3 Importing Datasets

A first step in data analysis is importing datasets. These can be in several formats. Fortunately, R has several packages that allow us to easily import data from comma-separated value (CSV), SPSS and Excel files.

You will find the following datasets on ILIAS:

  1. zufriedenheit.csv
  2. zufriedenheit-semicolon.csv
  3. zufriedenheit.sav
  4. zufriedenheit.xls

This is a generated dataset. The variables are a satisfaction rating, recorded at two time points for four subjects.

zufriedenheit.csv is a text file, in which the columns are separated by commas (hence the name). zufriedenheit-semicolon.csv is also a text file, but with a semi-colon as the delimiter. zufriedenheit.sav and zufriedenheit.xls are SPSS and Excel files.

In your RStudio project folder in the file system of your computer, create a new directory named ‘data’ and save the files you downloaded.

In RStudio, there are two ways of importing files:

  1. Using functions: read_csv(), read_csv2() (for ‘;’ delimiters), read_sav() and read_excel().

  2. Via the GUI: You can access this functionality either through ‘File > Import Dataset’ or in the Environment pane.

The second option is easier to use, and has the advantage that R will generate all code for importing datasets for us, which can be copied for subsequent use.

3.1 Comma-separated value (CSV) files

The functions we need for importimg CSV files are available in the readr package. This needs to be loaded:

library(readr)

Alternatively, we can just load all tidyverse packages:

library(tidyverse)

We will first import the data files using the GUI. In Environment, click on Import Dataset > From Text (readr) (or ‘File > Import Dataset > From Text (readr)’). You will see a dialogue containing a Code Preview with the following code:

library(readr)
dataset <- read_csv(NULL)
View(dataset)

On the left (bottom) you will find all options for importing the data. These options are all arguments of the read_csv() function (or the more general read_delim() funtion):

args(read_csv)
#> function (file, col_names = TRUE, col_types = NULL, locale = default_locale(), 
#>     na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", 
#>     trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, 
#>         n_max), progress = show_progress()) 
#> NULL

In the File Browser you can select the zufriedenheit.csv file. After doing so you will see Data Preview.

The variables Vpn and messzeitpunkt have been imported as character vectors. These will need to be converted to factors. Under Import Options a file name is generated automatically. This is simply the file name, minus the .csv suffix.

Try to discover what happens when you play around with the options, e.g. “First Row as Names”.

Now click on Import. A new data frame (tibble) with the name zufriedenheit will appear in the Environment pane, and the generated R code is printed to the console.

library(readr)
zufriedenheit <- read_csv("data/zufriedenheit.csv")
#> Parsed with column specification:
#> cols(
#>   Vpn = col_character(),
#>   messzeitpunkt = col_character(),
#>   rating = col_double()
#> )

We still have to convert the grouping variables to factors:

zufriedenheit$Vpn <- as.factor(zufriedenheit$Vpn)
zufriedenheit$messzeitpunkt <- as.factor(zufriedenheit$messzeitpunkt)
zufriedenheit
#> # A tibble: 8 x 3
#>   Vpn   messzeitpunkt rating
#>   <fct> <fct>          <dbl>
#> 1 VP_1  t1              64.3
#> 2 VP_2  t1              65.1
#> 3 VP_3  t1              54.2
#> 4 VP_4  t1              54.5
#> 5 VP_1  t2              67.2
#> 6 VP_2  t2              75.6
#> # ... with 2 more rows

Now do the same with the semi-colon delimited file zufriedenheit-semicolon.csv. This is something that you frequently be confronted with if your computer uses a German locale.

You should see the following code in the Code Preview:

library(readr)
zufriedenheit_semicolon <- read_delim("data/zufriedenheit-semicolon.csv",
    ";", escape_double = FALSE, trim_ws = TRUE)

Try importing both files using the R commnands. For the semi-colon file you can use the function read_csv2().

If you want to save a data frame as a CSV file, you can use the function write_csv():

write_csv(x = zufriedenheit_long, path = "data/zufriedenheit.csv")

3.2 SPSS datasets

We can import the same dataset, but this time from the SPSS file format (zufriedenheit.sav). There is a package called haven; this provides the function read_sav(). We can load the package to make the function available:

library(haven)
zufriedenheit_spss <- read_sav("data/zufriedenheit.sav")

But, as before, we can also use the GUI. In the Environment pane, click on Import Dataset > From SPSS, and choose the file. You will see a Code Preview:

In contrast to importing CSV files, we have no options when importing from SPSS except for the name of the data frame, which we will change to zufriedenheit_spss.

Now you can click on Import. In the Environment pane, the newly created varibale will appear. Variables imported from SPSS can have additional attributes, the most important of which is the labels attribute:

zufriedenheit_spss$Vpn
#>  
#> [1] 1 2 3 4 1 2 3 4
#> attr(,"format.spss")
#> [1] "F8.0"
#> attr(,"labels")
#> VP_1 VP_2 VP_3 VP_4 
#>    1    2    3    4

This contains the SPSS value labels. Using this, we can look up what type of coding scheme was used for categorical variables. If the values 0 and 1 were used for the sex variable, we determine which sex has the value 0.

If we want to convert variables to factors, we can use the function as_factor() from the haven package. This allows us to use either the SPSS values or value labels as levels of the factor in R. This is achieved by using the argument levels; this can take the values "default", "labels", "values" or "both" (you can inspect help page using ?as_factor). "default" seems to be the most sensible option - this means that labels are used if available and otherwise the values themselves are used. The other options are "both" (values and value labels are combined), "label" (labels only) and "values" (values only).

Arguments of the as_factor() function

levels
  How to create the levels of the generated factor:
      "default": uses labels where available, otherwise the values. Labels are sorted by value.
      "both": like "default", but pastes together the level and value
      "label": use only the labels; unlabelled values become NA
      "values: use only the values

ordered
    If TRUE create an ordered (ordinal) factor, if FALSE (the default) create a regular (nominal) factor.

The argument ordered allows us to create an ordered factor, if the ordering of the factor levels is important.

zufriedenheit_spss$Vpn <- as_factor(zufriedenheit_spss$Vpn,
                                    levels = "default")
zufriedenheit_spss$messzeitpunkt <- as_factor(zufriedenheit_spss$messzeitpunkt,
                                              levels = "default")

We can also save a data frame in the SPSS file format using the write_save() function:

write_sav(data = zufriedenheit_long, path = "data/zufriedenheit.sav")

Download and import the file called Beispieldatensatz.sav from ILIAS. Which value labels do the categorical variables have?

3.3 Excel files

We can also import Excel spreadsheets. Click on Import Dataset > From Excel, and then choose the file you want to import. We will call the data frame zufriedenheit_xls. Underneath the Name text box there is a drop-down menu entitled Sheet. This allows you to specify which worksheet you want to import. You should get the following R code in the Code Preview:

library(readxl)
zufriedenheit_xls <- read_excel("data/zufriedenheit.xlsx",
    sheet = "zufriedenheit")

The function we are using is called read_excel() and is available in the readxl package (not part of the tidyverse).

Categorical variables should again be converted to factors:

zufriedenheit_xls$Vpn <- as.factor(zufriedenheit_xls$Vpn)
zufriedenheit_xls$messzeitpunkt <- as.factor(zufriedenheit_xls$messzeitpunkt)

3.4 RData files

The last option is an RData file. This is a binary file format and has the advantage that we can combine multiple R objects, including all their attributes, in a single file. This is very useful; when exporting to a text file, such as CSV, all metadata will be lost. A further advantage is that the file may be compressed in order to save space. However, this file is specific to R, and thus may not be the best option when sharing your data with other people.

You can save objects in your workspace as .RData (or .Rda) files with the function save():

save(zufriedenheit, file = "data/zufriedenheit.Rda")

You can also save several objects in one go:

save(zufriedenheit, zufriedenheit_spss, zufriedenheit_xls,
     file = "data/zufriedenheit_alle.Rda")

The file can be loaded using the function load():

load(file = "data/zufriedenheit_alle.Rda")

3.5 Exercises

Importing datasets 1

  1. Download and import the file Therapy.sav from ILIAS, using both the GUI and the function read_sav().

  2. Check the coding of the grouping variables. Which level should the reference category?

  3. Convert to grouping variables to factors.

Importing datasets 2

  1. Generate a (simulated) dataset and export it either as a CSV, xls oder sav file. This dataset should contain at least a numeric and a grouping variable. Exchange files with a partner.

  2. You will receive a file from your partner. Try to import this, and perform all necessary conversions.