5 Importing and exporting data

One of the first steps in data analysis is usually importing datasets. These can be in stored in several different formats, depending on their origin. The most common text-based format is a coma-separated values (CSV) file, with tab-delimited text files being another option. It is also very common to keep data in an Excel spreadsheet, or in an SPSS .sav file. Fortunately, R has several packages that allow us to easily import data from comma-separated value (CSV) or other text files, SPSS files and Excel files. These functions all start with read_*. There are also the corresponding write_* function that allow us to export datasets to CSV, SPSS or Excel files.

In this chapter, we will import the following datasets, all of which can be downloaded from ILIAS:

File name Type Import function Package
zufriedenheit.csv Text file with commas as columns delimiters read_csv() readr
zufriedenheit-semicolon.csv Text file with a semi-colon as the delimiter read_csv2(), read_delim() readr
zufriedenheit.sav SPSS file read_sav() haven
zufriedenheit.xls Excel file read_excel(), read_xls() or read_xlsx() readxl

These are four different files containing the same dataset. The data consist of a satisfaction rating, recorded at two time points for four subjects.

In your RStudio project directory, create a new directory named ‘data’ if it doesn’t already exist, and put the files you downloaded in there.

In RStudio, there are two ways of importing files:

  1. Using functions: read_csv(), read_csv2() (for ‘;’ delimiters), read_sav() and read_excel().

  2. Via the GUI: You can access this functionality either through ‘File > Import Dataset’ or in the Environment pane.

The second option is easier to use, and has the advantage that R will generate all code for importing datasets for us, which can be copied for subsequent use.

The advantage of using the first approach is that you can keep a record of every step in your data analysis pipeline, which is important for reproducibility.

We will now inport these four datasets using both options.

5.1 Comma-separated value (CSV) files

The functions we need for importing CSV files are available in the readr package. This needs to be loaded:

library(readr)

However, readr is loaded automatically when we load the tidyverse package:

library(tidyverse)

We will first import the data files using the GUI. In the Environment pane , click on

Import Dataset > From Text (readr) or use the menu ‘File > Import Dataset > From Text (readr)’. You will see a dialogue containing a Code Preview.

On the left (bottom) you will find all options for importing the data. These options are all arguments of the read_csv() function (or the more general read_delim() funtion):

args(read_csv)
#> function (file, col_names = TRUE, col_types = NULL, locale = default_locale(), 
#>     na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", 
#>     trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, 
#>         n_max), progress = show_progress(), skip_empty_rows = TRUE) 
#> NULL

In the File Browser you can select the zufriedenheit.csv file. After doing so you will see Data Preview.

Alternatively, you can also navigate to the ‘data’ directory in the Files pane, and then click on the dataset you want to import. You will then be asked if you want to view the file, or import it. Selecting ‘import’ will take you straight to the data preview.

According to the preview, the variables Vpn and messzeitpunkt will be imported as character vectors. This means that if we want to treat them as categorical variables, which they are, they will need to be converted to factors.

Under Import Options a file name is generated automatically. This is simply the file name, minus the .csv suffix.

Try to discover what happens when you play around with the options, e.g. “First Row as Names”.

Now click on Import. A new data frame (tibble) with the name zufriedenheit will appear in the Environment pane, and the generated R code is printed to the console.

You can copy these lines of code, and paste them into an R script (although you can of course import tidyverse instead of just readr). Next time you want to import this dataset, you can simply run the code, without having to use the GUI.

library(readr)
zufriedenheit <- read_csv("data/zufriedenheit.csv")
#> Parsed with column specification:
#> cols(
#>   Vpn = col_character(),
#>   messzeitpunkt = col_character(),
#>   rating = col_double()
#> )

Now is a good time to convert the grouping variables Vpn and messzeitpunkt to factors:

zufriedenheit$Vpn <- as_factor(zufriedenheit$Vpn)
zufriedenheit$messzeitpunkt <- as_factor(zufriedenheit$messzeitpunkt)

If you print the tibble zufriedenheit, R tells you that Vpn and messzeitpunkt are factors (abbreviated to in the console), and that rating is a numeric variable (). The Environment also provides the information that Vpn has 4 levels and messzeitpunkt has 2 levels.

zufriedenheit
#> # A tibble: 8 x 3
#>   Vpn   messzeitpunkt rating
#>   <fct> <fct>          <dbl>
#> 1 VP_1  t1              64.3
#> 2 VP_2  t1              65.1
#> 3 VP_3  t1              54.2
#> 4 VP_4  t1              54.5
#> 5 VP_1  t2              67.2
#> 6 VP_2  t2              75.6
#> # … with 2 more rows

Now do the same with the semi-colon delimited file zufriedenheit-semicolon.csv. This is something that you might frequently work with if your computer uses German language settings.

This file uses semicolons instead of commas as a column delimiter. Therefore, you need to change the delimiter setting. Now you should see the following code in the Code Preview:

library(readr)
zufriedenheit_semicolon <- read_delim("data/zufriedenheit-semicolon.csv",
    ";", escape_double = FALSE, trim_ws = TRUE)

Try importing the semicolon-delimited file using the function read_csv2().

Solution

zufriedenheit  <- read_csv2("data/zufriedenheit-semicolon.csv") 
#> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
#> Parsed with column specification:
#> cols(
#>   Vpn = col_character(),
#>   messzeitpunkt = col_character(),
#>   rating = col_number()
#> )

5.1.1 Exporting a CSV file

If you want to save a data frame as a CSV file, you can use the function write_csv(), or write_delim() if you want to use semicolons. You have to specify the path where you want to save the CSV file.

write_csv(x = zufriedenheit, path = "data/zufriedenheit.csv")

5.2 SPSS datasets

We can import the same dataset, but this time from the SPSS file (zufriedenheit.sav). The function we use for this is called read_sav(). This is provided by the haven package, which also allows you to import Stata and SAS files. This package is not automatically loaded when you load tidyverse, therefore you have to load it using: We can load the package to make the function available:

library(haven)

As before, we can use the GUI to import files. In the Environment pane, click on Import Dataset > From SPSS, and choose the file. To distinguish this from the previously import CSV file, let’s change the name to zufriedenheit_spss.

IN the Code Preview, the important line is (besides library(haven)):

zufriedenheit_spss <- read_sav("data/zufriedenheit.sav") 

Note that the read_sav() function does have more arguments. These allow you skip the first n lines, define values to be converted to NA, select columns and to repair non-unique column names.

In contrast to importing CSV files, however, the GUI offers no further options when importing from SPSS except for the name of the data frame.

Now you can click on Import. In the Environment pane, the newly created variable will appear. Variables imported from SPSS can have additional attributes, the most important of which is the labels attribute. Both the Vpn and the messzeitpunkt columns have this:

zufriedenheit_spss$Vpn
#> <Labelled double>
#> [1] 1 2 3 4 1 2 3 4
#> 
#> Labels:
#>  value label
#>      1  VP_1
#>      2  VP_2
#>      3  VP_3
#>      4  VP_4
zufriedenheit_spss$messzeitpunkt
#> <Labelled double>
#> [1] 1 1 1 1 2 2 2 2
#> 
#> Labels:
#>  value label
#>      1    t1
#>      2    t2

Both of these variables are declared as labelled double. This a special class provided by the haven package to associate arbitrary labels with numeric or character vectors. These labels contain the SPSS value labels. Using this, we can look up what type of coding scheme was used for categorical variables. If for example the values 0 and 1 were used for a sex grouping variable, we can determine which sex has was coded as 0 and which was coded as 1.

If we want to convert these variables to factors, we can use the function as_factor() from the haven package. This allows us to use either the SPSS values or value labels as levels of the factor in R. This is achieved by using the argument levels, which can take the values "default", "labels", "values" or "both" (you can inspect help page using ?as_factor).

"default" seems to be the most sensible option - this means that labels are used if available and otherwise the values themselves are used. The other options are "both" (values and value labels are combined), "label" (labels only) and "values" (values only).

Arguments of the haven::as_factor() function

levels
  How to create the levels of the generated factor:
      "default": uses labels where available, otherwise the values. Labels are sorted by value.
      "both": like "default", but pastes together the level and value
      "label": use only the labels; unlabelled values become NA
      "values: use only the values

ordered
    If TRUE create an ordered (ordinal) factor, if FALSE (the default) create a regular (nominal) factor.

The argument ordered allows us to create an ordered factor, if the ordering of the factor levels is important.

zufriedenheit_spss$Vpn <- as_factor(zufriedenheit_spss$Vpn,
                                    levels = "default")

zufriedenheit_spss$messzeitpunkt <- as_factor(zufriedenheit_spss$messzeitpunkt,
                                              levels = "default")

Once we have converted the categorical variables to factors, they are no longer labelled.The labels are only intended to be an intermediate data structure, so that we can convert into regular R factors. Note that you an also remove labels using the zap_labels() function.

5.2.1 Importing a large dataset

Let’s import a bigger dataset (in the sense that it has many variables).

Import the file called beispieldaten.sav. Inspect the value labels of the categorical variables? The dataset contains a lot of variables.

  1. A good way to get an overview is to use the glimpse() function. Using glimpse(beispieldaten) you should be able to see which variables are labelled.

  2. The class of the labelled variables is “haven_labelled”.

beispieldaten <- read_sav("data/beispieldaten.sav")

If you want to know which variables (columns) are labelled, you can use the function is.labelled(). You want to apply this function to each column of the data frame beispieldaten.

A function which lets you do just that is sapply(). This applies a function to each item in a list (a data frame is a list) without having to write a for loop.

Running the code sapply(beispieldaten, is.labelled) will report TRUE/FALSE for each column, depending on whether it is labelled.

What we want is to select only those columns that are labelled. One way of doing this is:

which(sapply(beispieldaten, is.labelled)) 
#>               westost            geschlecht             Mathenote 
#>                     2                     3                    77 
#>      Fremdsprachenote         bildung_vater        bildung_mutter 
#>                    78                    80                    81 
#>  bildung_vater_binaer bildung_mutter_binaer 
#>                    82                    83

This function returns the indices of all columns for which sapply(beispieldaten, is.labelled) returns TRUE.

In the next chapter, we will get to know the select functions from the dplyr package. These make such operations easier. The code that will produce a similar output is:

select_if(beispieldaten, is.labelled)   
#> # A tibble: 286 x 8
#>   westost geschlecht Mathenote Fremdsprachenote bildung_vater bildung_mutter
#>   <dbl+l>  <dbl+lbl> <dbl+lbl>        <dbl+lbl>     <dbl+lbl>      <dbl+lbl>
#> 1 0 [Wes… 1 [weibli… 3 [ausre… 5 [gut]          4 [Fachhochs… 4 [Fachhochsc…
#> 2 0 [Wes… 0 [männli… 4 [befri… 3 [ausreichend]  3 [Fachabitu… 2 [Realschula…
#> 3 0 [Wes… 1 [weibli… 5 [gut]   4 [befriedigend] 2 [Realschul… 1 [Hauptschul…
#> 4 0 [Wes… 1 [weibli… 2 [mange… 3 [ausreichend]  2 [Realschul… 4 [Fachhochsc…
#> 5 0 [Wes… 1 [weibli… 4 [befri… 4 [befriedigend] 4 [Fachhochs… 2 [Realschula…
#> 6 0 [Wes… 0 [männli… 4 [befri… 3 [ausreichend]  2 [Realschul… 2 [Realschula…
#> # … with 280 more rows, and 2 more variables: bildung_vater_binaer <dbl+lbl>,
#> #   bildung_mutter_binaer <dbl+lbl>
beispieldaten  %>% 
    select_if(is.labelled)   
#> # A tibble: 286 x 8
#>   westost geschlecht Mathenote Fremdsprachenote bildung_vater bildung_mutter
#>   <dbl+l>  <dbl+lbl> <dbl+lbl>        <dbl+lbl>     <dbl+lbl>      <dbl+lbl>
#> 1 0 [Wes… 1 [weibli… 3 [ausre… 5 [gut]          4 [Fachhochs… 4 [Fachhochsc…
#> 2 0 [Wes… 0 [männli… 4 [befri… 3 [ausreichend]  3 [Fachabitu… 2 [Realschula…
#> 3 0 [Wes… 1 [weibli… 5 [gut]   4 [befriedigend] 2 [Realschul… 1 [Hauptschul…
#> 4 0 [Wes… 1 [weibli… 2 [mange… 3 [ausreichend]  2 [Realschul… 4 [Fachhochsc…
#> 5 0 [Wes… 1 [weibli… 4 [befri… 4 [befriedigend] 4 [Fachhochs… 2 [Realschula…
#> 6 0 [Wes… 0 [männli… 4 [befri… 3 [ausreichend]  2 [Realschul… 2 [Realschula…
#> # … with 280 more rows, and 2 more variables: bildung_vater_binaer <dbl+lbl>,
#> #   bildung_mutter_binaer <dbl+lbl>

The second example uses the pipe operater %>%, which we will learn about in the next chapter.

5.2.2 Exporting an SPSS file

We can also save a data frame in the SPSS file format using the write_save() function:

write_sav(data = zufriedenheit_long, path = "data/zufriedenheit.sav")

5.3 Excel files

We can also import Excel spreadsheets. Click on Import Dataset > From Excel, and then choose the file you want to import. We will call the data frame zufriedenheit_xls. Underneath the Name text box there is a drop-down menu entitled Sheet. This allows you to specify which worksheet you want to import. You should get the following R code in the Code Preview:

library(readxl)
zufriedenheit_xls <- read_excel("data/zufriedenheit.xlsx",
    sheet = "zufriedenheit")

The function we are using is called read_excel() and is available in the readxl package (not part of the tidyverse).

Categorical variables should again be converted to factors:

zufriedenheit_xls$Vpn <- as_factor(zufriedenheit_xls$Vpn)
zufriedenheit_xls$messzeitpunkt <- as_factor(zufriedenheit_xls$messzeitpunkt)

5.4 RData files

The last option is an RData file. This is a binary file format and has the advantage that we can combine multiple R objects, including all their attributes, in a single file. This is very useful; when exporting to a text file, such as CSV, all metadata will be lost. A further advantage is that the file may be compressed in order to save space. However, this file is specific to R.

You can save objects in your workspace as .RData (or .Rda) files with the function save():

save(zufriedenheit, file = "data/zufriedenheit.Rda")

You can also save several objects in one go:

save(zufriedenheit, zufriedenheit_spss, zufriedenheit_xls,
     file = "data/zufriedenheit_all.Rda")

The file can be loaded using the function load():

load(file = "data/zufriedenheit_all.Rda")

5.5 Exercises

Importing datasets 1

  1. Import the file Therapy.sav, using both the GUI and the function read_sav().
  2. Check the coding of the grouping variables. Which level should be the reference category?
  3. Convert the grouping variables to factors.

Solution

  1. Import the data:
library(haven)
Therapy <- read_sav("data/Therapy.sav")
  1. Check labelled variables:
which(sapply(Therapy, is.labelled)) 
#> Gruppe 
#>      2

The only labelled column is Gruppe. We can use attributes() to look at the labels:

attributes(Therapy$Gruppe)
#> $format.spss
#> [1] "F8.2"
#> 
#> $class
#> [1] "haven_labelled"
#> 
#> $labels
#> Kontrollgruppe Therapiegruppe 
#>              0              1
  1. Convert to factor:

Vpnr is imported as a numeric vector - it does not have a labels attribute. Therefore, as_factor() has no labels to use as factor levels and we get integers as factor levels. Gruppe is ‘labelled’. Here, as_factor() can use the labels as factor levels..

attributes(Therapy$Vpnr)
#> $format.spss
#> [1] "F3.0"
Therapy$Vpnr <- as_factor(Therapy$Vpnr)

Therapy$Gruppe <- as_factor(Therapy$Gruppe)
levels(Therapy$Gruppe)
#> [1] "Kontrollgruppe" "Therapiegruppe"

The first factor level is Kontrollgruppe. If we use dummy coding in a linear model, this will be used as the reference category.

Therapy
#> # A tibble: 100 x 5
#>   Vpnr  Gruppe         Pretest Posttest Difference_PrePost
#>   <fct> <fct>            <dbl>    <dbl>              <dbl>
#> 1 1     Kontrollgruppe    4.29     3.21              1.08 
#> 2 2     Kontrollgruppe    6.18     5.99              0.190
#> 3 3     Kontrollgruppe    3.93     4.17             -0.239
#> 4 4     Kontrollgruppe    5.06     4.76              0.295
#> 5 5     Kontrollgruppe    6.45     5.64              0.814
#> 6 6     Kontrollgruppe    4.49     4.67             -0.180
#> # … with 94 more rows