5 Importing and exporting data
One of the first steps in data analysis is usually importing datasets. These can
be in stored in several different formats, depending on their origin. The most
common text-based format is a coma-separated values (CSV) file, with
tab-delimited text files being another option. It is also very common to keep
data in an Excel spreadsheet, or in an SPSS .sav
file. Fortunately, R has
several packages that allow us to easily import data from comma-separated value
(CSV) or other text files, SPSS files and Excel files. These functions all start
with read_*
. There are also the corresponding write_*
function that allow us
to export datasets to CSV, SPSS or Excel files.
In this chapter, we will import the following datasets, all of which can be downloaded from ILIAS:
File name | Type | Import function | Package |
---|---|---|---|
zufriedenheit.csv | Text file with commas as columns delimiters | read_csv() |
readr |
zufriedenheit-semicolon.csv | Text file with a semi-colon as the delimiter | read_csv2() , read_delim() |
readr |
zufriedenheit.sav | SPSS file | read_sav() |
haven |
zufriedenheit.xls | Excel file | read_excel() , read_xls() or read_xlsx() |
readxl |
These are four different files containing the same dataset. The data consist of a satisfaction rating, recorded at two time points for four subjects.
In your RStudio project directory, create a new directory named ‘data’ if it doesn’t already exist, and put the files you downloaded in there.
In RStudio, there are two ways of importing files:
Using functions:
read_csv()
,read_csv2()
(for ‘;’ delimiters),read_sav()
andread_excel()
.Via the GUI: You can access this functionality either through ‘File > Import Dataset’ or in the Environment pane.
The second option is easier to use, and has the advantage that R will generate all code for importing datasets for us, which can be copied for subsequent use.
The advantage of using the first approach is that you can keep a record of every step in your data analysis pipeline, which is important for reproducibility.
We will now inport these four datasets using both options.
5.1 Comma-separated value (CSV) files
The functions we need for importing CSV files are available in the readr
package. This needs to be loaded:
However, readr
is loaded automatically when we load the tidyverse
package:
library(tidyverse)
We will first import the data files using the GUI. In the Environment pane , click on
Import Dataset > From Text (readr) or use the menu ‘File > Import Dataset > From Text (readr)’. You will see a dialogue containing a Code Preview.
On the left (bottom) you will find all options for importing the data. These
options are all arguments of the read_csv()
function (or the more general
read_delim()
funtion):
args(read_csv)
#> function (file, col_names = TRUE, col_types = NULL, locale = default_locale(),
#> na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "",
#> trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000,
#> n_max), progress = show_progress(), skip_empty_rows = TRUE)
#> NULL
In the File Browser you can select the zufriedenheit.csv
file. After doing so you will see Data Preview.
Alternatively, you can also navigate to the ‘data’ directory in the Files pane, and then click on the dataset you want to import. You will then be asked if you want to view the file, or import it. Selecting ‘import’ will take you straight to the data preview.
According to the preview, the variables Vpn
and messzeitpunkt
will be
imported as character vectors
. This means that if we want to treat them as
categorical variables, which they are, they will need to be converted to
factors.
Under Import Options a file name is generated automatically. This is simply
the file name, minus the .csv
suffix.
Try to discover what happens when you play around with the options, e.g. “First Row as Names”.
Now click on Import. A new data frame (tibble) with the name zufriedenheit
will appear in the Environment pane, and the generated R code is printed to
the console.
You can copy these lines of code, and paste them into an R script (although you
can of course import tidyverse
instead of just readr
). Next time you want to
import this dataset, you can simply run the code, without having to use the GUI.
library(readr)
zufriedenheit <- read_csv("data/zufriedenheit.csv")
#> Parsed with column specification:
#> cols(
#> Vpn = col_character(),
#> messzeitpunkt = col_character(),
#> rating = col_double()
#> )
Now is a good time to convert the grouping variables Vpn
and messzeitpunkt
to factors:
zufriedenheit$Vpn <- as_factor(zufriedenheit$Vpn)
zufriedenheit$messzeitpunkt <- as_factor(zufriedenheit$messzeitpunkt)
If you print the tibble zufriedenheit
, R tells you that Vpn
and
messzeitpunkt
are factors (abbreviated to rating
is a numeric variable (Vpn
has 4 levels and messzeitpunkt
has 2 levels.
zufriedenheit
#> # A tibble: 8 x 3
#> Vpn messzeitpunkt rating
#> <fct> <fct> <dbl>
#> 1 VP_1 t1 64.3
#> 2 VP_2 t1 65.1
#> 3 VP_3 t1 54.2
#> 4 VP_4 t1 54.5
#> 5 VP_1 t2 67.2
#> 6 VP_2 t2 75.6
#> # … with 2 more rows
Now do the same with the semi-colon delimited file zufriedenheit-semicolon.csv
. This is something that you might frequently work with if your computer uses German language settings.
This file uses semicolons instead of commas as a column delimiter. Therefore, you need to change the delimiter setting. Now you should see the following code in the Code Preview:
library(readr)
zufriedenheit_semicolon <- read_delim("data/zufriedenheit-semicolon.csv",
";", escape_double = FALSE, trim_ws = TRUE)
Try importing the semicolon-delimited file using the function read_csv2()
.
Solution
5.2 SPSS datasets
We can import the same dataset, but this time from the SPSS file
(zufriedenheit.sav
). The function we use for this is called read_sav()
. This
is provided by the haven
package, which also allows you to import Stata and
SAS files. This package is not automatically loaded when you load tidyverse
, therefore you have to load it using:
We can load the package to make the function available:
As before, we can use the GUI to import files. In the Environment pane, click
on Import Dataset > From SPSS, and choose the file. To distinguish this from
the previously import CSV file, let’s change the name to zufriedenheit_spss
.
IN the Code Preview, the important line is (besides library(haven)
):
Note that the read_sav()
function does have more arguments. These allow you
skip the first n lines, define values to be converted to NA, select columns and
to repair non-unique column names.
In contrast to importing CSV files, however, the GUI offers no further options when importing from SPSS except for the name of the data frame.
Now you can click on Import. In the Environment pane, the newly created
variable will appear. Variables imported from SPSS can have additional
attributes, the most important of which is the labels
attribute. Both the Vpn
and the messzeitpunkt
columns have this:
zufriedenheit_spss$Vpn
#> <Labelled double>
#> [1] 1 2 3 4 1 2 3 4
#>
#> Labels:
#> value label
#> 1 VP_1
#> 2 VP_2
#> 3 VP_3
#> 4 VP_4
zufriedenheit_spss$messzeitpunkt
#> <Labelled double>
#> [1] 1 1 1 1 2 2 2 2
#>
#> Labels:
#> value label
#> 1 t1
#> 2 t2
Both of these variables are declared as labelled double
. This a special class
provided by the haven
package to associate arbitrary labels with numeric or
character vectors. These labels contain the SPSS value labels. Using this, we
can look up what type of coding scheme was used for categorical variables. If for example
the values 0
and 1
were used for a sex
grouping variable, we can determine which sex
has was coded as 0
and which was coded as 1
.
If we want to convert these variables to factors, we can use the function
as_factor()
from the haven
package. This allows us to use either the SPSS
values or value labels as levels of the factor in R.
This is achieved by using
the argument levels
, which can take the values "default"
, "labels"
,
"values"
or "both"
(you can inspect help page using ?as_factor
).
"default"
seems to be the most sensible option - this means that labels are
used if available and otherwise the values themselves are used. The other
options are "both"
(values and value labels are combined), "label"
(labels
only) and "values"
(values only).
Arguments of the haven::as_factor()
function
levels
How to create the levels of the generated factor:
"default": uses labels where available, otherwise the values. Labels are sorted by value.
"both": like "default", but pastes together the level and value
"label": use only the labels; unlabelled values become NA
"values: use only the values
ordered
If TRUE create an ordered (ordinal) factor, if FALSE (the default) create a regular (nominal) factor.
The argument ordered
allows us to create an ordered factor, if the ordering of the factor levels is important.
zufriedenheit_spss$Vpn <- as_factor(zufriedenheit_spss$Vpn,
levels = "default")
zufriedenheit_spss$messzeitpunkt <- as_factor(zufriedenheit_spss$messzeitpunkt,
levels = "default")
Once we have converted the categorical variables to factors, they are no longer
labelled.The labels are only intended to be an intermediate data structure, so
that we can convert into regular R factors. Note that you an also remove labels using the zap_labels()
function.
5.2.1 Importing a large dataset
Let’s import a bigger dataset (in the sense that it has many variables).
Import the file called beispieldaten.sav
. Inspect the value labels of the categorical variables? The dataset contains a lot of variables.
-
A good way to get an overview is to use the
glimpse()
function. Usingglimpse(beispieldaten)
you should be able to see which variables are labelled. -
The
class
of the labelled variables is“haven_labelled”
.
If you want to know which variables (columns) are labelled, you can use the
function is.labelled()
. You want to apply this function to each column of
the data frame beispieldaten
.
A function which lets you do just that is sapply()
. This applies a function to
each item in a list (a data frame is a list) without having to write a for loop.
Running the code sapply(beispieldaten, is.labelled)
will report TRUE/FALSE
for each column, depending on whether it is labelled.
What we want is to select only those columns that are labelled. One way of doing this is:
which(sapply(beispieldaten, is.labelled))
#> westost geschlecht Mathenote
#> 2 3 77
#> Fremdsprachenote bildung_vater bildung_mutter
#> 78 80 81
#> bildung_vater_binaer bildung_mutter_binaer
#> 82 83
This function returns the indices of all columns for which sapply(beispieldaten, is.labelled)
returns TRUE
.
In the next chapter, we will get to know the select
functions from the dplyr
package. These make such operations easier. The code that will produce a similar output is:
select_if(beispieldaten, is.labelled)
#> # A tibble: 286 x 8
#> westost geschlecht Mathenote Fremdsprachenote bildung_vater bildung_mutter
#> <dbl+l> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
#> 1 0 [Wes… 1 [weibli… 3 [ausre… 5 [gut] 4 [Fachhochs… 4 [Fachhochsc…
#> 2 0 [Wes… 0 [männli… 4 [befri… 3 [ausreichend] 3 [Fachabitu… 2 [Realschula…
#> 3 0 [Wes… 1 [weibli… 5 [gut] 4 [befriedigend] 2 [Realschul… 1 [Hauptschul…
#> 4 0 [Wes… 1 [weibli… 2 [mange… 3 [ausreichend] 2 [Realschul… 4 [Fachhochsc…
#> 5 0 [Wes… 1 [weibli… 4 [befri… 4 [befriedigend] 4 [Fachhochs… 2 [Realschula…
#> 6 0 [Wes… 0 [männli… 4 [befri… 3 [ausreichend] 2 [Realschul… 2 [Realschula…
#> # … with 280 more rows, and 2 more variables: bildung_vater_binaer <dbl+lbl>,
#> # bildung_mutter_binaer <dbl+lbl>
beispieldaten %>%
select_if(is.labelled)
#> # A tibble: 286 x 8
#> westost geschlecht Mathenote Fremdsprachenote bildung_vater bildung_mutter
#> <dbl+l> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
#> 1 0 [Wes… 1 [weibli… 3 [ausre… 5 [gut] 4 [Fachhochs… 4 [Fachhochsc…
#> 2 0 [Wes… 0 [männli… 4 [befri… 3 [ausreichend] 3 [Fachabitu… 2 [Realschula…
#> 3 0 [Wes… 1 [weibli… 5 [gut] 4 [befriedigend] 2 [Realschul… 1 [Hauptschul…
#> 4 0 [Wes… 1 [weibli… 2 [mange… 3 [ausreichend] 2 [Realschul… 4 [Fachhochsc…
#> 5 0 [Wes… 1 [weibli… 4 [befri… 4 [befriedigend] 4 [Fachhochs… 2 [Realschula…
#> 6 0 [Wes… 0 [männli… 4 [befri… 3 [ausreichend] 2 [Realschul… 2 [Realschula…
#> # … with 280 more rows, and 2 more variables: bildung_vater_binaer <dbl+lbl>,
#> # bildung_mutter_binaer <dbl+lbl>
The second example uses the pipe
operater %>%
, which we will learn about in the next chapter.
5.3 Excel files
We can also import Excel spreadsheets. Click on Import Dataset > From Excel,
and then choose the file you want to import. We will call the data frame
zufriedenheit_xls
. Underneath the Name
text box there is a drop-down menu
entitled Sheet
. This allows you to specify which worksheet you want to import.
You should get the following R code in the Code Preview:
The function we are using is called read_excel()
and is available in the readxl
package (not part of the tidyverse).
Categorical variables should again be converted to factors:
5.4 RData files
The last option is an RData
file. This is a binary
file format and has the
advantage that we can combine multiple R objects, including all their
attributes, in a single file. This is very useful; when exporting to a text
file, such as CSV
, all metadata will be lost. A further advantage is that the
file may be compressed in order to save space. However, this file is specific to
R.
You can save objects in your workspace as .RData
(or .Rda
) files with the function save()
:
You can also save several objects in one go:
The file can be loaded using the function load()
:
5.5 Exercises
Importing datasets 1
- Import the file
Therapy.sav
, using both the GUI and the functionread_sav()
. - Check the coding of the grouping variables. Which level should be the reference category?
- Convert the grouping variables to factors.
Solution
- Import the data:
- Check labelled variables:
The only labelled column is Gruppe
. We can use attributes()
to look at the labels:
attributes(Therapy$Gruppe)
#> $format.spss
#> [1] "F8.2"
#>
#> $class
#> [1] "haven_labelled"
#>
#> $labels
#> Kontrollgruppe Therapiegruppe
#> 0 1
- Convert to factor:
Vpnr
is imported as a numeric vector - it does not have a labels attribute.
Therefore, as_factor() has no labels to use as factor levels and we get integers
as factor levels. Gruppe
is ‘labelled’. Here, as_factor()
can use the
labels as factor levels..
attributes(Therapy$Vpnr)
#> $format.spss
#> [1] "F3.0"
Therapy$Vpnr <- as_factor(Therapy$Vpnr)
Therapy$Gruppe <- as_factor(Therapy$Gruppe)
levels(Therapy$Gruppe)
#> [1] "Kontrollgruppe" "Therapiegruppe"
The first factor level is Kontrollgruppe
. If we use dummy coding in a linear
model, this will be used as the reference category.
Therapy
#> # A tibble: 100 x 5
#> Vpnr Gruppe Pretest Posttest Difference_PrePost
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 1 Kontrollgruppe 4.29 3.21 1.08
#> 2 2 Kontrollgruppe 6.18 5.99 0.190
#> 3 3 Kontrollgruppe 3.93 4.17 -0.239
#> 4 4 Kontrollgruppe 5.06 4.76 0.295
#> 5 5 Kontrollgruppe 6.45 5.64 0.814
#> 6 6 Kontrollgruppe 4.49 4.67 -0.180
#> # … with 94 more rows