7 Plotting with ggplot2
Graphics are very important for data analysis. On the one hand, we can use it for exploratory data analysis to discover any hidden relationships or simply to get an overview. On the other hand, we need graphics to present results and communicate them to others.
There are several ways to create graphics in R. With the original graphics system (R Base Graphics) we can quickly create simple plots. It is very powerful and flexible, but the problem is that the syntax looks a bit archaic, and it is difficult for beginners to customize graphics.
In contrast, ggplot2
is based on an intuitive syntax, the so-called Grammar
of graphics. Once you get
used to it, you can create very complex graphics with an elegant and consistent
“grammar”. ggplot2
is designed to work with tidy
data, i.e. we need data in
long format. Plots are always created according to the same principle:
Start by preparing a dataset so that it is in the right format.
Create a plot object using the function
ggplot()
.Define so-called “aesthetic mappings”, i.e. we determine which variables should be displayed on the X and Y axes and which variables are used to group the data. The function we use for this is called
aes()
.Add one or more “layers” to the plot. These layers define how something should be displayed, e.g. as a line or as a histogram. These functions begin with the prefix
geom_
, e.g.geom_line()
.Add further specifications, such as the colour scheme that should be used, and possibly facetting by levels of a grouping variable.
To use ggplot2
we need an additional operator: +
. You already know this as a
mathematical operator, but in this context, the use of +
means that we combine
individual elements of a plot object.
After this somewhat abstract introduction, we illustrate these steps with a practical example.
At the end of the last chapter, we examined the relationship between psychological stress and gender. Now we load the data set again:
and create a data set containing only the variables ID
,gender
and stress_psychisch
.
stress <- adolescents %>%
dplyr::select(ID, geschlecht, stress_psychisch) %>%
mutate_at(vars(geschlecht, ID), as_factor) %>%
drop_na()
stress
#> # A tibble: 284 x 3
#> ID geschlecht stress_psychisch
#> <fct> <fct> <dbl>
#> 1 1 weiblich 1.67
#> 2 2 männlich 3.5
#> 3 10 weiblich 3.67
#> 4 11 weiblich 1.5
#> 5 12 weiblich 2.5
#> 6 14 männlich 1
#> # … with 278 more rows
In this data set we have a numeric variable, stress_psychisch
and a grouping
variable gender
. Our question was whether the variable stress_psychisch
is
related to the grouping variable gender
. We could graphically illustrate this
relationship in various ways: with dots, with a box plot or with a violin plot.
These three methods are geoms
in the language of ggplot2
and can be used as
follows: geom_point ()
, geom_boxplot()
or geom_violin()
. In addition,
there is a function geom_jitter()
that spatially jitters the data points (as
an alternative to displaying data points with the same value on top of each
other).
We can load the ggplot2
package individually or as part of the tidyverse
:
7.1 Step 1: Creating a plot object
We start with a data set and create a plot object with the function ggplot()
.
This function has a data frame as the first argument. This means that we can use
the pipe
operator:
We have two options. We prefer the pipe notation here, but it is also possible
to specify the data frame as an argument within the function. Furthermore, we
assign the object to a variable and call it p
.
7.2 Step 2: Aesthetic mappings
With the second argument mapping
we now define the “aesthetic mappings”. These
determine how the variables are used to represent the data and are defined using
the aes()
function. We want to represent the grouping variable gender
on the
X-axis and stress_psychisch
should be displayed on the Y-axis. In addition,
aes()
can have additional arguments: fill
,color
, shape
,linetype
,
group
. These are used to assign different colors, shapes, lines, etc. to the
levels of grouping variables.
In this example, we have the grouping variable gender
and we want the two levels of gender
to have different colors and to be filled in with different colors.
p <- stress %>%
ggplot(mapping = aes(x = geschlecht,
y = stress_psychisch,
color = geschlecht,
fill = geschlecht))
p
is now an ‘empty’ plot object. We can look at it, but nothing is displayed yet, as it does not yet contain “layers”. A ggplot2
object is displayed by having the object sent to the console, either with or without print ()
.
We can see that ggplot2
has already labeled the axes for us using the variable names.
7.3 Step 3: Add geoms
Using the geom_
functions we can now add “layers” to the plot object p
. The syntax works like this: We “add” (+
) a geom
to the plot object p
: p + geom_
.
7.3.1 Scatter plot
We first try to represent the observations as points:
The points are now displayed in different colors, but within a gender points may be plotted on top of each other if they have the same value (overplotting). In this case, there is the function geom_jitter ()
, which draws points with a jittering side by side:
geom_jitter ()
has an argument width
, with which we can define how widely the points are being jittered.
geom_jitter()
has further arguments: size
determines the diameter of the points, alpha
determines their transparency.
7.3.2 Visualizing a distribution
Another possibility is to represent the central tendency and dispersion of the data with a box plot or a violin plot.
In a box plot, the median is displayed, the rectangle represents the middle 50%, and the whiskers show 1.5 * the interquartile range. Outliers are represented by dots. To actually see the median in the diagram, it is better to omit the fill
attribute:
p <- stress %>%
ggplot(mapping = aes(x = geschlecht,
y = stress_psychisch,
color = geschlecht))
p + geom_boxplot()
A violin plot is similar to a box plot, but instead of the quantiles it shows a kernel density estimate. A violin plot looks best when we use the fill
attribute.
p <- stress %>%
ggplot(mapping = aes(x = geschlecht,
y = stress_psychisch,
fill = geschlecht))
p + geom_violin()
If we find that a mapping should not apply to all “layers” then we can define it individually for each “layer” rather than in the ggplot ()
function:
7.3.3 Combining multiple layers
We can also use multiple layers. We just need to combine several geom_
functions with a +
:
In the previous examples in this script, we did not create a plot object, but
rather sent the data frame to the ggplot ()
function using the pipe
operator, and then directly added the geoms
with +
. In addition, we have
used other functions, such as theme_classic()
to make the background white.
7.4 Geoms for different data types
Let’s summarize: so far we have learned how to put together a plot in several
steps. We start with a data frame and define a ggplot2
object using the
ggplot()
function. With the aes ()
function, we assign variables of a
data frame to the X or Y axis and define further “aesthetic mappings”, e.g. a
color coding based on a grouping variable. Then we add graphic elements with
geom_ *
functions as “layers” to the plot object.
Now let’s look at a selection of geoms for different combinations of variables. We can either display one variable on the X axis or two variables on the X and Y axes, and these variables can be either continuous or categorical.
Here we will only consider a small selection of the possible ggplot2
functions. The package is very extensive and has a very good website where everything is documented: ggplot2 Dokumentation
After working through this chapter, you will be able to find graphical visualization solutions yourself. Data visualization can be a very creative process and is fun! Further examples of specific data analysis methods can be found in the following chapters.
For the following examples we use the data sets ‘beispieldaten’ (adolescents) and ‘Kinderwunsch_Schweiz’ (kinderwunsch):
adolescents <- read_sav("data/beispieldaten.sav")
adolescents <- adolescents %>%
mutate(westost = as_factor(westost),
geschlecht = as_factor(geschlecht),
bildung_mutter = as_factor(bildung_mutter,
levels = "values"),
bildung_vater = as_factor(bildung_vater,
levels = "values"),
ID = as.factor(ID))
kinderwunsch <- read_sav("data/Kinderwunsch.sav")
kinderwunsch <- kinderwunsch %>%
mutate(geschl = as_factor(geschl))
7.4.1 One variable
Even if we want to display only one variable (on the X-axis), we still have to represent some values on the Y-axis. This will often be a descriptive summary such as the frequencies of a categorical variable.
Categorical variable
When we plot a categorical variable, we often use a bar chart or bar
graph. This plot represents the frequencies of the different categories based
on a rectangle (rectangular bar). The function that is used for this is called
geom_bar()
.
As an example we want to plot the frequencies of the four levels of education of the father:
If we do not use fill = ‘lightblue’, color = ‘black’
within the aes()
function, then these arguments are not considered a grouping statement. For instance, we can just color all elements light blue with fill = ‘lightblue’
.
p <- adolescents %>%
dplyr::select(bildung_vater) %>%
drop_na() %>%
ggplot(aes(x = bildung_vater))
p + geom_bar(fill = 'lightblue', color = 'black')
For an overview of the possible color names, use the colors ()
function. There are 657 colors, of which we are showing here only 15 that are randomly sampled with sample(15)
:
colors() %>% sample(15)
#> [1] "gray96" "linen" "salmon" "skyblue3"
#> [5] "lightcyan1" "lemonchiffon" "grey1" "aquamarine"
#> [9] "pink1" "gray4" "azure4" "rosybrown1"
#> [13] "lavenderblush2" "gray18" "darkorchid4"
Here, too, we can specify a grouping variable, which we use to color code the rectangles.
p <- adolescents %>%
dplyr::select(bildung_vater, westost) %>%
drop_na() %>%
ggplot(aes(x = bildung_vater, fill = westost))
p + geom_bar()
By default, ggplot2
creates a stacked bar chart, i.e. the rectangles are
stacked on top of each other. If this is not desired, we can use the argument
position = "dodge"
of the function geom_bar()
. This tells the geom_bar()
function that the bars should be drawn next to each other.
As a third variant we can use position = "identity"
; so the bars are drawn on top of each other. Since the rectangle in the background is no longer visible, we use the alpha
argument to make the bars transparent.
Continous variable
If the variable that we want to represent graphically is not categorical but
continuous a histogram is the appropriate option; we generate this with the
function geom_histogram()
. As an example we consider the psychological stress
symptoms.
A histogram provides a graphical representation of the distribution of a numerical variable. For this purpose, the values of these variables are subdivided into discrete intervals, or bins
. On the Y-axis, the frequencies in the respective intervals are then displayed, analogous to a bar chart. The determination of the size of the intervals (bin width) is critical. If we do not specify anything, ggplot2
selects a binwidth
itself, but we can also specify it ourselves using the binwidth
argument.
p <- adolescents %>%
dplyr::select(stress_psychisch) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch))
# automatic bin width selection
p + geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The optimal bin width depends on the scale and variance of the variable and should neither be too fine nor too coarse. Try out several values for the bin width.
If we want the relative frequencies on the Y axis instead of the absolute ones, we can use aes()
function with the y = ..density..
argument.
p <- adolescents %>%
dplyr::select(stress_psychisch) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch, y = ..density..))
p + geom_histogram(binwidth = 0.5)
We can use a grouping variable also with histograms:
p <- adolescents %>%
dplyr::select(stress_psychisch, geschlecht) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch, fill = geschlecht))
p + geom_histogram(binwidth = 0.5)
As with the bar chart, the histograms are stacked on top of each other. If we
want them separately in the same space of the plot, we use position = "identity"
together with the transparancy argument alpha
.
p <- adolescents %>%
dplyr::select(stress_psychisch, geschlecht) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch, fill = geschlecht))
p + geom_histogram(binwidth = 0.5,
position = "identity",
alpha = 0.6)
Next to each other is also possible:
7.4.2 Two variables
Now we display two variables of a data set together. Again, the possible geoms
depend on the data type of the variables.
X and Y continuous
If both variables are continuous, we can show their relationship using a scatterplot or a line graph. We use the functions geom_point()
, or geom_line()
.
As an example we want to visualize the relation between psychological stress symptoms and life satisfaction.
p <- adolescents %>%
dplyr::select(stress_psychisch, leben_gesamt) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch, y = leben_gesamt))
p + geom_point(size = 2, alpha = 0.6)
We have already seen the size
andalpha
arguments above, as well as the possibility to avoid ‘overplotting’ with the function geom_jitter()
. Both geom_jitter()
and geom_point()
also have a color
argument.
Grouping based on a categorical variable works here as well. We use both the color and the shape of the dots to better distinguish the categories.
X categorical and Y continuous
If one of the variables is categorical, then instead of using it as a grouping variable, we can represent it on one axis.
We have already seen examples of this: where we presented the variables
geschlecht
and stress_psychisch
and used the functions geom_boxplot()
and
geom_violin()
. But we can also use the function geom_bar()
for two
variables. In this case, the variable on the Y axis is summed up for all
observations in the categories on the X axis. Since this does not require a
statistical transformation, we use the argument stat = 'identity'
.
As an example, let’s consider the kinderwunsch
data set. In this study
adolescent subjects were asked if they wanted to have a child or not in the
future (binary answer, variable ‘kind_d’). On the Y-axis now the absolute
frequencies of a “yes” response shall be represented:
p <- kinderwunsch %>%
ggplot(aes(x = geschl,
y = kind_d,
fill = geschl))
p + geom_bar(stat = 'identity')
In order to better understand this graph, we additionally calculate the relative frequencies of a “yes” response per gender.
X and Y categorical
Finally, the variables can be categorical on both the X and Y axes. In this case
it would be useful to plot the joint frequencies using the function
geom_count()
.
As an example we want to consider the joint frequency distribution of the education of the father and the education of the mother.
p <- adolescents %>%
dplyr::select(starts_with("bildung")) %>%
drop_na() %>%
ggplot(aes(x = bildung_vater,
y = bildung_mutter))
p + geom_count()
geom_count()
counts the common frequencies of the categories of the two variables and displays them as the diameter of the points.
We can obtain the frequency table with the table()
function.
7.4.3 Example
Wide versus Long: Education of Parents
Looking at the next example, let’s look at the difference between long and
wide formats. We have used education_father
and education_mother
as
separate variables to represent them on separate axes when we presented the
common frequency distribution of father’s and mother’s education. However, we
could also summarize education_father
and education_mother
as levels of a
repeated measures factor parents
, and the education levels as measurement
variable education
(value
), i.e. as a key
/value
pair. We do this if we
want to use education
as a variable on an axis and parents
as a grouping
variable.
This may not be easy to understand, so let’s look at a concrete example. We want to graphically display the average grade for the different educational levels of the parents. However, we want different lines for father and mother. Now it is important that we have a long dataset.
adolescents <- read_sav("data/beispieldaten.sav")
adolescents <- adolescents %>%
mutate(westost = as_factor(westost),
geschlecht = as_factor(geschlecht),
bildung_mutter = as_factor(bildung_mutter,
levels = "values"),
bildung_vater = as_factor(bildung_vater,
levels = "values"),
ID = as.factor(ID))
bildung <- adolescents %>%
# choose variables
dplyr::select(Gesamtnote, bildung_vater, bildung_mutter) %>%
# remove missing values
drop_na() %>%
# wide to long
pivot_longer(-Gesamtnote, names_to = "eltern", values_to = "bildung") %>%
# convert to factors
mutate(eltern = as.factor(eltern),
bildung = as.factor(bildung)) %>%
mutate(eltern = str_replace(eltern, ".*_", "")) %>%
# grouping: first parents (eltern), then eductional level (bildung)
group_by(eltern, bildung) %>%
# compute grade average
summarise(Gesamtnote = mean(Gesamtnote))
bildung
#> # A tibble: 8 x 3
#> # Groups: eltern [2]
#> eltern bildung Gesamtnote
#> <chr> <fct> <dbl>
#> 1 mutter 1 4.10
#> 2 mutter 2 4.37
#> 3 mutter 3 4.51
#> 4 mutter 4 4.73
#> 5 vater 1 4.08
#> 6 vater 2 4.38
#> # … with 2 more rows
p <- bildung %>%
ggplot(aes(x = bildung,
y = Gesamtnote,
colour = eltern,
linetype = eltern,
group = eltern))
p + geom_line(size = 2) +
geom_point(size = 4)
In this example it becomes clear that a big part of working with ggplot2
is getting the data into ‘correct’ format first. When this is done, however, it is relatively easy to create the desired plot.
7.5 Facets
Previously, we used grouping variables to create different colors/shapes/lines for the categories of the grouping variable within a plot. Sometimes this is too confusing.
For example, if we want to create a histogram of grades at each stage of the mother’s education, then the graph would be completely overloaded.
p <- adolescents %>%
dplyr::select(Gesamtnote, bildung_mutter) %>%
drop_na() %>%
ggplot(mapping = aes(x = Gesamtnote,
fill = bildung_mutter))
p + geom_histogram(binwidth = 0.8,
position = "dodge")
An obvious solution would be to display the histograms for the mother’s education levels in separate graphics.
This is exactly what we can do with the functions facet_wrap()
and facet_grid()
.
With facet_wrap()
we create a graphic for each category of the grouping variable:
p <- adolescents %>%
dplyr::select(Gesamtnote, bildung_mutter) %>%
drop_na() %>%
ggplot(mapping = aes(x = Gesamtnote,
fill = bildung_mutter)) +
facet_wrap(~bildung_mutter)
p + geom_histogram(binwidth = 0.8)
The ~
(tilde) sign here roughly means: ‘depending on’ or ‘as a function of’.
If we have two grouping variables, we can create a raster with facet_grid()
.
p <- adolescents %>%
dplyr::select(Gesamtnote, bildung_mutter, bildung_vater) %>%
drop_na() %>%
ggplot(mapping = aes(x = Gesamtnote)) +
facet_grid(bildung_mutter ~ bildung_vater)
p + geom_histogram(binwidth = 0.8,
fill = 'steelblue4')
Here the levels of education_mother
are shown in the rows, and the levels of education_vater
in the columns.
As a second example we can plot the grade average as a line diagram, depending on the parents’ education, separated into plots for fathers and mothers. From this example, we see that we can use facet_grid ()
even if we only have one grouping variable, to set the number of rows or columns.
If we want the grouping in the rows, we write facet_grid(grouping variable ~.)
, if we want it in the columns, we’ll write facet_grid(. ~ grouping variable)
. The point .
here means that we do not use grouping variables for the rows or columns, respectively.
7.6 Colors and themes
So far, we have used the default colour palette. However, the standard colours are unsuitable for color blind people. There are many color palettes that we could use. One particularly attractive colour scheme is the viridis colour palette.
We will use the palette as follows:
to fill in shapes we use
scale_fill_viridis_d()
orscale_fill_viridis_c()
, depending on whether the variable is discrete or continuous.for lines and dots we use
scale_color_viridis_d()
orscale_color_viridis_c()
As an example, we again plot the relationship between stress_psychisch
andleben_gesamt
, this time with our own color palette.
p <- adolescents %>%
dplyr::select(ID, geschlecht, stress_psychisch, leben_gesamt) %>%
mutate_at(vars(geschlecht, ID), as_factor) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch,
y = leben_gesamt,
color = geschlecht,
shape = geschlecht))
p + geom_jitter(size = 3, alpha = 0.9) +
scale_color_viridis_d()
We can assign the colors also ‘manually’:
Another tricky point is the gray background that ggplot2
automatically
chooses. The easiest way to change this is to define a ‘theme’. There are two
such ‘themes’ that have a white background: theme_bw()
and theme_classic()
.
These two in turn differ in that theme_classic()
does not draw ‘gridlines’,
and only the left and bottom axes.
p <- adolescents %>%
dplyr::select(stress_psychisch, leben_gesamt, geschlecht) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch,
y = leben_gesamt,
color = geschlecht,
shape = geschlecht))
p + geom_jitter(size = 3, alpha = 0.9) +
scale_color_viridis_d() +
theme_bw()
p <- adolescents %>%
dplyr::select(stress_psychisch, leben_gesamt, geschlecht) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch,
y = leben_gesamt,
color = geschlecht,
shape = geschlecht))
p + geom_jitter(size = 3, alpha = 0.9) +
scale_color_viridis_d() +
theme_classic()
There are quite a few packages on CRAN that provide ggplot2 themes
7.7 Plot labels
Now we can also change the labels of the X / Y axes with xlab()
and ylab()
,
and give the plot a title with the ggtitle()
function. With the function
labs()
we can additionally change the title of the legend.
Finally, we also want to increase the font size, as the often seems
too small. We do this with the argument base_size = FONT SIZE
of the theme_*
functions.
p <- adolescents %>%
dplyr::select(stress_psychisch, leben_gesamt, geschlecht) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch,
y = leben_gesamt,
color = geschlecht,
shape = geschlecht))
p + geom_jitter(size = 3, alpha = 0.9) +
scale_color_viridis_d() +
theme_classic(base_size = 14) +
ggtitle("Zufriedenheit vs. Stress") +
xlab("Psych. Stress")+
ylab("Zufriedenheit") +
# Titel der color-Legende und der shape-Legende ist "Geschlecht"
labs(color = "Geschlecht",
shape = "Geschlecht")
7.8 Saving plots
Of course, if we made a nice graph or plot, we want to save it. We can do this
with the function ggsave()
. The function takes as arguments the file name, the
name of the plot object and other properties, such as the desired height and
width of the plot. These may expressed in “cm” using the argument units = "cm"
. To save the graph, we must have assigned our finished ggplot2
object to
a variable:
p <- adolescents %>%
dplyr::select(stress_psychisch, leben_gesamt, geschlecht) %>%
drop_na() %>%
ggplot(mapping = aes(x = stress_psychisch,
y = leben_gesamt,
color = geschlecht,
shape = geschlecht))
my_plot <- p + geom_jitter(size = 3, alpha = 0.9) +
scale_color_viridis_d() +
theme_classic(base_size = 14) +
ggtitle("Zufriedenheit vs. Stress") +
xlab("Psych. Stress")+
ylab("Zufriedenheit") +
labs(color = "Geschlecht") +
guides(shape = FALSE)
my_plot
can now be saved:
7.9 Exercise
In this exercise we want to examine the six self-efficacy scales in the ‘adolescents’ data set. We assume that these are related to the overall grade average. Before we calculate a multiple regression, let’s try to plot the relationships between these variables.
selbstwirksamkeit_wide <- adolescents %>%
dplyr::select(ID, Gesamtnote, starts_with("swk_")) %>%
drop_na()
selbstwirksamkeit_wide
#> # A tibble: 279 x 8
#> ID Gesamtnote swk_neueslernen swk_lernregulat… swk_motivation
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 2 4 5 4 4.6
#> 2 10 5 5 4.88 6
#> 3 11 4 4.57 5.38 5.4
#> 4 12 5 5.33 5.29 4.67
#> 5 14 4 4.86 4.5 4.6
#> 6 15 4 4 4.38 5.6
#> # … with 273 more rows, and 3 more variables: swk_durchsetzung <dbl>,
#> # swk_sozialkomp <dbl>, swk_beziehung <dbl>
Try to discover relationships between the six self-efficacy scales by plotting them. For example, try to create a scatterplot for each pair of variables.
If you want a pairs plot, you can install the package GGally
from CRAN. This
provides the function ggpairs()
.
color
is an attribute of lines or points,fill
is an attribute of areas.If we define the “aesthetic mappings” within the function
ggplot()
, they apply to all “layers”, i.e. for all elements of the plot. We could also define these mappings separately for each “layer”.