Intermediate R for reproducible scientific analysis
For loops
Learning objectives
- Write and understand
for
loops.
Repeating operations
Often when we’re trying to solve a problem or run some analysis we find ourselves doing the same thing over, and over, and over again on different groupings of data, or on different files, or with slight parameter variations.
The great thing about R, and programming in general, is it allows us to be lazy. Why do a repetitive task if you can make the computer do it for you?
For example, lets say I wanted to calculated the total population for each continent in the gapminder dataset in 2007. We could do this in several ways, but the most basic approach is manually:
gap[year == 2007 & continent == "Asia", sum(pop)]
[1] 3811953827
gap[year == 2007 & continent == "Africa", sum(pop)]
[1] 929539692
gap[year == 2007 & continent == "Americas", sum(pop)]
[1] 898871184
gap[year == 2007 & continent == "Europe", sum(pop)]
[1] 586098529
gap[year == 2007 & continent == "Oceania", sum(pop)]
[1] 24549947
This is tedious to type out. We can do it, but imagine if we wanted to run some calculation for each country!
The clever way to do this would be to use our recently acquired data.table skills:
gap[year == 2007, sum(pop), by=continent]
continent V1
1: Asia 3811953827
2: Europe 586098529
3: Africa 929539692
4: Americas 898871184
5: Oceania 24549947
But sometimes the solution to a problem isn’t obvious, or doesn’t fit into a format we’re used to, so it’s helpful to have multiple tools in our problem-solving toolbox to fall back on.
With a for loop we can instead iterate over each continent, and tell R to run the same command:
for (cc in gap[,unique(continent)]) {
popsum <- gap[year == yy & continent == cc, sum(pop)]
print(paste(cc, ":", popsum))
}
Error in eval(expr, envir, enclos): object 'yy' not found
This construct tells R to go through each thing on the right of the in
operator and store it in the variable cc
. Inside the body of the for
loop, i.e. any lines of code that fall between the curly braces ({
and }
), we can then access the value of cc
to do whatever we like. So first, cc
will hold the value “Asia”, then it will run the line of code, and return back to the top of the loop. Next cc
will hold the value “Europe”, and do the same thing, and so on.
What if we want to look at the change in total population for each continent over the years? We can “nest” for loops to iterate through multiple separate conditions:
for (cc in gap[,unique(continent)]) {
for (yy in gap[,unique(year)]) {
popsum <- gap[year == yy & continent == cc, sum(pop)]
print(paste(cc, yy, ":", popsum))
}
}
[1] "Asia 1952 : 1395357351.99999"
[1] "Asia 1957 : 1562780599"
[1] "Asia 1962 : 1696357182"
[1] "Asia 1967 : 1905662900"
[1] "Asia 1972 : 2150972248"
[1] "Asia 1977 : 2384513556"
[1] "Asia 1982 : 2610135582"
[1] "Asia 1987 : 2871220762"
[1] "Asia 1992 : 3133292191"
[1] "Asia 1997 : 3383285500"
[1] "Asia 2002 : 3601802203"
[1] "Asia 2007 : 3811953827"
[1] "Europe 1952 : 418120846"
[1] "Europe 1957 : 437890351"
[1] "Europe 1962 : 460355155"
[1] "Europe 1967 : 481178958"
[1] "Europe 1972 : 500635059"
[1] "Europe 1977 : 517164531"
[1] "Europe 1982 : 531266901"
[1] "Europe 1987 : 543094160"
[1] "Europe 1992 : 558142797"
[1] "Europe 1997 : 568944148"
[1] "Europe 2002 : 578223869"
[1] "Europe 2007 : 586098529"
[1] "Africa 1952 : 237640501"
[1] "Africa 1957 : 264837738"
[1] "Africa 1962 : 296516865"
[1] "Africa 1967 : 335289489"
[1] "Africa 1972 : 379879541"
[1] "Africa 1977 : 433061021"
[1] "Africa 1982 : 499348587"
[1] "Africa 1987 : 574834110"
[1] "Africa 1992 : 659081517"
[1] "Africa 1997 : 743832984"
[1] "Africa 2002 : 833723916"
[1] "Africa 2007 : 929539692"
[1] "Americas 1952 : 345152446"
[1] "Americas 1957 : 386953916"
[1] "Americas 1962 : 433270254"
[1] "Americas 1967 : 480746623"
[1] "Americas 1972 : 529384210"
[1] "Americas 1977 : 578067699"
[1] "Americas 1982 : 630290920"
[1] "Americas 1987 : 682753971"
[1] "Americas 1992 : 739274104"
[1] "Americas 1997 : 796900410"
[1] "Americas 2002 : 849772762"
[1] "Americas 2007 : 898871184"
[1] "Oceania 1952 : 10686006"
[1] "Oceania 1957 : 11941976"
[1] "Oceania 1962 : 13283518"
[1] "Oceania 1967 : 14600414"
[1] "Oceania 1972 : 16106100"
[1] "Oceania 1977 : 17239000"
[1] "Oceania 1982 : 18394850"
[1] "Oceania 1987 : 19574415"
[1] "Oceania 1992 : 20919651"
[1] "Oceania 1997 : 22241430"
[1] "Oceania 2002 : 23454829"
[1] "Oceania 2007 : 24549947"
For or Apply? The second circle of hell.
We made our way into the second Circle, here live the gluttons. – The R inferno
One of the biggest things that trips up novices and experienced R users alike, is building a results object (vector, list, matrix, data frame) as your for loop progresses. For example:
results <- data.frame(continent=character(), year=numeric(), popsum=numeric())
for (cc in gap[,unique(continent)]) {
for (yy in gap[,unique(year)]) {
popsum <- gap[year == yy & continent == cc, sum(pop)]
this_result <- data.frame(continent=cc, year=yy, popsum=popsum)
results <- rbind(results, this_result)
}
}
results
continent year popsum
1 Asia 1952 1395357352
2 Asia 1957 1562780599
3 Asia 1962 1696357182
4 Asia 1967 1905662900
5 Asia 1972 2150972248
6 Asia 1977 2384513556
7 Asia 1982 2610135582
8 Asia 1987 2871220762
9 Asia 1992 3133292191
10 Asia 1997 3383285500
11 Asia 2002 3601802203
12 Asia 2007 3811953827
13 Europe 1952 418120846
14 Europe 1957 437890351
15 Europe 1962 460355155
16 Europe 1967 481178958
17 Europe 1972 500635059
18 Europe 1977 517164531
19 Europe 1982 531266901
20 Europe 1987 543094160
21 Europe 1992 558142797
22 Europe 1997 568944148
23 Europe 2002 578223869
24 Europe 2007 586098529
25 Africa 1952 237640501
26 Africa 1957 264837738
27 Africa 1962 296516865
28 Africa 1967 335289489
29 Africa 1972 379879541
30 Africa 1977 433061021
31 Africa 1982 499348587
32 Africa 1987 574834110
33 Africa 1992 659081517
34 Africa 1997 743832984
35 Africa 2002 833723916
36 Africa 2007 929539692
37 Americas 1952 345152446
38 Americas 1957 386953916
39 Americas 1962 433270254
40 Americas 1967 480746623
41 Americas 1972 529384210
42 Americas 1977 578067699
43 Americas 1982 630290920
44 Americas 1987 682753971
45 Americas 1992 739274104
46 Americas 1997 796900410
47 Americas 2002 849772762
48 Americas 2007 898871184
49 Oceania 1952 10686006
50 Oceania 1957 11941976
51 Oceania 1962 13283518
52 Oceania 1967 14600414
53 Oceania 1972 16106100
54 Oceania 1977 17239000
55 Oceania 1982 18394850
56 Oceania 1987 19574415
57 Oceania 1992 20919651
58 Oceania 1997 22241430
59 Oceania 2002 23454829
60 Oceania 2007 24549947
“Growing” a results object like this is bad practice. At each iteration, R needs to talk to the computer’s operating system to ask for the right amount of memory for your new results object. Like all diplomatic negotiations, this can take a while (at least in computer time!). As a result, you might find that your for loops seem to take forever when you start working with bigger datasets or more complex calculations.
It’s much better to tell R how big your results object will be up front, that way R only needs to ask the computer for the right amount of memory once:
# First lets calculate the number of rows we need:
nresults <- gap[,length(unique(continent))] * gap[,length(unique(year))]
results <- data.frame(
continent=character(length=nresults),
year=numeric(length=nresults),
popsum=numeric(length=nresults)
)
# Instead of iterating over values, we need to keep track of indices so we know
# which row to insert or new results into at each iteration.
# `seq_along` will create a sequence of numbers based on the length of the
# vector. So instead of c("Asia", "Americas", "Europe", "Africa", "Oceania"),
# ii will store c(1,2,3,4,5)
continents <- gap[,unique(continent)]
years <- gap[,unique(year)]
# We also need to keep track of which row to insert into. We could do fancy
# math based on our indices, but this is hard to get right and can lead to hard
# to detect errors. Its much easier to just keep track of this manually.
this_row <- 1
for (ii in seq_along(continents)) {
for (jj in seq_along(years)) {
# Now we need to look-up the appopriate values based on our indices
cc <- continents[ii]
yy <- years[jj]
popsum <- gap[year == yy & continent == cc, sum(pop)]
results[this_row,] <- list(cc, yy, popsum)
# Increment the row counter
this_row <- this_row + 1
}
}
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Asia"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Europe"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Africa"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Americas"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
Warning in `[<-.factor`(`*tmp*`, iseq, value = "Oceania"): invalid factor
level, NA generated
results
continent year popsum
1 <NA> 1952 1395357352
2 <NA> 1957 1562780599
3 <NA> 1962 1696357182
4 <NA> 1967 1905662900
5 <NA> 1972 2150972248
6 <NA> 1977 2384513556
7 <NA> 1982 2610135582
8 <NA> 1987 2871220762
9 <NA> 1992 3133292191
10 <NA> 1997 3383285500
11 <NA> 2002 3601802203
12 <NA> 2007 3811953827
13 <NA> 1952 418120846
14 <NA> 1957 437890351
15 <NA> 1962 460355155
16 <NA> 1967 481178958
17 <NA> 1972 500635059
18 <NA> 1977 517164531
19 <NA> 1982 531266901
20 <NA> 1987 543094160
21 <NA> 1992 558142797
22 <NA> 1997 568944148
23 <NA> 2002 578223869
24 <NA> 2007 586098529
25 <NA> 1952 237640501
26 <NA> 1957 264837738
27 <NA> 1962 296516865
28 <NA> 1967 335289489
29 <NA> 1972 379879541
30 <NA> 1977 433061021
31 <NA> 1982 499348587
32 <NA> 1987 574834110
33 <NA> 1992 659081517
34 <NA> 1997 743832984
35 <NA> 2002 833723916
36 <NA> 2007 929539692
37 <NA> 1952 345152446
38 <NA> 1957 386953916
39 <NA> 1962 433270254
40 <NA> 1967 480746623
41 <NA> 1972 529384210
42 <NA> 1977 578067699
43 <NA> 1982 630290920
44 <NA> 1987 682753971
45 <NA> 1992 739274104
46 <NA> 1997 796900410
47 <NA> 2002 849772762
48 <NA> 2007 898871184
49 <NA> 1952 10686006
50 <NA> 1957 11941976
51 <NA> 1962 13283518
52 <NA> 1967 14600414
53 <NA> 1972 16106100
54 <NA> 1977 17239000
55 <NA> 1982 18394850
56 <NA> 1987 19574415
57 <NA> 1992 20919651
58 <NA> 1997 22241430
59 <NA> 2002 23454829
60 <NA> 2007 24549947
As you can see, this involves a lot more work. Most R users will even go so far to tell you that for loops are bad, and that you should use something called apply
instead! We’ll cover this in the next lesson, and later we’ll show you another method, foreach
which also handles object creation for you.
For loops are most useful when you’re performing a series of calculations where each iteration depends on the results of the last (for example a random walk).
Challenge 1
Write a script that loops through the gapminder
data by continent and prints out the mean life expectancy in 1952.
Challenge 3
Modify the script so that it loops through the years as well as the continents.
Challenge 4
Write a for loop that performs a random walk for 100 steps, then plot the result.
Hint: You can use sign(rnorm(1))
in the body of the loop to randomly choose a direction (forward or backward) at each iteration.
Hint: You will want to store the resulting position (starting at 0) after each iteration for plotting purposes.
Hint: give the plot
function the indices 0:100 as the x axis, and the stored positions as the y axis. specify the ‘type’ argument as “l” to draw a the path.