Archive
R – Sorting a data frame by the contents of a column
Let’s examine how to sort the contents of a data frame by the value of a column
> numPeople = 10 > sex=sample(c("male","female"),numPeople,replace=T) > age = sample(14:102, numPeople, replace=T) > income = sample(20:150, numPeople, replace=T) > minor = age<18
This last statement might look surprising if you’re used to Java or a traditional programming language. Rather than becoming a single boolean/truth value, minor actually becomes a vector of truth values, one per row in the age column. It’s equivalent to the much more verbose code in Java:
int[] age= ...; for (int i = 0; i < income.length; i++) { minor[i] = age[i] < 18; }
Just as expected, the value of minor is a vector:
> mode(minor) [1] "logical" > minor [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSENext we create a data frame, which groups together our various vectors into the columns of a data structure:
> population = data.frame(sex=sex, age=age, income=income, minor=minor) > population sex age income minor 1 male 68 150 FALSE 2 male 48 21 FALSE 3 female 68 58 FALSE 4 female 27 124 FALSE 5 female 84 103 FALSE 6 male 92 112 FALSE 7 male 35 65 FALSE 8 female 15 117 TRUE 9 male 89 95 FALSE 10 male 26 54 FALSEThe arguments (sex=sex, age=age, income=income, minor=minor) assign the same names to the columns as I originally named the vectors; I could just as easily call them anything. For instance,
> data.frame(a=sex, b=age, c=income, minor=minor) a b c minor 1 male 68 150 FALSE 2 male 48 21 FALSE 3 female 68 58 FALSE 4 female 27 124 FALSE 5 female 84 103 FALSE 6 male 92 112 FALSE 7 male 35 65 FALSE 8 female 15 117 TRUE 9 male 89 95 FALSE 10 male 26 54 FALSEBut I prefer the more descriptive labels I gave previously.
> population sex age income minor 1 male 68 150 FALSE 2 male 48 21 FALSE 3 female 68 58 FALSE 4 female 27 124 FALSE 5 female 84 103 FALSE 6 male 92 112 FALSE 7 male 35 65 FALSE 8 female 15 117 TRUE 9 male 89 95 FALSE 10 male 26 54 FALSENow let’s say we want to order by the age of the people. To do that is a one liner:
> population[order(population$age),] sex age income minor 8 female 15 117 TRUE 10 male 26 54 FALSE 4 female 27 124 FALSE 7 male 35 65 FALSE 2 male 48 21 FALSE 1 male 68 150 FALSE 3 female 68 58 FALSE 5 female 84 103 FALSE 9 male 89 95 FALSE 6 male 92 112 FALSEThis is not magic; you can select arbitrary rows from any data frame with the same syntax:
> population[c(1,2,3),] sex age income minor 1 male 68 150 FALSE 2 male 48 21 FALSE 3 female 68 58 FALSEThe order function merely returns the indices of the rows in sorted order.
> order(population$age) [1] 8 10 4 7 2 1 3 5 9 6Note the $ syntax; you select columns of a data frame by using a dollar sign and the name of the column. You can retrieve the names of the columns of a data frame with the names function.
> names(population) [1] "sex" "age" "income" "minor" > population$income [1] 150 21 58 124 103 112 65 117 95 54 > income [1] 150 21 58 124 103 112 65 117 95 54As you can see, they are exactly the same.
So what we’re really doing with the command
population[order(population$age),]is
population[c(8,10,4,7,2,1,3,5,9,6),]Note the trailing comma; what this means is to take all the columns. If we only wanted certain columns, we could specify after this comma.
> population[order(population$age),c(1,2)] sex age 8 female 15 10 male 26 4 female 27 7 male 35 2 male 48 1 male 68 3 female 68 5 female 84 9 male 89 6 male 92
Running totals in R
Let’s say we wanted to simulate flipping a coin 50 times using the statistical language R, where a 1 is a heads and 0 is tails.
> flips=sample(0:1, 50, replace=T) > flips [1] 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 1 [39] 1 1 1 0 1 0 0 1 1 0 1 1
Now we can plot the values to see which were heads and which were tails:
> plot(flips, main="Coin flips",ylab="0 = tails, 1 = heads")Raw values of heads and tails
What if we want to see a running total of the number of heads over time? I was faced with just this problem for a completely different domain; I’ve written the function myself multiple times in Java and other languages but I was hoping it would be built-in to a stats language like R. Fortunately I was right; the command you want is cumsum (cumulative sum). There are a total of four functions like this:
Cumulative Sums, Products, and Extremes
cumsum(x) cumprod(x) cummax(x) cummin(x)
They work just as you’d expect.
> cumsum(flips) [1] 0 1 1 2 2 3 4 5 6 7 8 9 10 10 10 11 12 12 13 14 14 15 15 16 17 [26] 18 19 19 19 20 20 20 20 20 20 21 21 22 23 24 25 25 26 26 26 27 28 28 29 30 > plot(cumsum(flips), main="Number of heads flipped over time",ylab="Number of heads")

Running total of number of heads
This is a trivial example, but it certainly simplifies my life.
How Facebook and Google use R
Interesting read. rpart comes in handy again.