
Archive for the ‘R’ Category

R – Sorting a data frame by the contents of a column

February 12, 2010 7 comments

Let’s examine how to sort the contents of a data frame by the value of a column

> numPeople = 10
> sex=sample(c("male","female"),numPeople,replace=T)
> age = sample(14:102, numPeople, replace=T)
> income = sample(20:150, numPeople, replace=T)
> minor = age<18

This last statement might look surprising if you’re used to Java or a traditional programming language. Rather than becoming a single boolean/truth value, minor actually becomes a vector of truth values, one per row in the age column.  It’s equivalent to the much more verbose code in Java:

int[] age= ...;
for (int i = 0; i < income.length; i++) {
   minor[i] = age[i] < 18;

Just as expected, the value of minor is a vector:

> mode(minor)
[1] "logical"
> minor

Next we create a data frame, which groups together our various vectors into the columns of a data structure:

> population = data.frame(sex=sex, age=age, income=income, minor=minor)
> population
 sex age income minor
1    male  68    150 FALSE
2    male  48     21 FALSE
3  female  68     58 FALSE
4  female  27    124 FALSE
5  female  84    103 FALSE
6    male  92    112 FALSE
7    male  35     65 FALSE
8  female  15    117  TRUE
9    male  89     95 FALSE
10   male  26     54 FALSE

The arguments (sex=sex, age=age, income=income, minor=minor) assign the same names to the columns as I originally named the vectors; I could just as easily call them anything.  For instance,

> data.frame(a=sex, b=age, c=income, minor=minor)
 a  b   c minor
1    male 68 150 FALSE
2    male 48  21 FALSE
3  female 68  58 FALSE
4  female 27 124 FALSE
5  female 84 103 FALSE
6    male 92 112 FALSE
7    male 35  65 FALSE
8  female 15 117  TRUE
9    male 89  95 FALSE
10   male 26  54 FALSE

But I prefer the more descriptive labels I gave previously.

> population
     sex   age income minor
1    male  68    150 FALSE
2    male  48     21 FALSE
3  female  68     58 FALSE
4  female  27    124 FALSE
5  female  84    103 FALSE
6    male  92    112 FALSE
7    male  35     65 FALSE
8  female  15    117  TRUE
9    male  89     95 FALSE
10   male  26     54 FALSE

Now let’s say we want to order by the age of the people. To do that is a one liner:

> population[order(population$age),]
 sex age income minor
8  female  15    117  TRUE
10   male  26     54 FALSE
4  female  27    124 FALSE
7    male  35     65 FALSE
2    male  48     21 FALSE
1    male  68    150 FALSE
3  female  68     58 FALSE
5  female  84    103 FALSE
9    male  89     95 FALSE
6    male  92    112 FALSE

This is not magic; you can select arbitrary rows from any data frame  with the same syntax:

> population[c(1,2,3),]
 sex age income minor
1   male  68    150 FALSE
2   male  48     21 FALSE
3 female  68     58 FALSE

The order function merely returns the indices of the rows in sorted order.

> order(population$age)
 [1]  8 10  4  7  2  1  3  5  9  6

Note the $ syntax; you select columns of a data frame by using a dollar sign and the name of the column. You can retrieve the names of the columns of a data frame with the names function.

> names(population)
[1] "sex"    "age"    "income" "minor" 

> population$income
 [1] 150  21  58 124 103 112  65 117  95  54
> income
 [1] 150  21  58 124 103 112  65 117  95  54

As you can see, they are exactly the same.

So what we’re really doing with the command




Note the trailing comma; what this means is to take all the columns. If we only wanted certain columns, we could specify after this comma.

> population[order(population$age),c(1,2)]
 sex age
8  female  15
10   male  26
4  female  27
7    male  35
2    male  48
1    male  68
3  female  68
5  female  84
9    male  89
6    male  92
Categories: programming, R Tags: ,

Running totals in R

February 11, 2010 Leave a comment

Let’s say we wanted to simulate flipping a coin 50 times using the statistical language R, where a 1 is a heads and 0 is tails.

> flips=sample(0:1, 50, replace=T)
> flips [1] 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 1
[39] 1 1 1 0 1 0 0 1 1 0 1 1

Now we can plot the values to see which were heads and which were tails:

> plot(flips, main="Coin flips",ylab="0 = tails, 1 = heads")

Raw values of heads and tails

What if we want to see a running total of the number of heads over time? I was faced with just this problem for a completely different domain; I’ve written the function myself multiple times in Java and other languages but I was hoping it would be built-in to a stats language like R.  Fortunately I was right; the command you want is cumsum (cumulative sum).  There are a total of four functions like this:

Cumulative Sums, Products, and Extremes


They work just as you’d expect.

> cumsum(flips)
 [1]  0  1  1  2  2  3  4  5  6  7  8  9 10 10 10 11 12 12 13 14 14 15 15 16 17
[26] 18 19 19 19 20 20 20 20 20 20 21 21 22 23 24 25 25 26 26 26 27 28 28 29 30
> plot(cumsum(flips), main="Number of heads flipped over time",ylab="Number of heads")

Running total of number of heads

This is a trivial example, but it certainly simplifies my life.

Categories: programming, R

How Facebook and Google use R

February 21, 2009 Leave a comment

How Facebook and Google use R

Interesting read.  rpart comes in handy again.

Categories: link, programming, R