### Archive

Archive for the ‘R’ Category

## R – Sorting a data frame by the contents of a column

Let’s examine how to sort the contents of a data frame by the value of a column

```> numPeople = 10
> sex=sample(c("male","female"),numPeople,replace=T)
> age = sample(14:102, numPeople, replace=T)
> income = sample(20:150, numPeople, replace=T)
> minor = age<18
```

This last statement might look surprising if you’re used to Java or a traditional programming language. Rather than becoming a single boolean/truth value, minor actually becomes a vector of truth values, one per row in the age column.  It’s equivalent to the much more verbose code in Java:

```int[] age= ...;
for (int i = 0; i < income.length; i++) {
minor[i] = age[i] < 18;
}
```

Just as expected, the value of minor is a vector:

```> mode(minor)
[1] "logical"
> minor
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
Next we create a data frame, which groups together our various vectors into the columns of a data structure:
> population = data.frame(sex=sex, age=age, income=income, minor=minor)
> population
sex age income minor
1    male  68    150 FALSE
2    male  48     21 FALSE
3  female  68     58 FALSE
4  female  27    124 FALSE
5  female  84    103 FALSE
6    male  92    112 FALSE
7    male  35     65 FALSE
8  female  15    117  TRUE
9    male  89     95 FALSE
10   male  26     54 FALSE

The arguments (sex=sex, age=age, income=income, minor=minor) assign the same names to the columns as I originally named the vectors; I could just as easily call them anything.  For instance,
> data.frame(a=sex, b=age, c=income, minor=minor)
a  b   c minor
1    male 68 150 FALSE
2    male 48  21 FALSE
3  female 68  58 FALSE
4  female 27 124 FALSE
5  female 84 103 FALSE
6    male 92 112 FALSE
7    male 35  65 FALSE
8  female 15 117  TRUE
9    male 89  95 FALSE
10   male 26  54 FALSE

But I prefer the more descriptive labels I gave previously.
> population
sex   age income minor
1    male  68    150 FALSE
2    male  48     21 FALSE
3  female  68     58 FALSE
4  female  27    124 FALSE
5  female  84    103 FALSE
6    male  92    112 FALSE
7    male  35     65 FALSE
8  female  15    117  TRUE
9    male  89     95 FALSE
10   male  26     54 FALSE

Now let’s say we want to order by the age of the people.  To do that is a one liner:
> population[order(population\$age),]
sex age income minor
8  female  15    117  TRUE
10   male  26     54 FALSE
4  female  27    124 FALSE
7    male  35     65 FALSE
2    male  48     21 FALSE
1    male  68    150 FALSE
3  female  68     58 FALSE
5  female  84    103 FALSE
9    male  89     95 FALSE
6    male  92    112 FALSE

This is not magic; you can select arbitrary rows from any data frame  with the same syntax:
> population[c(1,2,3),]
sex age income minor
1   male  68    150 FALSE
2   male  48     21 FALSE
3 female  68     58 FALSE

The order function merely returns the indices of the rows in sorted order.
> order(population\$age)
[1]  8 10  4  7  2  1  3  5  9  6

Note the \$ syntax; you select columns of a data frame by using a dollar sign and the name of the column.  You can retrieve the names of the columns of a data frame with the names function.
> names(population)
[1] "sex"    "age"    "income" "minor"

> population\$income
[1] 150  21  58 124 103 112  65 117  95  54
> income
[1] 150  21  58 124 103 112  65 117  95  54

As you can see, they are exactly the same.
So what we’re really doing with the command
population[order(population\$age),]

is
population[c(8,10,4,7,2,1,3,5,9,6),]

Note the trailing comma; what this means is to take all the columns.  If we only wanted certain columns, we could specify after this comma.
> population[order(population\$age),c(1,2)]
sex age
8  female  15
10   male  26
4  female  27
7    male  35
2    male  48
1    male  68
3  female  68
5  female  84
9    male  89
6    male  92

__ATA.cmd.push(function() {
__ATA.initDynamicSlot({
id: 'atatags-26942-641f86aeafc82',
location: 120,
formFactor: '001',
label: {
},
creative: {
},
privacySettings: {
text: 'Privacy',

}
}
});
});

```
Categories: programming, R Tags: ,

## Running totals in R

Let’s say we wanted to simulate flipping a coin 50 times using the statistical language R, where a 1 is a heads and 0 is tails.

```> flips=sample(0:1, 50, replace=T)
> flips [1] 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 1
[39] 1 1 1 0 1 0 0 1 1 0 1 1
```

Now we can plot the values to see which were heads and which were tails:

```> plot(flips, main="Coin flips",ylab="0 = tails, 1 = heads")

Raw values of heads and tails```

What if we want to see a running total of the number of heads over time? I was faced with just this problem for a completely different domain; I’ve written the function myself multiple times in Java and other languages but I was hoping it would be built-in to a stats language like R.  Fortunately I was right; the command you want is cumsum (cumulative sum).  There are a total of four functions like this:

Cumulative Sums, Products, and Extremes

```cumsum(x)
cumprod(x)
cummax(x)
cummin(x)
```

They work just as you’d expect.

```> cumsum(flips)
[1]  0  1  1  2  2  3  4  5  6  7  8  9 10 10 10 11 12 12 13 14 14 15 15 16 17
[26] 18 19 19 19 20 20 20 20 20 20 21 21 22 23 24 25 25 26 26 26 27 28 28 29 30

Running total of number of heads

This is a trivial example, but it certainly simplifies my life.

Categories: programming, R