darkness

Saturday, 12 November 2005

Using R

darkness @ 21:52:37

Situation: I’ve got a Cricket getting distance readings from three other Cricket beacons. The data file looks like:

01:9E:14:C4:0A:00:00:A1 219
01:7D:CA:60:0A:00:00:8A 150
01:9F:FF:C3:0A:00:00:BA 205
01:9E:14:C4:0A:00:00:A1 219
01:7D:CA:60:0A:00:00:8A 150
01:9F:FF:C3:0A:00:00:BA 206
01:9E:14:C4:0A:00:00:A1 219
01:7D:CA:60:0A:00:00:8A 150
01:9F:FF:C3:0A:00:00:BA 206
01:9E:14:C4:0A:00:00:A1 219
.
.
.

The first column is the beacon ID, the second is the computed distance to that beacon. Right now our beacons send out a signal about once a second. (In one case it took me four minutes to retrieve >200 readings for each of the three beacons.) The goals are to know when we’ve sampled enough data to make a reasonable guess at what the real value for distance is, and of course to actually make that guess.

Ideally, we’d like to have our distance accurate to 1cm, or maybe 2cm; of course, it’s not clear how useful math or statistics or anything else short of prayer will be if the Cricket sends back totally bogus results at any point. We’ve had spots where its reported distance to beacon was much too high; we attributed this to ultrasound reflection. Still, if we trust that the data coming in is “generally right,” we just need some math/statistics to “scrub” the data. We need to know which data points to keep, and which to throw out. (Alternatively, I might just be using this as a diversion to learn more about statistics and statistics software.)

I’m using R 2.2.0 from Fedora Extras, with Emacs Speaks Statistics. I’m also using my old statistics text book. I took one statistics course at college. I consider myself to have done well in that course, and (surprisingly) I feel like I grok the textbook. All that said… it was one course. The guy barely spoke English. A little bit of knowledge can be very dangerous. So if you think the stuff I’m doing here is pointless, flawed, or you simply have a better suggestion for doing this, please do let me know.

Anyway. Given the above file, I read the data in like this:

> position1 < - read.table("distances.1", header=FALSE)
> position1
V1  V2
1   01:9E:14:C4:0A:00:00:A1 219
2   01:7D:CA:60:0A:00:00:8A 150
3   01:9F:FF:C3:0A:00:00:BA 205
4   01:9E:14:C4:0A:00:00:A1 219
5   01:7D:CA:60:0A:00:00:8A 150
6   01:9F:FF:C3:0A:00:00:BA 206
7   01:9E:14:C4:0A:00:00:A1 219
8   01:7D:CA:60:0A:00:00:8A 150
9   01:9F:FF:C3:0A:00:00:BA 206
10  01:9E:14:C4:0A:00:00:A1 219
.
.
.

No headers in my input file. < - is an assignment operator; you can reverse the arrow, so long as it’s pointing to what you’re assigning to. You can rename V1 and V2 to something better (like beaconid and distance) with edit(position1).

position1 is now a data.frame (try class(position1)). The beaconid column is non-numeric, so data.table made it into a “factor.” I don’t know enough to explain what a factor is, but you can use it for grouping. Check this out:

> attach(position1)
> tapply(distance, beaconid, summary)
$"01:7D:CA:60:0A:00:00:8A"
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
149.0   150.0   150.0   150.1   150.0   151.0 
$"01:9E:14:C4:0A:00:00:A1"
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
51.0   219.0   219.0   217.6   219.0   220.0 
$"01:9F:FF:C3:0A:00:00:BA"
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
85.0   205.0   206.0   203.5   206.0   207.0

attach basically makes the columns of the given object accessible without being qualified by the object’s name. That is, without attach(position1) first, I’d have to use position1$beaconid instead of just beaconid. tapply takes the vector as its first argument (the columns of a data.frame are vectors–or, at least, they are in this case) and divides it up based on the factor given as its second argument, then calls the given function (summary here) with each of the resulting vectors. I think of it as analogous to GROUP BY in SQL. By the way, if you want to undo attach, just detach().

It looks like the first beacon is probably 150cm away. I noticed that the mean and the median had some disagreement for the second and third beacons. I wasn’t crazy about it. But look at the minimums: way too low for the beacon IDs ending in A1 and BA. So trim the top and bottom 10% of the data set:

> tapply(distance, beaconid, mean, trim=0.1)
01:7D:CA:60:0A:00:00:8A 01:9E:14:C4:0A:00:00:A1 01:9F:FF:C3:0A:00:00:BA 
150.0000                219.0000                205.7964

Much better. The trim=0.1 is an argument that’s passed to mean.

Variance and standard deviation:

> tapply(distance, beaconid, var)
01:7D:CA:60:0A:00:00:8A 01:9E:14:C4:0A:00:00:A1 01:9F:FF:C3:0A:00:00:BA 
0.08623402            208.02434337            217.29764082 
> tapply(distance, beaconid, sd)
01:7D:CA:60:0A:00:00:8A 01:9E:14:C4:0A:00:00:A1 01:9F:FF:C3:0A:00:00:BA 
0.2936563              14.4230490              14.7410190

What would those look like if you trim the data set, as we did with mean, above?

Diversion: subset is kind of interesting. I was interested in pulling out just one beacon’s readings into a separate vector:

> a1 < - subset(position1, beaconid=="01:9E:14:C4:0A:00:00:A1", select=distance)
> a1
distance
1        219
4        219
7        219
10       219
13       219
16       220
19       219
21        85
24       219
27       219
.
.
.
> class(a1)
[1] “data.frame”

Not a vector, but instead a data.frame; that’s hardly a problem right now, though. By the way, you can do things like position1[[1]] and position[1]. They both output the beacon ID column, but in different formats:

> class(position1[1])
[1] "data.frame"
> class(position1[[1]])
[1] "factor"

I still want to trim data. I don’t see anything in R that returns a “trimmed” vector. So I wrote my own (with help from http://www.math.mcgill.ca/~steele/math680/lecture2.680.pdf).

trim < - function(vect, percent=0.1) {
    nelem <- length(vect)
    howmany <- round(nelem * percent)
    sort(vect)[(howmany + 1):(nelem - howmany)]
}

Note that the parenthesis around howmany + 1 and nelem - howmany are apparently quite important. In use:

> 1:10
[1]  1  2  3  4  5  6  7  8  9 10
> trim(1:10)
[1] 2 3 4 5 6 7 8 9
> trim(1:10, percent=0.2)
[1] 3 4 5 6 7 8

Once again, Emacs makes this easier: write the function in a scratch buffer, M-C-x to eval the function in the current R process. Note that 1:10 makes some sort of… I don’t know what it is, but I guess it creates something that acts like a vector, if it’s not exactly a vector.

Now lets see what the data for beacon A1 (that is, ID ending in A1) looks like with 10% trimming from both ends:

> sd(trim(a1[[1]]))
[1] 0

So, uh. There you go.

Before I forget, I think you can do help.start() and it’ll pop up a browser (if it can) with local HTML docs. This is helpful to have open. You can also see how you’ve been polluting your name space:

> objects()
[1] "a1"           "last.warning" "mydata"       "origdata"     "position1"   
[6] "trim"        
> rm(origdata, mydata)
> objects()
[1] "a1"           "last.warning" "position1"    "trim"

This might be important because it looks like, on exit, you can choose to save your current session to disk. When you start R next time from the same directory, you’re popped right back into your session. This is why they recommend that, for each project/set of data/whatever you want to work with in R, you work in a separate directory.

Lets look at beacon BA:

> levels(beaconid)
[1] "01:7D:CA:60:0A:00:00:8A" "01:9E:14:C4:0A:00:00:A1"
[3] "01:9F:FF:C3:0A:00:00:BA"
> ba < - subset(position1, beaconid=="01:9F:FF:C3:0A:00:00:BA", select=distance)
> summary(trim(ba[[1]]))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
205.0   206.0   206.0   205.8   206.0   206.0 
> sd(trim(ba[[1]]))
[1] 0.4012177

That’s really not so bad, either.

Of course, I just realized: it would be nice if I knew if that was anywhere near the correct distance. Will have to experiment with a measuring tape at some point.

What happens if I stopped after the first 20 readings?

> summary(trim(ba[[1]][1:20]))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
205.0   206.0   206.0   205.8   206.0   206.0 
> sd(ba[[1]][1:20])
[1] 0.5231484

Not bad. In fact, with fewer than 20 readings I still seem to get between 205.8 and 206 for the average. I can live with 0.2cm.

Will have to work more on this later, dinner time now.

Powered by WordPress