Using R
Situation: I’ve got a Cricket getting distance readings from three other Cricket beacons. The data file looks like:
01:9E:14:C4:0A:00:00:A1 219
01:7D:CA:60:0A:00:00:8A 150
01:9F:FF:C3:0A:00:00:BA 205
01:9E:14:C4:0A:00:00:A1 219
01:7D:CA:60:0A:00:00:8A 150
01:9F:FF:C3:0A:00:00:BA 206
01:9E:14:C4:0A:00:00:A1 219
01:7D:CA:60:0A:00:00:8A 150
01:9F:FF:C3:0A:00:00:BA 206
01:9E:14:C4:0A:00:00:A1 219
.
.
.
The first column is the beacon ID, the second is the computed distance to that beacon. Right now our beacons send out a signal about once a second. (In one case it took me four minutes to retrieve >200 readings for each of the three beacons.) The goals are to know when we’ve sampled enough data to make a reasonable guess at what the real value for distance is, and of course to actually make that guess.
Ideally, we’d like to have our distance accurate to 1cm, or maybe 2cm; of course, it’s not clear how useful math or statistics or anything else short of prayer will be if the Cricket sends back totally bogus results at any point. We’ve had spots where its reported distance to beacon was much too high; we attributed this to ultrasound reflection. Still, if we trust that the data coming in is “generally right,” we just need some math/statistics to “scrub” the data. We need to know which data points to keep, and which to throw out. (Alternatively, I might just be using this as a diversion to learn more about statistics and statistics software.)
I’m using R 2.2.0 from Fedora Extras, with Emacs Speaks Statistics. I’m also using my old statistics text book. I took one statistics course at college. I consider myself to have done well in that course, and (surprisingly) I feel like I grok the textbook. All that said… it was one course. The guy barely spoke English. A little bit of knowledge can be very dangerous. So if you think the stuff I’m doing here is pointless, flawed, or you simply have a better suggestion for doing this, please do let me know.
Anyway. Given the above file, I read the data in like this:
> position1 < - read.table("distances.1", header=FALSE)
> position1
V1 V2
1 01:9E:14:C4:0A:00:00:A1 219
2 01:7D:CA:60:0A:00:00:8A 150
3 01:9F:FF:C3:0A:00:00:BA 205
4 01:9E:14:C4:0A:00:00:A1 219
5 01:7D:CA:60:0A:00:00:8A 150
6 01:9F:FF:C3:0A:00:00:BA 206
7 01:9E:14:C4:0A:00:00:A1 219
8 01:7D:CA:60:0A:00:00:8A 150
9 01:9F:FF:C3:0A:00:00:BA 206
10 01:9E:14:C4:0A:00:00:A1 219
.
.
.
No headers in my input file. < - is an assignment operator; you can
reverse the arrow, so long as it’s pointing to what you’re
assigning to. You can rename V1 and V2 to something better (like
beaconid and distance) with edit(position1).
position1 is now a data.frame (try class(position1)). The
beaconid column is non-numeric, so data.table made it into a
“factor.”
I don’t know enough to explain what a factor is, but you can use
it for grouping. Check this out:
> attach(position1)
> tapply(distance, beaconid, summary)
$"01:7D:CA:60:0A:00:00:8A"
Min. 1st Qu. Median Mean 3rd Qu. Max.
149.0 150.0 150.0 150.1 150.0 151.0
$"01:9E:14:C4:0A:00:00:A1"
Min. 1st Qu. Median Mean 3rd Qu. Max.
51.0 219.0 219.0 217.6 219.0 220.0
$"01:9F:FF:C3:0A:00:00:BA"
Min. 1st Qu. Median Mean 3rd Qu. Max.
85.0 205.0 206.0 203.5 206.0 207.0
attach basically makes the columns of the given object accessible
without being qualified by the object’s name. That is, without
attach(position1) first, I’d have to use position1$beaconid
instead of just beaconid. tapply takes the vector as its first
argument (the columns of a data.frame are vectors–or, at least,
they are in this case) and divides it up based on the factor given as
its second argument, then calls the given function (summary here)
with each of the resulting vectors. I think of it as analogous to
GROUP BY in SQL. By the way, if you want to undo attach, just
detach().
It looks like the first beacon is probably 150cm away. I noticed that
the mean and the median had some disagreement for the second and third
beacons. I wasn’t crazy about it. But look at the minimums: way too
low for the beacon IDs ending in A1 and BA. So trim the top and
bottom 10% of the data set:
> tapply(distance, beaconid, mean, trim=0.1)
01:7D:CA:60:0A:00:00:8A 01:9E:14:C4:0A:00:00:A1 01:9F:FF:C3:0A:00:00:BA
150.0000 219.0000 205.7964
Much better. The trim=0.1 is an argument that’s passed to mean.
Variance and standard deviation:
> tapply(distance, beaconid, var)
01:7D:CA:60:0A:00:00:8A 01:9E:14:C4:0A:00:00:A1 01:9F:FF:C3:0A:00:00:BA
0.08623402 208.02434337 217.29764082
> tapply(distance, beaconid, sd)
01:7D:CA:60:0A:00:00:8A 01:9E:14:C4:0A:00:00:A1 01:9F:FF:C3:0A:00:00:BA
0.2936563 14.4230490 14.7410190
What would those look like if you trim the data set, as we did with
mean, above?
Diversion: subset is kind of interesting. I was interested in
pulling out just one beacon’s readings into a separate vector:
> a1 < - subset(position1, beaconid=="01:9E:14:C4:0A:00:00:A1", select=distance)
> a1
distance
1 219
4 219
7 219
10 219
13 219
16 220
19 219
21 85
24 219
27 219
.
.
.
> class(a1)
[1] "data.frame"
Not a vector, but instead a data.frame; that’s hardly a problem
right now, though. By the way, you can do things like
position1[[1]] and position[1]. They both output the beacon ID
column, but in different formats:
> class(position1[1])
[1] "data.frame"
> class(position1[[1]])
[1] "factor"
I still want to trim data. I don’t see anything in R that returns a “trimmed” vector. So I wrote my own (with help from http://www.math.mcgill.ca/~steele/math680/lecture2.680.pdf).
trim < - function(vect, percent=0.1) {
nelem <- length(vect)
howmany <- round(nelem * percent)
sort(vect)[(howmany + 1):(nelem - howmany)]
}
Note that the parenthesis around howmany + 1 and nelem - howmany
are apparently quite important. In use:
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> trim(1:10)
[1] 2 3 4 5 6 7 8 9
> trim(1:10, percent=0.2)
[1] 3 4 5 6 7 8
Once again, Emacs makes this easier: write the function in a scratch
buffer, M-C-x to eval the function in the current R process. Note
that 1:10 makes some sort of… I don’t know what it is, but I guess
it creates something that acts like a vector, if it’s not exactly a
vector.
Now lets see what the data for beacon A1 (that is, ID ending in A1)
looks like with 10% trimming from both ends:
> sd(trim(a1[[1]]))
[1] 0
So, uh. There you go.
Before I forget, I think you can do help.start() and it’ll pop up a
browser (if it can) with local HTML docs. This is helpful to have
open. You can also see how you’ve been polluting your name space:
> objects()
[1] "a1" "last.warning" "mydata" "origdata" "position1"
[6] "trim"
> rm(origdata, mydata)
> objects()
[1] "a1" "last.warning" "position1" "trim"
This might be important because it looks like, on exit, you can choose to save your current session to disk. When you start R next time from the same directory, you’re popped right back into your session. This is why they recommend that, for each project/set of data/whatever you want to work with in R, you work in a separate directory.
Lets look at beacon BA:
> levels(beaconid)
[1] "01:7D:CA:60:0A:00:00:8A" "01:9E:14:C4:0A:00:00:A1"
[3] "01:9F:FF:C3:0A:00:00:BA"
> ba < - subset(position1, beaconid=="01:9F:FF:C3:0A:00:00:BA", select=distance)
> summary(trim(ba[[1]]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
205.0 206.0 206.0 205.8 206.0 206.0
> sd(trim(ba[[1]]))
[1] 0.4012177
That’s really not so bad, either.
Of course, I just realized: it would be nice if I knew if that was anywhere near the correct distance. Will have to experiment with a measuring tape at some point.
What happens if I stopped after the first 20 readings?
> summary(trim(ba[[1]][1:20]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
205.0 206.0 206.0 205.8 206.0 206.0
> sd(ba[[1]][1:20])
[1] 0.5231484
Not bad. In fact, with fewer than 20 readings I still seem to get between 205.8 and 206 for the average. I can live with 0.2cm.
Will have to work more on this later, dinner time now.