MGoBlog has moved. The new site can be found at

Monday, December 19, 2005

Okay... I have the fluttering beginnings of something cool. I have hacked my way through the various jungles of variable data input from hell to compile a database of... er... most of this year's football plays. For everybody. Bad news if you're an Eastern Michigan fan (other than "you're an Eastern Michigan fan"): I'm still working on an especially head-meets-wall section of plays that should work but do not. Other failures are scattered throughout the data.

Despite this, there are still over 100,000 plays that have been dumped into an SQL database (by hand! Uphill both ways! Wearing communism on my back like an extremely red, not-yet-shot-into-space monkey!). I think better than 95% of the plays from scrimmage made it in safely. Now things like this...

...exist and appear fairly sane. That's a breakdown of the Michigan D's third down performance YTD: blue are unsuccessful conversions, red vice versa.

So that's cool but totally without context and somewhat silly on the surface. Is this good? Is it bad? If you find yourself in second and nine should you attempt to lose two yards to access Michigan's horrid 3rd-and-11 defense?

So then we've got this:

The background there is the NCAA average; the foreground is Michigan's success at halting their opponents. Now we have something to relate it to: surprisingly, Michigan's defense on third and one is subpar despite Gabe Watson. Those seismic spikes towards the end are way stupid, though... so we need to smooth us some data:

This is better, but still wonky.

[WARNING] Those who experience seizures at math talk should skip the next section. [/WARNING]

I had two ideas for the smoothing:
  1. a sort of moving average based on the percentages. Unsatisfactory because this particular graph does not appear to be linear... there's a steep drop at the beginning and then a levelling off. Also, tends to ignore the fact that 4/6 on third and eleven isn't quite as meaningful as 6/22 or whatever on third and ten.
  2. something similar to #1 except adding up each success and failure instead of treating each yard line as a monolithic percentage. This more properly weights stuff that is frequently experienced (like 3rd and 10), but also has a distorting effect that you can clearly see around 12 or 13 yards to go... with so few instances of those distances, their lower conversion percentages are drowned out by the vast quantity of 3rd and 10s.
#2 is currently in use, but I'm highly suspicious. The preponderance of 3rd and 10s seems to be dragging items nearby to its level, like a gravitational well. Someone somewhere has tackled data like this before and come out with something better than either of these alternatives. Are you one of these people? Please say yes. Any help is appreciated.

Anyway, I'm a couple tweaks away from releasing a little app that will give you these graphs for any offense or defense in I-A, but before I do it I'd really like to get some better smoothing so that the results for individual teams don't look goofy. Consider this an APB for assistance.

(You can come out now, mathophobes.)