Sunday, July 19, 2015



Sorry, y'all, made a couple mistakes in interpreting my own results in terms of extrapolation. This is the curse of the data-driven -- things end up being a number that's sometimes hard to check.

The main problem stemmed from a slight error in assumptions to the models used in the last two posts (see below if interested). The other is a simple misinterpretation of my own results; these models used to provide the number of days until an event occurs after the first data point -- I was assuming that it was days from the present (so a ~16 day difference). The new versions of the models produce results in terms of "days since the most recent data point," so this confusion won't occur again.

With this error now corrected, I can now make some better predictions. First of all, the previous estimate for Bernie Sanders (I) surpassing Hillary Clinton (D) is reduced. My last post indicated that in 88 days (based on my full data set), they should have the same number of Likes. Furthermore, using only the last 12 days (where Sanders has had a bit more of a hot streak), I predicted this shift to occur in 70 days.

After applying the correction, using all the data, Sanders should pass Clinton in 70 days (c. September 27). Using just the last 12 days (starting on day 6 of my data collection), Sanders should pass Clinton in 56 days (c. September 13). Two weeks in an election year can mean a lot of momentum, hence my desire to correct this. In either case, it would occur around 1.3M or 1.4M Likes each.

(the animation in the previous post is actually correct -- I just didn't report the correct number of days).

Based on my entire data set, here are some (potentially) interesting future dates:
July 29: Sanders passes Marco Rubio (R) around 918K Likes.
August 10: Donald Trump (R) hits 3M Likes.
August 14: Ben Carson (R) passes Mike Huckabee (R) around 1.8M Likes.
August 22: Clinton passes Rick Perry (R) at 1.2M Likes.
September 7: Sanders passes Perry at 1.2M Likes.
September 17: Trump hits 4M Likes.
September 20: Jeb Bush (R) passes Rick Santorum (R) at 266K Likes.
September 22: Clinton passes Ted Cruz (R) at 1.4M Likes.
September 24: Sanders passes Cruz at 1.4M Likes.
September 27: Sanders passes Clinton at 1.4M Likes.

Anyway, these are just the simplest possible predictions and don't account for an off-hand sound bite here or a riveting debate there. I'll go into that over the next couple days while things seem to be quiet on the data front. I'll try to work up some more visualizations too -- this has been a lot of text. Until next time.

The error was that I was assuming (incorrectly) that the Like data was collected at 24-hour intervals. This is largely the case, but not always. As such, the analysis saw a 1.5 day interval and .5 day interval and treated them both as full days. Why is this bad? Because on the longer interval, we see an uncharacteristically large jump in Likes, and on the shorter one, we see an uncharacteristically small jump in Likes.

To correct this, I made a simple function that figures out exactly how much time there was between measurements and now use that in my models. In particular, it creates a list of days since the last measurement (so if the first data point was 16 days prior to the most recent point, it would have a value of -16). That way, all predictions are phrased in terms of "days from now" rather than "days since the start of my data set." Easier to not make the same mistake in the future.

From a list of times called "times" that contains timestamps as strings, I created this new list with:

dayTimes = N[(AbsoluteTime[#] - AbsoluteTime[times[[Length[times]]]])/(3600*24)] & /@ times

No comments:

Post a Comment