MathematElection

Tuesday, October 20, 2015

Wolfram Technology Conference

Today I presented my work on this blog at the Wolfram Technology Conference. So that happened.

I'm a part of the Wolfram Student Ambassador Program, and was invited to share the sorts of things I've been doing. I've been a nervous wreck about it for multiple weeks. The people at the conference are industry leaders, often having been working in Mathematica for many years (compared to my ~3-4 years). But y'know what?

It was great!

I think roughly 60 people attended my talk, and it went really well. *phew* I've had several people come up to me after the talk to say how much they enjoyed it, and that was about the biggest self-esteem boost I've had since getting into graduate school

The slides for the talk are available here:
PDF: https://dl.dropboxusercontent.com/u/4972364/WTC_2015/Presentation_beta.nb.pdf
Mathematica Notebook: https://dl.dropboxusercontent.com/u/4972364/WTC_2015/Presentation_beta.nb
(needs the following GIF: https://dl.dropboxusercontent.com/u/4972364/WTC_2015/blog.gif)

The Mathematica Notebook version (if you own a copy of Mathematica) is preferred -- the formatting is better and it's fully interactive. In either case, it gives you some insight into how I create the visualizations for this blog and some of the technology behind it -- specifically the Wolfram Data Drop, the Manipulate function, and CloudDeploy.

I'll have some updates for interactive things over the next couple weeks. Going to make it a lot easier for you to play with the data on your own. I'll also have some other updates from the coding side of things.

Some questions from the Q&A are worth mentioning here as well as things I'd like to do in the future:

Have you thought about using Twitter / Google Trends?
Short answer: yes, but not yet. I'm definitely hoping to analyze these sorts of data streams as well, seeing how well they match up to each other and how well they match up to poll-based public opinion estimates.

What about the selection bias of Facebook?
This (and questions like it) pose a very valid flaw (or at least a major assumption) inherent in this analysis, which is that Facebook, Twitter, and even Google have a skewed representation of the population. A lot of likely voters simply aren't going to have a Facebook page with a lot of information. Furthermore, just because someone follows a candidate doesn't mean they'll vote for them, nor does the lack of following a candidate indicate a lack of support. I'm hoping to look at past elections and find similar Facebook Like data to see how well these sorts of things actually match up. If there's enough overlap, maybe I can make accurate predictions in spite of the biases, or at least try to correct for them.

(not asked, but something I mentioned) What about extrapolation? Surely a linear model is inappropriate here.
Absolutely. Right now, I'm making very naive predictions that are surely wrong. That's not a statement made out of false humility. It's really a quite stupid way of looking at things. That doesn't make it bad. Just very likely to be inaccurate as time goes on. The question is how to bring in more sophisticated models (specifically regarding discrete events like the debates). The computation is straightforward. The challenge is the theory. And honestly, I don't (yet) have a ton of experience with this sort of data. So, it's coming down the pipeline, just maybe not for a while.

Anyway, I need to get some sleep for the conference tomorrow, followed by flights, working on my neuroscience homework, and studying for midterms for Thursday. Oy.

Tuesday, October 6, 2015

Carson Leads; Second Debate; Interactive Graph

Howdy! It's been a while. Sorry about that. Turns out grad school is kinda hard. Who knew?

There's a lot of data to talk about (if you want graphs, just scroll down a bit to the Graphs and Stuff section), but I wanted to say a bit about some of the code changes I've made and why I think they're worth noting, plus just tell more of a story about the project.

Code

A lot of my time for this blog is spent trying to make new and better visualization tools for this data set. Part of it comes from learning new tools in Mathematica, part of it comes from more software engineering -- how can I build this so that this time in January I can still effectively use the same code? That takes some forethought. As my high school computer science teacher's wall said, "weeks of programming can save hours of planning" (Yes, Mr. Martin, I still think about this regularly when I code and quote it to the students I tutor).

Being my own project manager (+ y'all)

As the sole programmer on this project, I know all the flaws of my system. I know that before this post, I hadn't included John Kasich (R) in my results even though I was monitoring his page. I know that it was impossible to understand what was going on in some of my plots. I know I wasn't handling some special cases well. I know I couldn't highlight a particular candidate.

Figuring out these problems is one thing. Addressing them is another. Although I usually end up having long coding sessions to fix multiple things, I have to do so systematically. That means determining what's critical, what will cause changes "downstream" and so forth. For example, I ran into this one for this post: what happens if the data in a Databin in the Wolfram Language isn't in chronological order? I include a timestamp in each submission I make to the Wolfram Data Drop, but the fact is, the rows in a Databin are ordered by submission time, not manual timestamp (although they are treated the same in many places).

Why is that a problem? A lot of the code in the project relies on neighboring rows being sequential (especially for finding the differences between days and things like that). So when I had to re-create some submissions to properly handle Kasich to account for missing data (assume -1 or 0 everywhere, including percent change, until real daily values can be computed), that left me with a new problem of how to compensate. Fortunately, I was able to deal with this (and sorting names by last name in some places) upfront. Without planning how to do this effectively, I would probably still be editing code, rather than making a structured fix. Planning > hacking, at least if you care about the quality of the code next week.

I also want to respond to what you think about this blog. I hadn't been able to display daily results, and that's kinda annoying if you want to check in on things between my posts. I hadn't included Kasich (nor fixed Scott Walker's (R) data [who also entered the race later]), nor added the ability to highlight a candidate, along with some other back-end updates. That's a lot. Carry on to see results.

Optional Parameters

Okay, super tech-ing out now. The Wolfram Language supports optional parameters. That sentence may make no sense to you. Let's talk about fast food instead. You pull up to your local Whataburger (I'm all about that honey butter chicken biscuit) and see all the choices you can order. You decide on a double cheeseburger. That could be the end of the story. But maybe you want to customize it. Maybe you want grilled onions. Oh, and you should put double lettuce. And substitute ketchup for mustard. You took the main idea then added a lot of options (admittedly fairly structured ones -- you're not going to be able to add a side of chicken enchiladas with green sauce).

The Wolfram Language (WL, which I'm using for this project) supports this sort of flexibility, including providing defaults (or guesses). For example, doing a simple plot could be done with many options:

Simple plot

Highly customized plot

As part of my project, I'm building a lot of functions specifically for my data. Particularly for the graphs of Likes versus time, there's a lot to be customized. There's color, there's whether to highlight a candidate, there's which candidates to plot, there's the plot range... and I'm sure I'll have more next time.

There's a cool pattern in WL to do this (there are others; this is the one I used successfully for some statistics-related functions I created as an exercise a few weeks ago). You can create the full function:
myNewFunction[x,y,z]:= ....
Then you can determine which variables should have options and default values:
myNewFunction[x,OptionsPattern[{y->5, z->7}]]:=myNewFunction[x,OptionValue[y],OptionValue[z]]
This creates a new version that requires a value for x and assumes that y=5 and z=7. However, if the user uses new values (myNewFunction[2,y->10]) the calculation will be updated accordingly. Essentially, it's nice to have the first version where the user specifies all the values so you don't get confused (as the person writing the function), and it's nice to have the second as a user because it's more flexible and "smart." This pattern provides the best of both worlds: any functionality changes are only made once, and any new options are easy to incorporate into versions.

In doing so with my time series graphing function, I made it so I can selectively customize options for the plot without having to change the function or write down every single parameter I have at my disposal. TL;DR: I can change stuff easier to create prettier plots faster.

Graphs and Stuff

Here are the actual results. (I apologize that I don't have data for a few days due to some technical issues)

Ben Carson (R) Leads All Candidates

Last time, Donald Trump (R) had full control over the political field. A challenger has arisen. Ben Carson (R) has rapidly risen in popularity on Facebook:

Oh hey, I can highlight candidates now.

Each vertical line indicates one of the televised GOP debates. Carson has skyrocketed since the first debate and even more so since the second one. He now has approximately 19% of the total number of Likes to Trump's 18% and third-place Rand Paul's (R) 9.5%.

In related news, between September 23 and October 2, he was the first to hit 4M Facebook Likes.

Past Predictions

This brings me to some past predictions. I'll copy the predictions here and write the updates in bold. (Many of the events occurred on September 17 – just a few hours after the second GOP debate)

August 26: Clinton passes Perry at 1.2M Likes (Occurred on August 24 at 1.2M)
August 29: Rubio hits 1M Likes (Occurred on September 17 – a delay was predicted last time due to his waning FB support at the time)
September 1: Bush passes Santorum at 265K (Occurred on September 17 at 266K)
September 4: Sanders passes Cruz at 1.4M (Occurred September 17)
September 18: Bush passes Jindal at 286K (Not yet occurred)
September 29: Sanders passes Huckabee (R) at 1.8M (Not yet occurred)
September 29: Bush passes Walker at 300K (Not yet occurred)
October 13: Sanders passes Paul (R) at 2.1M (Not yet occurred)

So, my predictions were kinda bad. #LinearRegressionOnNonlinearData

I don't have new predictions yet, but here's the current state of affairs.

One of these days, I'll have this sort of thing compared with real polling numbers... or FEC filings.

Interactive Graph!!!

Okay, so this took a lot of work, but I think I finally got it, folks! You can now make some of your own graphs with the power of the cloud (the Wolfram Cloud, specifically)!

Here's what it looks like:

You select which type of data you want to look at, potentially filtering by political party, and optionally highlighting a particular candidate then hit submit! Give it a little while to process the data for you and then voilà! You get your very own graph (such as the one earlier in this post), or these:

Absolute overnight change, Republicans only, Donald Trump highlighted

Percentage overnight change, Democrats only, Bernie Sanders highlighted

To create your own graphs, GO HERE https://wolfr.am/7mn~J_kq Enjoy!

Saturday, August 22, 2015

Bernie Beats Hillary... Again?

Interesting times for the Democrats! As of today, Bernie Sanders (I) has passed Hillary Clinton (D) in terms of total likes on Facebook. After trailing by ~7000 likes for the past couple days (see below), Bernie finally passed Hillary. But is that the full story? It's hard to say.

Quick Milestone Update

Before delving into the Bernie / Hillary data, just a quick update on some prior predictions. See http://mathematelection.blogspot.com/2015/08/the-first-gop-debate.html for the previous ones.

August 13: Fiorina (R) passes Bush (R) at 250K likes (no prediction, although I think I may have misread things last time because around this date was when I thought Fiorina would pass Walker)
Marco Rubio (R) has still not hit 1M likes (prediction, August 17) Since the spike in likes from the first GOP debate, he's cooled off a bit to roughly ~1000 likes per night.
August 18: Fiorina passes Santorum (R) at 265K (I think this was a misread too -- predicted Bush to do this on the 19th)
August 22: Sanders passes Clinton at 1.2M likes (prediction: August 20 at 1.2M).
Unless something quite strange happens, Sanders is within 7000 likes of Rick Perry (R) and based on Sanders' typical increase of ~10000, he'll pass Perry tomorrow. Clinton will likely follow after in another couple days.
August 22: Fiorina passes Bobby Jindal (R) with 278K(again, I think last time was misread because Bush was supposed to pass Jindal on August 23 with 286K)

Back to the Story

Bernie passes Hillary. Based on the data so far, this was a long time coming. Since I started collecting data, Bernie has been closing the gap [for the record, I'm using first names because each of their respective campaigns use the candidate's first name primarily]:

[As an aside, I discovered today, courtesy of an article from The Guardian several days ago, that Bernie actually has two official Facebook Pages, one of which passed Hillary long ago. However, the page I'm monitoring is the official campaign page {the other is for his presence as a U.S. Senator}, so I'm going to stick with it. As a related aside to the pages I'm following, I will not be monitoring the "candidate" Deez Nuts who recently polled at 9% in North Carolina until a verified page is created -- and this may not happen since the high school student running under this pseudonym is not actually eligible to legally become the POTUS due to age restrictions]

Since the first GOP debate (in which neither of these candidates participated), Bernie took off relative to Hillary. What intrigues me is the sudden change August 19. All of a sudden he lost momentum relative to Hillary (or, equivalently, she gained momentum relative to him). What's interesting is that nothing major seems to have happened that day. Hillary has been (somewhat) facing criticism following remarks made to #BlackLivesMatter activists that were recorded. Bernie, two days prior, was getting (mostly) positive press for his remark to a reporter about how his hair is not a serious issue when asked about why his gets less scrutiny than Hillary's.

What happened?

According to a user on reddit, it may have been from Hillary's campaign buying votes overseas. If you want to see what that user came up with, click on the link. I will not copy the infographic to this page because I have seen no indication as of yet that such figures are legitimate. I mention it because the timeframe seems to coincide with where things took a rather sudden turn in my data set, but with a word of caution regarding jumping to conclusions that in the extremely chaotic world of social media, this could also be just a fluke for a few days.

What I do know is data. From August 6 through August 19, the linear best-fit line for Hillary's lead compared to Bernie was the following, where "d" is the number of days since August 6.

$\hat{\textrm{Lead}}=185,681-15,108.2\cdot d$

Obviously, these are a best fit, and so the slope and intercept are estimates. Considering the standard error of the fit, the intercept is 186,000 +/- 3000 and the slope is -15100 +/- 400. That means that on August 6, Hillary was roughly 186,000 Likes ahead of Bernie, with a lead dropping 15,000 likes per day. The fit produced the coefficient of determination R^2=0.990. What does that mean? Formally, 99% of the variance in the data can be accounted for by the model listed above. Less formally, the data are pretty much exactly in a line. If you don't believe the number, just scroll up and look at the fit. [I changed how "d" was expressed in the formula above to be days since August 6 rather than days before August 22, to provide clarity about what the intercept means in this case, but regardless the slope is the same either way]

Including the data since August 19, R^2 drops to 0.962. In practice, that means the data are still pretty much linear. As stated before, the last few days could be more of a fluke [for example, when Hillary's lead stagnated about a month ago for several days] and things may get back to normal later, but from a very simplified (perhaps not even really valid, just interesting) perspective, it seems to me that something has been going on for one of the campaigns (or both) over the past couple days. The fact that this happened between the two leading candidates on the Democratic tickets right as their number of Likes on Facebook was equalizing is at the very least curious, and should probably elicit some skepticism about this data set to begin with. It is very much true that some companies exist to sell Likes on Facebook (such as fbskip.com), and what better time to do so when your lead is falling? Whether that's going on now is currently just speculation. But I think there's good reason to wear very skeptical spectacles when reviewing this data.

Predictions:

(I'll be real careful this time) I'm still using a linear model fit of the data since the first debate on August 6. Predictions change in response to the new data.

August 26: Clinton passes Perry at 1.2M Likes
August 29: Rubio hits 1M Likes (I doubt this date is correct because he's been losing momentum) [See below]

September 1: Bush passes Santorum at 265K
September 4: Sanders passes Cruz at 1.4M
September 18: Bush passes Jindal at 286K
September 29: Sanders passes Huckabee (R) at 1.8M
September 29: Bush passes Walker at 300K
October 13: Sanders passes Paul (R) at 2.1M

Interactive Plots Coming:

I'm going to keep working to get better interactive plots up. More on that probably next time. In the meantime, if there's a plot you want based on the graphics I've displayed anywhere on the blog (or on the existing interactive plots with old data), just let me know and I'll get it to you.

Wednesday, August 12, 2015

The First GOP Debate

The first debate for Republican candidates was August 6, just about a week ago. Since then, there have been many interesting changes in the Facebook Like dataset.

Debate Winners:

This is one of the first times to see if major changes in the political landscape are reflected by Facebook Likes. CNN and The Huffington Post seem to agree that Carly Fiorina (R), Marco Rubio (R), and Ben Carson (R) were the winners of the debates (Fiorina was in the "happy hour" debate while Rubio and Caron were in the "primetime" debate). Let's take a look at the Facebook data.

Let's take a look at the data from August 7:

Based on the percentage overnight change, it certainly seems that Fiorina and Carson were major winners. In fact, Carson's overnight increase of 124,341 Likes set a new record (at least since I started this project). [Although, he broke his own record the next day with 145,526 Likes] It definitely seems like percent change is one of the best ways to track public opinion shifting, rather than total Likes or even absolute overnight change. Trump consistently does well overnight, but it's difficult to ascertain whether this is because of agreement with his views on issues or because people are simply interested in following him due to controversy. However, the percentage bumps seem to jive with the "declared winners" of the Debate (to some extent anyway). What's interesting is that the Facebook data seems to not really show much of an increase for Rubio. In fact, Bernie Sanders (I) has seen more of a bump since the debate (the large bump for Carson is August 7).

I created a new statistic this time that shows, I think, who the front-runners seem to be in terms of momentum: the percent increase in Likes since the start of the data set. While Donald Trump (R) still leads in terms of total Likes, Fiorina now actually has the highest percent change, up there with Bernie Sanders (I) and Carson. We'll see how this plays out, but I wouldn't be surprised if those four become the major contenders later on (but that's just a personal opinion).

The Sanders Bump:

One strange artifact of the debate (and other recent press) is that Sanders is taking off relative to Hillary Clinton (D). Over the past week or so, Sanders has rapidly decreased Clinton's lead. Using all the data collected so far in a linear model, Sanders should pass Clinton in approximately 43 days. Using just the past week (given his actual increased political momentum after the debate and the #BlackLivesMatter interruption in Seattle at one of his rallies), Sanders should pass Clinton in just a little over a week. Interesting times are ahead for the Democrats.

Milestones:

Here are some milestones that candidates have hit recently, along with when they were predicted to happen (see my previous post).

August 3: Bernie Sanders (I) passes Marco Rubio (R) with 925K Likes (prediction: July 29 with 918K)

August 6: Ben Carson (R) passes Mike Huckabee (R) with 1.8M Likes (prediction: August 14 with 1.8M)

August 8: Ben Carson (R) passes Rand Paul (R) with 2M Likes (no prediction)

August 9: Donald Trump (R) hits 3M Likes (prediction: August 10)

August 10: Bernie Sanders (I) hits 1M Likes (no prediction)

New Predictions:

These are based on just the data since the debate. The newest prediction for the party nominees are Ben Carson and Bernie Sanders.

August 14: Fiorina passes Walker (R) at 290K

August 17: Rubio hits 1M

August 19: Bush (R) passes Santorum (R) at 265K

August 20: Sanders passes Perry (R) and Clinton at 1.2M

August 23: Clinton passes Perry at 1.2M

August 25: Bush passes Jindal (R) at 286K

September 4: Sanders passes Cruz (R) at 1.5M

September 8: Carson passes Trump at 3.97M

September 9: Carson hits 4M

September 11: Christie (R) passes Graham (R) at 136K

Sunday, August 2, 2015

Major (code) Updates!

This time, I don't really have many updates on the actual election front. Things have been fairly stagnant over the past week or more.

However.

I've been changing quite a bit of my process and have some cool things to share:

1) Wolfram Data Drop
As stated previously, I've been using Wolfram Mathematica to process the FB Like data. I'm working to move things more and more to the cloud and make things more and more accessible. A friend of mine told me about Wolfram Data Drop (cloud-based storage, publically available, often used as an interface for the "Internet of Things" -- maybe you store your minute-by-minute pulse measurements from a FitBit or something there). So now my data is accessible here.

2) Cloud Visualizations
Another cool thing is deploying some of the things I've built to the cloud. I'm still having a little bit of trouble producing things quite the way I want, but I do have a way to present all the data up until today in an interactive format:

To view graphs like below, go here.

To view a pie chart like below, go here.

To view the actual data like below, go here.

And finally, to see the estimated total likes of each candidate from now until election day (based on a best-fit line of data collected so far), go here.

I'll be working on making these faster soon. There are a lot of calculations happening in each of these which is why they may take a while to load. Until then, I wish you happy times in exploring all the data!

Sunday, July 19, 2015

Corrections

Oops.

Sorry, y'all, made a couple mistakes in interpreting my own results in terms of extrapolation. This is the curse of the data-driven -- things end up being a number that's sometimes hard to check.

The main problem stemmed from a slight error in assumptions to the models used in the last two posts (see below if interested). The other is a simple misinterpretation of my own results; these models used to provide the number of days until an event occurs after the first data point -- I was assuming that it was days from the present (so a ~16 day difference). The new versions of the models produce results in terms of "days since the most recent data point," so this confusion won't occur again.

With this error now corrected, I can now make some better predictions. First of all, the previous estimate for Bernie Sanders (I) surpassing Hillary Clinton (D) is reduced. My last post indicated that in 88 days (based on my full data set), they should have the same number of Likes. Furthermore, using only the last 12 days (where Sanders has had a bit more of a hot streak), I predicted this shift to occur in 70 days.

After applying the correction, using all the data, Sanders should pass Clinton in 70 days (c. September 27). Using just the last 12 days (starting on day 6 of my data collection), Sanders should pass Clinton in 56 days (c. September 13). Two weeks in an election year can mean a lot of momentum, hence my desire to correct this. In either case, it would occur around 1.3M or 1.4M Likes each.

(the animation in the previous post is actually correct -- I just didn't report the correct number of days).

Based on my entire data set, here are some (potentially) interesting future dates:
July 29: Sanders passes Marco Rubio (R) around 918K Likes.
August 10: Donald Trump (R) hits 3M Likes.
August 14: Ben Carson (R) passes Mike Huckabee (R) around 1.8M Likes.
August 22: Clinton passes Rick Perry (R) at 1.2M Likes.
September 7: Sanders passes Perry at 1.2M Likes.
September 17: Trump hits 4M Likes.
September 20: Jeb Bush (R) passes Rick Santorum (R) at 266K Likes.
September 22: Clinton passes Ted Cruz (R) at 1.4M Likes.
September 24: Sanders passes Cruz at 1.4M Likes.
September 27: Sanders passes Clinton at 1.4M Likes.

Anyway, these are just the simplest possible predictions and don't account for an off-hand sound bite here or a riveting debate there. I'll go into that over the next couple days while things seem to be quiet on the data front. I'll try to work up some more visualizations too -- this has been a lot of text. Until next time.

Below:
The error was that I was assuming (incorrectly) that the Like data was collected at 24-hour intervals. This is largely the case, but not always. As such, the analysis saw a 1.5 day interval and .5 day interval and treated them both as full days. Why is this bad? Because on the longer interval, we see an uncharacteristically large jump in Likes, and on the shorter one, we see an uncharacteristically small jump in Likes.

To correct this, I made a simple function that figures out exactly how much time there was between measurements and now use that in my models. In particular, it creates a list of days since the last measurement (so if the first data point was 16 days prior to the most recent point, it would have a value of -16). That way, all predictions are phrased in terms of "days from now" rather than "days since the start of my data set." Easier to not make the same mistake in the future.

From a list of times called "times" that contains timestamps as strings, I created this new list with:

dayTimes = N[(AbsoluteTime[#] - AbsoluteTime[times[[Length[times]]]])/(3600*24)] & /@ times

Saturday, July 18, 2015

Bernie Beats Biden

#CalledIt

Last night, Bernie Sanders (I) surpassed Joe Biden (D) [not yet announced] in terms of total Likes. Sanders now has 837K to Biden's 836K as of midnight central time.

Sanders is still behind Hillary Clinton (D), the forerunner of the Democrats (1,066K). However, over the last 11 days, Sanders has added at least 1.5 times as many Likes as Clinton each night. Especially considering that he is behind still, it should come as no surprise that his percent increase in Likes has been at least double hers over the same timeframe. In fact, a linear best fit line of Clinton's lead now predicts Sanders surpassing her in 88 days if we use the data from my entire span of collection. Over these past 11 "magical" days for Sanders, the same fit shows him rivaling Clinton in 71 days. Assuming linear growth in Likes for Sanders, this would put both of them at 1.25M Likes.

To further this analysis, I performed a simple linear regression (best-fit line) for each candidate based on all the data I have so far (admittedly, it's not fair to Scott Walker [R], because most of the data for him shows no change since he announced after I started collecting data), and extrapolated it. The following animation shows the total estimated number of Likes for each candidate running up until the election.

Interestingly, it shows that by the time of the election, we could expect Donald Trump (R) to be up against Sanders. However, because primaries happen earlier, we should look at where candidates are estimated to be around Super Tuesday. In that case, this model predicts Trump to have 8.2M Likes compared to Ben Carson's (R) 2.64M and Sanders' 2.56M. This is perhaps the simplest possible model for analyzing this data, but an interesting one nevertheless.