Tuesday, October 20, 2015

Wolfram Technology Conference

Today I presented my work on this blog at the Wolfram Technology Conference. So that happened.

I'm a part of the Wolfram Student Ambassador Program, and was invited to share the sorts of things I've been doing. I've been a nervous wreck about it for multiple weeks. The people at the conference are industry leaders, often having been working in Mathematica for many years (compared to my ~3-4 years). But y'know what?

It was great!

I think roughly 60 people attended my talk, and it went really well. *phew* I've had several people come up to me after the talk to say how much they enjoyed it, and that was about the biggest self-esteem boost I've had since getting into graduate school

The slides for the talk are available here:
PDF: https://dl.dropboxusercontent.com/u/4972364/WTC_2015/Presentation_beta.nb.pdf
Mathematica Notebook: https://dl.dropboxusercontent.com/u/4972364/WTC_2015/Presentation_beta.nb
(needs the following GIF: https://dl.dropboxusercontent.com/u/4972364/WTC_2015/blog.gif)

The Mathematica Notebook version (if you own a copy of Mathematica) is preferred -- the formatting is better and it's fully interactive. In either case, it gives you some insight into how I create the visualizations for this blog and some of the technology behind it -- specifically the Wolfram Data Drop, the Manipulate function, and CloudDeploy.

I'll have some updates for interactive things over the next couple weeks. Going to make it a lot easier for you to play with the data on your own. I'll also have some other updates from the coding side of things.

Some questions from the Q&A are worth mentioning here as well as things I'd like to do in the future:

Have you thought about using Twitter / Google Trends?
Short answer: yes, but not yet. I'm definitely hoping to analyze these sorts of data streams as well, seeing how well they match up to each other and how well they match up to poll-based public opinion estimates.

What about the selection bias of Facebook?
This (and questions like it) pose a very valid flaw (or at least a major assumption) inherent in this analysis, which is that Facebook, Twitter, and even Google have a skewed representation of the population. A lot of likely voters simply aren't going to have a Facebook page with a lot of information. Furthermore, just because someone follows a candidate doesn't mean they'll vote for them, nor does the lack of following a candidate indicate a lack of support. I'm hoping to look at past elections and find similar Facebook Like data to see how well these sorts of things actually match up. If there's enough overlap, maybe I can make accurate predictions in spite of the biases, or at least try to correct for them.

(not asked, but something I mentioned) What about extrapolation? Surely a linear model is inappropriate here.
Absolutely. Right now, I'm making very naive predictions that are surely wrong. That's not a statement made out of false humility. It's really a quite stupid way of looking at things. That doesn't make it bad. Just very likely to be inaccurate as time goes on. The question is how to bring in more sophisticated models (specifically regarding discrete events like the debates). The computation is straightforward. The challenge is the theory. And honestly, I don't (yet) have a ton of experience with this sort of data. So, it's coming down the pipeline, just maybe not for a while.

Anyway, I need to get some sleep for the conference tomorrow, followed by flights, working on my neuroscience homework, and studying for midterms for Thursday. Oy.

Tuesday, October 6, 2015

Carson Leads; Second Debate; Interactive Graph

Howdy! It's been a while. Sorry about that. Turns out grad school is kinda hard. Who knew?

There's a lot of data to talk about (if you want graphs, just scroll down a bit to the Graphs and Stuff section), but I wanted to say a bit about some of the code changes I've made and why I think they're worth noting, plus just tell more of a story about the project.

Code

A lot of my time for this blog is spent trying to make new and better visualization tools for this data set. Part of it comes from learning new tools in Mathematica, part of it comes from more software engineering -- how can I build this so that this time in January I can still effectively use the same code? That takes some forethought. As my high school computer science teacher's wall said, "weeks of programming can save hours of planning" (Yes, Mr. Martin, I still think about this regularly when I code and quote it to the students I tutor).

Being my own project manager (+ y'all)

As the sole programmer on this project, I know all the flaws of my system. I know that before this post, I hadn't included John Kasich (R) in my results even though I was monitoring his page. I know that it was impossible to understand what was going on in some of my plots. I know I wasn't handling some special cases well. I know I couldn't highlight a particular candidate.

Figuring out these problems is one thing. Addressing them is another. Although I usually end up having long coding sessions to fix multiple things, I have to do so systematically. That means determining what's critical, what will cause changes "downstream" and so forth. For example, I ran into this one for this post: what happens if the data in a Databin in the Wolfram Language isn't in chronological order? I include a timestamp in each submission I make to the Wolfram Data Drop, but the fact is, the rows in a Databin are ordered by submission time, not manual timestamp (although they are treated the same in many places).

Why is that a problem? A lot of the code in the project relies on neighboring rows being sequential (especially for finding the differences between days and things like that). So when I had to re-create some submissions to properly handle Kasich to account for missing data (assume -1 or 0 everywhere, including percent change, until real daily values can be computed), that left me with a new problem of how to compensate. Fortunately, I was able to deal with this (and sorting names by last name in some places) upfront. Without planning how to do this effectively, I would probably still be editing code, rather than making a structured fix. Planning > hacking, at least if you care about the quality of the code next week.

I also want to respond to what you think about this blog. I hadn't been able to display daily results, and that's kinda annoying if you want to check in on things between my posts. I hadn't included Kasich (nor fixed Scott Walker's (R) data [who also entered the race later]), nor added the ability to highlight a candidate, along with some other back-end updates. That's a lot. Carry on to see results.

Optional Parameters

Okay, super tech-ing out now. The Wolfram Language supports optional parameters. That sentence may make no sense to you. Let's talk about fast food instead. You pull up to your local Whataburger (I'm all about that honey butter chicken biscuit) and see all the choices you can order. You decide on a double cheeseburger. That could be the end of the story. But maybe you want to customize it. Maybe you want grilled onions. Oh, and you should put double lettuce. And substitute ketchup for mustard. You took the main idea then added a lot of options (admittedly fairly structured ones -- you're not going to be able to add a side of chicken enchiladas with green sauce).

The Wolfram Language (WL, which I'm using for this project) supports this sort of flexibility, including providing defaults (or guesses). For example, doing a simple plot could be done with many options:
Simple plot 
Highly customized plot
As part of my project, I'm building a lot of functions specifically for my data. Particularly for the graphs of Likes versus time, there's a lot to be customized. There's color, there's whether to highlight a candidate, there's which candidates to plot, there's the plot range... and I'm sure I'll have more next time.

There's a cool pattern in WL to do this (there are others; this is the one I used successfully for some statistics-related functions I created as an exercise a few weeks ago). You can create the full function:
myNewFunction[x,y,z]:= .... 
Then you can determine which variables should have options and default values:
myNewFunction[x,OptionsPattern[{y->5, z->7}]]:=myNewFunction[x,OptionValue[y],OptionValue[z]]
This creates a new version that requires a value for x and assumes that y=5 and z=7.  However, if the user uses new values (myNewFunction[2,y->10]) the calculation will be updated accordingly. Essentially, it's nice to have the first version where the user specifies all the values so you don't get confused (as the person writing the function), and it's nice to have the second as a user because it's more flexible and "smart." This pattern provides the best of both worlds: any functionality changes are only made once, and any new options are easy to incorporate into versions.

In doing so with my time series graphing function, I made it so I can selectively customize options for the plot without having to change the function or write down every single parameter I have at my disposal. TL;DR: I can change stuff easier to create prettier plots faster.

Graphs and Stuff

Here are the actual results. (I apologize that I don't have data for a few days due to some technical issues)

Ben Carson (R) Leads All Candidates

Last time, Donald Trump (R) had full control over the political field. A challenger has arisen. Ben Carson (R) has rapidly risen in popularity on Facebook:
Oh hey, I can highlight candidates now.
Each vertical line indicates one of the televised GOP debates. Carson has skyrocketed since the first debate and even more so since the second one. He now has approximately 19% of the total number of Likes to Trump's 18% and third-place Rand Paul's (R) 9.5%.

In related news, between September 23 and October 2, he was the first to hit 4M Facebook Likes.

Past Predictions

This brings me to some past predictions. I'll copy the predictions here and write the updates in bold. (Many of the events occurred on September 17 – just a few hours after the second GOP debate)
  • August 26: Clinton passes Perry at 1.2M Likes (Occurred on August 24 at 1.2M)
  • August 29: Rubio hits 1M Likes (Occurred on September 17 – a delay was predicted last time due to his waning FB support at the time)
  • September 1: Bush passes Santorum at 265K (Occurred on September 17 at 266K)
  • September 4: Sanders passes Cruz at 1.4M (Occurred September 17)
  • September 18: Bush passes Jindal at 286K (Not yet occurred)
  • September 29: Sanders passes Huckabee (R) at 1.8M (Not yet occurred)
  • September 29: Bush passes Walker at 300K (Not yet occurred)
  • October 13: Sanders passes Paul (R) at 2.1M (Not yet occurred)
So, my predictions were kinda bad. #LinearRegressionOnNonlinearData

I don't have new predictions yet, but here's the current state of affairs.
One of these days, I'll have this sort of thing compared with real polling numbers... or FEC filings. 

Interactive Graph!!!

Okay, so this took a lot of work, but I think I finally got it, folks! You can now make some of your own graphs with the power of the cloud (the Wolfram Cloud, specifically)!

Here's what it looks like: 
You select which type of data you want to look at, potentially filtering by political party, and optionally highlighting a particular candidate then hit submit! Give it a little while to process the data for you and then voilĂ ! You get your very own graph (such as the one earlier in this post), or these:

Absolute overnight change, Republicans only, Donald Trump highlighted
Percentage overnight change, Democrats only, Bernie Sanders highlighted

To create your own graphs, GO HERE https://wolfr.am/7mn~J_kq Enjoy!