Wednesday 16 March 2016

Spark & streaming: first impressions

I recently participated in a 24-hour company hackathon with two colleagues. We used Spark Streaming to do near real-time processing of production data, with plain old MySQL as a real-time "session" data store. We even managed to bolt on a machine learning algorithm using Spark MLlib, courtesy of yours truly.

Spark turned out to be amazingly easy to use and it performed really well for our use case. We did ten second microbatches, which gave us near real-time data processing, and we could also get near real-time metrics and statistics as well with statsd, graphite and d3.js.

For the processing, our division of labour was fairly standard, with three separate Spark workflows:

1) The main processor continuously reads production data from a suitable data source, in real time, and processes it in ten-second chunks. Data "sessions" are created and updated, using a single MySQL table for storage. As soon as we have a minimum amount of data available, we also calculate a prediction for the future "outcome" of each data bunch, based on a previously learned model (see below). Real-time stats and metrics are sent to statsd as data comes in.

2) The history processor runs periodically every X seconds. It connects to MySQL, selects all recently completed data bunches from the main MySQL table and processes them. A completed bunch is simply one whose data has not been updated for some specific amount of time, say 15 minutes.

For each completed bunch the final stats (sums, counts etc) are calculated and sent on to statsd and the bunch is then moved into the history table. This way we can keep the live table at a manageable size. In addition, for each finished bunch we also check how well our previous prediction held up, i.e. whether our prediction was the same as the actual final outcome.

3) The model builder runs periodically every Z minutes. It connects to MySQL, selects a random sample of recent historical data bunches from the history table, and trains a machine learning model on them. We used a random forest for our predictions. Basically, I wrote a bunch of code to turn our historical samples into vectors of doubles (taking categorical variables into account as well), configured the MLlib random forest learner with proper parameters, and that was that. After the model is trained it is simply saved to disk and then used at step 1) by the main processor. With very little work I was able to get a prediction accuracy that was many times better than random guessing and clearly worthwhile.

All in all, initial impressions of Spark were very positive. It made building a relatively non-trivial pipeline like the above super easy. For our processing we don't need to do any joins or other complicated things, which does make things a bit easier; nevertheless with Spark you get scaling, redundancy and failover out of the box, which will help a lot with future-proofing. The ease of MLlib and overall the amount of attention being paid to proper working and scaling of all the tools and libraries is just really nice. Spark Streaming works very nicely as well, and according to several smart people seems to be a very good solution for streaming in general.

MySQL in contrast works really well on just one beefy machine (if you have enough RAM and a proper SSD), in our case handling up to 1000 read-write requests per second with a live table of around 300-600k items at any given time. But high-availability and failover is a bit harder to do, as is scaling. Since this was just a random hackathon we "solved" those problems by simply ignoring them. The value of MySQL was that we could have multiple indices on a table and thus do lookups based both on the key and the recentness of the latest data, which is more annoying to do with something like Cassandra or indeed with Spark itself. So MySQL worked well for us in this limited use case, but future-proofing it would be noticeably harder than with Spark.

Need to do more things with Spark in the future.