Data Stream Processing
I wanted to check out, so called Data Stream Processing algorithms recently.
So first, what is Stream Processing and why do we need specific technologies for that?
Stream Processing is:
- is the process of extracting knowledge structures from continuous, rapid and unlimited data
- finds statistically relevant patterns between data examples where the values are delivered in a sequence
However, Stream Processing has to provide results immediately, without the need to process tons of data on top of a Big Data solution using a slow Map Reduce job running once a day. We are talking about realtime data, so results are changing anyway (Why you don’t want real-time analytics to be exact) and therefore doesn’t need to be exact, they “only” need to estimate things.
What are example use cases:
- Determining the number of unique elements (very interesting with limited amount of memory and no persistence)
- Recommendation Engines (f.e. realtime calculation of certain user specific vectors that can be used for the next page delivery)
- Estimating Top Lists
- Sensor Data (react on unusual data)
What is so special? Stream Processing needs to be fast (limited processing time) and distributed, with nearly no persistence and with a low memory footprint. It can’t relay on arbitrary data that is queried. This would be too slow. A algorithm for Stream Processing is significantly more complex, it only deals with recent data and is usually incremental, works with things like hashes and estimations.
Great sources of streaming algorithms can be found here:
- Stream-Lib (addthis: “We have endeavored to create useful implementations from iterating over the existing academic literature”)
- Storm Starter Tools (not really comparable to the one above, but shows some concepts of Sliding Window resp. Moving Average implementations)
But there are probably many more Papers/Concepts, waiting for an implementation. If you want to get an impression of how to implement a Paper, this is a good source: Data Streams as Random Permutations:
the Distinct Element Problem a concept which is actually implemented here Recordinality.
Or this one Efficient Computation of Frequent and Top-k Elements in Data Streams, which is implemented here StreamSummary
So the algorithms are clear. What next? How should a system for Stream Processing look like? Well, there are many:
So what should be used? Well, S4 seems to be dead, Flink is maybe too new, Storm is known already (we are using Storm in production), Akka? Always a good candidate (while I’m unable to predict if Typesafe is able to compete against Samza, Storm, Spark and the companies behind. All of them already have a good Scala API or are even written in Scala. My best guess would be “No”). Samza is a complex system, uses YARN, same for Flink.
Vertx has the following features
- A Distributed Event Bus to wire components together
- WebSocket support for real-time server-push applications
- A powerful encapsulated module system
- And much more…
Promising isn’t it?
So what could be a minimum viable test case for Vertx?
- Consuming the Twitter data feed
- Estimating a Top-k hit-list of the most used languages
- Web socket to provide a real time update
- Showing a Barchart on a Browser
I do not want to go into detail. I hope the code is self-explaining (well, I am sure it is :). For Twitter connection I am using the work of a great Streaming Data Berlin Buzzword14 Hackathon here: bbuzz14-stream-mining.
Regarding Vertx, I implemented 4 Verticles (the packages of code that Vert.x executes are called verticles), written in Scala
- MainVerticle for executing the following 3 verticles
- TwitterVerticle for connecting to Twitter and publishing the tweet stream to the event bus
- LanguageCountVerticle for counting Top-k languages (Stream Lib: StreamSummary)
- WebSocketVerticle Sending the result to connected Web Sockets
I think the way a case like this can be implemented in Vertx is really smart. The event bus connects verticles together. Even a connected WebSocket can be triggered by publishing an event to a specific WebSocket Id. Executing it, is straightforward as well. Simply type
vertx runmod eu.fakod~my-twitter-module~1.0-SNAPSHOT -conf config.json. The config can be provided by simply adding a
-conf parameter, very flexible.
I hope this code is a good starting point for your own investigations. So have fun…