Code for this ist on here.
There is a good architectural concept, where to locate Realtime Analysis or Stream Processing, the Lambda Architecture.
A Lambda Architecture defines a clear set of architectural principles for building robust and scalable data systems. It is based on three main design principles:
- human fault-tolerance – the system is unsusceptible to data loss or data corruption because at scale it could be irreparable.
- data immutability – store data in it’s rawest form immutable and for perpetuity.
- recomputation – with the two principles above it is always possible to (re)-compute results by running a function on the raw data.
In case of realtime analysis we are talking about the Speed Layer. It compensates for the high latency of updates to the serving layer and deals with recent data only. (For more information read this please or, of course, Nathan’s BigData book)
Realtime Analysis Algorithms
There is a special set of concepts and ideas how to calculate things in a Speed Layer called Stream Algorithms. They usually have to meet the following requirements:
- Should not use any persistence (would be too slow)
- Runs in Memory only (which is limited)
- Distributed System (want a cluster)
- Sliding Window (no data processing using a whole data set)
- Limited Processing Time per item
That’s what makes realtime analysis so special. Algorithms need to be fast and are not allowed to consume much memory or to use an expensive persistence. To achieve that, their results are often estimations. In the most cases of displaying immediate results this is fine. Their data does not need to be perfectly accurate. F.e. for intrusion detection a traffic sign chart or line chart is sufficient. It is kind of a visual trigger for deeper and more accurate data analysis.
Addthis StreamLib is a great source of implementations of common realtime analysis algorithms.
For technical logging I recently switched to Kibana. But after a couple of Sprints of development it gets stuffed with various different log messages, just signing pure technical issues, errors and warnings. Not easy to pull out business relevant data and processes for realtime analysis. So I decided to add a different log stream and I called it Business Logging. The Business Logging stream satisfies specific business concerns and is used as well as a data source for my BigData solution. Means: As source for Speed Layer and Batch Layer.
I decided to use some of LinkedIn’s tools, so:
- Added a Business Logging Scala Trait to every relevant Service or Component
- Used an Avro based serialization to keep data small (helpful as well, that nearly every Hadoop/BigData tool supports Avro)
- Every Service/Component sends Business Logging messages as a specific Topic to Kafka (f.e. search term data to topic eu.fakod.app.search.term)
- LinkedIn’s Camus is used to store this data to HDFS (see as well my Camus Sweeper post to deal with the Hadoop small file problem)
- Vertx as a realtime analysis component consumes these messages as well and processes them
- A Vertx Verticle provides a WebSocket that can be consumed by a D3.js based dashboard
I open sourced the Vertx part of this setup. And I think it is helpful to understand how easy it is to implement a valuable realtime analysis for business guys.
I am using in this project:
- StreamLib: Great set of implemented stream processing algorithms
- Top-k: Efficient Computation of Frequent and Top-k Elements in Data Streams (paper)
- Adoptive Counting: Fast and accurate counting of unique elements (paper)
- Sliding Window: Only considering data in a certain time window
- MainVerticle: Starts the following Verticles
- KafkaVerticle: Kafka Consumer Verticle, consumes a configured set of Topics
- WebSocketVerticle: Manages frontend WebSocket connections
- DataVerticle: Processes (realtime analysis) received messages and calculates messages for Frontend
Implemented Realtime Analysis Functionality:
- AccessCounter used to simply count access
- TopKCounter uses Top-k (class StreamSummary) to calculate top 50 of search terms
- UniqueUserCounter uses adoptive counting (class AdaptiveCounting) to count unique users
- SlidingWindowCounter used to implement a Sliding Window functionality
- Frontend Data contains:
- Time Windowed Top 50 Search Term Chart
- Access Chart. One value for every window (50 windows at all)
- Unique User Chart. Unique Users are calculated for every Time Window. One value for every window (50 windows at all)