January 21, 2016

As we know the Hadoop like Batch processing system has evolved and matured over the past few years for excellent offline data processing platform for Big Data. Hadoop is a high-throughput system which can crunch a huge volume of data using a distributed parallel processing paradigm called MapReduce. But there are many use cases across various domains which require real-time / near real-time response on Big Data for faster decision making. Hadoop is not suitable for those use cases. Credit card fraud analytics, network fault prediction from sensor data, security threat prediction, and so forth need to process real time data stream on the fly to predict if a given transaction is a fraud, if the system is developing a fault, or if there is a security threat in the network. If decisions such as these are not taken in real time, the opportunity to mitigate the damage is lost.

Real-time systems perform analytics on short time windows, i.e. correlating and predicting events streams generated for the last few minutes. Now, for better prediction capabilities, realtime systems often lev featured erage batch processing systems such as Hadoop.

The heart of any prediction system is the Model. There are various Machine Learning algorithms available for different types of prediction systems. Any prediction system will have higher probability of correctness if the Model is built using good training samples. This Model building phase can be done offline. For instance, a credit card fraud prediction system could leverage a model built using previous credit card transaction data over a period of time. Imagine a credit card system for a given credit card provider serving hundreds of thousands of users having millions of transactions data over given period of time; we need a Hadoop-like system to process them. Once built, this model can be fed to a real time system to find if there is any deviation in the real time stream data.

In this article we will showcase how real time analytics use cases can be solved using popular open source technologies to process real time data in a fault tolerant and distributed manner. The major challenge here is; it needs to be always available to process real time feeds; it needs to be scalable enough to process hundreds of thousands of message per second; and it needs to support a scale-out distributed architecture to process the stream in parallel.

Real Time Data Processing Challenges

Real Time data processing challenges are very complex. As we all know, Big Data is commonly categorized into volume, velocity, and variety of the data, and Hadoop like system handles the Volume and Varity part of it. Along with the volume and variety, the real time system needs to handle the velocity of the data as well. And handling the velocity of Big Data is not an easy task. First, the system should be able to collect the data generated by real time events streams coming in at a rate of millions of events per seconds. Second, it needs to handle the parallel processing of this data as and when it is being collected. Third, it should perform event correlation using a Complex Event Processing engine to extract the meaningful information from this moving stream. These three steps should happen in a fault tolerant and distributed way. The real time system should be a low latency system so that the computation can happen very fast with near real time response capabilities.

To solve this complex real time processing challenge, we have evaluated two popular open source technologies; Apache Kafka (http://kafka.apache.org/design.html), which is the distributed messaging system, and Storm (http://storm-project.net/) which is a distributed stream processing engine.

Storm and Kafka are the future of stream processing, and they are already in use at a number of high-profile companies including Groupon, Alibaba, The Weather Channel, and many more.

An idea born inside of Twitter, Storm is a “distributed real-time computation system”. Meanwhile, Kafka is a messaging system developed at LinkedIn to serve as the foundation for their activity stream and the data processing pipeline behind it.

With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message is processed in a real-time, reliable manner. Storm and Kafka can handle data velocities of tens of thousands of messages every second.

Stream processing solutions such as Storm and Kafka have caught the attention of many enterprises due to their superior approach to ETL (extract, transform, and load) and data integration.

Let’s take a closer look at Kafka and Storm to how they have achieved the parallelism and robust processing of stream data.

Distributed Messaging Architecture: Apache Kafka

Apache Kafka is the messaging system originally developed at LinkedIn for processing LinkedIn’s activity stream. Let look at the issues with the traditional messaging system which caused people to seek an alternative.

Any messaging system has three major components; the message producer, the message consumer, and the message broker. Message producers and message consumers use message queue for asynchronous inter-process communication. Any messaging supports both point-to-point as well as publish/subscribe communication. In point-to-point paradigm, the producer of the messages sends the messages to a queue. There can be multiple consumers associated with the queue, but only one consumer can consume a given message from a queue. In publish/subscribe paradigm, there can be multiple producers producing messages for a given entity called Topic, and there can be multiple consumers subscribed for that topic. Each subscription receives the copy of each message sent for that topic. This differs from point-topoint communication, where one message is consumed by only one consumer. The message broker is the heart of the whole messaging system, which bridges the gap between the producer and consumer.

A “durable message” is a message where the broker will hold on to a message if the subscriber is temporarily unavailable. So the durability is defined by the relationship between a “Topic Subscriber” and the “Broker”. Durability is applicable only to the publish/subscribe paradigm.

A “persistent message” is a message that defines the relationship between a “Message Producer” and the “Broker”. This can be established for both point-to-point and publish/subscribe. This has to do with the guaranteed delivery of the message by persisting the message after it has been received from the message producer.

Both message durability and message persistency come with a cost. Keeping the state of the message (whether message is consumed or not) is a tricky problem. The traditional messaging brokers keep track of the consumer state. It uses metadata about the messages and stores this metadata in broker. Storing metadata about billions of messages creates large overhead for broker. On top of that, the relational database for storing the message metadata does not scale very well. However, the broker tries to keep the metadata size small by deleting messages which are already consumed. The challenging problem arises about how Broker and Consumer conclude that a given message is consumed. Can a broker mark a message consumed as and when it put the message in the network for delivery? What will happen if the Consumer is down by the time the message reaches the consumer? To solve this, most messaging systems keep an acknowledgement system. When a message is delivered by a broker, it is marked as “sent”, and when the consumer consumes the message and sends an acknowledgement, the broker marks it as “consumed”. But what will happen if the consumer actually consumed the message, but a network failure occurred before the acknowledgement reached the broker? The broker will still keep the state as “sent” not “consumed”. If the broker resends the message, the message will be consumed twice. The major problem arises around the performance of the broker. Now, the broker must keep track of each and every message. Imagine the overhead of the broker in cases when thousands of messages are being produced every second. This is a major reason why the traditional messaging system is not able to scale beyond a certain limit.

The following is how Kafka solved these problems.

Messaging System: The Kafka Way

Below figure shows how different types of producers can communicate to different types of consumers through the Kafka Broker.

This happens fairly naturally for brokers and producers, but consumers require particular support. Each consumer process belongs to a consumer group and each message is delivered to exactly one process within every consumer group. Hence, a consumer group allows many processes or machines to logically act as a single consumer. The concept of consumer group is very powerful and can be used to support the semantics of either a queue or topic as found in JMS. To support queue semantics, we can put all consumers in a single consumer group, in which case each message will go to a single consumer. To support topic semantics, each consumer is put in its own consumer group, and then all consumers will receive each message. Kafka has the added benefit in the case of large data that no matter how many consumers a topic has, a message is stored only a single time.

The overall architecture of Kafka is shown in below. Kafka is distributed in nature; a Kafka cluster typically consists of multiple brokers. To balance load, a topic is divided into multiple partitions and each broker stores one or more of those partitions. Multiple producers and consumers can publish and retrieve messages at the same time

Kafka relies heavily on the file system for storing and caching messages. There is a general perception that “disks are slow” which makes people skeptical that a persistent structure can offer competitive performance. In fact, disks are both much slower and much faster than people expect depending on how they are used; a properly designed disk structure can often be as fast as the network.

This suggests a design which is very simple: rather than maintain as much as possible in-memory and flush to the file system only when necessary, Kafka inverts that. All data is immediately written to a persistent log on the file system without any call to flush the data. In effect, this just means that it is transferred into the kernel’s page cache where the OS can flush it later.

Kafka has a very simple storage layout. Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of approximately the same size (e.g., 1GB). Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file. For better performance, Kafka flush the segment files to disk only after a configurable number of messages have been published or a certain amount of time has elapsed. A message is only exposed to the consumers after it is flushed.

Unlike typical messaging systems, a message stored in Kafka doesn’t have an explicit message ID. Instead, each message is addressed by its logical offset in the log. This avoids the overhead of maintaining auxiliary, seek-intensive random-access index structures that map the message IDs to the actual message locations.

If the messaging system is designed around this kind of design of read ahead and write behind, how does Kafka support the Consumer State problem we defined earlier, i.e. how does Kafka keep track of which messages are being “sent” or “consumed”? Fact of the matter is, Kafka broker never keeps track of this. In Kafka, it is consumer which keeps track of the messages it consumed. Consumer is maintaining something like a watermark which tells which offset in the log segment is consumed. A consumer always consumes messages from a particular partition sequentially. If the consumer acknowledges a particular message offset, it implies that the consumer has received all messages prior to that offset in the partition. This provides flexibility to the consumer to consume older message by lowering the watermark. Typically, Consumer stores the state information in Zookeeper which is used for Distributed consensus service. Otherwise, Consumer can maintain this watermark level in any data structure it wishes, which depends on the Consumer. For instance, if Hadoop is consuming messages from Kafka, it can store the watermark value into HDFS.

Along with the architectural details mentioned above, Kafka also has many advanced configurations such as Topic Partitioning, Automatic Load Balancing, and so on. More advanced details can be found on the Kafka website.

This design of Kafka makes it highly scalable, able to process millions of messages per second. The producer of real time data can write messages into Kafka cluster, and the real time consumer can read the messages from Kafka cluster for parallel processing.

Apache Storm is one such distributed real time parallel processing platform developed by Twitter. With Storm and Kafka, one can conduct stream processing at linear scale, assured that every message is reliably processed in real-time. Storm and Kafka can handle data velocities of tens of thousands of messages every second.

Let us now look at the Storm architecture and how it can connect to Kafka for real time processing.

Real Time Data Processing Platform: Storm

Storm is a free and open source distributed real time computation system. Storm has many use cases: real time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast; a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

The concept behind Storm is somewhat similar to Hadoop. In Hadoop cluster, you run Map Reduce Job; in Storm Cluster, there is Topologies. The core abstraction in Storm is the “stream”. A stream is an unbounded sequence of tuples as shown in below.

Storm Topologies are combination of Spouts and Bolts. Spouts are where the data stream is injected into the topology. Bolts process the streams that are piped into it. Bolt can feed data from spouts or other bolts. Storm takes care of parallel processing of spouts and bolts and moving data around.

The Offline Process: Model Building

The failure prediction system is one of the major use cases in real time analytics. As mentioned earlier, to detect failure in a given stream of sensor data, we need to first define normal behavior. For this we need to build model around the historical sensor data. To build the model, we can use an offline Batch Processing system such as Hadoop to extract, transform, and load historical sensor records from logs. For offline data processing, we used Hadoop to pull the historical data stored in the Kafka cluster and then process the same in Hadoop to partition the log records and then try to extract a correct sequence of events for given network segment. Once this model is built, it is used in the real time system (Storm/Kafka) for fault prediction. A representative depiction of the architecture is shown in below