Stream processing is rapidly growing in adoption and expanding to more and more use cases as the technology matures. While in the early days stream processors were used to compute approximate aggregates, today’s stream processors are capable of powering accurate analytics applications and evaluating complex business logic on high-throughput streams. One of the most important aspects of stream processing is state handling—i.e., remembering past input and using it to influence the processing of future input.
In this article, the first of a two-part series, we will discuss stateful stream processing as a design pattern for real-time analytics and event-driven applications. As we’ll see, stream processing has evolved to support a huge array of use cases, spanning both analytical and transactional applications. We will also introduce Apache Flink and its features that have been specifically designed to run stateful streaming applications. The second article of our series will be more hands-on and show in detail how to implement stateful stream processing applications with Flink.
What is stateful stream processing?
Virtually all business-relevant data is produced as a stream of events. Sensor measurements, website clicks, interactions with mobile applications, database modifications, application and machine logs, stock trades and financial transactions… all of these operations are characterized by continuously generated data. In fact, there are very few bounded data sets that are produced all at once instead of being recorded from a stream of data.
If we look at how data is accessed and processed, we can identify two classes of problems: 1) use cases in which the data changes faster than the processing logic and 2) use cases in which the code or query changes faster than the data. While in the first scenario we are dealing with a stream processing problem, the latter case indicates a data exploration problem. Examples for data exploration use cases include offline data analysis, data mining, and data science tasks. The clear separation of data streaming and data exploration problems leads to the insight that the majority of all production applications address stream processing problems. However, only a few of them are implemented using stream processing technology today.
For example, there are many applications that continuously analyze data by collecting continuously produced data (in a database or data lake) and periodically running batch jobs on the recorded data. The results of data exploration tasks, such as machine learning pipelines or reporting queries, often become part of streaming applications when they are applied in a production environment. Stream processing problems are also not limited to analytical applications. Any kind of application that continuously processes input events essentially addresses a streaming problem, such as for example many applications that are implemented as microservices.
The first open source stream processors were not designed to target the broad scope of stream processing use cases. Instead, they focused on reducing the latency for analytical applications at the cost of accuracy. Since then, stream processing technology has evolved rapidly and matured significantly. Today, state-of-the-art open source stream processors, such as Apache Flink, can address a much wider range of use cases, including accurate, low-latency analytics and event-driven applications. The expansion of use cases was made possible mainly due to extended support for two important concepts, namely state and time. The amount of control that a stream processing framework exposes over state and time, and the features it offers to manage state and time, directly determine what kinds of applications can be built with and executed by the framework.
Time is an important aspect of stream processing because streams are unbounded and continuously serve events. Hence, many stream processing applications have inherent time semantics, such as applications that produce aggregates in intervals or evaluate temporal patterns on a sequence of events. Compared to batch processing applications that process bounded data sets, stream processing applications can often explicitly handle time in order to trade off the latency and completeness of results.
Stream processing applications ingest events as they arrive. Depending on the business logic, events or intermediate results need to be stored for later processing. Any kind of data that is remembered to perform a computation in the future belongs to the state of an application. You can also think of state as a local variable in a regular program. While the concept of state is easy to understand, the task of executing a stateful stream processing application in a distributed system is more challenging.
As everybody knows, failures are omnipresent in distributed systems. In contrast to early stream processors, which treated state as volatile or put it into external data stores, stateful stream processors maintain application state locally and guarantee its consistency. Therefore, a stateful stream processor needs to make sure that the state of an application is not lost in case of a failure and is consistently restored after the application has recovered from the failure. This means that an application must be able to continue processing as if the failure had never happened. In order to achieve good scalability, application state must be effectively distributed across all involved worker nodes and even re-distributed if the application is scaled out or in. Finally, all state management operations must be performed efficiently because application state can grow to several terabytes in size.
Use cases for stateful stream processing
Every application that continuously applies the same business or processing logic to a sequence of input events addresses a stream processing problem. Such applications are ubiquitous in business environments and the vast majority of them are stateful. While few such applications are currently implemented with stream processing frameworks, there are many classes of applications for which distributed and stateful stream processors could add significant value, such as improved latency, higher throughput, better scalability, improved consistency, and easier application lifecycle management. Among these types of applications are analytical queries, ETL jobs, and data pipelines, but also transactional applications and data-driven microservices.
At Data Artisans, we have worked with users and customers who have built stateful streaming applications for a wide range of use cases. Customers have used stateful stream processing to build event-driven back ends for web applications (such as social networks), dynamically configurable aggregation pipelines to analyze the behavior of mobile game users, and even internal and public services to define data pipelines with SQL and run ad-hoc SQL queries on data streams. Other use cases include fraud detection for financial transactions, monitoring and quality assessment of business processes, intrusion detection for computer networks, real-time dashboards to monitor the service quality of mobile networks, and anomaly detection on IoT data. Many of these use cases have been presented at one of the past Flink Forward conferences. Recordings and slides of most talks are available online.
Introducing Apache Flink
Apache Flink is a distributed data processor that has been specifically designed to run stateful computations over data streams. Its runtime is optimized for processing unbounded data streams as well as bounded data sets of any size. Flink is able to scale computations to thousands of cores and thereby process data streams with high throughput at low latency. Flink applications can be deployed to resource managers including Hadoop YARN, Apache Mesos, and Kubernetes or to stand-alone Flink clusters. Fault-tolerance is a very important aspect of Flink, as it is for any distributed system. Flink can operate in a highly-available mode without a single point of failure and can recover stateful applications from failures with exactly-once state consistency guarantees. Moreover, Flink offers many features to ease the operational aspects of running stream processing applications in production. It integrates nicely with existing logging and metrics infrastructure and provides a REST API to submit and control running applications.
Flink features multiple APIs with different trade-offs for expressiveness and conciseness to implement stream processing applications. The DataStream API is the basic API and provides familiar primitives found in other data-parallel processing frameworks such as map, flatMap, split, and union. These primitives are extended by common stream processing operations, as for example windowed aggregations, joins, and an operator for asynchronous requests against external data stores.
Flink’s ProcessFunctions are low-level interfaces that expose precise control over state and time. For example, a ProcessFunction can be implemented to store each received event in its state and register a timer for a future point in time. Later, when the timer fires, the function can retrieve the event and possibly other events from its state to perform a computation and emit a result. This fine-grained control over state and time enables a broad scope of applications.
Finally, Flink’s SQL support and Table API offer declarative interfaces to specify unified queries against streaming and batch sources. This means that the same query can be executed with the same semantics on a bounded data set and a stream of real-time events. Both ProcessFunctions and SQL queries can be seamlessly integrated with the DataStream API, which gives developers maximum flexibility in choosing the right API.
In addition to Flink’s core APIs, Flink features domain-specific libraries for graph processing and analytics as well as complex event processing (CEP). Flink’s CEP library provides an API to define and evaluate patterns on event streams. This pattern API can be used to monitor processes or raise alarms in case of unexpected sequences of events.
Streaming applications never run as isolated services. Instead they need to ingest and typically also emit event streams. Apache Flink provides a rich library of connectors for the most commonly used stream and storage systems. Applications can ingest streams from or publish streams to Apache Kafka and Amazon Kinesis. Streams can also be ingested by reading files as they appear in directories or be persisted by writing events to bucketed files. Flink supports a number of different file systems including HDFS, S3, and NFS. Moreover, Flink applications can “sink” data via JDBC (i.e. export it into a relational database) or insert it into Apache Cassandra and Elasticsearch.
Flink is powering business-critical applications in many companies around the world and across many industries, such as ecommerce, telecommunication, finance, gaming, and entertainment. Users are reporting applications that run on thousands of cores, maintain terabytes of state, and process billions of events per day. The open source community that is developing Flink is continuously growing and adding new users.
State management in Apache Flink
Many of Flink’s outstanding features revolve around the handling of application state. Flink always co-locates state and computation on the same machine, such that all state accesses are local. How the state is locally stored and accessed depends on the state back end that is configured for an application. State back ends are pluggable components. Flink provides implementations that store application state as objects on the JVM heap or serialized in RocksDB, an embedded, disk-based storage engine. While the heap-based back end provides in-memory performance, it is limited by the size of the available memory. In contrast, the RocksDB back end, which leverages RocksDB’s efficient LSM-tree-based disk storage, can easily maintain much larger state sizes.