Stateful stream processing with Apache Flink

Stream processing is rapidly growing in adoption and expanding to more and more use cases as the technology matures. While in the early days stream processors were used to compute approximate aggregates, today’s stream processors are capable of powering accurate analytics applications and evaluating complex business logic on high-throughput streams. One of the most important aspects of stream processing is state handling—i.e., remembering past input and using it to influence the processing of future input.

In this article, the first of a two-part series, we will discuss stateful stream processing as a design pattern for real-time analytics and event-driven applications. As we’ll see, stream processing has evolved to support a huge array of use cases, spanning both analytical and transactional applications. We will also introduce Apache Flink and its features that have been specifically designed to run stateful streaming applications. The second article of our series will be more hands-on and show in detail how to implement stateful stream processing applications with Flink.

What is stateful stream processing?

Virtually all business-relevant data is produced as a stream of events. Sensor measurements, website clicks, interactions with mobile applications, database modifications, application and machine logs, stock trades and financial transactions… all of these operations are characterized by continuously generated data. In fact, there are very few bounded data sets that are produced all at once instead of being recorded from a stream of data.

If we look at how data is accessed and processed, we can identify two classes of problems: 1) use cases in which the data changes faster than the processing logic and 2) use cases in which the code or query changes faster than the data. While in the first scenario we are dealing with a stream processing problem, the latter case indicates a data exploration problem. Examples for data exploration use cases include offline data analysis, data mining, and data science tasks. The clear separation of data streaming and data exploration problems leads to the insight that the majority of all production applications address stream processing problems. However, only a few of them are implemented using stream processing technology today.

Leave a Reply

Your email address will not be published. Required fields are marked *