kafka learning streaming apache beam resources dataflow

An Education in Streaming

Rion Williams

May 9, 2020 • 6 min read

I’m probably one of the few folks out there that love a good technical book. Sure, well written blog posts, tutorials, and exploratory projects are amazing, but if you really want to dig deep, books can be a great resource.

Since I’ve fallen in love with streaming data systems, I thought it would be fitting to focus on books in that arena. Streaming itself seems to be very hit or miss in terms of understanding. It’s not always intuitive and can often go against what your previous experience has taught you to believe.

neo-thinks-he-knows-streaming

I’ve seen countless developers, incredibly smart folks, struggle with some of the most basic concepts behind streaming systems, and with this post I’m hoping to provide some “recommended readings” to help prevent that from happening to you.

So without further fanfare, let’s take a few looks into a few books (in no particular order) that might help you on your journey.

Kafka: The Definitive Guide

kafka-the-definitive-guide-cover

A perfect primer for Kafka that covers its origins, internals, uses, and very much a recommended read before diving into it.

Kafka is probably the most ubiquitous technology when you think of streaming, so it’s a good one to understand. So who better to cover this topic than the folks at Confluent, a team composed of tons of Kafka aficionados and many members of the original team that built it.

it’s important to note, we are talking about Apache Kafka proper (not to be confused with Kafka Streams which is an abstraction and framework built atop Kafka).

Kafka: The Definitive Guide is an excellent (and free) entry point into the world of Kafka. It’s an incredibly easy read, and provides a high-level overview of Kafka itself, the value it provides, and how it works in a nutshell. You'll learn about producers, consumers, brokers, clusters, partitions and all of the other vast building blocks that make Kafka great - and all within the first chapter.

After the overview, you'll get a nice overview of the installation process and a primer on all of the companion applications and processes that help Kafka do its job effectively (things like Zookeeper), configuration, hardware needs, and more. Once you get your feet wet there, you'll journey into the world of writing producers to generate messages as well as consumers to read them and the pieces will start to come together.

The book then dives into the nitty-gritty details: the internals. You'll revisit many of the concepts introduced earlier in the book and go into deep-dives on each of them (e.g. instead of knowing that partitions are just the building blocks of Kafka, you'll learn about how they are structured, segments, indexes, etc.). These details segue into actually constructing a pipeline and all of the considerations that go into making it perform well, scale, and handle failures (which happen all the time). You'll get a variety of other production-oriented topics surrounding administration, monitoring, and you should be able to leave the book with some confidence in working with Kafka.

One of the most common reviews of the book itself states that “I wish that I had read this book before working with Kafka” and that couldn’t be more true (and it's a great resource if you are interested in becoming certified with Kafka).

Kafka Streams in Action

kafka-streams-in-action-cover

An excellent introduction to Kafka and its internals, when to use it, all framed around the Kafka Streams framework for building streaming applications.

Keeping with the topic of Kafka, the next book “Kafka Streams in Action” by Bill Bejeck provides a much more code-heavy approach. It almost reads like a revision of the Definitive Guide mentioned earlier with a much more practical focus specifically for Kafka Streams.

The book goes over many of the same concepts covered as the earlier book and the first few chapters closely mirror each other, discussing high level use-cases, when and when not to consider Kafka for solving problems, etc.

This overview quickly takes a turn from around Chapter 3 and beyond into very informative, targeted chapters which focus on things like join semantics, windowing, stateful processing, etc. Each chapter works through and builds upon earlier problems and provides code examples for how to go about leveraging Kafka Streams to solve these.

You'll learn about the various supported APIs that Kafka Streams exposes from the high-level DSL that provides a very SQL-esque syntax for working with streams to the low-level Processor API that'll allow you to work at the most granular level. It's chapters regarding state really shine and demonstrate some of the true potential behind the Kafka Streams framework.

As with the previously mentioned Kafka: The Definitive Guide, you'll explore many common design patterns and use cases for Kafka along with accompanying Kafka Streams code to implement them. You'll cover concepts like testing, monitoring, and all of the other goodness you'd expect before creating a production-ready Kafka Streams pipeline.

Ultimately, you can’t go wrong with either of these if you are working with Kafka, however if you are using Kafka Streams, I’d probably err towards this one, otherwise, stick with the Definitive Guide.

Designing Data-Intensive Applications

designing-data-intensive-applications

Probably the most well known book on this list, and with good reason. It covers all facets of building data-oriented applications and while it doesn’t focus solely on streaming, it’s covered quite a bit.

In this fantastic work on all things data, Martin Kleppmann guides you through the foundations of data in an easily digestible and comprehensive journey. Regardless of your experience, there’s something for you in Designing Data-Intensive Applications.

Kleppmann begins the journey at the foundational level, but don’t assume that it consists of just a one-two sentence high level overview. All of the popular flavors of data stores are covered (relational, document, graph, etc.) in depth with regards to scalability, maintainability, and reliability. You’ll also learn about the underlying implementations (e.g B-trees, LSM trees, etc.), formatting techniques, and protocols that make each excel as databases or warehouses respectively.

After this foundational level, the book dives into the distributed world, which is primarily why it’s on this list. Streaming technologies such as Kafka are covered along with all of the concepts that make them great (partitioning, concurrency, replication, etc.). You’ll learn in-depth about concepts (and potential assumptions) like isolation, locking, serialization, and clocks (they can be unreliable). There’s lots of information to grok here, but it’ll affect how you think about streaming and distributed applications.

The final part of the book focuses on both batch processing and streaming, and many of the patterns that align the two. You’ll revisit Kafka in depth here (hopefully with the knowledge gleaned earlier in the book) as well as other pub-sub technologies and message queues. Lots of good nuggets about failures (spoiler alert: bad things happen a lot), idempotence, microbatching, windowing, and much, much more.

If you had to read one book from this list - Designing Data Intensive Applications wouldn’t be a bad one to pick.

Streaming Systems

streaming-systems-cover

A fantastic book on streaming in a holistic sense that focuses on building systems around it, the trade offs between batching and streaming, the Apache Beam model, and all sorts of other goodness.

Arguably one of the best books I’ve come across regarding streaming in a holistic view. Streaming Systems was written by a team of three engineers at Google, who worked on their DataFlow team and were members of the steering committee for Apache Beam.

The book itself talks holistically about streaming with a focus on the Beam Model (i.e.Apache Beam) and does so because Beam itself is designed to be an interface or abstraction for various streaming technologies (e.g. Spark, Flink, Dataflow, Samza, and countless others). As such, all of the code examples are written in it so that they could easily be identified in these other technologies, but more so they allow you to understand what is happening.

The book is extremely well written and is an easy read, despite the content occasionally teetering on academic (i.e. very detailed, low-level). It’s chock full of knowledge and tidbits, which a dash of humor in every chapter. It’s what most development books should aspire to be.

Streaming Systems also introduces a common “What” , “How”, “When”, “Where” motif that will stick with you when approaching steaming problems.

Another plus for Streaming Systems is that despite the name, it doesn’t shy away from saying the B-word. Batching comes up frequently, primarily since the Beam Model supports it, but because it’s also the manner in which most companies perform data processing. The book details how the two interact, parallel, and how to strike a balance between them.

CodeProject