Kafka - Nejati Notes

Apache Kafka is an **open-source distributed event streaming platform**. Originally developed by LinkedIn and now maintained by the Apache Software Foundation, it's designed to handle high-volume, real-time data feeds with low latency. It is written in Scala and Java and can be deployed on various infrastructures, including bare metal, virtual machines, and containers, both on-premises and in the cloud. ### Core Architectural Concepts 🏗️ At its heart, Kafka operates on a publish-subscribe model, but with unique characteristics that provide durability, scalability, and fault tolerance. - **Events (Messages/Records):** The fundamental unit of data in Kafka. An event represents a fact that "something happened." Each event consists of: - **Key:** Optional. Used for routing messages to specific partitions. Messages with the same key are guaranteed to go to the same partition, ensuring order for that key. - **Value:** The actual payload of the message. Can be any data format (e.g., JSON, Avro, Protobuf, plain text) as Kafka stores it as a byte array. - **Timestamp:** Can be set by the producer or assigned by the broker, indicating when the event occurred or was received. - **Headers:** Optional metadata (key-value pairs). - **Topics:** Logical channels or categories to which events are published. Producers write to topics, and consumers read from topics. Think of them as named feeds of events. Topics are inherently multi-producer and multi-subscriber. - **Partitions:** The backbone of Kafka's parallelism and scalability. Each topic is divided into one or more partitions. - **Ordered, Immutable Sequence:** Each partition is an ordered, append-only log of events. Events within a partition are assigned a sequential ID called an **offset**. - **Scalability & Parallelism:** Partitions allow a topic's data to be distributed across multiple brokers. This enables multiple consumers to read from a topic in parallel (one consumer per partition within a consumer group) and producers to write in parallel. - **Data Distribution:** Events are distributed to partitions based on the message key (using a hash of the key by default) or in a round-robin fashion if no key is provided. - **Offsets:** A unique, sequential integer assigned to each event within a partition. Offsets are used by consumers to track their read position. Kafka does not track which messages are "read" at the broker level; this is the responsibility of the consumer. - **Brokers:** Servers that form a Kafka cluster. Each broker hosts some set of partitions. - **Storage Layer:** Brokers are responsible for receiving messages from producers, assigning offsets, and committing them to disk (the log). They also serve fetch requests from consumers. - **Cluster:** Kafka is run as a cluster of one or more brokers. This distributed nature provides fault tolerance and scalability. - **Controller:** Within a Kafka cluster, one broker acts as the controller. The controller is responsible for managing the state of partitions and replicas and for performing administrative tasks like reassigning partitions if a broker fails. - **Replication:** For fault tolerance, partitions are replicated across multiple brokers. - **Leader:** For each partition, one broker is elected as the "leader." All writes and reads for that partition go through the leader. - **Followers:** Other brokers host "follower" replicas. Followers passively replicate the leader's data. - **In-Sync Replicas (ISRs):** A subset of followers that are currently caught up with the leader. Writes are only considered committed when acknowledged by all ISRs (configurable), providing durability. If a leader fails, one of the ISRs is elected as the new leader. - **Producers:** Client applications that publish (write) streams of events to Kafka topics. Producers can choose which partition to send a message to, or let Kafka decide based on the message key. - **Consumers:** Client applications that subscribe to (read and process) streams of events from Kafka topics. - **Consumer Groups:** Consumers typically belong to a "consumer group." Each event in a topic is delivered to only one consumer instance within each subscribing consumer group. This allows for load balancing and parallel processing of events across multiple consumer instances. If a consumer in a group fails, its partitions are reassigned to other members of the group. - **ZooKeeper (Phasing Out with KRaft):** Traditionally, Kafka used Apache ZooKeeper for cluster coordination, including managing broker metadata, controller election, topic configurations, and access control lists (ACLs). - **KRaft (Kafka Raft Metadata mode):** Newer versions of Kafka are transitioning to KRaft mode, which replaces ZooKeeper for metadata management. KRaft embeds the consensus protocol directly within Kafka, simplifying deployment, improving scalability of metadata operations, and reducing operational overhead by removing an external dependency. --- ### Guarantees and Key Features ✨ - **Ordering:** Kafka guarantees order of messages _within a partition_. If a producer sends messages M1 then M2 to the same partition, M1 will be written first and will have a lower offset than M2. Consumers will also see them in this order. Order across partitions is not guaranteed. - **Durability:** Messages written to Kafka are persisted on disk and replicated. Durability is configurable based on producer acknowledgments and topic replication factor. - **Delivery Semantics:** Kafka supports different message delivery semantics: - **At most once:** Messages might be lost but are never redelivered. - **At least once:** Messages are never lost but might be redelivered (e.g., if a consumer processes a message but fails before committing its offset). - **Exactly once:** Each message is delivered and processed precisely one time. This is achievable in Kafka, especially with the Streams API and transactional producers/consumers, or when integrating with systems that support idempotent writes or transactions. - **High Throughput:** Kafka is designed for high-volume message streams, capable of handling millions of messages per second per broker, by leveraging sequential disk I/O and batching. - **Scalability:** - **Horizontal Scaling:** Clusters can be easily expanded by adding more brokers. Topics can be scaled by adding more partitions. - **Consumer Scalability:** Consumer groups allow for scaling out consumption by adding more consumer instances up to the number of partitions. - **Fault Tolerance:** Replication of partitions across brokers ensures that the system can tolerate broker failures. If a broker with a partition leader fails, a follower replica takes over. - **Decoupling:** Producers and consumers are fully decoupled. Producers don't know about consumers, and consumers don't know about producers. They only interact with Kafka brokers. - **Data Retention:** Kafka stores messages for a configurable period (or based on size). Messages are not deleted immediately after consumption, allowing multiple independent consumers to read the same data at different times and speeds. This also supports re-processing of data. - **Log Compaction:** For certain use cases (e.g., maintaining the latest state for each key), Kafka topics can be configured for log compaction. This ensures that Kafka retains at least the last known value for each message key within a log partition. --- ### Getting Started & Operations 🛠️ - **Installation:** Can be downloaded and run locally, or deployed using Docker containers or Kubernetes operators. - **Configuration:** Kafka has extensive configuration options for brokers, topics, producers, and consumers to fine-tune performance, durability, and resource usage. Key configurations include replication factor, number of partitions, acknowledgment settings, and retention policies. - **Cluster Management:** Involves tasks like adding/removing brokers, rebalancing partitions, monitoring cluster health, managing security, and upgrading versions. - **Security:** Kafka supports various security features: - **Encryption:** SSL/TLS for data in transit. - **Authentication:** SASL (e.g., Kerberos, PLAIN, SCRAM) to verify client identities. - **Authorization:** ACLs to control which clients can perform which operations on which resources (topics, groups, cluster). - **Geo-Replication:** Tools and patterns (like MirrorMaker or Confluent Replicator) exist to replicate data between Kafka clusters in different data centers or cloud regions for disaster recovery or to bring data closer to users. - **Tiered Storage:** Some distributions offer tiered storage, allowing older data to be moved to cheaper, long-term storage (like S3) while still being queryable, reducing the cost of storing large volumes of data in Kafka brokers.