Topics, Partitions and Offsets

  1. Topics: A particular stream of data. Similar to a table in a database. You can have as many topics as you want. It is identified by name
  2. Partitions: It splits a topic. Each partition is ordered. When you create a topic, you have to define how many partitions you want.
  3. Offset: Within a partition, each message gets an auto incremental id. It’s infinite. Offset only have a meaning for a specific partition, which means, offset1 in partition1, only have particular meaning for partition1, not partition2. The offsets’ order would only be guaranteed within one partition. Data (offset) is kept only for a limited time (default is one week), but the offset id would not reverse back, even if previous data are deleted. Data is immutable once it’s written to a partition. Data is assigned randomly to a partition unless a key is provided.

 

Brokers

what holds the topics

  1. A Kafka cluster is composed of multiple brokers(servers).
  2. Each broker would have an id, not name, but number.
  3. Each broker will contain only certain topic partitions, not all data, Kafka is distributed.
  4. Connect to any broker (a bootstrap broker), you will be connected to entire cluster.
  5. There is no relationship between partition number and broker number.
  6. The topic is spreaded among brokers. When you create a topic, Kafka will automatically distribute the topic across all brokers.

Topic replication factor

in a big data world, if a machine goes down, then the things still work.

  1. Replication factor > 1, usually 2 (risky), or 3 (golden: one for maintenance, one for unexpectedly down, one for live).
  2. One leader, multiple ISR (in-sync replica): At any time ONE broker can be a leader for a given partition. Only that leader can receive and serve data for a partition. Other brokers will passively synchronize the data.
  3. Zookeeper will decide who is leader, you will no need to worry about it.

Producers

how we get data in Kafka.

  1. Producers will write data to topics, and the topics are made of partitions.
  2. Producers automatically know which broker and partition to write.
  3. If broker failes, producers will automatically recover.
  4. Producers will send to Topics in a “round robin” way, to simulate “load balance”, if there’s no key.

 

No message keys: Before write to a topic, producers can choose 3 kinds of acknowledgment of data writes:

  1. acks=0: Producer won’t wait for acknowledgment (possible data loss).
  2. acks=1: Producer will wait for the leader to acknowledge (limited data loss).
  3. acks=all: Leader + replicas acknowledgment (no data loss).

With message keys

  1. Producers can choose to send a key with the message, the key can be string, number etc. If the key is not send, key will be equal to null, the data will be sent round robin.
  2. If a key is sent, all messages for that key will go to the same partition. A key is sent if you need message ordering for a specific field, eg: truck_id. The mechanism of key to partition is called hashing. It doesn’t control truck_id_123 goes to partition 0 or partition 1. It only controls that: once truck_id_123 goes to partition 0, then all future truck_id_123 will always go to partition 0.
  3. As long as the number of partitions remais constant for a topic (NO NEW PARTITIONS), the same ke will always go to the same partition.

Consumers

read data from topic (identified by name).

  1. Consumers know which broker to read from.
  2. If broker failures, consumers know how to recover.
  3. Data is read in order, within each partitions, offset by offset.
  4. Consumers can also read from multiple partitions.

 

Consumer Groups

each consumer within a group reads from exclusive partitions.

    1. Consumer group can be considered as an application
    2. If more consumers than partitions, some consumers will be inactive. Usually you don’t have inactive consumers, the only case you want to have inactive consumer, is when your consumer 3 is about to shutdown, consumer 4 is take over right away. Usually, you just have as many consumers as partitions at most.

 

Consumer Offsets

similar to checkpoint or bookmark.

  1. Kafka stores the offsets at which a consumer group has been reading.
  2. Consumer Offsets are commited live in a Kafka topic named __consumer_offsets.
  3. When a consumer in a group has processed data recieve from Kafka, it should be commiting the offsets.
  4. This benefits if consumer dies, then it will read back from where it left off.

 

Deliver semantics for consumers

Consumers choose when to commit offsets

    1. At most once: offsets are committed as soon as the message received, before processed. So if processing goes wrong, the message is lost. Not preferred.
    2. At least once: offsets are committed after the message is processed, if message processing goes wrong, the message will be read again. But it can result duplicate processing of same messages. So your processing MUST be IDEMPOTENT i.e. f(x) = f(f(x)).
    3. Exactly once: Can be achieve from Kafka to Kafka workflows, using Kafka Streams API. For the case of Kafka to an External system (e.g. a Database), usually means, you should use an IDEMPOTENT consumer.

Kafka Broker Discovery

how producer and consumer figure out which broker to work with?

  1. EVERY Kafka broker is also named as “bootstrap server”, which means, you ONLY need to connect to one broker, to access the entire cluster.
  2. Each broker in the cluster knows about all other brokers, topics, partitions (metadata).
  3. Once a Kafka Client connect to a bootstrap broker, it will automatically do a “metadata request”.

 

Zookeeper

  1. manages brokers and keeps a list of them
  2. helps leader partitions election
  3. sends notifications to Kafka in case of changes (broker dies, new topic created, etc …)
  4. Kafka can’t live without Zookeeper. So start Zookeeper before Kafka start.
  5. Zookeeper by design operates with an odd number of servers (3, 5, 7 …) 
  6. Zookeeper has a leader, leader handles writes; The rest of servers are followers, followers handle reads.
  7. Producers and Consumers writes to Kafka, Kafka manages all metadata in Zookeeper.
  8. IMPORTANT: Zookeeper does NOT store consumer offsets with Kafka > v0.10. Zookeeper is completely isolated from Consumers and Producers. But Consumers’ offsets, are store in a Kafka Topic —- ‘__consumer_offsets’.

 

Theory Roundup

key points to remember

Kafka Guarantees:
  1. Messages are appended to a topic-partition in the oder they are sent.
  2. Consumers read messages in the order stored in a topic-partition.
  3. Replication factor of N can tolerate up to N-1 brokers being down.
  4. As long as the number of partitions remais constant for a topic (NO NEW PARTITIONS), the same ke will always go to the same partition.