Pre Configure

2. Increase maximum limit: (We will demostrate batch process, so 1000 is really easy to fulfil)

3. Install Maven dependency

 

 

The Real Code

1. Setup ElasticSearch Client

2. Create a Kafka Consumer

3. Post records from Kafka into ElasticSearch (Bonsai)

 

Consumer Offset Commit Strategy

Things about at most once and at least once

  1. At most once: offsets are commited as soon as the message is received. If the processing goes wrong, the message will be lost and it won’t be read again.
  2. At least once: offsets are commited after the message is processed. If the processing goes wrong, the message will be read again. This can result in duplicate processing of messages. Make sure your processing is idempotent.
  3. Exactly once: can be achieved for Kafka to Kafka workflows using Kafka Streams API. For Kafka to Sink workflows, use an idempotent consumer.
  4. Bottom line: for most applications you should use at least once process and ensure your transofmations/processing are idempotent.

 

Configuration

1. enable.auto.commit = true & syncrhonous processing of batches

2. enable.auto.commit = false & manual commit of offsets

 

Every 100 records, commit consumer offset.

 

 

Idempotent Stategy

Two strategies for idempotent consumer

 

 

Bulk Request to ElasticSearch

 

 

Consumer Offset Reset Behaviour

  1. The behaviour for the consumer

2. Consumer offsets can be lost, if consumer hasn't read new data

1. in ONE day (Kafka < 2.0)

2. in 7 days (Kafka >= 2.0)

3. This days can be controlled by offset.retention.minutes, in broker settings.

 

 

Replay data for Consumers

1. Reset consumer offset:

 

2. Replay previous tweets.

 

Controlling Consumer Liveliness

 

Consumer Heartbeat Thread: is used to detect a consumer application being down

1. Heartbeats are sent periodically to the broker
2. If no heartbeat is sent during that period, the consumer is considered dead
3. Set even lower to faster consumer rebalances
1. How often to send heartbeats
2. Usually set to 1/3rd of session.timeout.ms

Consumer Poll Thread

1. Maximum amount of time between two .poll() calls before declaring the consumer dead
2. This would be relevant to Big Data processing, if before second .poll()`, it takes more than 5 minutes, Kafka will consider this consumer is dead and kick out this consumer from the group.
3. This is used to detect a data processing issue with the consumer.