By   December 29, 2014

Kafka – Introduction & Terminologies
Kafka is an apache project originally introduced by LinkedIn as an open source project. It is a distributed, fast, durable and scalable publish / subscribe messaging system. This messaging is also persistent with constant time performance [ O(1) ] even with many terabytes of storage. It also supports parallel data load into hadoop.

Conceptual Kafka

From the 40,000 feet high view, there are three entities in the kafka world. There are producers (publish message). These messages are published to kafka cluster. The messages are then pulled by consumers.

kafka_conceptual

Broker

As we just discussed, Kafka is a distributed message platform. It runs a cluster with one or more servers. Each server in the kafka cluster is called Broker.

kafka_brokers

Topics & Partitions

Kafka messages are published to a topic. They can be referred as categories of messages. The producers publish their message on a topic, and consumers subscribe to topics for message consumption. Each topic can have one or more partitions, which are immutable, ordered sequence of messages. These partitions can be on the same broker, or they can span across multiple brokers.

Producer needs to decide which particular partition it needs to publish message to. It may do so in a round robin fashion. It can also be done based on some semantic partition function (key based).

kafka_partition_selection

Kafka introduced fault tolerance by introducing leaders and followers. The messages on a partition can be replicated on different servers. In this case, one of the servers would be called as Leader, and the rest are called followers. Messages are always published to the leader. They are then passively replicated by all the followers. In case of leader failure, one of the follower takes over as a leader without causing any effect to the consumers.

cluser_leader_follower

Mapping between Partitions & Consumers

Kafka keeps the ordering by one-to-one mapping between partition and consumer in a consumer group. This is similar to basic Domain & Range concept for Functions (single valued) in basic calculus. It can be an injective or non-injective surjective function. Each partition is assigned to only one consumer in a consumer group. More than one partitions can be assigned to one consumer. There can be no consumer without an assigned partition. The below example is definitely like a non-injective surjective function.

kafka_partitions_consumers

Consumer Pull Vs Partition Push

Kafka is based on pull mechanism. A consumer pulls messages from the assigned partition when it is ready.

Handling Ordering of Messages

Kafka guarantees ordering of messages on partition level. So if a topic has more than one partitions, the order of messages on different partitions cannot be guaranteed. It looks like, this could be even worse if those partitions are selected on a round robin fashion.

How would more than one producer use round robin fashion for publishing messages as it seems for round robin, it should be a central point of publishing???

Queuing Vs Publish / Subscribe

In a queuing system, a message can only be consumed by a single consumer. In a publish / subscribe based mechanism, it can be consumed by a number of consumers. Kafka can be configured as either of them. It is a queuing system for each consumer group. So if you have one consumer group, it is a queuing system.

publish_subscribe_queue

https://cwiki.apache.org/confluence/display/KAFKA/Index