Apache Kafka with Spark Streaming: Real-time Analytics Redefined

www.edureka.co/r-for-analytics
www.edureka.co/apache-Kafka
Apache Kafka with Spark Streaming - Real Time Analytics
Redefined

Slide 2Slide 2 www.edureka.co/apache-Kafka
Agenda
At the end of this webinar we will be able understand :
 What Is Kafka?
 Why We Need Kafka ?
 Kafka Components
 How Kafka Works
 Which Companies Are Using Kafka ?
 Kafka And Spark Integration Hands on

Why Kafka ??

Why Kafka?
When we have other messaging systems
Aren’t they Good?
Kafka Vs Other Message Broker?

They all are Good
But not for all use-cases.

• Transportation of logs
• Activity Stream in Real time.
• Collection of Performance Metrics
– CPU/IO/Memory usage
– Application Specific
• Time taken to load a web-page.
• Time taken by Multiple Services while building a web-page.
• No of requests.
• No of hits on a particular page/url.
So what are my Use-cases…

What is Common?
Scalable : Need to be Highly Scalable. A lot of Data. It can be billions of message.
Reliability of messages, What If, I loose a small no. of messages. Is it fine with me ?
Distributed : Multiple Producers, Multiple Consumers
High-throughput : Does not need to have JMS Standards, as it may be an overkill for some use-cases like
transportation of logs.
As per JMS, each message has to be acknowledged back.
Exactly one delivery guarantee requires two-phase commit.

Why LinkedIn built Kafka ?
To collect its growing data, LinkedIn developed many custom data pipelines for streaming and queueing data, like :
To flow data into
data warehouse
To send batches of
data into our
hadoop workflow
for analytics
To collect and
aggregate logs
from every service
To collect tracking
events like page
views
To queue their
inmail messaging
system
To keep their
people search
system up to date
whenever someone
updated their
profile
As the site needed to scale, each individual pipeline needed to scale and many other pipelines were needed.
Something had to give !!!
The result was development of
Kafka

The number has been growing since
Source : confluent

http://gigaom.com/2013/12/09/netflix-open-sources-its-data-traffic-cop-suro/
A diagram of LinkedIn’s data architecture as of February 2013, including everything from Kafka to Teradata.
diagram of LinkedIn’s data architecture

Kafka ?
Built with speed and
scalability in mind.
Enabled near real-time
access to any data
source
Empowered hadoop
jobs
Allowed us to build
real-time analytics
Vastly improved our
site monitoring and
alerting capability
Enabled us to visualize
and track our call
graphs.
Apache Kafka Hits 1.1 Trillion Messages Per Day (September 2015)
Kafka is a distributed pub-sub
messaging platform
Universal pipeline, built around
the concept of a commit log
Kafka as a universal stream broker

Kafka Benchmarks

Kafka Producer/Consumer Performance
Processes hundred of thousands of messages in a second

Slide 14Slide 14 www.edureka.co/apache-Kafka14
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
How fast is Kafka?
• “Up to 2 million writes/sec on 3 cheap machines”
– Using 3 producers on 3 different machines, 3x async replication
• Only 1 producer/machine because NIC already saturated
• Sustained throughput as stored data grows
– Slightly different test config than 2M writes/sec above.
• Test setup
– Kafka trunk as of April 2013, but 0.8.1+ should be similar.
– 3 machines: 6-core Intel Xeon 2.5 GHz, 32GB RAM, 6x 7200rpm SATA, 1GigE

• Fast writes:
– While Kafka persists all data to disk, essentially all writes go to the
page cache of OS, i.e. RAM.
– Cf. hardware specs and OS tuning (we cover this later)
• Fast reads:
– Very efficient to transfer data from page cache to a network socket
– Linux: sendfile() system call
• Combination of the two = fast Kafka!
– Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will see no read
activity on the disks as they will be serving data entirely from cache.
15
http://kafka.apache.org/documentation.html#persistence
Why is Kafka so fast?

• Example: loggly.com, who run Kafka & Co. on Amazon AWS
– “99.99999% of the time our data is coming from disk cache and RAM; only very rarely do we hit the
disk.”
– “One of our consumer groups (8 threads) which maps a log to a customer can process about 200,000
events per second draining from 192 partitions spread across 3 brokers.”
• Brokers run on m2.xlarge Amazon EC2 instances backed by provisioned IOPS
16
http://www.developer-tech.com/news/2014/jun/10/why-loggly-loves-apache-kafka-how-unbreakable-infinitely-scalable-messaging-makes-log-management-better/
Why is Kafka so fast?

How it works ??

• The who is who
– Producers write data to brokers.
– Consumers read data from brokers.
– All this is distributed.
• The data
– Data is stored in topics.
– Topics are split into partitions, which are replicated.
18
A first look

Broker(s)
19
• Topic: feed name to which messages are published
– Example: “zerg.hydra”
ne
w
Producer A1
Producer A2
Producer An
…
…
Kafka prunes “head” based on age or max size or “key”
Older msgs Newer msgs
Kafka topic
Topics
Producers always append to “tail”
(think: append to a file)

Broker(s)
20
ne
w
Producer A1
Producer A2
Producer An
…
Producers always append to “tail”
(think: append to a file)
…
Older msgs Newer msgs
Consumer group C1 Consumers use an “offset pointer” to
track/control their read progress
(and decide the pace of consumption)
Consumer group C2
Topics

• A topic consists of partitions.
• Partition: ordered + immutable sequence of messages that is continually appended
Topics

2
• #partitions of a topic is configurable
• #partitions determines max consumer (group) parallelism
– Consumer group A, with 2 consumers, reads from a 4-partition topic
– Consumer group B, with 4 consumers, reads from the same topic
Topics

• Offset: messages in the partitions are each assigned a unique (per partition) and sequential id
called the offset
– Consumers track their pointers via (offset, partition, topic) tuples
Consumer group C1
Topics

http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Partition

Consumer3
(Group2)
Kafka
Broker
Consumer4
(Group2)
Producer
Zookeeper
Consumer2
(Group1)
Consumer1
(Group1)
Update Consumed
Message offset
Queue
Topology
Topic
Topology
Kafka
Broker
Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
Broker

http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Putting it altogether

Kafka + Spark = Real Time Analytics

Analytics Flow

Data Ingestion Source

Real time Analysis with Spark Streaming

Analytics Result Displayed/Stored

Streaming In Detail

Apache Kafka with Spark Streaming: Real-time Analytics Redefined

• LinkedIn : activity streams, operational metrics, data bus
– 400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014
• Netflix : real-time monitoring and event processing
• Twitter : as part of their Storm real-time data pipelines
• Spotify : log delivery (from 4h down to 10s), Hadoop
• Loggly : log collection and processing
• Mozilla : telemetry data
• Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber, …
34
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Kafka adoption and use cases

Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.
Survey

Apache Kafka with Spark Streaming: Real-time Analytics Redefined

More Related Content

What's hot (20)

Similar to Apache Kafka with Spark Streaming: Real-time Analytics Redefined (20)

More from Edureka! (20)

Recently uploaded (20)

Apache Kafka with Spark Streaming: Real-time Analytics Redefined