www.edureka.co/r-for-analytics
www.edureka.co/apache-Kafka
Apache Kafka with Spark Streaming - Real Time Analytics
Redefined
Slide 2Slide 2Slide 2 www.edureka.co/apache-Kafka
Agenda
At the end of this webinar we will be able understand :
 What Is Kafka?
 Why We Need Kafka ?
 Kafka Components
 How Kafka Works
 Which Companies Are Using Kafka ?
 Kafka And Spark Integration Hands on
Slide 3Slide 3Slide 3 www.edureka.co/apache-Kafka
Why Kafka ??
Slide 4Slide 4Slide 4 www.edureka.co/apache-Kafka
Why Kafka?
When we have other messaging systems
Aren’t they Good?
Kafka Vs Other Message Broker?
Slide 5Slide 5Slide 5 www.edureka.co/apache-Kafka
They all are Good
But not for all use-cases.
Slide 6Slide 6Slide 6 www.edureka.co/apache-Kafka
• Transportation of logs
• Activity Stream in Real time.
• Collection of Performance Metrics
– CPU/IO/Memory usage
– Application Specific
• Time taken to load a web-page.
• Time taken by Multiple Services while building a web-page.
• No of requests.
• No of hits on a particular page/url.
So what are my Use-cases…
Slide 7Slide 7Slide 7 www.edureka.co/apache-Kafka
What is Common?
Scalable : Need to be Highly Scalable. A lot of Data. It can be billions of message.
Reliability of messages, What If, I loose a small no. of messages. Is it fine with me ?
Distributed : Multiple Producers, Multiple Consumers
High-throughput : Does not need to have JMS Standards, as it may be an overkill for some use-cases like
transportation of logs.
As per JMS, each message has to be acknowledged back.
Exactly one delivery guarantee requires two-phase commit.
Slide 8Slide 8Slide 8 www.edureka.co/apache-Kafka
Why LinkedIn built Kafka ?
To collect its growing data, LinkedIn developed many custom data pipelines for streaming and queueing data, like :
To flow data into
data warehouse
To send batches of
data into our
hadoop workflow
for analytics
To collect and
aggregate logs
from every service
To collect tracking
events like page
views
To queue their
inmail messaging
system
To keep their
people search
system up to date
whenever someone
updated their
profile
As the site needed to scale, each individual pipeline needed to scale and many other pipelines were needed.
Something had to give !!!
The result was development of
Kafka
Slide 9Slide 9Slide 9 www.edureka.co/apache-Kafka
The number has been growing since
Source : confluent
Slide 10Slide 10Slide 10 www.edureka.co/apache-Kafka
http://gigaom.com/2013/12/09/netflix-open-sources-its-data-traffic-cop-suro/
A diagram of LinkedIn’s data architecture as of February 2013, including everything from Kafka to Teradata.
diagram of LinkedIn’s data architecture
Slide 11Slide 11Slide 11 www.edureka.co/apache-Kafka
Kafka ?
Built with speed and
scalability in mind.
Enabled near real-time
access to any data
source
Empowered hadoop
jobs
Allowed us to build
real-time analytics
Vastly improved our
site monitoring and
alerting capability
Enabled us to visualize
and track our call
graphs.
Apache Kafka Hits 1.1 Trillion Messages Per Day (September 2015)
Kafka is a distributed pub-sub
messaging platform
Universal pipeline, built around
the concept of a commit log
Kafka as a universal stream broker
Slide 12Slide 12Slide 12 www.edureka.co/apache-Kafka
Kafka Benchmarks
Slide 13Slide 13Slide 13 www.edureka.co/apache-Kafka
Kafka Producer/Consumer Performance
Processes hundred of thousands of messages in a second
Slide 14Slide 14Slide 14 www.edureka.co/apache-Kafka14
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
How fast is Kafka?
• “Up to 2 million writes/sec on 3 cheap machines”
– Using 3 producers on 3 different machines, 3x async replication
• Only 1 producer/machine because NIC already saturated
• Sustained throughput as stored data grows
– Slightly different test config than 2M writes/sec above.
• Test setup
– Kafka trunk as of April 2013, but 0.8.1+ should be similar.
– 3 machines: 6-core Intel Xeon 2.5 GHz, 32GB RAM, 6x 7200rpm SATA, 1GigE
Slide 15Slide 15Slide 15 www.edureka.co/apache-Kafka
• Fast writes:
– While Kafka persists all data to disk, essentially all writes go to the
page cache of OS, i.e. RAM.
– Cf. hardware specs and OS tuning (we cover this later)
• Fast reads:
– Very efficient to transfer data from page cache to a network socket
– Linux: sendfile() system call
• Combination of the two = fast Kafka!
– Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will see no read
activity on the disks as they will be serving data entirely from cache.
15
http://kafka.apache.org/documentation.html#persistence
Why is Kafka so fast?
Slide 16Slide 16Slide 16 www.edureka.co/apache-Kafka
• Example: loggly.com, who run Kafka & Co. on Amazon AWS
– “99.99999% of the time our data is coming from disk cache and RAM; only very rarely do we hit the
disk.”
– “One of our consumer groups (8 threads) which maps a log to a customer can process about 200,000
events per second draining from 192 partitions spread across 3 brokers.”
• Brokers run on m2.xlarge Amazon EC2 instances backed by provisioned IOPS
16
http://www.developer-tech.com/news/2014/jun/10/why-loggly-loves-apache-kafka-how-unbreakable-infinitely-scalable-messaging-makes-log-management-better/
Why is Kafka so fast?
Slide 17Slide 17Slide 17 www.edureka.co/apache-Kafka
How it works ??
Slide 18Slide 18Slide 18 www.edureka.co/apache-Kafka
• The who is who
– Producers write data to brokers.
– Consumers read data from brokers.
– All this is distributed.
• The data
– Data is stored in topics.
– Topics are split into partitions, which are replicated.
18
A first look
Slide 19Slide 19Slide 19 www.edureka.co/apache-Kafka
Broker(s)
19
• Topic: feed name to which messages are published
– Example: “zerg.hydra”
ne
w
Producer A1
Producer A2
Producer An
…
…
Kafka prunes “head” based on age or max size or “key”
Older msgs Newer msgs
Kafka topic
Topics
Producers always append to “tail”
(think: append to a file)
Slide 20Slide 20Slide 20 www.edureka.co/apache-Kafka
Broker(s)
20
ne
w
Producer A1
Producer A2
Producer An
…
Producers always append to “tail”
(think: append to a file)
…
Older msgs Newer msgs
Consumer group C1 Consumers use an “offset pointer” to
track/control their read progress
(and decide the pace of consumption)
Consumer group C2
Topics
Slide 21Slide 21Slide 21 www.edureka.co/apache-Kafka
• A topic consists of partitions.
• Partition: ordered + immutable sequence of messages that is continually appended
Topics
Slide 22Slide 22Slide 22 www.edureka.co/apache-Kafka2
2
• #partitions of a topic is configurable
• #partitions determines max consumer (group) parallelism
– Consumer group A, with 2 consumers, reads from a 4-partition topic
– Consumer group B, with 4 consumers, reads from the same topic
Topics
Slide 23Slide 23Slide 23 www.edureka.co/apache-Kafka23
• Offset: messages in the partitions are each assigned a unique (per partition) and sequential id
called the offset
– Consumers track their pointers via (offset, partition, topic) tuples
Consumer group C1
Topics
Slide 24Slide 24Slide 24 www.edureka.co/apache-Kafka24
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Partition
Slide 25Slide 25Slide 25 www.edureka.co/apache-Kafka
Consumer3
(Group2)
Kafka
Broker
Consumer4
(Group2)
Producer
Zookeeper
Consumer2
(Group1)
Consumer1
(Group1)
Update Consumed
Message offset
Queue
Topology
Topic
Topology
Kafka
Broker
Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
Broker
Slide 26Slide 26Slide 26 www.edureka.co/apache-Kafka26
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Putting it altogether
Slide 27Slide 27Slide 27 www.edureka.co/apache-Kafka
Kafka + Spark = Real Time Analytics
Slide 28Slide 28Slide 28 www.edureka.co/apache-Kafka
Analytics Flow
Slide 29Slide 29Slide 29 www.edureka.co/apache-Kafka
Data Ingestion Source
Slide 30Slide 30Slide 30 www.edureka.co/apache-Kafka
Real time Analysis with Spark Streaming
Slide 31Slide 31Slide 31 www.edureka.co/apache-Kafka
Analytics Result Displayed/Stored
Slide 32Slide 32Slide 32 www.edureka.co/apache-Kafka
Streaming In Detail
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Slide 34Slide 34Slide 34 www.edureka.co/apache-Kafka
• LinkedIn : activity streams, operational metrics, data bus
– 400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014
• Netflix : real-time monitoring and event processing
• Twitter : as part of their Storm real-time data pipelines
• Spotify : log delivery (from 4h down to 10s), Hadoop
• Loggly : log collection and processing
• Mozilla : telemetry data
• Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber, …
34
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Kafka adoption and use cases
Questions
Slide 35
Slide 36
Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.
Survey

More Related Content

PPTX
Introduction to Azure Databricks
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Pythonsevilla2019 - Introduction to MLFlow
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Emerging Trends in Data Engineering
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Spark and Spark Streaming
PPTX
Introduction to ML with Apache Spark MLlib
Introduction to Azure Databricks
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Pythonsevilla2019 - Introduction to MLFlow
Flexible and Real-Time Stream Processing with Apache Flink
Emerging Trends in Data Engineering
Introduction to Apache Flink - Fast and reliable big data processing
Spark and Spark Streaming
Introduction to ML with Apache Spark MLlib

What's hot (20)

PDF
Iceberg: a fast table format for S3
PPTX
Azure Cognitive Services Bring AI to your applications in 3 steps.pptx
PDF
MLOps Using MLflow
PPTX
Introduction to Apache Spark
PPTX
Introduction to the Microsoft Azure Cloud.pptx
PDF
Data Versioning and Reproducible ML with DVC and MLflow
PDF
MLflow: A Platform for Production Machine Learning
PPTX
Apache Flink in the Cloud-Native Era
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
PDF
Vector databases and neural search
PDF
Vector database
PDF
Performance Acceleration: Summaries, Recommendation, MPP and more
PDF
Top 5 mistakes when writing Spark applications
PPTX
Technical overview of Azure Cosmos DB
PPTX
MLOps.pptx
PDF
Apache Flink internals
PPTX
Introduction to Kafka and Zookeeper
PPTX
Azure Synapse Analytics Overview (r1)
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Iceberg: a fast table format for S3
Azure Cognitive Services Bring AI to your applications in 3 steps.pptx
MLOps Using MLflow
Introduction to Apache Spark
Introduction to the Microsoft Azure Cloud.pptx
Data Versioning and Reproducible ML with DVC and MLflow
MLflow: A Platform for Production Machine Learning
Apache Flink in the Cloud-Native Era
Iceberg: A modern table format for big data (Strata NY 2018)
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Vector databases and neural search
Vector database
Performance Acceleration: Summaries, Recommendation, MPP and more
Top 5 mistakes when writing Spark applications
Technical overview of Azure Cosmos DB
MLOps.pptx
Apache Flink internals
Introduction to Kafka and Zookeeper
Azure Synapse Analytics Overview (r1)
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Ad

Similar to Apache Kafka with Spark Streaming: Real-time Analytics Redefined (20)

PDF
Connect K of SMACK:pykafka, kafka-python or?
PPTX
How kafka is transforming hadoop, spark & storm
PPTX
How Apache Kafka is transforming Hadoop, Spark and Storm
PPTX
Apache Kafka
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
PPTX
Building Event-Driven Systems with Apache Kafka
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
An Introduction to Apache Kafka
PDF
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
PDF
Building Streaming Data Applications Using Apache Kafka
PPTX
Apache Kafka: Next Generation Distributed Messaging System
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
PPTX
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
PDF
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
PDF
Fault Tolerance with Kafka
PPTX
Apache Kafka 0.8 basic training - Verisign
PDF
JConWorld_ Continuous SQL with Kafka and Flink
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
PPTX
Apache kafka
Connect K of SMACK:pykafka, kafka-python or?
How kafka is transforming hadoop, spark & storm
How Apache Kafka is transforming Hadoop, Spark and Storm
Apache Kafka
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Building Event-Driven Systems with Apache Kafka
Apache Kafka - Scalable Message-Processing and more !
An Introduction to Apache Kafka
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Building Streaming Data Applications Using Apache Kafka
Apache Kafka: Next Generation Distributed Messaging System
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Fault Tolerance with Kafka
Apache Kafka 0.8 basic training - Verisign
JConWorld_ Continuous SQL with Kafka and Flink
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Apache kafka
Ad

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
Electrocardiogram sequences data analytics and classification using unsupervi...
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PPTX
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
SaaS reusability assessment using machine learning techniques
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
Build Real-Time ML Apps with Python, Feast & NoSQL
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Rapid Prototyping: A lecture on prototyping techniques for interface design
Electrocardiogram sequences data analytics and classification using unsupervi...
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
4 layer Arch & Reference Arch of IoT.pdf
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
Co-training pseudo-labeling for text classification with support vector machi...
Training Program for knowledge in solar cell and solar industry
Data Virtualization in Action: Scaling APIs and Apps with FME
Lung cancer patients survival prediction using outlier detection and optimize...
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
SaaS reusability assessment using machine learning techniques
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
Auditboard EB SOX Playbook 2023 edition.
Build Real-Time ML Apps with Python, Feast & NoSQL
NewMind AI Weekly Chronicles – August ’25 Week IV
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj

Apache Kafka with Spark Streaming: Real-time Analytics Redefined

  • 2. Slide 2Slide 2Slide 2 www.edureka.co/apache-Kafka Agenda At the end of this webinar we will be able understand :  What Is Kafka?  Why We Need Kafka ?  Kafka Components  How Kafka Works  Which Companies Are Using Kafka ?  Kafka And Spark Integration Hands on
  • 3. Slide 3Slide 3Slide 3 www.edureka.co/apache-Kafka Why Kafka ??
  • 4. Slide 4Slide 4Slide 4 www.edureka.co/apache-Kafka Why Kafka? When we have other messaging systems Aren’t they Good? Kafka Vs Other Message Broker?
  • 5. Slide 5Slide 5Slide 5 www.edureka.co/apache-Kafka They all are Good But not for all use-cases.
  • 6. Slide 6Slide 6Slide 6 www.edureka.co/apache-Kafka • Transportation of logs • Activity Stream in Real time. • Collection of Performance Metrics – CPU/IO/Memory usage – Application Specific • Time taken to load a web-page. • Time taken by Multiple Services while building a web-page. • No of requests. • No of hits on a particular page/url. So what are my Use-cases…
  • 7. Slide 7Slide 7Slide 7 www.edureka.co/apache-Kafka What is Common? Scalable : Need to be Highly Scalable. A lot of Data. It can be billions of message. Reliability of messages, What If, I loose a small no. of messages. Is it fine with me ? Distributed : Multiple Producers, Multiple Consumers High-throughput : Does not need to have JMS Standards, as it may be an overkill for some use-cases like transportation of logs. As per JMS, each message has to be acknowledged back. Exactly one delivery guarantee requires two-phase commit.
  • 8. Slide 8Slide 8Slide 8 www.edureka.co/apache-Kafka Why LinkedIn built Kafka ? To collect its growing data, LinkedIn developed many custom data pipelines for streaming and queueing data, like : To flow data into data warehouse To send batches of data into our hadoop workflow for analytics To collect and aggregate logs from every service To collect tracking events like page views To queue their inmail messaging system To keep their people search system up to date whenever someone updated their profile As the site needed to scale, each individual pipeline needed to scale and many other pipelines were needed. Something had to give !!! The result was development of Kafka
  • 9. Slide 9Slide 9Slide 9 www.edureka.co/apache-Kafka The number has been growing since Source : confluent
  • 10. Slide 10Slide 10Slide 10 www.edureka.co/apache-Kafka http://gigaom.com/2013/12/09/netflix-open-sources-its-data-traffic-cop-suro/ A diagram of LinkedIn’s data architecture as of February 2013, including everything from Kafka to Teradata. diagram of LinkedIn’s data architecture
  • 11. Slide 11Slide 11Slide 11 www.edureka.co/apache-Kafka Kafka ? Built with speed and scalability in mind. Enabled near real-time access to any data source Empowered hadoop jobs Allowed us to build real-time analytics Vastly improved our site monitoring and alerting capability Enabled us to visualize and track our call graphs. Apache Kafka Hits 1.1 Trillion Messages Per Day (September 2015) Kafka is a distributed pub-sub messaging platform Universal pipeline, built around the concept of a commit log Kafka as a universal stream broker
  • 12. Slide 12Slide 12Slide 12 www.edureka.co/apache-Kafka Kafka Benchmarks
  • 13. Slide 13Slide 13Slide 13 www.edureka.co/apache-Kafka Kafka Producer/Consumer Performance Processes hundred of thousands of messages in a second
  • 14. Slide 14Slide 14Slide 14 www.edureka.co/apache-Kafka14 http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines How fast is Kafka? • “Up to 2 million writes/sec on 3 cheap machines” – Using 3 producers on 3 different machines, 3x async replication • Only 1 producer/machine because NIC already saturated • Sustained throughput as stored data grows – Slightly different test config than 2M writes/sec above. • Test setup – Kafka trunk as of April 2013, but 0.8.1+ should be similar. – 3 machines: 6-core Intel Xeon 2.5 GHz, 32GB RAM, 6x 7200rpm SATA, 1GigE
  • 15. Slide 15Slide 15Slide 15 www.edureka.co/apache-Kafka • Fast writes: – While Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM. – Cf. hardware specs and OS tuning (we cover this later) • Fast reads: – Very efficient to transfer data from page cache to a network socket – Linux: sendfile() system call • Combination of the two = fast Kafka! – Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache. 15 http://kafka.apache.org/documentation.html#persistence Why is Kafka so fast?
  • 16. Slide 16Slide 16Slide 16 www.edureka.co/apache-Kafka • Example: loggly.com, who run Kafka & Co. on Amazon AWS – “99.99999% of the time our data is coming from disk cache and RAM; only very rarely do we hit the disk.” – “One of our consumer groups (8 threads) which maps a log to a customer can process about 200,000 events per second draining from 192 partitions spread across 3 brokers.” • Brokers run on m2.xlarge Amazon EC2 instances backed by provisioned IOPS 16 http://www.developer-tech.com/news/2014/jun/10/why-loggly-loves-apache-kafka-how-unbreakable-infinitely-scalable-messaging-makes-log-management-better/ Why is Kafka so fast?
  • 17. Slide 17Slide 17Slide 17 www.edureka.co/apache-Kafka How it works ??
  • 18. Slide 18Slide 18Slide 18 www.edureka.co/apache-Kafka • The who is who – Producers write data to brokers. – Consumers read data from brokers. – All this is distributed. • The data – Data is stored in topics. – Topics are split into partitions, which are replicated. 18 A first look
  • 19. Slide 19Slide 19Slide 19 www.edureka.co/apache-Kafka Broker(s) 19 • Topic: feed name to which messages are published – Example: “zerg.hydra” ne w Producer A1 Producer A2 Producer An … … Kafka prunes “head” based on age or max size or “key” Older msgs Newer msgs Kafka topic Topics Producers always append to “tail” (think: append to a file)
  • 20. Slide 20Slide 20Slide 20 www.edureka.co/apache-Kafka Broker(s) 20 ne w Producer A1 Producer A2 Producer An … Producers always append to “tail” (think: append to a file) … Older msgs Newer msgs Consumer group C1 Consumers use an “offset pointer” to track/control their read progress (and decide the pace of consumption) Consumer group C2 Topics
  • 21. Slide 21Slide 21Slide 21 www.edureka.co/apache-Kafka • A topic consists of partitions. • Partition: ordered + immutable sequence of messages that is continually appended Topics
  • 22. Slide 22Slide 22Slide 22 www.edureka.co/apache-Kafka2 2 • #partitions of a topic is configurable • #partitions determines max consumer (group) parallelism – Consumer group A, with 2 consumers, reads from a 4-partition topic – Consumer group B, with 4 consumers, reads from the same topic Topics
  • 23. Slide 23Slide 23Slide 23 www.edureka.co/apache-Kafka23 • Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset – Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1 Topics
  • 24. Slide 24Slide 24Slide 24 www.edureka.co/apache-Kafka24 http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/ Partition
  • 25. Slide 25Slide 25Slide 25 www.edureka.co/apache-Kafka Consumer3 (Group2) Kafka Broker Consumer4 (Group2) Producer Zookeeper Consumer2 (Group1) Consumer1 (Group1) Update Consumed Message offset Queue Topology Topic Topology Kafka Broker Broker does not Push messages to Consumer, Consumer Polls messages from Broker. Broker
  • 26. Slide 26Slide 26Slide 26 www.edureka.co/apache-Kafka26 http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/ Putting it altogether
  • 27. Slide 27Slide 27Slide 27 www.edureka.co/apache-Kafka Kafka + Spark = Real Time Analytics
  • 28. Slide 28Slide 28Slide 28 www.edureka.co/apache-Kafka Analytics Flow
  • 29. Slide 29Slide 29Slide 29 www.edureka.co/apache-Kafka Data Ingestion Source
  • 30. Slide 30Slide 30Slide 30 www.edureka.co/apache-Kafka Real time Analysis with Spark Streaming
  • 31. Slide 31Slide 31Slide 31 www.edureka.co/apache-Kafka Analytics Result Displayed/Stored
  • 32. Slide 32Slide 32Slide 32 www.edureka.co/apache-Kafka Streaming In Detail
  • 34. Slide 34Slide 34Slide 34 www.edureka.co/apache-Kafka • LinkedIn : activity streams, operational metrics, data bus – 400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014 • Netflix : real-time monitoring and event processing • Twitter : as part of their Storm real-time data pipelines • Spotify : log delivery (from 4h down to 10s), Hadoop • Loggly : log collection and processing • Mozilla : telemetry data • Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber, … 34 https://cwiki.apache.org/confluence/display/KAFKA/Powered+By Kafka adoption and use cases
  • 36. Slide 36 Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your experience better! Please spare few minutes to take the survey after the webinar. Survey