Cortex: Horizontally Scalable, Highly Available Prometheus
Cortex is a horizontally scalable, highly available monitoring and alerting system based on Prometheus, designed to efficiently handle large volumes of time series data while providing long-term storage and multitenancy. It leverages technologies such as AWS DynamoDB, Google Cloud Bigtable, and Apache Cassandra for durability and reliability, while managing data ingestion and queries effectively. Additionally, Grafana Cloud offers a fully managed metrics platform integrating Grafana and Prometheus, simplifying observability for users.
Prometheus
• A monitoring& alerting system.
• Inspired by Google’s BorgMon
• Originally built by SoundCloud in 2012
• Open Source, now part of the CNCF
• Simple text-based metrics format
• Multidimensional datamodel
• Rich, concise query language
4.
Cortex
• Horizontally scalablePrometheus
• Distributed, fault tolerant architecture
• Long term storage
• Multitenant
github.com/cortexproject/cortex
5.
16/06/2016 First designdoc
25/08/2016 PromCon 2016 talk
25/10/2016 Renamed to Cortex
23/01/2017 Support for Recording Rules & Alerts
13/07/2017 BigTable support added
18/08/2017 PromCon 2017 talk
08/02/2018 Cassandra support added
20/09/2018 Join CNCF Sandbox
http://goo.gl/prdUYV
Community
• Commits from37 contributors,
spanning ~6 companies.
• Apache 2 license.
• Community mailing list +
~fortnightly call since Feb
2018.
• Establishing governance
based on CNI.
Prometheus Scaling
Your JobsYourJobsYour JobsYour JobsYour Apps
Your JobsYour JobsYour JobsYour JobsYour Apps
Scale Up
Your JobsYour JobsYour JobsYour JobsYour Apps
Your JobsYour JobsYour JobsYour JobsYour Infra
Manually
Shard
Cortex HA: Dynamo-stylereplication
Cortex
Ingester
Cortex
Ingester
Cortex
Ingester
Cortex
Distributor
s
Distributor replicates
samples on ingest.
Waits for N/2 ACKs
from ingesters to
ensure consistency.
Cortex
Querier
s
Querier de-dupes
samples on read -
again, only waiting
for N/2 responses.
Durability is hard…
AWSDynamoDB
Google Cloud
Bigtable
Apache Cassandra
…let someone else deal with it.
22.
• Why notjust write the samples straight to the NOSQL DB?
• By building & flushing chunks, Cortex acts as a “write deamplifier”,
massively reducing cost.
• The NOSQL DBs also don’t necessarily support the right indexes for
executing PromQL queries. Cortex adds these.
s
30k samples/s
450k series
~10 IOPs
Pod-per-tenant
Multitenant
Pros
• No applicationmodifications
necessary.
• Effectively zero change of “leakage”
between tenants.
Cons
• Cattle-not-pets
• Provisioning automation hides a lot of
complexity…
Pros
• Per-tenant marginal costs can be
close to zero
• Can take advantage of statistical
multiplexing.
• Reduced provisioning complexity can
be traded for more “interesting”
architecture.
Cons
• Takes work…
+
Grafana Cloud isa hosted and fully managed SaaS metrics
platform that helps Ops and Dev teams using Grafana
to understand the behavior of their applications and
infrastructure
Grafana Cloud allows users to provision and manage
the best open source observability tools - Grafana and
Prometheus - all through a simple UI and single API.
What is Grafana Cloud?
Store, visualize and alert without the headache of scaling or managing
your own monitoring stack.
Your complete, fully managed, hosted metrics platform.
Grafana Cloud: