Cortex: Horizontally Scalable,
Highly Available Prometheus
Tom Wilkie, Nov 2018

@tom_wilkie
Prometheus
• A monitoring & alerting system.

• Inspired by Google’s BorgMon

• Originally built by SoundCloud in 2012

• Open Source, now part of the CNCF

• Simple text-based metrics format

• Multidimensional datamodel

• Rich, concise query language
Cortex
• Horizontally scalable Prometheus

• Distributed, fault tolerant architecture

• Long term storage

• Multitenant

github.com/cortexproject/cortex
16/06/2016		 First design doc

25/08/2016		 PromCon 2016 talk

25/10/2016		 Renamed to Cortex

23/01/2017		 Support for Recording Rules & Alerts

13/07/2017		 BigTable support added

18/08/2017		 PromCon 2017 talk

08/02/2018		 Cassandra support added

20/09/2018 		 Join CNCF Sandbox
http://goo.gl/prdUYV
>2 million samples/s

>100 million timeseries
Adopters Users
Community
• Commits from 37 contributors,
spanning ~6 companies.

• Apache 2 license.

• Community mailing list +
~fortnightly call since Feb
2018.

• Establishing governance
based on CNI.

Horizontally Scalable

Highly Available

Long Term Storage

Multitenant
Horizontally Scalable
Prometheus Scaling
Your JobsYour JobsYour JobsYour JobsYour Apps
Your JobsYour JobsYour JobsYour JobsYour Apps
Scale Up
Your JobsYour JobsYour JobsYour JobsYour Apps
Your JobsYour JobsYour JobsYour JobsYour Infra
Manually
Shard
Cortex

Distributor
Cortex

Ingester
Cortex

Ingester
Cortex

Ingester
Cortex

Ingester
s
Cortex Scaling: Distributed Hash Table
hash(s)
0
16
32
48
us-central1 eu-west2
Your
Jobs
Your
Jobs
Your
Jobs
Your
Jobs
Your
Apps
Your
Jobs
Your
Jobs
Your
Jobs
Your
Jobs
Your
Apps
Global View
Can configure multiple
datasource in Grafana…

…but then only see data for one
Prometheus at a time.
us-central1 eu-west2
Your
Jobs
Your
Jobs
Your
Jobs
Your
Jobs
Your
Apps
Your
Jobs
Your
Jobs
Your
Jobs
Your
Jobs
Your
Apps
Global View II
“global”
Prometheus
Can configure a “global”
Prometheus to federate samples
from “local” Prometheus….

…but in practice only propagate
aggregates, have to preconfigure
rules, hard to scale etc.
us-central1 eu-west2
Your
Jobs
Your
Jobs
Your
Jobs
Your
Jobs
Your
Apps
Your
Jobs
Your
Jobs
Your
Jobs
Your
Jobs
Your
Apps
Global View III
“global”
Cortex
Or can push all data to a central
Cortex cluster.

Cortex horizontal scalability
allows it to scale to handle all the
raw samples.
Highly Available
Prometheus HA
Your JobsYour JobsYour JobsYour JobsYour Apps
AlertmanagerAlertmanager
Cortex HA: Dynamo-style replication
Cortex

Ingester
Cortex

Ingester
Cortex

Ingester
Cortex

Distributor
s
Distributor replicates
samples on ingest.

Waits for N/2 ACKs
from ingesters to
ensure consistency.
Cortex

Querier
s
Querier de-dupes
samples on read -
again, only waiting
for N/2 responses.
Long Term Storage
durability
/dʒɔːrəˈbɪlɪti/
noun
1. the ability to withstand wear, pressure, or damage.
“the reliability and durability of plastics"
Durability is hard…
AWS DynamoDB
Google Cloud

Bigtable
Apache Cassandra
…let someone else deal with it.
• Why not just write the samples straight to the NOSQL DB?

• By building & flushing chunks, Cortex acts as a “write deamplifier”,
massively reducing cost.

• The NOSQL DBs also don’t necessarily support the right indexes for
executing PromQL queries. Cortex adds these.
s
30k samples/s
450k series
~10 IOPs
Multitenant
Pod-per-tenant
s
Auth /
Frontend
…
Automated
Provisioning
`
Multitenant
s
Auth /
Frontend
Natively multi tenant
services handle different
users within the same
process
Pod-per-tenant
Multitenant
Pros
• No application modifications
necessary.

• Effectively zero change of “leakage”
between tenants.

Cons
• Cattle-not-pets

• Provisioning automation hides a lot of
complexity…
Pros
• Per-tenant marginal costs can be
close to zero

• Can take advantage of statistical
multiplexing.

• Reduced provisioning complexity can
be traded for more “interesting”
architecture.

Cons
• Takes work…
Horizontally Scalable

Highly Available

Long Term Storage

Multitenant
• PromCon 2016 talk

• KubeCon 2016 talk

• PromCon 2017 talk

• Original design doc

• CNCF TOC Presentation

• Amazon’s Dynamo Paper
More Reading
Get Involved!
github.com/cortexproject/cortex

#cortex on slack.cncf.io

@tom_wilkie, tom.wilkie@gmail.com
+
Grafana Cloud is a hosted and fully managed SaaS metrics
platform that helps Ops and Dev teams using Grafana
to understand the behavior of their applications and
infrastructure
Grafana Cloud allows users to provision and manage
the best open source observability tools - Grafana and
Prometheus - all through a simple UI and single API.
What is Grafana Cloud?
Store, visualize and alert without the headache of scaling or managing
your own monitoring stack.
Your complete, fully managed, hosted metrics platform.
Grafana Cloud:

Cortex: Horizontally Scalable, Highly Available Prometheus