SRE School: Instrumentation

Instrumentation is the foundation of a monitoring infrastructure. It is the part that directly touches the system(s) being monitored, the source of raw data for our collectors and analyzers and dashboards. It is also the only part that is not under an SRE team's direct control – instrumentation is usually plumbed through the codebase by product teams. Given this, an SRE's primary source of leverage is to make adding instrumentation as easy and painless as possible. We do this by writing instrumentation libraries with friendly, approachable, idiomatic APIs.

Metrics

Each measurable property of the system is a metric. Repeated measurements of a metric's value yield a time series of data samples. A metric's definition includes metadata about how to collect, aggregate, and interpret its samples.

Metric values can in theory be of any serializable data type, but in practice they are numbers, text, or distributions:

Numeric metrics may have an associated unit, ideally in a machine-readable annotation. This is most important for metrics where the "natural" definition of a unit is divisible, e.g. to record time intervals as an integral amount of milliseconds instead of a fractional amount of seconds.
Text metrics are most often constants, but are sometimes used for gauges if there's a small number of possible values.
Distributions are used for metrics with a very large set of possible values. They are usually visualized as a histogram or heat map.

A C-style enumeration such as enum { OPT_FOO = 1; OPT_BAR = 2; } is best reported as "OPT_FOO" and "OPT_BAR"[1] instead of numeric 1 and 2.

Booleans can be thought of as the enum { FALSE, TRUE }. Some monitoring systems give them a separate type to simplify query planning and analysis.

Metrics can be defined ad-hoc at point of emission, or statically in some global type. I prefer statically declared metrics because that gives the opportunity to attach metric metadata.

There are four common categories of metrics: constants, gauges, counters, and distributions[2].

Constants

A metric that does not change for the lifetime of its associated system component. Samples of a constant metric will always contain the same value. Common examples are build information (e.g. git commit ID), process start time, and process ID. Don't use constants for things that are only constant-ish, such as hostnames.

Constants can be text or numbers. For numbers, integers usually work better than floats (e.g. represent your start time as int64 milliseconds instead of float64 seconds.

Time	`/build/timestamp` (seconds since UNIX epoch)	`/build/revision_id`
2011-12-13 14:15	1300000000	git:da39a3ee5e6b4b0d3255bfef95601890afd80709
2011-12-13 14:16	1300000000	git:da39a3ee5e6b4b0d3255bfef95601890afd80709
2011-12-13 14:17	1300000000	git:da39a3ee5e6b4b0d3255bfef95601890afd80709

In Go, using a constant metric might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	_TIMESTAMP   int64 /* filled in by linker */
	_REVISION_ID string /* filled in by linker */
	
	metric.NewConstantInt64("/build/timestamp", _TIMESTAMP)
	metric.NewConstantString("/build/revision_id", _REVISION_ID)
)

Gauges

A gauge metric can vary freely across its possible value range. Think of them like tachometers.

Gauges can be text or numbers.

Example integer gauges are memory allocation, thread count, active RPC count.
Example text gauges are mutable config settings (e.g. backend addresses), environment variables, and hostnames.

Time	`/proc/thread_count`	`/proc/working_directory`
2011-12-13 14:15	200	/var/www/current
2011-12-13 14:16	250	/var/www/previous
2011-12-13 14:17	230	/var/www/current

In Go, using a gauge metric might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	threadCount = metric.NewGaugeInt64("/proc/thread_count")
	workingDir = metric.NewGaugeString("/proc/working_directory")
)

func updateMetrics() {
	threadCount.Set(int64(runtime.NumGoroutine()))
	wd, _ := os.Getwd()
	workingDir.Set(wd)
}

Counters

A counter metric must be a number, and can only increase during the lifetime of the system. Counters are almost always integers to avoid the implications of IEEE-754 rounding.

Example counter metrics are CPU microseconds spent, or the total request count.

Counters can only increase. If the metric collector sees that a new value is lower than the older value, it knows a metric reset has occurred. Resets happen when a process restarts, clearing in-memory state of the counter.

Time	`/net/http/server/request_count`
2011-12-13 14:15	10000
2011-12-13 14:16	11000
2011-12-13 14:17	1500	RESET

In Go, defining a counter metric might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	requestCount = metric.NewCounterInt64("/net/http/server/request_count")
)

func handler(w http.ResponseWriter, req *http.Request) {
	requestCount.Increment() // or .IncrementBy(1)
}

Distributions

Distributions are used for metrics with a very large set of possible values. They are usually visualized as a histogram or heat map.

Examples include request latencies, client IP addresses[3], and aggregations of constant/gauge/counter metrics from other sources.

Time /net/http/server/response_latency (seconds)

2011-12-13 14:15

Time	`/net/http/server/response_latency` (seconds)
2011-12-13 14:15	[ 0, 2) # [ 2, 3) ### [ 3, 5) ####### [ 5, 8) #### [ 8, 13) ## [13, ∞)
2011-12-13 14:16	[ 0, 2) # [ 2, 3) #### [ 3, 5) ######## [ 5, 8) ### [ 8, 13) # [13, ∞)
2011-12-13 14:17	[ 0, 2) [ 2, 3) # [ 3, 5) ## [ 5, 8) ##### [ 8, 13) ######## [13, ∞) #

[ 0,  2) #
[ 2,  3) ###
[ 3,  5) #######
[ 5,  8) ####
[ 8, 13) ##
[13,  ∞)

2011-12-13 14:16

[ 0,  2) #
[ 2,  3) ####
[ 3,  5) ########
[ 5,  8) ###
[ 8, 13) #
[13,  ∞)

2011-12-13 14:17

[ 0,  2) 
[ 2,  3) #
[ 3,  5) ##
[ 5,  8) #####
[ 8, 13) ########
[13,  ∞) #

In Go, defining a distribution metric might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	latency = metric.NewDurations(
		"/net/http/server/response_latency",
		metric.BinDurations([]time.Duration{
			2 * time.Second,
			3 * time.Second,
			5 * time.Second,
			8 * time.Second,
			13 * time.Second,
		})
	)
)

func handler(w http.ResponseWriter, req *http.Request) {
	start := time.Now()
	defer func() {
		latency.Sample(time.Now() - start)
	}()
}

Each distribution is also inherently a set of counters, because recording a sample in one of the bins will increment that bin's count. This property can be used to simplify some monitoring configurations.

Bins can be defined statically (as in the example above), or using a function. Binning might be performed either by the system reporting the metric, or by the monitoring infrastructure.

With client-side binning, the reporter decides how fine-grained the distribution should be.
- This is usually configurable per-metric by a command-line flag or config setting.
- Changing the binning can cause vertical aberrations in visualisations.
With collector-side binning, the client reports the events as-is and the monitoring infrastructure aggregates the data before storing/forwarding it.
- Example: collector receives raw distribution samples from its clients, and records {50,90,95,99}th percentiles over a trailing window.
- This can be significantly less flexible, and it is often difficult to visualize percentiles as usefully as a full distribution.

Metric Names

I know of three styles for metric names:

The Prometheus Style Guide recommends myapp_descriptive_snake_case, where myapp_ is a one-word prefix specific to the system being monitored. This style is derived from Google Borgmon, which uses metric names as symbols in its configuration DSL[4].
- Amazon CloudWatch metrics use this format, without a prefix.
statsd and its derivatives use short.dotted.words, though the exact symbol set can vary between vendors.
- For example, DataDog allows alphanum, underscores, and periods.
Google Stackdriver uses myapp.com/unix/filesystem/paths, with each product having its own "subdirectory" in the metrics hierarchy.
- The same style is applied to AWS metrics in Stackdriver, by adding product-specific prefixes for each CloudWatch metric.

My personal favorite is the UNIX paths style, which I've seen used to great success. Engineers exposed to this style begin to naturally lay out metric hierarchies, with clear meanings and good namespacing. I don't have any solid data about why the naming style has such an effect, but I suspect it has something to do with familiarity:

A metric name like http_request_count is well and good, but myapp_com.net.http.server.request_count looks wrong to an experienced engineer. Expressions that use that many dots violate the Law of Demeter.
In contrast, path-shaped metric names like myapp.com/net/http/server/request_count inspire no such negative thoughts. Long paths are common in UNIX environments, and it's certainly no harder to remember than many of the paths in Linux's sysfs.

Traces

While metrics help understand the system in aggregate, traces are used to understand the relationship between the parts of a system that processed a particular request.

A trace is a tree of spans, which each represent a logical region of the system's execution time. Spans are nested – all spans except the root span have a parent span, and a trace is constructed by walking the tree to link parents with their children.

########################  GET /user/messages/inbox
 ######                   User permissions check
    ####                  Read template from disk
    #########             Query database
             ###          Render page to HTML
                ##        Compress response body
                  ######  Write response body

Spans and traces can be understood by analogy to lower-level programming concepts. If a trace is a stack trace, then a span is a single stack frame. Just as every stack frame is pushed and popped, each span begins and ends. It's the timing of when the spans begin and end that is interesting when analysing a trace.

Each span is implicitly a sample of a duration distribution, and therefore also a counter[5].

Tools for creating and recording traces are currently less mature than for creating metrics, and a wide variety of tracing platforms exists. OpenTracing is an attempt to provide vendor-neutral APIs for many languages so that tracing support can more easily be added to shared libraries.

Events

Events are conceptually similar to logging, but with an implied increase in how interesting a human will find the event as compared to normal logs. A web server might log a message for every request, but only record an event for things like unhandled exceptions, config file changes, or 5xx error codes.

Events are usually rendered in dashboards on top of visualized metric data, so humans can correlate them with potential production impact. For example, an oncaller might be better able to debug a spike in request latency if the dashboard shows it was immediately preceeded by a config change.

Events can also be archived to a searchable event log. This can be useful when investigating unexpected behavior that occurred in a large window of time – logs may be too noisy to search, but the event log can quickly find "all SSH logins to this machine in the last 3 hours".

Events that indicate programming errors should be recorded in a ticket tracking system, then assigned to a engineer for diagnosis and correction. This should be relatively rare – if your service encounters unhandled errors more than once a month or so, then you should improve its automated test suite.

Metric Metadata

A raw stream of numbers can be useful to authors of the system who are deeply familiar with its internal details, but can be opaque to other engineers (including oncall SREs). Attaching metadata to metrics at their point of definition can help with this by acting as type hints, documentation, and cross-references.

Types of metadata that might be added include:

Human-readable documentation, such as a description of the metric's deeper meaning. Very nice to have when staring at hundreds of similar-looking metrics in a dashboard builder.
Numeric units, so the monitoring system can combine millisecond-resolution data from one system with minute-resolution data from another. Or bytes and gibibyte, or Mb/s with kB/s.
Tags (see below), which benefit from pre-definition and strong typing in the same way metrics do.
Source code location, usually inserted automatically by the monitoring library. Navigating from a dashboard to source code is often the first step for investigating an anomalous chart reading.
Contact info for a person or team that has more context about what the metric means, or how it relates to the overall health of a system.
Stability metrics are an API too! Some metrics are experimental and shouldn't be built into other teams' dashboards, so it's useful to be able to indicate "this metric's definition is stable" vs "this could change without warning".

In Go, defining a metric with metadata might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	requestCount = metric.NewCounterInt64(
		"/net/http/server/request_count",
		metric.Description("A count of the HTTP requests received by this process."),
	)
)

func handler(w http.ResponseWriter, req *http.Request) {
	requestCount.Increment() // or .IncrementBy(1)
}

Push vs Pull

Until I started looking into open-source monitoring frameworks, I didn't realize the "push vs pull" debate existed. I still don't fully understand it. Have we, as an industry, forgotten that TCP sockets are bidirectional? Anyway, here's a summary of the two sides.

Push Model

In the push model, processes are configured with the network address of a metric collector. They send metrics on their own schedule, either periodically (e.g. every 5 minutes) or whenever a value changes. statsd and its various derivatives are a canonical example of the push model – to increment a counter or set a gauge, the process sends a UDP packet[6] to the collector with the metric name and value.

The push model is dead simple to implement, and has a significant advantage of not requiring any sort of service discovery infrastructure. But it's also inflexible and difficult to manage – metric collection policies are hardcoded (or require a complex configuration management), and load balancing between collectors.

Pull Model

In the pull model, processes provide network access to their metrics and register themselves in a service discovery infrastructure such as Consul. Typical implementations are an HTTP endpoint (e.g. Prometheus's /varz) or a simple request-response RPC. The collectors use service discovery to find endpoints, scrape them on their own schedule, and make the data available on their own endpoints for scraping by higher-level collectors.

Two significant downsides to the pull model are the dependency on service discovery, and lack of backpressure:

If your service discovery infrastructure is degraded or unavailable, then newly created processes might not be monitored properly. Monitoring the discovery infrastructure itself is also a challenge, because your collectors need some way to hard-code the discovery service metric endpoints.
A fleet of collectors can easily send more metric requests than a single process can handle. Incorrect load balancing, monitoring configuration mistakes, or aggresive retries can cause your monitoring infrastructure to degrade the system it's monitoring!

Bi-Directional Collection

One solution to the push-vs-pull debate is to have the instrumented system connect to the collectors, receive its collection policy from them, and then push samples. This achieves the best of both worlds – the collector can set policy about which metrics to push and how often, but implementation of the policy is left up to the monitored system. Service registration is present only in vestigal form, because the monitored system can register with any collector instead of a globally-consistent service discovery infrastructure.

+------------------+               +------------+
| Monitored System |               |  Collector |
+------------------+               +------------+
          ||                              ||
          ||         Announcement         ||
          || ---------------------------> ||
          ||                              ||
          ||     Collection Policy        ||
          || <--------------------------- ||
          ||                              ||
          ||           Samples            ||
          || ---------------------------> ||
          ||                              ||
          ||           Samples            ||
          || ---------------------------> ||
          ||                              ||
          ||            ...               ||
          ||                              ||

The monitored system starts the process by connecting to the collector, and announcing its identity. The identity consists of things like cluster name, machine ID, process ID, or other ways to distinguish processes from each other.

A monitored system might announce multiple identities, for example if it's proxying metrics from some other source. A process that scrapes Apache log files to count errors might report two identities, one for itself and one for Apache. Each identity has independent (and possibly overlapping) sets of metrics.

Collection Policies

A large binary might be instrumented with many thousands of metrics, but only a subset will be of interest to the SRE team. Furthermore, some metrics should be updated more often than others – and the details can change as the SRE team refines dashboards or investigates ongoing service degradation. The rules about which metrics to push, and how frequently to push them, are encoded in a collection policy that the collector sends to the monitored system.

The following example policy pushes metrics starting with /build/ every 10 minutes, and metrics starting with /proc/ or /net/rpc/server/ every 5 seconds. The metric /net/rpc/client/response_latency is also pushed every 5 seconds, but other metrics under /net/rpc/client/ are not pushed.

metrics:
  - prefix: "/build/"
    interval: {seconds: 600}
  - prefix: "/proc/"
    interval: {seconds: 5}
  - prefix: "/net/rpc/server/"
    interval: {seconds: 5}
  - name: "/net/rpc/client/response_latency"
    interval: {seconds: 5}

A collector might also request specific events and trace spans, or all of them.

Note that there is no hard requirement on the monitored system to push at the specified interval. It might push less often if it's running low on CPU allocation, or perform an unscheduled push during shutdown.

Sample Compression

An unexpected benefit of pushing metrics in a reliable connection-oriented protocol is the opportunity for cheap data compression. Metric names, unchanged sample values, and timestamps are easy wins to reduce bandwidth requirements in your metric collection.

Metric Names

When the monitored system pushes a metric sample, it can allocate a connection-unique ID to that metric name. For later pushes, the name doesn't need to be re-transmitted. This is an especially good fit for protocol buffers, because each message field is identified by an ID. Therefore, a sample push can be encoded in the protobuf wire format as a sequence of (metric_id, metric_value) tuples, where the metric_value is of the protobuf type corresponding to the metric type.

A brief example, showing the original metric definition on the left, and the logical protobuf encoding on the right:

metric {
  name: "/proc/thread_count"
  type: INT64
  per_connection_metric_id: 1
}

metric {
  name: "/proc/working_directory"
  type: TEXT
  per_connection_metric_id: 2
}

message {
  int64 proc_thread_count = 1;
  string proc_working_directory = 2;
}

Unchanged Samples

Metric values often change less frequently than their collection interval. Instead of resending the same value over and over, the protocol can have a repeated int64 unchanged_metric_id field. Any metric IDs in this list will be treated as if they were sent using the last value seen in the current connection.

Timestamps

If timestamps are a metric type encoded into the protocol instead of just using integers, then they can be compressed using a timestamp base. For example, instead of sending int64 timestamp for each sample, send int64 timestamp_base in the announcement message and int32 timestamp_offset in the samples. Then reconstruct the original values in the collector as timestamp_base + timestamp_offset.

This technique works regardless of whether you use a fixed-length integer field, or a protobuf-style varint. Fixed-length fields will save 50% of each timestamp per sample, varint savings will vary depending on how small the offsets are. Note that for protobuf, chosing a timestamp base in the future and using negative offsets may result in more compact output due to ZigZag encoding.

The base time must be updated to a larger value if the offset would overflow a signed 32-bit integer. The resolution of your timestamps will affect how often the base time must be updated:

Maximum Offset	Seconds	Minutes	Hours	Days
2^31 nanoseconds	2.15	-	-	-
2^31 microseconds	2147.48	35.79	-	-
2^31 milliseconds	-	-	596.52	24.86

Or maybe "opt_foo" and "opt_bar". "OptFoo" is right out.
Distributions are sometimes called "histograms", for example by DataDog and Prometheus, but this is technically incorrect – a histogram is a visualization of a distribution.
This may seem like an odd metric value, but it can be useful when diagnosing routing-related network issues.
If you ever feel the urge to write your own turing-complete configuration language, take a deep breath and step back for a bit. Go for a walk around the block. Look at some trees.
Be careful about depending on spans as counters. Many tracing systems record only a subset of the traces they receive, or discard spans with durations outside of their recall window. You may find the implied metrics to be missing data from times when they are most interesting.
UDP could be a reasonable transport for metrics if you used it as the foundation for a reliable connection-oriented protocol (ala QUIC), but statsd does not do this. There is no mechanism to resend lost updates, ignore duplicates, or ensure correct sequencing of gauge values. Embedding the metric name in each packet is enormously wasteful of bandwidth. statsd collection is difficult to load balance across threads, and and very difficult to balance across collectors running on separate machines.