SRE School: Instrumentation

Instrumentation is the foundation of a monitoring infrastructure. It is the part that directly touches the system(s) being monitored, the source of raw data for our collectors and analyzers and dashboards. It is also the only part that is not under an SRE team’s direct control – instrumentation is usually plumbed through the codebase by product teams. Given this, an SRE’s primary source of leverage is to make adding instrumentation as easy and painless as possible. We do this by writing instrumentation libraries with friendly, approachable, idiomatic APIs.

Metrics

Each measurable property of the system is a metric. Repeated measurements of a metric’s value yield a time series of data samples. A metric’s definition includes metadata about how to collect, aggregate, and interpret its samples.

Metric values can in theory be of any serializable data type, but in practice they are numbers, text, or distributions:

A C-style enumeration such as enum { OPT_FOO = 1; OPT_BAR = 2; } is best reported as "OPT_FOO" and "OPT_BAR"1 instead of numeric 1 and 2.

Booleans can be thought of as the enum { FALSE, TRUE }. Some monitoring systems give them a separate type to simplify query planning and analysis.

Metrics can be defined ad-hoc at point of emission, or statically in some global type. I prefer statically declared metrics because that gives the opportunity to attach metric metadata.

There are four common categories of metrics: constants, gauges, counters, and distributions2.

Constants

A metric that does not change for the lifetime of its associated system component. Samples of a constant metric will always contain the same value. Common examples are build information (e.g. git commit ID), process start time, and process ID. Don’t use constants for things that are only constant-ish, such as hostnames.

Constants can be text or numbers. For numbers, integers usually work better than floats (e.g. represent your start time as int64 milliseconds instead of float64 seconds.

Time/build/timestamp (seconds since UNIX epoch)/build/revision_id
2011-12-13 14:151300000000git:da39a3ee5e6b4b0d3255bfef95601890afd80709
2011-12-13 14:161300000000git:da39a3ee5e6b4b0d3255bfef95601890afd80709
2011-12-13 14:171300000000git:da39a3ee5e6b4b0d3255bfef95601890afd80709

In Go, using a constant metric might look something like this:

1
2
3
4
5
6
7
8
9
import "foo.com/my/monitoring/impl/metric"

var (
	_TIMESTAMP   int64 /* filled in by linker */
	_REVISION_ID string /* filled in by linker */
	
	metric.NewConstantInt64("/build/timestamp", _TIMESTAMP)
	metric.NewConstantString("/build/revision_id", _REVISION_ID)
)

Gauges

A gauge metric can vary freely across its possible value range. Think of them like tachometers.

Gauges can be text or numbers.

Time/proc/thread_count/proc/working_directory
2011-12-13 14:15200/var/www/current
2011-12-13 14:16250/var/www/previous
2011-12-13 14:17230/var/www/current

In Go, using a gauge metric might look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import "foo.com/my/monitoring/impl/metric"

var (
	threadCount = metric.NewGaugeInt64("/proc/thread_count")
	workingDir = metric.NewGaugeString("/proc/working_directory")
)

func updateMetrics() {
	threadCount.Set(int64(runtime.NumGoroutine()))
	wd, _ := os.Getwd()
	workingDir.Set(wd)
}

Counters

A counter metric must be a number, and can only increase during the lifetime of the system. Counters are almost always integers to avoid the implications of IEEE-754 rounding.

Example counter metrics are CPU microseconds spent, or the total request count.

Counters can only increase. If the metric collector sees that a new value is lower than the older value, it knows a metric reset has occurred. Resets happen when a process restarts, clearing in-memory state of the counter.

Time/net/http/server/request_count
2011-12-13 14:1510000
2011-12-13 14:1611000
2011-12-13 14:171500RESET

In Go, defining a counter metric might look something like this:

1
2
3
4
5
6
7
8
9
import "foo.com/my/monitoring/impl/metric"

var (
	requestCount = metric.NewCounterInt64("/net/http/server/request_count")
)

func handler(w http.ResponseWriter, req *http.Request) {
	requestCount.Increment() // or .IncrementBy(1)
}

Distributions

Distributions are used for metrics with a very large set of possible values. They are usually visualized as a histogram or heat map.

Examples include request latencies, client IP addresses3, and aggregations of constant/gauge/counter metrics from other sources.

Time/net/http/server/response_latency (seconds)
2011-12-13 14:15
[ 0,  2) #
[ 2, 3) ###
[ 3, 5) #######
[ 5, 8) ####
[ 8, 13) ##
[13, ∞)
2011-12-13 14:16
[ 0,  2) #
[ 2, 3) ####
[ 3, 5) ########
[ 5, 8) ###
[ 8, 13) #
[13, ∞)
2011-12-13 14:17
[ 0,  2) 
[ 2, 3) #
[ 3, 5) ##
[ 5, 8) #####
[ 8, 13) ########
[13, ∞) #

In Go, defining a distribution metric might look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import "foo.com/my/monitoring/impl/metric"

var (
	latency = metric.NewDurations(
		"/net/http/server/response_latency",
		metric.BinDurations([]time.Duration{
			2 * time.Second,
			3 * time.Second,
			5 * time.Second,
			8 * time.Second,
			13 * time.Second,
		})
	)
)

func handler(w http.ResponseWriter, req *http.Request) {
	start := time.Now()
	defer func() {
		latency.Sample(time.Now() - start)
	}()
}

Each distribution is also inherently a set of counters, because recording a sample in one of the bins will increment that bin’s count. This property can be used to simplify some monitoring configurations.

Bins can be defined statically (as in the example above), or using a function. Binning might be performed either by the system reporting the metric, or by the monitoring infrastructure.

Metric Names

I know of three styles for metric names:

My personal favorite is the UNIX paths style, which I’ve seen used to great success. Engineers exposed to this style begin to naturally lay out metric hierarchies, with clear meanings and good namespacing. I don’t have any solid data about why the naming style has such an effect, but I suspect it has something to do with familiarity:

Traces

While metrics help understand the system in aggregate, traces are used to understand the relationship between the parts of a system that processed a particular request.

A trace is a tree of spans, which each represent a logical region of the system’s execution time. Spans are nested – all spans except the root span have a parent span, and a trace is constructed by walking the tree to link parents with their children.

1
2
3
4
5
6
7
########################  GET /user/messages/inbox
 ######                   User permissions check
    ####                  Read template from disk
    #########             Query database
             ###          Render page to HTML
                ##        Compress response body
                  ######  Write response body

Spans and traces can be understood by analogy to lower-level programming concepts. If a trace is a stack trace, then a span is a single stack frame. Just as every stack frame is pushed and popped, each span begins and ends. It’s the timing of when the spans begin and end that is interesting when analysing a trace.

Each span is implicitly a sample of a duration distribution, and therefore also a counter5.

Tools for creating and recording traces are currently less mature than for creating metrics, and a wide variety of tracing platforms exists. OpenTracing is an attempt to provide vendor-neutral APIs for many languages so that tracing support can more easily be added to shared libraries.

Events

Events are conceptually similar to logging, but with an implied increase in how interesting a human will find the event as compared to normal logs. A web server might log a message for every request, but only record an event for things like unhandled exceptions, config file changes, or 5xx error codes.

Events are usually rendered in dashboards on top of visualized metric data, so humans can correlate them with potential production impact. For example, an oncaller might be better able to debug a spike in request latency if the dashboard shows it was immediately preceeded by a config change.

Events can also be archived to a searchable event log. This can be useful when investigating unexpected behavior that occurred in a large window of time – logs may be too noisy to search, but the event log can quickly find “all SSH logins to this machine in the last 3 hours”.

Events that indicate programming errors should be recorded in a ticket tracking system, then assigned to a engineer for diagnosis and correction. This should be relatively rare – if your service encounters unhandled errors more than once a month or so, then you should improve its automated test suite.

Metric Metadata

A raw stream of numbers can be useful to authors of the system who are deeply familiar with its internal details, but can be opaque to other engineers (including oncall SREs). Attaching metadata to metrics at their point of definition can help with this by acting as type hints, documentation, and cross-references.

Types of metadata that might be added include:

In Go, defining a metric with metadata might look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import "foo.com/my/monitoring/impl/metric"

var (
	requestCount = metric.NewCounterInt64(
		"/net/http/server/request_count",
		metric.Description("A count of the HTTP requests received by this process."),
	)
)

func handler(w http.ResponseWriter, req *http.Request) {
	requestCount.Increment() // or .IncrementBy(1)
}

Tags

Tags are attached to a sample, span, or event to provide more information and context. They are a critical part of a metric, because without tags you couldn’t tell which machine has unusually high load, or whether your HTTP responses are status: OK or status: INTERNAL_SERVER_ERROR.

Tags are almost always named with short_snake_case. There’s no need to have full namespacing as in Metric names, because tags are implicitly namespaced to their metric.

Tags should have low cardinality) – the number of possible key-value combinations in the data. Tagging the response status is fine because there’s only a few dozen of them, but tagging the timestamp or client IP would place an enormous load on your collection and analysis infrastructure.

Tag value types are a restricted subset of metric value types: integers, text, and maybe bools. Floats are forbidden due to cardinality, and distributions don’t make sense as a tag.

Just like for metrics, tags might be declared ad-hoc or or statically. Static declaration of tags with their metrics improves the information available in dashboards, and helps catch programming errors before they land in prod.

In Go, defining a metric with tags might look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import "foo.com/my/monitoring/impl/metric"

var (
	requestCount = metric.NewCounterInt64(
		"/net/http/server/request_count",
		metric.StringTag("method"),
	)
)

func handler(w http.ResponseWriter, req *http.Request) {
	// func (c *CounterInt64) Increment(tagValues ...metric.TagValue)
	// func (c *CounterInt64) IncrementBy(count int64, tagValues ...metric.TagValue)
	requestCount.Increment(req.Method)
}

An alternative style, which is more type-safe but also more verbose, might be:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
var (
	tagMethod = metric.NewStringTag("method"),
	requestCount = metric.CounterInt64(
		"/net/http/server/request_count",
		metric.Tags(tagMethod),
	)
)

func handler(w http.ResponseWriter, req *http.Request) {
	requestCount.Increment(tagMethod(req.Method))
}

Note that neither style protects against forgetting a tag. In Go this is acceptable because zero-valued defaults are idiomatic, but other languages may prefer to require all tags to be specified when recording a sample.

Push vs Pull

Until I started looking into open-source monitoring frameworks, I didn’t realize the “push vs pull” debate existed. I still don’t fully understand it. Have we, as an industry, forgotten that TCP sockets are bidirectional? Anyway, here’s a summary of the two sides:

Push Model

In the push model, processes are configured with the network address of a metric collector. They send metrics on their own schedule, either periodically (e.g. every 5 minutes) or whenever a value changes. statsd and its various derivatives are a canonical example of the push model – to increment a counter or set a gauge, the process sends a UDP packet6 to the collector with the metric name and value.

The push model is dead simple to implement, and has a significant advantage of not requiring any sort of service discovery infrastructure. But it’s also inflexible and difficult to manage – metric collection policies are hardcoded (or require a complex configuration management), and load balancing between collectors .

Pull Model

In the pull model, processes provide network access to their metrics and register themselves in a service discovery infrastructure such as Consul. Typical implementations are an HTTP endpoint (e.g. Prometheus’s /varz) or a simple request-response RPC. The collectors use service discovery to find endpoints, scrape them on their own schedule, and make the data available on their own endpoints for scraping by higher-level collectors.

Two significant downsides to the pull model are the dependency on service discovery, and lack of backpressure:

Bi-Directional Collection

One solution to the push-vs-pull debate is to have the instrumented system connect to the collectors, receive its collection policy from them, and then push samples. This achieves the best of both worlds – the collector can set policy about which metrics to push and how often, but implementation of the policy is left up to the monitored system. Service registration is present only in vestigal form, because the monitored system can register with any collector instead of a globally-consistent service discovery infrastructure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
+------------------+               +------------+
| Monitored System |               |  Collector |
+------------------+               +------------+
          ||                              ||
          ||         Announcement         ||
          || ---------------------------> ||
          ||                              ||
          ||     Collection Policy        ||
          || <--------------------------- ||
          ||                              ||
          ||           Samples            ||
          || ---------------------------> ||
          ||                              ||
          ||           Samples            ||
          || ---------------------------> ||
          ||                              ||
          ||            ...               ||
          ||                              ||

The monitored system starts the process by connecting to the collector, and announcing its identity. The identity consists of things like cluster name, machine ID, process ID, or other ways to distinguish processes from each other.

A monitored system might announce multiple identities, for example if it’s proxying metrics from some other source. A process that scrapes Apache log files to count errors might report two identities, one for itself and one for Apache. Each identity has independent (and possibly overlapping) sets of metrics.

Collection Policies

A large binary might be instrumented with many thousands of metrics, but only a subset will be of interest to the SRE team. Furthermore, some metrics should be updated more often than others – and the details can change as the SRE team refines dashboards or investigates ongoing service degradation. The rules about which metrics to push, and how frequently to push them, are encoded in a collection policy that the collector sends to the monitored system.

The following example policy pushes metrics starting with /build/ every 10 minutes, and metrics starting with /proc/ or /net/rpc/server/ every 5 seconds. The metric /net/rpc/client/response_latency is also pushed every 5 seconds, but other metrics under /net/rpc/client/ are not pushed.

1
2
3
4
5
6
7
8
9
metrics:
  - prefix: "/build/"
    interval: {seconds: 600}
  - prefix: "/proc/"
    interval: {seconds: 5}
  - prefix: "/net/rpc/server/"
    interval: {seconds: 5}
  - name: "/net/rpc/client/response_latency"
    interval: {seconds: 5}

A collector might also request specific events and trace spans, or all of them.

Note that there is no hard requirement on the monitored system to push at the specified interval. It might push less often if it’s running low on CPU allocation, or perform an unscheduled push during shutdown.

Sample Compression

An unexpected benefit of pushing metrics in a reliable connection-oriented protocol is the opportunity for cheap data compression. Metric names, unchanged sample values, and timestamps are easy wins to reduce bandwidth requirements in your metric collection.

Metric Names

When the monitored system pushes a metric sample, it can allocate a connection-unique ID to that metric name. For later pushes, the name doesn’t need to be re-transmitted. This is an especially good fit for protocol buffers, because each message field is identified by an ID. Therefore, a sample push can be encoded in the protobuf wire format as a sequence of (metric_id, metric_value) tuples, where the metric_value is of the protobuf type corresponding to the metric type.

A brief example, showing the original metric definition on the left, and the logical protobuf encoding on the right:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
metric {                             |    message {
  name: "/proc/thread_count"         |      int64 proc_thread_count = 1;
  type: INT64                        |      string proc_working_directory = 2;
  per_connection_metric_id: 1        |    }
}                                    |
                                     |
metric {                             |
  name: "/proc/working_directory"    |
  type: TEXT                         |
  per_connection_metric_id: 2        |
}                                    |

Unchanged Samples

Metric values often change less frequently than their collection interval. Instead of resending the same value over and over, the protocol can have a repeated int64 unchanged_metric_id field. Any metric IDs in this list will be treated as if they were sent using the last value seen in the current connection.

Timestamps

If timestamps are a metric type encoded into the protocol instead of just using integers, then they can be compressed using a timestamp base. For example, instead of sending int64 timestamp for each sample, send int64 timestamp_base in the announcement message and int32 timestamp_offset in the samples. Then reconstruct the original values in the collector as timestamp_base + timestamp_offset.

This technique works regardless of whether you use a fixed-length integer field, or a protobuf-style varint. Fixed-length fields will save 50% of each timestamp per sample, varint savings will vary depending on how small the offsets are. Note that for protobuf, chosing a timestamp base in the future and using negative offsets may result in more compact output due to ZigZag encoding.

The base time must be updated to a larger value if the offset would overflow a signed 32-bit integer. The resolution of your timestamps will affect how often the base time must be updated:

Maximum OffsetSecondsMinutesHoursDays
2^31 nanoseconds2.15---
2^31 microseconds2147.4835.79--
2^31 milliseconds--596.5224.86

  1. Or maybe "opt_foo" and "opt_bar". "OptFoo" is right out.
  2. Distributions are sometimes called “histograms”, for example by DataDog and Prometheus, but this is technically incorrect – a histogram is a visualization of a distribution.
  3. This may seem like an odd metric value, but it can be useful when diagnosing routing-related network issues.
  4. If you ever feel the urge to write your own turing-complete configuration language, take a deep breath and step back for a bit. Go for a walk around the block. Look at some trees.
  5. Be careful about depending on spans as counters. Many tracing systems record only a subset of the traces they receive, or discard spans with durations outside of their recall window. You may find the implied metrics to be missing data from times when they are most interesting.
  6. UDP could be a reasonable transport for metrics if you used it as the foundation for a reliable connection-oriented protocol (ala QUIC), but statsd does not do this. There is no mechanism to resend lost updates, ignore duplicates, or ensure correct sequencing of gauge values. Embedding the metric name in each packet is enormously wasteful of bandwidth. statsd collection is difficult to load balance across threads, and and very difficult to balance across collectors running on separate machines.