John Millikin

Rust and dynamically-sized thin pointers

2024-06-02T08:27:30Z

Rust and dynamically-sized thin pointers

One of Rust's notable differences from C is its requirement that all values have a defined size, which enables runtime bounds-checking and advanced static analysis tooling such as MIRI. For dynamically-sized types (DSTs) this requirement is implemented using thick pointers, such that each pointer to a dynamically-sized value is an (address, size) tuple.

Thick pointers are more convenient and easier to use correctly than the C idiom of passing around value sizes manually, but they have a performance drawback – each thick pointer takes up twice as many registers as a thin pointer, even when pointing to values for which the size can be trivially computed. This overhead is especially noticeable for code that processes packet-based network protocols, which can cause Rust code to underperform C in that niche.

This page is an accompaniment to RFC 3536, which proposes that Rust should support thin pointers to DSTs.

Dynamically-sized types in C

Before getting into Rust's treatment of DSTs, it's useful to have an idea of how C handles them. The C language has been in active industrial use for about 50 years, so many interesting designs have been explored and some have been incorporated into the language itself.

Arrays and strings

The simplest DSTs in C are dynamically-allocated arrays, which are simply contiguous sequences of values. The number of values cannot be computed from the array content, so the size (or length) must be passed around as a second parameter to any function that will operate on the array.

bool validate_utf16(uint16_t *values, size_t *values_len) {
	/* size of `*values` is `values_len * sizeof(uint16_t)` */
}

Slightly more complex are arrays terminated by some sentinel value, such as NUL-terminated strings. The size of these values can be computed, but not in constant time – more importantly, the size of a value (1) can change and (2) can be lost. For example a C string can be truncated by writing '\x00' into the middle, and its size can be lost by overwriting the NUL terminator. This property makes traditional C string manipulation code difficult to reason about.

bool validate_utf8(const char *str) {
	/* This call might take a very long time, or even crash the process. */
	size_t str_len = strlen(str);
}

Due to the unpredictable behavior of C string manipulation code, modern systems programming languages generally avoid NUL-termination (or other inline sentinel values). Even in C, newly-written APIs tend to pass around the length explicitly (compare strcat() with strncat() and strlcat()). Many modern C codebases have adopted some form of thick pointer idiom for dynamically-sized values, for example GLib's GByteArray and GString.

struct GByteArray {
	uint8_t *data;
	unsigned int len;
}

struct GString {
	char *str;
	size_t len;
	size_t allocated_len;
}

Flexible array members

When working with low-level protocols it's common to encounter variable-length structures consisting of a fixed header followed by a blob of payload bytes. The header contains enough info to identify the layout of the payload, which can then be parsed with content-specific logic.

    byte
+----------+--------+--------+--------+--------+
|   [0..4) |      length (little-endian)       |
+----------+--------+--------+--------+--------+
|   [4..6) | request id (LE) |
+----------+--------+--------+
|        6 | opcode |
+----------+--------+
|        7 | flags  |
+----------+--------+--------+--------+--------+
| [8, ...) |               data                |
+----------+--------+--------+--------+--------+

Pointers to the buffer only need to contain the address of the first byte, because the buffer itself contains the length. This allows functions to have fewer parameters, improving performance due to reduced stack spills.

/* Parameters fit in registers on x86_64 with System V calling convention */
void packet_tee(
	  struct TeeOptions *options
	, struct Packet *packet
	, uint8_t *out_a, size_t out_a_len
	, uint8_t *out_b, size_t out_b_len
);

Traditionally this sort of layout would be implemented by placing a placeholder array at the end of a struct, either of length 0 or length 1. Examples exist from many long-lived codebases, including both Windows and Linux.

/* https://source.winehq.org/git/wine.git/blob/wine-9.10:/include/winnt.h */
#define ANYSIZE_ARRAY   1
typedef struct _TOKEN_GROUPS {
    DWORD GroupCount;
    SID_AND_ATTRIBUTES Groups[ANYSIZE_ARRAY];
} TOKEN_GROUPS, *PTOKEN_GROUPS;

/* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/inotify.h?h=v5.6 */
struct inotify_event {
	__s32   wd;             /* watch descriptor */
	__u32   mask;           /* watch mask */
	__u32   cookie;         /* cookie to synchronize two events */
	__u32   len;            /* length (including nulls) of name */
	char    name[0];        /* stub for possible name */
};

/* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/soundcard.h?h=v5.6 */
struct sysex_info {
	short key;              /* Use SYSEX_PATCH or MAUI_PATCH here */
#define SYSEX_PATCH	_PATCHKEY(0x05)
#define MAUI_PATCH	_PATCHKEY(0x06)
	short device_no;        /* Synthesizer number */
	int len;                /* Size of the sysex data in bytes */
	unsigned char data[1];  /* Sysex data starts here */
};

This idiom was so widespread that C99 incorporated it into the language as "flexible array members", which uses a slightly different syntax to differentiate from the case of an actual fixed-size array.

As a special case, the last element of a structure with more than one named member may have an incomplete array type; this is called a flexible array member. [...] when a . (or ->) operator has a left operand that is (a pointer to) a structure with a flexible array member and the right operand names that member, it behaves as if that member were replaced with the longest array (with the same element type) that would not make the structure larger than the object being accessed;

The Linux kernel has been migrating to C99 flexible array members over time.

/* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/inotify.h?h=v6.9 */
struct inotify_event {
	__s32   wd;             /* watch descriptor */
	__u32   mask;           /* watch mask */
	__u32   cookie;         /* cookie to synchronize two events */
	__u32   len;            /* length (including nulls) of name */
	char    name[];         /* stub for possible name */
};

Pointers to incomplete types

C permits structures to be declared without being defined, which is widely used to reduce transitive includes of header files. If the header never defines the struct then it is an "incomplete type" – pointers to incomplete types may be used like any other pointer, but the type itself (or any value of that type) cannot be inspected. Incomplete types are frequently used by C programmers to exclude implementation details from the public API.

struct SomeStruct;

struct SomeStruct *some_struct_new();
void some_struct_free(struct SomeStruct *);

An interesting consequence of incomplete types is that the pointers don't have to actually point to anything. C programmers are used to thinking of pointers and integers as basically interchangeable[0], and many extant C libraries will do things like stuff metadata into the low bits of an aligned pointer[1].

Dynamically-sized types in Rust

One of Rust's core design goals is to address the security and correctness problems inherent in dynamically-allocated values without a known size. Every Rust value has a known size and alignment, and the type system prevents dynamically-sized values from being used in contexts that require a statically-known size.

The core::mem module provides functions to inspect the size and alignment of values at runtime, and this ability is heavily used in the dynamic allocation subsystem (Box, etc). The existence of these functions implies that all Rust references have a known size and alignment[2].

// Every `Sized` type has a statically-known size and alignment.
pub const fn align_of<T>() -> usize
pub const fn size_of<T>() -> usize

// Every referenced value has a known size and alignment.
pub fn align_of_val<T: ?Sized>(val: &T) -> usize
pub fn size_of_val<T: ?Sized>(val: &T) -> usize

Slices and strings

Where C has dynamically-sized arrays and NUL-terminated strings, Rust has the slice and str types. Pointers to these types are an (address, size) tuple, making it impossible to accidentally lose track of how large any particular value is.

This also extends to values that contain a slice or string. The Rust equivalent to C's flexible array members uses roughly the same syntax, but the pointers to such a type will contain the size of the dynamically-sized field. A Rust binding to the Linux kernel's inotify API might contain code like this:

use core::ffi::c_char;

// A `*const InotifyEvent` will be twice the size of a `*const ()`, due
// to storing the length of `name`.
struct InotifyEvent {
	wd: i32,        // watch descriptor
	mask: u32,      // watch mask
	cookie: u32,    // cookie to synchronize two events
	len: u32,       // length (including nulls) of name
	name: [c_char], // stub for possible name
}

Note that the length in a *const InotifyEvent pointer is redundant with the length in the InotifyEvent::len field – that redundancy is what this page is intended to identify a solution for.

External types

Rust programmers that want to write bindings to C APIs that use incomplete types as opaque handles found themselves with a problem. Rust doesn't have the concept of an incomplete type, which means there's no good way to declare external FFI functions with type-safe pointer types.

extern "C" {
    fn some_struct_new() -> *mut SomeStruct;
    fn some_struct_free(_: *mut SomeStruct);
}

// In C this is a forward declaration, but in Rust this is declaring a
// statically-sized empty structure.
//
// SomeStruct would implement `Sized`, and thus could be (erroneously)
// passed directly as a function parameter, used as an array item type, etc.
struct SomeStruct;

// This definition of `SomeStruct` wouldn't implement Sized, but because
// Rust uses thick pointers for dynamically-sized types the `extern`
// function declarations would have the wrong ABI.
struct SomeStruct {
	_opaque: [u8],
}

// An uninhabited enum implies that `*const SomeStruct` will never point
// to a valid value, which in turn implies that creating a `&SomeStruct`
// reference is undefined behavior.
enum SomeStruct {}

One proposed solution to this is RFC 1861, which allows declaring external types. These types do double-duty as both incomplete types and flexible array members[3].

#![feature(extern_types)]
extern "C" {
	type SomeStruct;
	type FlexibleArrayU8;
}

// These function declarations have the correct ABI.
extern "C" {
    fn some_struct_new() -> *mut SomeStruct;
    fn some_struct_free(_: *mut SomeStruct);
}

// `HasFlexibleArray` is `!Sized` and `*const HasFlexibleArray` is a thin pointer.
struct HasFlexibleArray {
	len: u32,
	data: FlexibleArrayU8,
}

Unfortunately, combining these properties into a single feature has lead to conflicts when the semantics of C's incomplete types and flexible array members diverge. For example, an incomplete type doesn't have a known alignment[4], which means that it's incoherent for a reference to an extern type to exist.

// This is fine as long as the flexible array member really is a `[u8]`.
fn align_of_HasFlexibleArray(x: &HasFlexibleArray) {
	println!("static alignment: {}", core::mem::align_of::<InotifyEvent>());
	println!("dynamic alignment: {}", core::mem::align_of_val(x));
}

// What should this function print?
fn align_of_SomeStruct(x: &SomeStruct) {
	println!("alignment: {}", core::mem::align_of_val(x));
}

There are three relatively straightforward solutions to this problem, but they all require decoupling incomplete types from flexible array members.

The first option is to simply forbid extern types from becoming references or struct fields. I say "simply" here because it's easy to describe the goal, but the implications for the language semantics are tricky. It also might just be deferring the solution until later, because core::mem::align_of_val_raw() may stabilize one day, and then that function's behavior when passed a pointer to a value with an extern type would be difficult to specify.

A second option is to introduce a pointer-ish type that isn't actually a pointer. It would be the same size as a pointer and propagate provenance, but couldn't be directly converted to a pointer or reference. The guarantees this type would provide are weaker than actual pointers within the Rust aliasing model, since there's no way to tell what the C library stuffed into its bits. Semantically this might be similar to WebAssembly references, which are similarly opaque.

use core::ffi::Incomplete;
struct SomeStruct;

extern "C" {
    fn some_struct_new() -> Incomplete<SomeStruct>;
    fn some_struct_free(_: Incomplete<SomeStruct>);
}

The final option is to just give up on representing C incomplete types in the Rust language. FFI bindings to libraries that use incomplete types would be required to define their own wrapper types, with whatever semantics make sense for that library's data model. There would no longer be any way to identify SomeStruct within the Rust type system, only the handle type – which, being a pointer, is Sized.

#[repr(transparent)]
struct SomeStructRef { ptr: NonNull<()> }

extern "C" {
    fn some_struct_new() -> SomeStructRef;
    fn some_struct_free(_: SomeStructRef);
}

Whatever the future of extern types looks likes, my hope is that decoupling them from unsized thin pointers will improve the roadmap clarity for both features.

Proposal for !Sized thin pointers

The fundamental goal of this proposal is performance optimization. It should be possible to write a library in pure Rust that performs processing of (for example) IP packets with performance equivalent to a well-optimized C implementation of the same logic. Achieving that goal may require writing Rust that is non-idiomatic, including the use of unsafe in cases where a less performant solution would be able to use safe APIs.

As a starting point, consider a dynamically-sized value as being equivalent to a union with lots of array members:

#[repr(C)]
pub struct Packet {
	data_len_le: u16,
	data: [u8],
}

// IS EQUIVALENT TO

#[repr(C)]
pub struct Packet {
	data_len_le: u16,
	data: PacketData,
}
union PacketData {
	len_n0: [u8; 0],
	len_n1: [u8; 1],
	len_n2: [u8; 2],
	// ...
	len_max: [u8; u16::MAX as usize],
}

In such a type it is possible to safely obtain the address of Packet::data , but all members of the data array are potentially uninitialized and/or beyond the bounds of the allocated object. The only safe operation on PacketData is to interpret it as a 0-sized array – further conversion requires unsafe code to assert the validity of the data length.

impl Packet {
	pub fn data(&self) -> &[u8] {
		let data_ptr = self.data.as_ptr();
		let data_len = usize::from(u16::from_le(self.data_len_le));
		// SAFETY: The `Packet` type must ensure that the data length is valid.
		unsafe { core::slice::from_raw_parts(data_ptr, data_len) }
	}
}

impl PacketData {
	fn as_ptr(&self) -> *const u8 {
		// SAFETY: 0 is always a valid array length
		core::ptr::from_ref(unsafe { &self.len_n0 }).cast::<u8>()
	}
	fn as_mut_ptr(&mut self) -> *mut u8 {
		// SAFETY: 0 is always a valid array length
		core::ptr::from_mut(unsafe { &mut self.len_n0 }).cast::<u8>()
	}
}

The above code works in that rustc generates the correct assembly, but it doesn't work semantically because Packet implements Sized. That also means that size_of_val() will return incorrect results.

There needs to be a way to tell the compiler that Packet is a special kind of unsized type, one that doesn't need thick pointers. That hint also needs to be wired into some user-provided code that can report the size, otherwise size_of_val() can't work.

As a rough sketch, imagine a #[repr(thin_unsized)] attribute that marked a DST type as being a thin-pointer type, and required that type to implement a companion trait ThinUnsized.

// Safety:
// * The reported size of a value must not change during its lifetime.
unsafe trait core::marker::ThinUnsized {
	// Safety:
	// * The pointer must be correctly aligned and point to an initialized value.
	unsafe fn size_of_val_raw(ptr: *const Self) -> usize;
}

#[repr(C, thin_unsized)]
pub struct Packet {
	data_len_le: u16,
	data: [u8],
}

unsafe impl ThinUnsized for Packet {
	unsafe fn size_of_val_raw(ptr: *const Self) -> usize {
		let data_len = u16::from_le(ptr.cast::<u16>().read());
		core::mem::size_of::<u16>() + usize::from(data_len)
	}
}

Rust permits DSTs to be used as the final field of a struct, in which case that struct also becomes a DST. Thin-pointer DSTs would behave similarly, with no manual implementation of ThinUnsized required (or permitted).

// ContainsPacket is !Sized and implements ThinUnsized by adding a fixed
// overhead to the result of `Packet::ThinUnsized::size_of_val_raw()`.
pub struct ContainsPacket {
	some_data: u32,
	packet: Packet,
}

Interactions with Mutex and Box

Originally I was unsure whether a thin-pointer DST should have an immutable size. One could imagine, for example, a buffer that could be used to assemble an IP packet by incrementally appending data to the payload. However, discussion on the RFC revealed that the proposed new behavior would only be compatible with existing semantics if the reported size was the same during the entire lifetime of the object.

The first case to consider is Box, which can store unsized values (e.g. Box<[u8]>). When the Box is dropped the value's size is used to construct a Layout for deallocation. If the size of a Packet could change then Box<Packet> would be unsound, and introducing a new ?Trait bound to signify values with immutable sizes is impractical for numerous reasons.

Secondly, it is possible to call size_of_val() on a &Mutex. If the size of a value within a Mutex can change then it would be unsafe to compute its value without holding the lock, which implies size_of_val might deadlock or panic.

Thus, one of the safety preconditions for ThinUnsized must be that the reported size does not change during the value's lifetime.

The C99 language spec itself sorta guarantees this property via the extistence of uintptr_t, and to make matters worse there's a huge number of C libraries out there that assume size_t and uintptr_t are equivalent for all purposes. I won't get into the code that assumes pointer-sized unsigned int and/or unsigned long, which is unfortunately common in many older codebases.
If a type has an alignment of 8 (for example), then any pointer to a value of that type will have its lower 3 bits unset. These bits can be used to attach metadata, such as whether the value is dynamically or statically allocated. API functions that need to dereference the pointer will mask off the low bits first.
Although the requirement that all referenced values have a known size and alignment is (to my knowledge) not formally documented, it is baked into the language in various ways. My working assumption is that such a requirement will become part of a future Rust language spec, should such a document ever be written.
The Rust compiler is already using extern types to implement flexible array members in its own code. See OpaqueListContents in compiler/rustc_middle/src/ty/list.rs.
Incomplete types don't even necessarily have a knowable alignment. It's valid for a C library to return values of different types from some_struct_new(), as long as all of those types start with a SomeStruct.

vu128: Efficient variable-length integers

2024-05-21T10:45:18Z

vu128: Efficient variable-length integers

Why variable-length integers?

The basic goal of a variable-length integer encoding is to allow small numbers to be encoded in fewer bytes than larger numbers. Among other uses, they're popular for TLV formats such as Protocol Buffers, which contain lots of small numbers (tags) and have a design goal of space efficiency. The two encodings with near-universal adoption are VLQ and LEB128, which are the big- and little-endian dialects of an encoding originally developed in the early 1980s[0].

Due to various stupid decisions I find myself immersed in designing a new TLV format similar to Protocol Buffers, and therefore I needed to choose a variable-length integer encoding. Initially my plan was to use LEB128, but all the implementations I could find were far more complex than I wanted to deal with. I threw together a quick hack that used a length-prefix byte and figured I'd come back to the problem later.

It is now later, and to prepare for writing a new LEB128 library I first wrote a benchmark to figure out what sort of performance could be expected. Surprisingly, the quick hack solution was not only simpler, it is much faster, with even the carefully-optimized code in rustc_serialize/leb128.rs running at about 20% of the performance of my naive implementation.

After some head-scratching, my conclusion is that the LEB128 bitstream format is simply a poor fit for today's deeply pipelined processor architectures. In LEB128, each byte of input needs to have its MSB tested to determine whether the value has terminated. A simpler format with a bit-packed length prefix byte will naturally perform better, because bitmasks are cheap and branches are expensive.

An afternoon of web searching has failed to uncover a prior name for this particular encoding, so I'm going to name it vu128 and describe it below. If any reader knows of prior art, please send it my way.

The vu128 format

A vu128 containing an N-octet unsigned integer is composed of up to N+1 octets. If the integer is less than 0xF0, it is stored in a single octet. Otherwise, the first octet is 0xF0 | ((number of non-zero octets) - 1), and the non-zero octets of the integer are appended in little-endian order. Integer lengths up to 128 bits are supported.

Value (`u64`)	Encoded as
`0`	`[0x00]`
`239`	`[0xEF]`
`240`	`[0xF0 0xF0]`
`255`	`[0xF0 0xFF]`
`0xABCDEF`	`[0xF2 0xEF 0xCD 0xAB]`
`0x0123456789ABCDEF`	`[0xF7 0xEF 0xCD 0xAB 0x89 0x67 0x45 0x23 0x01]`

Signed integers are stored in the "zigzag" encoding used by Protocol Buffers. This ensures that small-magnitude integers of either sign are encoded in a compact layout.

Value (`i64`)	Encoded as
`0`	`[0x00]`
`-1`	`[0x01]`
`1`	`[0x02]`
`-2`	`[0x03]`
`2`	`[0x04]`

IEEE-754 floating-point values are stored in big-endian order due to their internal layout.

Value (`f64`)	Encoded as
`0.0`	`[0x00]`
`-0.0`	`[0x80]`
`1.0`	`[0xF1, 0x3F, 0xF0]`
`2.0`	`[0x40]`
`2.5`	`[0xF1, 0x40, 0x04]`

Rust implementation

A straightforward implementation using basic bit operations, with no SIMD or exotic control flow. On my desktop (AMD Ryzen 7 3700X) this code runs somewhere between 2x and 5x faster than the LEB128 code used by the Rust compiler.

This decoder will accept over-long encoding, and will also truncate values that are too large to fit in the given integer type. The low-level Protobuf routines behave similarly, but such behavior may or may not be appropriate depending on the use case. Adding checks for such conditions, if necessary, is left as an exercise to the reader.

Unsigned integers

// Copyright (c) John Millikin <john@john-millikin.com>
// SPDX-License-Identifier: 0BSD

macro_rules! encode_uNN {
	($name:ident, $t:ident, $max_len:literal, $len_mask:literal) => {
		#[inline]
		pub fn $name(buf: &mut [u8; $max_len], x: $t) -> usize {
			if x < 0xF0 {
				buf[0] = x as u8;
				return 1;
			}
			unsafe {
				(buf as *mut [u8; $max_len])
					.cast::<u8>().add(1).cast::<$t>()
					.write_unaligned(x.to_le());
			}
			let len = ((x.leading_zeros() >> 3) as u8) ^ $len_mask;
			buf[0] = 0xF0 | len;
			usize::from(len + 2)
		}
	};
}

macro_rules! decode_uNN {
	($name:ident, $t:ident, $max_len:literal, $len_mask:literal) => {
		#[inline]
		pub fn $name(buf: &[u8; $max_len]) -> ($t, usize) {
			if buf[0] < 0xF0 {
				return ($t::from(buf[0]), 1);
			}
			let x = $t::from_le(unsafe {
				(buf as *const [u8; $max_len])
					.cast::<u8>().add(1).cast::<$t>()
					.read_unaligned()
			});
			let len = buf[0] & 0x0F;
			let mask = $t::MAX >> ((len & $len_mask) ^ $len_mask);
			(x & mask, usize::from(len + 2))
		}
	};
}

encode_uNN!(encode_u16, u16, 3, 0x01);
encode_uNN!(encode_u32, u32, 5, 0x03);
encode_uNN!(encode_u64, u64, 9, 0x07);
encode_uNN!(encode_u128, u128, 17, 0x0F);

decode_uNN!(decode_u16, u16, 3, 0x01);
decode_uNN!(decode_u32, u32, 5, 0x03);
decode_uNN!(decode_u64, u64, 9, 0x07);
decode_uNN!(decode_u128, u128, 17, 0x0F);

#[inline]
pub fn encode_u8(buf: &mut [u8; 2], x: u8) -> usize {
	if x < 0xF0 {
		buf[0] = x;
		return 1;
	}
	buf[0] = 0xF0;
	buf[1] = x;
	2
}

#[inline]
pub fn decode_u8(buf: &[u8; 2]) -> (u8, usize) {
	if buf[0] < 0xF0 {
		return (buf[0], 1);
	}
	(buf[1], 2)
}

Assembly for the encoder and decoder functions, generated by rustc v1.78:

encode_u32:
        cmp     esi, 240
        jae     .LBB1_1
        mov     eax, 1
        mov     byte ptr [rdi], sil
        ret
.LBB1_1:
        mov     dword ptr [rdi + 1], esi
        lzcnt   esi, esi
        shr     esi, 3
        mov     al, 5
        sub     al, sil
        xor     sil, -13
        movzx   eax, al
        mov     byte ptr [rdi], sil
        ret

decode_u32:
        movzx   eax, byte ptr [rdi]
        cmp     eax, 240
        jae     .LBB0_1
        mov     edx, 1
        ret
.LBB0_1:
        mov     ecx, eax
        not     cl
        and     cl, 3
        mov     edx, 32
        sub     edx, ecx
        bzhi    ecx, dword ptr [rdi + 1], edx
        and     al, 15
        add     al, 2
        movzx   edx, al
        mov     eax, ecx
        ret

Signed integers (zigzag)

macro_rules! encode_iNN {
	(
		$name:ident,
		$encode_fn:ident,
		$iNN:ident,
		$uNN:ident,
		$max_len:literal,
		$zigzag_shift:literal
	) => {
		#[inline]
		pub fn $name(buf: &mut [u8; $max_len], x: $iNN) -> usize {
			let zigzag = ((x >> $zigzag_shift) as $uNN) ^ ((x << 1) as $uNN);
			$encode_fn(buf, zigzag)
		}
	};
}

macro_rules! decode_iNN {
	(
		$name:ident,
		$decode_fn:ident,
		$iNN:ident,
		$uNN:ident,
		$max_len:literal
	) => {
		#[inline]
		pub fn $name(buf: &[u8; $max_len]) -> ($iNN, usize) {
			let (zz, len) = $decode_fn(buf);
			let x = ((zz >> 1) as $iNN) ^ (-((zz & 1) as $iNN));
			(x, len)
		}
	};
}

encode_iNN!(encode_i8, encode_u8, i8, u8, 2, 7);
encode_iNN!(encode_i16, encode_u16, i16, u16, 3, 15);
encode_iNN!(encode_i32, encode_u32, i32, u32, 5, 31);
encode_iNN!(encode_i64, encode_u64, i64, u64, 9, 63);
encode_iNN!(encode_i128, encode_u128, i128, u128, 17, 127);

decode_iNN!(decode_i8, decode_u8, i8, u8, 2);
decode_iNN!(decode_i16, decode_u16, i16, u16, 3);
decode_iNN!(decode_i32, decode_u32, i32, u32, 5);
decode_iNN!(decode_i64, decode_u64, i64, u64, 9);
decode_iNN!(decode_i128, decode_u128, i128, u128, 17);

Floating-point

#[inline]
pub fn encode_f32(buf: &mut [u8; 5], x: f32) -> usize {
	encode_u32(buf, x.to_bits().swap_bytes())
}

#[inline]
pub fn encode_f64(buf: &mut [u8; 9], x: f64) -> usize {
	encode_u64(buf, x.to_bits().swap_bytes())
}

#[inline]
pub fn decode_f32(buf: &[u8; 5]) -> (f32, usize) {
	let (swapped, len) = decode_u32(buf);
	(f32::from_bits(swapped.swap_bytes()), len)
}

#[inline]
pub fn decode_f64(buf: &[u8; 9]) -> (f64, usize) {
	let (swapped, len) = decode_u64(buf);
	(f64::from_bits(swapped.swap_bytes()), len)
}

The earliest reference I can find to the concept is as part of MIDI 1.0, published in 1983, which describes the big-endian variant. The little-endian variant seems to have been first named LEB128 in DWARF v2, published in 1993.

Creating TUN/TAP interfaces in Linux

2023-04-18T02:53:51Z

Creating TUN/TAP interfaces in Linux

The basic approach to writing a TUN/TAP client (such as a VPN) for Linux is:

Open the /dev/net/tun device as a file, which (once configured) will communicate network traffic to userspace.
Allocate (or bind) a virtual network interface with the file handle using ioctl(TUNSETIFF).
Configure the network interface's address and link state.
Process network traffic in the userspace program.

There's reasonably complete documentation about each step of this process, but I couldn't find a worked example that tied it all together. The following C program is intended to serve as a basic minimal TUN/TAP client.

Steps 1-2: Allocating a TUN/TAP interface

Opening a file is straightforward, so the important part of this function is the ioctl(TUNSETIFF) call. It's this call that creates the network interface, and there are two user-configurable fields:

The ifr_name field contains the interface name, which may be specified by the caller. If unset (empty), then the kernel will assign a name such as tun0 or tap0.
The ifr_flags field sets whether the create a TUN or TAP interface. TUN interfaces process IP packets, and TAP interfaces process Ethernet frames.

The set of possible flags and their effects are documented at Linux Networking Documentation » Universal TUN/TAP device driver.

The interface name, if provided, must be less than IFNAMSIZ bytes. After the ioctl call returns, the ifr_name field can be inspected to see what name the interface was created with.

/* Copyright (c) John Millikin <john@john-millikin.com> */
/* SPDX-License-Identifier: 0BSD */
#define _POSIX_C_SOURCE 200809L

#include <errno.h>
#include <fcntl.h>
#include <linux/if.h>
#include <linux/if_tun.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

int tuntap_connect(const char *iface_name, short flags, char *iface_name_out) {
	int tuntap_fd, rc;
	size_t iface_name_len;
	struct ifreq setiff_request;

	if (iface_name != NULL) {
		iface_name_len = strlen(iface_name);
		if (iface_name_len >= IFNAMSIZ) {
			errno = EINVAL;
			return -1;
		}
	}

	tuntap_fd = open("/dev/net/tun", O_RDWR | O_CLOEXEC);
	if (tuntap_fd == -1) {
		return -1;
	}

	memset(&setiff_request, 0, sizeof setiff_request);
	setiff_request.ifr_flags = flags;
	if (iface_name != NULL) {
		memcpy(setiff_request.ifr_name, iface_name, iface_name_len + 1);
	}
	rc = ioctl(tuntap_fd, TUNSETIFF, &setiff_request);
	if (rc == -1) {
		int ioctl_errno = errno;
		close(tuntap_fd);
		errno = ioctl_errno;
		return -1;
	}

	if (iface_name_out != NULL) {
		memcpy(iface_name_out, setiff_request.ifr_name, IFNAMSIZ);
	}

	return tuntap_fd;
}

Step 3: Configure the interface with Netlink

At this point, most TUN/TAP examples I've found tell the user to configure the newly-created network interface by using the command line to run tools from iproute2. In this post I will instead use the Linux kernel's native Netlink subsystem.

Netlink can be thought of as a sort of RPC-ish request/response protocol, where messages are assembled manually from C structs. Besides the kernel docs linked above, the following manpages are useful for writing a Netlink client:

In this example we will be using the NETLINK_ROUTE mode to send RTM_NEWADDR and RTM_NEWLINK requests. Netlink error handling is a bit obtuse since it requires manual response handling, so I'm not going to bother with it for this example.

The first step is to open an AF_NETLINK socket by calling socket(AF_NETLINK). I'm also calling bind(), which isn't strictly necessary but provides metadata useful to strace[1].

/* Copyright (c) John Millikin <john@john-millikin.com> */
/* SPDX-License-Identifier: 0BSD */
#include <arpa/inet.h>
#include <linux/if.h>
#include <linux/netlink.h>
#include <linux/rtnetlink.h>
#include <net/if.h>
#include <stdint.h>
#include <string.h>

int netlink_connect() {
	int netlink_fd, rc;
	struct sockaddr_nl sockaddr;

	netlink_fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
	if (netlink_fd == -1) {
		return -1;
	}

	memset(&sockaddr, 0, sizeof sockaddr);
	sockaddr.nl_family = AF_NETLINK;
	rc = bind(netlink_fd, (struct sockaddr*) &sockaddr, sizeof sockaddr);
	if (rc == -1) {
		int bind_errno = errno;
		close(netlink_fd);
		errno = bind_errno;
		return -1;
	}
	return netlink_fd;
}

The first Netlink command will be RTM_NEWADDR, which sets the address and prefix length (netmask) of the interface. I've only implemented IPv4 support for this example, but IPv6 is similar.

A Netlink request contains a header (struct nlmsghdr), message content (here that's a struct ifaddrmsg), and an optional list of key-value attributes. The set of necessary attributes isn't well documented, so I ran strace ip addr add and replicated its requests.

int netlink_set_addr_ipv4(
	  int netlink_fd
	, const char *iface_name
	, const char *address
	, uint8_t network_prefix_bits
) {
	struct {
		struct nlmsghdr  header;
		struct ifaddrmsg content;
		char             attributes_buf[64];
	} request;

	struct rtattr *request_attr;
	size_t attributes_buf_avail = sizeof request.attributes_buf;

	memset(&request, 0, sizeof request);
	request.header.nlmsg_len = NLMSG_LENGTH(sizeof request.content);
	request.header.nlmsg_flags = NLM_F_REQUEST | NLM_F_EXCL | NLM_F_CREATE;
	request.header.nlmsg_type = RTM_NEWADDR;
	request.content.ifa_index = if_nametoindex(iface_name);
	request.content.ifa_family = AF_INET;
	request.content.ifa_prefixlen = network_prefix_bits;

	/* request.attributes[IFA_LOCAL] = address */
	request_attr = IFA_RTA(&request.content);
	request_attr->rta_type = IFA_LOCAL;
	request_attr->rta_len = RTA_LENGTH(sizeof (struct in_addr));
	request.header.nlmsg_len += request_attr->rta_len;
	inet_pton(AF_INET, address, RTA_DATA(request_attr));

	/* request.attributes[IFA_ADDRESS] = address */
	request_attr = RTA_NEXT(request_attr, attributes_buf_avail);
	request_attr->rta_type = IFA_ADDRESS;
	request_attr->rta_len = RTA_LENGTH(sizeof (struct in_addr));
	request.header.nlmsg_len += request_attr->rta_len;
	inet_pton(AF_INET, address, RTA_DATA(request_attr));

	if (send(netlink_fd, &request, request.header.nlmsg_len, 0) == -1) {
		return -1;
	}
	return 0;
}

The second Netlink command uses RTM_NEWLINK to enable the interface. It's equivalent to running ip link set up.

int netlink_link_up(int netlink_fd, const char *iface_name) {
	struct {
		struct nlmsghdr  header;
		struct ifinfomsg content;
	} request;

	memset(&request, 0, sizeof request);
	request.header.nlmsg_len = NLMSG_LENGTH(sizeof request.content);
	request.header.nlmsg_flags = NLM_F_REQUEST;
	request.header.nlmsg_type = RTM_NEWLINK;
	request.content.ifi_index = if_nametoindex(iface_name);
	request.content.ifi_flags = IFF_UP;
	request.content.ifi_change = 1;

	if (send(netlink_fd, &request, request.header.nlmsg_len, 0) == -1) {
		return -1;
	}
	return 0;
}

At this point the TUN/TAP interface has been fully configured and is just waiting for our process to read/write network data.

Step 4: Process network traffic

For this example I'll be writing a very simple tun2udp binary, which forwards IPv4 packets to/from UDP on localhost. Compile it with GCC or Clang:

gcc -o tun2udp tun2udp.c
send_port=12345
recv_port=12346
sudo ./tun2udp 10.11.12.0/24 $send_port $recv_port

/* Copyright (c) John Millikin <john@john-millikin.com> */
/* SPDX-License-Identifier: 0BSD */
#include <poll.h>
#include <stdio.h>
#include <stdlib.h>

int run_proxy(int tuntap_fd, int send_fd, int recv_fd) {
	struct pollfd poll_fds[2];
	char recv_buf[UINT16_MAX];

	poll_fds[0].fd = tuntap_fd;
	poll_fds[0].events = POLLIN;
	poll_fds[1].fd = recv_fd;
	poll_fds[1].events = POLLIN;

	while (1) {
		if (poll(poll_fds, 2, -1) == -1) {
			return -1;
		}

		if ((poll_fds[0].revents & POLLIN) != 0) {
			ssize_t count = read(tuntap_fd, recv_buf, UINT16_MAX);
			if (count < 0) {
				return -1;
			}
			send(send_fd, recv_buf, count, 0);
		}

		if ((poll_fds[1].revents & POLLIN) != 0) {
			ssize_t count = recv(recv_fd, recv_buf, UINT16_MAX, 0);
			if (count < 0) {
				return -1;
			}
			if (write(tuntap_fd, recv_buf, count) == -1) {
				return -1;
			}
		}
	}

	return 0;
}

int bind_localhost_udp(uint16_t port) {
	int fd, rc;
	struct sockaddr_in addr;

	fd = socket(AF_INET, SOCK_DGRAM, 0);
	if (fd == -1) {
		return -1;
	}

	memset(&addr, 0, sizeof addr);
	addr.sin_family = AF_INET;
	addr.sin_port = htons(port);
	addr.sin_addr.s_addr = inet_addr("127.0.0.1");

	rc = connect(fd, (struct sockaddr*) &addr, sizeof addr);
	if (rc == -1) {
		int connect_errno = errno;
		close(fd);
		errno = connect_errno;
		return -1;
	}

	return fd;
}

int connect_localhost_udp(uint16_t port) {
	int fd, rc;
	struct sockaddr_in addr;

	fd = socket(AF_INET, SOCK_DGRAM, 0);
	if (fd == -1) {
		return -1;
	}

	memset(&addr, 0, sizeof addr);
	addr.sin_family = AF_INET;
	addr.sin_port = htons(port);
	addr.sin_addr.s_addr = inet_addr("127.0.0.1");

	rc = bind(fd, (struct sockaddr*) &addr, sizeof addr);
	if (rc == -1) {
		int bind_errno = errno;
		close(fd);
		errno = bind_errno;
		return -1;
	}

	return fd;
}

The rest of the code is just argument parsing. For the TUN interface address it accepts an IPv4 dotted quad, with an optional netmask (defaulting to /32).

int split_address(char *address_str, uint8_t *network_prefix_bits) {
	char *prefix_sep, *prefix_str;

	prefix_sep = strchr(address_str, '/');
	if (prefix_sep == NULL) {
		prefix_str = NULL;
		*network_prefix_bits = 32;
	} else {
		*prefix_sep = 0;
		prefix_str = prefix_sep + 1;
	}

	if (inet_addr(address_str) == INADDR_NONE) {
		return -1;
	}

	if (prefix_str != NULL) {
		char *prefix_extra;
		long prefix_raw = strtol(prefix_str, &prefix_extra, 10);

		if (prefix_raw < 0 || prefix_raw > 32) {
			*prefix_sep = '/';
			return -1;
		}
		if (*prefix_extra != 0) {
			*prefix_sep = '/';
			return -1;
		}
		*network_prefix_bits = prefix_raw;
	}

	return 0;
}

int parse_port(char *port_str, uint16_t *port) {
	char *extra;
	long raw = strtol(port_str, &extra, 10);

	if (raw < 0 || raw > UINT16_MAX) {
		return -1;
	}
	if (*extra != 0) {
		return -1;
	}
	*port = raw;
	return 0;
}

Finally we get to main() and can glue everything together. Copy (or #include) the TUN/TAP and Netlink code from earlier sections. The TUN/TAP flags are hardcoded to IFF_TUN | IFF_NO_PI, which means it will send/receive IP packets with no additional framing. The interface name will be assigned by the kernel.

int main(int argc, char **argv) {
	int tuntap_fd, netlink_fd, send_fd, recv_fd, rc;
	char iface_name[IFNAMSIZ];
	char *address;
	uint8_t prefix_bits;
	uint16_t send_port, recv_port;

	if (argc < 4) {
		fprintf(stderr, "Usage: %s <address> <send-port> <recv-port>\n", argv[0]);
		return 1;
	}

	address = argv[1];
	if (split_address(address, &prefix_bits) == -1) {
		fprintf(stderr, "Invalid address \"%s\"\n", argv[1]);
		return 1;
	}

	if (parse_port(argv[2], &send_port) == -1) {
		fprintf(stderr, "Invalid port \"%s\"\n", argv[2]);
		return 1;
	}

	if (parse_port(argv[3], &recv_port) == -1) {
		fprintf(stderr, "Invalid port \"%s\"\n", argv[3]);
		return 1;
	}

	send_fd = bind_localhost_udp(send_port);
	if (send_fd == -1) {
		fprintf(stderr, "bind_localhost_udp(%u): ", send_port);
		perror(NULL);
		return 1;
	}
	recv_fd = connect_localhost_udp(recv_port);
	if (recv_fd == -1) {
		fprintf(stderr, "connect_localhost_udp(%u): ", recv_port);
		perror(NULL);
		return 1;
	}

	tuntap_fd = tuntap_connect(NULL, IFF_TUN | IFF_NO_PI, iface_name);
	if (tuntap_fd == -1) {
		perror("tuntap_connect");
		return 1;
	}

	netlink_fd = netlink_connect();
	if (netlink_fd == -1) {
		perror("netlink_connect");
		return 1;
	}

	rc = netlink_set_addr_ipv4(netlink_fd, iface_name, address, prefix_bits);
	if (rc == -1) {
		perror("netlink_set_addr_ipv4");
		return 1;
	}
	rc = netlink_link_up(netlink_fd, iface_name);
	if (rc == -1) {
		perror("netlink_link_up");
		return 1;
	}
	close(netlink_fd);

	if (run_proxy(tuntap_fd, send_fd, recv_fd) == -1) {
		perror("run_proxy");
		return 1;
	}
	return 0;
}

If the Netlink socket has bind() called on it, then the traced RTM_NEWADDR command is formatted like this:

sendto(6, [
	{
		nlmsg_len=40,
		nlmsg_type=RTM_NEWADDR,
		nlmsg_flags=NLM_F_REQUEST|NLM_F_EXCL|NLM_F_CREATE,
		nlmsg_seq=0
	 	nlmsg_pid=0
	}, {
		ifa_family=AF_INET,
		ifa_prefixlen=24,
		ifa_flags=0,
		ifa_scope=RT_SCOPE_UNIVERSE,
		ifa_index=if_nametoindex("tun0")
	}, [
		[{nla_len=8, nla_type=IFA_LOCAL}, inet_addr("10.10.0.1")],
		[{nla_len=8, nla_type=IFA_ADDRESS}, inet_addr("10.10.0.1")]
	]
], 40, 0, NULL, 0) = 40

If the socket does not have bind() called on it, then the same command is formatted like this:

sendto(6, [
	{
		nlmsg_len=40,
		nlmsg_type=0x14 /* NLMSG_??? */,
		nlmsg_flags=NLM_F_REQUEST|0x600,
		nlmsg_seq=0,
		nlmsg_pid=0
	}, "\x02\x18\x00\x00\x55\x00\x00\x00\x08\x00\x02\x00\x0a\x0a\x00\x01\x08\x00\x01\x00\x0a\x0a\x00\x01"
], 40, 0, NULL, 0) = 40

Running SunOS 4 in QEMU (SPARC)

2023-04-13T00:50:27Z

Running SunOS 4 in QEMU (SPARC)

SunOS is a historical UNIX operating system widely used from the mid 80s into the early/mid 90s. Older versions of QEMU struggled to emulate the SPARC platform that SunOS ran on, but QEMU v7.2 supports SPARC well enough to install and run SunOS without any unusual workarounds.

Installation media

The installation CD-ROM for SunOS 4.1.4 (also branded Solaris 1.1.2) is available on the Internet Archive:

Solaris v1.1.2 SPARC (704-4662-10) (uploaded 2019-09-23)

You might also want a dump of the SparcStation 5 boot PROM. QEMU's bundled OpenBIOS is capable of booting SunOS, but the original PROM is useful for people who want a more authentic emulation experience.

shasum -a 256 *
# 559c8455918029ffdaaf9890caf9f791c3a3604d2f2158793751b770593c0a3c  SunOS-v4.1.4.iso
# e7f40845504c65f4011278aa3e97a9810aa36775e6c199b715839fbc25eec45e  ss5.bin

Preparing the SunOS mini-root

The first stage of the SunOS installation process is to prepare a minimal bootable environment.

SunOS is designed to run on Sun's hardware, so it's relatively fussy about device layout and configuration compared to an OS intended for consumer hardware. The SPARCstation 5 Service Manual is a useful reference.

The internal HDD must have SCSI target 3, and the internal CD-ROM must have SCSI target 6.
SunOS expects the CD-ROM to have a physical block size of 512 bytes[1].
Although a real SPARCstation 5 supports up to 256 MiB of RAM, we'll be giving it only 64 MiB to simplify the installation process[2]

Leave off the -bios ss5.bin line to use QEMU's built-in OpenBIOS.

qemu-system-sparc -version
# QEMU emulator version 7.2.1
# Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
qemu-img create -f qcow2 sunos-hdd.img 2G
# Formatting 'sunos-hdd.img', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 lazy_refcounts=off refcount_bits=16
qemu-system-sparc \
#    -machine SS-5 \
#    -m 64 \
#    -bios ss5.bin \
#    -drive file=sunos-hdd.img,bus=0,unit=3,media=disk \
#    -device scsi-cd,channel=0,scsi-id=6,id=cdrom,drive=cdrom,physical_block_size=512 \
#    -drive if=none,file=SunOS-v4.1.4.iso,media=cdrom,id=cdrom

Once at the firmware prompt, type boot cdrom (or boot cdrom:d for OpenBIOS).

In the disk formatter, select disk type 13 (SUN2.1G), write the label to disk, then quit the formatting utility.

The installation script will prep the disk for the main installer, then prompt for a reboot.

If using OpenBIOS, the VM might not boot into mini-root by itself. Type boot disk0:b -sw at the firmware prompt to continue.

Installing SunOS itself

After rebooting, you should see some logspam and a root prompt. Run suninstall to continue the installation process.

There's no complicated decisions to make here, so I just went with the quick install of the full system.

After the installation is finished the VM will reboot and you'll be back at the firmware prompt. Type boot disk (or boot disk3:a for OpenBIOS) to boot.

In its original environment, a new SunOS workstation would have received its network configuration from RARP (the predecessor of DHCP) and NIS (sort of a proto-LDAP). Since we don't have a lab of 100 workstations to provision, manual data entry is fine.

The default IP address for QEMU's usermode networking is 10.0.2.15, for which SunOS will assign a netmask of 0xFF000000[3].

A password should be six to eight characters long 🔒.

Almost done. The last step is to configure the gateway router, and then the VM will have working networking. Just log in as root, set the gateway address, and write it to /etc/defaultrouter so it'll persist across reboots.

Installing a web browser (Netscape)

The final version of SunOS was released when the Web was in its infancy, and therefore does not have a bundled web browser (or any sort of HTTP-related utilities). Luckily for us SunOS/SPARC was a popular platform and Netscape published binaries for it. Actually finding those binaries was a bit of a slog, but I eventually located a copy of Netscape Communicator v4.61 on the delightfully retro page The Solbourne Solace @ Floodgap Retrobits (archive).

In the least surprising twist ever, the tarball itself is only available via Gopher, at gopher://gopher.floodgap.com/9/archive/sunos-4-solbourne-os-mp/communicator-v461-us.sparc-sun-sunos4.1.3_U1.tar.gz. I have mirrored it to archive.org at Netscape Communicator 4.61 [SunOS 4.1.3].

In any case, once you've obtained a copy of the Netscape installation package you'll find that it needs gzip, which at the time was a GNU-specific technology. I recommend following the manual installation instructions from README.install on your host machine to produce a plain tarball.

shasum -a 256 communicator-v461-us.sparc-sun-sunos4.1.3_U1.tar.gz
# c667feb3a73721872d60ffd4aab24e39be8d5a48761397b4dd2184b4dd2bb5de  communicator-v461-us.sparc-sun-sunos4.1.3_U1.tar.gz
tar -xf communicator-v461-us.sparc-sun-sunos4.1.3_U1.tar.gz
cd communicator-v461.sparc-sun-sunos4.1.3_U1/
mkdir -p netscape-v4.61/java/classes
mv *.nif netscape-v4.61/
mv *.jar netscape-v4.61/java/classes/
cd netscape-v4.61/
gzip -dc netscape-v461.nif | tar -xf -
gzip -dc nethelp-v461.nif | tar -xf -
gzip -dc spellchk-v461.nif | tar -xf -
cd ..
tar -cf ../netscape-v4.61.tar netscape-v4.61/

Getting that tarball into the VM is also a little tricky due to the lack of common network protocols between 1994 and 2023. I ended up writing a helper (recv.c) that will connect to a TCP socket and stream any data it receives to a file.

# # host (Linux, BSD, and most others)
nc -Nl 127.0.0.1 5000 < netscape-v4.61.tar
# 
# # host (macOS)
nc -l 127.0.0.1 5000 < netscape-v4.61.tar

# # VM
cc -o recv recv.c
./recv 10.0.2.2:5000 netscape-v4.61.tar

Unpack that tarball, write a wrapper script and a stub /etc/resolv.conf, and Netscape is ready to go.

cat /etc/resolv.conf
# domain sunos.local
# nameserver 10.0.2.3
cat ~/netscape.sh
# #!/bin/sh
# XNLSPATH="${HOME}/netscape-v4.61/nls"
# XKEYSYMDB="${HOME}/netscape-v4.61/XKeysymDB"
# export XNLSPATH XKEYSYMDB
# exec "${HOME}/netscape-v4.61/netscape_dns" "$@"

Appendix A: recv.c

This should be fairly readable despite being written in K&R C; the BSD sockets API hasn't changed much.

If you don't want to type the whole thing in by hand, see the next section about X11 forwarding.

#include <arpa/inet.h>
#include <fcntl.h>
#include <netdb.h>
#include <netinet/in.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <unistd.h>

int split_server_address(server_address, server_ip, server_port)
	char *server_address;
	unsigned long *server_ip;
	unsigned short *server_port;
{
	char *port_str, *port_extra;
	long port_raw;

	port_str = strchr(server_address, ':');
	if (port_str == NULL) {
		return -1;
	}
	*(port_str++) = 0;

	*server_ip = inet_addr(server_address);
	if (*server_ip == -1) {
		return -1;
	}

	port_raw = strtol(port_str, &port_extra, 10);
	if (port_raw < 1 || port_raw > 65535) {
		return -1;
	}
	if (*port_extra != 0) {
		return -1;
	}
	*server_port = port_raw;

	return 0;
}

int recv_file(server_ip, server_port, output_path)
	unsigned long server_ip;
	unsigned short server_port;
	char *output_path;
{
	int socket_fd, output_fd;
	struct sockaddr_in server;
	char buffer[2048];

	socket_fd = socket(AF_INET, SOCK_STREAM, 0);
	if (socket_fd == -1) {
		return -1;
	}

	memset(&server, 0, sizeof server);
	server.sin_family = AF_INET;
	server.sin_addr.s_addr = server_ip;
	server.sin_port = htons(server_port);

	if (connect(socket_fd, (struct sockaddr*)&server, sizeof server) == -1) {
		return -1;
	}

	output_fd = open(output_path, O_WRONLY | O_CREAT, 0600);
	if (output_fd == -1) {
		return -1;
	}

	while (1) {
		int n = read(socket_fd, buffer, sizeof buffer);
		if (n == -1) {
			close(output_fd);
			return -1;
		}
		if (n == 0) {
			return close(output_fd);
		}
		write(output_fd, buffer, n);
	}
}

int main(argc, argv)
	int argc;
	char **argv;
{
	unsigned long server_ip;
	unsigned short server_port;

	if (argc < 3) {
		fprintf(stderr, "Usage: %s <server_address> <output_path>\n", argv[0]);
		return 1;
	}

	if (split_server_address(argv[1], &server_ip, &server_port) == -1) {
		fprintf(stderr, "Invalid server address \"%s\"\n", argv[1]);
		return 1;
	}

	if (recv_file(server_ip, server_port, argv[2]) == -1) {
		perror("Error receiving file");
		return 1;
	}

	return 0;
}

Appendix B: X11 forwarding

The experience of interacting with a GUI from 1994 via QEMU's console is not great, so I recommend running an X11 server on your host and having the VM connect to it.

If you're already running an X11-based desktop (BSD, older Linux, macOS with XQuartz[4]) then you can proxy its socket directly to TCP and then connect to it from the VM. This will let you copy-paste big blobs of text such as recv.c.

# # host
socat TCP-LISTEN:6001,fork,bind=127.0.0.1 UNIX-CONNECT:/tmp/.X11-unix/X0

# # VM
setenv DISPLAY 10.0.2.2:1
xterm

Alternatively, use a nested X11 server such as Xnest or Xephyr. You'll be able to run the OpenWindows window manager, so it feels a bit like using VNC.

# # host
Xephyr -ac -listen tcp -screen 2048x1536 :1

# # VM
setenv DISPLAY 10.0.2.2:1
olwm

If olwm segfaults on startup, make sure that the host machine has the legacy X11 fonts installed. In Ubuntu 22.04 I had to install the xfonts-100dpi package.

Nowadays the physical block size for CD-ROMs is 2048 bytes, but in the 90s this value wasn't standardized yet. Consumer CD-ROM drives had a physical jumper on the back that could select the block size, and some OSes (including SunOS) would encounter read errors if the jumper wasn't set to what they expected.
SunOS requires a swap partition that is at least as large as machine memory, and the default swap partition size for SUN2.1G is 100 MiB. Using 64 MiB lets us avoid fiddling with the disk geometry in the formatting tool.
SunOS pre-dates CIDR, so it thinks of all 10.x.x.x addresses as belonging to the 10.0.0.0/8 "Class A" network. This is technically wrong for QEMU, which by default uses a netmask of 0xFFFFFF00, but it doesn't really matter as long as you don't try to do anything too complicated with multi-VM networking.
Note that the default socket path for XQuartz may contain a colon, which will make socat unhappy because it uses colons as part of its option syntax. You can work around this with a symlink.

Improved UNIX socket networking in QEMU 7.2

2023-04-11T02:17:00Z

Improved UNIX socket networking in QEMU 7.2

QEMU 7.2 quietly introduced two new network backends, -netdev dgram and -netdev stream. Unlike the older -netdev socket, these new backends directly support AF_UNIX socket addresses without the need for an intermediate wrapper tool.

The situation up until now

QEMU has a -netdev socket network backend, which will send/receive Ethernet frames via TCP (the connect= and listen= modes) or UDP (the mcast= and udp= modes). This functionality isn't well documented, and its intended use appears to be as a sort of simple network hub for hosts that can't use a TAP device[1].

$ qemu-system-x86_64 --help
[...]
-netdev socket,id=str[,fd=h][,listen=[host]:port][,connect=host:port]
                configure a network backend to connect to another network
                using a socket connection
-netdev socket,id=str[,fd=h][,mcast=maddr:port[,localaddr=addr]]
                configure a network backend to connect to a multicast maddr and port
                use 'localaddr=addr' to specify the host address to send packets from
-netdev socket,id=str[,fd=h][,udp=host:port][,localaddr=host:port]
                configure a network backend to connect to another network
                using an UDP tunnel

A less-obvious (and completely undocumented) behavior of -netdev socket is that (1) the fd= syntax is actually its own mutually-exclusive mode, and (2) it doesn't need to be the file descriptor of a TCP socket in particular. This means it's possible to coax QEMU into using a UNIX socket for its network backend, by connecting to the socket in a wrapper process before spawning QEMU. The wrapper doesn't have to be complex; see qemu-wrapper.c for a working example in 50 lines of C.

Whatever process created the UNIX socket can of course do whatever it needs to with the raw Ethernet frames it receives, including acting as a switch or VPN or whatever. If you don't already have a preferred usermode network library, I recommend Scapy as a comprehensive and beginner-friendly option. For a starting point, try using print-frames.py to log network traffic of a Debian live CD:

New backends in QEMU 7.2

The QEMU 7.2 release adds two new network backends, -netdev dgram and -netdev stream. Although the related mailing list discussion[2] make it clear that the new functionality exists to better support UNIX sockets, in classic QEMU fashion this minor detail has been left out of the documentation[3].

-netdev stream,id=str[,server=on|off],addr.type=inet,addr.host=host,addr.port=port[,to=maxport][,numeric=on|off][,keep-alive=on|off][,mptcp=on|off][,addr.ipv4=on|off][,addr.ipv6=on|off]
-netdev stream,id=str[,server=on|off],addr.type=unix,addr.path=path[,abstract=on|off][,tight=on|off]
-netdev stream,id=str[,server=on|off],addr.type=fd,addr.str=file-descriptor
                configure a network backend to connect to another network
                using a socket connection in stream mode.
-netdev dgram,id=str,remote.type=inet,remote.host=maddr,remote.port=port[,local.type=inet,local.host=addr]
-netdev dgram,id=str,remote.type=inet,remote.host=maddr,remote.port=port[,local.type=fd,local.str=file-descriptor]
                configure a network backend to connect to a multicast maddr and port
                use ``local.host=addr`` to specify the host address to send packets from
-netdev dgram,id=str,local.type=inet,local.host=addr,local.port=port[,remote.type=inet,remote.host=addr,remote.port=port]
-netdev dgram,id=str,local.type=unix,local.path=path[,remote.type=unix,remote.path=path]
-netdev dgram,id=str,local.type=fd,local.str=file-descriptor
                configure a network backend to connect to another network
                using an UDP tunnel

The -netdev stream backend works just like the pseudo-TCP example above, but doesn't require a wrapper:

The -netdev dgram backend is a bit different. Since datagrams are inherently unidirectional, frames sent to the host use a separate socket from frames sent to the guest. The receiving program also needs to be adjusted, because QEMU (reasonably) doesn't length-prefix datagrams.

print-frames-dgram-arp.py is an expanded version of the earlier example. It waits for the VM to send an ARP request for address 192.168.100.101, then prints any frames received after that request. Within the VM I turned off Avahi (noisy), manually configured the network, and used Python to send a UDP packet.

Within the -netdev dgram flag, the value of local.path= is the socket address that the host will send frames to, and remote.path= is the socket address that the host will receive frames from.

Despite my general crankiness about the docs coverage, I'm quite happy to see this functionality land. Native support for AF_UNIX datagrams is exciting (for a certain type of person) because it eliminates a lot of the complexity involved in wiring up QEMU with a userspace network stack. Using UNIX sockets means you don't need to worry about port conflicts, it doesn't need TAP so it's sandbox-friendly, and the VM's network won't break if the packet processor restarts.

Appendix A: qemu-wrapper.c

Nothing fancy here, it just creates a socket and connects it to a user-provided path.

/* Copyright (c) John Millikin <john@john-millikin.com> */
/* SPDX-License-Identifier: 0BSD */
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/un.h>
#include <unistd.h>

int main(int argc, char **argv) {
	int sock_fd, rc;
	char *sock_path;
	size_t sock_path_len;
	struct sockaddr_un sock_addr = {AF_UNIX, ""};

	if (argc < 3) {
		fprintf(stderr, "Usage: %s <socket> <qemu> [args...]\n", argv[0]);
		return 1;
	}

	sock_path = argv[1];
	sock_path_len = strlen(sock_path);
	if (sock_path_len >= sizeof sock_addr.sun_path) {
		fprintf(stderr, "Socket path \"%s\" too long\n", sock_path);
		return 1;
	}
	memcpy(sock_addr.sun_path, sock_path, sock_path_len + 1);

	sock_fd = socket(AF_UNIX, SOCK_STREAM, 0);
	if (sock_fd == -1) {
		perror("Failed to create socket");
		return 1;
	}

	rc = connect(sock_fd, (struct sockaddr*)&sock_addr, sizeof sock_addr);
	if (rc == -1) {
		fprintf(stderr, "Failed to connect to socket \"%s\": ", sock_path);
		perror(NULL);
		return 1;
	}

	execv(argv[2], argv + 2);
	fprintf(stderr, "%s: ", argv[2]);
	perror(NULL);
	return 1;
}

Appendix B: print-frames.py

Reads Ethernet frames from a socket, then uses Scapy to parse and print them.

The expected format of the TCP stream doesn't seem to be documented. In my testing the Ethernet frames were always prefixed with their length as a big-endian 32-bit uint.

#!/usr/bin/python3
# Copyright (c) John Millikin <john@john-millikin.com>
# SPDX-License-Identifier: 0BSD
import os
import os.path
import socket
import struct
import sys

from scapy import all as scapy

socket_path = sys.argv[1]
if os.path.exists(socket_path):
    os.remove(socket_path)

sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
sock.bind(socket_path)
sock.listen(1)
conn, addr = sock.accept()
while True:
	frame_len_buf = conn.recv(4)
	if len(frame_len_buf) == 0:
		break
	(frame_len,) = struct.unpack("!L", frame_len_buf)
	frame = scapy.Ether(conn.recv(frame_len))
	print(repr(frame))
	print("")

Appendix C: print-frames-dgram-arp.py

Similar as above, but adjusted for unidirectional sockets and expanded to verify that sending frames (ARP responses) to the VM works as expected. Within the VM, ping 192.168.100.101 and watch the ICMP frames come through.

#!/usr/bin/python3
# Copyright (c) John Millikin <john@john-millikin.com>
# SPDX-License-Identifier: 0BSD
import os
import os.path
import socket
import sys

from scapy import all as scapy

send_socket_path = sys.argv[1]
recv_socket_path = sys.argv[2]
if os.path.exists(recv_socket_path):
    os.remove(recv_socket_path)

send_sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
recv_sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
recv_sock.bind(recv_socket_path)

ready = False
while True:
	frame = scapy.Ether(recv_sock.recv(9001))
	if ready:
		print(repr(frame))
		print("")

	if not isinstance(frame.payload, scapy.ARP):
		continue
	if frame.payload.op != 1: # who-has
		continue

	if frame.payload.pdst == "192.168.100.101":
		resp_bytes = scapy.raw((scapy.Ether(
			dst="52:54:00:12:34:56",
			src="52:54:00:12:34:ff",
		) / scapy.ARP(
			op=2, # is-at
			hwsrc="52:54:00:12:34:ff",
			psrc="192.168.100.101",
			hwdst="52:54:00:12:34:56",
			pdst="192.168.100.100",
		)))
		send_sock.sendto(resp_bytes, send_socket_path)
		ready = True

https://wiki.qemu.org/Documentation/Networking#Socket
[PATCH v7 00/14] qapi: net: add unix socket type support to netdev backend
In case the reader thinks I'm being unfair by expecting --help output to have more detail, consider that the QEMU documentation page System Emulation » Invocation is the most complete reference I can find for QEMU's -netdev flags, and it doesn't even mention the dgram or stream network backends.

Debugging Win32 binaries in Ghidra via Wine

2022-08-23T06:09:54Z

Debugging Win32 binaries in Ghidra via Wine

Ghidra is a cross-platform reverse-engineering and binary analysis tool, with recent versions including support for dynamic analysis. I want to try using it as a replacement for IDA Pro in reverse-engineering of Win32 binaries, but hit bugs related to address space detection when running gdbserver with Wine (ghidra#4534).

This post contains custom GDB commands that allow Ghidra to query the Linux process ID and memory maps of a Win32 target process running in 32-bit Wine on a 64-bit Linux host.

Building a simple Win32 binary on Linux

If you've already got a Win32 binary you're interested in analyzing, you can skip this step.

For the purposes of testing and writing blog posts, it's useful to have a simple "hello world" binary that doesn't have much fancy stuff going on. This is the code to a minimal Win32 console program:

#include <windows.h>

static const char message[] = "Hello, world!\n";
static const int message_len = sizeof(message);

int __stdcall mainCRTStartup(void) {
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD bytes_written;
    WriteFile(stdout, message, message_len, &bytes_written, NULL);
    return 0;
}

To compile a PE binary in Linux you can either use Wine to install the Windows SDK, or use a cross-compiler. The Windows SDK is a bit of a pain to install since it's distributed as an ISO image full of installer wizards, so I choose the second option. For cross-compilation I prefer to use Clang and LLD whenever possible since they're "native" cross-compilers, which means that (unlike GNU GCC/LD) their target platform can be selected at runtime.

WINE="${HOME}/.opt/wine-7.15"
clang -target i386-pc-win32 -O2 -c \
#   -isystem "${WINE}"/include/wine/windows \
#   -isystem "${WINE}"/include/wine/msvcrt \
#   hello-win32.c
ld.lld -flavor link \
#   /out:hello-win32.exe \
#   /nxcompat:no \
#   /subsystem:console \
#   /defaultlib:kernel32 \
#   hello-win32.o

If you don't have a copy of kernel32.lib from the Windows SDK, a usable substitute can be generated from kernel32.spec in Wine's source tree.

WINE_SRC="${HOME}/src/third_party/winehq.org/wine-7.15"
"${WINE}"/bin/winebuild --def \
#   -E "${WINE_SRC}"/dlls/kernel32/kernel32.spec \
#   -o kernel32.def
llvm-dlltool -m i386 -k -d kernel32.def -l kernel32.lib

Double-check that the executable works:

wine hello-win32.exe
# Hello, world!

Debugging with gdbserver.exe

First, install both Linux and Windows builds of GDB, configured with --target=i686-w64-mingw32. On Ubuntu an appropriate build of GDB can be installed with apt install gdb-mingw-w64 gdb-mingw-w64-target.

The gdbserver.exe process will run "inside" Wine, and use Windows debugging APIs to control the binary being debugged. It listens on a TCP socket implementing the GDB remote serial protocol.

wine /usr/share/win32/gdbserver.exe localhost:10000 ./hello-win32.exe
# Listening on port 10000

The i686-w64-mingw32-gdb process runs in the Linux environment, and provides a REPL that can control the "remote" gdbserver. This process is necessary because Ghidra doesn't directly speak the GDB serial protocol, it controls GDB through the text UI. Before starting up Ghidra, verify that the GDB bits are working:

/usr/bin/i686-w64-mingw32-gdb

file ~/ghidra/hello-win32.exe
# Reading symbols from ~/ghidra/hello-win32.exe...
# (No debugging symbols found in ~/ghidra/hello-win32.exe)
target extended-remote :10000
# Remote debugging using :10000
# Reading C:/windows/system32/ntdll.dll from remote target...
# warning: File transfers from remote targets can be slow. Use "set sysroot" to access files locally instead.
# Reading C:/windows/system32/kernel32.dll from remote target...
# Reading C:/windows/system32/kernelbase.dll from remote target...
# 0x7bc7eb01 in ?? ()

Connecting Ghidra to GDB

Create the Ghidra project, import the Win32 binary to be analyzed, and enter the Debugger tool. When connecting to GDB you can use either IN-VM or GADP, but GADP is probably better since Ghidra's debugger can get wedged and it's nice to be able to forcefully disconnect by killing the GADP agent.

Here's where things start to go wrong. After creating the trace record, Ghidra will start throwing out error popups about trying to access invalid address space. Github issue ghidra#4534 has some of the nitty-gritty details on what's going on, but in summary Ghidra depends on the GDB command info proc mappings to figure out what it can peek at, and GDB doesn't implement that command for Windows targets.

Shimming the GDB memory map

There's two problems we're facing here:

First, we need to get access to the /proc/{pid}/maps file corresponding to the target process, parse it, and render output that matches what Ghidra expects from GDB.
Second, the gdbserver is running inside Wine and therefore uses Windows process IDs. There's no way to query the Linux process ID for a Windows process; such an API obviously doesn't exist in Windows, and Wine developers have declined to implement it as an extension.

The memory map parsing/formatting sounds tricky but is actually pretty straightforward because the format of info proc maps is almost the same as what GDB provides, and Ghidra doesn't care about the differences. The GDB Python API can be used to define a new remote-proc-mappings command, which reads /proc/{pid}/maps for any process accessible to the remote gdbserver.

import contextlib
import os
import threading

@contextlib.contextmanager
def pipe_fds():
    r_fd, w_fd = os.pipe()
    r_file = os.fdopen(r_fd, mode="rb")
    w_file = os.fdopen(w_fd, mode="wb")
    try:
        yield (r_file, w_file)
    finally:
        r_file.close()
        w_file.close()

class ReadThread(threading.Thread):
    def __init__(self, reader):
        super(ReadThread, self).__init__()
        self.__r = reader
        self.bytes = None

    def run(self):
        self.bytes = bytearray(self.__r.read())

def reformat_line(raw_line):
    split = raw_line.decode("utf-8").split(None, 5)
    # split[0] range
    # split[1] mode
    # split[2] offset
    # split[3] major_minor
    # split[4] inode
    # split[5] object name
    start_addr_s, end_addr_s = split[0].split("-")
    start_addr = int(start_addr_s, 16)
    end_addr = int(end_addr_s, 16)
    if len(split) == 6:
        objfile = split[5]
    else:
        objfile = ""
    return "0x{:X} 0x{:X} 0x{:X} 0x{:X} {} {}\n".format(
        start_addr, end_addr,
        end_addr - start_addr,
        int(split[2], 16),
        split[1],
        objfile,
    )

class RemoteProcMappings(gdb.Command):
    def __init__(self):
        super(RemoteProcMappings, self).__init__("remote-proc-mappings", gdb.COMMAND_STATUS)

    def invoke(self, arg, from_tty):
        argv = gdb.string_to_argv(arg)
        if len(argv) != 1:
            gdb.write("usage: remote-proc-mappings PID\n", gdb.STDERR)
            return

        remote_pid = int(argv[0])

        with pipe_fds() as (r_file, w_file):
            read_thread = ReadThread(reader = r_file)
            read_thread.start()
            maps_path = "/proc/{}/maps".format(remote_pid)
            pipe_writer_path = "/dev/fd/{}".format(w_file.fileno())
            gdb.execute("remote get {} {}".format(maps_path, pipe_writer_path))
            w_file.close()
            read_thread.join()
            raw_bytes = read_thread.bytes

        for raw_line in raw_bytes.split(b"\n"):
            if raw_line:
                gdb.write(reformat_line(raw_line))

RemoteProcMappings()

Next we need the Linux PID. Luckily(?) Wine allows Win32 binaries to directly invoke Linux syscalls via the INT 0x80 instruction, so a straightforward approach is to inject a linux_getpid() function into the target process's address space and then use GDB's call command to run it.

Many Windows binaries have executable stacks (/nxcompat:no), which makes this super easy:

define getpid-linux-i386
  # MOV eax,20 [SYS_getpid]
  # INT 0x80
  # RET
  set $linux_getpid = {int (void)}($esp-7)
  set {unsigned char[8]}($linux_getpid) = {\
    0xB8, 0x14, 0x00, 0x00, 0x00, \
    0xCD, 0x80, \
    0xC3 \
  }
  output $linux_getpid()
  echo \n
end

If the above command causes a segfault then the binary was probably compiled with /nxcompat, which places the stack in a non-executable mapping. Luckily(?) again, Windows processes map their .text segment to a fixed offset (by default 0x401000), so you can use Ghidra to locate some function padding or an unused error branch or whatever and write the getpid stub there:

define getpid-linux-i386
  # MOV eax,20 [SYS_getpid]
  # INT 0x80
  # RET
  set $linux_getpid = {int (void)}0x401020
  set {unsigned char[8]}($linux_getpid) = {\
    0xB8, 0x14, 0x00, 0x00, 0x00, \
    0xCD, 0x80, \
    0xC3 \
  }
  output $linux_getpid()
  echo \n
end

With these two custom commands defined, it's now possible to override info proc mappings to (1) find the Linux pid, and (2) report its memory mappings to Ghidra.

source ~/ghidra/getpid-linux-i386.gdb
source ~/ghidra/remote-proc-mappings.py

define info proc mappings
  python
remote_pid = gdb.execute("getpid-linux-i386", to_string=True).strip()
gdb.execute("remote-proc-mappings {}".format(remote_pid))
  end
end

Put that into a wine-win32.gdb file and source it from Ghidra's GDB interpreter panel. Note that to make Ghidra happy the info proc mappings command must be overridden before connecting to the remote gdbserver.

Since they're regular GDB commands, they can also be used from the command line:

file ~/ghidra/hello-win32.exe
# Reading symbols from ~/ghidra/hello-win32.exe...
# (No debugging symbols found in ~/ghidra/hello-win32.exe)
source ~/ghidra/wine-win32.gdb
target extended-remote :10000
# Remote debugging using :10000
# Reading C:/windows/system32/ntdll.dll from remote target...
# warning: File transfers from remote targets can be slow. Use "set sysroot" to access files locally instead.
# Reading C:/windows/system32/kernel32.dll from remote target...
# Reading C:/windows/system32/kernelbase.dll from remote target...
# 0x7bc7eb01 in ?? ()
getpid-linux-i386
# 1872324

When loaded into Ghidra's GDB session, the trace recording works and the dynamic analysis functionality (Dynamic panel, Regions panel, etc) work as expected.

Ghidra is able to disassemble code injected at runtime. Here, the Dynamic panel shows our linux_getpid code injected at 0x401020.

Running BeOS 5 in QEMU (i386)

2022-07-09T06:51:22Z

Running BeOS 5 in QEMU (i386)

BeOS is an operating system from the '90s, notable for its prescient technical decisions and abject business failure[1]. It embraced multi-threading at a time when 100mhz CPUs powered top-shelf workstations, and featured metadata-backed virtual folders ten years before their arrival in mainstream OSes.

Installation media

The installation CD-ROM for BeOS Pro Edition 5.0 is available on the Internet Archive. It's been uploaded there at least twice, but the content is identical, so either one works. You'll need both the .bin and .cue files.

BeOS Version 5.0.3 Intel and PPC (uploaded 2017-08-23)
BeOS Professional Edition 5.0.3 (uploaded 2021-08-18)

shasum -a 256 *
# 1889fd6cf5af4259b01c9d1925e62f664effdf9dd88f924dc9b4da41ce1f0106  BeOS_Tools.bin
# 6f4fd9fbf7dff01d27391bee3b8bb27def7ed2fcd978f4b698c220b69eb89af9  BeOS_Tools.cue
# 1889fd6cf5af4259b01c9d1925e62f664effdf9dd88f924dc9b4da41ce1f0106  beos-5.0.3-professional-gobe.bin
# a57d9552cdadbbdbe6f608e8dbe9ac2bec2a010da1ad801fc0176e4d66bb234c  beos-5.0.3-professional-gobe.cue

The BeOS installation media has an unusual layout with three separate filesystems, which must be split to be usable by QEMU[4]. Use bchunk to extract the bootable ISO 9660 filesystem into a .iso file.

curl -L -O https://raw.githubusercontent.com/hessu/bchunk/release/1.2.2/bchunk.c
shasum -a 256 bchunk.c
# 34ce2e8c23b41a9f14a7e4f50e14996f2754c27237ba431ede1caaee39e759a6  bchunk.c
gcc -o bchunk bchunk.c
./bchunk BeOS_Tools.bin BeOS_Tools.cue BeOS_Tools.iso
# binchunker for Unix, version 1.2.2 by Heikki Hannikainen <hessu@hes.iki.fi>
# 	Created with the kind help of Bob Marietta <marietrg@SLU.EDU>,
# 	partly based on his Pascal (Delphi) implementation.
# 	Support for MODE2/2352 ISO tracks thanks to input from
# 	Godmar Back <gback@cs.utah.edu>, Colas Nahaboo <Colas@Nahaboo.com>
# 	and Matthew Green <mrg@eterna.com.au>.
# 	Released under the GNU GPL, version 2 or later (at your option).
# 
# Reading the CUE file:
# 
# Track  1: MODE1/2352    01 00:00:00
# Track  2: MODE1/2352    01 10:48:58
# Track  3: MODE1/2352    01 46:07:03
# 
# Writing tracks:
# 
#  1: BeOS_Tools.iso01.iso   95/95   MB  [********************] 100 %
#  2: BeOS_Tools.iso02.iso  310/310  MB  [********************] 100 %
#  3: BeOS_Tools.iso03.iso  236/236  MB  [********************] 100 %
shasum -a 256 *.iso
# 5c193d1855ad542f9a40a092a32bf2c6072e273a51d781dbf925a9a02e66d759  BeOS_Tools.iso01.iso
# 0031b4eb35a8ebfcf578d197c2372dfda0f748ef260f44dba2dd93740da35626  BeOS_Tools.iso02.iso
# 26b771b4f22f01b3311b86c82d4a7c2f6d84973b2ed506cf8d65738732f21708  BeOS_Tools.iso03.iso

Of the split files, iso01 is bootable. The other two are BeFS filesystems containing x86 and PowerPC installation data.

ls -lh *.iso
# -rw-rw-r-- 1 john john  96M Oct  8 14:56 BeOS_Tools.iso01.iso
# -rw-rw-r-- 1 john john 311M Oct  8 14:56 BeOS_Tools.iso02.iso
# -rw-rw-r-- 1 john john 236M Oct  8 14:56 BeOS_Tools.iso03.iso
sudo mount -o loop BeOS_Tools.iso01.iso BeOS_Tools_01
sudo mount -o loop BeOS_Tools.iso02.iso BeOS_Tools_02
sudo mount -o loop BeOS_Tools.iso03.iso BeOS_Tools_03
ls BeOS_Tools_01
# AUTORUN.INF  boot.catalog  floppy.img  GNU  Gobe  Macintosh  Personal  PMAGIC
ls BeOS_Tools_02
# apps  beos  demos  home  _packages_  preferences  var
file BeOS_Tools_02/beos/apps/Terminal
# BeOS_Tools/beos/apps/Terminal: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, not stripped
file BeOS_Tools_03/beos/apps/Terminal
# BeOS_Tools/beos/apps/Terminal: header for PowerPC PEF executable

Note how the x86 edition of BeOS uses a common executable format (ELF), whereas the PowerPC edition uses PEF from early Mac OS. It's an unusual decision.

Booting the installer

We're now ready to start up QEMU and enter the BeOS graphical installer.

qemu-system-i386 -version
# QEMU emulator version 7.0.0
# Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
qemu-img create -f qcow2 beos-5.img 1G
# Formatting 'beos-5.img', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16
qemu-system-i386 -m 512M \
  -drive media=cdrom,file=BeOS_Tools.iso01.iso \
  -drive media=cdrom,file=BeOS_Tools.iso02.iso \
  -drive file=beos-5.img

At this point the screen will go blank and no further progress happens. To proceed, we must use the boot menu to disable BIOS calls.

The installer is now able to boot, and it would actually be able to fully install from here. However, BeOS doesn't recognize the QEMU graphics device and therefore defaults to low-resolution greyscale graphics.

Going back to the boot screen, the default video mode can be manually configured to something more reasonable. I picked 1024x768x16 to get color and a bit more usable screen area.

BeOS is now ready to install.

Installing BeOS

The installation process for BeOS 5 is mostly unremarkable to modern eyes, but remember this thing was a contemporary of Windows 98. The typical installer UI back then used text-mode VGA, and then here comes BeOS with full graphics (the windows repaint on drag!) straight from the installation media.

Also, I say mostly unremarkable, because there's no modern OS in the world that could install a complete desktop environment (including web browser and development tools) in 265 MB. Chrome is larger than that just by itself[5]

Post-install configuration

At this point, BeOS has been installed but still has some emulation issues. It will kernel panic on startup unless BIOS calls are disabled in the boot menu, and the graphics will default to 640x480 greyscale. Also, it doesn't have any network connectivity.

For networking, the list of supported NICs is available at BeOS Ready List - Intel » BeOS Ready Network Cards and Connections. QEMU's default NIC emulates an Intel e1000, but it can also emulate the NE2000 family supported by BeOS.

$ qemu-system-i386 -m 512M -drive file=beos-5.img -nic user,model=ne2k_pci

Once booted, go into the BeOS network preferences and enable DHCP. Click "restart networking" to let the changes take effect. The QEMU user-mode networking stack has built-in DHCP and DNS servers, so it doesn't matter how the host system is configured.

Next up is fixing the default graphics and disabling BIOS calls. Open a terminal into /boot/home/config/settings/kernel/drivers. This directory configures the BeOS boot loader; the sample/ directory contains example config files.

Using sample/kernel and sample/vesa as a guide, create two files in [...]/kernel/drivers:

File kernel should contain bios_calls disabled
File vesa should contain mode 1024 768 16 (or whatever resolution you want)

BeOS 5.0.3 comes with VIM 4.5, so features like `:split` are available.

Booting now works without any special boot menu selection, and the built-in web browser can be used to view any page that allows plaintext[6] HTTP clients.

Unfortunately for BeOS, technical innovation alone was not enough to pay the bills. After rejecting a $200 million buyout offer from Apple[2], Be Inc struggled to compete with hyper-competitive mid-90s Microsoft. It was swept away in the dot-com crash, and in 2001 the remaining assets were sold to Palm for $11 million[3].
The Mercury News: Jobs is back at Apple
Los Angeles Times: Be Inc. Founder Leaving as Firm Nears Closure
Invoking QEMU with -drive media=cdrom,file=BeOS_Tools.bin will fail to locate a boot sector.
The Linux build of Chrome is slightly larger than BeOS 5, and it doesn't even include an OpenGL teapot demo.
```
du -s --si /opt/google/chrome/
# 278M	/opt/google/chrome/
```
Minimum requirements for HTTPS have evolved somewhat in the 20 years since the NetPositive browser last saw development, so the modern web can only be accessed via a MITM proxy.

Gmail accepts forged YouTube emails

2022-06-01T01:36:59Z

Gmail accepts forged YouTube emails

This morning I woke up to an official-looking email from YouTube in my inbox, addressed to an address that isn't mine.

Long ago this sort of thing would happen if someone sent an email with forged headers[1] (e.g. to fish for logins), but the advent of DKIM and DMARC has relegated header forging to ancient history. I was greatly surprised to see that the forged email had passed Gmail's DKIM/DMARC checks.

A selection of the email's headers (full email) shows that it was accepted as coming from youtube.com, despite being received from robtoledoyour.com. I'm not familiar enough with the details of email authentication to say why this passed, but it seems pretty clear that something has gone wrong.

Delivered-To: jmillikin@gmail.com
Received: by 2002:a19:6d05:0:0:0:0:0 with SMTP id i5csp3611067lfc;
        Tue, 31 May 2022 10:35:25 -0700 (PDT)
From: YouTube <no-reply@youtube.com>
To: alltimecaptaincool2019@gmail.com
Date: Fri, 26 Nov 2021 22:16:25 -0800
[...]
ARC-Authentication-Results: i=2; mx.google.com;
       dkim=pass header.i=@robtoledoyour.com header.s=prime header.b=On+Vo8dl;
       dkim=pass header.i=@youtube.com header.s=20210112 header.b=xGMHx3cn;
       arc=pass (i=1 spf=pass spfdomain=scoutcamp.bounces.google.com dkim=pass dkdomain=youtube.com dmarc=pass fromdomain=youtube.com);
       spf=pass (google.com: domain of postalerts@robtoledoyour.com designates 2a01:7c8:bb01:51a::7 as permitted sender) smtp.mailfrom=postalerts@robtoledoyour.com;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=youtube.com
Return-Path: <postalerts@robtoledoyour.com>
Received: from 7n.robtoledoyour.com (7n.robtoledoyour.com. [2a01:7c8:bb01:51a::7])

Whoever is behind this has been active since at least August 2021 – I found references to that from: address on Twitter and Reddit:

[2021-08-19] https://old.reddit.com/r/indiasocial/comments/p7jby3/is_this_some_new_kind_of_spam_or_a_glitch/ (archive)
[2022-03-17] https://twitter.com/CTF/status/1504458796206374918 (archive)
[2022-03-18] https://twitter.com/jurasick/status/1504812649967767556 (archive)
[2022-03-18] https://twitter.com/JayChandran_/status/1504819850044092419 (archive)

The robtoledoyour.com domain is registered to an address in India. I find this notable, given that the first report of a alltimecaptaincool2019@gmail.com email impersonated Amazon.in and was posted in Reddit's /r/indiasocial forum. Also, the YouTube-style email mentions India-specific regulation. Finally, the domain was registered one month before the report on Reddit.

Snapshots of WHOIS and DNS

$ whois robtoledoyour.com
% IANA WHOIS server
% for more information on IANA, visit http://www.iana.org
% This query returned 1 object
[...]

Domain Name: ROBTOLEDOYOUR.COM
Registry Domain ID: 2626055284_DOMAIN_COM-VRSN
Registrar WHOIS Server: whois.name.com
Registrar URL: http://www.name.com
Updated Date: 2021-07-12T06:25:22Z
Creation Date: 2021-07-12T06:25:22Z
Registrar Registration Expiration Date: 2022-07-12T06:25:22Z
Registrar: Name.com, Inc.
Registrar IANA ID: 625
Reseller:
Domain Status: clientTransferProhibited https://www.icann.org/epp#clientTransferProhibited
Registry Registrant ID: Not Available From Registry
Registrant Name: Natarajan K kannan
Registrant Organization:
Registrant Street: 79-1/43-1,Matha sannathi street
Registrant City: Tirunelveli
Registrant State/Province: TN
Registrant Postal Code: 627006
Registrant Country: IN
Registrant Phone: Non-Public Data
Registrant Email: https://www.name.com/contact-domain-whois/robtoledoyour.com/registrant
Registry Admin ID: Not Available From Registry
Admin Name: Natarajan K kannan
Admin Organization:
Admin Street: 79-1/43-1,Matha sannathi street
Admin City: Tirunelveli
Admin State/Province: TN
Admin Postal Code: 627006
Admin Country: IN
Admin Phone: Non-Public Data
Admin Email: https://www.name.com/contact-domain-whois/robtoledoyour.com/admin
Registry Tech ID: Not Available From Registry
Tech Name: Natarajan K kannan
Tech Organization:
Tech Street: 79-1/43-1,Matha sannathi street
Tech City: Tirunelveli
Tech State/Province: TN
Tech Postal Code: 627006
Tech Country: IN
Tech Phone: Non-Public Data
Tech Email: https://www.name.com/contact-domain-whois/robtoledoyour.com/tech
Name Server: ns1dns.name.com
Name Server: ns2fwz.name.com
Name Server: ns3bfm.name.com
Name Server: ns4clq.name.com
DNSSEC: unSigned
Registrar Abuse Contact Email: abuse@name.com
Registrar Abuse Contact Phone: +1.7203101849
URL of the ICANN WHOIS Data Problem Reporting System: http://wdprs.internic.net/
>>> Last update of WHOIS database: 2022-05-31T22:52:19Z <<<

$ dig robtoledoyour.com MX
[...]
;; ANSWER SECTION:
robtoledoyour.com.	300	IN	MX	10 mail.redrool.com.

;; Query time: 134 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Jun 01 08:18:51 JST 2022
;; MSG SIZE  rcvd: 75

The MX domain mail.redrool.com is registered by NameCheap, doesn't have public WHOIS data, and was registered in 2013. If I had to speculate, I'd say this domain is unrelated and is merely being taken advantage of as an open relay.

$ whois redrool.com
% IANA WHOIS server
% for more information on IANA, visit http://www.iana.org
% This query returned 1 object
[...]

Domain name: redrool.com
Registry Domain ID: 1827884879_DOMAIN_COM-VRSN
Registrar WHOIS Server: whois.namecheap.com
Registrar URL: http://www.namecheap.com
Updated Date: 2021-07-29T08:50:59.21Z
Creation Date: 2013-09-17T10:28:13.00Z
Registrar Registration Expiration Date: 2022-09-17T10:28:13.00Z
Registrar: NAMECHEAP INC
Registrar IANA ID: 1068
Registrar Abuse Contact Email: abuse@namecheap.com
Registrar Abuse Contact Phone: +1.9854014545
Reseller: NAMECHEAP INC
Domain Status: clientTransferProhibited https://icann.org/epp#clientTransferProhibited
Registry Registrant ID:
Registrant Name: Redacted for Privacy
Registrant Organization: Privacy service provided by Withheld for Privacy ehf
Registrant Street: Kalkofnsvegur 2
Registrant City: Reykjavik
Registrant State/Province: Capital Region
Registrant Postal Code: 101
Registrant Country: IS
Registrant Phone: +354.4212434
Registrant Phone Ext:
Registrant Fax:
Registrant Fax Ext:
Registrant Email: a95454e67a0c42f988e530f0aeaa91d5.protect@withheldforprivacy.com
Registry Admin ID:
Admin Name: Redacted for Privacy
Admin Organization: Privacy service provided by Withheld for Privacy ehf
Admin Street: Kalkofnsvegur 2
Admin City: Reykjavik
Admin State/Province: Capital Region
Admin Postal Code: 101
Admin Country: IS
Admin Phone: +354.4212434
Admin Phone Ext:
Admin Fax:
Admin Fax Ext:
Admin Email: a95454e67a0c42f988e530f0aeaa91d5.protect@withheldforprivacy.com
Registry Tech ID:
Tech Name: Redacted for Privacy
Tech Organization: Privacy service provided by Withheld for Privacy ehf
Tech Street: Kalkofnsvegur 2
Tech City: Reykjavik
Tech State/Province: Capital Region
Tech Postal Code: 101
Tech Country: IS
Tech Phone: +354.4212434
Tech Phone Ext:
Tech Fax:
Tech Fax Ext:
Tech Email: a95454e67a0c42f988e530f0aeaa91d5.protect@withheldforprivacy.com
Name Server: ara.ns.cloudflare.com
Name Server: george.ns.cloudflare.com
DNSSEC: unsigned
URL of the ICANN WHOIS Data Problem Reporting System: http://wdprs.internic.net/
>>> Last update of WHOIS database: 2022-05-31T18:17:35.20Z <<<

Email was designed without any sort of security or authentication. I remember reading an IRC story, now lost, in which a student emails their professor from deadguy@yourhouse with the message "Help! I'm dead and I'm in your house!".

Compacting Lunr search indices

2022-05-27T10:06:02Z

Compacting Lunr search indices

Lunr is a small JavaScript library for full-text search, which I recently used to implement client-side search for this site. The user experience of client-side search depends in part on how large the search index is, and Lunr's default JSON encoding is more verbose than it needs to be. This page describes a more compact encoding that can reduce the serialized index size by about 40%.

I'll be using the Project Gutenberg editions of Moby Dick and Pride and Prejudice as the example search corpus, with the Project Gutenberg metadata and licensing information trimmed off.

curl -L -O https://www.gutenberg.org/files/1342/1342-0.txt
curl -L -O https://www.gutenberg.org/files/2701/2701-0.txt
shasum -a 256 *.txt
# c3fc0e1900e233a0c3c6ca5784a54a3d3aaf00d40603315c644487bd7a07e22f  1342-0.txt
# 61d5ab6a3910fab66eabc9d2fc708b68b756199cb754fd5ff51751dbe5f766cd  2701-0.txt
tail -n +168 1342-0.txt | head -n 14060 > pride-and-prejudice.txt
tail -n +337 2701-0.txt | head -n 21624 > moby-dick.txt

A basic Lunr indexing program uses a lunr.Builder to assemble the index, then converts it to JSON with toJSON().

import * as fs from "fs";
import lunr from "lunr";

function main(argv) {
	if (argv.length < 3) {
		console.error("usage: lunr-index FILES...");
		process.exit(1);
	}
	const fileNames = argv.slice(2);

	const idx = lunr((builder) => {
		builder.ref("name");
		builder.field("text");
		builder.pipeline.remove(lunr.stemmer);
		fileNames.forEach((fileName => {
			builder.add({
				name: fileName,
				text: fs.readFileSync(fileName)
			});
		}));
	});
	process.stdout.write(JSON.stringify(indexToJSON(idx)));
}

function indexToJSON(idx) {
	return idx.toJSON();
}

main(process.argv);

The resulting index JSON is about 1.68 MB raw, 250 KB with gzip, or 192 KB with Brotli. These two compression formats are useful because they can be consumed directly by common web browsers.

I also tried compressing it with zstd, which I expected to provide the best compression ratio, but in this case Brotli performed better.

node lunr-index-v1.js moby-dick.txt pride-and-prejudice.txt > index-v1.json
gzip -9k index-v1.json
brotli -q 11 -w 24 index-v1.json
zstd -19 index-v1.json
# index-v1.json        : 12.76%   (  1.61 MiB =>    211 KiB, index-v1.json.zst)
ls -lS --si index-v1.*
# -rw-r--r-- 1 john john 1.7M May 27 16:08 index-v1.json
# -rw-r--r-- 1 john john 249k May 27 16:08 index-v1.json.gz
# -rw-r--r-- 1 john john 216k May 27 16:08 index-v1.json.zst
# -rw-r--r-- 1 john john 192k May 27 16:08 index-v1.json.br

Compacting the inverted index

The Lunr index JSON contains two big lists, fieldVectors and invertedIndex. Of those, the inverted index is far bigger, and has more opportunity for space savings.

{
	"fields": ["text"],
	"fieldVectors": [
		[ "text/moby-dick.txt", [ /* ... */ ] ],
		[ "text/pride-and-prejudice.txt", [ /* ... */ ] ] ],
	"invertedIndex": [
		[ "1", {
			"_index": 1363,
			"text": {
				"moby-dick.txt": {},
				"pride-and-prejudice.txt": {}
			} } ],
		[ "1,000,000", {
			"_index": 7298,
			"text": {
				"moby-dick.txt": {}
			} } ],
		// ...
	] }

Notice how much duplicate text there is:

The index is semantically a list ordered by when the token was first observed, but is stored sorted by the token value. The _index property wouldn't be needed if the index was stored unsorted.
Field names (here, "text") are repeated for each token. These could be indexes to the top-level fields property instead. Or, since the set of fields is bounded and small, the field indexes could be implied by list ordering.
Document references (such as "moby-dick.txt") are also repeated, and when combined with field names could be indexes into the fieldVectors property.

The following code applies these basic transformations to the invertedIndex property.

function indexToJSON(idx) {
	const output = idx.toJSON();
	output.invertedIndex = compactInvIndex(output);
	return output;
}

function compactInvIndex(index) {
	const fields = index["fields"];
	const fieldVectorIdxs = new Map(index["fieldVectors"].map((v, idx) => {
		return [v[0], idx];
	}));

	const items = new Map(index["invertedIndex"].map(item => {
		const token = item[0];
		const props = item[1];
		const newItem = [token];
		fields.forEach((field) => {
			const fProps = props[field];
			const matches = [];
			Object.keys(fProps).forEach((docRef) => {
				const fieldVectorIdx = fieldVectorIdxs.get(`${field}/${docRef}`);
				if (fieldVectorIdx === undefined) {
					throw new Error();
				}
				matches.push(fieldVectorIdx);
				matches.push(fProps[docRef]);
			});
			newItem.push(matches);
		});

		return [props["_index"], newItem];
	}));

	const indexes = Array.from(items.keys()).sort((a, b) => a - b);

	const compacted = Array.from(indexes, (k) => {
		const item = items.get(k);
		if (item === undefined) {
			throw new Error();
		}
		return item;
	});

	return compacted;
}

The raw (uncompressed) index size is reduced by over 50%, and compressed sizes by about 25%.

node lunr-index-v2.js moby-dick.txt pride-and-prejudice.txt > index-v2.json
[...]
ls -lS --si index-v2.*
# -rw-r--r-- 1 john john 751k May 27 16:17 index-v2.json
# -rw-r--r-- 1 john john 188k May 27 16:17 index-v2.json.gz
# -rw-r--r-- 1 john john 162k May 27 16:17 index-v2.json.zst
# -rw-r--r-- 1 john john 145k May 27 16:17 index-v2.json.br

Compacting the field vectors

The field vectors are semantically a key-value map, where the keys are integer indexes into the inverted index. There isn't much here to optimize in terms of plain text, but a simple tweak can improve how compressible the data is.

Consider a file that is simply a big list of integers: 1,2,3,4,6,7 and so on. A generic compression function can take advantage of the limited character set, but doesn't have the semantic understanding to encode "next integer in sequence". However, if the original file is changed such that sequential integers use a static sentinel value, then repetitions of the sentinel can be greatly compressed.

"fieldVectors": [
	[ "text/moby-dick.txt", [
		0, 0.607,
		1, 0.356,
		2, 0.382,
		// ...
		] ],
	[ "text/pride-and-prejudice.txt", [
		1, 0.278,
		2, 0.383,
		6, 0.278,
		// ...
		] ] ]

For Lunr indexes, because the inverted index is expanded as new tokens are observed, there are likely to be long runs of sequential integers in the first "column" of the field vectors. They can be replaced with null.

function indexToJSON(idx) {
	const output = idx.toJSON();
	output.invertedIndex = compactInvIndex(output);
	output.fieldVectors = compactVectors(output);
	return output;
}

function compactVectors(index) {
	return index["fieldVectors"].map((item) => {
		const id = item[0];
		const vectors = item[1];
		let prev = null;
		const compacted = vectors.map((v, ii) => {
			if (ii % 2 === 0) {
				if (prev !== null && v === prev + 1) {
					prev += 1;
					return null;
				}
				prev = v;
			}
			return v;
		});
		return [id, compacted];
	});
}

This optimization shaves another 20% to 25% from the compressed file sizes.

node lunr-index-v3.js moby-dick.txt pride-and-prejudice.txt > index-v3.json
[...]
ls -lS --si index-v3.*
# -rw-r--r-- 1 john john 741k May 27 16:27 index-v3.json
# -rw-r--r-- 1 john john 137k May 27 16:27 index-v3.json.gz
# -rw-r--r-- 1 john john 120k May 27 16:27 index-v3.json.zst
# -rw-r--r-- 1 john john 118k May 27 16:27 index-v3.json.br

Recovering the original index data

Lunr can't directly consume the compacted form of its search index, so we need to reverse the above optimizations before calling lunr.Index.load().

import * as fs from "fs";

function main(argv) {
	if (argv.length !== 3) {
		console.error("usage: lunr-index-expand FILE");
		process.exit(1);
	}
	const compactIndex = JSON.parse(fs.readFileSync(argv[2]));
	process.stdout.write(JSON.stringify(expand(compactIndex)));
}

function expand(compact) {
	const fields = compact["fields"];

	const fieldVectors = compact["fieldVectors"].map((item) => {
		const id = item[0];
		const vectors = item[1];
		let prev = null;
		const expanded = vectors.map((v, ii) => {
			if (ii % 2 === 0) {
				if (v === null) {
					v = prev + 1;
				}
				prev = v;
			}
			return v;
		});
		return [id, expanded];
	});

	const invertedIndex = compact["invertedIndex"].map((item, itemIdx) => {
		const token = item[0];
		const fieldMap = {"_index": itemIdx};
		fields.forEach((field, fieldIdx) => {
			const matches = {};

			let docRef = null;
			item[fieldIdx + 1].forEach((v, ii) => {
				if (ii % 2 === 0) {
					docRef = fieldVectors[v][0].slice(`${field}/`.length);
				} else {
					matches[docRef] = v;
				}
			});
			fieldMap[field] = matches;
		})
		return [token, fieldMap];
	});

	invertedIndex.sort((a, b) => {
		if (a[0] < b[0]) {
			return -1;
		}
		if (a[0] > b[0]) {
			return 1;
		}
		return 0;
	});

	return {
		"version": compact["version"],
		"fields": fields,
		"fieldVectors": fieldVectors,
		"invertedIndex": invertedIndex,
		"pipeline": compact["pipeline"],
	};
}

main(process.argv);

The expanded output is identical to the original search index JSON.

node lunr-index-expand.js index-v3.json > index-v3-expanded.json
shasum -a 256 index-v1.json index-v3-expanded.json
# a6f96e4046152213c0c41a12dc83522a85f91db6603c34cb8b85174efc3ade3f  index-v1.json
# a6f96e4046152213c0c41a12dc83522a85f91db6603c34cb8b85174efc3ade3f  index-v3-expanded.json

JSON is not a YAML subset

2022-05-17T05:40:23Z

JSON is not a YAML subset

People on the internet believe that JSON is a subset of YAML, and that it's safe to parse JSON using a YAML parser:

Following this advice will end badly because JSON is not a subset of YAML. It is easy to construct JSON documents that (1) fail to parse as YAML, or (2) parse to valid but semantically different YAML. The second case is more dangerous because it's difficult to detect.

False has over "1.7e3" named fjords

YAML (infamously) allows string scalars to be unquoted. A conforming YAML parser, presented with a token known to contain a scalar value, must match that token against a set of patterns and then fall back to treating it as a string. This behavior produces surprising outcomes, and has been named The Norway Problem.

>" output-prefix=@>

@$ irb-3.1.2
require 'yaml'
@=> true
YAML.load '[FI,NO,SE]'
@=> ["FI", false, "SE"]

A similar issue affects JSON documents passed to a YAML parser when dealing with numbers in exponential notation. The YAML 1.1 spec is stricter about the syntax of numbers than JSON: 1e2 is a valid JSON number, but YAML 1.1 requires it to be written as 1.0e+2. Being an invalid number, the YAML parser will treat it as a string.

>" output-prefix=@>

@$ irb-3.1.2
require 'json'
@=> true
require 'yaml'
@=> true
JSON.load '{"a": 1e2}'
@=> {"a"=>100.0}
YAML.load '{"a": 1e2}'
@=> {"a"=>"1e2"}

YAML 1.2 won't save you

YAML 1.2 is a revision to the YAML spec that (among other goals) aims to make YAML a proper superset of JSON. To maintain backwards compatibility with existing YAML documents, the version is specified in a %YAML directive.

---
a: 1e2  # document["a"] == "1e2"
b: no   # document["b"] == false

%YAML 1.2
---
a: 1e2  # document["a"] == 100
b: no   # document["b"] == "no"

Regardless of whether YAML 1.2 has been (or will be) widely adopted, it does not help those who want to parse a JSON document with a YAML parser. JSON documents do not start with %YAML, and therefore cannot opt-in to the YAML parser behavior that would permit correct parsing of JSON.

Stateless Kubernetes overlay networks with IPv6

2021-02-20T07:08:48Z

Stateless Kubernetes overlay networks with IPv6

The Kubernetes network model is typically implemented by an overlay network, which allows pods to have an IP address decoupled from the underlying fabric. There's dozens of different overlay network implementations that combine a stateful IPv4 address allocator with VXLAN as a transport layer. IPv4 overlay networks have a number of well-documented drawbacks, which contributes to Kubernetes' reputation as difficult to operate beyond small cluster sizes (~10,000 machines).

This page describes an overlay network based on stateless IPv6 tunnels, which have better reliability and scalability characteristics than stateful IPv4 overlays. It uses IETF protocols that are natively supported by the Linux kernel, and since it is independent of Kubernetes itself can support communcication between processes both inside and outside of containers.

Wire protocol

4	(other IPv4 control fields)
	IPv4 header
0	(other IPv4 control fields)
8	TTL	IP protocol (UDP)	IP checksum
12	IPv4 source address
16	IPv4 destination address
	UDP header
20	source port		destination port (3544)
24	UDP length		UDP checksum
	IPv6 header
28	IPv6 control fields
32	IPv6 control fields
36	IPv6 source address
40
44
48
52	IPv6 destination address
56
60
64
	IPv6 payload

6to4 (RFC 3056) is a standard for routing IPv6 traffic over an IPv4 network. It was originally designed as part of the IPv6 migration strategy, allowing isolated IPv6-only networks to use existing internet infrastructure. The protocol is extremely simple – the IPv6 packet is treated as an IPv4 payload, using protocol number 41.

Teredo (RFC 4380) extends 6to4 by adding a layer of UDP encapsulation, which can improve compatibility with intermediate network devices that have compatibility issues with non-TCP/UDP protocols. This page assumes use of Teredo, but if the underlying network allows 6to4 (protocol 41) then the UDP encap can be turned off to save 8 bytes per packet.

The Linux kernel has built-in support for creating 6to4 tunnels in the sit driver. Such tunnels can optionally use the Teredo protocol by enabling the Foo Over UDP (FOU) mode, which is a setting for Linux tunnel drivers that encapsulates packets in UDP. FOU computes synthetic source ports for outbound packets based on the encapsulated packet's connection tuple, thus allowing intermediate routers to distinguish underlying streams (e.g. for link aggregation or flow control).

Pod address allocation

The 6to4 wire protocol describes how to encapsulate IPv6 packets, but doesn't mandate how IPv6 addresses should be assigned or how a router should calculate the IPv4 address of the destination[1]. For that we can use 6rd (RFC 5969), which is a flexible embedding of the IPv4 address space into IPv6.

Allocating pod addresses with 6rd has a number of helpful properties:

Given a pod's IPv6 address, its host's IPv4 address can be computed mechanically by the kernel. There is no userspace routing component.
Each host IPv4 address maps to a 64-bit IPv6 network prefix. Pod IPs can be allocated from this prefix by the CNI host-local IPAM plugin without any risk of conflict.
IPv6 addresses can be allocated from a Unique Local Address (RFC 4193) range, which is similar to IPv4 private address ranges (e.g. 10.0.0.0/8).
A host's IPv4 address can have its high bits masked off, which is useful when every IPv4 address is being allocated from the same CIDR block (e.g. a private network).

Unfortunately the 6rd functionality of iproute2 is not well documented, and the error messages are opaque netlink error codes. When in doubt, I recommend examining the iproute2 and Linux kernel source code to understand how ip tunnel 6rd commands map to netlink parameters.

Setting up a 6to4 overlay

Generate a network prefix

To use a ULA range as a 6rd prefix, each IPv4 address must be masked to 16 bits or less. For this page I'll be using IPv4 addresses in the 10.0.0.0/8 range, which masks to 24 bits, so the ULA prefix needs to have its length fudged a bit (40 bits => 32).

python3 -c 'import os; print("".join("%02x" % b for b in os.urandom(4)))'
# 8ce4b05e

Converting this to ULA yields an IPv6 network prefix of fd8c:e4b0:5e00::/40.

Create the SIT interface

I'm not going to be stepping through each of these commands, so for folks not familar with Linux networking I recommend opening the ip-link(8) and ip-tunnel(8) manpages to follow along. The only thing to note is that the SIT interface is being created without a remote address – this is an overlay, not a tunnel.

There's two machines here, node-a and node-b, which will be set up with identical configuration (adjusted for their different IPv4 addresses):

IPv4 address 10.1.1.100 maps to IPv6 prefix fd8c:e4b0:5e01:0164::/40.
IPv4 address 10.1.1.101 maps to IPv6 prefix fd8c:e4b0:5e01:0165::/40.

ip addr show ens37
# 3: ens37: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
#     link/ether 00:50:56:36:1e:3d brd ff:ff:ff:ff:ff:ff
#     inet 10.1.1.100/24 brd 10.1.1.255 scope global ens37
#        valid_lft forever preferred_lft forever
#     inet6 fe80::250:56ff:fe36:1e3d/64 scope link
#        valid_lft forever preferred_lft forever
ip tunnel add kubetunnel0 \
#     mode sit \
#     local '10.1.1.100' \
#     ttl 64
ip tunnel 6rd dev kubetunnel0 \
#     6rd-prefix 'fd8c:e4b0:5e00::/40' \
#     6rd-relay_prefix '10.0.0.0/8'
ip -6 addr add 'fd8c:e4b0:5e01:0164::1/40' dev kubetunnel0
ip link set kubetunnel0 up
ip -6 addr delete '::10.1.1.100/96' dev kubetunnel0
ip addr show dev kubetunnel0
# 6: kubetunnel0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
#     link/sit 10.1.1.100 brd 0.0.0.0
#     inet6 fd8c:e4b0:5e01:164::1/40 scope global
#        valid_lft forever preferred_lft forever

ip addr show ens37
# 3: ens37: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
#     link/ether 00:50:56:3f:94:51 brd ff:ff:ff:ff:ff:ff
#     inet 10.1.1.101/24 brd 10.1.1.255 scope global ens37
#        valid_lft forever preferred_lft forever
#     inet6 fe80::250:56ff:fe3f:9451/64 scope link
#        valid_lft forever preferred_lft forever
ip tunnel add kubetunnel0 \
#     mode sit \
#     local '10.1.1.101' \
#     ttl 64
ip tunnel 6rd dev kubetunnel0 \
#     6rd-prefix 'fd8c:e4b0:5e00::/40' \
#     6rd-relay_prefix '10.0.0.0/8'
ip -6 addr add 'fd8c:e4b0:5e01:0165::1/40' dev kubetunnel0
ip link set kubetunnel0 up
ip -6 addr delete '::10.1.1.101/96' dev kubetunnel0
ip addr show dev kubetunnel0
# 5: kubetunnel0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
#     link/sit 10.1.1.101 brd 0.0.0.0
#     inet6 fd8c:e4b0:5e01:165::1/40 scope global
#        valid_lft forever preferred_lft forever

Test 6to4 functionality

Before going further, let's take the kubetunnel0 devices for a spin and make sure they're able to route packets. Any protocol encapsulated by IPv6 should work (here I test ICMPv6, TCP, and UDP).

ping6 -c 1 fd8c:e4b0:5e01:165::1
# PING fd8c:e4b0:5e01:165::1(fd8c:e4b0:5e01:165::1) 56 data bytes
# 64 bytes from fd8c:e4b0:5e01:165::1: icmp_seq=1 ttl=64 time=0.456 ms
# 
# --- fd8c:e4b0:5e01:165::1 ping statistics ---
# 1 packets transmitted, 1 received, 0% packet loss, time 0ms
# rtt min/avg/max/mdev = 0.456/0.456/0.456/0.000 ms

nc -6lvn -p 1234
# Listening on [::] (family 10, port 1234)

echo 'Hello, world!' | nc -N fd8c:e4b0:5e01:165::1 1234

nc -6lvn -p 1234
# Listening on [::] (family 10, port 1234)
# Connection from fd8c:e4b0:5e01:164::1 39182 received!
# Hello, world!

If anything goes wrong – for example UDP works but TCP doesn't – you can use pretty much any packet capture tool to debug the overlay. Since 6to4 is a widely-deployed protocol, tools such as tcpdump know how to de-encapsulate the underlying flows.

tcpdump -i ens37 -nn --no-promiscuous-mode
# listening on ens37, link-type EN10MB (Ethernet), capture size 262144 bytes
# 05:04:16.648264 IP 10.1.1.100 > 10.1.1.101: IP6 fd8c:e4b0:5e01:164::1.39182 > fd8c:e4b0:5e01:165::1.1234: Flags [S], seq 1134988924, win 65320, options [mss 1420,sackOK,TS val 3326468574 ecr 0,nop,wscale 7], length 0
# 05:04:16.648496 IP 10.1.1.101 > 10.1.1.100: IP6 fd8c:e4b0:5e01:165::1.1234 > fd8c:e4b0:5e01:164::1.39182: Flags [S.], seq 2975133435, ack 1134988925, win 64768, options [mss 1420,sackOK,TS val 3023145745 ecr 3326468574,nop,wscale 7], length 0
# 05:04:16.648794 IP 10.1.1.100 > 10.1.1.101: IP6 fd8c:e4b0:5e01:164::1.39182 > fd8c:e4b0:5e01:165::1.1234: Flags [.], ack 1, win 511, options [nop,nop,TS val 3326468575 ecr 3023145745], length 0
# 05:04:16.648889 IP 10.1.1.100 > 10.1.1.101: IP6 fd8c:e4b0:5e01:164::1.39182 > fd8c:e4b0:5e01:165::1.1234: Flags [P.], seq 1:15, ack 1, win 511, options [nop,nop,TS val 3326468575 ecr 3023145745], length 14
# 05:04:16.648906 IP 10.1.1.101 > 10.1.1.100: IP6 fd8c:e4b0:5e01:165::1.1234 > fd8c:e4b0:5e01:164::1.39182: Flags [.], ack 15, win 506, options [nop,nop,TS val 3023145746 ecr 3326468575], length 0
# 05:04:16.648982 IP 10.1.1.100 > 10.1.1.101: IP6 fd8c:e4b0:5e01:164::1.39182 > fd8c:e4b0:5e01:165::1.1234: Flags [F.], seq 15, ack 1, win 511, options [nop,nop,TS val 3326468575 ecr 3023145745], length 0
# 05:04:16.649088 IP 10.1.1.101 > 10.1.1.100: IP6 fd8c:e4b0:5e01:165::1.1234 > fd8c:e4b0:5e01:164::1.39182: Flags [F.], seq 1, ack 16, win 506, options [nop,nop,TS val 3023145746 ecr 3326468575], length 0
# 05:04:16.649677 IP 10.1.1.100 > 10.1.1.101: IP6 fd8c:e4b0:5e01:164::1.39182 > fd8c:e4b0:5e01:165::1.1234: Flags [.], ack 2, win 511, options [nop,nop,TS val 3326468576 ecr 3023145746], length 0
# 
# 8 packets captured
# 8 packets received by filter
# 0 packets dropped by kernel

Enable Teredo mode (UDP encapsulation)

Since Teredo is 6to4 in UDP, we enable FOU mode to turn a 6to4 overlay into a Teredo overlay. FOU mode can be configured to use any destination port – I'm using 3544 because that's the official Teredo port, and it helps packet capture tools figure out what's going on.

modprobe fou
ip fou add port 3544 ipproto 41
ip link set \
#     name kubetunnel0 \
#     type sit \
#     encap fou \
#     encap-sport auto \
#     encap-dport 3544

Note that Teredo support is not as widespread as 6to4. In particular, tcpdump doesn't know how to de-encapsulate it.

root@node-b:~# tcpdump -i ens37 -nn --no-promiscuous-mode
# listening on ens37, link-type EN10MB (Ethernet), capture size 262144 bytes
# 05:35:26.040293 IP 10.1.1.100.54772 > 10.1.1.101.3544: UDP, length 80
# 05:35:26.040348 IP 10.1.1.101.54181 > 10.1.1.100.3544: UDP, length 80
# 05:35:26.040868 IP 10.1.1.100.54772 > 10.1.1.101.3544: UDP, length 72
# 05:35:26.041134 IP 10.1.1.100.54772 > 10.1.1.101.3544: UDP, length 86
# 05:35:26.041140 IP 10.1.1.100.54772 > 10.1.1.101.3544: UDP, length 72
# 05:35:26.041185 IP 10.1.1.101.54181 > 10.1.1.100.3544: UDP, length 72
# 05:35:26.041412 IP 10.1.1.101.54181 > 10.1.1.100.3544: UDP, length 72
# 05:35:26.042305 IP 10.1.1.100.54772 > 10.1.1.101.3544: UDP, length 72
# 
# 8 packets captured
# 8 packets received by filter
# 0 packets dropped by kernel

Wireshark works fine.

Persistent network configuration

Debian (ifupdown)

Create a file named /etc/network/interfaces.d/kubetunnel0 in interfaces(5) format. These commands are the same ones run earlier by hand.

If you don't want to use templating to inject the right local IPv4 address, or need something dynamic (e.g. if the host IPv4 is from DHCP), then move the commands into a helper binary and invoke it from this config file.

auto kubetunnel0
iface kubetunnel0 inet6 manual
  pre-up ip tunnel add "${IFACE}" \
    mode sit \
    local '10.1.1.100' \
    ttl 64
  pre-up ip tunnel 6rd dev "${IFACE}" \
    6rd-prefix 'fd8c:e4b0:5e00::/40' \
    6rd-relay_prefix '10.0.0.0/8'

  pre-up ip -6 addr add 'fd8c:e4b0:5e01:0164::1/40' dev "${IFACE}"
  up ip link set "${IFACE}" up
  post-up ip -6 addr delete '::10.1.1.100/96' dev "${IFACE}"

  down ip link set "${IFACE}" down
  post-down ip tunnel delete "${IFACE}"

You may also want to create a bridge device, so that pods can have NAT'd IPv4 IPs. This lets them talk to existing infrastructure that isn't part of the overlay network. I'll use 192.168.1.1/24 as the NAT range in this example. Put the following into /etc/network/interfaces.d/kubebridge0.

auto kubebridge0
iface kubebridge0 inet manual
  pre-up \
    iptables -t nat -C POSTROUTING -s 192.168.1.1/24 -j MASQUERADE || \
    iptables -t nat -A POSTROUTING -s 192.168.1.1/24 -j MASQUERADE

  pre-up ip link add name "${IFACE}" type bridge0
  pre-up ip addr add 192.168.1.1/24 brd + dev "${IFACE}"
  pre-up ip -6 addr add fd8c:e4b0:5e01:0164::1:1/112 dev "${IFACE}"

  up ip link set "${IFACE}" up

  down ip link set "${IFACE}" down
  post-down ip link delete "${IFACE}"

Kubelet configuration

After creating the interface, you still need to make the Kubelet use it for pod networking. Create a file /etc/cni/net.d/10-kubernetes-overlay.conf, using the host-local IPAM plugin to allocate addresses out of the host's IPv6 prefix:

The "ranges" section sets which subnets to use for pod IPs. The following example allocates two IPs, one from the IPv6 overlay and one from the IPv4 bridge NAT.
The "routes" section configures the pod's network namespace to use the bridge for outbound packets.

Here there are fewer options if you want to avoid templating the config file. You might need to write a custom CNI plugin that queries the network state and invokes the other CNI binaries.

{
  "cniVersion": "0.3.1",
  "name": "kubernetes-overlay",
  "type": "bridge",
  "bridge": "kubebridge0",
  "hairpinMode": true,
  "ipam": {
    "type": "host-local",
    "ranges": [
      [ { "subnet": "fd8c:e4b0:5e01:0164::1:0/112" } ],
      [ { "subnet": "192.168.1.0/24" } ]
    ],
    "routes": [
      { "dst": "0.0.0.0/0", "gw": "192.168.1.1" },
      { "dst": "fd8c:e4b0:5e00::/40", "gw": "fd8c:e4b0:5e01:0164::1:1" }
    ],
    "dataDir": "/var/run/cni/networks/kubernetes-overlay"
  }
}

Other notes

Jumbo packets

If you're using jumbo packets on your network, be aware that the kernel creates two SIT interfaces: kubetunnel0 and sit0.

ip addr show sit0
# 4: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
#     link/sit 0.0.0.0 brd 0.0.0.0

This mostly doesn't matter, except that some sit0 settings (including MTU) seem to affect all SIT tunnels on the machine. You'll want to adjust the sit0 and kubetunnel0 MTUs at the same time during interface creation. The MTU of sit0 should match the physical interface.

ip link set dev kubetunnel0 mtu 8950
ip link set dev sit0 mtu 9001

RFC 3068 reserved the 2002::/16 anycast prefix for 6to4 tunnels, so that each IPv4 address would convert to a 48-bit "routing prefix". This original scheme wasn't widely adopted because the user experience depends on both sides of the connection having access to high-quality tunnel routers. RFC 7526 officially deprecated the 2002::/16 prefix.

Extending VSCode with WebAssembly

2020-12-30T03:15:46Z

Extending VSCode with WebAssembly

Two years ago I filed Microsoft/vscode#65559 asking for WebAssembly support in VSCode extensions. At the time, WASM was supported by Node.JS but the WebAssembly symbol wasn't available in the extension's evaluation scope. That issue didn't get much activity from upstream but the other day I tried it again, and … it worked!

Below is a small "hello world" LSP-based extension that loads a WASM module in onInitialize(). It uses the vscode-languageserver library; readers new to VSCode extensions can follow along using Microsoft's Your First Extension and Language Server Extension Guide tutorials.

server.wasm

First up, we'll need the WASM file itself. I wrote two flavors (C and Rust) with equivalent API, returning a single static string. More complex APIs can use Emscripten or wasm-bindgen or whatever to deal with the FFI.

Option 1: C

char *greeting() {
	return "Hello world (from C)!";
}

clang --target=wasm32 \
#   --no-standard-libraries \
#   -Wl,--export-all \
#   -Wl,--no-entry \
#   -o out/server.wasm \
#   src/server.c

Option 2: Rust

#![no_std]

#[no_mangle]
pub extern "C" fn greeting() -> *const u8 {
	const HELLO: &'static str = "Hello world (from Rust)!\0";
	HELLO.as_ptr()
}

#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
    loop {}
}

rustc -O \
#   --target wasm32-unknown-unknown \
#   --crate-type=cdylib \
#   -o out/server.wasm \
#   src/server.rs

Either way you'll get a WASM module looking something like this:

wasm2wat out/server.wasm
# (module
#   (type (;0;) (func (result i32)))
#   (func $greeting (type 0) (result i32)
#     i32.const 1048576)
#   (table (;0;) 1 1 funcref)
#   (memory (;0;) 17)
#   (global (;0;) (mut i32) (i32.const 1048576))
#   (global (;1;) i32 (i32.const 1048601))
#   (global (;2;) i32 (i32.const 1048601))
#   (export "memory" (memory 0))
#   (export "greeting" (func $greeting))
#   (export "__data_end" (global 1))
#   (export "__heap_base" (global 2))
# 	(data (;0;) (i32.const 1048576) "Hello world (from Rust)!\00"))

server.ts

This file implements the server side of an LSP-based VSCode extension. For the sake of brevity I won't have it implement any handlers, so it just loads the WASM module and sends a message once successfully initialized.

import * as fs from "fs";
import * as path from "path";

import {
	createConnection,
	ProposedFeatures,
	InitializeParams,
	TextDocumentSyncKind,
	InitializeResult
} from 'vscode-languageserver/node';

declare const WebAssembly: any;

The only unusual part is the declare, which is necessary because TypeScript doesn't have type definitions for standalone WASM yet (DefinitelyTyped/DefinitelyTyped#48648). Since the WebAssembly API is small we can stub out the type checks.

Next up is a helper class to load the module from disk, compile it, and call its exported functions.

class ServerImpl {
	instance: any;

	constructor(instance: any) {
		this.instance = instance;
	}

	public static instantiate(): Promise<ServerImpl> {
		const wasmPath = path.resolve(__dirname, "server.wasm");
		return new Promise((resolve, reject) => {
			fs.readFile(wasmPath, (err, data) => {
				if (err) {
					reject(err);
					return;
				}
				const buf = new Uint8Array(data);
				resolve(WebAssembly.instantiate(buf, {})
					.then((result: any) => (new ServerImpl(result.instance)))
				);
			});
		});
	}

	public greeting(): String {
		const exports = this.instance.exports;
		const result_off = exports.greeting();
		const result_ptr = new Uint8Array(exports.memory.buffer, result_off);
		let result = "";
		for (let ii = 0; result_ptr[ii]; ii++){
			result += String.fromCharCode(result_ptr[ii]);
		}
		return result;
	}
}

Lastly the server startup and initialization logic, which calls into the helper to fetch the greeting string.

let impl: ServerImpl;
let connection = createConnection(ProposedFeatures.all);

connection.onInitialize((params: InitializeParams) => {
	const result: InitializeResult = {
		capabilities: {
			textDocumentSync: TextDocumentSyncKind.Incremental,
		}
	};
	return ServerImpl.instantiate()
		.then((loadedImpl: ServerImpl) => {
			impl = loadedImpl;
			return result;
		});
});

connection.onInitialized(() => {
	connection.window.showInformationMessage(`greeting: ${impl.greeting()}`);
});

connection.listen();

When the "Hello World" command is invoked via the command menu, the extension will be initialized and the greeting will pop up.

Notes on cross-compiling Rust

2020-12-12T09:34:16Z

Notes on cross-compiling Rust

One of my current hobby projects involves running Rust binaries on a Raspberry Pi. There are three computers involved: the Pi itself (ARMv7 Linux), my desktop (x86-64 Linux), and sometimes my laptop (x86-64 macOS).

The release of Cyberpunk 2077 means that my desktop will be spending more time booted into Windows, so I needed to figure out how to get the macOS machine to build binaries for ARMv7 Linux. I had hoped this would be straightforward because rustc is a native cross-compiler, and I've had good experiences with cross-compiling other modern languages (e.g. Go).

Unfortunately when I did a websearch for [cross-compiling rust] the results were universally terrible[1]. This page contains my notes on how to get cross-compilation working with either Cargo or Bazel, plus some suggestions for the rustup and rules_rust projects that could make cross-compilation simpler in the future.

Background

In the early days of software engineering, when high-level languages like C were just starting to displace assembly, compilers used build-time configuration to select a target platform. This meant that any given build of the compiler could only generate object code for a single platform. The concept of cross-compilation was introduced to describe compilers that could be built to run on Platform A but generate object code for Platform B.

Times change, and nowadays every major compiler[2] is what's called a "native cross compiler", allowing the target platform to be selected at runtime (e.g. with a CLI flag). This includes the Rust compiler rustc, which as of v1.48 supports well over a hundred distinct targets.

rustc --version
# rustc 1.48.0 (7eac88abb 2020-11-16)
rustc --print target-list | wc -l
# 156
rustc --print target-list | sort -R | head -n 10 | sort
# aarch64-apple-darwin
# i686-uwp-windows-msvc
# msp430-none-elf
# powerpc-unknown-linux-gnuspe
# powerpc-wrs-vxworks
# sparc64-unknown-linux-gnu
# sparc64-unknown-openbsd
# thumbv4t-none-eabi
# thumbv7a-pc-windows-msvc
# x86_64-pc-windows-msvc

In practice cross-compilation requires more than simply generating object code, but with a bit of effort from the toolchain developers it's possible to make this nearly seamless. Go is the gold standard here; it ships its own linker and the sources for its standard library, so a normal installation can directly build executables for any supported target.

Rustup and Cargo

The first build tool I tried is Cargo, which I installed with rustup. I dislike building with Cargo because it's primitive and inflexible, but since it's the official Rust build tool I hoped it would be the best documented.

# Cargo.toml

[package]
name = "helloworld"
version = "0.0.1"
edition = "2018"

[[bin]]
name = "helloworld"
path = "helloworld.rs"

Cargo uses the --target flag to enable cross-compilation.

cargo build --target armv7-unknown-linux-gnueabihf
#    Compiling helloworld v0.0.1 (/Users/john/src/rust-cross-compilation)
# error[E0463]: can't find crate for `std`
#   |
#   = note: the `armv7-unknown-linux-gnueabihf` target may not be installed

Whereas Go will build its standard library from source when cross-compiling, Rust relies on precompiled libraries[3]. We can use rustup to fetch a prebuilt std for Linux on ARMv7.

rustup target add armv7-unknown-linux-gnueabihf
# info: downloading component 'rust-std' for 'armv7-unknown-linux-gnueabihf'
# info: installing component 'rust-std' for 'armv7-unknown-linux-gnueabihf'
# info: using up to 500.0 MiB of RAM to unpack components
#  18.2 MiB /  18.2 MiB (100 %)  11.5 MiB/s in  1s ETA:  0s

cargo build --target armv7-unknown-linux-gnueabihf
#    Compiling helloworld v0.0.1 (/Users/john/src/rust-cross-compilation)
# error: linking with `cc` failed: exit code: 1
#   |
#   = note: "cc" "-Wl,--as-needed" "-Wl,-z,noexecstack" "-Wl,--eh-frame-hdr" "-L"
#   [...]
#   "-Wl,-Bdynamic" "-lgcc_s" "-lc" "-lm" "-lrt" "-lpthread" "-lutil" "-ldl" "-lutil"
#   = note: clang: warning: argument unused during compilation: '-pie' [-Wunused-command-line-argument]
#           ld: unknown option: --as-needed
#           clang: error: linker command failed with exit code 1 (use -v to see invocation)

The source file was successfully compiled, but it couldn't be linked into an executable. It looks like Cargo is trying to use the host system's linker, which will sometimes work, but fails in this particular case because the macOS linker only supports Apple targets.

Luckily the LLVM project, in addition to the compilation framework, also distributes the cross-platform LLD linker. While it doesn't cover every platform supported by rustc, it does support the common ones. We can configure Cargo to use it for linking our ARMv7 Linux binary.

I downloaded clang+llvm-11.0.0-x86_64-apple-darwin.tar.xz from https://releases.llvm.org/download.html and extracted it to ~/.opt/, then added a .cargo/config.toml to my workspace.

# .cargo/config.toml
[build]

[target.armv7-unknown-linux-gnueabihf]
linker = "/Users/john/.opt/clang+llvm-11.0.0-x86_64-apple-darwin/bin/lld"

cargo build --target armv7-unknown-linux-gnueabihf
#    Compiling helloworld v0.0.1 (/Users/john/src/rust-cross-compilation)
# error: linking with `/Users/john/.opt/clang+llvm-11.0.0-x86_64-apple-darwin/bin/lld` failed: exit code: 1
#   |
#   = note: "/Users/john/.opt/clang+llvm-11.0.0-x86_64-apple-darwin/bin/lld" "-flavor" "gnu" "--eh-frame-hdr" "-L"
#   [...]
#    "-Bdynamic" "-lgcc_s" "-lc" "-lm" "-lrt" "-lpthread" "-lutil" "-ldl" "-lutil"
#   = note: lld: error: unable to find library -lgcc_s
#           lld: error: unable to find library -lc
#           lld: error: unable to find library -lm
#           lld: error: unable to find library -lrt
#           lld: error: unable to find library -lpthread
#           lld: error: unable to find library -lutil
#           lld: error: unable to find library -ldl
#           lld: error: unable to find library -lutil

Getting closer!

The linker is being told to build an executable that dynamically links against the GNU libc, which I don't have a copy of. One option here is to download it from (for example) the Ubuntu package hosting, but I don't want to do that because I don't think a Rust binary should be depending on libc at all. Rust ought to be considered a replacement for C, rather than a thin layer on top.

Therefore I'm going to switch the Cargo target to the MUSL variant, which treats libc as an implementation detail rather than a core component of the platform.

rustup target add armv7-unknown-linux-musleabihf
# info: downloading component 'rust-std' for 'armv7-unknown-linux-musleabihf'
# info: installing component 'rust-std' for 'armv7-unknown-linux-musleabihf'
# info: using up to 500.0 MiB of RAM to unpack components
#  15.8 MiB /  15.8 MiB (100 %)  12.1 MiB/s in  1s ETA:  0s

# .cargo/config.toml
[build]

[target.armv7-unknown-linux-musleabihf]
linker = "/Users/john/.opt/clang+llvm-11.0.0-x86_64-apple-darwin/bin/lld"

cargo build --target armv7-unknown-linux-musleabihf
#    Compiling helloworld v0.0.1 (/Users/john/src/rust-cross-compilation)
#     Finished dev [unoptimized + debuginfo] target(s) in 1.50s

Success! The resulting binary is a valid executable for ARMv7 Linux, and can be run as-is on the Raspberry Pi.

file target/armv7-unknown-linux-musleabihf/debug/helloworld
# target/armv7-unknown-linux-musleabihf/debug/helloworld: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, with debug_info, not stripped

Bazel

Bazel is a language-agnostic build system. Its configuration language deals in actions and dependency graphs, rather than executables and libraries, which gives it some interesting scaling properties:

Building single-language projects with Bazel can be more difficult than using language-specific tools.
Building multi-language projects is substantially easier in Bazel than in any other build system.

This makes Bazel a natural choice of build tool for any system that involves (1) FFI, (2) generated code, or (3) well-factored subsystems. It is uniquely capable when compared to Cargo because it can build multiple Rust libraries ("crates") within a single workspace.

The first step to build Rust with Bazel is to configure the WORKSPACE to depend on rules_rust. This will also define the default Rust version and edition. There's no need to install toolchains or targets, because Bazel will fetch them on demand.

# WORKSPACE
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

http_archive(
    name = "io_bazel_rules_rust",
	# HEAD commit as of 2020-12-05
    urls = ["https://github.com/bazelbuild/rules_rust/archive/67f0c5ec0397d24ccc14264a0eda86915ddf63e8.tar.gz"],
    sha256 = "c587d402e4502100b01e4ba7d9584809cf4f4eb2d2f6634097883637bfb512b1",
	strip_prefix = "rules_rust-67f0c5ec0397d24ccc14264a0eda86915ddf63e8",
)

load("@io_bazel_rules_rust//rust:repositories.bzl", "rust_repositories")

rust_repositories(
    edition = "2018",
    version = "1.48.0",
)

Next we need to create a top-level BUILD file. This will define a rust_binary target for our hello-world executable, and also a platform describing what sort of system we want to build for.

# BUILD.bazel
load("@io_bazel_rules_rust//rust:rust.bzl", "rust_binary")

rust_binary(
    name = "helloworld",
    srcs = ["helloworld.rs"],
)

platform(
    name = "linux-armv7",
    constraint_values = [
        "@platforms//os:linux",
        "@platforms//cpu:arm",
    ],
)

In the future the Platform would use a more specific "cpu:armv7" constraint (bazelbuild/rules_rust#509) and support constraining on the Rust release channel (bazelbuild/rules_rust#510).

Anyway, that should be enough, but if we try running it we'll hit an error about missing toolchains.

bazel build //:helloworld --platforms=//:linux-armv7
# [...]
# ERROR: While resolving toolchains for target //:helloworld: no matching toolchains found for types @io_bazel_rules_rust//rust:toolchain

This is because rules_rust doesn't pre-register toolchains for all supported target platforms – it makes the user register each (host, target) mapping explicitly. We need to tell rules_rust to register a toolchain that can run on macOS (Darwin) and build for ARMv7 Linux.

# WORKSPACE
load("@io_bazel_rules_rust//rust:repositories.bzl", "rust_repository_set")

rust_repository_set(
    name = "rust_linux_armv7",
    edition = "2018",
    exec_triple = "x86_64-apple-darwin",
    extra_target_triples = ["arm-unknown-linux-musleabihf"],
    rustfmt_version = "1.4.20",
    version = "1.48.0",
)

bazel build //:helloworld --platforms=//:linux-armv7
# [...]
# INFO: From Compiling Rust bin helloworld (1 files):
# error: linking with `external/local_config_cc/cc_wrapper.sh` failed: exit code: 1
#   |
#   = note: "external/local_config_cc/cc_wrapper.sh" "-Wl,--as-needed" "-Wl,-z,noexecstack" "-Wl,--eh-frame-hdr" "-nostartfiles"
#   = note: clang: warning: argument unused during compilation: '-no-pie' [-Wunused-command-line-argument]
#           ld: unknown option: --as-needed
#           clang: error: linker command failed with exit code 1 (use -v to see invocation)

This is the same linker error as we saw with Cargo, and the solution is to tell rules_rust that it should use LLD. However, there's a problem – rules_rust doesn't have its own linker toolchain, it uses the C/C++ toolchain to find a linker.

We must now contend with the Bazel C/C++ configuration system, which is designed to handle the world's wide range of strange C compilers. I'm not going to give a blow-by-blow here because none of it is relevant to Rust, but a summary is:

We create a new Bazel package //cc-toolchain that will contain the C/C++ configuration. I'm just going to pull in the linker from the filesystem rather than properly repository_rule it, so the toolchain file sets will be empty stubs.
The CcToolchainConfigInfo itself requires the path to a bunch of different tools; since the only one needed here is lld I'll hardcode the rest to /bin/false.
This project doesn't need to build any C/C++ code for the host (e.g. for codegen), so I'm going to override --host_crosstool_top rather than define a true host-compatible toolchain.

A more complete solution would probably involve the Clang-based toolchains defined in https://github.com/bazelbuild/bazel-toolchains.

# cc-toolchain/BUILD

load(":config.bzl", "cc_toolchain_config")

filegroup(name = "empty")

cc_toolchain_suite(
    name = "clang_suite",
    toolchains = {
        "armv7": ":armv7_toolchain",
    },
)

cc_toolchain(
    name = "armv7_toolchain",
    all_files = ":empty",
    compiler_files = ":empty",
    dwp_files = ":empty",
    linker_files = ":empty",
    objcopy_files = ":empty",
    strip_files = ":empty",
    supports_param_files = 0,
    toolchain_config = ":armv7_toolchain_config",
    toolchain_identifier = "armv7-toolchain",
)

cc_toolchain_config(name = "armv7_toolchain_config")

# cc-toolchain/config.bzl

load(
    "@bazel_tools//tools/cpp:cc_toolchain_config_lib.bzl",
    "action_config",
    "tool",
    "tool_path",
)
load(
    "@bazel_tools//tools/build_defs/cc:action_names.bzl",
    "CPP_LINK_EXECUTABLE_ACTION_NAME",
)

LLD = "/Users/john/.opt/clang+llvm-11.0.0-x86_64-apple-darwin/bin/lld"

def _cc_toolchain_config_impl(ctx):
    return cc_common.create_cc_toolchain_config_info(
        ctx = ctx,
        toolchain_identifier = "armv7-toolchain",
        host_system_name = "local",
        target_system_name = "armv7-unknown-linux-musleabihf",
        target_cpu = "armv7",
        target_libc = "unknown",
        compiler = "clang",
        abi_version = "unknown",
        abi_libc_version = "unknown",
        action_configs = [
            action_config(
                action_name = CPP_LINK_EXECUTABLE_ACTION_NAME,
                enabled = True,
                tools = [tool(path = LLD)],
            ),
        ],
        tool_paths = [
            tool_path(
                name = "ld",
                path = LLD,
            ),
            tool_path(
                name = "ar",
                path = "/usr/bin/ar",
            ),
            tool_path(
                name = "cpp",
                path = "/bin/false",
            ),
            tool_path(
                name = "gcc",
                path = "/usr/bin/clang",
            ),
            tool_path(
                name = "gcov",
                path = "/bin/false",
            ),
            tool_path(
                name = "nm",
                path = "/bin/false",
            ),
            tool_path(
                name = "objdump",
                path = "/bin/false",
            ),
            tool_path(
                name = "strip",
                path = "/bin/false",
            ),
        ],
    )

cc_toolchain_config = rule(
    implementation = _cc_toolchain_config_impl,
    attrs = {},
    provides = [CcToolchainConfigInfo],
)

Whew. With that mess dealt with, rules_rust will now link with LLD and produce valid ARMv7 Linux binaries.

bazel build //:helloworld --platforms=//:linux-armv7 \
#   --cpu=armv7 \
#   --crosstool_top=//cc-toolchain:clang_suite \
#   --host_crosstool_top=@bazel_tools//tools/cpp:toolchain
# INFO: Invocation ID: f6c497d9-48db-4240-85b5-c8bfa675c49b
# INFO: Analyzed target //:helloworld (10 packages loaded, 274 targets configured).
# INFO: Found 1 target...
# Target //:helloworld up-to-date:
# 	bazel-bin/helloworld
# INFO: Elapsed time: 33.660s, Critical Path: 0.45s
# INFO: 10 processes: 5 remote cache hit, 5 internal.
# INFO: Build completed successfully, 10 total actions

file bazel-bin/helloworld
# bazel-bin/helloworld: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, with debug_info, not stripped

Suggestions

rules_rust has some work to do on making its toolchains ergonomic. Right now they couple the host binaries and target libraries into a single ToolchainInfo, which means Bazel can't resolve them separately based on host and target constraints. If they were split up (bazelbuild/rules_rust#523) then the entire set of supported targets could be pre-registered by a rust_toolchains() macro.
rules_rust should decouple its linker command from the C/C++ toolchain. I shouldn't have to touch anything related to cc to get a working rustc + lld combo.
Both rustup and rules_rust should integrate support for LLD. While I'm not sure if it should be the default for all platforms, it should definitely be the default (or strongly recommended) for cross-compilation.

The LLVM project should offer some non-monolithic downloads for some tools, or alternatively the Rust project should host a stripped-down archive for LLD. The full LLVM binary distribution is huge and it doesn't make sense to make users download a complete copy of Clang just so they can link ELF binaries on macOS.

du -sh clang+llvm-11.0.0-x86_64-apple-darwin/
# 2.4G	clang+llvm-11.0.0-x86_64-apple-darwin/
du -sh clang+llvm-11.0.0-x86_64-apple-darwin/bin/lld
#  81M	clang+llvm-11.0.0-x86_64-apple-darwin/bin/lld

It doesn't even use any of the bundled dylibs!

otool -L clang+llvm-11.0.0-x86_64-apple-darwin/bin/lld
# clang+llvm-11.0.0-x86_64-apple-darwin/bin/lld:
# 	/usr/lib/libxml2.2.dylib (compatibility version 10.0.0, current version 10.9.0)
# 	/usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.11)
# 	/usr/lib/libncurses.5.4.dylib (compatibility version 5.4.0, current version 5.4.0)
# 	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.100.1)
# 	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 902.1.0)

Cross-compilation should be covered by official Rust documentation. The Rust Book's maintainers have declined to add a chapter about it (rust-lang/book#2367), which makes me sad, but I am hopeful that it might one day be covered in the Embedded Rust book.

If a tutorial on cross-compiling Rust starts off with installing Docker or Vagrant then I'm not fucking reading it. And stop linking me to rust-embedded/cross, hiding these insane dependency stacks behind a "magical" wrapper doesn't help anybody worth helping.
Except for GCC, which like most GNU software chooses to remain frozen in a grotesque parody of mid-80s UNIX.
I've heard this is due to the Rust standard library's dependency on libc, thus requiring a C toolchain and headers to build std for a given platform.

First impressions of Rust

2020-08-06T22:23:05Z

First impressions of Rust

I've been wanting to write a big project in Rust for a while as a learning exercise, and actually started one in late 2018 (a FUSE server implementation). But then life happened and I got busy and never went anywhere with it. Due to certain world circumstances I'm currently spending a lot of time indoors so rust-fuse (docs) now exists and is good enough to write basic hello-world filesystems. I plan to polish it up a bit more with the goal of releasing a v1.0 that supports the same use cases as libfuse.

I took some notes along the way about things that struck me as especially good or bad. Overall I quite like Rust the language, have mixed feelings about the quality of ancillary tooling, and have strong objections to some decisions made by the packaging system (Cargo + crates.io).

Background

I've been programming professionally for 15 years, primarily network servers and GUIs on Linux. Between roughly 2009 and 2015 I experimented with using Haskell for systems programming, writing several projects in pure Haskell (haskell-dbus, Anansi) and as bindings (haskell-ncurses, haskell-cpython). However, I couldn't achieve the sorts of reliability improvements over bread-and-butter C++ that I had hoped for:

Haskell has a lot of tools for reasoning about the structure of computation, notably monads for declarative I/O, but it doesn't do much to help the programmer with non-algorithmic concerns such as memory lifetimes. I spent a lot of time debugging dangling pointers and race conditions.
I found it very difficult to write Haskell code that could run as fast as C. Avoiding allocation, auto-boxing, etc felt like it required a deep knowledge of undocumented or unspecified GHC behavior.

In late 2015 I started rough sketches for a new language, Funk, that would combine the type-safety of Haskell with the low-level precision of C/C++. Funk was strongly influenced by Google's internal dialect of C++, which uses smart pointers and sum types (e.g. StatusOr<T>) to improve memory safety – many of its features later became part of the C++11 and C++14 standards. To this foundation I bolted on Haskell-style typeclasses and modules, then started writing a Funk-to-C translator based on Vala.

At some point I was looking around for inspiration on how to handle memory allocation (I planned to use scoped arenas as the fundamental dynamic memory system) and I discovered Rust. Here was a language that was solving the same problems as Funk but (1) better designed (2) already implemented and (3) supported by an entire team of compiler experts. So that was that, Funk went to /dev/null and I logged a TODO to learn Rust.

The Rust language

It shouldn't come as a surprise that someone looking for a cross between C++ and Haskell would like Rust, but I want to be clear: I really really enjoy using Rust. It is nearly everything I want in a systems programming language, and the few parts it's missing are due to legitimate technical difficulty. The amount of careful thought I've seen put into its design – crafting async/await to work in no_std environments, or the new inline assembly syntax – has produced a language that is not only better than what I could have designed, it's better among axes I was not even aware existed prior to reading the Rust RFCs.

The "nightly" release channel is an excellent idea that I wish more infrastructure software made use of. Stabilizing individual features on their own schedules lets the compiler maintain a blistering release cadence (stable releases every six weeks!). Users are empowered to choose their own preferred point on the maintenance/velocity curve, opting in to higher upgrade costs in exchange for early access to new features. The "editions" system goes a bit further, derisking backwards-incompatible syntax changes that would have stymied C++ for decades (see: trigraphs).

Type system

Rust has a reasonable amount of Haskell-style type programming, though I wouldn't mind a bit more. Some parts of its type system are limited in non-intuitive ways – for example only lifetime-kinded type parameters can be universally quantified in a trait bound. I hit a lot of compiler errors that recognized exactly what I wanted to do but wouldn't let me do it.

I wish Rust's type system supported:

Closed-world ("sealed") traits. Rust's rules against private types in the public API are good civilization but they make it difficult to define pseudo-private traits like Mount that I want users to name but not implement or call into.
Associated types in structs. Rust lets traits have associated types, and structs can have associated values, but there's no equivalent to the nested type names found in C++ or Java.
Very basic dependent typing, or maybe something like Eiffel's contracts, for the purpose of eliminating array bounds checks. I'd like to be able to say "this function accepts a &[u8] of at least size_of<SomeType>" so I can do safe unchecked byte poking.

Standard library

There's a lot of standard UNIX functionality that's missing from the Rust standard library. Some of it is more-or-less available from separate packages like nix, but I shouldn't have to depend on four crates plus a C compiler to get access to getuid(). I shouldn't have to depend on anything to get the definition of ENOSYS or the size of c_ulong. Go is the gold standard here – it can cross-compile to a Linux target from macOS using its own copies of the Linux syscall table – and even Haskell has Foreign.C.Types.

A std::os::unix without getuid() is incomplete but can be worked around with a small extern "C" block. Much worse is the lack of macro-dependent functions like recvmsg(), which is not a great API to begin with, or functions with OS-dependent arity like mount(). Rust is not averse to providing clean wrappers around the OS library – the std::fs and std::process modules contain little else – so it's frustrating to see these very basic functions left out.

Tooling

rustdoc

I categorize documentation generators into two basic groups:

First is the Sphinx group, which consumes prose and uses embedded pragmas to reference symbols of the library being documented. The output layout tends to be textbook-like, containing long "chapters" that might cover entire modules in one HTML file. Sphinx-style docs are popular among Python programmers.
Second is the Doxygen group, which consumes source code and generates a rigidly-structured catalog of symbols with optional attached prose. The output feels more like an encyclopedia or reference manual.

rustdoc is obviously in the second category. It is designed to consume doc comments, which are special-cased by the Rust compiler, and produces output closely matching the structure of the exported API. At this task rustdoc does a reasonable job: the page layout is navigable, the markup format (rustdoc uses Markdown) isn't great but it could be worse, and it doesn't hardcode absolute file paths into the output like Haddock.

Some of its annotations, like whether a symbol is OS-specific (rust-lang/rust#43781), are gated to the Nightly toolchain. It's not obvious to me why they do this – it's a documentation generator, why does it care what version of the Rust compiler I'm using? What's more, some of its functionality is reserved for the standard library only. I can't mark fields as unstable (subject to change in future library versions) because that annotation is based on the #[unstable] attribute, which the compiler reserves for its own use. Ditto for annotations about which version a symbol was added in. If I'm going to use a Doxygen-group tool then I don't want it to get too fussy about what libraries it's documenting.

rustfmt

Something like a cross between gofmt, clang-format, and GNU indent. It has a lot of configuration options but all the interesting ones are gated to Nightly, and most of those are much less useful than you might expect.

As a representative sample, consider rustfmt's handling of hard tabs. Given the following input there are two basic ways you might use hard tabs to indent it, depending on whether struct value alignment should apply to nested structs:

MyStruct{
  field_with_long_name: (some_big_complex_variable_name + another_big_complex_variable_name),
  another_field: 123,
  nested_struct: &NestedStruct{
    nested_struct_field: 456,
  },
  final_field: 123,
}

The first is to treat the nested struct as a "break" in the alignment (gofmt does this). I've drawn the tabs as ████ for clarity:

MyStruct{
████field_with_long_name: (some_big_complex_variable_name
████                       + another_big_complex_variable_name),
████another_field:        123,
████nested_struct: &NestedStruct{
████████nested_struct_field: 456,
████},
████final_field: 123,
}

The second is to align all the values, including the nested struct, and introduce a nested layer of tabs:

MyStruct{
████field_with_long_name: (some_big_complex_variable_name
████                       + another_big_complex_variable_name),
████another_field:        123,
████nested_struct:        &NestedStruct{
████                      ████nested_struct_field: 456,
████                      },
████final_field:          123,
}

But what rustfmt produces is an indecisive and poorly formatted combo of the two – it doesn't even properly align the parenthesized expression after line-breaking it:

MyStruct {
████field_with_long_name: (some_big_complex_variable_name
████████+ another_big_complex_variable_name),
████another_field:        123,
████nested_struct:        &NestedStruct {
████████nested_struct_field: 456,
████},
████final_field:          123,
}

I eventually gave up on trying to make the formatted rust-fuse code look pretty, and settled for "consistent".

Cargo and crates.io

While the Rust language feels carefully designed to combine the best parts of multiple popular and interesting languages, Rust's default build system (Cargo) and package repository (crates.io) are the opposite. They combine the worst parts of Cabal/Hackage and NPM, resulting in a user experience that is somehow inferior to both.

Package naming

crates.io has no namespacing. If a user uploads a package named fuse, that name is taken forever and no other person can upload a package named fuse unless the first developer transfers ownership. It so happens that someone did in fact upload crates.io/crates/fuse in 2014 (last updated: 2017), which means I'm going to have to publish mine under some stupid codename or contrived rusty-libfuse-for-rust-lib nonsense.

How did this happen? It's not like package registries are a new invention – PyPI launched in 2003, and CPAN has been running since 1995 (!). NPM has had optional namespaces ("scopes") since at least 2014.

Go demonstrates how to do distributed package naming well. A Go package is identified by a hierarchial path rooted at a DNS domain, which both solves the issue of ownership (defer to DNS) and lets big shared hosting providers like GitHub cleanly subdivide their namespace. If Cargo had done the same we might have package names like "github.com/rust-lang/git2-rs", which while not great at least avoids staking a claim on the very concept of Git.

But since crates.io is centralized, it can be terser than Go. Cargo could use crates.io as the default, using the presence of a period to distinguish non-default registries in Cargo.toml. And if you combined it with NPM's sigils, the official libgit2 binding "@rust/git2" could be registered at the same time as Jane Doe's experimental "~jdoe/git2" package, and could live on the same Internet as "example.com/rust-stuff/git2". Everyone would have the chance to contribute their code to the commons under a reasonable name.

Single-crate packaging

Cargo's unit of distribution is the crate, which is a problem because Rust's compilation unit is also the crate. Large libraries are easier to work on when parts of the build graph can be cached, but if you try to split up a library you pretty immediately run into problems:

Cargo won't let you define more than one [lib] per Cargo.toml, so what would be a minor refactoring requires converting the project repository into a "workspace". As a side effect this breaks many common commands, for example cargo test must be replaced with cargo test --all.
Cargo can handle release archives containing multiple crates (via path dependencies), but crates.io rejects uploads containing crates with path deps. This leads to an explosion of crate registrations, as each project needs to upload its internal organs as separate packages. Good luck with figuring out semver for mypkg-internal-macros – might as well version them all "v0.0.$(date +%s)".

When I was writing Rust without Cargo I was confused about why people complained about slow build times, but now I get it. Of course build times are slow if changing one line of a leaf file requires rebuilding dozens of modules. I've found Bazel and rules_rust provide a good alternative to Cargo, since Bazel can twist your build into any DAG you want, but most Rust users are unlikely to be excited about injecting 50MB of Java build system into the middle of their workflow.

Commentary on “Stop Using Encrypted Email”

2020-02-22T03:40:51Z

Commentary on “Stop Using Encrypted Email”

Latacora’s recent article Stop Using Encrypted Email prompted a lot of comments on Tildes and Hacker News, which is not surprising considering the author’s blunt approach to a delicate topic. I agree with its recommendations and also think it would be good if other people did so, thus, I’m going to try expanding on it a little. Think of this as comments from the peanut gallery.

In summary:

Existing end-to-end encryption systems for email (including PGP and S/MIME) do not provide useful protection against interception.
Designing a useful end-to-end encryption system for email is infeasible due to the design of the SMTP protocol and the behavior of existing email clients.
If a message needs end-to-end encryption then it should not be sent via email. Signal is a reasonable choice for sending encrypted messages.
If a message does not need end-to-end encryption then send it via normal unencrypted email.
Since encrypted email does not provide useful protection, and unencrypted email is easier to use, there is no reason to send encrypted email.

“Email is unsafe”

Email is unsafe and cannot be made safe. The tools we have today to encrypt email are badly flawed. Even if those flaws were fixed, email would remain unsafe. Its problems cannot plausibly be mitigated. Avoid encrypted email.

The basic problem of email (meaning specifically SMTP) is that it was never designed to be secure. The address must be sent in a form readable by the recipient’s mail server, and auxiliary data like the subject comes along for the ride because that’s just how the wire protocol works.

There have been two major security retrofits added to SMTP since the original RFC 821 was published in 1982:

STARTTLS (2002) is intended to protect emails from being intercepted while being sent between mail servers. Unfortunately it is not considered a strong security layer because it doesn’t protect against man-in-the-middle[1] and is not widely adopted among smaller mail service operators[2].
Message-level encryption such as PGP (1991) and S/MIME (2002) is intended to (a) authenticate the sender of emails, and (b) protect the content of emails from interception by any third party (including the mail services). This is what people mean when they say “encrypted email”. Both PGP and S/MIME have design flaws that make them ineffective against interception[3], and these flaws cannot be plausibly fixed.

Even if these two security-oriented additions worked as intended – if emails were protected in transit, and message bodies end-to-end – the design of SMTP still makes email unsuitable for secure communication. More on this below.

“Most email encryption […] is performative”

Most email encryption on the Internet is performative, done as a status signal or show of solidarity. […] It doesn’t matter whether or not these emails are safe, which is why they’re encrypted so shoddily.

It is common, among geeks of a certain age[4], to have a GnuPG key. In bygone times conference and meetups would host “key-signing parties”, where people brought their laptops and photo IDs and solemnly cross-signed each others’ keys. Hex-formatted key fingerprints were inspected. I swear I am not making this up.

In retrospect none of it was real, or rather, none of it mattered. We never put much thought into the actual cryptography – the few GPG’d emails I sent could have been unpadded CBC for all I know. The point of signing your coworker’s GnuPG key was not to establish a secure comms channel. It was all about the ceremony: the grown-up version of children passing notes in Pig Latin.

“PGP is a deeply broken system”

The least interesting problems with encrypted email have to do with PGP. PGP is a deeply broken system. It was designed in the 1990s, and in the 20 years since it became popular, cryptography has advanced in ways that PGP has not kept up with.

The early ‘90s were not good years for computer security. A lot of primitives and protocols designed back then turned out to be either too complex (ASN.1) or not sophisticated enough (MD5, SSL v2/v3). PGP managed to be both at once, combining a challenging[5] packet-oriented record format with a small list of supported ciphers.

The problems with PGP are not limited to the protocol, but extend to the usability and security properties of related projects such as GnuPG[6] and the SKS keyserver network[7]. If starting from scratch, no contemporary security engineer would design something as slapdash as PGP.

So, for example, it recently turned out to be possible for eavesdroppers to decrypt messages without a key, simply by tampering with encrypted messages.

This is referencing the EFAIL attack, in which most mail clients can be convinced to send the entire contents of an encrypted message to an arbitrary remote server.

“All mainstream email software expects plaintext”

The foundations of electronic mail are plaintext. All mainstream email software expects plaintext. In meaningful ways, the Internet email system is simply designed not to be encrypted.
The clearest example of this problem is something every user of encrypted email has seen: the inevitable unencrypted reply. In any group of people exchanging encrypted emails, someone will eventually manage to reply in plaintext, usually with a quoted copy of the entire chain of email attached.

Some people reading this might have GPG keys; a smaller set might even have exchanged GPG-encrypted emails with a contact. If you were to send an encrypted email, and receive a reply in plaintext that quoted your plaintext message, would you be upset?

If the answer is “probably not”, then what security value is GPG providing?

The strength of a security system is not only in the low-level parts, the ciphers and key sizes and so on, but in how it interacts with humans and guides them towards safety. A tool designed to send un-interceptable messages should make it difficult to accidentally leak the entire conversation. This is not true of PGP (as typically used[8]), which is why PGP should be avoided.

“Metadata is as important as content, and email leaks it”

Leave aside the fact that the most popular email encryption tool doesn’t even encrypt subject lines, which are message content, not metadata.
The email “envelope” that includes the sender, the recipient, and timestamps – is unencrypted and always will be. Court cases (and lists of arrest targets) have been won or lost on little more than this. Internet email creates a durable log of metadata, one that every serious adversary is already skilled at accessing.

In security circles there is often talk of an “adversary”, which summarizes what the system is supposed to protect against. The adversary of a child-proof medicine bottle is a curious toddler. The adversary of a journalist reporting on organized crime is the criminal organization. The adversary of a group of anti-government rebels is the local government’s law enforcement agency. In each case the identity, capabilities, and goals of the adversary are used to decide what the system must protect against.

A typical “threat modeling” exercise might start off with something like this:

Adversary: a ten-year-old child
Goal: find out what present they’re getting for Christmas
Capability: physical access to parent’s locked iPhone

In this case, organizing the purchase of a Christmas present over encrypted email is probably not worth the effort. A passcode on the phone, or at most some basic precautions (e.g. not putting the word ‘Christmas’ in the planning emails) is probably sufficient.

Alternatively:

Adversary: the United States federal government
Goal: obtain evidence of contact between Jane Doe and John Smith
Capability: can demand a full copy of Jane’s email account pursuant to an authorized search warrant

In this case the use of encrypted email isn’t sufficient. The presence of emails between Jane and John is satisfies the adversary’s goal, and they can always come back for round two by demanding (from Jane and/or Joe) the emails be decrypted. The best option here is a messaging system with encrypted metadata, forward secrecy, and an enforced retention period – none of which is a feature of encrypted email.

What adversary is PGP protecting against? It’s not clear. The adversary has access to the ciphertext (otherwise PGP never enters the picture), which implies some significant real-world power – to wiretap an ISP, or compromise one of the mail providers[9]. But they don’t have the power to compel Jane or John to decrypt their side of the thread? Outside of a novel, this particular combination of capabilities and limitations is rare.

“Stop using encrypted email”

There are reasons people use and like email. We use email, too! It’s incredibly convenient. You can often guess people’s email addresses and communicate with them without ever being introduced. Every computing platform in the world supports it. Nobody needs to install anything new, or learn how to use a new system. Email is not going away.
[…]
Stop using encrypted email.

It’s important to remember that this advice to avoid encrypted email is specifically about avoiding encrypted email. It’s not a problem to send ordinary, non-sensitive messages over email. Some messages will get logged forever by some shadowy agency. Some will get STARTTLS, as a treat. So long as messages needing secure handling are sent only over secured channels, it’s fine to send the rest over email or iMessage or LINE or WeChat or whatever.

Just please don’t PGP it.

See The sad state of SMTP encryption and Enforced STARTTLS for SMTP. The implementation of MX DNS records reduces STARTTLS to more of a public hygiene benefit than a strong security layer.
Google’s transparency report summarizes adoption of email encryption in transit.
As far as I know they’re still OK for authentication of plain-text unencrypted messages, if for some reason this is your use case.
I will admit I have a GPG key, and it’s published to the keyservers. I used to have its public part on a page of this website, in case someone needed to contact me on urgent Snow Crash business. I have never attended a key-signing party, but there was a time when I would have if the opportunity arose.
Parsers are a rich source of security problems. When designing a serizalization format, especially for a security-critical protocol, “challenging” is a synonym for “bad”.
What’s the matter with PGP?
https://news.ycombinator.com/item?id=20315633
My own personal workflow for GPG was command-line based; I would copy ciphertext out of my mail client into a terminal, decrypt it, encrypt the response, and the copy it back. This workflow is atypical; most users of GPG interact with it via a mail client plugin such as Enigmail, which transparently decrypts messages in the mail client.
Another difference between 1990 and 2020 is that mail providers – hosted services in general, really – are more professionalized now. Apple, Google, and Microsoft have security departments larger than the entire headcount of a mid-90s ISP. For someone who is worried about the security of their mail provider, the best solution is to switch to one of these large and well-regarded options.

By any other CNAME

2020-02-05T10:25:15Z

By any other CNAME

When an engineer joins Google, they are issued a workstation – a physical computer in tower[1] form-factor that sits under their desk. Workstations have names, which the engineer gets to choose, and the fully-qualified hostname consists of this name plus an office-specific suffix. Workstations in Mountain View are in .mtv.corp.google.com, workstations in New York are in .nyc.corp.google.com, and so on – this is all tracked in various databases and synced to DNS. Intranet services that weren't office-specific, like the go/ URL shortener, were on .corp.google.com directly.

For convenience, Google uses a DNS feature named the DNS search path to let users reference workstations by short names. If I am sitting next to you, and I want to SSH into your workstation[2], I can type ssh yourbox instead of ssh yourbox.mtv.corp.google.com and it'll work. This wasn't only for workstation names, you could also use it for any sort of .corp.google.com name. In a browser you could type http://go/somelink and it would resolve to http://go.corp.google.com/somelink. In a PAM module you could #define LDAP_ADDRESS "ldap" and it would direct queries to ldap.corp.google.com. The halls rang with the sound of protobuf engineering, and all was at peace.

Some of you have noticed the problem.

One day I arrived at the office and discovered that I couldn't unlock my screen. This wasn't especially alarming, because nobody else in the building could log in and network outages happen sometimes. But then news started trickling in over working comms[3] that this wasn't a network outage. A few early risers had working desktop sessions, and the network was fine – only attempts to log in, SSH, or sudo were hanging.

Also, it was affecting the entire Mountain View campus.

The usual debugging process was followed with unusual haste and the issue was narrowed down to DNS. One of the new hires starting that day had requested their workstation be named ldap, per their initials, and as soon as that hostname hit the network it hijacked every LDAP client that had been configured to talk to "ldap". Unfortunately that was a big 'every' because (1) if the wrong value is easier to type then it outcompetes the correct value, and (2) the chances of a misconfiguration being discovered are the inverse of how often it happens. So pretty much everything was broken.

This story has a happy ending because Google does regular disaster recovery tests. The tests are always something outlandish, like aliens have invaded and all contact with California has been lost, and everyone has a good laugh around the coffee robot. The recovery procedure for total DNS outage involves taking a laptop into the panic room, a locked room with a direct connection to the Prod network. This was done, the owners of the machine inventory were able to delete the bad record, and new safety checks were installed around the important hostnames.

There was a brief experiment involving decommissioned Warp 19s, which were Google-designed rackmount machines with infamously sharp edges and the sound profile of a motorrad.
This is not as alarming as it sounds; workstations at Google are (were?) fairly untrusted, and it was common to SSH into a coworker's machine if your own was overloaded or doing software updates or whatever.
Engineers almost universally used IRC for quick conversations, team chatter, and coordinating incident response. Although Slack has some good points, I find myself missing good ol' port 6697 every time loading a channel makes my fan spin up.

SRE School: No Haunted Forests

2018-11-01T06:19:20Z

SRE School: No Haunted Forests

Engineer debugging a Puppet manifest (2018, colorized)

All industrial codebases contain bad code. To err is human, and situations get very human when you're staring down the barrel of a launch deadline. You've heard the euphemism tech debt, where like a car loan you hold a recurring obligation in exchange for immediate liquidity. But this is misleading: bad code is not merely overhead, it also reduces optionality for all teams that come in contact with it. Imagine being unable to get indoor plumbing because your neighbor has a mortgage!

Thus a better analogy for bad code is a haunted forest. Bad code negatively affects everything around it, so engineers will write ad-hoc scripts and shims to protect themselves from direct contact with the bad code. After the authors move to other projects, their hard work will join the forest.

Healthy engineering orgs do not tolerate the presence of haunted forests. When one is discovered you must move vigorously to contain, understand, and eradicate it.

Make this the motto of your team: No Haunted Forests!

Identifying a Haunted Forest

Not all intimidating or unmaintained codebases are haunted forests. Code may be difficult for a newcomer to come up to speed, or it might be a stable implementation of some RFC. A couple rules of thumb to identify code worthy of a complete rewrite:

Nobody at the company understands how the code should[1] behave.
It is obvious to everyone on the team[2] that the current implementation is not acceptable.
The project's missing features or erroneous behavior is impacting other teams.
At least one competent engineer has attempted to improve the existing codebase, and failed for technical reasons.
The codebase is resistant to static analysis, unit testing, interactive debuggers, and other fundamental tooling.

Haunted Environmentalists

Fresh graduates often push for a rewrite at the first sign of complexity, because they've spent the last four years in an environment where codebase lifetimes are measured in weeks. After their first unsuccessful rewrite they will evolve into Junior Engineers, repeating the parable of Chesterton's Fence and linking to that old Joel Spolsky thunkpiece about Netscape[3].

Be careful not to confuse this reactive anti-rewrite sentiment with true objections to your particular rewrite. Remind them that Joel wrote that when source control meant CVS.

Clearing Haunted Forests

Rewriting an existing codebase should be modeled as a special case of a migration. Don't try to replace the whole thing at once: systematize how users interact with the existing code, insert strong API boundaries between subsystems, and make changes intentionally.

User Interaction will make or break your rewrite. You must understand what the touch-points are for users of the existing system to avoid exposing them to maintain UI Compatibility. Often rewrites mandate some changes, so try to put them all near the start (if you know what the final state should be) or delay them to the end (when you can make it seem like a big-bang migration). If the user-facing changes are significant, see if you can arrange for separate opt-in and opt-out periods during which both interaction modes co-exist.

Subsystem API Boundaries let you carve up the old system into chunks that are easier to reason about. Be fairly strict about this: run the components in separate processes, separate machines, or whatever is needed to guarantee that your new API is the only mechanism they have to communicate. Do this recursively until the components are small enough that rewriting them from scratch is tedious instead of frightening.

Intentional Changes happen when the new codebase's behavior is forced to deviate from the old. At this point you should have a good idea which behavior, if either, is correct. If there's no single correct behavior, it's fine to settle for "predictable" or (in the limit) "deterministic". By making changes intentionally you minimize the chances of forced rollbacks, and may even be able to detect users depending on the old behavior.

Work incrementally. A good rewrite is valid and fully functional at any given checkpoint, which might be commits or nightly builds or tagged releases. The important thing is that you never get into a state where you're forced to roll back a functional part of the new system due to breakage in another part.

Common Features of Haunted Forests

All bad code is bad in its own special way, but there are some properties that are especially likely to make it hard to refactor incrementally. These are generally programming styles that hide state, obscure control flow, or permit type confusion.

Hidden State means mutable global variables and dynamic scoping. Both of these inhibit a reader's understanding of what code will do, and forces them to resort to logging or debuggers. They're like catnip for junior developers, who value succinct code but haven't yet been forced to debug someone else's succinct code at 3 AM on a Sunday.

Non-Local Control Flow prevents a reader from understanding what path execution will take. In the old times this meant setjmp and longjmp, but nowadays you'll see it in the form of callbacks and event loops. Python's Twisted and Ruby's EventMachine can easily turn into global callback dispatchers, preventing static analysis and rendering stack traces useless.

Dynamic Types require careful and thoughtful programming practices to avoid turning into "type soup". Highly magical metaprogramming like __getattr__ or method_missing are trivially easy to abuse in ways that make even trivial bug fixes too risky to attempt. Tooling such as Mypy and Flow can help here, but introducing them into an existing haunted forest is unlikely to have significant impact. Use them in the new codebase from the start, and they might be able to reclaim portions of the original code.

Distributed Systems can become haunted forests through sheer size, if no single person is capable of understanding the entire API surface they provide. Note that microservices don't automatically prevent this, because merely splitting up a monolith turns the internal structure into API surface. Each of the above per-process issues has distributed analogues, for example S3 is global mutable state and JSON-over-HTTP is dynamically typed.

A codebase where nobody knows what behavior it currently has is materially different from one where nobody understands what behavior it should have. The former don't need to be rewritten, because you can grind their test coverage up and then safely refactor.
You will sometimes hear objections from people who have not worked directly on the bad code, but have opinions about it anyway. Let them know that they're welcome to help out and you can arrange for a temporary rotation into the role of Forest Ranger.
The real reason Netscape failed is they wrote a dreadful browser, then spent three years writing a second dreadful browser. The fourth rewrite (Firefox) briefly had a chance at being the most popular browser, until Google's rewrite of Konqueror took the lead. The moral of this story: rewrites are a good idea if the new version will be better.

(More) Effective Go

2018-08-05T15:13:29Z

(More) Effective Go

Unbounded Iteration

"Unbounded iteration" is when you need to iterate over a sequence without knowing its total length. For example, receiving rows from a database query or data chunks from an HTTP response. Other languages have a native concept of "iterators" such that iterating over an array and stream use the same syntax, but Go doesn't do this.

The best approach to unbounded iteration in Go is a callback:

func stream(cb func(int)) {
	for _, x := range []int{1, 2, 3} {
		cb(x)
		time.Sleep(time.Second)
	}
}

func main() {
	stream(func(x int) {
		fmt.Println(x)
	})
}

Some developers new to Go may try to use a channel and background thread for unbounded iteration. Don't do this:

func stream() <-chan int {
	ch := make(chan int)
	go func() {
		defer close(ch)
		for _, x := range []int{1, 2, 3} {
			ch <- x
			time.Sleep(time.Second)
		}
	}()
	return ch
}

func main() {
	for x := range stream() {
		fmt.Println(x)
	}
}

Threads and thread-safe communication are cheap in Go, but not free. They add runtime and mental overhead – you need to think about the lifetime of any temporary channels backing your loops, and make sure they get properly drained so their backing threads can terminate.

The channel approach is also more difficult to extend. If a callback needs to change to return an error, or a non-error early exit, it's straightforward to add a return type. Channels have no mechanism to return data from the receiver.

Option Interfaces

Keyword arguments ("kwargs") are commonly used in other languages for passing optional parameters to a complicated API. Go doesn't have kwargs, and they can be awkward to imitate using an "options struct" because the receiver can't easily tell whether an option was explicitly set to its zero value:

type Options struct {
	ConcurrencyToken uint32
}

func Fetch(opts Options) {
	if opts.ConcurrencyToken == 0 {
		// is this fetch being run without a concurrency token? or did the
		// caller set a token, but it happens to be 0x00000000 ?
	}
}

Option interfaces let the options themselves be defined by functions, so that presence/absence, validation, and complex defaults are expressed naturally (with a cost of increased boilerplate):

type Option interface {
	apply(*options)
}

type fnOption func(*options)
func (fn fnOption) apply(opts *options) { fn(opts) }

type options struct {
	concurrencyToken *uint32
}

func ConcurrencyToken(token uint32) Option {
	return fnOption(func(opts *options) {
		opts.concurrencyToken = &token
	})
}

func Fetch(opts ...Option) {
	appliedOpts := options{}
	for _, opt := range opts {
		opt.apply(&appliedOpts)
	}
	if appliedOpts.ConcurrencyToken == nil {
		// definitely being run without a concurrency token
	}
}

Prefer POSIX Flags

Go's standard library contains a flags package for parsing command-line flags. It uses Plan 9 flag semantics, which are alien to advanced users with a Linux, UNIX, or Windows background (i.e. all of them).

The "github.com/spf13/pflag" package is API-compatible with the stdlib `flags` package, has extra API for features like "short" flags, and can automatically import flag definitions from libraries that use the stdlib.

import (
	goflag "flag"
	flag "github.com/spf13/pflag"
)

func main() {
	flag.CommandLine.AddGoFlagSet(goflag.CommandLine)
	flag.Parse()
}

Dynamic Flag Defaults

Sometimes you'll want a command-line flag with a default value that can't be hardcoded, like --config-path that defaults to somewhere in the user's home directory. A common patttern is to let the flag's zero value mean "use computed default", but this makes --help output less useful.

It's better to use a computed value for the flag's default at definition time, then (1) --help will show that default value and (2) code consuming the flag doesn't need to special-case it.

configPath = flag.String("config-path", defaultConfigPath(), "[your wonderful documentation here]")

func defaultConfigPath() {
	path := os.ExpandEnv("${HOME}/.config/my-client-config"
	if _, err := os.Stat(path); err == nil {
		return path
	}
	return ""
}

Errors Should Include Stack Traces

If your code constructs errors with fmt.Errorf() or similar standard library functions, you're implicitly dropping the stack trace of where that error happened. Prefer to use the "github.com/pkg/errors" package, which records stack traces when the error is created and can preserve them as explanatory text is added in callers.

Custom error types can also use this library to obtain and propagate stack traces:

import "github.com/pkg/errors"

type myCustomError struct {
	code int32
	trace errors.StackTrace
}

func (err *myCustomError) StackTrace() errors.StackTrace {
	return err.trace
}

func fail(code int32) error {
	trace := errors.New("").(stackTrace).StackTrace()
	return &myCustomError{
		code: code,
		trace: trace[1:],
	}
}

Avoid Mutable Globals

This is standard good programming practice, but I want to specifically call it out here because the Go standard library is full of these things. You must be careful.

For example, the net/http package has functions Handle(), ListenAndServe(), etc that operate on http.DefaultServeMux. You don't want to use these. Prefer to explicitly create your own *http.ServeMux and pass it around as an explicit parameter. Then when you want to write tests you won't need to go back and figure out all the places you're poking at mutable global state.

// BAD
http.Handle("/foo", fooHandler)
http.ListenAndServe(":8080", nil)

// GOOD
mux := http.NewServeMux()
mux.Handle("/foo", fooHandler)
http.ListenAndServe(":8080", mux)

Don't Mutate or Invalidate Parameters

This is a hard rule for your public API. It's also helpful to comply with in private APIs, but you don't need to if you're willing to accept the risk of weird bugs.

A public function defined like Listen(addrs []string) shouldn't mutate or invalidate the value passed in for addrs:

// BAD!
func Listen(addrs []string) {
	for ii, addr := range addrs {
		addrs[ii] = addr + ":1234"
	}
}


// BAD!
func Listen(addrs []string) {
	addrs = append(addrs, "localhost:1234")
}

If you need to make adjustments to a user-provided value, copy it first:

func Listen(addrs []string) {
	addrs = append([]string{}, addrs...)

	// safe
	addrs = append(addrs, "localhost:1234")
}

Error Beneath the WAVs

2018-08-04T23:52:33Z

Error Beneath the WAVs

This is a follow-up to Why I Ripped The Same CD 300 Times. By the end of that page I'd identified a fragment of audio data that could cause read errors even if it was isolated and burned to a fresh CD. Further testing yielded a "cursed WAV" that consistently prevents perfect rips on different brands of optical drive, ripping software, and operating system.

EDIT (2018-08-10): It worked! With the power of the two-sheep LTR-40125S I can successfully rip the original discs, with bit-exact audio data and a matching AccurateRip report.

🐻 превед! Arrived from IXBT? This is similar to the post «Магия чисел», but the source of errors is a little different. Please see the CDDA vs CD-ROM section.

“History became legend. Legend became myth.”

The root cause would have been forever mysterious and unknown to me, but for this Hacker News comment by userbinator [1]:

It is likely "weak sectors", the bane of copy protection decades ago and of which plenty of detailed articles used to exist on the 'net, but now I can find only a few:

http://ixbtlabs.com/articles2/magia-chisel/index.html
https://hydrogenaud.io/index.php/topic,50365.0.html
http://archive.li/rLugY

This page explores how "weak sectors" are caused by bad encoding logic in a CD burner. Probably what happened is the artist gave the factory a master on CD-R, which had been burned on a drive with affected firmware. The master contained the bad EFM encoding and was accurately duplicated into the pressed CDs.

TODO

Physically a CD's data track is a spiral of "pits" and "lands", where at each clock cycle a transition is "1" and lack of transition is "0". Directly encoding data bytes in this format would cause some transitions to occur too quickly for a detector to track, so bytes are "stretched" to 14 bits using eight-to-fourteen modulation. The sequential "EFM codewords" are separated by three "merging bits", which are chosen by the writing device under two constraints:

The bitstream may not have two consecutive 1s, or more than ten consecutive 0s.
The bitstream should avoid "DC bias" by maintaing roughly equal counts of 1s and 0s.

It appears that in the ~15 years between the optical disc's invention and the spread of home burning, knowledge of the EFM modulator's role in reducing DC bias was lost.

US patent US06614934 (granted 1986-07-29):

In order to maintain at least the minimum run length when the channel bits of successive symbols are merged into a single channel bit stream, at least two additional "merging bits" are added to the channel bits for each symbol. As a result of this, however, the digital sum value (DSV) of the channel bits of successive symbols may become appreciable, [...]
It has been found that under certain conditions, despite the addition of merging bits to minimize the d.c. unbalance (or DSV) of the channel bits, the DSV may become sufficiently significant to adversely affect read-out of the channel bits.

Contrast with this quote from later material (circa 2002):

The CD-Reader has trouble reading CD's with a high DSV, because (Not sure about this info, this is just an idea from Pio2001, a trusted source), the pits return little light when they are read.

Come Sing Along With the Pirate Song

All of this would have remained an obscure detail of CD manufacturing until some Macrovision employee circa 2000 realized that consumer-grade CD burners didn't implement DSV scrambling. Instead of responsibly reporting defective hardware, they decided to build digital restrictions products around bit patterns known to trigger bad EFM modulation. The resulting data corruption could be used to detect pirated copies, with a minor side effect of preventing customers from using legally purchased software.

The piracy scene named these difficult-to-modulate bit patterns "weak sectors".

userbinator's comment links to http://sirdavidguy.coolfreepages.com/SafeDisc_2_Technical_Info.html, which contains a concrete example of a "weak sector" pattern:

Feeding a regular bit pattern into the EFM encoder can cause a situation in which the merging bits are not sufficient to keep the DSV low. For example, if the EFM encoder were fed with the bit pattern "D9 04 D9 04 D9 04 D9 04 D9 04 D9 04 D9 04 D9 04 D9 04 D9 04 D9 04" [...]

It also speculates that correct EFM modulation was abandoned due to performance concerns:

The algorithm for calculating the merging bits is far too slow to be viable in an actual CD-Burner. Therefore, CD designers had to come up with their own algorithms, which are faster. The problem is, when confronted with the weak sectors, the algorithms cannot produce the correct the merging bits. This results in sectors filled with incorrect EFM-Codes. This means that every byte in the sector will be interpreted as a read error. The error correction is not nearly enough to correct every byte in the sector (obviously).

This seems plausible: the early 2000s was a time of fierce competition between optical drive vendors, and speed was king. A drive that could only write at 4x would lose sales to its 16x competitors, even if its output was technically more correct.

Counting Sheep

The quality of a CD drive's EFM algorithm was of mostly academic interest when it came to the general population[2], but extremely important to pirates. As contemporary games' DRM schemes relied on more and more subtle properties of the physical media, pirates researched which drives were capable of duplicating them. A discerning pirate bought their CD burner based on its "sheep rating" (named after the CloneCD mascot). Drives were "sheep tested" with data files generated by Alexander Noé's Weak Sector Utility.

Sheep Rating	Capability
0	Can't duplicate CDs containing weak sectors
1	Can duplicate CDs containing SafeDisc up to version 2.4.x
2	Can duplicate CDs containing SafeDisc up to version 2.5.x
3	Can duplicate CDs containing any possible weak sectors

My next objective was to purchase a CD drive capable of burning and re-reading the original track. Such a drive would be able to rip the entire CD in one pass, thereby providing a clean rip log[3]. But contemporary reviews of optical drives no longer include a sheep rating – copying a Blu-ray is a matter of cryptography rather than error correction gimmicks, and CD ripping doesn't drive ad impressions.

The best resource I found is makeabackup.com/burners.html, which contains lists of optical drives categorized by sheep rating. The brand I found mentioned most commonly in archived piracy forums was Plextor, so I was surprised to see no Plextor drives on the 2-sheep list. Instead, I bought a Lite-On LTR-40125S[4]. Once it arrives I'll expand this page with my findings (check back on 2018-08-11).

EDIT (2018-08-10): It worked! With the power of the two-sheep LTR-40125S I can successfully rip the original discs, with bit-exact audio data and a matching AccurateRip report.

I found several references to "three-sheep burners" as a semi-mythical achievement, but no concrete evidence of such a device ever being sold. It's possible that the Yamaha CRW3200 in "Audio Master Quality" mode might have been able to duplicate certain discs at a three-sheep level, by writing physically larger data tracks at the cost of reduced capacity[5]. Since the high DC bias manifests as tracking errors, a disc that's easier to track may be the solution.

Making WAVs

If you've been following along at home with your favorite hex editor, you'll notice that the "cursed" portion of the original audio [6] has very long sequences with a __ 0x04 __ 0x04 pattern. This matches the 0xD9 0x04 0xD9 0x04 sample above. Was there something special about 0x04? Was there a correlation between weak sectors and EFM patterns? To answer these questions I generated a synthetic test file, containing single CDDA frames full of a suspect pattern, joined by long runs of 0x00 padding[7]. Burning and ripping the "cursed wav" verified that some of these patterns were sufficient on their own to cause rip errors[8].

I wasn't able to figure out how to predict the behavior of a particular byte pattern. Of the patterns tested, many were harmless (or at least don't affect any of my drives). Stretches of identical bytes were also harmless, so it wasn't just repetition in play. Given the difficulty of measuring voltage levels in an optical drive's ICs, it's likely I'll never figure out exactly what causes particular patterns to cause errors.

import contextlib, wave

FRAME_BYTES = 2352
SILENCE = "\x00" * (FRAME_BYTES * 3)

with contextlib.closing(wave.open("all-cursed.wav", "w")) as out:
    out.setnchannels(2)
    out.setsampwidth(2)
    out.setframerate(44100)
    out.writeframes(SILENCE)

    for b1 in range(0, 255):
        for b2 in range(0, 255):
            frame = "%s\x01%s\x01" % (chr(b1), chr(b2))
            out.writeframes(frame * (FRAME_BYTES / len(frame)))
            out.writeframes(SILENCE)

CDDA vs CD-ROM

The issue of repeated patterns causing silent data corruption appears to be specific to audio CDs (CDDA) – SafeDisc operated by detecting their presence, not by preventing the rip. However, silent data corruption could affect data CDs (CD-ROM) as documented by a forum post on IXBT (summaries: Russian, English) – specific sequences of bytes can be mistaken for the sector synchronization header, and cause portions of individual files on an optical disc to become unreadable.

Over 2 / 3 of the drives tested fail to read the file if it contains a signature that turns into a data sequence identical to Sync Header! Except Toshiba and HP, all manufacturers use sync header as a key sign of the sector start at data reading.

The behavior users see is a little different. For a data CD, the sequence must scramble to exactly match the sector sync header. In audio CDs, the responsible byte patterns must be repeated many times to drive the DSV value high enough to force read errors.

"Only a few" was an understatement. What looks to have been a flourishing community of must archivists and game pirates has nearly vanished from the Internet, losing years of reports and research about how CD error handling works in real-world conditions. This writeup was made possible by Internet Archive.
A typical high-end hard drive of that era might store around 100 GB of data. Music was typically ripped to MP3 or AAC; there was no point in worrying about the exact value of bits fed into a lossy encoder.
All of my personal CD rips are archived along with their rip log, so I can verify the audio CRCs during backup tests.
This thing's a blast from the past. The CD ripping scene imploded before SATA had reached optical drives, so it's got PATA pins and one of those four-pin Molex power sockets.
I found several reviews praising how AMQ discs skipped less often, but no reviews specifically about its capabilities with weak sectors. This is unsurprising given that the capacity differences would render it unusable for software piracy.
I say "original", but it turns out that's wrong at this level of detail. I used Audacity to cut out that section, because I assume it was capable of moving bytes from one lossless file to another without changing those bytes. Not so! Audacity will happily _mangle the shit_ out of audio data. You can verify this by opening a .wav, writing it back out, and comparing the two files. It's also fun to look at the spectrogram of an "empty" audio file that got passed through Audacity.
I also ran tests with other padding bytes, to determine that the corruptions were caused by inter-frame copying (vs dumb "zero out bad bytes" logic).
A variant wrote patterns to separate short tracks. Ripping these with the Lite-On in EAC reported "OK" CRCs for all of the tracks, despite being very obviously mangled. This is alarming behavior from a ripping program designed to detect corrupt data. I suspect there's an edge condition in the cache busting that doesn't work right when tracks are very short.

Why I Ripped The Same CD 300 Times

2018-07-30T23:54:21Z

Why I Ripped The Same CD 300 Times

I collect music by buying physical CDs, digitizing them with Exact Audio Copy, and scanning the artwork. This is sometimes challenging if the CD was self-published in a limited run in a foreign country ten years ago. It is very challenging if the CDs have an innate defect that renders some tracks unreadable.

(Русский перевод: Зачем я рипнул один компакт-диск 300 раз)

UPDATE (2018-08-10): See the follow-up post Error Beneath the WAVs for more info about what exactly is wrong with my discs, and info about which CD drives are capable of reading them. I got a perfect rip by upgrading to a "two-sheep" CD drive.

帰るべき城

The piano arrangement album 帰るべき城 by Altneuland was published in 2005. I discovered it in 2008 (probably on YouTube), downloaded the best copy I could find, and filed it away in the TODO list. Recent advances in international parcel forwarding technology let me buy a used copy last year, but when it arrived none of my CD drives could read track #3. This sort of thing is common when buying used CDs, especially if they need to transit a USPS international shipping center. I shelved it and kept on the lookout for another copy, which I located last month. It arrived on Friday, I immediately tried to rip it, and hit the exact same error. This didn't seem to be an issue of wear or damage – the CD itself was probably defective from the factory.

I had three choices: accept an imperfect rip in my archives[1], hope to find another copy some day that would rip successfully (unlikely), or somehow regenerate the original audio data from my corrupt copies. You already know which branch I took.

How Ripping Works

EAC failing to read track #3 of 「帰るべき城」

CDs store digital data, but the interface between CDs, lasers, and optical diodes is very analog. Read errors can be caused by anything from dirty media, to scratches on the protective polycarbonate layer, to vibration from the optical drive itself. The primitive error correction codes in the CDDA standard, designed to minimize audible distortions on lightly used disks, are not capable of fully recovering the bitstream on CDs with a significant error rate. Contemporary CD ripping software works around this with two important error detection techniques: redundant reads and AccurateRip.

The page EAC: Extraction Technology describes EAC's approach to redundant reads:

In secure mode this program either reads every audio sector at least twice [...] If an error occurs (read or sync error), the program keeps on reading this sector, until eight of 16 retries are identical, but at maximum one, three or five times (according to the selected error recovery quality) these 16 retries are read. So, in the worst case, bad sectors are read up to 82 times!

Simple enough. If a read request sometimes returns bad data, read everything twice, and then be extra careful if the first two reads didn't match. AccurateRip is the same principle, but distributed – it's a service to which rippers can submit checksums of their ripped audio files. The idea is that if you rip a track and see that a thousand other people got the same bits for the same track, then your rip was probably good.

This article is about what happens with both techniques fail. EAC can't make progress if every single read returns different data, and because it's rare the AccurateRip database only had a single entry[2].

“I walked ten thousand aisles, ten thousand aisles to see you”

Optical drives from Asus, LG, Lite-On, Pioneer, and an unknown OEM

A practical solution to CDs that won't rip is to use a different drive. Sometimes a particular model is especially lenient with the CDDA spec or has better error correction firmware or whatever. The DBpoweramp forums maintain a CD/DVD Drive Accuracy List for rippers to select a good drive.

On Saturday morning I bought five new CD drives from different brands[3], tried rips on all of them, and found one that could maintain sync through the broken track. Unfortunately the confirmation rip failed to verify – there were about 20,000 bytes different betwen each rip.

But now I had a .wav file sitting on disk, and a way to get more of them. Reasoning that the read errors on a bad disk will fluctuate around the "correct" value, I figured I'd rip it a couple times, find the most "voted" value for unstable bytes, and use that as a correction. This approach was ultimately successful, but was far more work than I expected.

“Quantity has a quality all its own”

Corrected and uncorrectable errors per rip count

I started by ripping one of the CDs repeatedly, recording all the values for each byte, and declaring an error "correctable" if more than half of the rips had used a particular byte value at that position. Initial behavior was good, the number of uncorrectable errors dropped from almost ~6900 bytes at N=4 to ~5000 bytes at N=10. The per-rip benefit slowly decreased over time, until at around N=80 the uncorrectable error count stabilized at ~3700. I stopped ripping at N=100.

Same, but for two disks with cross-checked corrections

Next I tried ripping the second CD 100 times and using the two correction maps to "fill in" uncorrectable error positions in the other disk. This was a failure: each disk had thousands of corrections that disagreed with corrections on the other disk! It turns out you can't fix noisy data by intersecting it with a different-but-related noise source.

Arts and Crafts

The EAC site has another nice resource: the DAE Quality Test, which quantifies the error correction capability of an optical disk drive's firmware. This is a different, lower-level type of error handling that can fix read errors instead of merely reporting them. The catch is that EAC's "secure mode" works by disabling or avoiding this built-in correction code, on the assumption that it doesn't work well.

The test is prepared by burning a provided waveform to a CD-R, cutting some divots in the data surface, then carefully coloring part of it with black marker. That's it – guaranteed unrecoverable errors in a deterministic pattern.

I ran the test on all of the drives, obtaining two interesting results:

The Lite-On drive here is what I used to get past the sync error. It happily chews through the magic marker, but gets really confused by straight lines cut in the data surface. You can see how what should be three distinct peaks on the right get merged into one giant error blob.

Errors total        Num : 206645159
Errors (Loudness)   Num : 965075 - Avg :  -21.7 dB(A)   -  Max :   -5.5 dB(A)
Error Muting        Num : 154153 - Avg :   99.1 Samples -  Max :   3584 Samples
Skips               Num :    103 - Avg :  417.3 Samples -  Max :   2939 Samples

Total Test Result       :   45.3 points (of 100.0 maximum)

The Pioneer drive scored the highest on the DAE test. To my eye the chart doesn't look like anything special, but the analysis tool judged it the best error-correction firmware in my little fleet.

Errors total        Num : 2331952
Errors (Loudness)   Num : 147286 - Avg :  -77.2 dB(A)   -  Max :  -13.2 dB(A)
Error Muting        Num :   8468 - Avg :    1.5 Samples -  Max :    273 Samples
Skips               Num :     50 - Avg :    6.5 Samples -  Max :     30 Samples

Total Test Result       :   62.7 points (of 100.0 maximum)

“At some point numbers do count”

Corrected and uncorrectable errors per rip count (Pioneer)

How can I use the Pioneer's good innate error handing when EAC's "secure mode" works by bypassing a drive's error logic? That's easy, switch EAC to "burst mode" and let it write the bits to disk just as the firmware reported them. How can we turn that heap of unchecked wavs into a file of "secure mode" quality? The same error analysis tooling built for the Lite-On rips!

A few EAC config tweaks and another hundred rips later, we get this beautiful chart. A few things to note:

The uncorrectable bit errors quickly approach zero, but never quite get there.
There's a huge jump in corrected errors in the 53rd or 54th rip.
The error counts before and after that big jump have some flat areas, indicating areas of stability in the ripped data.

0xA595BC09

Using the nearly-perfect correction data from the Pioneer, I generated a "best guess" file and started comparing it to the Pioneer rips. As expected there were some bad outliers, which I fixed by ripping ten more times:

for RIP_ID in $(seq -w 1 100); do echo -n "rip$RIP_ID: "; cmp -l analysis-out.wav rips-cd1-pioneer/rip${RIP_ID}/*.wav | wc -l ; done | sort -rgk2 | head -n 10
# rip054: 2865
# rip099: 974
# rip007: 533
# rip037: 452
# rip042: 438
# rip035: 404
# rip006: 392
# rip059: 381
# rip043: 327
# rip014: 323

I also found something really interesting, a handful of rips had come out with exactly the same audio content! Remember that this is what the EAC "secure mode" is designed to test for as a success criteria. That shncat -q -e | rhash --print="%C" snippet is used to calculate the CRC32 checksum of the raw audio data, and it's what EAC uses.

for wav in rips-cd1-pioneer/*/*.wav; do shncat "$wav" -q -e | rhash --printf="%C $wav\n" - ; done  | sort -k1
# [...]
# 9DD05FFF rips-cd1-pioneer/rip059/rip.wav
# 9F8D1B53 rips-cd1-pioneer/rip072/rip.wav
# A2EA0283 rips-cd1-pioneer/rip082/rip.wav
# A595BC09 rips-cd1-pioneer/rip021/rip.wav
# A595BC09 rips-cd1-pioneer/rip022/rip.wav
# A595BC09 rips-cd1-pioneer/rip023/rip.wav
# A595BC09 rips-cd1-pioneer/rip024/rip.wav
# A595BC09 rips-cd1-pioneer/rip025/rip.wav
# A595BC09 rips-cd1-pioneer/rip026/rip.wav
# A595BC09 rips-cd1-pioneer/rip027/rip.wav
# A595BC09 rips-cd1-pioneer/rip028/rip.wav
# A595BC09 rips-cd1-pioneer/rip030/rip.wav
# A595BC09 rips-cd1-pioneer/rip031/rip.wav
# A595BC09 rips-cd1-pioneer/rip040/rip.wav
# A595BC09 rips-cd1-pioneer/rip055/rip.wav
# A595BC09 rips-cd1-pioneer/rip058/rip.wav
# AA3B5929 rips-cd1-pioneer/rip043/rip.wav
# ABAAE784 rips-cd1-pioneer/rip033/rip.wav
# [...]

Setting that aside for now, re-ripping the outliers let the analysis complete with zero uncorrectable errors. And when I checked that file, it had the same audio content as the "common" rip!

I am 99% confident that I have correctly digitised this troublesome CD, with 0xA595BC09 being the correct CRC of track #3.

UPDATE: A Perfect Rip

After a Hacker News comment pointed me in the right direction (see Error Beneath the WAVs), I bought another CD drive (this one's drive #6) that was reported to have better handling of this particular problem. It was able to successfully rip the original disc with no issues.

At last, victory!

Track  3

     Filename C:\Archive\アルトノイラント - 帰るべき城 [ANL-001]\03 - The End of Theocratic Era.wav

     Pre-gap length  0:00:02.00

     Peak level 100.0 %
     Extraction speed 8.2 X
     Track quality 100.0 %
     Test CRC A595BC09
     Copy CRC A595BC09
     Accurately ripped (confidence 1)  [84B9DD1A]  (AR v2)
     Copy OK

Appendix A: compare.rs

This is the tool I used to calculate suspected byte errors. It wasn't intended to live long so it's a bit ugly, but may be of interest to anyone who stumbles across this page with the same goal.

extern crate memmap;

use std::cmp;
use std::collections::HashMap;
use std::env;
use std::fs;
use std::sync;
use std::sync::mpsc;
use std::thread;

use memmap::Mmap;

const CHUNK_SIZE: usize = 1 << 20;

fn suspect_positions(
    mmaps: &HashMap<String, Mmap>,
    start_idx: usize,
    end_idx: usize,
) -> Vec<usize> {
    let mut positions = Vec::new();
    for ii in start_idx..end_idx {
        let mut first = true;
        let mut byte: u8 = 0;
        for (_file_name, file_content) in mmaps {
            if first {
                byte = file_content[ii];
                first = false;
            }
            else if byte != file_content[ii] {
                positions.push(ii);
                break;
            }
        }
    }
    positions
}

fn main() {
    let mut args: Vec<String> = env::args().collect();
    args.remove(0);
    let mut first = true;
    let mut size: usize = 0;

    let mut files: Vec<fs::File> = Vec::new();
    let mut mmaps: HashMap<String, Mmap> = HashMap::new();
    for filename in args {
        let mut file = fs::File::open(&filename).unwrap();
        files.push(file);
        let mmap = unsafe { Mmap::map(files.last().unwrap()).unwrap() };
        if first {
            first = false;
            size = mmap.len();
        } else {
            assert!(size == mmap.len());
        }
        mmaps.insert(filename, mmap);
    }

    let (suspects_tx, suspects_rx) = mpsc::channel();

    let mut start_idx = 0;
    let mmaps_ref = sync::Arc::new(mmaps);
    loop {
        let t_start_idx = start_idx;
        let t_end_idx = cmp::min(start_idx + CHUNK_SIZE, size);
        if start_idx == t_end_idx {
            break;
        }

        let mmaps_ref = mmaps_ref.clone();
            let suspects_tx = suspects_tx.clone();
            thread::spawn(move || {
                let suspects = suspect_positions(mmaps_ref.as_ref(), t_start_idx, t_end_idx);
                suspects_tx.send(suspects).unwrap();
            });
        start_idx = t_end_idx;
    }
    drop(suspects_tx);

    let mut suspects: Vec<usize> = Vec::with_capacity(size);
    for mut suspects_chunk in suspects_rx {
        suspects.append(&mut suspects_chunk);
    }
    suspects.sort();

    println!("{{\"files\": [");
        let mut first_file = true;
        for (file_name, file_content) in mmaps_ref.iter() {
            let file_comma = if first_file { "" } else { "," };
            first_file = false;
            println!("{}{{\"name\": \"{}\", \"suspect_bytes\": [", file_comma, file_name);
            for (ii, position) in suspects.iter().enumerate() {
                let comma = if ii == suspects.len() - 1 { "" } else { "," };
                println!("[{}, {}]{}", position, file_content[*position], comma);
            }
            println!("]}}");
        }
    println!("]}}");
}

I've had a couple people ask "what's so special about track #3 that you would go through this?". Funny thing – track #3 was the only non-piano track on this album, and I actually don't like it very much! I will probably not listen to it often, or ever! This effort was not about the music itself, but about making a perfect copy of an ephemeral physical artifact.
That single AccurateRip entry for this album matched my CRCs for all tracks except track #3 – they had 0x84B9DD1A, vs my result of 0xA595BC09. I suspect that original ripper didn't realize their disk was bad.
- My original footnote here was wrong. The AccurateRip report is correct (and matches my result) – I didn't realize AccurateRip uses a different checksum than EAC. The perfect rip reports an AccurateRip match on all tracks.
The obvious question when buying a CD- or DVD-ROM drive, here in the year in 2018, is "lol where?". And I didn't want just one, I wanted several, from different brands. There is only one bricks-and-mortar store I know of that would have an inventory of 5.25" DVD drives. Only one that's big enough to spare the shelf space but crufty enough that they wouldn't be out of place. I speak, of course, of Frys Electronics.

Effective gRPC

2018-07-02T05:04:40Z

Effective gRPC

This page documents habits and styles I've found useful when working with gRPC and Protocol Buffers.

gRPC

Error Reporting

Use the google.protobuf.Status message to report errors back to clients – this type should be special-cased by the gRPC library for your language (e.g. grpc-go has "google.golang.org/grpc/status". This message can contain arbitrary sub-messages, so servers can offer basic error messages to all clients and structured errors to clients that can handle them.

See google/rpc/code.proto for details on the meaning of each error code, and the Google Cloud Error Model for good advice on how to write error messages.

Deadlines and Timeouts

Server-side handlers should always propagate deadlines. Clients should almost always set deadlines. Prefer deadlines to timeouts, because the meaning of an absolute timestamp is less ambiguous than a relative time when working across a network boundary.

Depending on your implementation library, it may be possible to define default timeouts in the service schema. Don't do this – the schema author cannot predict what behavior will be appropriate for all implementations or users.

Addresses

Always represent and store gRPC addresses as a full string, following the URL-like syntax used by gRPC Name Resolution. Restrictive formats like "IP+port tuple" will annoy users who want to run your code as part of a larger framework or integration test, which may have its own ideas about network addresses.

Let addresses be set in a command-line flag or config file, so users can configure them without having to patch your binary. Do this even if you're really really sure the entire world wants to run your service on port 80.

Streaming

gRPC supports uni-directional and bi-directional message streams. Use streams if the amount of data being transferred is potentially large, or if the other side can meaningfully process data before the input has been fully received. For example, a service offering a SHA256 method could hash the input chunks as they arrive, then send back the final digest when the client closes the request stream.

Streaming is more efficient than sending a separate RPC for each chunk, but less efficient than a single RPC with all chunks in a repeated field. The overhead of streaming can be minimized by using a batched message type.

service Foo {
    rpc MyStream(FooRequest) returns (stream MyStreamItem);
}

message MyStreamItem {
    repeated MyStreamValue values = 1;
}
message MyStreamValue {
    // ... fields for each logical value
}

WARNING: In some implementations (e.g. grpc-go), the stream handles are not thread-safe even if the client stub is. Interacting with a stream handle from multiple threads may cause unpredictable behavior, including silent message corruption.

Request / Response Types

Each method in your service should have its own Request and Response messages.

service Foo {
    rpc Bar(BarRequest) returns (BarResponse);
}

message BarRequest  { ... }
message BarResponse { ... }

Don't use the same message for multiple methods unless they're literally implementing the same method with a different API (e.g. unary and streaming variants accepting the same response). Even then, prefer a different type for the part of the API that may vary.

service Foo {
    rpc Bar(BarRequest) returns (BarResponse);
    rpc BarStream(BarRequest) returns (stream BarResponseStreamItem);
}

message BarRequest  { ... }
message BarResponse { ... }

message BarResponseStreamItem { ... }

WARNING: Do not use google.protobuf.Empty as a request or response type. The API documentation in google/protobuf/empty.proto is an anti-pattern. If you use Empty, then adding fields to your request/response will be a breaking API change for all clients and servers.

Protobuf

Package Names

Use a package name including your project name, company (if applicable), and Semantic Versioning major version. The exact format depends on personal taste – popular formats include reverse domain name notation as used in Java, or $COMPANY.$PROJECT as used by core gRPC types.

com.mycompany.my_project.v1
com.mycompany.MyProject.v1
mycompany.my_project.v1

API versions that are not fully stabilized should have a version suffix like v1alpha, v2beta1, or v3test – see the Kubernetes API versioning policy for more thorough guidance.

Protobuf package names are used in generated code, so try to avoid name components that are commonly used for built-in types or keywords (like return or void). This is especially important for generating C++, which (as of protobuf 3.6) does not have a FileOption to override the default namespace name calculation.

Import Paths

Try to structure your proto file's on-disk layout so that import paths match the package name: types in mycompany.my_project.v1 should be imported with import "mycompany/my_project/v1/some_file.proto". This is not required by the Protobuf toolchain, but does help humans remember what to type.

Note that if you're using Bazel's built-in proto_library() rule, it doesn't currently support adjusting the import paths (bazelbuild/bazel#3867). Until that feature is implemented, you'll need to either write your own proto_library in Starlark, or simply put the .proto sources in the desired directory structure.

Next-Number Comments

In large protobuf messages, it can be annoying to figure out which field number should be used for new fields. To simplify the life of future editors, add a comment at the end of your messages and enums.

message MyMessage {
    // ... lots of fields here ...

    // NEXT: 42
}

Enums

Enum symbol scoping follows old-style C/C++ rules, so that the defined names are not scoped to the enum name:

// symbol `FUN_LEVEL_HIGH' is of type `FunLevel'.
enum FunLevel {
    FUN_LEVEL_UNKNOWN = 0;
    FUN_LEVEL_LOW = 1;
    FUN_LEVEL_HIGH = 2;
    // NEXT: 3
}

This can be awkward for users accustomed to languages with more modern scoping rules. I like to wrap the enum in a message:

// symbol `FunLevel::HIGH` is of type `FunLevel::Enum`.
message FunLevel {
    enum Enum {
        UNKNOWN = 0;
        LOW = 1;
        HIGH = 2;
        // NEXT: 3
    }
}

Tombstones

If a field has been deleted, its field number must not be reused by future field additions[1]. Prevent accidental field number reuse by adding tombstones with the reserved keyword. I always reserve both the field name and number.

enum FunLevel {
    // removed -- too much fun
    reserved "FUN_LEVEL_EXCESSIVE"; reserved 10;
}

message MyMessage {
    reserved "crufty_old_field"; reserved 20;
}

Documentation

Protobuf doesn't have a built-in generator for API documentation. Of the available options, protoc-gen-doc seems the most mature. See the protoc-gen-doc README for syntax and examples.

Validation

Protobuf doesn't have a built-in validation mechanism, other than the required in proto2 (removed in proto3). Lyft's protoc-gen-validatetool is the best solution I know of for this, though it's in early alpha and currently only supports Go.

Optional Scalar Types

In proto3, the ability to mark scalar fields (int32, string, etc) as optional was removed. Scalar fields are now always present, and will be a default "zero value" if not otherwise set. This can be frustrating when designing a schema for a system where "" and NULL are logically distinct values.

The official workaround is a set of "wrapper types", defined in google/protobuf/wrappers.proto, that define single-valued messages. Your schema can use .google.protobuf.Int32Value instead of int32 to get optionality.

import "google/protobuf/wrappers.proto";

message MyMessage {
    .google.protobuf.Int32Value some_field = 1;
}

Another approach is to wrap the scalar field in oneof, with no other choices. This forces even scalar fields to have optionality, and adds helper methods in generated code to detect if the field was set.

message MyMessage {
    oneof oneof_some_field {
        int32 some_field = 1;
    }
}

For a motivational lesson in reuse of field identifiers, see SEC administrative proceeding 3-15570 against Knight Capital regarding loss of $460 million USD in 45 minutes.

Bazel School: Toolchains

2018-05-25T14:34:27Z

Bazel School: Toolchains

I've recently been using Bazel as a multi-platform distributed build system. Bazel itself supports this pretty well, but many of the user-contributed extension libraries don't make good use of Bazel's toolchains and therefore break when multiple OSes are involved in a build. I hope the situation can be improved by documenting nascent best practices.

This page is a bit advanced. It assumes background knowledge in cross compilation, plus experience with Bazel's Starlark extension language, build rules, and repository definitions . Most users of Bazel shouldn't need to care about the details of compiler toolchains, but this is important stuff for maintainers of language rules.

Constraints

Bazel's package/toolchain design is based on constraints, which are simple text key/value pairs. Keys are defined constraint_setting, and values by constraint_value. Settings and values are true targets, which means they're addressed by label, obey visibility, and can be aliased.

A couple basic constraints come predefined in @bazel_tools//platforms:

@bazel_tools//platforms:cpu
    @bazel_tools//platforms:arm
    @bazel_tools//platforms:ppc
    @bazel_tools//platforms:s390x
    @bazel_tools//platforms:x86_32
    @bazel_tools//platforms:x86_64

@bazel_tools//platforms:os
    @bazel_tools//platforms:freebsd
    @bazel_tools//platforms:linux
    @bazel_tools//platforms:osx
    @bazel_tools//platforms:windows

Note the limited selection and lack of precision. These definitions are (as of Bazel 0.13) useful only for getting started. Most language rules will want to define their own – see @io_bazel_rules_go//go/toolchain:toolchains.bzl for an example of custom values for the built-in settings.

Platforms

Upstream docs:

A platform is a named set of constraint values (see above), plus some other metadata that I'm going to skip because it's part of the not-fully-implemented remote execution API. They can contain any number of constraint values, but at most one constraint value per constraint setting (i.e. you can't have a platform with two CPU types). Be specific – Autoconf's "GNU Triplets " are a good model to imitate here.

# platforms/BUILD

platform(
    name = "x86_64-apple-darwin",
    constraint_values = [
        "@bazel_tools//platforms:osx",
        "@bazel_tools//platforms:x86_64",
    ],
)

platform(
    name = "i686-linux-gnu",
    constraint_values = [
        "@bazel_tools//platforms:linux",
        "@bazel_tools//platforms:x86_32",
    ],
)

The rest of this page will use the standard platform definitions built into Bazel. Custom platforms are if you need to constrain on other dimensions, such as CPU vendor or libc version.

Defining Toolchains

Upstream docs:

To work with cross-compilation, toolchains themselves need to be (1) capable of generating non-native output binaries and (2) must define their Bazel constraints.

Toolchain Types

Each category of toolchain is identified by a toolchain type, which is a string in the format of a build label. There is no requirement that the value actually match any defined label. I recommend using a @-prefixed toolchain type, to avoid potential conflicts in workspaces with multiple language rules loaded.

ToolchainInfo

The ToolchainInfo provider is how your rules store toolchain configuration to Bazel. There's no special requirements about the values you can put in, so feel free to use whatever makes sense for your language.

Skylark doesn't have a public/private distinction for struct attributes, so a convention of underscore-prefixed attribute names is borrowed from Python. It's easy for rule implementations to get access to the ToolchainInfo for any registered toolchain, so be clear in your docs which attributes are part of your public API.

First you define a rule type for your toolchain info:

# demo_toolchain.bzl

DEMO_TOOLCHAIN = "@rules_demo//:demo_toolchain_type"

def _demo_toolchain_info(ctx):
    return [
        platform_common.ToolchainInfo(
            compiler = ctx.attr.compiler,
            cflags = ctx.attr.cflags,
        ),
    ]

demo_toolchain_info = rule(
    _demo_toolchain_info,
    attrs = {
        "_compiler": attr.label(
            executable = True,
            default = "//:demo_compiler"
            cfg = "host",
        ),
        "cflags": attr.string_list(),
    },
)

Then use it to create toolchain info targets, one for each unique configuration you might want to build with:

# BUILD

load(":demo_toolchain.bzl", "DEMO_TOOLCHAIN", "demo_toolchain_info")

demo_toolchain_info(
    name = "demo_toolchain_info/i686-linux-gnu",
    cflags = ["--target-os=linux", "--target-arch=i686"],
)

demo_toolchain_info(
    name = "demo_toolchain_info/x86_64-linux-gnu",
    cflags = ["--target-os=linux", "--target-arch=amd64"],
)

Registration

Once you've got your ToolchainInfo rules defined, the next step is to register them. This is where the info is associated with the toolchain type and the constraint values so Bazel can auto-detect which toolchains are usable on a particular platform.

# BUILD

toolchain(
    name = "demo_toolchain_linux_x86_32",
    exec_compatible_with = [
            "@bazel_tools//platforms:linux",
            "@bazel_tools//platforms:x86_32",
    ],
    target_compatible_with = [
            "@bazel_tools//platforms:linux",
            "@bazel_tools//platforms:x86_32",
    ],
    toolchain = ":demo_toolchain_info/i686-linux-gnu",
    toolchain_type = DEMO_TOOLCHAIN,
)

toolchain(
    name = "demo_toolchain_linux_x86_64",
    # [...]
)

Finally, the toolchains Bazel can use are passed to register_toolchains in your WORKSPACE. Usually this is done in a helper macro defined in the language rules, so that both the toolchain() rules and register_toolchains(...) args can be generated by the same logic.

# WORKSPACE

register_toolchains(
    "//:demo_toolchain_linux_x86_32",
    "//:demo_toolchain_linux_x86_64",
)

Using Toolchains

Rules can say which type toolchains they depend on, like "needs a C++ compiler". When defining the rule, set the toolchains param to all the toolchain types that will be needed to run the action. Then within the implementation, fetch the ToolchainInfo values (the same ones defined in the toolchain info rule) and inspect the content to implement your build.

# rules.bzl

def _demo_rule(ctx):
    tc = ctx.toolchains[DEMO_TOOLCHAIN]
    print("toolchain: %s %r" % (tc.compiler, tc.cflags))

demo_rule = rule(
    _demo_rule,
    toolchains = [DEMO_TOOLCHAIN],
)

Cross-Compilation

Toolchains can have different exec_compatible_with and target_compatible_with attrs. The execution compatibility is used for the platform that runs builds (i.e. the worker), and the target compatibility is the types that the toolchain can output.

Here's the definition of a cross-compiling toolchain that runs on 32-bit Linux but generates output for 64-bit Linux:

# BUILD

load(":demo_toolchain.bzl", "demo_toolchain_info")

demo_toolchain_info(
    name = "demo_toolchain_info_linux_x86_32_cross64",
    compiler = "@demo_prebuilt_compiler_linux_x86_32//:demo_compiler",
    cflags = ["--target-os=linux", "--target-arch=amd64"],
)

toolchain(
    name = "demo_toolchain_linux_x86_32_cross64",
    exec_compatible_with = [
            "@bazel_tools//platforms:linux",
            "@bazel_tools//platforms:x86_32",
    ],
    target_compatible_with = [
            "@bazel_tools//platforms:linux",
            "@bazel_tools//platforms:x86_64",
    ],
    toolchain = ":demo_toolchain_info_linux_x86_32_cross64",
    toolchain_type = DEMO_TOOLCHAIN,
)

Platform Selection Flags

Bazel (as of 0.13) has two flags to override the platform selection, which are useful when the execution platform is custom-defined or different in some important way from the machine running Bazel. The most common reason is if you're building with remote workers.

The --platforms flag specifies which platforms the final compiled binaries will run on. This flag can accept multiple platforms, in which case Bazel may generate multiple outputs for a build artifact.
The --host_platform flag overrides which platform is used for executing build commands. I'm hopeful that this flag could be split into --host_platform and --remote_platform in future versions of Bazel, so that some actions can be run locally even if the distributed build pool is different from the local workstation.

There's also the --cpu and --host_cpu flags, which (if I understand correctly) are deprecated and exist only because the built-in C++ rules haven't been migrated to the toolchains system yet.

Prebuilt Toolchains

Compiler toolchains are often large, and take a while to build. Downloading prebuilt toolchains can materially improve your users' experience, but there's some extra details to be aware of:

Do not use uname, inspection of /proc, or similar unsandboxed mechanisms to discover the execution platform. These interfere with user's customizations of the build environment, and can cause incorrect behavior when the execution platform is different from where the user is running Bazel.
If the toolchain is downloaded by a custom repository rule, put it in its own .bzl file. Repository rules are invalidated by any changes to the .bzl file they're defined in, and you don't want small changes to toolchains to force a re-download of large toolchain archives.

Mojibake in Surugaya Javascript

2018-03-24T08:51:05Z

Mojibake in Surugaya Javascript

Yesterday I bought some used CDs from the online store Surugaya. The checkout process was broken in an interesting way: when I clicked the payment method confirmation button, nothing happened. I switched from Chrome to Firefox and was able to place an order successfully[1].

The bug

A quick look in the web console showed some errors in loading the page's required Javascript:

Indeed, line 60 of the script was obviously invalid:

   /*! jQuery Super Text Converter 2014-03-03
    *  Vertion : 1.0.3
    *  Dependencies : jQuery *
    *  Author : MegazalRock (Otto Kamiya)
    *  Copyright (c) 2014 MegazalRock (Otto Kamiya);
    *  License : */
                         [...]
58 },{
59 	zenkaku : /￥/g,
60 	hankaku : '\'
61 },{

Root cause analysis

How did this happen? There are two important clues:

First, Japanese editions of Windows use the Yen sign to render U+005C, instead of the backslash. This is backwards-compatibility behavior from pre-Unicode days when all characters needed to fit in a single byte – the JIS X 0201 character set used 0x5C for the Yen sign, and so Japanese editions of DOS use ¥ for the directory separator. Even after Windows gained Unicode support, it still renders ¥ instead of \ when running in a Japanese locale.
Second, if the Surugaya version is compared with jquery-supertextconverter-plugin.js v1.0.3 we see two changes that look intentional, and several that look erroneous:

--- https://github.com/megazalrock/jquery-supertextconverter/blob/1.0.3/dist/jquery-supertextconverter-plugin.js
+++ https://www.suruga-ya.jp/js/jquery-supertextconverter-plugin.js
@@ -21,7 +21,7 @@
 				hyphen: true
 			},
 			zenkakuHyphen: 'ー',
-			zenkakuChilda: '〜'
+			zenkakuChilda: '縲鰀'
 		}, options);
 		stc.regexp = {
 			hankaku : /[A-Za-z0-9#$%&\\()*+,.\/<>\[\]{}=@;:_\^`]/g,
@@ -57,16 +57,16 @@
 					type: 'space'
 				},{
 					zenkaku : /￥/g,
-					hankaku : '¥'
+					hankaku : '\'
 				},{
-					zenkaku : /[ー―‐−]/g,
+					zenkaku : /[ー―‐竏綻/g,
 					hankaku : '-',
 					type : 'hyphen'
 				},{
 					zenkaku : /｜/g,
 					hankaku : '|'
 				},{
-					zenkaku : /[～〜]/g,
+					zenkaku : /[～縲彎/g,
 					hankaku : '~',
 					type: 'tilda'
 				},{
@@ -99,7 +99,7 @@
 					zenkaku : '　',
 					type: 'space'
 				},{
-					hankaku : /[¥\\]/g,
+					hankaku : /[\\\]/g,
 					zenkaku : '￥'
 				},{
 					hankaku : /[\-ｰ]/g,
@@ -140,7 +140,7 @@
 			/ﾗ/g, /ﾘ/g, /ﾙ/g, /ﾚ/g, /ﾛ/g,
 			/ﾜ/g, /ｦ/g, /ﾝ/g,
 			/ｧ/g, /ｨ/g, /ｩ/g, /ｪ/g, /ｫ/g,
-			/ｬ/g, /ｭ/g, /ｮ/g,
+			/ｬ/g, /ｭ/g, /ｮ/g, /ｯ/g,
 			/ﾞ/g, /ﾟ/g, /｡/g, /､/g
 		];
 		this.zenkakuKanaList = [
@@ -160,7 +160,7 @@
 			'ラ', 'リ', 'ル', 'レ', 'ロ',
 			'ワ', 'ヲ', 'ン',
 			'ァ', 'ィ', 'ゥ', 'ェ', 'ォ',
-			'ャ', 'ュ', 'ョ',
+			'ャ', 'ュ', 'ョ', 'ッ',
 			'゛', '゜', '。', '、'
 		];
 	};

This is a case of mojibake!

Mojibake (文字化け) (IPA: [mod͡ʑibake]; lit. "character transformation", from the Japanese 文字 (moji) "character" + 化け (bake, pronounced "bah-keh") "transform") is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

What I think happened is someone wanted to add 「ｯ」 to the replacement lists at the end, so they edited the source file with some basic text editor. The editor was running in a Japanese locale and interpreted the UTF-8 source as some other encoding, causing mojibake. When the new file was saved, the corruption was preserved.

Identifying the mystery encoding

Which encoding did the editor use? A web search for 「￥」 will obviously not find anything useful, so lets use one of the other replacements. https://www.google.com/search?q="縲鰀" has some relevant results:

「縲鰀」とはどういう意味ですか？ ("What is the meaning of 「縲鰀」?")
縲鰀の謎 ("The mystery of 縲鰀")

These confirm other people have encountered this exact error before, but neither says which encoding is involved.

Note something interesting – two of the bad replacements consumed a trailing ]. The unknown encoding must be variable-width.

We can construct a table of likely candidates:

Character	Unicode	UTF-8	Shift JIS[2]	EUC-JP
`'`	`U+0027`	`x27`	`x27`	`x27`
`]`	`U+005D`	`x5D`	`x5D`	`x5D`
`−`	`U+2212`	`xE2 x88 x92`	`x81 x7C`	`xA1 xDD`
`〜`	`U+301C`	`xE3 x80 x9C`	`x81 x60`	`xA1 xC1`
`彎`	`U+5F4E`	`xE5 xBD x8E`	`x9C x5D`	`xD7 xBE`
`竏`	`U+7ACF`	`xE7 xAB x8F`	`xE2 x88`	`xE3 xE8`
`綻`	`U+7DBB`	`xE7 xB6 xBB`	`x92 x5D`	`xC3 xBE`
`縲`	`U+7E32`	`xE7 xB8 xB2`	`xE3 x80`	`xE5 xE0`
`鰀`	`U+9C00`	`xE9 xB0 x80`	`xEF xCD`	`x8F xEB xA5`

That did it! We can see how some of the bytes match up:

0x5D shows up at the end of the Shift JIS encodings
0x9C is at the end of utf8("〜") and the start of shift_jis("彎")
utf("〜") starts with 0xE3 0x80, which is shift_jis("縲").

This file was encoded in UTF-8, but edited as Shift JIS. We can test this theory using Python:

$ python
>>> print u"〜]".encode("utf8").decode("shift_jisx0213")
縲彎
>>> print u"−]".encode("utf8").decode("shift_jisx0213")
竏綻
>>> print u"〜".encode("utf8").decode("shift_jisx0213")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'shift_jisx0213' codec can't decode byte 0x9c in position 2: incomplete multibyte sequence

Close, but not quite. Something else is going on. shift_jis('縲鰀') is 0xE3 0x80 0xEF 0xCD, and 0xEF doesn't show up anywhere else in the table. What if the editor was being really clever, and restarting the encoding autodetector each time it fails to decode a multi-byte sequence?

>>> bytes = u"〜".encode('utf8')
>>> bytes += '\x00' # padding
>>> print bytes[0:2].decode('shift_jisx0213') + bytes[2:4].decode('utf-16-be')
縲鰀

There it is. The unknown editor thought the best way to load a UTF-8 file was to parse it as a mix of Shift JIS and big-endian UTF-16.

Impact timeline

How long has this online store been serving up invalid, payment-breaking Javascript on its checkout page?

$ curl -v -o /dev/null https://www.suruga-ya.jp/js/jquery-supertextconverter-plugin.js 2>&1 | grep -E 'Date:|Last-Modified:'
< Date: Sat, 24 Mar 2018 07:46:01 GMT
< Last-Modified: Wed, 20 Jan 2016 07:38:19 GMT

Over two years. Hmmm.

Reporting to the webmaster

Surugaya does not have a published email address, and their order confirmation mail helpfully notes that replies are not monitored. Their contact form is at https://www.suruga-ya.jp/toiawase. Since they're located in Japan and have no English text on their site, I figured mangled Japanglish would be more successful than English. Here's the best I could do with Google Translate and a dictionary:

こんにちは、

「支払方法の選択」のページにjavascriptのエラーがありますから、Chromeの使いの顧客は購買できません。

エラーの写真： https://i.imgur.com/N9d0J08.png
このファイル： https://www.suruga-ya.jp/js/jquery-supertextconverter-plugin.js

},{
	zenkaku : /￥/g,
	hankaku : '\'         <- これは悪い
},{

元のファイルは正しいかもしれないと思います。
https://github.com/megazalrock/jquery-supertextconverter/blob/master/src/superTextConverter.js#L60-L63

},{
	zenkaku : /￥/g,
	hankaku : '¥'         <- これは良い
},{

僕の変な日本語はごめんあさい。返事なら、日本語も英語もいいです。

Their contact form has a "reset" button right next to the submit button, and the message field gets cleared on navigate-back, so I got to type that up twice. いい練習ですね。

When I click the submit button, nothing happens. I switch to Firefox again and am able to submit their contact form.

The second bug

The contact form directs to a confirmation page, which notes that I'm about to submit an empty message. What?

I can see the POST values got sent over correctly, but the confirmation page thinks I tried to submit an empty message. It's not just a rendering problem either, the "confirm" button there just serves up a error about the missing fields. Whatever's happening seems to be server-side, and I have no visibility into it.

Another attempt at contact

Looking at the source for the page, I notice it has <link rev="made" href="mailto:info@act-system.com"> in it. Maybe this "act system" is a web development firm responsible for the shop, and they will be able to fix the script?

Looks like an SEO company rather than a web developer, and the last activity is from January 2016. Probably coincidental that their final blog post was written two days before the Last-Modified date on that broken script.

What did we learn, Palmer?

Text encoding is still hard.

After making changes to your website, consider diffing to make sure the delta is what you expected.

If you're going to ignore email in favor of a contact form, consider testing your contact form.

If your online store's sales funnel drops all users of the #1 most popular browser, you may be leaving money on the table from potential customers who don't know how to debug your Javascript.

Via PayPal, obviously. I'm not about to type my credit card number into a site that behaves like this.
Shift JIS unified JIS X 0201 and JIS X 0208 into a single character set.

UNIX Syscalls

2022-08-02T02:50:36Z

UNIX Syscalls

On UNIX-like operating systems, userland processes invoke kernel procedures using the "syscall" feature. Each syscall is identified by a "syscall number" and has a short list of parameters, which both can vary between operating systems, hardware platforms, and configuration options.

Performing a syscall is usually done via a special assembly instruction, though some platforms use other mechanisms (e.g. a vDSO). This page is a catalog of how to invoke syscalls on different UNIX-like platforms.

int $0x80 (or int 80h)

int $0x80 (also styled as int 80h) is the traditional syscall instruction on i386 UNIX-like platforms. It triggers a software interrupt that transfers control to the kernel, which inspects its registers and stack to find the syscall number + parameters. It is obsolete since the mid 2000s for performance reasons, but can still be found in tutorials because it's easier to understand than more modern mechanisms.

Syscalls by OS

(incomplete)

Name	Standard	Linux	Darwin	FreeBSD
access	POSIX	access(2)	access(2)	access(2)
creat	POSIX	creat(2)	creat(2)	creat(2)
exchangedata			exchangedata(2)
extattr_delete_file				extattr(2)
extattr_get_file				extattr(2)
extattr_list_file				extattr(2)
extattr_set_file				extattr(2)
fallocate		fallocate(2)
fsync	POSIX	fsync(2)	fsync(2)	fsync(2)
stat	POSIX	stat(2)	stat(2)	stat(2)
fcntl	POSIX	fcntl(2)	fcntl(2)	fcntl(2)
flock		flock(2)	flock(2)	flock(2)
getxattr		getxattr(2)	getxattr(2)
link	POSIX	link(2)	link(2)	link(2)
listxattr		listxattr(2)	listxattr(2)
lseek	POSIX	lseek(2)	lseek(2)	lseek(2)
mkdir	POSIX	mkdir(2)	mkdir(2)	mkdir(2)
mknod	POSIX	mknod(2)	mknod(2)	mknod(2)
open	POSIX	open(2)	open(2)	open(2)
opendir	POSIX	opendir(3)	directory(3)	directory(3)
poll	POSIX	poll(2)	poll(2)	poll(2)
read	POSIX	read(2)	read(2)	read(2)
readdir	POSIX	readdir(3)	directory(3)	directory(3)
readlink	POSIX	readlink(2)	readlink(2)	readlink(2)
removexattr		removexattr(2)	removexattr(2)
rename	POSIX	rename(2)	rename(2)	rename(2)
renameat2		rename(2)
rmdir	POSIX	rmdir(2)	rmdir(2)	rmdir(2)
chmod	POSIX	chmod(2)	chmod(2)	chmod(2)
chown	POSIX	chown(2)	chown(2)	chown(2)
utime	POSIX	utime(2)	utime(3)	utime(3)
setxattr		setxattr(2)	setxattr(2)
statfs		statfs(2)	statfs(2)	statfs(2)
symlink	POSIX	symlink(2)	symlink(2)	symlink(2)
unlink	POSIX	unlink(2)	unlink(2)	unlink(2)
write	POSIX	write(2)	write(2)	write(2)

Linux

Linux syscalls are defined in include/linux/syscalls.h. Syscalls use the same parameter order across platforms, but some (e.g. sys_stat64) are only defined on some platforms, and others (e.g. sys_clone) have different parameters depending on kernel compilation options. Syscall numbers are platform-dependent.

Manpage syscalls(2) lists syscalls and which kernel version they were added in. Manpage syscall(2) lists per-architecture calling conventions and register assignments.

Documentation and tutorials for implementing a Linux syscall:

LWN: Anatomy of a system call [part 1] [part 2] [additional content] (David Drysdale)
Tutorial - Write a System Call (Stephen Brennan)
Adding hello world system call to Linux (Arvind S. Raj)
Implementing a system call in Linux Kernel 4.7.1 (Sreehari S.)
Adding a Hello World System Call to Linux kernel 3.16.0 (Surya Seetharaman)

Linux: i386 (INT 0x80)

The syscall number is passed in register eax. Syscalls with six or fewer parameters pass them in registers [ebx, ecx, edx, esi, edi, ebp]. Syscalls with more than six parameters use ebx to pass a memory address, in a way that doesn't seem to be well documented.

Linux syscall numbers for i386 are defined in arch/x86/entry/syscalls/syscall_32.tbl.

See above for background on int $0x80.

.data
	.set .L_STDOUT,        1
	.set .L_SYSCALL_EXIT,  1
	.set .L_SYSCALL_WRITE, 4
	.L_message:
		.ascii "Hello, world!\n"
		.set .L_message_len, . - .L_message

.text
	.global _start
	_start:
		# write(STDOUT, message, message_len)
		mov $.L_SYSCALL_WRITE, %eax
		mov $.L_STDOUT,        %ebx
		mov $.L_message,       %ecx
		mov $.L_message_len,   %edx
		int $0x80

		# exit(0)
		mov $.L_SYSCALL_EXIT, %eax
		mov $0,               %ebx
		int $0x80

static linking

as --32 -o hello.o hello.s
ld -m elf_i386 -o hello hello.o
file hello
# hello: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

as --32 -o hello.o hello.s
ld -m elf_i386 -o hello hello.o \
#   --dynamic-linker /lib/ld-linux.so.2 \
#   -l:ld-linux.so.2
file hello
# hello: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, not stripped
ldd hello
#     /lib/ld-linux.so.2 (0x56614000)
#     linux-gate.so.1 (0xf77ba000)
./hello
# Hello, world!

Linux: i386 (vDSO)

A vDSO is a shared library injected into processes by the kernel, rather than loaded by the dynamic linker. It's used in i386 linux to implement faster syscalls via the SYSENTER instructions available in modern 32-bit x86 processors[1] [2]. Later kernel versions also added fast paths for certain read-only syscalls[3].

This code is slightly more complicated than the int 0x80 example because all functions loaded from shared objects (including __kernel_vsyscall) must use indirect calls.

.extern __kernel_vsyscall

.data
	.set .L_STDOUT,        1
	.set .L_SYSCALL_WRITE, 4
	.set .L_SYSCALL_EXIT,  1
	.L_message:
		.ascii "Hello, world!\n"
		.set .L_message_len, . - .L_message

.text
	.global _start
	_start:
		call .L_get_pc_thunk.esi
		add  $_GLOBAL_OFFSET_TABLE_, %esi

		# write(STDOUT, message, message_len)
		mov  $.L_SYSCALL_WRITE, %eax
		mov  $.L_STDOUT,        %ebx
		mov  $.L_message,       %ecx
		mov  $.L_message_len,   %edx
		call *__kernel_vsyscall@GOT(%esi)

		# exit(0)
		mov  $.L_SYSCALL_EXIT, %eax
		mov  $0,               %ebx
		call *__kernel_vsyscall@GOT(%esi)

	.L_get_pc_thunk.esi:
		mov (%esp), %esi
		ret

The linux-gate.so.1 library that will be available at runtime is not available to the linker at compile time. To get the correct symbols and ELF headers into the executable, we need to inject some fake data:

--defsym __kernel_vsyscall=0 creates a place for the symbol address to be written to, once resolved. This also prevents the linker from warning about an unresolved symbol.
Creating a dummy shared object with ld -shared -soname=linux-gate.so.1 causes the linker to add a DT_NEEDED entry for the vDSO, so the dynamic linker will know to use it as a source of symbol addresses.

The resulting binary is a totally normal dynamic ELF executable.

echo '.type __kernel_vsyscall STT_FUNC' | as --32 -o dummy_so.o
ld -m elf_i386 -shared \
#   --defsym __kernel_vsyscall=0 \
#   -soname=linux-gate.so.1 \
#   -o dummy_so dummy_so.o
as --32 -o hello.o hello.s
ld -m elf_i386 -o hello hello.o \
#   --dynamic-linker /lib/ld-linux.so.2 \
#   -l:ld-linux.so.2 \
#   dummy_so
file hello
# hello: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, not stripped
ldd hello
#     /lib/ld-linux.so.2 (0x56625000)
#     linux-gate.so.1 (0xf77d5000)
./hello
# Hello, world!

Why not auxinfo?

Some articles about the Linux vDSO describe looking up its address using the ELF auxiliary vector. I avoided this because it seems complicated and fussy:

AT_SYSINFO provides the address of __kernel_vsyscall directly, but is deprecated[4] and requires the discovered address to be plumbed through client code (or assigned to a magic global in some very early initializer).
AT_SYSINFO_EHDR provides the address of the vDSO, which requires further parsing using an ELF library to extract relevant symbol addresses. I don't want my programs to embed ELF parsers, especially when a perfectly good one is available in ld.so.
The dynamic linker solution can be trivially extended to other Linux vDSO symbols like __vdso_gettimeofday, again with no ELF parsing needed.

The main disadvantage of my solution is it can't be used in a statically linked executable, which are useful for system recovery tools (e.g. busybox) or minimal Docker containers.

Why not gs:0x10?

I've seen one article recommend using call *%gs:0x10to invoke __kernel_vsyscall, because GNU libc uses this register to locate its early-initialized magic globals.

Don't do this. Everything I can find about glibc auxv handling indicates that the value of %gs is not part of the GNU libc public ABI, and it seems to be pointing to some internal datastructure that happens to have the address of __kernel_vsyscall at offset 0x10 (used to be 0x18). There is no guarantees that these properties will be true in the future, especially if you want your code to link against non-GNU libc implementations such as musl.

Linux: x86-64

The syscall number is passed in register rax. Parameters are passed in registers [rdi, rsi, rdx, rcx, r8, r9]. I haven't found documentation on what x86-64 Linux does for syscalls with more than six parameters. The syscall instruction is used to pass control to the kernel.

Linux syscall numbers for x86-64 are defined in arch/x86/entry/syscalls/syscall_64.tbl.

.data
	.set .L_STDOUT,        1
	.set .L_SYSCALL_EXIT,  60
	.set .L_SYSCALL_WRITE, 1
	.L_message:
		.ascii "Hello, world!\n"
		.set .L_message_len, . - .L_message

.text
	.global _start
	_start:
		# write(STDOUT, message, message_len)
		mov     $.L_SYSCALL_WRITE, %rax
		mov     $.L_STDOUT,        %rdi
		mov     $.L_message,       %rsi
		mov     $.L_message_len,   %rdx
		syscall

		# exit(0)
		mov     $.L_SYSCALL_EXIT, %rax
		mov     $0,               %rdi
		syscall

static linking

as --64 -o hello.o hello.s
ld -m elf_x86_64 -o hello hello.o
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

as --64 -o hello.o hello.s
ld -m elf_x86_64 -o hello hello.o \
#   --dynamic-linker /lib64/ld-linux-x86-64.so.2 \
#   -l:ld-linux-x86-64.so.2
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, not stripped
ldd hello
#    /lib64/ld-linux-x86-64.so.2 (0x00007f472a831000)
#    linux-vdso.so.1 (0x00007ffe83d7a000)
./hello
# Hello, world!

Linux: ARM v6 (Little-Endian, EABI)

include/uapi/asm-generic/unistd.h

Linux syscall numbers for ARM are defined in arch/arm/tools/syscall.tbl.

.arch armv6
.data
	.set .L_STDOUT,        1
	.set .L_SYSCALL_EXIT,  1
	.set .L_SYSCALL_WRITE, 4
	.L_message:
		.ascii "Hello, world!\n"
	.set .L_message_len, . - .L_message

.text
	.global _start
	_start:
		@ write(STDOUT, message, message_len)
		mov %r7, #.L_SYSCALL_WRITE
		mov %r0, #.L_STDOUT
		ldr %r1, =.L_message
		mov %r2, #.L_message_len
		swi #0

		@ exit(0)
		mov %r7, #.L_SYSCALL_EXIT
		mov %r0, #0
		swi #0

static linking

as -EL -o hello.o hello.s
ld -m armelf_linux_eabi -o hello hello.o
file hello
# hello: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

as -EL -o hello.o hello.s
ld -m armelf_linux_eabi -o hello hello.o \
#   --dynamic-linker /lib/ld-linux-armhf.so.3 \
#   -l:ld-linux-armhf.so.3
file hello
# hello: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, not stripped
./hello
# Hello, world!

Linux: RISC-V

The syscall number is passed in register a7, and parameters in registers a0 to a6.

Linux syscall numbers for RISC-V are defined in include/uapi/asm-generic/unistd.h.

.section .rodata
	.set .L_STDOUT,        1
	.set .L_SYS_exit,  93
	.set .L_SYS_write, 64
	.L_message:
		.ascii "Hello, world!\n"
	.set .L_message_len, . - .L_message

.text
	.global _start
	_start:
		li a7, .L_SYS_write
		li a0, .L_STDOUT
		la a1, .L_message
		li a2, .L_message_len
		ecall

		li a7, .L_SYS_exit
		li a0, 0
		ecall

static linking

as -o hello.o hello.s
ld -m elf64lriscv -o hello hello.o
file hello
# hello: ELF 64-bit LSB executable, UCB RISC-V, double-float ABI, version 1 (SYSV), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

as -o hello.o hello.s
ld -m elf64lriscv -o hello hello.o \
#   --dynamic-linker /lib/ld-linux-riscv64-lp64d.so.1  \
#   -l:ld-linux-riscv64-lp64d.so.1
file hello
# hello: ELF 64-bit LSB executable, UCB RISC-V, RVC, double-float ABI, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-riscv64-lp64d.so.1, not stripped
./hello
# Hello, world!

Darwin (MacOS X)

Note that I have left out the instructions to statically link binaries because they are documented as unsupported: Technical Q&A QA1118: Statically linked binaries on Mac OS X. Apple is also known to break the syscall ABI between MacOS versions, though it should be stable enough for the syscalls inherited from BSD.

Use of lea here is because PIE addressing is required for -macos_version_min 10.7 or later. Make sure this linker flag matches the .macosx_version_min value in the assembly, or the linker may reject your object code.

10.8 and later requires linking with libSystem via ld -lSystem. Earlier versions don't need that link.

The default entry point changed from start to _main in 10.8. Use ld -e _main to build for earlier -macos_version_min values.

Darwin: i386

.macosx_version_min 10, 8

.data
	.set L_STDOUT,        1
	.set L_SYSCALL_EXIT,  1
	.set L_SYSCALL_WRITE, 4
	L_message:
		.ascii "Hello, world!\n"
		.set L_message_len, . - L_message

.text
	.global _main
	_main:
		mov %eax, %esi

		# write(STDOUT, message, message_len)
		push $L_message_len
		lea  L_message-_main(%esi), %eax
		push %eax
		push $L_STDOUT
		push $0 # stack padding
		mov  $L_SYSCALL_WRITE, %eax
		int  $0x80
		add  $16, %esp

		# exit(0)
		push $0 # exit code
		push $0 # stack padding
		mov  $L_SYSCALL_EXIT, %eax
		int  $0x80

dynamic linking

as -arch i386 -o hello.o hello.s
ld -arch i386 -macosx_version_min 10.8 -lSystem -o hello hello.o
file hello
# hello: Mach-O executable i386
otool -L hello
# hello:
#     /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1238.60.2)
./hello
# Hello, world!

Darwin: x86-64

In 64-bit MacOS X, syscall numbers are divided into "classes". The syscalls inherited from BSD are in SYSCALL_CLASS_UNIX, starting at 0x2000000. See XNU header osfmk/mach/syscall_sw.h for details.

.macosx_version_min 10, 8

.data
	.set L_STDOUT,        1
	.set L_SYSCALL_EXIT,  0x2000001
	.set L_SYSCALL_WRITE, 0x2000004
	L_message:
		.ascii "Hello, world!\n"
		.set L_message_len, . - L_message

.text
	.global _main
	_main:
		# write(STDOUT, message, message_len)
		mov     $L_SYSCALL_WRITE, %rax
		mov     $L_STDOUT,        %rdi
		lea     L_message(%rip),  %rsi
		mov     $L_message_len,   %rdx
		syscall

		# exit(0)
		mov     $L_SYSCALL_EXIT, %rax
		mov     $0,              %rdi
		syscall

dynamic linking

as -arch x86_64 hello.s -o hello.o
ld -arch x86_64 -o hello hello.o \
#     -macosx_version_min 10.8 -lSystem
file hello
# hello: Mach-O 64-bit executable x86_64
otool -L hello
# hello:
#     /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1238.60.2)
./hello
# Hello, world!

FreeBSD

The list of system calls is defined in sys/kern/syscalls.master. Syscall numbers are the same across hardware platforms.

FreeBSD: i386

int $0x80 appears to be the only supported syscall mechanism for FreeBSD on i386. There is a vDSO at sys/sys/vdso.h but it doesn't contain a Linux-style generic syscall trampoline.

.data
	.set .L_STDOUT,        1
	.set .L_SYSCALL_EXIT,  1
	.set .L_SYSCALL_WRITE, 4
	.L_message:
		.ascii "Hello, world!\n"
		.set .L_message_len, . - .L_message

.text
	.global _start
	_start:
		# write(STDOUT, message, message_len)
		push $.L_message_len
		push $.L_message
		push $.L_STDOUT
		push $0 # stack padding
		mov  $.L_SYSCALL_WRITE, %eax
		int  $0x80
		add  $16, %esp

		# exit(0)
		push $0 # exit code
		push $0 # stack padding
		mov  $.L_SYSCALL_EXIT, %eax
		int  $0x80

static linking

as --32 -o hello.o hello.s
ld -m elf_i386_fbsd -o hello hello.o
file hello
# hello: ELF 32-bit LSB executable, Intel 80386, version 1 (FreeBSD), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

as --32 -o hello.o hello.s
ld -m elf_i386_fbsd -o hello hello.o \
#    --dynamic-linker=/libexec/ld-elf.so.1 \
#    -L/libexec -l:ld-elf.so.1 \
#    --hash-style=gnu
file hello
# hello: ELF 32-bit LSB executable, Intel 80386, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, not stripped
ldd hello
# hello:
#     /libexec/ld-elf.so.1 (0x2806e000)
./hello
# Hello, world!

FreeBSD: x86-64

Note that older FreeBSD kernels contain a bug in syscall handling that can cause crashes when using the SYSCALL instruction. Compilers targeting these old versions should use INT $0x80 instead.

.data
	.set L_STDOUT,        1
	.set L_SYSCALL_EXIT,  1
	.set L_SYSCALL_WRITE, 4
	L_message:
		.ascii "Hello, world!\n"
		.set L_message_len, . - L_message

.text
	.global _main
	_main:
		# write(STDOUT, message, message_len)
		mov     $L_SYSCALL_WRITE, %rax
		mov     $L_STDOUT,        %rdi
		mov     $L_message,       %rsi
		mov     $L_message_len,   %rdx
		syscall

		# exit(0)
		mov     $L_SYSCALL_EXIT, %rax
		mov     $0,              %rdi
		syscall

static linking

as --64 -o hello.o hello.s
ld -m elf_x86_64_fbsd -o hello hello.o
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

as --64 -o hello.o hello.s
ld -m elf_x86_64_fbsd -o hello hello.o \
#    --dynamic-linker=/libexec/ld-elf.so.1 \
#    -L/libexec -l:ld-elf.so.1 \
#    --hash-style=gnu
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, not stripped
ldd hello
# hello:
#     /libexec/ld-elf.so.1 (0x800822000)
./hello
# Hello, world!

FreeBSD: RISC-V

The syscall number is passed in register t0, and parameters in registers a0 to a6.

.section .rodata
        .set .L_STDOUT,        1
        .set .L_SYS_exit,  1
        .set .L_SYS_write, 4
        .L_message:
                .ascii "Hello, world!\n"
        .set .L_message_len, . - .L_message

.text
        .global _start
        _start:
                li t0, .L_SYS_write
                li a0, .L_STDOUT
                la a1, .L_message
                li a2, .L_message_len
                ecall

                li t0, .L_SYS_exit
                li a0, 5
                ecall

static linking

as -o hello.o hello.s
ld -m elf64lriscv_fbsd -o hello hello.o
file hello
# hello: ELF 64-bit LSB executable, UCB RISC-V, double-float ABI, version 1 (FreeBSD), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

as -o hello.o hello.s
ld -m elf64lriscv_fbsd -o hello hello.o \
#    --dynamic-linker=/libexec/ld-elf.so.1 \
#    -L/libexec -l:ld-elf.so.1 -rpath /libexec
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, not stripped
ldd hello
# hello:
#     ld-elf.so.1 (0x82254000)
./hello
# Hello, world!

SunOS 4.x (Solaris 1.x)

SunOS: SPARC v7

.seg "data"
	L_STDOUT        = 1
	L_SYSCALL_EXIT  = 1
	L_SYSCALL_WRITE = 4
	L_message:
		.ascii "Hello world!\n"
		L_message_len = . - L_message

.seg "text"
	.global _start
	_start:
		! write(STDOUT, message, message_len)
		mov L_SYSCALL_WRITE, %g1
		mov L_STDOUT,        %o0
		set L_message,       %o1
		set L_message_len,   %o2
		ta  0

		! exit(0)
		mov L_SYSCALL_EXIT, %g1
		mov 0,              %o0
		ta  0

static linking

as -o hello.o hello.s
ld -e _start -o hello hello.o
file hello
# hello:          sparc demand paged executable not stripped
ldd hello
# hello: statically linked
./hello
# Hello world!

Inline Assembly

Higher-level languages sometimes let assembly be embedded directly into their object code. The exact syntax is language- and compiler-specific.

I used x86-64 Linux as the target platform for these examples, but they should work equally well if the appropriate instructions are substituted.

A note on "clobbering": compilers require the inline assembly block to declare which CPU registers _other than the inputs and outputs_ may be modified. The exact set of clobbered registers is compiler-, platform-, and os-specific[5]. Linux on x86-64 clobbers rcx and r11 (and maybe r10, as claimed by osdev?).

Linux: x86-64 (GNU C)

See Using Assembly Language with C in the GCC manual for an overview, Machine Constraints for architecture-specific codes to pass parameters into an assembly block, and Local Register Variables for details on assigning values to specific registers.

I couldn't find documentation on which registers GNU C's inline assembly clobbers, if any.

static const int STDOUT = 1;
static const int SYSCALL_EXIT = 60;
static const int SYSCALL_WRITE = 1;
static const char message[] = "Hello, world!\n";
static const int message_len = sizeof(message);

void _start() {
	{   /* write(STDOUT, message, message_len) */
		register int         rax __asm__ ("rax") = SYSCALL_WRITE;
		register int         rdi __asm__ ("rdi") = STDOUT;
		register const char *rsi __asm__ ("rsi") = message;
		register int         rdx __asm__ ("rdx") = message_len;
		__asm__ __volatile__ ("syscall"
			: "+r" (rax)
			: "r" (rax), "r" (rdi), "r" (rsi), "r" (rdx)
			: "rcx", "r11");
	}

	{   /* exit(0) */
		register int rax __asm__ ("rax") = SYSCALL_EXIT;
		register int rdi __asm__ ("rdi") = 0;
		__asm__ __volatile__ ("syscall"
			:
			: "r" (rax), "r" (rdi)
			: "rcx", "r11");
	}
}

static linking

gcc -m64 -c -o hello.o hello.c
ld -m elf_x86_64 -o hello hello.o
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

gcc -m64 -c -o hello.o hello.c
ld -m elf_x86_64 -o hello hello.o \
#   --dynamic-linker /lib64/ld-linux-x86-64.so.2 \
#   -l:ld-linux-x86-64.so.2
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, not stripped
./hello
# Hello, world!

Linux: x86-64 (LLVM IR)

See Inline Assembler Expressions in the LLVM IR reference for an overview. I'm using named registers in the input list instead of moving things around in the ASM block, so that LLVM will handle the register allocation.

LLVM documentation says its ASM calls clobber registers dirflag, fpsr, and flags in addition to any registers clobbered by the kernel.

@.message = internal constant [14 x i8] c"Hello, world!\0A"

define void @_start() {
	%message_ptr = getelementptr [14 x i8], [14 x i8]* @.message , i64 0, i64 0

	; write(STDOUT, message, message_len)
	call i64 asm sideeffect "syscall",
		"={rax},{rax},{rdi},{rsi},{rdx},~{rcx},~{r11},~{dirflag},~{fpsr},~{flags}"
		( i64 1            ; {rax} SYSCALL_WRITE
		, i64 1            ; {rdi} STDOUT
		, i8* %message_ptr ; {rsi} message
		, i64 14           ; {rdx} message_len
		)

	; exit(0)
	call i64 asm sideeffect "syscall",
		"={rax},{rax},{rdi},~{rcx},~{r11},~{dirflag},~{fpsr},~{flags}"
		( i64 60 ; {rax} SYSCALL_EXIT
		, i64 0  ; {rdi} exit_code
		)

	ret void
}

static linking

llc -o hello.o hello.ll -filetype=obj
ld -m elf_x86_64 -o hello hello.o
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

llc -o hello.o hello.ll -filetype=obj -relocation-model=pic
ld -m elf_x86_64 -o hello hello.o \
#   --dynamic-linker /lib64/ld-linux-x86-64.so.2 \
#   -l:ld-linux-x86-64.so.2
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, not stripped
./hello
# Hello, world!

Linux: x86-64 (Rust)

See Inline assembly in the Rust reference for an overview. As in the LLVM IR example, I'm using named registers to let the compiler handle register allocation.

#![no_std]
#![no_main]

const STDOUT: u64 = 1;
const SYSCALL_EXIT: u64 = 60;
const SYSCALL_WRITE: u64 = 1;

#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
	loop {}
}

#[no_mangle]
unsafe fn _start() {
	let message: &str = "Hello, world!\n";

	// write(STDOUT, message, message.len())
	let mut _rc: i64;
	core::arch::asm!(
		"syscall",
		in("rax") SYSCALL_WRITE,
		in("rdi") STDOUT,
		in("rsi") message.as_ptr(),
		in("rdx") message.len(),
		out("rcx") _,
		out("r11") _,
		lateout("rax") _rc,
	);

	// exit(0)
	core::arch::asm!(
		"syscall",
		in("rax") SYSCALL_EXIT,
		in("rdi") 0,
		out("rcx") _,
		out("r11") _,
	);
}

static linking

rustc --emit obj -O -C panic=abort -o hello.o hello.rs
ld -m elf_x86_64 -o hello hello.o
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
./hello
# Hello, world!

dynamic linking

rustc --emit obj -O -C panic=abort -o hello.o hello.rs
ld -m elf_x86_64 -o hello hello.o \
#   --dynamic-linker /lib64/ld-linux-x86-64.so.2 \
#   -l:ld-linux-x86-64.so.2
file hello
# hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, not stripped
./hello
# Hello, world!

LKML: Intel P6 vs P7 system call performance (Mike Hayward)
LWN: How to speed up system calls
manpage vdso(7)
manpage getauxval(3)
See the System V ABI for details.

SRE School: Health Checking

2018-03-14T06:20:54Z

SRE School: Health Checking

Any service that has complex logic or external dependencies might stop working for unexpected reasons. While instrumentation and monitoring can help bring these problems to human attention, it can be difficult to use dashboards or alerts for low-latency automated responses. A load balancer, for example, should respond to unhealthy backends on the order of seconds – long before any human can become aware of the problem.

Health checking is the process by which processes self-monitor for problems, report those problems to other parts of the service, and respond to other processes' unhealthiness in ways that mitigate overall service degredation.

Reporting Problems

Health checks are done not for a process's own benefit, but for the benefit of others. The first part of any health checking logic is the endpoint by which other processes poll it. This is essentially a miniature black-box monitoring system.

Health checks should almost always be performed over the same protocol that normal requests and responses will be handled by. If your HTTP server processes health checks in a separate thread pool or with a special low-dependency handler, then the risk of health checks reporting OK for an unhealthy process is significantly increased.

HTTP

Many distributed systems use HTTP as a transport protocol, so adding in a simple /healthcheck endpoint is popular. The semantics are usually "always respond 200 OK", and upstream load balancers treat timeouts or other response codes as unhealthy. Repeated failed health checks cause the load balancer to stop sending requests to that backend.

A few changes to this basic model can improve the efficiency:

Certain error codes can be special-cased to mean "stop sending requests immediately" – for example, Envoy treats 503 as a hard go-away.
When using common ports like :80 or :443, an "expected name" might be attached to the request to identify which backend the load balancer expects to be talking to. When a different process is listening on that port at the moment, it will reject the health-check request and the load balancer will avoid sending it traffic for the other service.

gRPC

gRPC has a standardised and expanded version of the basic HTTP health check. It expects each port to respond to /grpc.health.v1.Health/Check, and allows requests to specify which service name they are for:

The handling of service names is important because each gRPC server can offer multiple gRPC services, each logically distinct and with its own health check logic. For example, an authorization server with separate "issue token" and "validate token" services might be temporarily unable to issue tokens, but could still validate any that were previously issued.

Dependencies

While a service might become unhealthy because of some internal problem, it's far more common for unhealthiness to be caused by dependencies on other components. A mail server might be unable to send email because it's getting CONNECTION_REFUSED from smptd, unable to show existing emails because the database machine is rebooting, or unable to do anything at all because a human has manually marked its local machine as bad.

Within a single process, health status is detected and propagated to relevant services via a dependency tree. Ideally, the codebase is structured so that depending on any external resource (a database, an RPC backend, a secret key installed by Puppet) requires going through the dependency framework.

Interfaces

type HealthChecker interface {
    Metadata() *Metadata
    Children() []HealthChecker
    HealthCheck(context.Context, func(error))
}

type Metadata struct {
    Name        string
    Description string
}

Defining Dependencies

type FileDependency struct {
    Path string
}

var _ health.HealthChecker = (*FileDependency)(nil)

func (f *FileDependency) Metadata() *health.Metadata {
    return &health.Metadata{
        Name: fmt.Sprintf("local file: %s", f.Path),
    }
}

func (f *FileDependency) Children() []health.HealthChecker { return nil }

func (f *FileDependency) HealthCheck(ctx context.Context, cb func(error)) {
    ticker := time.Tick(time.Second)
    for {
        select {
        case <-ticker:
            fp, err := os.Open(f.Path)
            if err == nil {
                fp.Close()
            }
            cb(err)
        case <-ctx.Done():
            return
        }
    }
}

Registration

type motdImpl struct {
    motdFile *FileDependency
}

func (i *motdImpl) Motd(ctx context.Context, req *pb.MotdRequest) (*pb.MotdResponse, error) {
    motd, err := ioutil.ReadFile(i.motdFile.Path)
    if err != nil {
        return nil, err
    }
    return &pb.MotdResponse{Message: motd}, nil
}

func main() {
    ctx := context.Background()

    impl := &motdImpl{
        motdFile: &FileDependency{
            Path: "/etc/motd",
        },
    }
    machineHealthyFile := &FileDependency{
        Path: "/etc/machine-healthy",
    }

    srv := grpc.NewServer()
    pb.RegisterMotdServer(srv, impl)

    healthSrv := health.NewHealthServer()
    grpc_health_v1.RegisterHealthServer(srv, healthSrv)

    healthSrv.Register(impl.motdFile, health.ServiceName("com.example.Motd"))
    healthSrv.Register(machineHealthyFile)

    // waits for dependencies to become healthy
    healthSrv.Start(ctx)

    address := "127.0.0.1:1234"
    socket, err := net.Listen("tcp", address)
    if err != nil {
        log.Fatalf("net.Listen(%q): %v", address, err)
    }
    srv.Serve(socket)
}

Server Startup

Server startup should block until dependencies have become healthy, so that service implementation code doesn't have to deal with "half-open" dependencies (unless explicitly written to do so). Since dependencies can take a few seconds to initialize, starting them in parallel also helps reduce overall startup time.

Not all dependencies should block server startup, and some should only block startup but not otherwise affect health checking. The levels are:

Hard dependencies block startup until the dependency is healthy, and the service (or entire process) becomes unhealthy if the dependency is unhealthy. Examples might include the main database server, a proxy for outgoing connections, or disk space for critical logs.
Startup dependencies block startup, but once loaded don't need to be re-checked. Examples include a per-service private key, a large file loaded from local disk, or configuration data stored remotely.
Optional dependencies do not block startup, but do propagate health status to services that depend on them. This is useful when a single process is providing many services, and there's no problem with only accepting traffic for some of them.

Reddit Front Page (2018)

2018-03-11T04:34:36Z

Reddit Front Page (2018)

Over the past few months I've noticed I get a lot less enjoyment out of browsing Reddit. There wasn't any clear reason, just a general feeling that every hour I spent there was an hour wasted. It didn't use to be that way, I think – while there were always some forums there filled with noise, it also contained fresh analyses and insightful commentary and regularly surfaced them to the front page (/r/all).

Filtering With RES

My first attempt to fix things was installing Reddit Enhancement Suite, a Chrome extension that implements (among other things) the ability to hide subreddits from view. When I noticed that particular noisy subreddits were taking up too much of the page, I added them to my RES blacklist. Unfortunately, RES is client-side and can't easily request more links on heavily filtered pages. I noticed that the front page would sometimes be empty, because every link on it came from a subreddit that I didn't want to see.

Next I tried using Reddit's built-in subreddit filtering, a feature added in November 2016 to handle just this use case. After quickly hitting their limit of 99 hidden subreddits, I moved things around between their filter and RES to optimize the number of filtered links per page view. Reddit would block the most popular of the noisy subreddits, and RES could take the long tail.

Collecting More Data

Filtering didn't seem to be very effective, and I was still regularly seeing pages containing only noise, or nothing at all. Was the problem lack of data? I wrote a quick-n-dirty[1] scraper that would hit Reddit's API, saving the current top 2000 posts to JSON files for analysis:

import datetime
import json
import os.path
import time
import urllib
import urllib2

timestamp = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
os.mkdir(timestamp)

after = None
for request_num in range(20):
    out_filename = os.path.join(timestamp, "%02d.json" % (request_num + 1,))
    print "[%s]" % (out_filename)
    params = {"limit": "100"}
    if after is not None:
        params["after"] = after
    req = urllib2.Request(
        url = "https://reddit.com/r/all/.json?" + urllib.urlencode(params),
        headers = {
            # https://github.com/reddit-archive/reddit/wiki/API#rules
            "user-agent": "darwin:com.john-millikin.redditpopularity:v1 (by /u/jmillikin)",
        },
    )
    response_fp = urllib2.urlopen(req)
    response = json.load(response_fp)
    with open(out_filename, "wb") as fp:
        json.dump(response, fp, indent=2)
    time.sleep(2)
    after = response["data"]["children"][-1]["data"]["name"]

Then I extracted the most interesting fields into an SQLite database for easier querying:

import glob
import json
import sqlite3
import sys

# https://www.sqlite.org/datatype3.html
db = sqlite3.connect(sys.argv[1].strip("/") + ".sqlite")
db.execute("""
CREATE TABLE posts (
  name text,
  created_utc integer,
  subreddit text,
  score integer,
  num_comments integer,
  domain text,
  title text,
  url text
);
""")
for filename in glob.glob(sys.argv[1] + "/*.json"):
    with open(filename, "rb") as fp:
        response = json.load(fp)
    for list_item in response["data"]["children"]:
        post = list_item["data"]
        db_row = [
            post["name"],
            int(post["created_utc"]),
            post["subreddit"],
            int(post["score"]),
            int(post["num_comments"]),
            post["domain"],
            post["title"],
            post["url"],
        ]
        insert_sql = "INSERT INTO posts VALUES (%s)" % (", ".join("?" for _ in db_row),)
        db.execute(insert_sql, db_row)

db.commit()
db.close()

Analysis

By Subreddit

OK, we've got a snapshot of the top 2000 posts and can refresh it at will. Which subreddits should I filter out server-side to minimize noise on /r/all?

sqlite> SELECT COUNT(DISTINCT subreddit) FROM posts;
1608
sqlite> .mode column
sqlite> .headers on
sqlite> .width 20 10
sqlite> SELECT subreddit, COUNT(*) AS count FROM posts
   ...> GROUP BY subreddit ORDER BY count DESC, subreddit
   ...> LIMIT 20;
subreddit             count
--------------------  ----------
aww                   5
funny                 5
gaming                5
gifs                  5
pics                  5
politics              5
AskReddit             4
BlackPeopleTwitter    4
CrappyDesign          4
FortNiteBR            4
PrequelMemes          4
Rainbow6              4
dankmemes             4
leagueoflegends       4
memes                 4
nba                   4
oddlysatisfying       4
soccer                4
todayilearned         4
trees                 4

This result was pretty surprising to me. I had expected to see power law numbers, with "default" subreddits like /r/funny having an order of magnitude more posts on /r/all than the average. But it looks like the front-page algorithm optimizes for maximum subreddit variety, featuring over 1600 unique subreddits within the top 2000 posts. With a limit of 99 subreddits in the server-side filter, there's just no practical way to hide noise based on subreddit name.

By Domain

Here's where that power law showed up. Take a look at which domains the top 2000 posts are pointing at:

sqlite> SELECT domain, COUNT(*) AS count FROM posts
   ...> GROUP BY domain ORDER BY count DESC, domain
   ...> LIMIT 20;
domain                count
--------------------  ----------
i.redd.it             925
i.imgur.com           344
gfycat.com            137
imgur.com             103
v.redd.it             56
twitter.com           32
youtube.com           29
reddit.com            8
streamable.com        8
youtu.be              8
cdna.artstation.com   7
cdnb.artstation.com   4
clips.twitch.tv       4
media.giphy.com       4
self.AskReddit        4
self.leagueoflegends  4
streamja.com          4
78.media.tumblr.com   3
cdn.discordapp.com    3
inquisitr.com         3

1565 images! Out of the top 2000 posts on the world's biggest internet forum, over 75% of them are just pictures[2]!

/r/all posts per domain as of 2018-03-11 00:40:30 UTC

{ "animation": false, "tooltip": { "trigger": "item", "formatter": "{b}: {c}" }, "xAxis": { "type": "category", "axisLabel": { "rotate": -30, "margin": 15 }, "data": [ "i.redd.it", "i.imgur.com", "gfycat.com", "imgur.com", "v.redd.it", "youtube.com", "twitter.com", "cdn*.artstation.com", "reddit.com", "streamable.com" ] }, "yAxis": { "type": "value" }, "series": [{ "type": "bar", "data": [ { "value": 925, "itemStyle": { "color": "#c23531" } }, { "value": 344, "itemStyle": { "color": "#c23531" } }, { "value": 137, "itemStyle": { "color": "#c23531" } }, { "value": 103, "itemStyle": { "color": "#c23531" } }, { "value": 56, "itemStyle": { "color": "#c23531" } }, { "value": 37, "itemStyle": { "color": "#2f4554" } }, { "value": 32, "itemStyle": { "color": "#2f4554" } }, { "value": 11, "itemStyle": { "color": "#c23531" } }, { "value": 8, "itemStyle": { "color": "#2f4554" } }, { "value": 8, "itemStyle": { "color": "#2f4554" } } ] }] }

And the dropoff is incredible -- the #10 domain on /r/all has 0.04% of the posts!

Filtering

Without Images

What happens if we kick out all the image hosts?

sqlite> SELECT COUNT(*) FROM posts;
2000
sqlite> DELETE FROM posts WHERE url LIKE "%.jpg" OR url LIKE "%.gif" OR domain IN ('i.redd.it', 'i.imgur.com', 'gfycat.com', 'giant.gfycat.com', 'imgur.com', 'v.redd.it', 'm.imgur.com', 'i.gyazo.com', 'cdna.artstation.com', 'cdnb.artstation.com', 'flickr.com') OR domain LIKE "%.media.tumblr.com";
sqlite> SELECT COUNT(*) FROM posts;
388

sqlite> SELECT domain, COUNT(*) AS count FROM posts GROUP BY domain ORDER BY count DESC, domain LIMIT 30;
domain                          count
------------------------------  ----------
twitter.com                     32
youtube.com                     29
reddit.com                      8
streamable.com                  8
youtu.be                        8
clips.twitch.tv                 4
self.AskReddit                  4
self.leagueoflegends            4
streamja.com                    4
inquisitr.com                   3
nytimes.com                     3
self.Jokes                      3
self.Showerthoughts             3
thehill.com                     3
variety.com                     3
businessinsider.com             2
dailycaller.com                 2
dailymail.co.uk                 2
en.wikipedia.org                2
epicgames.com                   2
newsweek.com                    2
rawstory.com                    2
salon.com                       2
self.AskOuija                   2
self.Brawlstars                 2
self.CFB                        2
self.Competitiveoverwatch       2
self.DestinyTheGame             2
self.LifeProTips                2
self.WritingPrompts             2

We have less than a quarter of the original dataset, but the link quality is higher. I see some newspapers in the list now, and the two most popular domains together are only 15% of the links.

Without Self-Posts

Many of the remaining high-scoring posts are "self-posts", text posted directly to Reddit by users — basically a comment. Lets look more closely to see if they might be interesting:

sqlite> SELECT score, subreddit, title FROM posts WHERE domain LIKE "self.%" LIMIT 20;
score                 subreddit             title
--------------------  --------------------  ----------------------------------------------------------------------------------------------------
47354                 Showerthoughts        Being a blacksmith must have been a real pantydropper back in the day seeing how Smith is the most c
6008                  atheism               “Religion is what keeps the poor from murdering the rich.” ―Napoleon Bonaparte
23625                 AskReddit             What should people stop buying?
6289                  garlicoin             If this post gets 20,000 upvotes, I will give 5 random commenters 1000 GRLC.
15396                 Jokes                 A priest and a rabbi were sitting next to each other on an airplane.
1919                  leagueoflegends       MLG has wiped their entire LoL archive channel. This means many important pre-LCS VODS no longer exi
5852                  WritingPrompts        [WP] One evening, a portal to hell opens at the foot of your bed. A demon strides through, rips off
2671                  CrazyIdeas            I'm starting a charity to raise awareness of pyramid schemes. Donate $100 to register as a fundraise
2087                  ireland               IRELAND ARE 6 NATIONS CHAMPIONS UPVOTE PARTY!!!!
5334                  askscience            Am I using muscles to keep my eyelids open or to keep them closed or both?
5320                  confession            I got married tonight and it was the worst, most stressful day of my life.
1773                  nintendo              Happy March 10th aka MAR10 aka Mario Day!
2924                  personalfinance       A “subscription” box charged me for 4 of their $107 boxes without my consent and won’t refund
1125                  IAmA                  [AMA REQUEST] A Designer For Expensive Brands Like Gucci, Louis Vuitton, Etc
7414                  dadjokes              My teenage daughter came home from school and she was blazing mad. “We had sex education today, da
3886                  AskReddit             What is something everyone knows, but no one wants to admit?
1412                  YouShouldKnow         YSK that by looking up "3.11" on yahoo.co.jp, 10 cents will be donated to the East Japan Earthquake
493                   Competitiveoverwatch  Uber: "You know what Hydration is called in the sky, Matt?"
847                   leagueoflegends       Clutch Gaming vs. Echo Fox / NA LCS 2018 Spring - Week 8 / Post-Match Discussion
494                   canada                CBC reporting Doug Ford has won PC Leadership in Ontario by the slimmest of margins. Christine Ellio

Not really. There are two good links here (awareness of a disaster-relief charity and a breaking political story), but it's 90% noise. Lets delete them for now, and consider re-adding with strict filtering in the future.

sqlite> DELETE FROM posts WHERE domain LIKE "self.%";
sqlite> SELECT COUNT(*) FROM posts;
223

Without Sports or Video Games

I'm assuming that anybody who cares enough about disc golf (etc) to click its posts is already subscribed directly. Lets delete posts from any subreddits that are obviously for a specific game (physical or virtual). In theory, Reddit could support this directly in their server by a simple tagging system for subreddits.

sqlite> DELETE FROM posts WHERE subreddit IN ('49ers', 'Artifact', 'Barca', 'BattleRite', 'CollegeBasketball', 'Destiny', 'DetroitRedWings', 'DotA2', 'FantasyPL', 'FortNiteBR', 'GlobalOffensive', 'GreenBayPackers', 'LiverpoolFC', 'LonghornNation', 'MkeBucks', 'NUFC', 'NYYankees', 'NintendoSwitch', 'PS4', 'SquaredCircle', 'Steam', 'StreetFighter', 'aoe2', 'baseball', 'canucks', 'chelseafc', 'civbattleroyale', 'cowboys', 'detroitlions', 'discgolf', 'eagles', 'gamernews', 'hockey', 'lakers', 'minnesotatwins', 'nba', 'osugame', 'reddevils', 'smashbros', 'soccer', 'speedrun', 'sports', 'starcraft', 'thelastofus', 'torontoraptors', 'xboxone');
sqlite> SELECT COUNT(*) FROM posts;
169

Ranking

Reddit's Default Ranking

Here's what the front page would look like, using the above filters with Reddit's current ranking algorithm:

sqlite> .width 7 12 30 22 120
sqlite> SELECT score, num_comments, domain, subreddit, title FROM posts LIMIT 25;
score    num_comments  domain                          subreddit               title
-------  ------------  ------------------------------  ----------------------  --------------------------------------------------------------------------------------------------------------------
23083    1466          ultimateclassicrock.com         Music                   40 year old rock station in Chicago replaced by Christian radio at midnight last night. Signed off with Motley Crue’s
36976    1491          youtube.com                     todayilearned           TIL that before the Super Bowl XLI Halftime Show, the show coordinator asked Prince if he'd be alright performing in the
8255     286           businessinsider.com             Futurology              SpaceX rocket launches are getting boring — and that's an incredible success story for Elon Musk: “His aim: dramatic
24711    791           inquisitr.com                   technology              Senate Bill Meant To Punish Equifax Might Actually Reward It: Thanks to last-minute changes in legislation designed to d
11283    772           cbsnews.com                     politics                80 percent of mass shooters showed no interest in video games, researcher says
4621     278           haaretz.com                     worldnews               'Caved to religious pressure': Israeli army takes down viral Women's Day video empowering female soldiers
12958    599           fox13news.com                   FloridaMan              Florida woman jailed for 5 months because of a failed field drug test. The lab test took 7 months to come back, revealin
40663    1571          usatoday.com                    books                   Banning literature in prisons perpetuates a system that ignores inmate humanity
26701    343           dailymail.co.uk                 UpliftingNews           Cute video shows no-kill shelter putting old chairs to good use by letting rescue dogs curl up on them in their cages
59591    3772          seattletimes.com                news                    Costco says extra profit from tax cuts will be shared with employees
6080     285           bellinghamherald.com            nottheonion             A man found 54 human hands in the snow. Russia says they’re probably just trash
20262    903           indiewire.com                   television              Bill Hader’s ‘Massive Panic Attacks’ on ‘SNL’ Inspired His New HBO Series, ‘Barry’
2097     148           scontent-lht6-1.xx.fbcdn.net    batman                  And that's how you end the greatest live action superhero film of all time.
4092     75            web.archive.org                 savedyouaclick          Scientists warn of mysterious and deadly new epidemic called Disease X that could kill millions around the world | "Dise
2595     69            twitter.com                     TrumpCriticizesTrump    "I told Rex Tillerson, our wonderful Secretary of State, that he is wasting his time trying to negotiate with Little Roc
4403     314           youtube.com                     videos                  He is not using auto tune but a form of yodeling.
3885     88            youtu.be                        youtubehaiku            [Poetry] Rejected Theme Song from READY PLAYER ONE
6152     237           inquisitr.com                   AgainstHateSubreddits   Reddit’s Financial Ties To Jared Kushner’s Family Under Scrutiny Amid Inaction Against The_Donald Hate Speech
44691    880           aero.umd.edu                    science                 Scientists create nanowood, a new material that is as insulating as Styrofoam but lighter and 30 times stronger, doesn?
4252     364           nydailynews.com                 politics                FedEx won't ship items like stamps, coins or ashes — but they'll ship guns at a discount
2137     131           heroichollywood.com             Marvel                  Marvel's 'Black Panther' Joins The $1 Billion Box Office Club
6039     679           space.com                       space                   Trump Praises Commercial Space Industry at Cabinet Meeting
2277     535           clips.twitch.tv                 LivestreamFail          OWL referee or should I say "no fun police". DED game btw
1891     119           wect.com                        offbeat                 Cop who lied to Uber driver about it being "illegal to film police" gets reinstated, abruptly retires the next day.
2827     75            salon.com                       esist                   Is Donald Trump a cult leader? Expert says he “fits the stereotypical profile”

This is better than we started with, but even after all that bulk deletion we still have to contend with noise like /r/savedyouaclick (posting clickbait on purpose), /r/LivestreamFail (people I don't know doing things I will never care about), and /r/youtubehaiku (America's Funniest Home Videos for snake people).

By Score

What if we rank directly on voted score?

sqlite> SELECT score, num_comments, domain, subreddit, title FROM posts ORDER BY score DESC LIMIT 25;
score    num_comments  domain                          subreddit               title
-------  ------------  ------------------------------  ----------------------  --------------------------------------------------------------------------------------------------------------------
59591    3772          seattletimes.com                news                    Costco says extra profit from tax cuts will be shared with employees
44691    880           aero.umd.edu                    science                 Scientists create nanowood, a new material that is as insulating as Styrofoam but lighter and 30 times stronger, doesn?
40663    1571          usatoday.com                    books                   Banning literature in prisons perpetuates a system that ignores inmate humanity
36976    1491          youtube.com                     todayilearned           TIL that before the Super Bowl XLI Halftime Show, the show coordinator asked Prince if he'd be alright performing in the
30939    1845          youtube.com                     videos                  It's the weekend and you know what that means
26701    343           dailymail.co.uk                 UpliftingNews           Cute video shows no-kill shelter putting old chairs to good use by letting rescue dogs curl up on them in their cages
26312    2873          jpost.com                       politics                Putin: Jews might have been behind U.S. election interference
24711    791           inquisitr.com                   technology              Senate Bill Meant To Punish Equifax Might Actually Reward It: Thanks to last-minute changes in legislation designed to d
23083    1466          ultimateclassicrock.com         Music                   40 year old rock station in Chicago replaced by Christian radio at midnight last night. Signed off with Motley Crue’s
20262    903           indiewire.com                   television              Bill Hader’s ‘Massive Panic Attacks’ on ‘SNL’ Inspired His New HBO Series, ‘Barry’
12958    599           fox13news.com                   FloridaMan              Florida woman jailed for 5 months because of a failed field drug test. The lab test took 7 months to come back, revealin
11932    1418          timesofisrael.com               worldnews               Putin suggests ‘Jews with Russian citizenship’ behind US election interference
11283    772           cbsnews.com                     politics                80 percent of mass shooters showed no interest in video games, researcher says
8255     286           businessinsider.com             Futurology              SpaceX rocket launches are getting boring — and that's an incredible success story for Elon Musk: “His aim: dramatic
7490     80            np.reddit.com                   bestof                  Redditor mentions psychiatrist Dr. Tyler Black in a thread about gamer psychology and violence, Dr. Tyler Black shows up
6657     227           en.wikipedia.org                todayilearned           TIL of Major Digby Tatham-Warter, a British major who brought an umbrella into battle, using it to stop an armoured vehi
6341     740           youtu.be                        videos                  Girl goes on Dr. Phil and says she is pregnant with baby Jesus. Ultrasound reveals she is literally full of shit.
6152     237           inquisitr.com                   AgainstHateSubreddits   Reddit’s Financial Ties To Jared Kushner’s Family Under Scrutiny Amid Inaction Against The_Donald Hate Speech
6080     285           bellinghamherald.com            nottheonion             A man found 54 human hands in the snow. Russia says they’re probably just trash
6039     679           space.com                       space                   Trump Praises Commercial Space Industry at Cabinet Meeting
5254     784           scmp.com                        worldnews               Putin said he “couldn’t care less” if fellow Russian citizens sought to meddle in US election, insisting such effo
4790     176           jaha.ahajournals.org            science                 top cardiologists have better patient outcomes when they are away. Study of patient outcomes during Transcatheter Cardio
4621     278           haaretz.com                     worldnews               'Caved to religious pressure': Israeli army takes down viral Women's Day video empowering female soldiers
4403     314           youtube.com                     videos                  He is not using auto tune but a form of yodeling.
4252     364           nydailynews.com                 politics                FedEx won't ship items like stamps, coins or ashes — but they'll ship guns at a discount

This … is good! I would enjoy reading a Reddit front page that looked like this.

Conclusions

Reddit's server-side filtering options are not currently useful for /r/all, because their ranking algorithm intentionally optimizes for a small number of posts from many subreddits, but their filter has a hard capacity limit of 99 subreddits. Their filtering could be made much more effective by offering the ability to hide unwanted domains, hide self posts, and hide certain broad categories of subreddit that users can easily self-recognize (e.g. "video games", "sports", or "livestreamers").

Other Findings

/r/The_Donald

One interesting note is that of the top 2000 posts on /r/all, none of them came from controversial right-wing political subreddit /r/The_Donald. This appears to be intentional: at the time of writing /r/The_Donald has several recent posts with scores in the 2000-8000 range, which is far above the minimum scores seen on /r/all:

sqlite> SELECT score, num_comments, domain, subreddit, title FROM posts ORDER BY score ASC LIMIT 10;
score    num_comments  domain                          subreddit               title
-------  ------------  ------------------------------  ----------------------  --------------------------------------------------------------------------------------------------------------------
55       15            pcper.com                       hardware                AMD FreeSync 2 for Xbox One S and Xbox One X
57       0             nytimes.com                     netneutrality           Washington Governor Signs First State Net Neutrality Bill
63       5             oregonlive.com                  oregon                  Bend woman gets 21 years for drugging kids so she could go tanning, do CrossFit
70       4             bitcoinafrica.io                BasicIncome             Universal Basic Income Experiment Launches in Kenya and Uganda Partly Funded by Bitcoin
71       40            youtube.com                     SugarPine7              Sexy nightmare.
75       41            reddit.com                      Drama                   /u/GallowBoob outs his sockpuppet as he justifies his pedophilia.
77       1             vstinner.github.io              Python                  How Victor Stinner fixed a very old GIL race condition in Python 3.7
79       12            dailymail.co.uk                 EnoughLibertarianSpam   Pro-gun poster girl is shot in the back by her four-year-old son
97       81            youtube.com                     eurovision              It's Benjamin Ingrosso with "Dance You Off" for Sweden!
98       228           baytoday.ca                     CanadaPolitics          Doug Ford wins PC leadership race in close vote

I am personally happy about this because I find candidate-specific political forums very noisy, but it's not clear how to reconcile this behavior with the Reddit administration's public claims of content neutrality.

My first attempt used praw, but it requires a registered client ID and I didn't want to go through that just to get a few MB of JSON.
You might object that pictures can be uesful, but those aren't the kind Reddit upvotes. Currently #1 on the site, judged by the users to be more interesting than any other post, is a gif of a falling tree plus fake captions.

Re:Creators Episode 21

2018-03-10T21:24:27Z

Re:Creators Episode 21

Episode 21 of Re:Creators ends with a rather nicely typeset message to the viwer, in Japanese and Latin:

Mundum divit factum, atque pulchre.
世界は豊かに、そして美しく

That's some unusual Latin. I wonder how they translated it? Can I do better?

Google Translate[1] and Yandex Translate both support Latin output, but their results don't match the screencap. Bing doesn't support Latin at the time of writing[2].

Google Translate

Yandex Translate

Note that Google and Yandex have nearly identical outputs after accounting for Latin's order-independent grammar, and it's a reasonable-looking solution.

Lets look more directly at the vocabulary being used:

Mundus / mundum is a direct translation of「世界」. Using a Latin dictionary as reference, we see that mundus is in the nominative case, and mundum is in the accusative case. In English, we use word order to make this distinction – "The mundus is …", "… around the mundum". In「世界は」, the joshi「は」has semantics similar to English's "Regarding the …". So we should use the nominative case: mundus.
dives is an adjective, being used here as a translation for「豊か」. divit isn't a Latin word, or at least not one I can find in any dictionary. Are we done with this part? Not quite – dives means wealthy[3], but「豊か」has the slightly different meaning of plentiful, abundant, or bountiful. Wiktionary suggests copia or abundantia would be more fitting.
atque is one of the Latin equivalents for and. This is our clue that it's being used as a translation for the conjunction「そして」. It's not quite a good fit though – Lewis & Short's A Latin Dictionary notes[4] that atque is used before words starting with vowels, and ac is the form before consonants. The examples also make it clear that atque is a much tighter binding than「そして」, which is usually translated as "and thus" or "and therefore". Perhaps atque would be a more like「と」? We'll come back to this in a moment.
pulcher is another adjective, and an exact match for「美しい」. Both translations use the vocative case, with pulchre being the masculine form and pulchra the feminine. I don't have high confidence that either of these is correct – vocative is an odd choice because we're not talking to the world itself. Lets use the accusative: pulchrum.

We now have enough to attempt a literal translation:

世界は豊かに、そして美しく
Mundus est copiosum ac pulchrum.
The world is bountiful and beautiful.

But we can do better! The original text appears to be using grammatical forms from Classical Japanese. To a native reader it would seem slightly poetic or literary, the feeling of which we can reproduce in Latin and English by adjusting the vocabulary and word order. Amazon's English subtitles translated this as "The world is full of abundance and beauty". As a native speaker I don't know how to name the "X is Y" -> "X is full of Y" pattern, but it does seem to add a certain poetic feeling.

First, lets review the use of atque / ac. To me, those words seem more suited for "I went to the store to buy eggs ac milk". Latin has a ‑que suffix, which is a conjunctive that appended to words to imply they go together and are somehow related. The world's beauty is beacause it is abundant, so ‑que may be a good fit here.

To convert the adjective pulcher into a noun, we need to add the ‑tudo suffix – more specifically, the accusative case ‑tudinem. While we're at it, lets use abundantia (accusative: abundantiam) instead of copia to match Amazon.

Putting these adjustments together, we arrive at this translation:

世界は豊かに、そして美しく
Mundus abundantiam plenus est pulchritudinemque.
The world is full of abundance and beauty.

I don't like how long this Latin is. Romans valued brevity, so lets step back a bit toward our first translation. By using the verb abundat and our adjective pulchrum we can trim out almost half of it. I'm using et for this one instead of ‑que to resemble Horace's "dulce et decorum est …" [5], and moving mundus est to the end:

世界は豊かに、そして美しく
Abundat et pulchrum mundus est.
The world is abundant and beautiful.

This looks reasonable. I'm content with it.

If you've somehow made it to the end of this page and thought "I want to read another eight hundred pages of this", find a copy of Le Ton beau de Marot by Hofstadter.

Google Translate does struggle with Latin, but can usually get the gist across.
Bing does support Klingon! Any trekkies around who can check their work?
The word dives is familiar to many English speakers via The Parable of Dives and Lazarus.
http://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.04.0059:entry=atque
This may seem strange given what I said about ac earlier. A native Roman speaker of Latin would probably have considered them equivalent, but in modern times the fame of Horace makes et seem a bit fancier.

SRE School: Instrumentation

2018-03-03T18:52:24Z

SRE School: Instrumentation

Instrumentation is the foundation of a monitoring infrastructure. It is the part that directly touches the system(s) being monitored, the source of raw data for our collectors and analyzers and dashboards. It is also the only part that is not under an SRE team's direct control – instrumentation is usually plumbed through the codebase by product teams. Given this, an SRE's primary source of leverage is to make adding instrumentation as easy and painless as possible. We do this by writing instrumentation libraries with friendly, approachable, idiomatic APIs.

Metrics

Each measurable property of the system is a metric. Repeated measurements of a metric's value yield a time series of data samples. A metric's definition includes metadata about how to collect, aggregate, and interpret its samples.

Metric values can in theory be of any serializable data type, but in practice they are numbers, text, or distributions:

Numeric metrics may have an associated unit, ideally in a machine-readable annotation. This is most important for metrics where the "natural" definition of a unit is divisible, e.g. to record time intervals as an integral amount of milliseconds instead of a fractional amount of seconds.
Text metrics are most often constants, but are sometimes used for gauges if there's a small number of possible values.
Distributions are used for metrics with a very large set of possible values. They are usually visualized as a histogram or heat map.

A C-style enumeration such as enum { OPT_FOO = 1; OPT_BAR = 2; } is best reported as "OPT_FOO" and "OPT_BAR"[1] instead of numeric 1 and 2.

Booleans can be thought of as the enum { FALSE, TRUE }. Some monitoring systems give them a separate type to simplify query planning and analysis.

Metrics can be defined ad-hoc at point of emission, or statically in some global type. I prefer statically declared metrics because that gives the opportunity to attach metric metadata.

There are four common categories of metrics: constants, gauges, counters, and distributions[2].

Constants

A metric that does not change for the lifetime of its associated system component. Samples of a constant metric will always contain the same value. Common examples are build information (e.g. git commit ID), process start time, and process ID. Don't use constants for things that are only constant-ish, such as hostnames.

Constants can be text or numbers. For numbers, integers usually work better than floats (e.g. represent your start time as int64 milliseconds instead of float64 seconds.

Time	`/build/timestamp` (seconds since UNIX epoch)	`/build/revision_id`
2011-12-13 14:15	1300000000	git:da39a3ee5e6b4b0d3255bfef95601890afd80709
2011-12-13 14:16	1300000000	git:da39a3ee5e6b4b0d3255bfef95601890afd80709
2011-12-13 14:17	1300000000	git:da39a3ee5e6b4b0d3255bfef95601890afd80709

In Go, using a constant metric might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	_TIMESTAMP   int64 /* filled in by linker */
	_REVISION_ID string /* filled in by linker */
	
	metric.NewConstantInt64("/build/timestamp", _TIMESTAMP)
	metric.NewConstantString("/build/revision_id", _REVISION_ID)
)

Gauges

A gauge metric can vary freely across its possible value range. Think of them like tachometers.

Gauges can be text or numbers.

Example integer gauges are memory allocation, thread count, active RPC count.
Example text gauges are mutable config settings (e.g. backend addresses), environment variables, and hostnames.

Time	`/proc/thread_count`	`/proc/working_directory`
2011-12-13 14:15	200	/var/www/current
2011-12-13 14:16	250	/var/www/previous
2011-12-13 14:17	230	/var/www/current

In Go, using a gauge metric might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	threadCount = metric.NewGaugeInt64("/proc/thread_count")
	workingDir = metric.NewGaugeString("/proc/working_directory")
)

func updateMetrics() {
	threadCount.Set(int64(runtime.NumGoroutine()))
	wd, _ := os.Getwd()
	workingDir.Set(wd)
}

Counters

A counter metric must be a number, and can only increase during the lifetime of the system. Counters are almost always integers to avoid the implications of IEEE-754 rounding.

Example counter metrics are CPU microseconds spent, or the total request count.

Counters can only increase. If the metric collector sees that a new value is lower than the older value, it knows a metric reset has occurred. Resets happen when a process restarts, clearing in-memory state of the counter.

Time	`/net/http/server/request_count`
2011-12-13 14:15	10000
2011-12-13 14:16	11000
2011-12-13 14:17	1500	RESET

In Go, defining a counter metric might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	requestCount = metric.NewCounterInt64("/net/http/server/request_count")
)

func handler(w http.ResponseWriter, req *http.Request) {
	requestCount.Increment() // or .IncrementBy(1)
}

Distributions

Distributions are used for metrics with a very large set of possible values. They are usually visualized as a histogram or heat map.

Examples include request latencies, client IP addresses[3], and aggregations of constant/gauge/counter metrics from other sources.

Time /net/http/server/response_latency (seconds)

2011-12-13 14:15

Time	`/net/http/server/response_latency` (seconds)
2011-12-13 14:15	[ 0, 2) # [ 2, 3) ### [ 3, 5) ####### [ 5, 8) #### [ 8, 13) ## [13, ∞)
2011-12-13 14:16	[ 0, 2) # [ 2, 3) #### [ 3, 5) ######## [ 5, 8) ### [ 8, 13) # [13, ∞)
2011-12-13 14:17	[ 0, 2) [ 2, 3) # [ 3, 5) ## [ 5, 8) ##### [ 8, 13) ######## [13, ∞) #

[ 0,  2) #
[ 2,  3) ###
[ 3,  5) #######
[ 5,  8) ####
[ 8, 13) ##
[13,  ∞)

2011-12-13 14:16

[ 0,  2) #
[ 2,  3) ####
[ 3,  5) ########
[ 5,  8) ###
[ 8, 13) #
[13,  ∞)

2011-12-13 14:17

[ 0,  2) 
[ 2,  3) #
[ 3,  5) ##
[ 5,  8) #####
[ 8, 13) ########
[13,  ∞) #

In Go, defining a distribution metric might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	latency = metric.NewDurations(
		"/net/http/server/response_latency",
		metric.BinDurations([]time.Duration{
			2 * time.Second,
			3 * time.Second,
			5 * time.Second,
			8 * time.Second,
			13 * time.Second,
		})
	)
)

func handler(w http.ResponseWriter, req *http.Request) {
	start := time.Now()
	defer func() {
		latency.Sample(time.Now() - start)
	}()
}

Each distribution is also inherently a set of counters, because recording a sample in one of the bins will increment that bin's count. This property can be used to simplify some monitoring configurations.

Bins can be defined statically (as in the example above), or using a function. Binning might be performed either by the system reporting the metric, or by the monitoring infrastructure.

With client-side binning, the reporter decides how fine-grained the distribution should be.
- This is usually configurable per-metric by a command-line flag or config setting.
- Changing the binning can cause vertical aberrations in visualisations.
With collector-side binning, the client reports the events as-is and the monitoring infrastructure aggregates the data before storing/forwarding it.
- Example: collector receives raw distribution samples from its clients, and records {50,90,95,99}th percentiles over a trailing window.
- This can be significantly less flexible, and it is often difficult to visualize percentiles as usefully as a full distribution.

Metric Names

I know of three styles for metric names:

The Prometheus Style Guide recommends myapp_descriptive_snake_case, where myapp_ is a one-word prefix specific to the system being monitored. This style is derived from Google Borgmon, which uses metric names as symbols in its configuration DSL[4].
- Amazon CloudWatch metrics use this format, without a prefix.
statsd and its derivatives use short.dotted.words, though the exact symbol set can vary between vendors.
- For example, DataDog allows alphanum, underscores, and periods.
Google Stackdriver uses myapp.com/unix/filesystem/paths, with each product having its own "subdirectory" in the metrics hierarchy.
- The same style is applied to AWS metrics in Stackdriver, by adding product-specific prefixes for each CloudWatch metric.

My personal favorite is the UNIX paths style, which I've seen used to great success. Engineers exposed to this style begin to naturally lay out metric hierarchies, with clear meanings and good namespacing. I don't have any solid data about why the naming style has such an effect, but I suspect it has something to do with familiarity:

A metric name like http_request_count is well and good, but myapp_com.net.http.server.request_count looks wrong to an experienced engineer. Expressions that use that many dots violate the Law of Demeter.
In contrast, path-shaped metric names like myapp.com/net/http/server/request_count inspire no such negative thoughts. Long paths are common in UNIX environments, and it's certainly no harder to remember than many of the paths in Linux's sysfs.

Traces

While metrics help understand the system in aggregate, traces are used to understand the relationship between the parts of a system that processed a particular request.

A trace is a tree of spans, which each represent a logical region of the system's execution time. Spans are nested – all spans except the root span have a parent span, and a trace is constructed by walking the tree to link parents with their children.

########################  GET /user/messages/inbox
 ######                   User permissions check
    ####                  Read template from disk
    #########             Query database
             ###          Render page to HTML
                ##        Compress response body
                  ######  Write response body

Spans and traces can be understood by analogy to lower-level programming concepts. If a trace is a stack trace, then a span is a single stack frame. Just as every stack frame is pushed and popped, each span begins and ends. It's the timing of when the spans begin and end that is interesting when analysing a trace.

Each span is implicitly a sample of a duration distribution, and therefore also a counter[5].

Tools for creating and recording traces are currently less mature than for creating metrics, and a wide variety of tracing platforms exists. OpenTracing is an attempt to provide vendor-neutral APIs for many languages so that tracing support can more easily be added to shared libraries.

Events

Events are conceptually similar to logging, but with an implied increase in how interesting a human will find the event as compared to normal logs. A web server might log a message for every request, but only record an event for things like unhandled exceptions, config file changes, or 5xx error codes.

Events are usually rendered in dashboards on top of visualized metric data, so humans can correlate them with potential production impact. For example, an oncaller might be better able to debug a spike in request latency if the dashboard shows it was immediately preceeded by a config change.

Events can also be archived to a searchable event log. This can be useful when investigating unexpected behavior that occurred in a large window of time – logs may be too noisy to search, but the event log can quickly find "all SSH logins to this machine in the last 3 hours".

Events that indicate programming errors should be recorded in a ticket tracking system, then assigned to a engineer for diagnosis and correction. This should be relatively rare – if your service encounters unhandled errors more than once a month or so, then you should improve its automated test suite.

Metric Metadata

A raw stream of numbers can be useful to authors of the system who are deeply familiar with its internal details, but can be opaque to other engineers (including oncall SREs). Attaching metadata to metrics at their point of definition can help with this by acting as type hints, documentation, and cross-references.

Types of metadata that might be added include:

Human-readable documentation, such as a description of the metric's deeper meaning. Very nice to have when staring at hundreds of similar-looking metrics in a dashboard builder.
Numeric units, so the monitoring system can combine millisecond-resolution data from one system with minute-resolution data from another. Or bytes and gibibyte, or Mb/s with kB/s.
Tags (see below), which benefit from pre-definition and strong typing in the same way metrics do.
Source code location, usually inserted automatically by the monitoring library. Navigating from a dashboard to source code is often the first step for investigating an anomalous chart reading.
Contact info for a person or team that has more context about what the metric means, or how it relates to the overall health of a system.
Stability metrics are an API too! Some metrics are experimental and shouldn't be built into other teams' dashboards, so it's useful to be able to indicate "this metric's definition is stable" vs "this could change without warning".

In Go, defining a metric with metadata might look something like this:

import "foo.com/my/monitoring/impl/metric"

var (
	requestCount = metric.NewCounterInt64(
		"/net/http/server/request_count",
		metric.Description("A count of the HTTP requests received by this process."),
	)
)

func handler(w http.ResponseWriter, req *http.Request) {
	requestCount.Increment() // or .IncrementBy(1)
}

Push vs Pull

Until I started looking into open-source monitoring frameworks, I didn't realize the "push vs pull" debate existed. I still don't fully understand it. Have we, as an industry, forgotten that TCP sockets are bidirectional? Anyway, here's a summary of the two sides.

Push Model

In the push model, processes are configured with the network address of a metric collector. They send metrics on their own schedule, either periodically (e.g. every 5 minutes) or whenever a value changes. statsd and its various derivatives are a canonical example of the push model – to increment a counter or set a gauge, the process sends a UDP packet[6] to the collector with the metric name and value.

The push model is dead simple to implement, and has a significant advantage of not requiring any sort of service discovery infrastructure. But it's also inflexible and difficult to manage – metric collection policies are hardcoded (or require a complex configuration management), and load balancing between collectors.

Pull Model

In the pull model, processes provide network access to their metrics and register themselves in a service discovery infrastructure such as Consul. Typical implementations are an HTTP endpoint (e.g. Prometheus's /varz) or a simple request-response RPC. The collectors use service discovery to find endpoints, scrape them on their own schedule, and make the data available on their own endpoints for scraping by higher-level collectors.

Two significant downsides to the pull model are the dependency on service discovery, and lack of backpressure:

If your service discovery infrastructure is degraded or unavailable, then newly created processes might not be monitored properly. Monitoring the discovery infrastructure itself is also a challenge, because your collectors need some way to hard-code the discovery service metric endpoints.
A fleet of collectors can easily send more metric requests than a single process can handle. Incorrect load balancing, monitoring configuration mistakes, or aggresive retries can cause your monitoring infrastructure to degrade the system it's monitoring!

Bi-Directional Collection

One solution to the push-vs-pull debate is to have the instrumented system connect to the collectors, receive its collection policy from them, and then push samples. This achieves the best of both worlds – the collector can set policy about which metrics to push and how often, but implementation of the policy is left up to the monitored system. Service registration is present only in vestigal form, because the monitored system can register with any collector instead of a globally-consistent service discovery infrastructure.

+------------------+               +------------+
| Monitored System |               |  Collector |
+------------------+               +------------+
          ||                              ||
          ||         Announcement         ||
          || ---------------------------> ||
          ||                              ||
          ||     Collection Policy        ||
          || <--------------------------- ||
          ||                              ||
          ||           Samples            ||
          || ---------------------------> ||
          ||                              ||
          ||           Samples            ||
          || ---------------------------> ||
          ||                              ||
          ||            ...               ||
          ||                              ||

The monitored system starts the process by connecting to the collector, and announcing its identity. The identity consists of things like cluster name, machine ID, process ID, or other ways to distinguish processes from each other.

A monitored system might announce multiple identities, for example if it's proxying metrics from some other source. A process that scrapes Apache log files to count errors might report two identities, one for itself and one for Apache. Each identity has independent (and possibly overlapping) sets of metrics.

Collection Policies

A large binary might be instrumented with many thousands of metrics, but only a subset will be of interest to the SRE team. Furthermore, some metrics should be updated more often than others – and the details can change as the SRE team refines dashboards or investigates ongoing service degradation. The rules about which metrics to push, and how frequently to push them, are encoded in a collection policy that the collector sends to the monitored system.

The following example policy pushes metrics starting with /build/ every 10 minutes, and metrics starting with /proc/ or /net/rpc/server/ every 5 seconds. The metric /net/rpc/client/response_latency is also pushed every 5 seconds, but other metrics under /net/rpc/client/ are not pushed.

metrics:
  - prefix: "/build/"
    interval: {seconds: 600}
  - prefix: "/proc/"
    interval: {seconds: 5}
  - prefix: "/net/rpc/server/"
    interval: {seconds: 5}
  - name: "/net/rpc/client/response_latency"
    interval: {seconds: 5}

A collector might also request specific events and trace spans, or all of them.

Note that there is no hard requirement on the monitored system to push at the specified interval. It might push less often if it's running low on CPU allocation, or perform an unscheduled push during shutdown.

Sample Compression

An unexpected benefit of pushing metrics in a reliable connection-oriented protocol is the opportunity for cheap data compression. Metric names, unchanged sample values, and timestamps are easy wins to reduce bandwidth requirements in your metric collection.

Metric Names

When the monitored system pushes a metric sample, it can allocate a connection-unique ID to that metric name. For later pushes, the name doesn't need to be re-transmitted. This is an especially good fit for protocol buffers, because each message field is identified by an ID. Therefore, a sample push can be encoded in the protobuf wire format as a sequence of (metric_id, metric_value) tuples, where the metric_value is of the protobuf type corresponding to the metric type.

A brief example, showing the original metric definition on the left, and the logical protobuf encoding on the right:

metric {
  name: "/proc/thread_count"
  type: INT64
  per_connection_metric_id: 1
}

metric {
  name: "/proc/working_directory"
  type: TEXT
  per_connection_metric_id: 2
}

message {
  int64 proc_thread_count = 1;
  string proc_working_directory = 2;
}

Unchanged Samples

Metric values often change less frequently than their collection interval. Instead of resending the same value over and over, the protocol can have a repeated int64 unchanged_metric_id field. Any metric IDs in this list will be treated as if they were sent using the last value seen in the current connection.

Timestamps

If timestamps are a metric type encoded into the protocol instead of just using integers, then they can be compressed using a timestamp base. For example, instead of sending int64 timestamp for each sample, send int64 timestamp_base in the announcement message and int32 timestamp_offset in the samples. Then reconstruct the original values in the collector as timestamp_base + timestamp_offset.

This technique works regardless of whether you use a fixed-length integer field, or a protobuf-style varint. Fixed-length fields will save 50% of each timestamp per sample, varint savings will vary depending on how small the offsets are. Note that for protobuf, chosing a timestamp base in the future and using negative offsets may result in more compact output due to ZigZag encoding.

The base time must be updated to a larger value if the offset would overflow a signed 32-bit integer. The resolution of your timestamps will affect how often the base time must be updated:

Maximum Offset	Seconds	Minutes	Hours	Days
2^31 nanoseconds	2.15	-	-	-
2^31 microseconds	2147.48	35.79	-	-
2^31 milliseconds	-	-	596.52	24.86

Or maybe "opt_foo" and "opt_bar". "OptFoo" is right out.
Distributions are sometimes called "histograms", for example by DataDog and Prometheus, but this is technically incorrect – a histogram is a visualization of a distribution.
This may seem like an odd metric value, but it can be useful when diagnosing routing-related network issues.
If you ever feel the urge to write your own turing-complete configuration language, take a deep breath and step back for a bit. Go for a walk around the block. Look at some trees.
Be careful about depending on spans as counters. Many tracing systems record only a subset of the traces they receive, or discard spans with durations outside of their recall window. You may find the implied metrics to be missing data from times when they are most interesting.
UDP could be a reasonable transport for metrics if you used it as the foundation for a reliable connection-oriented protocol (ala QUIC), but statsd does not do this. There is no mechanism to resend lost updates, ignore duplicates, or ensure correct sequencing of gauge values. Embedding the metric name in each packet is enormously wasteful of bandwidth. statsd collection is difficult to load balance across threads, and and very difficult to balance across collectors running on separate machines.

haskell-cpython: Calling Python libraries from Haskell

2010-10-28T04:04:18Z

haskell-cpython: Calling Python libraries from Haskell

Haskell's a great language; it's efficient, consistent, terse, reliable, and so on. But if there's one thing Haskell's not, it's "batteries included". Compared to popular dynamic languages, such as Python and Ruby, Haskell has a very limited module library. Writing bindings to Python libraries (via the Python/C API) is an easy and practical approach to reusing the Python community's work.

Code: https://john-millikin.com/code/haskell-cpython (GitHub mirror)

Preflight

In addition to standard Haskell development tools (GHC, Cabal, etc), building the example code requires the Python 3.1 headers. In Debian/Ubuntu, apt-get install python3.1-dev.

Once necessary libraries are installed, you should be able to run the following test program. If the program won't compile, or crashes, double-check that GHC and Cabal are installed properly.

module Main where
import qualified Data.Text.IO as T
import qualified CPython as Py

main :: IO ()
main = do
	Py.initialize
	Py.getVersion >>= T.putStrLn

The program should give output like this:

$ runhaskell version.hs
3.1.2 (release31-maint, Sep 17 2010, 20:37:45)
[GCC 4.4.5]

Python's built-in types

Like any self-respecting language, Python has a variety of built-in types; integers, text, lists, tuples, etc. The first step to using any Python library is marshaling Haskell values into an equivalent Python value. A full list of types supported by the CPython bindings is available in the API reference.

Lets marshal some basic stuff, using print() to see what Python makes of it:

{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.Text as T
import qualified Data.ByteString.Char8 as B
import System.IO (stdout)
import qualified CPython as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types as Py

main :: IO ()
main = do
	Py.initialize
	unicode <- Py.toUnicode "Hello World!"
	Py.print unicode stdout
	
	bytes <- Py.toBytes (B.pack "Hello\NULWorld!\ETX")
	Py.print bytes stdout
	
	float <- Py.toFloat 1.2345
	Py.print float stdout
	
	int <- Py.toInteger 12345
	Py.print int stdout
	
	list <- Py.toList [Py.toObject int]
	Py.print list stdout
	
	tuple <- Py.toTuple [Py.toObject int]
	Py.print tuple stdout
	
	set <- Py.toSet [Py.toObject int]
	Py.print set stdout

$ runhaskell marshaling.hs
'Hello World!'
b'Hello\x00World!\x03'
1.2345
12345
[12345]
(12345,)
{12345}

That's a big chunk to digest at once, so lets break it down a bit:

Python's unicode, bytes, float, and int types match up precisely with Haskell's Text, ByteString, Double, and Integer, respectively. Byte literals are prefixed with b, to reduce confusion with unicode strings.
Python's tuples are similar to Haskell's, except they may contain any number of elements. Single-element tuples are indicated by a trailing comma.
Python's lists are heterogeneous and support constant-time indexing; in Haskell, we use the SomeObject GADT to represent the contents of lists (and of arbitrary Python objects in general). Every value stored in a list must be first converted to a SomeObject, using Py.toObject.
Python's sets are also heterogeneous and constant-time; the special syntax {1, 2, 3} is equivalent to Haskell's Data.Set.fromList [1, 2, 3].

Methods and Protocols

Every Python object has a selection of methods, which can be called by external code to do stuff. If you've ever used a pseudo-OO language like C++ or Java, you've used methods before. Some methods are exposed directly via Python/C; others must be queried as attributes from an object.

When separate types have similar methods, those methods are usually standardized into a protocol. Python protocols are like Haskell typeclasses, except not type checked; any value with the appropriate methods is said to implement a protocol. For example, tuple, list, and bytes values all implement the sequence protocol.

Importing modules

There's only so much you can do with the built-in types; sooner or later, you'll want to use one of Python's rich selection of libraries. That's why you're reading this, right?

Modules are exposed to the runtime as standard Python objects, and their contents (variables, procedures, class definitions) can be queried like any other object attribute. Lets look at an example of calling os.uname():

{-# LANGUAGE OverloadedStrings #-}
module Main where

import qualified Data.Text as T
import System.IO (stdout)

import qualified CPython as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types as Py
import qualified CPython.Types.Module as Py

main :: IO ()
main = do
	Py.initialize
	
	os <- Py.importModule "os"
	uname <- Py.getAttribute os =<< Py.toUnicode "uname"
	res <- Py.callArgs uname []
	Py.print res stdout

$ runhaskell import.hs
('Linux', 'desktop', '2.6.35-22-generic', '#35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010', 'x86_64')

The getAttribute and callArgs functions are both part of the object protocol; the former works on all objects, while the latter works on objects with the __call__() magic method.

A module can be imported any number of times, but will only be loaded once per interpreter. This comes in very useful in Haskell, which has no native support for static data – if you need to call a Python method, just import its module at the call site.

Of course, even inexpensive operations can become a bottleneck if performed often enough; importing an already-loaded module is fast, but the full lookup still involves several string comparisons and a marshal. If the same Python function needs to be run many times, consider querying it once and caching the function object.

Catching Exceptions

If anybody's been playing around with the above examples, they might have run into the following problem:

$ runhaskell exceptions.hs
exceptions.hs: <CPython exception>

Because Python exceptions are themselves Python objects, printing them requires an IO action. In fact, because Python methods can perform arbitrary actions, printing the same exception twice might give different output! Therefore, the Show instance for Python exceptions is mostly worthless.

Every Python exception has three components: a class, a value, and an optional traceback (i.e. stack trace). The class is generally not interesting, but the value can be printed to see what went wrong:

{-# LANGUAGE OverloadedStrings #-}
module Main where

import qualified Control.Exception as E
import qualified Data.Text as T
import System.IO (stdout)

import qualified CPython as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types.Exception as Py
import qualified CPython.Types.Module as Py

main :: IO ()
main = do
	Py.initialize
	E.handle onException $ do
		Py.importModule "no-such-mod"
		return ()

onException :: Py.Exception -> IO ()
onException exc = Py.print (Py.exceptionValue exc) stdout

$ runhaskell exceptions.hs
ImportError('No module named no-such-mod',)

This'll do for quick and dirty scripts, but more complex errors will benefit from using the traceback module. Use procedures like print_exception() to get nice, pretty-printed error messages. If an exception originated in Python code, a stack trace will also be printed.

import qualified CPython.Constants as Py
import qualified CPython.Types as Py

-- ...

onException exc = do
	tb <- case Py.exceptionTraceback exc of
		Just obj -> return obj
		Nothing -> Py.none
	mod <- Py.importModule "traceback"
	proc <- Py.getAttribute mod =<< Py.toUnicode "print_exception"
	Py.callArgs proc [Py.exceptionType exc, Py.exceptionValue exc, tb]
	return ()

$ runhaskell exceptions.hs
ImportError: No module named no-such-mod

Putting it all together: binding 'mimetypes'

Here's the payoff; implementing a Haskell library with an existing Python library. For this I'll use the mimetypes module, since it's simple and self-contained; more useful bindings might be to the Universal Feed Parser or docutils.

Even a simple binding is a bit big to read all at once as an example, so I've split it up. First is the imports and exports; no explanation needed, hopefully.

{-# LANGUAGE OverloadedStrings #-}
module MimeTypes
	( MimeTypes
	, newMimeTypes
	, guessExtension
	, guessType
	) where

import qualified Data.Text as T
import qualified CPython as Py
import qualified CPython.Constants as Py
import qualified CPython.Protocols.Object as Py
import qualified CPython.Types as Py
import qualified CPython.Types.Module as Py
import qualified CPython.Types.Tuple as PyT

Next we have a data type for matching the mimetypes.MimeTypes class; it doesn't have the full complement of attributes, but enough for demonstration. newMimeTypes's parameters mimic that of the Python class's constructor.

Note that there are no Python types exposed in this module's public interface; clients of this module are insulated from the internal implementation. Aside from the absurdly heavy dependency list, there is no sign that this module is just a binding.

data MimeTypes = MimeTypes
	{ mtGuessExtension :: Py.SomeObject
	, mtGuessType :: Py.SomeObject
	}

newMimeTypes :: [FilePath] -> Bool -> IO MimeTypes
newMimeTypes files strict = do
	Py.initialize
	mod <- Py.importModule "mimetypes"
	cls <- Py.getAttribute mod =<< Py.toUnicode "MimeTypes"
	pyFiles <- Py.toList =<< mapM (fmap Py.toObject . Py.toUnicode) files
	pyStrict <- if strict then Py.true else Py.false
	mt <- Py.callArgs cls [Py.toObject pyFiles, Py.toObject pyStrict]
	
	pyGuessExtension <- Py.getAttribute mt =<< Py.toUnicode "guess_extension"
	pyGuessType <- Py.getAttribute mt =<< Py.toUnicode "guess_type"
	return $ MimeTypes pyGuessExtension pyGuessType

If you've any sense, one of the first things you thought after reading that was "golly, that sure is ugly". And you're right – it is ugly. Anybody who wants to make a serious go of binding large-scale Python libraries (such as Django) are heavily encouraged to write something similar to c2hs to automate the worst of it. Call it py2hs?

However, aside from being dreadfully verbose, it's not particularly complex. Parameters are marshaled from Haskell types into their Python equivalents, packaged up into a parameter list, and used to call the class constructor. After the MimeTypes object has been created, its guess_extension and guess_type methods are queried and cached for later use.

Which brings us to:

guessExtension :: MimeTypes -> T.Text -> Bool -> IO (Maybe T.Text)
guessExtension mt type_ strict = do
	pyType <- Py.toUnicode type_
	pyStrict <- if strict then Py.true else Py.false
	res <- Py.callArgs (mtGuessExtension mt) [Py.toObject pyType, Py.toObject pyStrict]
	textOrNone res

guessType :: MimeTypes -> T.Text -> Bool -> IO (Maybe T.Text, Maybe T.Text)
guessType mt url strict = do
	pyURL <- Py.toUnicode url
	pyStrict <- if strict then Py.true else Py.false
	res <- Py.callArgs (mtGuessType mt) [Py.toObject pyURL, Py.toObject pyStrict]
	Just tup <- Py.cast res
	[pyType, pyEncoding] <- Py.fromTuple tup
	type_ <- textOrNone pyType
	encoding <- textOrNone pyEncoding
	return (type_, encoding)

textOrNone :: Py.SomeObject -> IO (Maybe T.Text)
textOrNone obj = do
	isNone <- Py.isNone obj
	if isNone
		then return Nothing
		else do
			Just cast <- Py.cast obj
			Just `fmap` Py.fromUnicode cast

Really, it's more of the same; marshal parameters, call, dissect the result. Testing for None is common enough that I moved it to a helper; more complex bindings might have dozens such helpers for special cases. Are you listening, py2hs author?

Finally, load up our new binding into GHCi and see if it works:

$ ghci -XOverloadedStrings
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Prelude> :l MimeTypes
[1 of 1] Compiling MimeTypes        ( MimeTypes.hs, interpreted )
Ok, modules loaded: MimeTypes.
*MimeTypes> types <- newMimeTypes [] False

It loaded! And it didn't crash! We're off to a good start; lets see if our guessType works:

*MimeTypes> import Data.Text
*MimeTypes Data.Text> guessType types "foo.txt" True
(Just "text/plain",Nothing)
*MimeTypes Data.Text> guessType types "foo.html.gz" True
(Just "text/html",Just "gzip")

Looks good; it's picking up the file type, and the optional encoding. Now for guessExtension:

*MimeTypes Data.Text> guessExtension types "text/plain" True
Just ".ksh"

Hmm.

http://bugs.python.org/issue1043134

Hmm 🤔

Monad is not difficult

2010-02-23T00:57:33Z

Monad is not difficult

In Haskell, the typeclass Monad is a way for programmers to customize how sequences of function calls are run. Originally the goal was just to make IO-heavy code easier to read, but it turns out that Monad-shaped APIs are wonderfully flexible and can simplify all sorts of programming problems.

For example, say we want to define a function to look up two people in a database, and return information about the older one (or the first, if both are the same age). The database API already defines a way to look up a person by name, which returns NULL if the requested name isn’t found. Our function should return NULL if either name cannot be found.

The imperative-style implementation would look like this:

Maybe<Person> findOldest(Database db, Name name1, Name name2) {
    Maybe<Person> maybePerson1 = findByName(db, name1);
    if (maybePerson1 == Nothing) {
        return Nothing;
    }
    Person person1 = maybePerson1.Value();

    Maybe<Person> maybePerson2 = findByName(db, name2);
    if (maybePerson2 == Nothing) {
        return Nothing;
    }
    Person person2 = maybePerson2.Value();

    if (person1.BirthDate > person2.BirthDate) {
        return person2;
    }
    return person1;
}

Other than verbosity, this code is fairly readable, because the language has built-in support for returning early from a function.

But what happens if we write it in a declarative-style language? In most declarative languages, a function is a single expression, and every branch must return a value. This prevents the programmer from returning early.

findOldest :: Database -> Name -> Name -> Maybe Person
findOldest db name1 name2 =
    case findByName db name1 of
        Nothing -> Nothing
        Just person1 -> case findByName db name2 of
            Nothing -> Nothing
            Just person2 -> if birthDate person1 > birthDate person2
                then Just person2
                else Just person1

You can see that each time the programmer wants to check for an empty result, they have to add another level of indentation. This sort of code quickly becomes difficult to read and maintain.

Monad provides a solution, because it allows the programmer to customize how to handle functions that return an empty value. If the programmer instructs the compiler to return early when an empty value is encountered, then she doesn’t have to write all those checks manually.

instance Monad Maybe where
    return x = Just x
    val >>= f = case val of
        Nothing -> Nothing
        Just x -> f x

findOldest :: Database -> Name -> Name -> Maybe Person
findOldest db name1 name2 = do
    person1 <- findByName db name1
    person2 <- findByName db name2
    return (if birthDate person1 > birthDate person2
        then person2
        else person1)

The same pattern extends to all sorts of code that needs extra control over the execution sequence. With only a small tweak, the instance for Maybe can be used for Either:

instance Monad (Either a) where
    return x = Right x
    val >>= f = case val of
        Left err -> Left err
        Right x -> f x

Monad instances can also be defined for types which change how functions are called, rather than whether they are called. The Reader type adds an implicit environment to a function:

data Reader env a = Reader (env -> a)

ask :: Reader env env
ask = Reader (\env -> env)

runReader :: Reader env a -> env -> a
runReader (Reader run) env = run env

instance Monad (Reader env) where
   return x = Reader (\_ -> x)
   val >>= f = Reader (\env -> runReader (f (runReader val env)) env)

Either is used instead of exceptions, and Reader is used instead of global state, because their behavior can be mechanically checked by the compiler. By offloading concerns about error and state handling to the compiler, the programmer can focus on the higher-level behavior of their program.

Exercise for the reader: Given this definition, how would you define a Monad instance for the State type?

data State st a = State (st -> (st, a))

get :: State st st
get = State (\st -> (st, st))

put :: st -> State st ()
put st = State (\_ -> (st, ()))

Understanding Iteratees

2010-01-18T19:40:37Z

Understanding Iteratees

Iteratees are an abstraction discovered by Oleg Kiselyov, which provide a performant, predictable, and safe alternative to lazy I/O. Though the data types involved are simple, their relationship to incremental processing is not obvious, and existing documentation ranges in quality from merely dense to outright baffling. This article attempts to clarify the concepts and use underlying iteratees.

Please note that these are my notes, as I attempt to implement iteratee–based libraries. I may have misunderstood minor or major parts of iteratees. If in doubt, the final authority is Oleg -- though understanding his answers requires a saving throw vs. confusion. Please e-mail me any comments or suggestions.

2010–08–19: the code available in this article has been expanded and packaged as the enumerator library.

Iteratees vs. Lazy I/O

Lazy I/O – eg, hGetContents and friends – is known to have several shortcomings. Most notably, IO errors can occur in pure code and Handles may remain open for arbitrary periods of time. Oleg notes[1] that this can lead to unexpected failures, due to resource exhaustion.

Iteratees do not suffer from these problems. Their resource use is bounded and predictable, and the type system provides guarantees that limited resources are released when no longer needed. Notably, iteratees can process arbitrarily large inputs in constant space.

Implementing iteratees

There are at least five generic iteratee libraries, each with differing type signatures and semantics. Oleg's Iteratee.hs, IterateeM.hs, & IterateeMCPS.hs, John Lato's iteratee package, and a post by Per Magnus Therning.

This page documents a sixth implementation, based on IterateeM, with simplified error handling and naming conventions (hopefully) more obvious to the average Haskell programmer.

data Chunk a
	= Chunk [a]
	| EOF
	deriving (Show, Eq)

data Step e a m b
	= Continue (Chunk a -> Iteratee e a m b)
	| Yield b (Chunk a)
	| Error e

newtype Iteratee e a m b = Iteratee
	{ runIteratee :: m (Step e a m b)
	}

In general, an iteratee begins in the Continue state. As each chunk is passed to the continuation, the iteratee may return the next step, which is one of:

Continue: The iteratee requires more input before it can produce a result.
Yield: The iteratee has received enough input to generate a result, along with left–over input. If the iteratee will no longer accept input, it should yield EOF. If no input remains, but the iteratee can still accept more, it should yield Chunk [].
Error: The iteratee experienced an error which prevents it from proceeding further. The type of error contained will depend on the enumerator and/or iteratee – common choices are String and SomeException.

Based on these semantics, some simple instances can be created:

instance Monoid (Chunk a) where
	mempty = Chunk []
	mappend (Chunk xs) (Chunk ys) = Chunk $ xs ++ ys
	mappend _ _ = EOF

instance Functor Chunk where
	fmap _ EOF = EOF
	fmap f (Chunk xs) = Chunk $ map f xs

instance (Show a, Show b, Show e) => Show (Step e a m b) where
	showsPrec d step = showParen (d > 10) $ case step of
		(Continue _) -> s "Continue"
		(Yield b chunk) -> s "Yield " . sp b . s " " . sp chunk
		(Error err) -> s "Error " . sp err
		where
			s = showString
			sp :: Show a => a -> ShowS
			sp = showsPrec 11

Slightly more complex is the Monad instance for iteratees. The first iteratee is run, and if it yielded a value, that value is fed into the second iteratee.

instance Monad m => Monad (Iteratee e a m) where
	return x = Iteratee . return . Yield x $ Chunk []
	m >>= f = Iteratee $ runIteratee m >>= \mStep -> case mStep of
		Continue k -> return $ Continue ((>>= f) . k)
		Error err -> return $ Error err
		Yield x (Chunk []) -> runIteratee $ f x
		Yield x chunk -> runIteratee (f x) >>= \r -> case r of
			Continue k -> runIteratee $ k chunk
			Error err -> return $ Error err
			
			-- runIteratee (f x) does not consume any input; if it
			-- returns Yield, then its "extra" input must be
			-- (Chunk []) and can be ignored.
			Yield x' _ -> return $ Yield x' chunk

instance MonadTrans (Iteratee e a) where
	lift m = Iteratee $ m >>= runIteratee . return

instance MonadIO m => MonadIO (Iteratee e a m) where
	liftIO = lift . liftIO

instance Monad m => Functor (Iteratee e a m) where
	fmap f i = i >>= return . f

Next, lets define a few simple primitive combinators for building iteratees from pure functions:

returnI :: Monad m => Step e a m b -> Iteratee e a m b
returnI = Iteratee . return

liftI :: Monad m => (Chunk a -> Step e a m b) -> Iteratee e a m b
liftI k = returnI $ Continue (returnI . k)

yield :: Monad m => b -> Chunk a -> Iteratee e a m b
yield x chunk = returnI $ Yield x chunk

continue :: Monad m => (Chunk a -> Iteratee e a m b) -> Iteratee e a m b
continue k = returnI $ Continue k

throwError :: Monad m => e -> Iteratee e a m b
throwError err = returnI $ Error err

These combinators are sufficient to define simple iteratees; for example, a variation of dropWhile:

-- import Prelude hiding (dropWhile)
-- import qualified Prelude as Prelude
dropWhile :: Monad m => (a -> Bool) -> Iteratee e a m ()
dropWhile f = liftI step where
	step (Chunk xs) = case Prelude.dropWhile f xs of
		[] -> Continue $ returnI . step
		xs' -> Yield () (Chunk xs')
	step EOF = Yield () EOF

Or an iteratee for printing received chunks to stdout, useful for debugging:

printChunks :: MonadIO m => Show a => Bool -> Iteratee e a m ()
printChunks printEmpty = continue step where
	step (Chunk []) | not printEmpty = continue step
	step (Chunk xs) = liftIO (print xs) >> continue step
	step EOF = liftIO (putStrLn "EOF") >> yield () EOF

Finally, to extract the final result from an iteratee, it's sufficient to feed it EOF and check the returned Step. Note that a "well–behaved" iteratee continuation will always return Yield or Error in response to EOF – iteratees which return Continue may loop forever, depending on their monadic behavior.

run :: Monad m => Iteratee e a m b -> m (Either e b)
run i = runIteratee i >>= check where
	check (Continue k) = runIteratee (k EOF) >>= check
	check (Yield x _) = return $ Right x
	check (Error e) = return $ Left e

Enumerators

Iteratees consume data from a sequence of input chunks. To generate those chunks, we define enumerators (and enumerator composition operators).

type Enumerator e a m b = Step e a m b -> Iteratee e a m b

infixl 1 >>==, ==<<
(>>==) :: Monad m =>
	Iteratee e a m b ->
	(Step e a m b -> Iteratee e a' m b') ->
	Iteratee e a' m b'
m >>== f = Iteratee (runIteratee m >>= runIteratee . f)

(==<<):: Monad m =>
	(Step e a m b -> Iteratee e a' m b') ->
	Iteratee e a m b ->
	Iteratee e a' m b'
f ==<< m = m >>== f

Note that the Enumerator type is semantically equivalent to:

type Enumerator e a m b = Step e a m b -> m (Step e a m b)

Simple enumerators can be defined in terms of existing combinators. The basic format of an enumerator is that when it receives a Continue step, it passes a chunk to the continuation to generate its returned iteratee. Other step types are passed through unchanged.

enumList :: Monad m => [a] -> Enumerator e a m b
enumList xs (Continue k) = case xs of
	[] -> k EOF
	(x:xs') -> k (Chunk [x]) >>== enumList xs'
enumList _ step = returnI step

More complex enumerators require building the result manually. Note that while the recursive step is much larger in this example, the fundamental layout (loop on Continue, pass on others) remains.

enumHandle :: Handle -> Enumerator String ByteString IO b
enumHandle h = Iteratee . allocaBytes bufferSize . loop where
	bufferSize = 4096
	loop (Continue k) = do_read k
	loop step = const $ return step
	do_read k p = do
		n <- try $ hGetBuf h p bufferSize
		case (n :: Either SomeException Int) of
			Left err -> return $ Error $ show err
			Right 0 -> return $ Continue k
			Right n' -> do
				bytes <- packCStringLen (p, n')
				step <- runIteratee (k (Chunk [bytes]))
				loop step p

In some cases, it might make more sense to define this enumerator in terms of bytes rather than byte strings. The required changes are minor – the bytes are stored directly in the Chunk list.

enumHandle :: Handle -> Enumerator String Word8 IO b
…
			Right n' -> do
				bytes <- F.peekArray n' p
				step <- runIteratee (k (Chunk bytes))
				loop step p

Enumeratees

Enumerators generate data, iteratees consume it. When a value needs to generate a stream using another stream as input, it is named an enumeratee.

type Enumeratee e aOut aIn m b = Step e aIn m b -> Iteratee e aOut m (Step e aIn m b)

Most interesting transformations in iteratee-based code are enumeratees. For example, map can be encoded as an enumeratee:

checkDone :: Monad m =>
	((Chunk a -> Iteratee e a m b) -> Iteratee e a' m (Step e a m b)) ->
	Enumeratee e a' a m b
checkDone _ (Yield x chunk) = return $ Yield x chunk
checkDone f (Continue k) = f k
checkDone _ (Error err) = throwError err

mapI :: Monad m => (ao -> ai) -> Enumeratee e ao ai m b
mapI f = checkDone $ continue . step where
	step k EOF = yield (Continue k) EOF
	step k (Chunk []) = continue $ step k
	step k chunk = k (fmap f chunk) >>== mapI f

A more complex example: sequenceI converts an iteratee to an enumeratee, by feeding it input until it returns EOF. This is useful for chaining iteratees together, to support embedded streams.

finished :: Monad m => Iteratee e a m Bool
finished = liftI $ \chunk -> case chunk of
	EOF -> Yield True EOF
	_   -> Yield False chunk

sequenceI :: Monad m => Iteratee e ao m ai -> Enumeratee e ao ai m b
sequenceI i = checkDone check where
	check k = finished >>= \f -> if f
		then yield (Continue k) EOF
		else step k
	step k = i >>= \v -> k (Chunk [v]) >>== sequenceI i

A join combinator is useful for "extracting" an output stream from an enumeratee's result.

joinI :: Monad m => Iteratee e a m (Step e a' m b) -> Iteratee e a m b
joinI outer = outer >>= check where
	check (Continue k) = k EOF >>== check
	check (Yield x _) = return x
	check (Error e) = throwError e

Oleg Kiselyov – Lazy vs correct IO