vu128: Efficient variable-length integers

Why variable-length integers?

The basic goal of a variable-length integer encoding is to allow small numbers to be encoded in fewer bytes than larger numbers. Among other uses, they're popular for TLV formats such as Protocol Buffers, which contain lots of small numbers (tags) and have a design goal of space efficiency. The two encodings with near-universal adoption are VLQ and LEB128, which are the big- and little-endian dialects of an encoding originally developed in the early 1980s[0].

Due to various stupid decisions I find myself immersed in designing a new TLV format similar to Protocol Buffers, and therefore I needed to choose a variable-length integer encoding. Initially my plan was to use LEB128, but all the implementations I could find were far more complex than I wanted to deal with. I threw together a quick hack that used a length-prefix byte and figured I'd come back to the problem later.

It is now later, and to prepare for writing a new LEB128 library I first wrote a benchmark to figure out what sort of performance could be expected. Surprisingly, the quick hack solution was not only simpler, it is much faster, with even the carefully-optimized code in rustc_serialize/leb128.rs running at about 20% of the performance of my naive implementation.

After some head-scratching, my conclusion is that the LEB128 bitstream format is simply a poor fit for today's deeply pipelined processor architectures. In LEB128, each byte of input needs to have its MSB tested to determine whether the value has terminated. A simpler format with a bit-packed length prefix byte will naturally perform better, because bitmasks are cheap and branches are expensive.

An afternoon of web searching has failed to uncover a prior name for this particular encoding, so I'm going to name it vu128 and describe it below. If any reader knows of prior art, please send it my way.

The vu128 format

A vu128 containing an N-octet unsigned integer is composed of up to N+1 octets. If the integer is less than 0xF0, it is stored in a single octet. Otherwise, the first octet is 0xF0 | ((number of non-zero octets) - 1), and the non-zero octets of the integer are appended in little-endian order. Integer lengths up to 128 bits are supported.

Value (u64)Encoded as
0[0x00]
239[0xEF]
240[0xF0 0xF0]
255[0xF0 0xFF]
0xABCDEF[0xF2 0xEF 0xCD 0xAB]
0x0123456789ABCDEF[0xF7 0xEF 0xCD 0xAB 0x89 0x67 0x45 0x23 0x01]

Signed integers are stored in the "zigzag" encoding used by Protocol Buffers. This ensures that small-magnitude integers of either sign are encoded in a compact layout.

Value (i64)Encoded as
0[0x00]
-1[0x01]
1[0x02]
-2[0x03]
2[0x04]

IEEE-754 floating-point values are stored in big-endian order due to their internal layout.

Value (f64)Encoded as
0.0[0x00]
-0.0[0x80]
1.0[0xF1, 0x3F, 0xF0]
2.0[0x40]
2.5[0xF1, 0x40, 0x04]

Rust implementation

A straightforward implementation using basic bit operations, with no SIMD or exotic control flow. On my desktop (AMD Ryzen 7 3700X) this code runs somewhere between 2x and 5x faster than the LEB128 code used by the Rust compiler.

This decoder will accept over-long encoding, and will also truncate values that are too large to fit in the given integer type. The low-level Protobuf routines behave similarly, but such behavior may or may not be appropriate depending on the use case. Adding checks for such conditions, if necessary, is left as an exercise to the reader.

Unsigned integers

// Copyright (c) John Millikin <john@john-millikin.com>
// SPDX-License-Identifier: 0BSD

macro_rules! encode_uNN {
	($name:ident, $t:ident, $max_len:literal, $len_mask:literal) => {
		#[inline]
		pub fn $name(buf: &mut [u8; $max_len], x: $t) -> usize {
			if x < 0xF0 {
				buf[0] = x as u8;
				return 1;
			}
			unsafe {
				(buf as *mut [u8; $max_len])
					.cast::<u8>().add(1).cast::<$t>()
					.write_unaligned(x.to_le());
			}
			let len = ((x.leading_zeros() >> 3) as u8) ^ $len_mask;
			buf[0] = 0xF0 | len;
			usize::from(len + 2)
		}
	};
}

macro_rules! decode_uNN {
	($name:ident, $t:ident, $max_len:literal, $len_mask:literal) => {
		#[inline]
		pub fn $name(buf: &[u8; $max_len]) -> ($t, usize) {
			if buf[0] < 0xF0 {
				return ($t::from(buf[0]), 1);
			}
			let x = $t::from_le(unsafe {
				(buf as *const [u8; $max_len])
					.cast::<u8>().add(1).cast::<$t>()
					.read_unaligned()
			});
			let len = buf[0] & 0x0F;
			let mask = $t::MAX >> ((len & $len_mask) ^ $len_mask);
			(x & mask, usize::from(len + 2))
		}
	};
}

encode_uNN!(encode_u16, u16, 3, 0x01);
encode_uNN!(encode_u32, u32, 5, 0x03);
encode_uNN!(encode_u64, u64, 9, 0x07);
encode_uNN!(encode_u128, u128, 17, 0x0F);

decode_uNN!(decode_u16, u16, 3, 0x01);
decode_uNN!(decode_u32, u32, 5, 0x03);
decode_uNN!(decode_u64, u64, 9, 0x07);
decode_uNN!(decode_u128, u128, 17, 0x0F);

#[inline]
pub fn encode_u8(buf: &mut [u8; 2], x: u8) -> usize {
	if x < 0xF0 {
		buf[0] = x;
		return 1;
	}
	buf[0] = 0xF0;
	buf[1] = x;
	2
}

#[inline]
pub fn decode_u8(buf: &[u8; 2]) -> (u8, usize) {
	if buf[0] < 0xF0 {
		return (buf[0], 1);
	}
	(buf[1], 2)
}

Assembly for the encoder and decoder functions, generated by rustc v1.78:

encode_u32:
        cmp     esi, 240
        jae     .LBB1_1
        mov     eax, 1
        mov     byte ptr [rdi], sil
        ret
.LBB1_1:
        mov     dword ptr [rdi + 1], esi
        lzcnt   esi, esi
        shr     esi, 3
        mov     al, 5
        sub     al, sil
        xor     sil, -13
        movzx   eax, al
        mov     byte ptr [rdi], sil
        ret

decode_u32:
        movzx   eax, byte ptr [rdi]
        cmp     eax, 240
        jae     .LBB0_1
        mov     edx, 1
        ret
.LBB0_1:
        mov     ecx, eax
        not     cl
        and     cl, 3
        mov     edx, 32
        sub     edx, ecx
        bzhi    ecx, dword ptr [rdi + 1], edx
        and     al, 15
        add     al, 2
        movzx   edx, al
        mov     eax, ecx
        ret

Signed integers (zigzag)

macro_rules! encode_iNN {
	(
		$name:ident,
		$encode_fn:ident,
		$iNN:ident,
		$uNN:ident,
		$max_len:literal,
		$zigzag_shift:literal
	) => {
		#[inline]
		pub fn $name(buf: &mut [u8; $max_len], x: $iNN) -> usize {
			let zigzag = ((x >> $zigzag_shift) as $uNN) ^ ((x << 1) as $uNN);
			$encode_fn(buf, zigzag)
		}
	};
}

macro_rules! decode_iNN {
	(
		$name:ident,
		$decode_fn:ident,
		$iNN:ident,
		$uNN:ident,
		$max_len:literal
	) => {
		#[inline]
		pub fn $name(buf: &[u8; $max_len]) -> ($iNN, usize) {
			let (zz, len) = $decode_fn(buf);
			let x = ((zz >> 1) as $iNN) ^ (-((zz & 1) as $iNN));
			(x, len)
		}
	};
}

encode_iNN!(encode_i8, encode_u8, i8, u8, 2, 7);
encode_iNN!(encode_i16, encode_u16, i16, u16, 3, 15);
encode_iNN!(encode_i32, encode_u32, i32, u32, 5, 31);
encode_iNN!(encode_i64, encode_u64, i64, u64, 9, 63);
encode_iNN!(encode_i128, encode_u128, i128, u128, 17, 127);

decode_iNN!(decode_i8, decode_u8, i8, u8, 2);
decode_iNN!(decode_i16, decode_u16, i16, u16, 3);
decode_iNN!(decode_i32, decode_u32, i32, u32, 5);
decode_iNN!(decode_i64, decode_u64, i64, u64, 9);
decode_iNN!(decode_i128, decode_u128, i128, u128, 17);

Floating-point

#[inline]
pub fn encode_f32(buf: &mut [u8; 5], x: f32) -> usize {
	encode_u32(buf, x.to_bits().swap_bytes())
}

#[inline]
pub fn encode_f64(buf: &mut [u8; 9], x: f64) -> usize {
	encode_u64(buf, x.to_bits().swap_bytes())
}

#[inline]
pub fn decode_f32(buf: &[u8; 5]) -> (f32, usize) {
	let (swapped, len) = decode_u32(buf);
	(f32::from_bits(swapped.swap_bytes()), len)
}

#[inline]
pub fn decode_f64(buf: &[u8; 9]) -> (f64, usize) {
	let (swapped, len) = decode_u64(buf);
	(f64::from_bits(swapped.swap_bytes()), len)
}

  1. The earliest reference I can find to the concept is as part of MIDI 1.0, published in 1983, which describes the big-endian variant. The little-endian variant seems to have been first named LEB128 in DWARF v2, published in 1993.

Change Feed