vu128: Efficient variable-length integers

Why variable-length integers?

The basic goal of a variable-length integer encoding is to allow small numbers to be encoded in fewer bytes than larger numbers. Among other uses, they're popular for TLV formats such as Protocol Buffers, which contain lots of small numbers (tags) and have a design goal of space efficiency. The two encodings with near-universal adoption are VLQ and LEB128, which are the big- and little-endian dialects of an encoding originally developed in the early 1980s[0].

Due to various stupid decisions I find myself immersed in designing a new TLV format similar to Protocol Buffers, and therefore I needed to choose a variable-length integer encoding. Initially my plan was to use LEB128, but all the implementations I could find were far more complex than I wanted to deal with. I threw together a quick hack that used a length-prefix byte and figured I'd come back to the problem later.

It is now later, and to prepare for writing a new LEB128 library I first wrote a benchmark to figure out what sort of performance could be expected. Surprisingly, the quick hack solution was not only simpler, it is much faster, with even the carefully-optimized code in rustc_serialize/leb128.rs running at about 20% of the performance of my naive implementation.

After some head-scratching, my conclusion is that the LEB128 bitstream format is simply a poor fit for today's deeply pipelined processor architectures. In LEB128, each byte of input needs to have its MSB tested to determine whether the value has terminated. A simpler format with a bit-packed length prefix byte will naturally perform better, because bitmasks are cheap and branches are expensive.

An afternoon of web searching has failed to uncover a prior name for this particular encoding, so I'm going to name it vu128 and describe it below. If any reader knows of prior art, please send it my way.

---

Updated: The discussions on Hacker News and Lobste.rs are worth reading, in particular comments about the value of space efficiency for 16-bit values:

The original vu128 encoding tries to have as few branches as possible by encoding values in the range [0xF0, 0x3FFF) as three octets where LEB128 would use two, but several commenters noted that this wasn't a good tradeoff for use cases such as symbol relocation tables. The updated version of this blog post adjusts the vu128 format to ensure that encoded values are always the same size or smaller than LEB128, while retaining most of the performance benefits.

The vu128 format

A vu128 containing an N-octet unsigned integer is composed of up to N+1 octets. Unsigned integers up to 128 bits in length are supported. The encoded value's first octet's leading bits determine the layout of the rest of the encoded value. There are three layouts, with the value space partitioned to optimize the tradeoff between branch count and encoded length.

Values in the range [0, 2⁷) are encoded as a single byte with the same bits as the original value. The MSB is zero.

Values in the range [2⁷, 2²⁸) are encoded as a unary length prefix, followed by (length*7) bits, in little-endian order. This is conceptually similar to LEB128, but the continuation bits are placed in upper half of the initial byte. This arrangement is also known as a "prefix varint" and can be found in some Google-originated codebases, for example libtextclassifier:utils/base/prefixvarint.h.

MSB ------------------ LSB

      10101011110011011110  Input value (0xABCDE)
   0101010 1111001 1011110  Zero-padded to a multiple of 7 bits
01010101 11100110 ___11110  Grouped into octets, with 3 continuation bits
01010101 11100110 11011110  Continuation bits `110` added
    0x55     0xE6     0xDE  In hexadecimal

        [0xDE, 0xE6, 0x55]  Encoded output (order is little-endian)

Value	Encoded as
`0xABCDE`	`[0xDE, 0xE6, 0x55]`
`0x80`	`[0x80, 0x02]`
`0x3FFF`	`[0xBF, 0xFF]`
`0x4000`	`[0xC0, 0x00, 0x02]`
`0x1FFFFF`	`[0xDF, 0xFF, 0xFF]`
`0x200000`	`[0xE0, 0x00, 0x00, 0x02]`
`0xFFFFFFF`	`[0xEF, 0xFF, 0xFF, 0xFF]`

Values in the range [2²⁸, 2¹²⁸) are encoded as a binary length prefix, followed by payload bytes, in little-endian order. To differentiate this layout from those of smaller values, the top 4 bits of the first byte are set. The length prefix value is the number of payload bytes minus one; equivalently it is the total length of the encoded value minus two.

MSB ------------------------------------ LSB

               10010001101000101011001111000  Input value (0x12345678)
         00010010 00110100 01010110 01111000  Zero-padded to a multiple of 8 bits
00010010 00110100 01010110 01111000 11110011  Prefix byte is `0xF0 | (4 - 1)`
    0x12     0x34     0x56     0x78     0xF3  In hexadecimal

              [0xF3, 0x78, 0x56, 0x34, 0x12]  Encoded output (order is little-endian)

Value	Encoded as
`0x12345678`	`[0xF3, 0x78, 0x56, 0x34, 0x12]`
`0x10000000`	`[0xF3, 0x00, 0x00, 0x00, 0x10]`
`0xABCDEF12_34567890`	`[0xF7, 0x90, 0x78, 0x56, 0x34, 0x12, 0xEF, 0xCD, 0xAB]`

Signed integers are stored in the "zigzag" encoding used by Protocol Buffers. This ensures that small-magnitude integers of either sign are encoded in a compact layout.

Value	Encoded as
`0`	`[0x00]`
`-1`	`[0x01]`
`1`	`[0x02]`
`-2`	`[0x03]`
`2`	`[0x04]`

IEEE-754 floating-point values are byte-swapped to big-endian before encoding, so that trailing zeroes in the significand can be elided.

Value (`f64`)	Encoded as
`0.0`	`[0x00]`
`-0.0`	`[0x80, 0x02]`
`1.0`	`[0xDF, 0x81, 0x07]`
`2.0`	`[0x40]`
`2.5`	`[0x80, 0x11]`

Rust implementation

The reference implementation is written in Rust using basic bit operations, with no SIMD or exotic control flow. On my desktop (AMD Ryzen 7 3700X) this code runs somewhere between 2x and 5x faster than the LEB128 code used by the Rust compiler. Faster implementations may be possible on target platforms with scatter-gather bit manipulation instructions, such as PDEP and PEXT available on recent x86-64 processors.

Github: https://github.com/jmillikin/rust-vu128
crates.io: https://crates.io/crates/vu128

The decoder will accept over-long encoding, and will also truncate values that are too large to fit in the given integer type. The low-level Protobuf routines behave similarly, but such behavior may or may not be appropriate depending on the use case. Adding checks for such conditions, if necessary, is left as an exercise to the reader.

(This page previously included the Rust source directly, before the library was published.)

Python implementation

This implementation is intended for readers that aren't familiar with Rust. It has not been optimized for performance, and is intended to be a starting point for implementing vu128 in one's preferred language.

# Copyright (c) John Millikin <john@john-millikin.com>
# SPDX-License-Identifier: 0BSD

def encode(x: int):
	"""Encode a vu128 to bytes."""

	if x < 0x80:
		return bytes([x])
	if x < 0x4000:
		return bytes([
			0b10000000 | (x & 0b00111111),
			(x >>  6),
		])
	if x < 0x200000:
		return bytes([
			0b11000000 | (x & 0b00011111),
			(x >>  5) & 0xFF,
			(x >> 13),
		])
	if x < 0x10000000:
		return bytes([
			0b11100000 | (x & 0b00001111),
			(x >>  4) & 0xFF,
			(x >> 12) & 0xFF,
			(x >> 20),
		])

	assert x < 2**128
	payload = x.to_bytes(length=16, byteorder="little").rstrip(b"\x00")
	return bytes([
		0xF0 | (len(payload) - 1),
	]) + payload


def decode(buf: bytes):
	"""Decode a vu128 and return (decoded_value, encoded_len)."""

	if buf[0] & 0b10000000 == 0:
		return buf[0], 1
	if buf[0] & 0b01000000 == 0:
		decoded = (
			(buf[0] & 0b00111111) |
			(buf[1] <<  6)
		)
		return decoded, 2
	if buf[0] & 0b00100000 == 0:
		decoded = (
			(buf[0] & 0b00011111) |
			(buf[1] <<  5) |
			(buf[2] << 13)
		)
		return decoded, 3
	if buf[0] & 0b00010000 == 0:
		decoded = (
			(buf[0] & 0b00001111) |
			(buf[1] <<  4) |
			(buf[2] << 12) |
			(buf[3] << 20)
		)
		return decoded, 4

	payload_len = (buf[0] & 0x0F) + 1
	encoded_len = payload_len + 1
	payload = buf[1:encoded_len]
	assert len(payload) == payload_len
	return int.from_bytes(payload, byteorder = "little"), encoded_len

Appendix A: Original vu128 format (before revision)

This version of the vu128 format was the original, before revisions based on online discussions (see above). It is slightly faster than the final version, but produces longer outputs for some values.

A vu128 containing an N-octet unsigned integer is composed of up to N+1 octets. If the integer is less than 0xF0, it is stored in a single octet. Otherwise, the first octet is 0xF0 | ((number of non-zero octets) - 1), and the non-zero octets of the integer are appended in little-endian order. Integer lengths up to 128 bits are supported.

Value	Encoded as
`0`	`[0x00]`
`239`	`[0xEF]`
`240`	`[0xF0 0xF0]`
`255`	`[0xF0 0xFF]`
`0xABCDEF`	`[0xF2 0xEF 0xCD 0xAB]`
`0x0123456789ABCDEF`	`[0xF7 0xEF 0xCD 0xAB 0x89 0x67 0x45 0x23 0x01]`

Appendix B: Original Rust implementation (before revision)

This implementation of the original vu128 format was what I used for initial benchmarking, before revisions based on online discussions (see above). It is slightly faster than the final version, but produces longer outputs for some values.

Discussions and commentary about vu128 before the release of the rust-vu128 library are referring to this code.

// Copyright (c) John Millikin <john@john-millikin.com>
// SPDX-License-Identifier: 0BSD

macro_rules! encode_uNN {
	($name:ident, $t:ident, $max_len:literal, $len_mask:literal) => {
		#[inline]
		pub fn $name(buf: &mut [u8; $max_len], x: $t) -> usize {
			if x < 0xF0 {
				buf[0] = x as u8;
				return 1;
			}
			unsafe {
				(buf as *mut [u8; $max_len])
					.cast::<u8>().add(1).cast::<$t>()
					.write_unaligned(x.to_le());
			}
			let len = ((x.leading_zeros() >> 3) as u8) ^ $len_mask;
			buf[0] = 0xF0 | len;
			usize::from(len + 2)
		}
	};
}

macro_rules! decode_uNN {
	($name:ident, $t:ident, $max_len:literal, $len_mask:literal) => {
		#[inline]
		pub fn $name(buf: &[u8; $max_len]) -> ($t, usize) {
			if buf[0] < 0xF0 {
				return ($t::from(buf[0]), 1);
			}
			let x = $t::from_le(unsafe {
				(buf as *const [u8; $max_len])
					.cast::<u8>().add(1).cast::<$t>()
					.read_unaligned()
			});
			let len = buf[0] & 0x0F;
			let mask = $t::MAX >> ((len & $len_mask) ^ $len_mask);
			(x & mask, usize::from(len + 2))
		}
	};
}

encode_uNN!(encode_u16, u16, 3, 0x01);
encode_uNN!(encode_u32, u32, 5, 0x03);
encode_uNN!(encode_u64, u64, 9, 0x07);
encode_uNN!(encode_u128, u128, 17, 0x0F);

decode_uNN!(decode_u16, u16, 3, 0x01);
decode_uNN!(decode_u32, u32, 5, 0x03);
decode_uNN!(decode_u64, u64, 9, 0x07);
decode_uNN!(decode_u128, u128, 17, 0x0F);

#[inline]
pub fn encode_u8(buf: &mut [u8; 2], x: u8) -> usize {
	if x < 0xF0 {
		buf[0] = x;
		return 1;
	}
	buf[0] = 0xF0;
	buf[1] = x;
	2
}

#[inline]
pub fn decode_u8(buf: &[u8; 2]) -> (u8, usize) {
	if buf[0] < 0xF0 {
		return (buf[0], 1);
	}
	(buf[1], 2)
}

Assembly for the encoder and decoder functions, generated by rustc v1.78:

encode_u32:
        cmp     esi, 240
        jae     .LBB1_1
        mov     eax, 1
        mov     byte ptr [rdi], sil
        ret
.LBB1_1:
        mov     dword ptr [rdi + 1], esi
        lzcnt   esi, esi
        shr     esi, 3
        mov     al, 5
        sub     al, sil
        xor     sil, -13
        movzx   eax, al
        mov     byte ptr [rdi], sil
        ret

decode_u32:
        movzx   eax, byte ptr [rdi]
        cmp     eax, 240
        jae     .LBB0_1
        mov     edx, 1
        ret
.LBB0_1:
        mov     ecx, eax
        not     cl
        and     cl, 3
        mov     edx, 32
        sub     edx, ecx
        bzhi    ecx, dword ptr [rdi + 1], edx
        and     al, 15
        add     al, 2
        movzx   edx, al
        mov     eax, ecx
        ret

Signed integers:

macro_rules! encode_iNN {
	(
		$name:ident,
		$encode_fn:ident,
		$iNN:ident,
		$uNN:ident,
		$max_len:literal,
		$zigzag_shift:literal
	) => {
		#[inline]
		pub fn $name(buf: &mut [u8; $max_len], x: $iNN) -> usize {
			let zigzag = ((x >> $zigzag_shift) as $uNN) ^ ((x << 1) as $uNN);
			$encode_fn(buf, zigzag)
		}
	};
}

macro_rules! decode_iNN {
	(
		$name:ident,
		$decode_fn:ident,
		$iNN:ident,
		$uNN:ident,
		$max_len:literal
	) => {
		#[inline]
		pub fn $name(buf: &[u8; $max_len]) -> ($iNN, usize) {
			let (zz, len) = $decode_fn(buf);
			let x = ((zz >> 1) as $iNN) ^ (-((zz & 1) as $iNN));
			(x, len)
		}
	};
}

encode_iNN!(encode_i8, encode_u8, i8, u8, 2, 7);
encode_iNN!(encode_i16, encode_u16, i16, u16, 3, 15);
encode_iNN!(encode_i32, encode_u32, i32, u32, 5, 31);
encode_iNN!(encode_i64, encode_u64, i64, u64, 9, 63);
encode_iNN!(encode_i128, encode_u128, i128, u128, 17, 127);

decode_iNN!(decode_i8, decode_u8, i8, u8, 2);
decode_iNN!(decode_i16, decode_u16, i16, u16, 3);
decode_iNN!(decode_i32, decode_u32, i32, u32, 5);
decode_iNN!(decode_i64, decode_u64, i64, u64, 9);
decode_iNN!(decode_i128, decode_u128, i128, u128, 17);

Floating point:

#[inline]
pub fn encode_f32(buf: &mut [u8; 5], x: f32) -> usize {
	encode_u32(buf, x.to_bits().swap_bytes())
}

#[inline]
pub fn encode_f64(buf: &mut [u8; 9], x: f64) -> usize {
	encode_u64(buf, x.to_bits().swap_bytes())
}

#[inline]
pub fn decode_f32(buf: &[u8; 5]) -> (f32, usize) {
	let (swapped, len) = decode_u32(buf);
	(f32::from_bits(swapped.swap_bytes()), len)
}

#[inline]
pub fn decode_f64(buf: &[u8; 9]) -> (f64, usize) {
	let (swapped, len) = decode_u64(buf);
	(f64::from_bits(swapped.swap_bytes()), len)
}

The earliest reference I can find to the concept is as part of MIDI 1.0, published in 1983, which describes the big-endian variant. The little-endian variant seems to have been first named LEB128 in DWARF v2, published in 1993.