Protocol Buffers, which contain lots of small numbers (tags) and have a design goal of space efficiency. The two encodings with near-universal adoption are VLQ and LEB128, which are the big- and little-endian dialects of an encoding originally developed in the early 1980s
Due to various stupid decisions I find myself immersed in designing a new TLV format similar to Protocol Buffers, and therefore I needed to choose a variable-length integer encoding. Initially my plan was to use LEB128, but all the implementations I could find were far more complex than I wanted to deal with. I threw together a quick hack that used a length-prefix byte and figured I'd come back to the problem later.
It is now later, and to prepare for writing a new LEB128 library I first wrote a benchmark to figure out what sort of performance could be expected. Surprisingly, the quick hack solution was not only simpler, it is much faster, with even the carefully-optimized code in rustc_serialize/leb128.rs running at about 20% of the performance of my naive implementation.
After some head-scratching, my conclusion is that the LEB128 bitstream format is simply a poor fit for today's deeply pipelined processor architectures. In LEB128, each byte of input needs to have its MSB tested to determine whether the value has terminated. A simpler format with a bit-packed length prefix byte will naturally perform better, because bitmasks are cheap and branches are expensive.
An afternoon of web searching has failed to uncover a prior name for this particular encoding, so I'm going to name it vu128 and describe it below. If any reader knows of prior art, please send it my way.
A vu128 containing an N-octet unsigned integer is composed of up to N+1 octets. If the integer is less than 0xF0, it is stored in a single octet. Otherwise, the first octet is 0xF0 | ((number of non-zero octets) - 1), and the non-zero octets of the integer are appended in little-endian order. Integer lengths up to 128 bits are supported.
Value (u64) | Encoded as |
---|---|
0 | [0x00] |
239 | [0xEF] |
240 | [0xF0 0xF0] |
255 | [0xF0 0xFF] |
0xABCDEF | [0xF2 0xEF 0xCD 0xAB] |
0x0123456789ABCDEF | [0xF7 0xEF 0xCD 0xAB 0x89 0x67 0x45 0x23 0x01] |
Signed integers are stored in the "zigzag" encoding used by Protocol Buffers. This ensures that small-magnitude integers of either sign are encoded in a compact layout.
Value (i64) | Encoded as |
---|---|
0 | [0x00] |
-1 | [0x01] |
1 | [0x02] |
-2 | [0x03] |
2 | [0x04] |
IEEE-754 floating-point values are stored in big-endian order due to their internal layout.
Value (f64) | Encoded as |
---|---|
0.0 | [0x00] |
-0.0 | [0x80] |
1.0 | [0xF1, 0x3F, 0xF0] |
2.0 | [0x40] |
2.5 | [0xF1, 0x40, 0x04] |
A straightforward implementation using basic bit operations, with no SIMD or exotic control flow. On my desktop (AMD Ryzen 7 3700X) this code runs somewhere between 2x and 5x faster than the LEB128 code used by the Rust compiler.
This decoder will accept over-long encoding, and will also truncate values that are too large to fit in the given integer type. The low-level Protobuf routines behave similarly, but such behavior may or may not be appropriate depending on the use case. Adding checks for such conditions, if necessary, is left as an exercise to the reader.
// Copyright (c) John Millikin <john@john-millikin.com> // SPDX-License-Identifier: 0BSD macro_rules! encode_uNN { ($name:ident, $t:ident, $max_len:literal, $len_mask:literal) => { #[inline] pub fn $name(buf: &mut [u8; $max_len], x: $t) -> usize { if x < 0xF0 { buf[0] = x as u8; return 1; } unsafe { (buf as *mut [u8; $max_len]) .cast::<u8>().add(1).cast::<$t>() .write_unaligned(x.to_le()); } let len = ((x.leading_zeros() >> 3) as u8) ^ $len_mask; buf[0] = 0xF0 | len; usize::from(len + 2) } }; } macro_rules! decode_uNN { ($name:ident, $t:ident, $max_len:literal, $len_mask:literal) => { #[inline] pub fn $name(buf: &[u8; $max_len]) -> ($t, usize) { if buf[0] < 0xF0 { return ($t::from(buf[0]), 1); } let x = $t::from_le(unsafe { (buf as *const [u8; $max_len]) .cast::<u8>().add(1).cast::<$t>() .read_unaligned() }); let len = buf[0] & 0x0F; let mask = $t::MAX >> ((len & $len_mask) ^ $len_mask); (x & mask, usize::from(len + 2)) } }; } encode_uNN!(encode_u16, u16, 3, 0x01); encode_uNN!(encode_u32, u32, 5, 0x03); encode_uNN!(encode_u64, u64, 9, 0x07); encode_uNN!(encode_u128, u128, 17, 0x0F); decode_uNN!(decode_u16, u16, 3, 0x01); decode_uNN!(decode_u32, u32, 5, 0x03); decode_uNN!(decode_u64, u64, 9, 0x07); decode_uNN!(decode_u128, u128, 17, 0x0F); #[inline] pub fn encode_u8(buf: &mut [u8; 2], x: u8) -> usize { if x < 0xF0 { buf[0] = x; return 1; } buf[0] = 0xF0; buf[1] = x; 2 } #[inline] pub fn decode_u8(buf: &[u8; 2]) -> (u8, usize) { if buf[0] < 0xF0 { return (buf[0], 1); } (buf[1], 2) }
Assembly for the encoder and decoder functions, generated by rustc v1.78:
encode_u32: cmp esi, 240 jae .LBB1_1 mov eax, 1 mov byte ptr [rdi], sil ret .LBB1_1: mov dword ptr [rdi + 1], esi lzcnt esi, esi shr esi, 3 mov al, 5 sub al, sil xor sil, -13 movzx eax, al mov byte ptr [rdi], sil ret decode_u32: movzx eax, byte ptr [rdi] cmp eax, 240 jae .LBB0_1 mov edx, 1 ret .LBB0_1: mov ecx, eax not cl and cl, 3 mov edx, 32 sub edx, ecx bzhi ecx, dword ptr [rdi + 1], edx and al, 15 add al, 2 movzx edx, al mov eax, ecx ret
macro_rules! encode_iNN { ( $name:ident, $encode_fn:ident, $iNN:ident, $uNN:ident, $max_len:literal, $zigzag_shift:literal ) => { #[inline] pub fn $name(buf: &mut [u8; $max_len], x: $iNN) -> usize { let zigzag = ((x >> $zigzag_shift) as $uNN) ^ ((x << 1) as $uNN); $encode_fn(buf, zigzag) } }; } macro_rules! decode_iNN { ( $name:ident, $decode_fn:ident, $iNN:ident, $uNN:ident, $max_len:literal ) => { #[inline] pub fn $name(buf: &[u8; $max_len]) -> ($iNN, usize) { let (zz, len) = $decode_fn(buf); let x = ((zz >> 1) as $iNN) ^ (-((zz & 1) as $iNN)); (x, len) } }; } encode_iNN!(encode_i8, encode_u8, i8, u8, 2, 7); encode_iNN!(encode_i16, encode_u16, i16, u16, 3, 15); encode_iNN!(encode_i32, encode_u32, i32, u32, 5, 31); encode_iNN!(encode_i64, encode_u64, i64, u64, 9, 63); encode_iNN!(encode_i128, encode_u128, i128, u128, 17, 127); decode_iNN!(decode_i8, decode_u8, i8, u8, 2); decode_iNN!(decode_i16, decode_u16, i16, u16, 3); decode_iNN!(decode_i32, decode_u32, i32, u32, 5); decode_iNN!(decode_i64, decode_u64, i64, u64, 9); decode_iNN!(decode_i128, decode_u128, i128, u128, 17);
#[inline] pub fn encode_f32(buf: &mut [u8; 5], x: f32) -> usize { encode_u32(buf, x.to_bits().swap_bytes()) } #[inline] pub fn encode_f64(buf: &mut [u8; 9], x: f64) -> usize { encode_u64(buf, x.to_bits().swap_bytes()) } #[inline] pub fn decode_f32(buf: &[u8; 5]) -> (f32, usize) { let (swapped, len) = decode_u32(buf); (f32::from_bits(swapped.swap_bytes()), len) } #[inline] pub fn decode_f64(buf: &[u8; 9]) -> (f64, usize) { let (swapped, len) = decode_u64(buf); (f64::from_bits(swapped.swap_bytes()), len) }
The earliest reference I can find to the concept is as part of MIDI 1.0, published in 1983, which describes the big-endian variant. The little-endian variant seems to have been first named LEB128 in DWARF v2, published in 1993.