1. Variable-length integer encoding (UBInt)
Uses a custom unbounded integer format where each integer is encoded using 2-bit pairs: 01
for bit 0, 10
for bit 1, 11
for bits 11, and 00
as terminator. This eliminates fixed-width overhead for small integers.
2. Custom alphabet encoding for strings
Analyzes all unique characters in each string column (including history) to build a minimal alphabet. Each character is then encoded using $\lceil \log_2(|\text{alphabet}| + 1) \rceil$ bits instead of 8 bits, with 000...
as string terminator.
3. Sorted column optimization
Detects when an integer column forms a complete sorted sequence from 0 to $n-1$. In this case, the column values are implicit (equal to row index) and are not encoded at all, saving significant space.
4. Timestamp delta encoding
History operations store only the first timestamp fully; subsequent timestamps are encoded as deltas from the previous timestamp, drastically reducing space for temporal data.
5. Update operation bitmask
For update operations, uses a bitmask to indicate which columns changed, then only encodes the old values for those specific columns rather than storing entire row snapshots.
6. Bit-level LZ77 compression
Applies a custom bit-level LZ77-like compression algorithm that operates directly on bit arrays. Uses 12-bit distance fields (4095-bit lookback window) and 8-bit length fields with hash-based match finding for efficient back-references.
7. Adaptive compression flag
Compares compressed vs uncompressed payload sizes and automatically selects the smaller representation, prefixed with a 1-bit flag. Ensures compression never increases total size.
8. Schema-driven type optimization
Different column types use optimal encodings: UBInt for integers/unix timestamps, custom alphabets for strings, and length-prefixed JSON. The schema defines interpretation, meaning there are no type tags per value.
9. Minimal padding
Only pads to the byte boundary at the very end of the entire encoded stream, maximizing bit-level compression efficiency throughout the encoding process. The entire database contains at most 7 bits of padding in total.