The goal in chunking is to convert file data into smaller variable length chunks, approximately 64 KiB in length. Chunks boundaries must be computed in a deterministic way such that chunking the same data in 2 different places yields chunks that can be deduplicated.
64 KiB
8 KiB
(minimum chunk size)128 KiB
(maximum chunk size)0xffff000000000000
(16 one-bits → boundary probability 1/2^16 per byte)For each input byte b
, update the hash with 64-bit wrapping arithmetic:
h = (h << 1) + TABLE[b]
At each position after updating h
, let size = current_offset - start_offset + 1
.
size < MIN_CHUNK_SIZE
: do not test MASK
; continuesize >= MAX_CHUNK_SIZE
: force a boundary(h & MASK) == 0
: boundary at this positionWhen a boundary found or taken:
[start_offset, current_offset + 1)
start_offset = current_offset + 1
h = 0
At end-of-file, if start_offset < len(data)
, emit the final chunk [start_offset, len(data))
.
Inputs: (See above for constant parameters)
data: byte array
State:
h = 0
start_offset = 0 // start of the "current chunk"
if len(data) < MIN_CHUNK_SIZE:
emit chunk [0, len(data))
done
for i in range(0, len(data)):
b = data[i]
h = (h << 1) + TABLE[b] // 64-bit wrapping
size = i + 1 - start_offset
if size < MIN_CHUNK_SIZE:
continue
if size >= MAX_CHUNK_SIZE or (h & MASK) == 0:
emit chunk [start_offset, i + 1)
start_offset = i + 1
h = 0
if start_offset < len(data):
emit chunk [start_offset, len(data))
Given that MASK has 16 one-bits, for a random 64-bit hash h, the chance that all those 16 bits are zero is 1 / 2^16. On average, that means you’ll see a match about once every 64 KiB.
TABLE[256]
injects pseudo-randomness per byte value so that the evolving hash h
behaves like a random 64-bit value with respect to the mask test. This makes boundaries content-defined yet statistically evenly spaced.(h << 1)
amplifies recent bytes, helping small changes affect nearby positions without globally shifting all boundaries.h
to 0 at each boundary prevents long-range carryover and keeps boundary decisions for each chunk statistically independent.h
when you emit a boundary. This ensures chunking is stable even when streaming input in pieces.size >= MIN_CHUNK_SIZE
. This reduces the frequency of tiny chunks and stabilizes average chunk sizes.size >= MAX_CHUNK_SIZE
even if (h & MASK) != 0
. This guarantees bounded chunk sizes and prevents pathological long chunks when matches are rare.(h << 1) + TABLE[b]
. This is the behavior in the reference implementation rust-gearhash.len(data) < MIN_CHUNK_SIZE
, the entire data
is emitted as a single chunk.(h & MASK) == 0
before MAX_CHUNK_SIZE
, a boundary is forced at MAX_CHUNK_SIZE
to cap chunk size.T[256]
table and mask, the algorithm is deterministic across platforms: same input → same chunk boundaries.Computing and testing the rolling hash at every byte is expensive for large data, and early tests inside the first few bytes of a chunk are disallowed by the MIN_CHUNK_SIZE
constraint anyway.
We are able to intentionally skip testing some data with cut-point skipping to accelerate scanning without affecting correctness.
The hash function by virtue of the use of 64 byte integer length and the bit shift ((h << 1) + TABLE[b]
) causes the hash at any byte offset to only depend on the last 64 bytes.
With a Gear rolling hash window of 64 bytes, the first boundary test is deferred until at least MIN_CHUNK_SIZE - 64 - 1
bytes into the chunk.
This ensures that, by the time the first boundary can be considered (at offset MIN_CHUNK_SIZE
), at least one full hash window of bytes from the current chunk has influenced the hash state.
MIN_CHUNK_SIZE
and the hash produced at a testable offset is the same as the hash computed had we not skipped any bytes.MIN_CHUNK_SIZE
.MIN_CHUNK_SIZE - 64 - 1
before invoking the mask test loop.