Content-Defined Chunking Algorithm

The goal in chunking is to convert file data into smaller variable length chunks, approximately 64 KiB in length. Chunks boundaries must be computed in a deterministic way such that chunking the same data in 2 different places yields chunks that can be deduplicated.

Step-by-step algorithm (Gearhash-based CDC)

Constant Parameters

State

Per-byte update rule (Gearhash)

For each input byte b, update the hash with 64-bit wrapping arithmetic:

h = (h << 1) + TABLE[b]

Boundary test and size constraints

At each position after updating h, let size = current_offset - start_offset + 1.

When a boundary found or taken:

At end-of-file, if start_offset < len(data), emit the final chunk [start_offset, len(data)).

Pseudocode

Inputs: (See above for constant parameters)
  data: byte array

State:
  h = 0
  start_offset = 0 // start of the "current chunk"

if len(data) < MIN_CHUNK_SIZE:
  emit chunk [0, len(data))
  done

for i in range(0, len(data)):
  b = data[i]
  h = (h << 1) + TABLE[b]      // 64-bit wrapping
  size = i + 1 - start_offset

  if size < MIN_CHUNK_SIZE:
    continue

  if size >= MAX_CHUNK_SIZE or (h & MASK) == 0:
    emit chunk [start_offset, i + 1)
    start_offset = i + 1
    h = 0

if start_offset < len(data):
  emit chunk [start_offset, len(data))

Boundary probability and mask selection

Given that MASK has 16 one-bits, for a random 64-bit hash h, the chance that all those 16 bits are zero is 1 / 2^16. On average, that means you’ll see a match about once every 64 KiB.

Properties

Intuition and rationale

Implementation notes

Edge cases

Portability and determinism

Minimum-size skip-ahead (cut-point skipping optimization)

Computing and testing the rolling hash at every byte is expensive for large data, and early tests inside the first few bytes of a chunk are disallowed by the MIN_CHUNK_SIZE constraint anyway. We are able to intentionally skip testing some data with cut-point skipping to accelerate scanning without affecting correctness.

The hash function by virtue of the use of 64 byte integer length and the bit shift ((h << 1) + TABLE[b]) causes the hash at any byte offset to only depend on the last 64 bytes. With a Gear rolling hash window of 64 bytes, the first boundary test is deferred until at least MIN_CHUNK_SIZE - 64 - 1 bytes into the chunk. This ensures that, by the time the first boundary can be considered (at offset MIN_CHUNK_SIZE), at least one full hash window of bytes from the current chunk has influenced the hash state.

References