File Reconstruction: Term-based Representation

This document describes how a file can be represented and reconstructed from a compact, deduplicated form using a series of terms. Each term specifies where to source data (a content-addressed container called a xorb) and which chunk indices within that container are required.

Glossary

Core Idea

After following the chunking procedure a file can be represented as an ordering of chunks. Those chunks are then packed into xorbs and given the set of xorbs we convert the file representation to "reconstruction" made up of "terms". When forming xorbs the ordering and grouping of chunks prioritizes contiguous runs of chunks that appear in a file such that when referencing a xorb we maximize the term range length.

Any file’s raw bytes can be described as the concatenation of data produced by a sequence of terms. Each term references a contiguous range of chunks within a particular xorb. The file is reconstructed by retrieving those chunk ranges, decoding them to raw bytes, and concatenating in order.

Term Format

Each term consists of:

Reconstruction Rules

Given an ordered list of terms describing a file:

  1. For each term, fetch the specified chunk range from the identified xorb.
  2. Decode/decompress the chunks into raw bytes, preserving the original order.
  3. If reconstructing the entire file, concatenate the decoded outputs of all terms in their listed order.
  4. If reconstructing a byte sub-range of the file, the first and last terms may be partially used:
  5. Skip a prefix of bytes within the first term’s decoded output so the file-level range starts at the requested offset.
  6. Truncate the tail of the last term’s decoded output so the file-level range ends at the requested position.

Ordering and Coverage

Multiple Terms per Xorb and Coalescing

Chunk and Byte Boundaries

Determinism and Integrity

Example (Conceptual)

Assume a file is represented by the following ordered terms:

Term Xorb hash (conceptual) Chunk range
1 X1 [0, 5)
2 X2 [3, 8)
3 X1 [9, 12)

Reconstruction proceeds by obtaining chunks 0,1,2,3,4 from xorb X1, chunks 3,4,5,6,7 from xorb X2, and chunks 9,10,11 from xorb X1, decoding each contiguous range, and concatenating in the term order 1 → 2 → 3.

Serialization and Deserialization

This section summarizes how the term-based reconstruction is persisted and exchanged.

Serialization into shards (file info section)

A file’s reconstruction can be serialized into a shard as part of its file info section. Conceptually, this section encodes the complete set of terms that describe the file. When stored this way, the representation is canonical and sufficient to reconstruct the full file solely from its referenced xorb ranges.

Reference: shard format file info

Deserialization from the reconstruction API (JSON)

A reconstruction API can return a JSON object that carries the full reconstruction. This response is represented by a structure named “QueryReconstructionResponse”, where the terms key enumerates the ordered list of terms required to reconstruct the entire file. The terms list contains, for each term, the xorb identifier and the contiguous chunk index range to retrieve. Other fields may provide auxiliary details (such as offsets or fetch hints) that optimize retrieval without altering the meaning of the terms sequence.

Reference: api.md, download protocol

Fragmentation and Why Longer Ranges Matter

Fragmentation refers to representing a file with many very short, scattered ranges across many xorbs. While this can maximize deduplication opportunities, it often harms read performance and increases overhead.

In practice there is a balance: longer ranges improve reconstruction performance, while finer granularity can increase deduplication savings. Favoring longer contiguous chunk ranges within the same xorb, and coalescing adjacent or overlapping ranges when feasible, helps maintain good read performance without sacrificing correctness. In xet-core we use a fragmentation prevention mechanism that targets that the average term contains 8 chunks.