Chunk-level deduplication is a fundamental optimization technique in the Xet system that eliminates redundant data by identifying and sharing identical content blocks (chunks) across files and repositories. This specification details the procedures, algorithms, and security mechanisms that enable efficient storage and transfer while maintaining data integrity and access control.
Deduplication in Xet operates at the chunk level rather than the file level, providing fine-grained deduplication capabilities that can identify shared content even when files differ significantly. This approach is particularly effective for scenarios common in machine learning and data science workflows, such as:
A chunk is a variable-sized content block derived from files using Content-Defined Chunking (CDC) with a rolling hash function. Chunks are the fundamental unit of deduplication in Xet.
Xorbs are containers that aggregate multiple chunks for efficient storage and transfer:
Shards are objects that contain a list of xorbs that may be deduped against (for the context of deduplication, ignore the file info section of the shard format).
The CAS system provides the underlying storage infrastructure:
When a file is processed for upload, it undergoes the following steps:
graph TD
A[File Input] --> B[Content-Defined Chunking]
B --> C[Hash Computation]
C --> D[Chunk Creation]
D --> E[Deduplication Query]
Xet employs a three-tiered deduplication strategy to maximize efficiency while minimizing latency:
Scope: Current upload session Mechanism: In-memory hash lookup table Purpose: Eliminate redundancy within the current file or session
Benefits:
Scope: Previously uploaded files and sessions Mechanism: Local shard file metadata cache Purpose: Leverage deduplication against recently uploaded content
Benefits:
Scope: Entire Xet ecosystem Mechanism: Global deduplication service with HMAC protection Purpose: Discover deduplication opportunities across all users and repositories
The global deduplication system provides deduplication capabilities across all data that is managed by the Xet system:
Not all chunks are eligible for global deduplication queries to manage system load:
Recommendations: Spacing constraints: The global dedupe API is optimized to return information about nearby chunks when there is a match. Consider only issueing a request to an eligible chunk every ~4MB of data.
Global deduplication uses HMAC (Hash-based Message Authentication Code) to protect chunk hashes while enabling deduplication.
Security Properties:
Raw chunk hashes are never transmitted from servers to clients; a client has to encrypt their raw chunk hash and find a match to know a raw chunk hash exists in the system. They may know this chunk hash because they own this data, the match has made them privy to know which xorb has this chunk hash and the position in the xorb, but has not revealed any other raw chunk hashes in that xorb or other xorbs.
Each chunk has its content hashed using a cryptographic hash function (Blake3-based MerkleHash) to create a unique identifier for content addressing. See section about hashing.
When new chunks need to be stored, they are aggregated into xorbs based on size and count limits. If adding a new chunk would exceed the maximum xorb size or chunk count, the current xorb is finalized and uploaded. See section about xorb formation
When chunks are deduplicated, the system creates file reconstruction information that includes:
This information allows the system to reconstruct files by:
See section about file reconstruction.
While deduplication is valuable for saving space, doing it too aggressively can cause file fragmentation—meaning a file’s chunks end up scattered across many different xorbs. This can make reading files slower and less efficient. To avoid this, in xet-core we aim (and encourage implementors) to keep long, continuous runs of chunks together in the same xorb whenever possible. Instead of always deduplicating every possible chunk, the system sometimes chooses to reference a straight run of chunks in a single xorb, even if it means skipping deduplication for a few chunks. This approach balances the benefits of deduplication with the need to keep files easy and fast to read. Consider for example referencing a deduplicated chunks in a minimum run of chunks (e.g. at least 8 chunks) or targeting an average contiguous run of chunks totalling length >= 1MB.
Xet's chunk-level deduplication system provides a comprehensive solution for efficient data storage and transfer in large-scale data workflows. By combining local, cached, and global deduplication strategies with robust security mechanisms and fragmentation prevention, the system achieves significant storage savings while maintaining performance and data integrity.
The multi-tiered approach ensures that deduplication is both effective and efficient:
The system's design prioritizes both efficiency and safety, with comprehensive error handling, performance monitoring, and security measures that make it suitable for production use in data-intensive applications.