This document describes the complete process of downloading a single file from the Xet protocol using the CAS (Content Addressable Storage) reconstruction API.
File download in the Xet protocol is a two-stage process:
To download a file given a file hash, first call the reconstruction API to get the file reconstruction. Follow the steps in api.md.
Note that you will need at least a read
scope auth token, auth reference.
For large files is may be recommended to download the request the reconstruction in batches i.e. the first 10GB, download all the data, then the next 10GB and so on. Use the
Range
header to specify a range of file data.
The reconstruction API returns a QueryReconstructionResponse
object with three key components:
Scroll
{
"offset_into_first_range": 0,
"terms": [
{
"hash": "a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456",
"unpacked_length": 263873,
"range": {
"start": 0,
"end": 4
}
},
...
],
"fetch_info": {
"a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456": [
{
"range": {
"start": 0,
"end": 4
},
"url": "https://transfer.xethub.hf.co/xorb/default/a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456",
"url_range": {
"start": 0,
"end": 131071
}
},
...
],
...
}
}
number
0
Array<CASReconstructionTerm>
CASReconstructionTerm
contains:hash
: The xorb hash (64-character lowercase hex string)range
: Chunk index range{ start: number, end: number }
within the xorb; end-exclusive [start, end)
unpacked_length
: Expected length after decompression (for validation)Map<Xorb Hash (64 character lowercase hex string), Array<CASReconstructionFetchInfo>>
CASReconstructionFetchInfo
CASReconstructionFetchInfo
contains:url
: HTTP URL for downloading the xorb data, presigned url containing authorization informationurl_range
(bytes_start, bytes_end): Byte range { start: number, end: number }
for the Range header; end-inclusive [start, end]
Range: bytes=<start>-<end>
when downloading this chunk rangerange
(index_start, index_end): Chunk index range { start: number, end: number }
that this URL provides; end-exclusive [start, end)
CASReconstructionTerm
in order from the terms
arrayFor each CASReconstructionTerm
, find matching fetch info using the term's hash
CASReconstructionTerm
. The xorb hash is guaranteed to exist as a key in the fetch_info map.CASReconstructionFetchInfo
and find one which refers to a chunk range that is equal or encompassing the term's chunk range.GET
request with the Range
header setDeserialize the downloaded xorb data to extract chunks
This series of chunks contains chunks at indices specified by the CASReconstructionFetchInfo
's range
field. Trim chunks at the beginning or end to match the chunks specified by the reconstruction term's range
field.
offset_into_first_range
bytesfile_id = "0123...abcdef"
api_endpoint, token = get_token() # follow auth.md instructions
url = api_endpoint + "/reconstructions/" + file_id
reconstruction = get(url, headers={"Authorization": "Bearer: " + token})
# break the reconstruction into components
terms = reconstruction["terms"]
fetch_info = reconstruction["fetch_info"]
offset_into_first_range = reconstruction["offset_into_first_range"]
For each CASReconstructionTerm
in the terms
array:
hash
in the fetch_info
map to get a list of CASReconstructionFetchInfo
CASReconstructionFetchInfo
entry where the fetch info's range
contains the term's range
CASReconstructionFetchInfo
and find the element where the range block ({ "start": number, "end": number }
) of the CASReconstructionFetchInfo
has start <= term's range start AND end >= term's range end.for term in terms:
xorb_hash = term["hash"]
fetch_info_entries = fetch_info[xorb_hash]
fetch_info_entry = None
for entry in fetch_info_entries:
if entry["range"][start] <= term["range"]["start"] and entry["range"]["end"] >= term["range"]["end"]:
fetch_info_entry = entry
break
if fetch_info_entry is None:
# Error!
For each matched fetch info:
url
in the fetch info entryRange
header: bytes={url_range.start}-{url_range.end}
for term in terms:
...
data_url = fetch_info_entry["url"]
range_header = "bytes=" + fetch_info_entry["url_range"]["start"] + "-" + fetch_info_entry["url_range"]["end"]
data = get(data_url, headers={"Range": range_header})
The downloaded data is in xorb format and must be deserialized:
unpacked_length
from the termNote: The specific deserialization process depends on the Xorb format.
for term in terms:
...
chunks = {}
for i in range(fetch_info_entry["range"]["start"], fetch_info_entry["range"]["end"]):
chunk = deserialize_chunk(data) # assume data is a reader that advances forwards
chunks[i] = chunk
# at this point data should be fully consumed
From the deserialized xorb data:
range
to identify which chunks are neededrange.start
to range.end-1
(end-exclusive)file_chunks = []
for term in terms:
...
for i in range(term["range"]["start"], term["range"]["end"]):
chunk = chunks[i]
# it is possible that the offset captures multiple chunks, so we may need to skip whole chunks
if offset_into_first_range > len(chunk):
offset_into_first_range -= len(chunk)
continue
if offset_info_first_range > 0:
chunk = chunk[offset_into_first_range:]
offset_info_first_range = 0
file_chunks.push(chunk)
Write all of the chunks to the output file or buffer.
If a range was specified then the total data will need to be truncated to the amount of bytes requested.
When a range is specified but the range does not end on a chunk boundary the last byte of the requested range will be in the middle of the last chunk.
A client knows the start of the data from offset_into_first_range
and can then use the length of the specified range to know end end offset.
with open(file_path) as f:
for chunk in file_chunks:
f.write(chunk)
For partial file downloads, the reconstruction API supports range queries:
Range: bytes=start-end
header in reconstruction requestoffset_into_first_range
field indicates where your range starts within the first termWhen downloading individual term data:
A client must include the range header formed with the values from the url_range field to specify the exact range of data of a xorb that they are accessing. Not specifying this header will cause result in an authorization failure.
Xet global deduplication requires that access to xorbs is only granted to authorized ranges. Not specifying this header will result in an authorization failure.
Here's an example of a serialized QueryReconstructionResponse
struct that shows how file reconstruction would work across multiple xorbs:
{
"offset_into_first_range": 0,
"terms": [
{
"hash": "a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456",
"unpacked_length": 263873,
"range": {
"start": 1,
"end": 4
}
},
{
"hash": "fedcba0987654321098765432109876543210fedcba098765432109876543",
"unpacked_length": 143890,
"range": {
"start": 0,
"end": 3
}
},
{
"hash": "a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456",
"unpacked_length": 3063572,
"range": {
"start": 3,
"end": 43
}
},
],
"fetch_info": {
"a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456": [
{
"range": {
"start": 1,
"end": 43
},
"url": "https://transfer.xethub.hf.co/xorb/default/a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20130721%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20130721T201207Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=d6796aa6097c82ba7e33b4725e8396f8a9638f7c3d4b5a6b7c8d9e0f1a2b3c4d",
"url_range": {
"start": 57980,
"end": 1433008
}
}
],
"fedcba0987654321098765432109876543210fedcba098765432109876543": [
{
"range": {
"start": 0,
"end": 3
},
"url": "https://transfer.xethub.hf.co/xorb/default/fedcba0987654321098765432109876543210fedcba098765432109876543?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20130721%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20130721T201207Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=d6796aa6097c82ba7e33b4725e8396f8a9638f7c3d4b5a6b7c8d9e0f1a2b3c4d",
"url_range": {
"start": 0,
"end": 65670
}
}
]
}
}
This example shows reconstruction of a file that requires:
[1, 4)
from the first xorb (~264KB of unpacked data)[0, 2)
from the second xorb (~144KB of unpacked data)[3, 43)
from the same xorb from the first term (~3MB of unpacked data)The fetch_info
provides the HTTP URLs and byte ranges needed to download the required chunk data from each xorb. The ranges provided within fetch_info and term sections are always end-exclusive i.e. { "start": 0, "end": 3 }
is a range of 3 chunks at indices 0, 1 and 2.
The ranges provided under a fetch_info items' url_range key are to be used to form the Range
header when downloading the chunk range.
A "url_range"
value of { "start": X, "end": Y }
creates a Range
header value of bytes=X-Y
.
When downloading and deserializing the chunks from xorb a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456
we will have the chunks at indices [1, 43)
.
We will need to only use the chunks at [1, 4)
to fulfill the first term and then chunks [3, 43)
to fulfill the third term.
Note that in this example the chunk at index 3 is used twice! This is the benefit of deduplication; we only need to download the chunk content once.
sequenceDiagram
autonumber
actor Client as "Client"
participant CAS as "CAS API"
participant Transfer as "Transfer Service (Xet storage)"
Client->>CAS: GET /reconstructions/{file_id}<br/>Authorization: Bearer <token><br/>Range: bytes=start-end (optional)
CAS-->>Client: 200 OK<br/>QueryReconstructionResponse {offset_into_first_range, terms[], fetch_info{}}
loop For each term in terms (ordered)
Client->>Client: Find fetch_info by xorb hash, entry whose range contains term.range
Client->>Transfer: GET {url}<br/>Range: bytes=url_range.start-url_range.end
Transfer-->>Client: 206 Partial Content<br/>xorb byte range
Client->>Client: Deserialize xorb → chunks for fetch_info.range
Client->>Client: Trim to term.range, apply offset for first term
Client->>Client: Append chunks to output
end
alt Range requested
Client->>Client: Truncate output to requested length
end