Storage¶
The storage layer round-trips CanonicalDataset through four
production backends: CSV, JSON, Parquet, and DuckDB.
The backend is auto-detected from the file extension; the SDK picks
the right writer without configuration.
Purpose¶
This page covers:
- The
StorageRegistryinterface. - The four backends and their trade-offs.
- Round-trip semantics (
Decimalpreserved, ISO-8601 dates preserved, enums preserved). - Incremental updates (
APPEND/MERGE/REPLACE).
Prerequisites¶
un-comtrade-sdkinstalled.- For Parquet:
pip install un-comtrade-sdk[parquet]. - For DuckDB:
pip install un-comtrade-sdk[duckdb]. - A
CanonicalDatasetto write.
Walkthrough¶
Auto-detect from extension¶
from un_comtrade import ComtradeClient
with ComtradeClient() as client:
exports = client.trade.get_exports(reporter_code=699, period="2022")
client.storage.open("india_exports_2022.parquet").write(exports)
The open(uri) method inspects the extension (.csv, .json,
.parquet, .duckdb) and dispatches to the right writer.
Explicit backend¶
client.storage.csv("india_exports_2022.csv").write(exports)
client.storage.json("india_exports_2022.json").write(exports)
client.storage.parquet("india_exports_2022.parquet").write(exports)
client.storage.duckdb("india_exports_2022.duckdb").write(exports)
The explicit API is preferable when you want to fail-fast on a typo or when the file extension is ambiguous.
Read back a dataset¶
dataset = client.storage.open("india_exports_2022.parquet").read()
print(f"{len(dataset.records):,} rows")
The read() method returns a fresh CanonicalDataset. The
round-trip is byte-for-byte equal to the original — Decimal
arithmetic preserved, ISO-8601 dates preserved, frozenset enums
preserved.
Incremental update¶
Three modes:
APPEND— append records to the existing dataset.MERGE— merge new records by primary key (ref_period_id + reporter_code + partner_code + flow_code + cmd_code).REPLACE— replace the entire dataset.
Examples¶
Write to all four backends and verify equality:
from un_comtrade import ComtradeClient
with ComtradeClient() as client:
exports = client.trade.get_exports(reporter_code=699, period="2022")
# Write to each backend.
for ext in (".csv", ".json", ".parquet", ".duckdb"):
path = f"india_exports_2022{ext}"
client.storage.open(path).write(exports)
# Read back from each backend and compare.
originals = sorted(exports.records, key=lambda r: (r.partner_code, r.cmd_code))
for ext in (".csv", ".json", ".parquet", ".duckdb"):
path = f"india_exports_2022{ext}"
dataset = client.storage.open(path).read()
roundtripped = sorted(dataset.records, key=lambda r: (r.partner_code, r.cmd_code))
assert originals == roundtripped, f"mismatch on {ext}"
print("All four backends produce identical round-trips.")
Incremental append with merge-by-key:
# Append 2021 records to an existing 2022 file.
exports_2021 = client.trade.get_exports(reporter_code=699, period="2021")
client.storage.append("india_exports_history.parquet", exports_2021)
# Read back the merged dataset.
merged = client.storage.open("india_exports_history.parquet").read()
print(f"{len(merged.records):,} rows in the merged dataset")
Related Recipes¶
- RECIPE-031 — ETL pipeline into storage.
- RECIPE-032 — Export to CSV.
- RECIPE-033 — Export to Parquet.
- RECIPE-034 — Export to DuckDB.
- RECIPE-035 — Reload from storage.
- RECIPE-036 — Analytics on stored data.
Related API¶
ComtradeClient.storage— the storage facade.StorageRegistry.open— the auto-detect accessor.
Related Guides¶
- Trade — produces
CanonicalDataset. - Analytics — typed analytics on top of the dataset.
- ETL — composes the
fetch→exportflow.
Next steps¶
- ETL — chain storage writes into a pipeline.
- Cookbook → storage recipes — full executable forms.