Storage¶

The storage layer round-trips CanonicalDataset through four production backends: CSV, JSON, Parquet, and DuckDB. The backend is auto-detected from the file extension; the SDK picks the right writer without configuration.

Purpose¶

This page covers:

The StorageRegistry interface.
The four backends and their trade-offs.
Round-trip semantics (Decimal preserved, ISO-8601 dates preserved, enums preserved).
Incremental updates (APPEND / MERGE / REPLACE).

Prerequisites¶

un-comtrade-sdk installed.
For Parquet: pip install un-comtrade-sdk[parquet].
For DuckDB: pip install un-comtrade-sdk[duckdb].
A CanonicalDataset to write.

Walkthrough¶

Auto-detect from extension¶

from un_comtrade import ComtradeClient

with ComtradeClient() as client:
    exports = client.trade.get_exports(reporter_code=699, period="2022")
    client.storage.open("india_exports_2022.parquet").write(exports)

The open(uri) method inspects the extension (.csv, .json, .parquet, .duckdb) and dispatches to the right writer.

Explicit backend¶

client.storage.csv("india_exports_2022.csv").write(exports)
client.storage.json("india_exports_2022.json").write(exports)
client.storage.parquet("india_exports_2022.parquet").write(exports)
client.storage.duckdb("india_exports_2022.duckdb").write(exports)

The explicit API is preferable when you want to fail-fast on a typo or when the file extension is ambiguous.

Read back a dataset¶

dataset = client.storage.open("india_exports_2022.parquet").read()
print(f"{len(dataset.records):,} rows")

The read() method returns a fresh CanonicalDataset. The round-trip is byte-for-byte equal to the original — Decimal arithmetic preserved, ISO-8601 dates preserved, frozenset enums preserved.

Incremental update¶

client.storage.update("india_exports_2022.parquet", mode="append")

Three modes:

APPEND — append records to the existing dataset.
MERGE — merge new records by primary key (ref_period_id + reporter_code + partner_code + flow_code + cmd_code).
REPLACE — replace the entire dataset.

Examples¶

Write to all four backends and verify equality:

from un_comtrade import ComtradeClient

with ComtradeClient() as client:
    exports = client.trade.get_exports(reporter_code=699, period="2022")

    # Write to each backend.
    for ext in (".csv", ".json", ".parquet", ".duckdb"):
        path = f"india_exports_2022{ext}"
        client.storage.open(path).write(exports)

    # Read back from each backend and compare.
    originals = sorted(exports.records, key=lambda r: (r.partner_code, r.cmd_code))
    for ext in (".csv", ".json", ".parquet", ".duckdb"):
        path = f"india_exports_2022{ext}"
        dataset = client.storage.open(path).read()
        roundtripped = sorted(dataset.records, key=lambda r: (r.partner_code, r.cmd_code))
        assert originals == roundtripped, f"mismatch on {ext}"
    print("All four backends produce identical round-trips.")

Incremental append with merge-by-key:

# Append 2021 records to an existing 2022 file.
exports_2021 = client.trade.get_exports(reporter_code=699, period="2021")
client.storage.append("india_exports_history.parquet", exports_2021)

# Read back the merged dataset.
merged = client.storage.open("india_exports_history.parquet").read()
print(f"{len(merged.records):,} rows in the merged dataset")

RECIPE-031 — ETL pipeline into storage.
RECIPE-032 — Export to CSV.
RECIPE-033 — Export to Parquet.
RECIPE-034 — Export to DuckDB.
RECIPE-035 — Reload from storage.
RECIPE-036 — Analytics on stored data.

ComtradeClient.storage — the storage facade.
StorageRegistry.open — the auto-detect accessor.

Trade — produces CanonicalDataset.
Analytics — typed analytics on top of the dataset.
ETL — composes the fetch → export flow.

Next steps¶

ETL — chain storage writes into a pipeline.
Cookbook → storage recipes — full executable forms.