Design

How the pieces fit together

Database schema

The database is backed by DuckDB, chosen for its zero-configuration embedded operation, fast analytical queries, and ability to handle large result sets without a separate server process.

erDiagram
    kernels {
        VARCHAR sha256 PK
        VARCHAR filename
        VARCHAR kernel_type
        BIGINT size_bytes
    }
    locations {
        VARCHAR sha256 PK
        VARCHAR abs_path PK
        VARCHAR mission
        VARCHAR source_url
        TIMESTAMP scanned_at
    }
    metakernel_entries {
        VARCHAR mk_path PK
        INTEGER entry_index PK
        VARCHAR raw_entry
        VARCHAR filename
    }
    metakernel_registry {
        VARCHAR mk_path PK
        VARCHAR mission
        VARCHAR source_url
        VARCHAR filename
        TIMESTAMP acquired_at
    }
    missions {
        VARCHAR name PK
        VARCHAR server_url
        VARCHAR mk_dir_url
        BOOLEAN dedup
        TIMESTAMP added_at
    }
    kernel_coverage {
        VARCHAR sha256 PK
        INTEGER body_id PK
        DOUBLE et_start
        DOUBLE et_end
        VARCHAR source
    }
    kernels ||--o{ locations : "one hash → many locations"
    kernels ||--o{ kernel_coverage : "one hash → many coverage intervals"
    metakernel_registry ||--o{ metakernel_entries : "one .tm → many entries"

The separation between kernels and locations is the core of the design:

A kernel is a unique piece of content, identified by its SHA-256 hash. The filename column stores the first-registered name (the “canonical” name), but this is purely informational.
A location is a place on disk where that content exists. The same hash can appear in many locations — that’s exactly how duplicates are represented.

When generic_kernels/spk/satellites/jup365.bsp and JUICE/kernels/spk/jup365_19900101_20500101.bsp have identical content:

Table	sha256	filename / abs_path	mission
`kernels`	`a1b2c3...`	`jup365.bsp`	—
`locations`	`a1b2c3...`	`.../generic_kernels/spk/satellites/jup365.bsp`	`generic`
`locations`	`a1b2c3...`	`.../JUICE/kernels/spk/jup365_19900101_20500101.bsp`	`JUICE`

One hash, two locations, two different filenames. The tool knows they’re the same file.

Content-addressed identity

Every file registered with the database goes through SHA-256 hashing. This is the sole source of truth for “are these the same file?”

from spice_kernel_db.hashing import sha256_file

h1 = sha256_file("/data/generic_kernels/spk/satellites/jup365.bsp")
h2 = sha256_file("/data/JUICE/kernels/spk/jup365_19900101_20500101.bsp")
assert h1 == h2  # same content, different names

Why SHA-256 and not a simpler check like file size?

File size is not unique. Two different CK files could easily have the same byte count.
Filename is not reliable. As shown above, missions rename files.
SHA-256 is fast enough. Even for large SPK/BSP files (hundreds of MB), hashing takes under a second on modern hardware. The cost is paid once at scan time.

What happens when the same content arrives under a different name?

When you scan_directory, each file is hashed. If the hash already exists in kernels but under a different filename, the tool:

Logs an informational message: "Hash match: jup365_19900101_20500101.bsp is identical to already-registered jup365.bsp"
Records the new location (with the new path and filename) pointing to the existing hash
Does not create a second kernels row

This means find_by_filename("jup365_19900101_20500101.bsp") will find all locations whose path ends with that filename, even though the canonical name in the kernels table is jup365.bsp. The lookup checks both the kernels.filename column and the actual filename in locations.abs_path.

Discovering archive-side aliases

Because identity is content-based and recorded at scan/get time, the database accumulates — for free, without modifying any files — a map of which filenames an archive publishes the same bytes under. The aliases command (and KernelDB.aliases()) surface it:

spice-kernel-db aliases juice_ops_v13.tf

reveals that ESA ships juice_ops_v13.tf and juice_ops_v12.tf with byte-identical content — i.e. v13 did not actually change anything in v12 — or that two trajectory scenarios (crema_5_1 and its a3_2 variant) reuse exactly the same solar-array, antenna, or frame kernels. That is a genuine insight into how an archive is structured, and it requires no deduplication step: physically replacing the duplicates with symlinks (dedup) is an optional, separate space-saving operation.

kernel_aliases is a view over locations — SELECT DISTINCT sha256, basename(abs_path) — so the alias set is always exactly the distinct names each content hash has been observed under, with nothing extra to maintain.

Mission-aware resolution

When the tool needs to find a local file for a kernel name — for instance, when rewriting a metakernel — it follows a strict priority chain:

flowchart TD
    A["resolve_kernel(filename, preferred_mission)"] --> B{Exact filename<br/>in preferred mission?}
    B -->|Yes| C["✅ Return path<br/>(no warning)"]
    B -->|No| D{Exact filename<br/>in any mission?}
    D -->|Yes| E["⚠️ Return path + warning:<br/>'not found in [MISSION],<br/>using copy from [OTHER]'"]
    D -->|No| F{Path-suffix match<br/>on disk?}
    F -->|Yes| G["⚠️ Return path + warning:<br/>'found on disk, registered as<br/>[name] in [MISSION]'"]
    F -->|No| H["❌ Return None<br/>+ suggest 'scan' to re-index"]

    style C fill:#d4edda
    style E fill:#fff3cd
    style G fill:#fff3cd
    style H fill:#f8d7da

Why mission preference matters

Consider a scenario where both JUICE and generic_kernels have pck00011.tpc. They happen to be identical now. But if a future JUICE release ships a mission-specific override of that PCK file, the JUICE copy should take precedence when working with JUICE data.

By always checking the preferred mission first, the tool naturally handles this: it uses the mission’s own copy when available, and only falls back to other sources when necessary — always with a warning so you know what happened.

Path-suffix matching

The third resolution step handles cases where a kernel exists on disk and is registered in the database, but under a different canonical name. This uses a path-suffix match: if the requested filename appears as the last component of any registered abs_path, it’s considered a match.

For example, if jup365_19900101_20500101.bsp is requested and the file exists on disk at .../spk/jup365_19900101_20500101.bsp but was registered under the canonical name jup365.bsp (because the content was first seen under that name), the path-suffix match finds it.

Note

Prior to v0.10.0, the tool used fuzzy prefix matching that progressively stripped underscore-delimited suffixes. This was removed because it could match completely unrelated kernels that happened to share a common prefix (e.g., all JUICE kernels start with juice_). Path-suffix matching is safer because it requires the exact filename to exist on disk.

Minimal metakernel edits

This is the most opinionated design decision in the tool. When rewriting a metakernel, the goal is to change as little as possible.

Why not just rewrite all the paths?

You could replace every entry in KERNELS_TO_LOAD with absolute paths to wherever the files live locally. This works, but:

The resulting metakernel is unrecognisable compared to the original. A diff would show every line changed.
You lose the ability to visually verify that the kernel list matches the mission’s official release.
If a kernel is loaded from a different mission’s tree (via fallback), it’s not obvious from reading the file.
The metakernel becomes tied to your specific machine’s filesystem layout.

The symlink tree strategy

Instead, rewrite_metakernel() creates a symlink tree that mirrors the original directory structure:

link_root/
├── ck/
│   └── juice_sc_default_v01.bc  →  /data/JUICE/kernels/ck/juice_sc_default_v01.bc
├── fk/
│   └── juice_v44.tf             →  /data/JUICE/kernels/fk/juice_v44.tf
├── lsk/
│   └── naif0012.tls             →  /data/generic_kernels/lsk/naif0012.tls
├── pck/
│   ├── pck00011.tpc             →  /data/JUICE/kernels/pck/pck00011.tpc
│   ├── gm_de431.tpc             →  /data/JUICE/kernels/pck/gm_de431.tpc
│   └── juice_jup012.tpc         →  /data/JUICE/kernels/pck/juice_jup012.tpc
├── sclk/
│   └── juice_fict_160326_v02.tsc → /data/JUICE/kernels/sclk/juice_fict_160326_v02.tsc
└── spk/
    ├── de432s.bsp               →  /data/JUICE/kernels/spk/de432s.bsp
    ├── jup365_19900101_20500101.bsp → /data/generic_kernels/spk/satellites/jup365.bsp
    └── juice_crema_5_1_a3_v01.bsp   → /data/JUICE/kernels/spk/juice_crema_5_1_a3_v01.bsp

The rewritten metakernel differs from the original in exactly one place:

-  PATH_VALUES  = ( '..' )
+  PATH_VALUES  = ( '/data/spice/unified_kernels' )

Everything else — PATH_SYMBOLS, KERNELS_TO_LOAD, the header comments, the loading order — is preserved character for character.

Trust and traceability

This approach means:

You can diff the original and rewritten metakernels to verify that only PATH_VALUES changed.
The KERNELS_TO_LOAD list is the mission team’s original list, untouched. If they made a mistake, it’s their mistake, not yours.
The symlink tree is transparent: ls -la shows you exactly where each file actually lives.
If a kernel was resolved from a different mission via fallback, it shows up in the warnings at rewrite time — not silently buried in the metakernel.

Bridging filename aliases

The symlink tree also handles the filename alias problem seamlessly. If the metakernel asks for $KERNELS/spk/jup365_19900101_20500101.bsp but the tool found the content under jup365.bsp in generic_kernels, the symlink is:

link_root/spk/jup365_19900101_20500101.bsp  →  /data/generic_kernels/spk/satellites/jup365.bsp

The symlink is named what the metakernel expects, and points to where the content actually lives. SPICE follows symlinks transparently, so it loads the right data.

Remote acquisition architecture

The tool supports downloading kernels from remote SPICE servers, starting with NASA NAIF and ESA SPICE. Both servers use Apache mod_autoindex for directory listing, which the tool parses to discover available metakernels and their dependencies.

Multi-server support

Mission configurations are stored in the missions table with:

name: Mission identifier (e.g., JUICE, MRO)
server_url: Base server URL (e.g., https://spiftp.esac.esa.int/data/SPICE/)
mk_dir_url: Directory containing mission metakernels (e.g., data/SPICE/JUICE/kernels/mk/)
dedup: Whether to deduplicate downloaded kernels (defaults to True for backwards compatibility)
added_at: Timestamp of mission registration

This decouples the tool from hardcoded server assumptions and allows users to configure any number of SPICE servers with similar directory structures.

URL resolution for kernel downloads

When fetching a remote metakernel, the tool must translate relative PATH_VALUES entries into absolute download URLs. The resolution algorithm:

Parse the metakernel to extract PATH_VALUES and KERNELS_TO_LOAD
For each PATH_VALUES entry, resolve it relative to the metakernel’s own URL (not the server root)
For each kernel entry, expand path symbols and construct the full download URL
Record the kernel’s source URL in the database for traceability

For example, if the metakernel is at:

https://spice.esac.esa.int/data/SPICE/JUICE/kernels/mk/juice_ops.tm

And contains:

PATH_VALUES  = ( '../..' )
KERNELS_TO_LOAD = ( '$KERNELS/lsk/naif0012.tls' )

The kernel download URL becomes:

https://spice.esac.esa.int/data/SPICE/JUICE/kernels/lsk/naif0012.tls

This mirrors how SPICE itself would resolve paths if the metakernel were used in place on the server.

Parallel downloads

Kernel downloads are performed in parallel using ThreadPoolExecutor with rich.progress bars for both individual file progress and overall batch progress. Each download:

Computes SHA-256 during streaming to disk
Checks the database for existing copies with the same hash (deduplication)
Records the download URL as source_url in the locations table
Validates file integrity against Content-Length headers when available

Auto-linking for immediate use

After downloading all kernels referenced by a metakernel, the tool creates symlinks in a directory adjacent to the downloaded metakernel that mirror the original server’s directory structure. This allows the metakernel to work immediately without rewriting:

downloaded/
├── juice_ops.tm
└── juice_ops_links/
    ├── lsk/ → ../../<actual_location>/lsk/
    ├── pck/ → ../../<actual_location>/pck/
    └── spk/ → ../../<actual_location>/spk/

The metakernel’s PATH_VALUES is rewritten to point to juice_ops_links/, preserving the original kernel list untouched.

Registry tracking

All acquired metakernels are registered in the database:

metakernel_registry: Stores the metakernel path, mission, source URL, and acquisition timestamp
metakernel_entries: Stores each KERNELS_TO_LOAD entry with its index and parsed filename

This allows the tool to track what has been downloaded, detect updates on the remote server, and provide an audit trail for reproducibility.

Deduplication strategy

For users who want to reclaim disk space, the tool can replace duplicate files with symlinks to a single canonical copy.

The canonical copy is selected with a preference for the generic_kernels tree, on the grounds that:

Generic kernels are the “source of truth” for shared files
They’re typically the first place you’d download from
Mission-specific copies should be replaceable without loss of information

Per-mission deduplication control

Missions can opt out of deduplication via the missions table’s dedup column. When set to False, kernels from that mission will never be replaced with symlinks during deduplication, even if identical copies exist elsewhere.

This is useful for missions where preserving the exact on-disk layout is critical (e.g., for archival purposes or when symlinks might break external tools). Missions not registered in the missions table default to dedup=True for backwards compatibility.

The deduplicate_with_symlinks() method supports a dry_run mode (the default) so you can see exactly what would change before committing. After deduplication, all original paths still work — they’re just symlinks now — and the database remains accurate because the hash and location information hasn’t changed.

Warning

Deduplication with symlinks requires a filesystem that supports symlinks (Linux, macOS). It won’t work on Windows NTFS without elevated privileges, and won’t work at all on FAT32.

Kernel type classification

Files are classified by extension into SPICE kernel types:

Extension	Kernel type	Description
`.bc`	CK	Orientation / pointing
`.bsp`	SPK	Ephemeris / trajectory
`.bpc`	PCK	Binary planetary constants
`.tpc`	PCK	Text planetary constants
`.tf`	FK	Frame definitions
`.ti`	IK	Instrument parameters
`.tls`	LSK	Leapseconds
`.tsc`	SCLK	Spacecraft clock coefficients
`.bds`	DSK	Digital shape models
`.tm`	MK	Metakernels

This classification is used for the stats() breakdown and is stored in the database for potential future filtering.

Mission auto-detection

When scanning a directory, if you don’t supply a --mission flag, the tool guesses the mission name from the directory structure. It looks for the pattern .../<MISSION>/kernels/... and takes the directory name immediately before kernels/.

For example:

Path	Detected mission
`/data/JUICE/kernels/lsk/naif0012.tls`	`JUICE`
`/data/MRO/kernels/ck/mro_sc_2024.bc`	`MRO`
`/data/generic_kernels/lsk/naif0012.tls`	Falls back to filename heuristic → `generic`

You can always override this with the mission parameter in the API or --mission on the CLI.

Dependency roadmap

The project currently depends on only two packages beyond the standard library: duckdb for the database engine and rich for terminal output (tables, progress bars, panels). This section documents planned additions that would simplify the codebase or add meaningful capability.

`httpx` — HTTP client

The remote.py module currently uses urllib.request from the standard library. This works but results in verbose code — each HTTP call requires manual request construction, error handling, and timeout management. Replacing with httpx would:

Simplify ~30–40 lines of URL-fetching boilerplate across remote.py and cli.py
Provide consistent timeout handling and retry semantics
Enable a future migration to async for parallel downloads (currently uses ThreadPoolExecutor)
Replace patterns like urllib.request.Request(url, method="HEAD") + urlopen() with httpx.head(url)

`beautifulsoup4` — HTML parsing

Directory listings from NASA NAIF and ESA SPICE servers are parsed with regex patterns in remote.py. These patterns handle the two known server HTML formats (plain-text Apache listings from NASA, table-based listings from ESA) but are inherently fragile — any change to the HTML structure would require regex updates.

Replacing with beautifulsoup4 and CSS selectors would make the parsing ~20–25 lines shorter and significantly more robust against minor HTML changes from the upstream servers.

`spiceypy` — SPICE toolkit (optional)

SpiceyPy wraps NASA’s CSPICE library. Adding it as an optional dependency would enable:

Kernel validation: verify that downloaded files are valid SPICE kernels (not just by extension)
Metadata extraction: read coverage windows, reference frames, and object IDs from binary kernels
Metakernel loading verification: actually load a rewritten metakernel through SPICE to confirm it works

This would be an optional dependency (pip install spice-kernel-db[spice]) since CSPICE requires a compiled binary and not all users need runtime validation.