Preprocessing

Input Format

SILO ingests data in NDJSON format (Newline-Delimited JSON). One JSON object per line describes a single sequence record. There is no separate TSV/FASTA input mode.

.zst and .xz compressed NDJSON files are detected and decompressed transparently.

Preprocessing Configuration

The preprocessing configuration is a YAML file that controls where SILO reads its input and writes its output. All keys are optional unless noted otherwise. Filenames are resolved relative to inputDirectory.

Key	Type	Default	Default in Docker image	Description
`inputDirectory`	path	`./`	`/preprocessing/input/`	Directory containing the input files.
`outputDirectory`	path	`./output/`	`/preprocessing/output/`	Directory where SILO writes the preprocessed database state.
`ndjsonInputFilename`	path	(none — required)		NDJSON file with the input records, relative to `inputDirectory`. SILO will refuse to start preprocessing if this is unset.
`databaseConfig`	path	`database_config.yaml`		The database configuration file, relative to `inputDirectory`.
`referenceGenomeFilename`	path	`reference_genomes.json`		The reference genomes file, relative to `inputDirectory`.
`lineageDefinitionFilenames`	list	(absent)		A list of lineage-definition file names (see Lineage Definition Files), relative to `inputDirectory`.
`phyloTreeFilename`	path	(absent)		A phylogenetic-tree file (see Phylogenetic Tree File), relative to `inputDirectory`.
`withoutUnalignedSequences`	boolean	`false`		If `true`, SILO omits the unaligned-sequence column for each aligned nucleotide sequence.

NDJSON Record Schema

Each line in the NDJSON file is a flat JSON object. The top-level keys must include:

One entry for every metadata field declared in the database_config.yaml, using the same name and the type indicated in the schema.
One entry for every nucleotide segment and amino acid gene declared in the reference genomes file. The value is a sequence object, or null if the sequence is missing.

Additionally, raw (unaligned) nucleotide sequences may be provided under keys prefixed with unaligned_.

Unknown top-level keys are ignored with a warning. Missing required fields cause an error.

Sequence Object

A sequence object has the following structure:

{
    "sequence": "ACGTACGT",
    "insertions": ["214:ACGT"],
    "offset": 0
}

Key	Type	Description
`sequence`	string	The aligned sequence as a string of valid symbols.
`sequenceCompressed`	string	Alternative to `sequence`: a base64-encoded, ZSTD-compressed sequence. The ZSTD dictionary must be the column’s reference sequence. Takes precedence over `sequence` if both are present.
`insertions`	array of strings	Insertions in the form `<position>:<symbols>`. The position is the index of the symbol after which the insertion is placed; position `0` inserts before the first symbol.
`offset`	integer	Optional offset into the reference (default: `0`).

Example Record

Given a database config with metadata fields primaryKey, date, country, age, and a reference genome with one nucleotide segment main and one gene E, a valid NDJSON line looks like:

{
    "primaryKey": "seq_001",
    "date": "2021-03-18",
    "country": "Switzerland",
    "age": 54,
    "main": { "sequence": "ACGTACGT", "insertions": ["4:CC"] },
    "E": { "sequence": "MYSF*", "insertions": [] }
}

Lineage Definition Files

A lineage-indexed metadata field (generateLineageIndex in the database config) requires a YAML file describing the lineage hierarchy. Multiple lineage systems can be declared via the lineageDefinitionFilenames list in the preprocessing config.

Each top-level key in the YAML is a lineage label. Per label you can specify:

parents: a list of parent lineage labels (omit, set to null, or use [] to mark a root).
aliases: a list of alternative names for the lineage.

Minimal example:

A:
    aliases:
        - Root
B:
    parents:
        - A
C:
    parents:
        - A
E:
    parents: [B, C]
    aliases:
        - LeafE

SILO verifies that the lineage labels are unique and that the relationships form a directed acyclic graph (no cycles). It makes no further assumptions about the lineage system. See documentation/lineage_definitions.md in the SILO repository for the authoritative spec.

Phylogenetic Tree File

A phylogenetic-tree-indexed metadata field (isPhyloTreeField in the database config) requires a tree file referenced by the phyloTreeFilename preprocessing-config key.

SILO accepts two formats:

All nodes — internal and leaves — must be uniquely labelled. See documentation/phylogenetic_queries.md in the SILO repository for the authoritative spec.

Incremental Preprocessing

In addition to building a database from scratch, SILO supports appending new records to an existing database state via the silo append command. See documentation/incremental_preprocessing.md in the SILO repository for details.