Skip to content

Preprocessing

SILO ingests data in NDJSON format (Newline-Delimited JSON). One JSON object per line describes a single sequence record. There is no separate TSV/FASTA input mode.

.zst and .xz compressed NDJSON files are detected and decompressed transparently.

The preprocessing configuration is a YAML file that controls where SILO reads its input and writes its output. All keys are optional unless noted otherwise. Filenames are resolved relative to inputDirectory.

KeyTypeDefaultDefault in Docker imageDescription
inputDirectorypath.//preprocessing/input/Directory containing the input files.
outputDirectorypath./output//preprocessing/output/Directory where SILO writes the preprocessed database state.
ndjsonInputFilenamepath(none — required)NDJSON file with the input records, relative to inputDirectory. SILO will refuse to start preprocessing if this is unset.
databaseConfigpathdatabase_config.yamlThe database configuration file, relative to inputDirectory.
referenceGenomeFilenamepathreference_genomes.jsonThe reference genomes file, relative to inputDirectory.
lineageDefinitionFilenameslist(absent)A list of lineage-definition file names (see Lineage Definition Files), relative to inputDirectory.
phyloTreeFilenamepath(absent)A phylogenetic-tree file (see Phylogenetic Tree File), relative to inputDirectory.
withoutUnalignedSequencesbooleanfalseIf true, SILO omits the unaligned-sequence column for each aligned nucleotide sequence.

Each line in the NDJSON file is a flat JSON object. The top-level keys must include:

  • One entry for every metadata field declared in the database_config.yaml, using the same name and the type indicated in the schema.
  • One entry for every nucleotide segment and amino acid gene declared in the reference genomes file. The value is a sequence object, or null if the sequence is missing.

Additionally, raw (unaligned) nucleotide sequences may be provided under keys prefixed with unaligned_.

Unknown top-level keys are ignored with a warning. Missing required fields cause an error.

A sequence object has the following structure:

{
"sequence": "ACGTACGT",
"insertions": ["214:ACGT"],
"offset": 0
}
KeyTypeDescription
sequencestringThe aligned sequence as a string of valid symbols.
sequenceCompressedstringAlternative to sequence: a base64-encoded, ZSTD-compressed sequence. The ZSTD dictionary must be the column’s reference sequence. Takes precedence over sequence if both are present.
insertionsarray of stringsInsertions in the form <position>:<symbols>. The position is the index of the symbol after which the insertion is placed; position 0 inserts before the first symbol.
offsetintegerOptional offset into the reference (default: 0).

Given a database config with metadata fields primaryKey, date, country, age, and a reference genome with one nucleotide segment main and one gene E, a valid NDJSON line looks like:

{
"primaryKey": "seq_001",
"date": "2021-03-18",
"country": "Switzerland",
"age": 54,
"main": { "sequence": "ACGTACGT", "insertions": ["4:CC"] },
"E": { "sequence": "MYSF*", "insertions": [] }
}

A lineage-indexed metadata field (generateLineageIndex in the database config) requires a YAML file describing the lineage hierarchy. Multiple lineage systems can be declared via the lineageDefinitionFilenames list in the preprocessing config.

Each top-level key in the YAML is a lineage label. Per label you can specify:

  • parents: a list of parent lineage labels (omit, set to null, or use [] to mark a root).
  • aliases: a list of alternative names for the lineage.

Minimal example:

A:
aliases:
- Root
B:
parents:
- A
C:
parents:
- A
E:
parents: [B, C]
aliases:
- LeafE

SILO verifies that the lineage labels are unique and that the relationships form a directed acyclic graph (no cycles). It makes no further assumptions about the lineage system. See documentation/lineage_definitions.md in the SILO repository for the authoritative spec.

A phylogenetic-tree-indexed metadata field (isPhyloTreeField in the database config) requires a tree file referenced by the phyloTreeFilename preprocessing-config key.

SILO accepts two formats:

All nodes — internal and leaves — must be uniquely labelled. See documentation/phylogenetic_queries.md in the SILO repository for the authoritative spec.

In addition to building a database from scratch, SILO supports appending new records to an existing database state via the silo append command. See documentation/incremental_preprocessing.md in the SILO repository for details.