Data Formats¶

This page describes the file formats consumed by fastdpplot.

Binary matrix (`.bin`)¶

What is it?¶

A .bin file is a flat, row-major, little-endian binary array of numeric values representing a dynamic-programming or scoring matrix.

Each element at position (row, col) corresponds to the alignment score (or match indicator) between query residue row and subject residue col.

File size formula¶

file_size = rows × cols × element_size_in_bytes

where rows = query_length and cols = subject_length.

Supported element types¶

`--dtype` flag	C type	Bytes	Notes
`u8`	`uint8_t`	1	Counts or presence/absence (0–255).
`i16`	`int16_t`	2	Signed 16-bit alignment scores.
`i32`	`int32_t`	4	Signed 32-bit alignment scores.
`f32`	`float`	4	Single-precision floating-point scores.
`f64`	`double`	8	Double-precision floating-point scores.
`auto`	—	—	Inferred from file size (tries U8→I16→I32→F32→F64).

Dtype inference order

When --dtype auto is used, fastdpplot tries dtypes in the order u8 → i16 → i32 → f32 → f64 and picks the first match. If rows × cols is divisible by multiple element sizes, the smallest matching dtype is chosen. Pass --dtype explicitly to avoid ambiguity.

Endianness

fastdpplot always reads multi-byte types as little-endian. If your tool writes big-endian matrices (uncommon on x86/ARM), you must byte-swap the file before loading.

Generating a `.bin` file (Python example)¶

import numpy as np

# Create a random u8 matrix (query_len × subject_len)
matrix = np.random.randint(0, 256, (query_len, subject_len), dtype=np.uint8)
matrix.tofile("data/dp_matrix.bin")

# Float32 matrix
matrix = np.zeros((query_len, subject_len), dtype=np.float32)
# … fill in your DP scores …
matrix.tofile("data/dp_matrix.bin")  # numpy uses native byte order; ensure little-endian

Numpy byte order

On little-endian systems (x86, ARM) np.tofile writes little-endian by default. On big-endian systems, convert first:

matrix.astype(matrix.dtype.newbyteorder('<')).tofile("data/dp_matrix.bin")

Value normalisation¶

After loading, fastdpplot applies min-max normalisation:

normalised = (value - min) / (max - min)

All rendered dot values are therefore in [0, 1] regardless of the original dtype. The --threshold flag filters out points below a normalised value.

FASTA files (`.fasta` / `.fa`)¶

fastdpplot uses the FASTA files only to:

Read the sequence length (to validate matrix dimensions).
Derive axis labels from the sequence ID and description.

The actual nucleotide / amino-acid sequence is not used for rendering.

Accepted FASTA syntax¶

>sequence_id optional description text here
ACGTACGTACGT
ACGTACGT

Rule	Detail
Header line	Starts with `>`. The ID is everything up to the first whitespace. The rest is the description.
Sequence lines	All non-header, non-comment lines are concatenated and uppercased.
Comment lines	Lines starting with `;` are skipped.
Multiple records	All records are parsed; only the first is used by the pipeline.
Empty records	A header with no following sequence raises `FastDpError::EmptySequence`.

Axis label format¶

The axis label shown on the dotplot is formatted as:

{id} | {description}    (truncated to 80 characters with … suffix)

or just {id} when the description is empty.

Example FASTA file¶

>NM_001234.5 Homo sapiens BRCA1 mRNA, complete cds
ATGGATTTCCTTGTTGCTTTTTTTAGCCTTGTGTGTAATGAGTACAGCATGTTTCAGCC
TCTTTTTCTGTTAATTCAAAGATTAAAGTGAAATTTAACCAAAAGCCAGTGATGAAAAAC

Parquet files (`.parquet`)¶

Parquet files must contain at least two numeric columns named x/X and y/Y. An optional value column may be named value, identity, or score.

Column types must be Float32 or Float64.

See Rust API — io for details.

Delimited text files¶

Any tab- or space-separated file with numeric X and Y columns is accepted. Lines starting with # are treated as comments. A non-numeric first row is treated as a header and skipped automatically.

See CLI Reference for column mapping flags.