Skip to content

Data Formats

This page describes the file formats consumed by fastdpplot.


Binary matrix (.bin)

What is it?

A .bin file is a flat, row-major, little-endian binary array of numeric values representing a dynamic-programming or scoring matrix.

Each element at position (row, col) corresponds to the alignment score (or match indicator) between query residue row and subject residue col.

File size formula

file_size = rows × cols × element_size_in_bytes

where rows = query_length and cols = subject_length.

Supported element types

--dtype flag C type Bytes Notes
u8 uint8_t 1 Counts or presence/absence (0–255).
i16 int16_t 2 Signed 16-bit alignment scores.
i32 int32_t 4 Signed 32-bit alignment scores.
f32 float 4 Single-precision floating-point scores.
f64 double 8 Double-precision floating-point scores.
auto Inferred from file size (tries U8→I16→I32→F32→F64).

Dtype inference order

When --dtype auto is used, fastdpplot tries dtypes in the order u8 → i16 → i32 → f32 → f64 and picks the first match. If rows × cols is divisible by multiple element sizes, the smallest matching dtype is chosen. Pass --dtype explicitly to avoid ambiguity.

Endianness

fastdpplot always reads multi-byte types as little-endian. If your tool writes big-endian matrices (uncommon on x86/ARM), you must byte-swap the file before loading.

Generating a .bin file (Python example)

import numpy as np

# Create a random u8 matrix (query_len × subject_len)
matrix = np.random.randint(0, 256, (query_len, subject_len), dtype=np.uint8)
matrix.tofile("data/dp_matrix.bin")
# Float32 matrix
matrix = np.zeros((query_len, subject_len), dtype=np.float32)
# … fill in your DP scores …
matrix.tofile("data/dp_matrix.bin")  # numpy uses native byte order; ensure little-endian

Numpy byte order

On little-endian systems (x86, ARM) np.tofile writes little-endian by default. On big-endian systems, convert first:

matrix.astype(matrix.dtype.newbyteorder('<')).tofile("data/dp_matrix.bin")


Value normalisation

After loading, fastdpplot applies min-max normalisation:

normalised = (value - min) / (max - min)

All rendered dot values are therefore in [0, 1] regardless of the original dtype. The --threshold flag filters out points below a normalised value.


FASTA files (.fasta / .fa)

fastdpplot uses the FASTA files only to:

  1. Read the sequence length (to validate matrix dimensions).
  2. Derive axis labels from the sequence ID and description.

The actual nucleotide / amino-acid sequence is not used for rendering.

Accepted FASTA syntax

>sequence_id optional description text here
ACGTACGTACGT
ACGTACGT
Rule Detail
Header line Starts with >. The ID is everything up to the first whitespace. The rest is the description.
Sequence lines All non-header, non-comment lines are concatenated and uppercased.
Comment lines Lines starting with ; are skipped.
Multiple records All records are parsed; only the first is used by the pipeline.
Empty records A header with no following sequence raises FastDpError::EmptySequence.

Axis label format

The axis label shown on the dotplot is formatted as:

{id} | {description}    (truncated to 80 characters with … suffix)

or just {id} when the description is empty.

Example FASTA file

>NM_001234.5 Homo sapiens BRCA1 mRNA, complete cds
ATGGATTTCCTTGTTGCTTTTTTTAGCCTTGTGTGTAATGAGTACAGCATGTTTCAGCC
TCTTTTTCTGTTAATTCAAAGATTAAAGTGAAATTTAACCAAAAGCCAGTGATGAAAAAC

Parquet files (.parquet)

Parquet files must contain at least two numeric columns named x/X and y/Y. An optional value column may be named value, identity, or score.

Column types must be Float32 or Float64.

See Rust API — io for details.


Delimited text files

Any tab- or space-separated file with numeric X and Y columns is accepted. Lines starting with # are treated as comments. A non-numeric first row is treated as a header and skipped automatically.

See CLI Reference for column mapping flags.