Data Formats¶
This page describes the file formats consumed by fastdpplot.
Binary matrix (.bin)¶
What is it?¶
A .bin file is a flat, row-major, little-endian binary array of numeric
values representing a dynamic-programming or scoring matrix.
Each element at position (row, col) corresponds to the alignment score (or
match indicator) between query residue row and subject residue col.
File size formula¶
where rows = query_length and cols = subject_length.
Supported element types¶
--dtype flag |
C type | Bytes | Notes |
|---|---|---|---|
u8 |
uint8_t |
1 | Counts or presence/absence (0–255). |
i16 |
int16_t |
2 | Signed 16-bit alignment scores. |
i32 |
int32_t |
4 | Signed 32-bit alignment scores. |
f32 |
float |
4 | Single-precision floating-point scores. |
f64 |
double |
8 | Double-precision floating-point scores. |
auto |
— | — | Inferred from file size (tries U8→I16→I32→F32→F64). |
Dtype inference order
When --dtype auto is used, fastdpplot tries dtypes in the order
u8 → i16 → i32 → f32 → f64 and picks the first match.
If rows × cols is divisible by multiple element sizes, the smallest
matching dtype is chosen. Pass --dtype explicitly to avoid ambiguity.
Endianness
fastdpplot always reads multi-byte types as little-endian. If your tool writes big-endian matrices (uncommon on x86/ARM), you must byte-swap the file before loading.
Generating a .bin file (Python example)¶
import numpy as np
# Create a random u8 matrix (query_len × subject_len)
matrix = np.random.randint(0, 256, (query_len, subject_len), dtype=np.uint8)
matrix.tofile("data/dp_matrix.bin")
# Float32 matrix
matrix = np.zeros((query_len, subject_len), dtype=np.float32)
# … fill in your DP scores …
matrix.tofile("data/dp_matrix.bin") # numpy uses native byte order; ensure little-endian
Numpy byte order
On little-endian systems (x86, ARM) np.tofile writes little-endian by
default. On big-endian systems, convert first:
Value normalisation¶
After loading, fastdpplot applies min-max normalisation:
All rendered dot values are therefore in [0, 1] regardless of the original
dtype. The --threshold flag filters out points below a normalised value.
FASTA files (.fasta / .fa)¶
fastdpplot uses the FASTA files only to:
- Read the sequence length (to validate matrix dimensions).
- Derive axis labels from the sequence ID and description.
The actual nucleotide / amino-acid sequence is not used for rendering.
Accepted FASTA syntax¶
| Rule | Detail |
|---|---|
| Header line | Starts with >. The ID is everything up to the first whitespace. The rest is the description. |
| Sequence lines | All non-header, non-comment lines are concatenated and uppercased. |
| Comment lines | Lines starting with ; are skipped. |
| Multiple records | All records are parsed; only the first is used by the pipeline. |
| Empty records | A header with no following sequence raises FastDpError::EmptySequence. |
Axis label format¶
The axis label shown on the dotplot is formatted as:
or just {id} when the description is empty.
Example FASTA file¶
>NM_001234.5 Homo sapiens BRCA1 mRNA, complete cds
ATGGATTTCCTTGTTGCTTTTTTTAGCCTTGTGTGTAATGAGTACAGCATGTTTCAGCC
TCTTTTTCTGTTAATTCAAAGATTAAAGTGAAATTTAACCAAAAGCCAGTGATGAAAAAC
Parquet files (.parquet)¶
Parquet files must contain at least two numeric columns named x/X and
y/Y. An optional value column may be named value, identity, or score.
Column types must be Float32 or Float64.
See Rust API — io for details.
Delimited text files¶
Any tab- or space-separated file with numeric X and Y columns is accepted.
Lines starting with # are treated as comments. A non-numeric first row is
treated as a header and skipped automatically.
See CLI Reference for column mapping flags.