Skip to content

Output

Output Format

The output consists of two aligned sequences printed line-by-line:

  • Green: matched characters
  • Red: gaps
  • Cyan: mismatches

Each output block includes the base range for easier visual indexing.

FM-Index Anchoring with Proteins

If you see the message "FM-index anchoring unavailable/failed. Falling back to MPI full DP" during protein alignment, this is expected behavior, not a bug.

* **Exact Matches vs. Similarity:** The FM-Index requires exact substring matches (k-mers, typically 5-8 characters) to build anchors. Distant protein sequences (<30% identity) often preserve *chemical similarity* rather than exact identity, meaning they may only share very short exact matches (2-3 amino acids).
* **Smart Fallback:** Lacking exact matches long enough to safely anchor, the program intentionally skips the FM-Index phase. It gracefully falls back to the full Smith-Waterman or Needleman-Wunsch DP matrix (using BLOSUM62) to ensure a biologically accurate alignment. 
* **Why not lower the k-mer size?** Forcing a tiny k-mer threshold (like `k=3`) on proteins would result in massive amounts of random, noisy seeds, completely destroying both accuracy and performance.

Additional Outputs (From Code)

Beyond alignment:

  • DP matrices (text or binary)
  • Traceback matrices
  • LCS sequence output
  • Indexed position ranges

Binary formats:

  • Efficient storage for large matrices
  • Row-major layout with metadata header

Future Improvements

  • Support multi-sequence alignment.
  • Allow multiple FASTA entries.
  • Export alignment results in standard formats (CLUSTAL, Stockholm).
  • Web-based or GUI interface.