Theoretical Background

Theoretical Background¶

1. Longest Common Subsequence (LCS)¶

The LCS problem is solved using a dynamic programming table to compute the length and path of the longest subsequence common to both strings. The solution provides insights into evolutionary or structural similarity at a character level.

Time Complexity: $O(m \cdot n)$
Space Complexity: $O(m \cdot n)$
Alignment Type: Character-based (not contiguous)

2. Global Alignment (Needleman-Wunsch)¶

This method aligns the entire sequences using a scoring matrix initialized with gap penalties. It is ideal for full-length, homologous sequences.

Time Complexity: $O(m \cdot n)$
Uses a traceback matrix to reconstruct the alignment path.
Supports scoring for matches, mismatches, and gaps.

3. Local Alignment (Smith-Waterman)¶

Focuses on identifying the most similar subsequences. Suitable for aligning protein motifs or conserved regions.

Time Complexity: $O(m \cdot n)$
Incorporates zero in scoring to allow for local reset.
Traceback stops when score becomes zero.

Alignment Matrix and Mathematical Explanation¶

1. Longest Common Subsequence (LCS)¶

Goal: Find the longest subsequence common to two sequences $X = x_1x_2\ldots x_m$ and $Y = y_1y_2\ldots y_n$.

Matrix Definition¶

Let $C[i][j]$ denote the length of the LCS of the prefixes $x_1\ldots x_i$ and $y_1\ldots y_j$.

Recurrence Relation:

$$ C[i][j] = \begin{cases} 0 & \text{if } i = 0 \text{ or } j = 0 \\ C[i-1][j-1] + 1 & \text{if } x_i = y_j \\ \max(C[i-1][j], C[i][j-1]) & \text{otherwise} \end{cases} $$

Backtracking Table:

D: Diagonal (match)
U: Up (skip from X)
L: Left (skip from Y)

In Code:

if (x[i - 1] == y[j - 1]) {
    c[i][j] = c[i - 1][j - 1] + 1;
    b[i][j] = 'D';
} else if (c[i - 1][j] >= c[i][j - 1]) {
    c[i][j] = c[i - 1][j];
    b[i][j] = 'U';
} else {
    c[i][j] = c[i][j - 1];
    b[i][j] = 'L';
}

2. Global Alignment (Needleman–Wunsch)¶

Goal: Align full sequences $X$ and $Y$ from start to end, minimizing penalties or maximizing score.

Matrix Definition¶

Let $S[i][j]$ denote the optimal alignment score between $x_1\ldots x_i$ and $y_1\ldots y_j$.

Initialization:

$$ S[i][0] = i \cdot \text{GAP},\quad S[0][j] = j \cdot \text{GAP} $$

Recurrence Relation:

$$ S[i][j] = \max \begin{cases} S[i-1][j-1] + \text{score}(x_i, y_j) \ S[i-1][j] + \text{GAP} \ S[i][j-1] + \text{GAP} \end{cases} $$

where:

$$ \text{score}(x_i, y_j) = \begin{cases} \text{MATCH} & \text{if } x_i = y_j \\ \text{MISMATCH} & \text{otherwise} \end{cases} $$

Traceback: Built from a traceback[][] matrix storing 'D', 'U', 'L' for path recovery.

3. Local Alignment (Smith–Waterman)¶

Goal: Find the highest scoring local region between two sequences, ideal for detecting motifs.

Matrix Definition¶

Let $H[i][j]$ be the highest scoring alignment ending at $x_i, y_j$.

Initialization:

$$ H[i][0] = 0,\quad H[0][j] = 0 $$

Recurrence Relation:

$$ H[i][j] = \max \begin{cases} 0 \ H[i-1][j-1] + \text{score}(x_i, y_j) \ H[i-1][j] + \text{GAP} \ H[i][j-1] + \text{GAP} \end{cases} $$

Reset to 0 allows alignment to start anywhere and stop at the best score.

Traceback: Starts from the max score cell and moves until score is zero.

Time and Space Complexities¶

Algorithm	Time Complexity	Space Complexity	Alignment Type
LCS	O(m·n)	O(m·n)	Character-only
Global (NW)	O(m·n)	O(m·n)	Full sequence
Local (SW)	O(m·n)	O(m·n)	Subsequence

Where $m = |X|$, $n = |Y|$