Architecture Guide

This document provides a comprehensive overview of Bioquik’s internal architecture, design decisions, and extension points.

System Overview

Bioquik is built around a pipeline architecture that processes DNA sequences through several stages:

  1. Input Validation → 2. Pattern Expansion → 3. Parallel Processing → 4. Report Generation

Each stage is implemented as a separate module with clear responsibilities.

Core Components

1. FM-Index: The Heart of Pattern Matching

The FM-index (Full-text index in Minute space) is the core data structure that enables efficient pattern matching. It’s a compressed variant of the suffix array that supports:

  • Count queries: How many times does pattern P occur? (O(m) time)

  • Locate queries: Where does pattern P occur? (O(m + k) time)

How It Works

# Simplified example
seq = "ACGTACGT$"
fm = FMIndex(seq)
count = fm.count(b"ACG")  # Returns 2
positions = fm.locate(b"ACG")  # Returns [0, 4]

The FM-index uses:

  • Suffix Array: Built using pydivsufsort (fast C implementation)

  • BWT (Burrows-Wheeler Transform): Permutation of the sequence

  • C-table: Cumulative character counts

  • Wavelet Tree: For efficient rank queries

Backward Search Algorithm

The key to FM-index efficiency is backward search:

For pattern P = "ACG":
  1. Start with range [0, n) covering all suffixes
  2. Process 'G': narrow range to suffixes starting with 'G'
  3. Process 'C': narrow range to suffixes starting with 'CG'
  4. Process 'A': narrow range to suffixes starting with 'ACG'
  5. Range size = number of occurrences

This is why counting is O(m) - we only need to process each character once.

2. Wavelet Tree: Enabling Rank Queries

The Wavelet Tree is a succinct data structure that answers rank queries efficiently:

  • Rank(symbol, i): How many times does symbol appear in positions [0, i)?

This is exactly what backward search needs to narrow the search range.

Structure

        Root (bitvector: 0101...)
       /                      \
   Left (A,C)              Right (G,T)
  /         \              /         \
A           C            G           T

Each internal node stores a bitvector indicating which child each position goes to.

3. Parallel Processing Strategy

Bioquik uses process-based parallelism via ProcessPoolExecutor:

  • Each FASTA file is processed independently

  • No shared state between workers

  • Linear speedup with number of CPU cores

Why processes over threads?

  • Python’s GIL prevents true parallelism with threads

  • Processes bypass GIL for CPU-bound tasks

  • Each process has its own memory space (no synchronization needed)

4. Pattern Expansion

Wildcard patterns like *CG* are expanded into concrete motifs:

Pattern: *CG*
Wildcards: 1 before CG, 1 after CG
Expansion: 4 × 4 = 16 motifs
  ACGA, ACGC, ACGG, ACGT
  CCGA, CCGC, CCGG, CCGT
  GCGA, GCGC, GCGG, GCGT
  TCGA, TCGC, TCGG, TCGT

The expansion uses Cartesian product of DNA bases (A, C, G, T).

Data Flow

Input Processing

FASTA File
  ↓
Read lines (skip headers starting with '>')
  ↓
Concatenate sequences
  ↓
Normalize to uppercase
  ↓
Build FM-index

Pattern Matching

Pattern List: ["*CG*", "**CG**"]
  ↓
Expand to Motifs: ["ACGA", "ACGC", ..., "TCGT", ...]
  ↓
For each motif:
  - Encode to bytes
  - Query FM-index
  - Count occurrences (non-overlapping)
  ↓
Store results: {Pattern, Motif, Count}

Output Generation

Per-file CSVs
  ↓
Concatenate (pandas)
  ↓
Write combined CSV
  ↓
(Optional) Aggregate to JSON
  ↓
(Optional) Generate plots

Memory Management

FM-Index Memory Usage

For a sequence of length n:

  • Suffix array: ~4n bytes (32-bit integers)

  • BWT: n bytes

  • C-table: ~16 bytes (4 characters × 4 bytes)

  • Wavelet tree: ~n log σ bits ≈ n/2 bytes (for DNA)

  • Total: ~5.5n bytes

For a 1GB sequence: ~5.5GB memory (reasonable for modern systems)

Optimization: SA Sampling

The suffix array is sampled (every 32nd position) to reduce memory:

  • Count queries: No SA needed (already fast)

  • Locate queries: May need to “walk back” to find sampled position

  • Trade-off: Slightly slower locate, much less memory

Performance Characteristics

Time Complexity

Operation

Complexity

Notes

FM-index construction

O(n log n)

Via divsufsort

Pattern counting

O(m)

m = pattern length

Pattern locating

O(m + k)

k = occurrences

Per-file processing

O(n log n + m×k)

Construction + queries

Overall (f files)

O(f × (n log n + m×k))

Sequential

Overall (parallel)

O((n log n + m×k))

With f cores

Space Complexity

  • FM-index: O(n)

  • Wavelet tree: O(n log σ) = O(n) for DNA

  • Total per file: O(n)

Parallelization Efficiency

  • Ideal speedup: Linear with number of cores

  • Bottlenecks:

    • I/O for many small files

    • Memory for very large files

    • GIL doesn’t affect (using processes)

Extension Points

Adding New Pattern Types

To support different pattern syntax:

# In motifs.py
def expand_custom_pattern(pattern: str) -> List[str]:
    # Your expansion logic
    pass

def build_pattern_to_motifs(patterns: List[str]) -> Dict[str, List[str]]:
    # Integrate custom expansion
    ...

Custom Counting Strategies

Modify _count_in_fm() in fasta_worker.py:

def _count_in_fm(fm: FMIndex, motif: bytes, *, 
                 strategy: str = "non_overlapping") -> int:
    if strategy == "overlapping":
        return fm.count(motif)
    elif strategy == "non_overlapping":
        # Current implementation
        ...
    elif strategy == "maximal":
        # New strategy
        ...

Alternative Index Structures

Implement a new index with the same interface:

class MyCustomIndex:
    def count(self, pattern: bytes) -> int:
        ...
    def locate(self, pattern: bytes) -> List[int]:
        ...

# In fasta_worker.py, swap:
# fm = FMIndex(seq)
fm = MyCustomIndex(seq)

Design Decisions

Why FM-Index?

  • Memory efficient: Compressed suffix array

  • Fast queries: O(m) counting, O(m + k) locating

  • Mature algorithm: Well-studied, reliable

Why Non-Overlapping Counts?

  • Biological relevance: Overlapping motifs may not be independent

  • Configurable: Can enable overlapping via parameter

Why Per-File CSVs?

  • Incremental processing: Can process files independently

  • Debugging: Easy to inspect individual file results

  • Flexibility: Users can process subsets

Why Process-Based Parallelism?

  • GIL bypass: True parallelism for CPU-bound tasks

  • Isolation: Process crashes don’t affect others

  • Scalability: Linear speedup with cores

Testing Strategy

Unit Tests

Each module has corresponding test file:

  • test_fmindex.py: FM-index correctness

  • test_wavelettree.py: Wavelet tree queries

  • test_motifs.py: Pattern expansion

  • test_processor.py: Parallel processing

  • test_reports.py: Output generation

Integration Tests

  • test_cli.py: End-to-end CLI workflows

  • Test with real FASTA files

  • Verify output formats

Performance Tests

  • Benchmark with various file sizes

  • Measure parallelization efficiency

  • Profile memory usage

Future Enhancements

Potential Improvements

  1. Index Caching: Reuse FM-indexes for repeated queries

  2. Streaming: Process files too large for memory

  3. Distributed: Scale to multiple machines

  4. GPU Acceleration: Offload FM-index construction

  5. Compressed Storage: Store indexes on disk

Research Directions

  • Alternative index structures (BWT variants)

  • Approximate matching support

  • Multi-pattern batch queries

  • Real-time streaming queries

Further Reading

  • FM-Index: Ferragina & Manzini (2000) “Opportunistic Data Structures”

  • Wavelet Trees: Grossi et al. (2003) “High-Order Entropy-Compressed Text Indexes”

  • Suffix Arrays: Manber & Myers (1993) “Suffix Arrays: A New Method for On-Line String Searches”