Architecture Guide

This document provides a comprehensive overview of Bioquik’s internal architecture, design decisions, and extension points.

System Overview

Bioquik is built around a pipeline architecture that processes DNA sequences through several stages:

Input Validation → 2. Pattern Expansion → 3. Parallel Processing → 4. Report Generation

Each stage is implemented as a separate module with clear responsibilities.

Core Components

1. FM-Index: The Heart of Pattern Matching

The FM-index (Full-text index in Minute space) is the core data structure that enables efficient pattern matching. It’s a compressed variant of the suffix array that supports:

Count queries: How many times does pattern P occur? (O(m) time)
Locate queries: Where does pattern P occur? (O(m + k) time)

How It Works

# Simplified example
seq = "ACGTACGT$"
fm = FMIndex(seq)
count = fm.count(b"ACG")  # Returns 2
positions = fm.locate(b"ACG")  # Returns [0, 4]

The FM-index uses:

Suffix Array: Built using pydivsufsort (fast C implementation)
BWT (Burrows-Wheeler Transform): Permutation of the sequence
C-table: Cumulative character counts
Wavelet Tree: For efficient rank queries

Backward Search Algorithm

The key to FM-index efficiency is backward search:

For pattern P = "ACG":
Start with range [0, n) covering all suffixes
Process 'G': narrow range to suffixes starting with 'G'
Process 'C': narrow range to suffixes starting with 'CG'
Process 'A': narrow range to suffixes starting with 'ACG'
Range size = number of occurrences

This is why counting is O(m) - we only need to process each character once.

2. Wavelet Tree: Enabling Rank Queries

The Wavelet Tree is a succinct data structure that answers rank queries efficiently:

Rank(symbol, i): How many times does symbol appear in positions [0, i)?

This is exactly what backward search needs to narrow the search range.

Structure

        Root (bitvector: 0101...)
       /                      \
   Left (A,C)              Right (G,T)
  /         \              /         \
A           C            G           T

Each internal node stores a bitvector indicating which child each position goes to.

3. Parallel Processing Strategy

Bioquik uses process-based parallelism via ProcessPoolExecutor:

Each FASTA file is processed independently
No shared state between workers
Linear speedup with number of CPU cores

Why processes over threads?

Python’s GIL prevents true parallelism with threads
Processes bypass GIL for CPU-bound tasks
Each process has its own memory space (no synchronization needed)

4. Pattern Expansion

Wildcard patterns like *CG* are expanded into concrete motifs:

Pattern: *CG*
Wildcards: 1 before CG, 1 after CG
Expansion: 4 × 4 = 16 motifs
  ACGA, ACGC, ACGG, ACGT
  CCGA, CCGC, CCGG, CCGT
  GCGA, GCGC, GCGG, GCGT
  TCGA, TCGC, TCGG, TCGT

The expansion uses Cartesian product of DNA bases (A, C, G, T).

Data Flow

Input Processing

FASTA File
  ↓
Read lines (skip headers starting with '>')
  ↓
Concatenate sequences
  ↓
Normalize to uppercase
  ↓
Build FM-index

Pattern Matching

Pattern List: ["*CG*", "**CG**"]
  ↓
Expand to Motifs: ["ACGA", "ACGC", ..., "TCGT", ...]
  ↓
For each motif:
  - Encode to bytes
  - Query FM-index
  - Count occurrences (non-overlapping)
  ↓
Store results: {Pattern, Motif, Count}

Output Generation

Per-file CSVs
  ↓
Concatenate (pandas)
  ↓
Write combined CSV
  ↓
(Optional) Aggregate to JSON
  ↓
(Optional) Generate plots

Memory Management

FM-Index Memory Usage

For a sequence of length n:

Suffix array: ~4n bytes (32-bit integers)
BWT: n bytes
C-table: ~16 bytes (4 characters × 4 bytes)
Wavelet tree: ~n log σ bits ≈ n/2 bytes (for DNA)
Total: ~5.5n bytes

For a 1GB sequence: ~5.5GB memory (reasonable for modern systems)

Optimization: SA Sampling

The suffix array is sampled (every 32nd position) to reduce memory:

Count queries: No SA needed (already fast)
Locate queries: May need to “walk back” to find sampled position
Trade-off: Slightly slower locate, much less memory

Performance Characteristics

Time Complexity

Operation	Complexity	Notes
FM-index construction	O(n log n)	Via divsufsort
Pattern counting	O(m)	m = pattern length
Pattern locating	O(m + k)	k = occurrences
Per-file processing	O(n log n + m×k)	Construction + queries
Overall (f files)	O(f × (n log n + m×k))	Sequential
Overall (parallel)	O((n log n + m×k))	With f cores

Space Complexity

FM-index: O(n)
Wavelet tree: O(n log σ) = O(n) for DNA
Total per file: O(n)

Parallelization Efficiency

Ideal speedup: Linear with number of cores
Bottlenecks:
- I/O for many small files
- Memory for very large files
- GIL doesn’t affect (using processes)

Extension Points

Adding New Pattern Types

To support different pattern syntax:

# In motifs.py
def expand_custom_pattern(pattern: str) -> List[str]:
    # Your expansion logic
    pass

def build_pattern_to_motifs(patterns: List[str]) -> Dict[str, List[str]]:
    # Integrate custom expansion
    ...

Custom Counting Strategies

Modify _count_in_fm() in fasta_worker.py:

def _count_in_fm(fm: FMIndex, motif: bytes, *, 
                 strategy: str = "non_overlapping") -> int:
    if strategy == "overlapping":
        return fm.count(motif)
    elif strategy == "non_overlapping":
        # Current implementation
        ...
    elif strategy == "maximal":
        # New strategy
        ...

Alternative Index Structures

Implement a new index with the same interface:

class MyCustomIndex:
    def count(self, pattern: bytes) -> int:
        ...
    def locate(self, pattern: bytes) -> List[int]:
        ...

# In fasta_worker.py, swap:
# fm = FMIndex(seq)
fm = MyCustomIndex(seq)

Design Decisions

Why FM-Index?

Memory efficient: Compressed suffix array
Fast queries: O(m) counting, O(m + k) locating
Mature algorithm: Well-studied, reliable

Why Non-Overlapping Counts?

Biological relevance: Overlapping motifs may not be independent
Configurable: Can enable overlapping via parameter

Why Per-File CSVs?

Incremental processing: Can process files independently
Debugging: Easy to inspect individual file results
Flexibility: Users can process subsets

Why Process-Based Parallelism?

GIL bypass: True parallelism for CPU-bound tasks
Isolation: Process crashes don’t affect others
Scalability: Linear speedup with cores

Testing Strategy

Unit Tests

Each module has corresponding test file:

test_fmindex.py: FM-index correctness
test_wavelettree.py: Wavelet tree queries
test_motifs.py: Pattern expansion
test_processor.py: Parallel processing
test_reports.py: Output generation

Integration Tests

test_cli.py: End-to-end CLI workflows
Test with real FASTA files
Verify output formats

Performance Tests

Benchmark with various file sizes
Measure parallelization efficiency
Profile memory usage

Future Enhancements

Potential Improvements

Index Caching: Reuse FM-indexes for repeated queries
Streaming: Process files too large for memory
Distributed: Scale to multiple machines
GPU Acceleration: Offload FM-index construction
Compressed Storage: Store indexes on disk

Research Directions

Alternative index structures (BWT variants)
Approximate matching support
Multi-pattern batch queries
Real-time streaming queries