Architecture Guide
This document provides a comprehensive overview of Bioquik’s internal architecture, design decisions, and extension points.
System Overview
Bioquik is built around a pipeline architecture that processes DNA sequences through several stages:
Input Validation → 2. Pattern Expansion → 3. Parallel Processing → 4. Report Generation
Each stage is implemented as a separate module with clear responsibilities.
Core Components
1. FM-Index: The Heart of Pattern Matching
The FM-index (Full-text index in Minute space) is the core data structure that enables efficient pattern matching. It’s a compressed variant of the suffix array that supports:
Count queries: How many times does pattern P occur? (O(m) time)
Locate queries: Where does pattern P occur? (O(m + k) time)
How It Works
# Simplified example
seq = "ACGTACGT$"
fm = FMIndex(seq)
count = fm.count(b"ACG") # Returns 2
positions = fm.locate(b"ACG") # Returns [0, 4]
The FM-index uses:
Suffix Array: Built using
pydivsufsort(fast C implementation)BWT (Burrows-Wheeler Transform): Permutation of the sequence
C-table: Cumulative character counts
Wavelet Tree: For efficient rank queries
Backward Search Algorithm
The key to FM-index efficiency is backward search:
For pattern P = "ACG":
1. Start with range [0, n) covering all suffixes
2. Process 'G': narrow range to suffixes starting with 'G'
3. Process 'C': narrow range to suffixes starting with 'CG'
4. Process 'A': narrow range to suffixes starting with 'ACG'
5. Range size = number of occurrences
This is why counting is O(m) - we only need to process each character once.
2. Wavelet Tree: Enabling Rank Queries
The Wavelet Tree is a succinct data structure that answers rank queries efficiently:
Rank(symbol, i): How many times does
symbolappear in positions [0, i)?
This is exactly what backward search needs to narrow the search range.
Structure
Root (bitvector: 0101...)
/ \
Left (A,C) Right (G,T)
/ \ / \
A C G T
Each internal node stores a bitvector indicating which child each position goes to.
3. Parallel Processing Strategy
Bioquik uses process-based parallelism via ProcessPoolExecutor:
Each FASTA file is processed independently
No shared state between workers
Linear speedup with number of CPU cores
Why processes over threads?
Python’s GIL prevents true parallelism with threads
Processes bypass GIL for CPU-bound tasks
Each process has its own memory space (no synchronization needed)
4. Pattern Expansion
Wildcard patterns like *CG* are expanded into concrete motifs:
Pattern: *CG*
Wildcards: 1 before CG, 1 after CG
Expansion: 4 × 4 = 16 motifs
ACGA, ACGC, ACGG, ACGT
CCGA, CCGC, CCGG, CCGT
GCGA, GCGC, GCGG, GCGT
TCGA, TCGC, TCGG, TCGT
The expansion uses Cartesian product of DNA bases (A, C, G, T).
Data Flow
Input Processing
FASTA File
↓
Read lines (skip headers starting with '>')
↓
Concatenate sequences
↓
Normalize to uppercase
↓
Build FM-index
Pattern Matching
Pattern List: ["*CG*", "**CG**"]
↓
Expand to Motifs: ["ACGA", "ACGC", ..., "TCGT", ...]
↓
For each motif:
- Encode to bytes
- Query FM-index
- Count occurrences (non-overlapping)
↓
Store results: {Pattern, Motif, Count}
Output Generation
Per-file CSVs
↓
Concatenate (pandas)
↓
Write combined CSV
↓
(Optional) Aggregate to JSON
↓
(Optional) Generate plots
Memory Management
FM-Index Memory Usage
For a sequence of length n:
Suffix array: ~4n bytes (32-bit integers)
BWT: n bytes
C-table: ~16 bytes (4 characters × 4 bytes)
Wavelet tree: ~n log σ bits ≈ n/2 bytes (for DNA)
Total: ~5.5n bytes
For a 1GB sequence: ~5.5GB memory (reasonable for modern systems)
Optimization: SA Sampling
The suffix array is sampled (every 32nd position) to reduce memory:
Count queries: No SA needed (already fast)
Locate queries: May need to “walk back” to find sampled position
Trade-off: Slightly slower locate, much less memory
Performance Characteristics
Time Complexity
Operation |
Complexity |
Notes |
|---|---|---|
FM-index construction |
O(n log n) |
Via divsufsort |
Pattern counting |
O(m) |
m = pattern length |
Pattern locating |
O(m + k) |
k = occurrences |
Per-file processing |
O(n log n + m×k) |
Construction + queries |
Overall (f files) |
O(f × (n log n + m×k)) |
Sequential |
Overall (parallel) |
O((n log n + m×k)) |
With f cores |
Space Complexity
FM-index: O(n)
Wavelet tree: O(n log σ) = O(n) for DNA
Total per file: O(n)
Parallelization Efficiency
Ideal speedup: Linear with number of cores
Bottlenecks:
I/O for many small files
Memory for very large files
GIL doesn’t affect (using processes)
Extension Points
Adding New Pattern Types
To support different pattern syntax:
# In motifs.py
def expand_custom_pattern(pattern: str) -> List[str]:
# Your expansion logic
pass
def build_pattern_to_motifs(patterns: List[str]) -> Dict[str, List[str]]:
# Integrate custom expansion
...
Custom Counting Strategies
Modify _count_in_fm() in fasta_worker.py:
def _count_in_fm(fm: FMIndex, motif: bytes, *,
strategy: str = "non_overlapping") -> int:
if strategy == "overlapping":
return fm.count(motif)
elif strategy == "non_overlapping":
# Current implementation
...
elif strategy == "maximal":
# New strategy
...
Alternative Index Structures
Implement a new index with the same interface:
class MyCustomIndex:
def count(self, pattern: bytes) -> int:
...
def locate(self, pattern: bytes) -> List[int]:
...
# In fasta_worker.py, swap:
# fm = FMIndex(seq)
fm = MyCustomIndex(seq)
Design Decisions
Why FM-Index?
Memory efficient: Compressed suffix array
Fast queries: O(m) counting, O(m + k) locating
Mature algorithm: Well-studied, reliable
Why Non-Overlapping Counts?
Biological relevance: Overlapping motifs may not be independent
Configurable: Can enable overlapping via parameter
Why Per-File CSVs?
Incremental processing: Can process files independently
Debugging: Easy to inspect individual file results
Flexibility: Users can process subsets
Why Process-Based Parallelism?
GIL bypass: True parallelism for CPU-bound tasks
Isolation: Process crashes don’t affect others
Scalability: Linear speedup with cores
Testing Strategy
Unit Tests
Each module has corresponding test file:
test_fmindex.py: FM-index correctnesstest_wavelettree.py: Wavelet tree queriestest_motifs.py: Pattern expansiontest_processor.py: Parallel processingtest_reports.py: Output generation
Integration Tests
test_cli.py: End-to-end CLI workflowsTest with real FASTA files
Verify output formats
Performance Tests
Benchmark with various file sizes
Measure parallelization efficiency
Profile memory usage
Future Enhancements
Potential Improvements
Index Caching: Reuse FM-indexes for repeated queries
Streaming: Process files too large for memory
Distributed: Scale to multiple machines
GPU Acceleration: Offload FM-index construction
Compressed Storage: Store indexes on disk
Research Directions
Alternative index structures (BWT variants)
Approximate matching support
Multi-pattern batch queries
Real-time streaming queries
Further Reading
FM-Index: Ferragina & Manzini (2000) “Opportunistic Data Structures”
Wavelet Trees: Grossi et al. (2003) “High-Order Entropy-Compressed Text Indexes”
Suffix Arrays: Manber & Myers (1993) “Suffix Arrays: A New Method for On-Line String Searches”