# Architecture Guide This document provides a comprehensive overview of Bioquik's internal architecture, design decisions, and extension points. ## System Overview Bioquik is built around a **pipeline architecture** that processes DNA sequences through several stages: 1. **Input Validation** → 2. **Pattern Expansion** → 3. **Parallel Processing** → 4. **Report Generation** Each stage is implemented as a separate module with clear responsibilities. ## Core Components ### 1. FM-Index: The Heart of Pattern Matching The **FM-index** (Full-text index in Minute space) is the core data structure that enables efficient pattern matching. It's a compressed variant of the suffix array that supports: - **Count queries**: How many times does pattern P occur? (O(m) time) - **Locate queries**: Where does pattern P occur? (O(m + k) time) #### How It Works ```python # Simplified example seq = "ACGTACGT$" fm = FMIndex(seq) count = fm.count(b"ACG") # Returns 2 positions = fm.locate(b"ACG") # Returns [0, 4] ``` The FM-index uses: - **Suffix Array**: Built using `pydivsufsort` (fast C implementation) - **BWT (Burrows-Wheeler Transform)**: Permutation of the sequence - **C-table**: Cumulative character counts - **Wavelet Tree**: For efficient rank queries #### Backward Search Algorithm The key to FM-index efficiency is **backward search**: ``` For pattern P = "ACG": 1. Start with range [0, n) covering all suffixes 2. Process 'G': narrow range to suffixes starting with 'G' 3. Process 'C': narrow range to suffixes starting with 'CG' 4. Process 'A': narrow range to suffixes starting with 'ACG' 5. Range size = number of occurrences ``` This is why counting is O(m) - we only need to process each character once. ### 2. Wavelet Tree: Enabling Rank Queries The **Wavelet Tree** is a succinct data structure that answers rank queries efficiently: - **Rank(symbol, i)**: How many times does `symbol` appear in positions [0, i)? This is exactly what backward search needs to narrow the search range. #### Structure ``` Root (bitvector: 0101...) / \ Left (A,C) Right (G,T) / \ / \ A C G T ``` Each internal node stores a bitvector indicating which child each position goes to. ### 3. Parallel Processing Strategy Bioquik uses **process-based parallelism** via `ProcessPoolExecutor`: - Each FASTA file is processed independently - No shared state between workers - Linear speedup with number of CPU cores **Why processes over threads?** - Python's GIL prevents true parallelism with threads - Processes bypass GIL for CPU-bound tasks - Each process has its own memory space (no synchronization needed) ### 4. Pattern Expansion Wildcard patterns like `*CG*` are expanded into concrete motifs: ``` Pattern: *CG* Wildcards: 1 before CG, 1 after CG Expansion: 4 × 4 = 16 motifs ACGA, ACGC, ACGG, ACGT CCGA, CCGC, CCGG, CCGT GCGA, GCGC, GCGG, GCGT TCGA, TCGC, TCGG, TCGT ``` The expansion uses Cartesian product of DNA bases (A, C, G, T). ## Data Flow ### Input Processing ``` FASTA File ↓ Read lines (skip headers starting with '>') ↓ Concatenate sequences ↓ Normalize to uppercase ↓ Build FM-index ``` ### Pattern Matching ``` Pattern List: ["*CG*", "**CG**"] ↓ Expand to Motifs: ["ACGA", "ACGC", ..., "TCGT", ...] ↓ For each motif: - Encode to bytes - Query FM-index - Count occurrences (non-overlapping) ↓ Store results: {Pattern, Motif, Count} ``` ### Output Generation ``` Per-file CSVs ↓ Concatenate (pandas) ↓ Write combined CSV ↓ (Optional) Aggregate to JSON ↓ (Optional) Generate plots ``` ## Memory Management ### FM-Index Memory Usage For a sequence of length `n`: - Suffix array: ~4n bytes (32-bit integers) - BWT: n bytes - C-table: ~16 bytes (4 characters × 4 bytes) - Wavelet tree: ~n log σ bits ≈ n/2 bytes (for DNA) - **Total**: ~5.5n bytes For a 1GB sequence: ~5.5GB memory (reasonable for modern systems) ### Optimization: SA Sampling The suffix array is sampled (every 32nd position) to reduce memory: - **Count queries**: No SA needed (already fast) - **Locate queries**: May need to "walk back" to find sampled position - Trade-off: Slightly slower locate, much less memory ## Performance Characteristics ### Time Complexity | Operation | Complexity | Notes | |-----------|-----------|-------| | FM-index construction | O(n log n) | Via divsufsort | | Pattern counting | O(m) | m = pattern length | | Pattern locating | O(m + k) | k = occurrences | | Per-file processing | O(n log n + m×k) | Construction + queries | | Overall (f files) | O(f × (n log n + m×k)) | Sequential | | Overall (parallel) | O((n log n + m×k)) | With f cores | ### Space Complexity - FM-index: O(n) - Wavelet tree: O(n log σ) = O(n) for DNA - **Total per file**: O(n) ### Parallelization Efficiency - **Ideal speedup**: Linear with number of cores - **Bottlenecks**: - I/O for many small files - Memory for very large files - GIL doesn't affect (using processes) ## Extension Points ### Adding New Pattern Types To support different pattern syntax: ```python # In motifs.py def expand_custom_pattern(pattern: str) -> List[str]: # Your expansion logic pass def build_pattern_to_motifs(patterns: List[str]) -> Dict[str, List[str]]: # Integrate custom expansion ... ``` ### Custom Counting Strategies Modify `_count_in_fm()` in `fasta_worker.py`: ```python def _count_in_fm(fm: FMIndex, motif: bytes, *, strategy: str = "non_overlapping") -> int: if strategy == "overlapping": return fm.count(motif) elif strategy == "non_overlapping": # Current implementation ... elif strategy == "maximal": # New strategy ... ``` ### Alternative Index Structures Implement a new index with the same interface: ```python class MyCustomIndex: def count(self, pattern: bytes) -> int: ... def locate(self, pattern: bytes) -> List[int]: ... # In fasta_worker.py, swap: # fm = FMIndex(seq) fm = MyCustomIndex(seq) ``` ## Design Decisions ### Why FM-Index? - **Memory efficient**: Compressed suffix array - **Fast queries**: O(m) counting, O(m + k) locating - **Mature algorithm**: Well-studied, reliable ### Why Non-Overlapping Counts? - **Biological relevance**: Overlapping motifs may not be independent - **Configurable**: Can enable overlapping via parameter ### Why Per-File CSVs? - **Incremental processing**: Can process files independently - **Debugging**: Easy to inspect individual file results - **Flexibility**: Users can process subsets ### Why Process-Based Parallelism? - **GIL bypass**: True parallelism for CPU-bound tasks - **Isolation**: Process crashes don't affect others - **Scalability**: Linear speedup with cores ## Testing Strategy ### Unit Tests Each module has corresponding test file: - `test_fmindex.py`: FM-index correctness - `test_wavelettree.py`: Wavelet tree queries - `test_motifs.py`: Pattern expansion - `test_processor.py`: Parallel processing - `test_reports.py`: Output generation ### Integration Tests - `test_cli.py`: End-to-end CLI workflows - Test with real FASTA files - Verify output formats ### Performance Tests - Benchmark with various file sizes - Measure parallelization efficiency - Profile memory usage ## Future Enhancements ### Potential Improvements 1. **Index Caching**: Reuse FM-indexes for repeated queries 2. **Streaming**: Process files too large for memory 3. **Distributed**: Scale to multiple machines 4. **GPU Acceleration**: Offload FM-index construction 5. **Compressed Storage**: Store indexes on disk ### Research Directions - Alternative index structures (BWT variants) - Approximate matching support - Multi-pattern batch queries - Real-time streaming queries ## Further Reading - **FM-Index**: Ferragina & Manzini (2000) "Opportunistic Data Structures" - **Wavelet Trees**: Grossi et al. (2003) "High-Order Entropy-Compressed Text Indexes" - **Suffix Arrays**: Manber & Myers (1993) "Suffix Arrays: A New Method for On-Line String Searches"