Which Chunker and Retrieval Method Wins? Stats for RAG Pipelines

What This Investigation Covers

RAG pipelines have multiple independently-tunable components — chunking strategy, retrieval method, reranker, context window size — and it’s tempting to tune them one at a time. But factors interact: dense retrieval may only outperform BM25 when paired with semantic chunking, while the combination with fixed-size chunks shows no difference. A one-factor-at-a-time analysis misses this. Factorial evaluation with a mixed model captures main effects and interactions simultaneously, accounts for per-question variation in difficulty, and applies the right multiple-comparison correction across all pairwise tests. If your eval dataset already records which configuration produced each row, you’re one analyze_factorial() call away from a rigorous answer.

What you’ll learn

Structuring your dataset for factorial analysis: How to format a tagged RAG eval dataset — one row per (question, configuration) with the factor columns and a score column — so it can be passed directly to analyze_factorial().

What the mixed model is doing: How fitting score ~ chunker * retrieval + (1|question) separates configuration effects from question-difficulty effects, and why this gives tighter CIs than ignoring the random intercept.

Reading main effects vs. interactions: How to interpret the Wald χ² tests per factor: a significant interaction means the best chunker depends on which retrieval method you use, and you can’t report a single “best chunker” without qualification.

Reporting the winning configuration honestly: How to present estimated marginal means, pairwise CIs, and Holm-corrected p-values across all combinations — and how to flag when no combination is statistically distinguishable from the others.

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.