What This Investigation Covers
RAG pipelines have multiple independently-tunable components — chunking strategy, retrieval method, reranker, context window size — and it’s tempting to tune them one at a time. But factors interact: dense retrieval may only outperform BM25 when paired with semantic chunking, while the combination with fixed-size chunks shows no difference. A one-factor-at-a-time analysis misses this. Factorial evaluation with a mixed model captures main effects and interactions simultaneously, accounts for per-question variation in difficulty, and applies the right multiple-comparison correction across all pairwise tests. If your eval dataset already records which configuration produced each row, you’re one analyze_factorial() call away from a rigorous answer.
What you’ll learn
Structuring your dataset for factorial analysis: How to format a tagged RAG eval dataset — one row per (question, configuration) with the factor columns and a score column — so it can be passed directly to analyze_factorial().
What the mixed model is doing: How fitting score ~ chunker * retrieval + (1|question) separates configuration effects from question-difficulty effects, and why this gives tighter CIs than ignoring the random intercept.
Reading main effects vs. interactions: How to interpret the Wald χ² tests per factor: a significant interaction means the best chunker depends on which retrieval method you use, and you can’t report a single “best chunker” without qualification.
Reporting the winning configuration honestly: How to present estimated marginal means, pairwise CIs, and Holm-corrected p-values across all combinations — and how to flag when no combination is statistically distinguishable from the others.
Looking for quick usage examples? Check out the Example Usage page.
Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.