CTX: Trigger-Driven Dynamic Context Loading
for Code-Aware LLM Agents
Jeawon Jang Β· be2jay67@gmail.com
β‘ Code: github.com/jaytoone/CTX
π arXiv: Pending Endorsement (cs.IR)
π 8 Strategies Β· 415 Queries Β· p<0.05
0.776
TES Score
1.9Γ
vs BM25 Baseline
5.2%
Token Usage
0.95
Hybrid COIR R@5
1.00
IMPLICIT Recall@5
415
Total Queries
Abstract
Large language models suffer from context dilution when processing
extensive codebases β the "Lost in the Middle" problem. Standard RAG approaches treat code as flat text,
ignoring the structural dependency information in import graphs.
We present CTX, a trigger-driven dynamic context loading system that classifies developer queries into four types β EXPLICIT_SYMBOL, SEMANTIC_CONCEPT, TEMPORAL_HISTORY, IMPLICIT_CONTEXT β and routes each to a specialized retrieval pipeline.
For dependency-sensitive queries, CTX performs breadth-first traversal over the codebase import graph, resolving transitive relationships invisible to keyword and embedding methods. Evaluated on a synthetic benchmark (50 files, 166 queries) and three real Python codebases (968 files total, 249 queries), CTX achieves TES 1.9Γ higher than BM25 with only 5.2% token usage. Statistical significance via McNemar and Wilcoxon tests (p<0.05) across 415 queries.
We present CTX, a trigger-driven dynamic context loading system that classifies developer queries into four types β EXPLICIT_SYMBOL, SEMANTIC_CONCEPT, TEMPORAL_HISTORY, IMPLICIT_CONTEXT β and routes each to a specialized retrieval pipeline.
For dependency-sensitive queries, CTX performs breadth-first traversal over the codebase import graph, resolving transitive relationships invisible to keyword and embedding methods. Evaluated on a synthetic benchmark (50 files, 166 queries) and three real Python codebases (968 files total, 249 queries), CTX achieves TES 1.9Γ higher than BM25 with only 5.2% token usage. Statistical significance via McNemar and Wilcoxon tests (p<0.05) across 415 queries.
Contributions
- Four-type trigger taxonomy β EXPLICIT_SYMBOL, SEMANTIC_CONCEPT, TEMPORAL_HISTORY, IMPLICIT_CONTEXT, each mapped to a specialized retrieval strategy enabling adaptive resource allocation.
- Import graph traversal β BFS-based algorithm over the codebase import graph resolving transitive dependencies. Recall@5 = 1.0 on dependency queries vs 0.4 for BM25 β 150% improvement.
- TES metric β Trade-off Efficiency Score = Recall@K / ln(1 + |retrieved|), unified measure of accuracy-efficiency. Pearson r = 0.87 correlation with NDCG@5 (p<0.001).
- Hybrid Dense+CTX β Two-stage pipeline combining dense neural seed selection with import graph expansion. COIR Recall@5 = 0.950 (+150% over CTX alone), validating complementary nature of semantic and structural retrieval.
Architecture
Query Input
β
βΌ
ββββββββββββββββββββββββββββββββ
β Trigger Classifier β β regex + keyword patterns
β (EXPLICIT / SEMANTIC / β
β TEMPORAL / IMPLICIT) β
ββββββββββββββββ¬ββββββββββββββββ
β
βββββββββ΄βββββββββ¬βββββββββββββββ¬ββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
Symbol Index TF-IDF/Dense History Log Import Graph BFS
(AST lookup) (cosine sim) (git history) (transitive deps)
β β β β
βββββββββ¬ββββββββββ΄βββββββββββββββ΄ββββββββββββββββββ
β
ββββββββββββΌβββββββββββββββ
β Adaptive-k Selection β β k = f(query_type, codebase_size)
β (3~10 files) β
ββββββββββββ¬βββββββββββββββ
β
βΌ
LLM Context
(5.2% tokens)
Synthetic Benchmark (50 files, 166 queries)
| Strategy | Recall@5 | Token% | TES |
|---|---|---|---|
| Full Context | 0.075 | 100.0% | 0.019 |
| BM25 | 0.982 | 18.7% | 0.410 |
| Dense TF-IDF | 0.973 | 21.0% | 0.406 |
| LlamaIndex | 0.972 | 20.1% | 0.405 |
| Chroma Dense | 0.829 | 19.3% | 0.346 |
| GraphRAG-lite | 0.523 | 24.0% | 0.218 |
| Hybrid Dense+CTX | 0.725 | 23.6% | 0.303 |
| CTX (Ours) | 0.874 | 5.2% | 0.776 |
COIR External Benchmark (CodeSearchNet Python)
| Strategy | Recall@1 | Recall@5 | MRR |
|---|---|---|---|
| Dense Embedding (MiniLM) | 0.960 | 1.000 | 0.978 |
| Hybrid Dense+CTX | 0.930 | 0.950 | 0.940 |
| BM25 | 0.920 | 0.980 | 0.946 |
| CTX Adaptive | 0.210 | 0.380 | 0.293 |
Per-Trigger-Type Recall@5 (Synthetic)
| Trigger Type | BM25 | TF-IDF | CTX | Delta |
|---|---|---|---|---|
| EXPLICIT_SYMBOL | 0.81 | 0.73 | 0.97 | +19.8% |
| SEMANTIC_CONCEPT | 0.54 | 0.68 | 0.60 | β |
| TEMPORAL_HISTORY | 0.50 | 0.50 | 1.00 | +100% |
| IMPLICIT_CONTEXT | 0.40 | 0.40 | 1.00 | +150% |
Ablation Study
| Variant | Removed | Recall@5 | TES | IMPL_CONTEXT |
|---|---|---|---|---|
| Full CTX | β | 0.874 | 0.776 | 1.000 |
| No Graph | Import graph | 0.821 | 0.635 | 0.400 |
| No Classifier | Trigger type | 0.743 | 0.412 | 0.600 |
| Fixed-k=5 | Adaptive-k | 0.856 | 0.712 | 1.000 |
Key Findings:
- Removing import graph β IMPLICIT_CONTEXT recall drops 60% (1.0 β 0.4)
- Removing trigger classifier β TES drops 47% (0.776 β 0.412)
- TESβNDCG@5 Pearson r = 0.87 (p < 0.001, 28 strategy-dataset pairs)
- pass@1 with MiniMax M2.5: CTX 0.265 vs Full Context 0.102 (n=49, McNemar p<0.05)
Try CTX on a Sample Codebase
The demo runs CTX retrieval on a 10-file sample Python codebase (pipeline, retriever, evaluator, graph builder, metrics, etc.). Enter a natural language query and compare retrieval strategies.
Retrieval Strategy
1 10
Example queries:
Examples
| Code Query | Retrieval Strategy | k (max files) |
|---|
Core Algorithm β Trigger Classifier
Import Graph BFS β Key Differentiator vs RAG
TES Metric β Trade-off Efficiency Score
Full source: github.com/jaytoone/CTX
git clone https://github.com/jaytoone/CTX && cd CTX
pip install -r requirements.txt
python run_experiment.py --dataset-size small --strategy all
BibTeX
Links
- GitHub: https://github.com/jaytoone/CTX
- arXiv: Pending (cs.IR endorsement in progress β code: HBJRI6)
Experiment Reproducibility
git clone https://github.com/jaytoone/CTX
cd CTX
pip install -r requirements.txt
# Synthetic benchmark (all 8 strategies)
python run_experiment.py --dataset-size small --strategy all
# Real codebase evaluation
python run_experiment.py --dataset-source real --project-path /path/to/project
# COIR benchmark
python run_coir_eval.py
# LLM pass@1 evaluation (requires MINIMAX_API_KEY)
python run_llm_eval_v2.py