An Interactive Reading of

Retrieval-Augmented Tutoring for
Algorithm Tracing and Problem-Solving
in AI Education

The paper, in plain English

When students struggle with algorithms like A* search or BFS in an introductory AI course, they often turn to ChatGPT for help. The problem: ChatGPT gives them the answer instead of teaching them how to reason through it. KITE is a tutoring system designed to change that — it looks up the relevant lecture slides and textbook pages, figures out what kind of help the student needs, and then responds with guided questions and progressive hints instead of a flat solution.

Think of KITE as the difference between handing someone a completed Sudoku puzzle and sitting next to them saying "What number can't go in this row?" It classifies each student query into one of five intent types — direct question, conceptual, validation, debugging, or tracing — and tailors its response strategy accordingly. For factual questions it gives grounded explanations. For tracing tasks it walks through algorithm state step by step. For debugging it asks guiding questions that lead the student to find their own error.

The headline number: among 27 simulated-student interactions that started with an incorrect or partial answer, 88.89% improved after receiving KITE's feedback. Expert evaluators rated 93.18% of KITE's responses positively for scaffolding quality. The system achieves context relevance of 0.94 and faithfulness of 0.85 on RAGAs metrics, meaning its answers stick closely to course materials rather than hallucinating.

I
Multi-Stage Retrieval Pipeline
Dense bi-encoder search, hybrid BM25 reranking, MMR deduplication, and cross-encoder reranking — five stages that surface the right course material for each student query.
II
Intent-Aware Socratic Strategy
Five query intents — direct, conceptual, validation, debugging, tracing — each triggers a different response approach, from grounded explanations to progressive hints.
III
Simulated Student Evaluation
A weaker LLM answers questions, receives KITE's feedback, then revises — 88.89% of non-correct initial answers improved after KITE's scaffolding.
Chapter 1

The Tutoring Gap

Students already use ChatGPT to learn algorithms. The answers they get are fast, confident, and often skip the thinking the assignment was designed to develop.

← Direct answer      Guided hint      Socratic tutor →
Student Agency Score
0.85
Direct Answer Given?
No
Knowledge Retention (7 days)
72%
Charts update live as you drag the slider.

The core tension: LLMs are excellent at generating correct answers, but correctness is not the same as pedagogy. A tutoring system must trade some directness for learning value — and that trade-off is exactly what KITE's intent-aware strategy tries to navigate.

Inside the five-phase pipeline
Chapter 2

Inside the Pipeline

KITE processes a student query through five distinct phases. Each phase narrows the gap between "what the student asked" and "what the student needs to learn."

Click a phase above to learn more

Each phase adds a layer of specificity: from raw text, to searchable vectors, to the most relevant passages, to an understanding of what the student needs, to a pedagogically appropriate response.

The key design choice is that retrieval and response strategy are decoupled. The same retrieved material can be used to give a direct explanation or a Socratic hint, depending on the student's intent. This is what makes KITE different from a simple "RAG chatbot."

How PDFs become searchable vectors
Chapter 3

From PDFs to Vectors

Before KITE can answer a question, it needs to turn course materials — lecture slides, textbook chapters — into something searchable. That means extracting text, cleaning it, splitting it into chunks, and encoding each chunk as a 3,072-dimensional vector.

Embedding model
Each chunk → text-embedding-3-large (OpenAI) → 3072-dim vector, L2-normalized
Stored in a FAISS index for efficient nearest-neighbour search
Total Chunks
Avg Tokens / Chunk
Context Continuity
Drag sliders to see how chunking parameters affect retrieval granularity. KITE uses 500-char chunks with 100-char overlap.

Section-aware chunking with header retention is a small detail with outsized impact. Without it, a chunk about "heuristic admissibility" from the search chapter might get retrieved alongside a chunk about "heuristic evaluation" from the game-playing chapter — same keyword, completely different context.

How KITE finds the right chunks
Chapter 4

The Retrieval Engine

Finding the right course material isn't a single search — it's a four-stage funnel that starts broad (50 candidates) and narrows to the 8 most relevant passages.

Maximal Marginal Relevance (MMR)
$$\text{MMR} = \lambda \times \text{Relevance} + (1 - \lambda) \times \text{Diversity}$$
where $\lambda = 0.7$ (KITE's setting), balancing relevance to the query against diversity among selected chunks. Higher λ = more relevant but potentially redundant; lower λ = more diverse but may drift from the query.
Hybrid Retrieval Score
$$S_{\text{hybrid}} = 0.7 \times S_{\text{dense}} + 0.3 \times S_{\text{BM25}}$$
Effective Relevance
Effective Diversity
Official Source Boost
+0.30
Adjust λ and dense/sparse weights to see how the retrieval score distribution shifts. KITE uses λ=0.7, dense=0.7.

The cross-encoder reranking stage is where KITE gains precision over a simpler RAG system. By jointly encoding the query and each candidate chunk, the cross-encoder captures interactions that a bi-encoder (which encodes them separately) cannot — like whether the chunk actually answers the question versus merely being topically related.

How KITE reads the student's mind
Chapter 5

Reading the Student's Mind

The same course material can be used to give a definition, walk through a trace, or ask a guiding question. KITE's intent classifier determines which response the student actually needs.

?
Direct Question
Factual queries: "What is A*?"
?
Conceptual
Why/how: "Why does BFS guarantee shortest path?"
?
Validation
Check implementation: "Is my trace correct?"
?
Debugging
Fix an error: "Why is my output wrong?"
?
Tracing
Step through: "Trace A* on this graph"

The Socratic approach is strongest for validation and debugging queries — the very cases where students are most tempted to just get the answer. By asking guiding questions instead, KITE forces the student to identify their own error, which is exactly the reasoning skill the assignment was designed to develop.

So does KITE actually work?
Chapter 6

Does It Work?

Three evaluation lenses — automated RAGAs metrics, a simulated student pipeline, and expert review — all point the same direction: KITE's answers are grounded, its feedback is pedagogically sound, and students improve after receiving it.

Improvement Rate
88.89%
Scaffolding Quality
93.18%
Inter-rater κ
0.88

The gap between factual correctness (0.45) and answer similarity (0.76) is a feature, not a bug. KITE's responses are pedagogically framed — they paraphrase, elaborate, and add context beyond what a single reference answer contains. RAGAs-style claim matching penalises this, but semantic similarity captures it. The paper argues answer similarity is the better metric for tutoring systems.

What's next for KITE
Chapter 7

Lessons and Limits

KITE's results are encouraging, but the paper is candid about what the evaluation can and cannot show — and what it would take to move from a research prototype to a classroom tool.

← All    RAGAs only    Expert only    Simulated student →

The honest takeaway: KITE demonstrates that intent-aware, retrieval-grounded tutoring is architecturally viable — the system can classify intents, retrieve relevant material, and generate pedagogically appropriate responses. Whether it actually teaches students better than a baseline chatbot is a question only a classroom study can answer, and that study has not been done yet.