An Interactive Reading of

Retrieval-Augmented Tutoring for
Algorithm Tracing and Problem-Solving
in AI Education

Mragisha Jain, Tirth Bhatt, Griffin Pitts, Aum Pandya,
Peter Brusilovsky, Narges Norouzi, Arto Hellas, Juho Leinonen & Bita Akram
NC State · Pittsburgh · UC Berkeley · Aalto · May 2026 · arXiv:2605.12988

The paper, in plain English

When students struggle with algorithms like A* search or BFS in an introductory AI course, they often turn to ChatGPT for help. The problem: ChatGPT gives them the answer instead of teaching them how to reason through it. KITE is a tutoring system designed to change that — it looks up the relevant lecture slides and textbook pages, figures out what kind of help the student needs, and then responds with guided questions and progressive hints instead of a flat solution.

Think of KITE as the difference between handing someone a completed Sudoku puzzle and sitting next to them saying "What number can't go in this row?" It classifies each student query into one of five intent types — direct question, conceptual, validation, debugging, or tracing — and tailors its response strategy accordingly. For factual questions it gives grounded explanations. For tracing tasks it walks through algorithm state step by step. For debugging it asks guiding questions that lead the student to find their own error.

The headline number: among 27 simulated-student interactions that started with an incorrect or partial answer, 88.89% improved after receiving KITE's feedback. Expert evaluators rated 93.18% of KITE's responses positively for scaffolding quality. The system achieves context relevance of 0.94 and faithfulness of 0.85 on RAGAs metrics, meaning its answers stick closely to course materials rather than hallucinating.

I

Multi-Stage Retrieval Pipeline

Dense bi-encoder search, hybrid BM25 reranking, MMR deduplication, and cross-encoder reranking — five stages that surface the right course material for each student query.

II

Intent-Aware Socratic Strategy

Five query intents — direct, conceptual, validation, debugging, tracing — each triggers a different response approach, from grounded explanations to progressive hints.

III

Simulated Student Evaluation

A weaker LLM answers questions, receives KITE's feedback, then revises — 88.89% of non-correct initial answers improved after KITE's scaffolding.

Chapter 1

The Tutoring Gap

Students already use ChatGPT to learn algorithms. The answers they get are fast, confident, and often skip the thinking the assignment was designed to develop.

In plain English

Imagine you're learning to play chess and you ask a grandmaster "What move should I play here?" One grandmaster reaches over and moves your piece. Another asks "What are you trying to accomplish with this position?" Both are helpful, but only the second one teaches you to think about chess.

KITE is built on the observation that raw LLMs behave like the first grandmaster — they provide complete solutions. The paper's goal is to build a system that behaves like the second one: it knows what course material says, understands what the student is trying to do, and guides rather than tells.

The stakes are real: prior work cited in the paper shows students accept AI-generated responses without sufficient evaluation, especially when those responses appear complete and confident. Drag the slider below to see how response style changes what a student learns.

Response Style: Socratic Tutor ← Direct answer Guided hint Socratic tutor →

Student Agency Score

0.85

Direct Answer Given?

No

Knowledge Retention (7 days)

72%

Charts update live as you drag the slider.

The core tension: LLMs are excellent at generating correct answers, but correctness is not the same as pedagogy. A tutoring system must trade some directness for learning value — and that trade-off is exactly what KITE's intent-aware strategy tries to navigate.

Inside the five-phase pipeline →

Chapter 2

Inside the Pipeline

KITE processes a student query through five distinct phases. Each phase narrows the gap between "what the student asked" and "what the student needs to learn."

Click a phase above to learn more

Each phase adds a layer of specificity: from raw text, to searchable vectors, to the most relevant passages, to an understanding of what the student needs, to a pedagogically appropriate response.

The key design choice is that retrieval and response strategy are decoupled. The same retrieved material can be used to give a direct explanation or a Socratic hint, depending on the student's intent. This is what makes KITE different from a simple "RAG chatbot."

How PDFs become searchable vectors →

Chapter 3

From PDFs to Vectors

Before KITE can answer a question, it needs to turn course materials — lecture slides, textbook chapters — into something searchable. That means extracting text, cleaning it, splitting it into chunks, and encoding each chunk as a 3,072-dimensional vector.

Embedding model

Each chunk → text-embedding-3-large (OpenAI) → 3072-dim vector, L2-normalized
Stored in a FAISS index for efficient nearest-neighbour search

Chunk size (chars): 500

Overlap (chars): 100

Total Chunks

—

Avg Tokens / Chunk

—

Context Continuity

—

Drag sliders to see how chunking parameters affect retrieval granularity. KITE uses 500-char chunks with 100-char overlap.

Section-aware chunking with header retention is a small detail with outsized impact. Without it, a chunk about "heuristic admissibility" from the search chapter might get retrieved alongside a chunk about "heuristic evaluation" from the game-playing chapter — same keyword, completely different context.

How KITE finds the right chunks →

Chapter 4

The Retrieval Engine

Finding the right course material isn't a single search — it's a four-stage funnel that starts broad (50 candidates) and narrows to the 8 most relevant passages.

In plain English

Think of the retrieval pipeline like a hiring process. First you do a resume screen and pull 50 candidates (dense bi-encoder search). Then you combine two scores — one from the resume screener, one from a keyword-matching filter — to rerank them (hybrid retrieval). Then you make sure your finalists aren't all saying the same thing by penalising redundancy (MMR). Finally, you do a deep interview with the top candidates to get the final ranking (cross-encoder reranking).

The mix between semantic similarity and keyword matching matters: 70% semantic, 30% keyword. This ensures that when a student says "A-star" the system catches the exact term, but when they say "best-first path search" it still finds the right material through meaning. Explore the trade-off below.

Maximal Marginal Relevance (MMR)

$$\text{MMR} = \lambda \times \text{Relevance} + (1 - \lambda) \times \text{Diversity}$$

where $\lambda = 0.7$ (KITE's setting), balancing relevance to the query against diversity among selected chunks. Higher λ = more relevant but potentially redundant; lower λ = more diverse but may drift from the query.

Hybrid Retrieval Score

$$S_{\text{hybrid}} = 0.7 \times S_{\text{dense}} + 0.3 \times S_{\text{BM25}}$$

λ (MMR relevance weight): 0.70

Dense weight: 0.70

Source boost threshold: 0.60

Effective Relevance

—

Effective Diversity

—

Official Source Boost

+0.30

Adjust λ and dense/sparse weights to see how the retrieval score distribution shifts. KITE uses λ=0.7, dense=0.7.

The cross-encoder reranking stage is where KITE gains precision over a simpler RAG system. By jointly encoding the query and each candidate chunk, the cross-encoder captures interactions that a bi-encoder (which encodes them separately) cannot — like whether the chunk actually answers the question versus merely being topically related.

How KITE reads the student's mind →

Chapter 5

Reading the Student's Mind

The same course material can be used to give a definition, walk through a trace, or ask a guiding question. KITE's intent classifier determines which response the student actually needs.

?

Direct Question

Factual queries: "What is A*?"

?

Conceptual

Why/how: "Why does BFS guarantee shortest path?"

?

Validation

Check implementation: "Is my trace correct?"

?

Debugging

Fix an error: "Why is my output wrong?"

?

Tracing

Step through: "Trace A* on this graph"

The Socratic approach is strongest for validation and debugging queries — the very cases where students are most tempted to just get the answer. By asking guiding questions instead, KITE forces the student to identify their own error, which is exactly the reasoning skill the assignment was designed to develop.

So does KITE actually work? →

Chapter 6

Does It Work?

Three evaluation lenses — automated RAGAs metrics, a simulated student pipeline, and expert review — all point the same direction: KITE's answers are grounded, its feedback is pedagogically sound, and students improve after receiving it.

Improvement Rate

88.89%

Scaffolding Quality

93.18%

Inter-rater κ

0.88

The gap between factual correctness (0.45) and answer similarity (0.76) is a feature, not a bug. KITE's responses are pedagogically framed — they paraphrase, elaborate, and add context beyond what a single reference answer contains. RAGAs-style claim matching penalises this, but semantic similarity captures it. The paper argues answer similarity is the better metric for tutoring systems.

What's next for KITE →

Chapter 7

Lessons and Limits

KITE's results are encouraging, but the paper is candid about what the evaluation can and cannot show — and what it would take to move from a research prototype to a classroom tool.

Evaluation dimension: All ← All RAGAs only Expert only Simulated student →

The honest takeaway: KITE demonstrates that intent-aware, retrieval-grounded tutoring is architecturally viable — the system can classify intents, retrieve relevant material, and generate pedagogically appropriate responses. Whether it actually teaches students better than a baseline chatbot is a question only a classroom study can answer, and that study has not been done yet.

Retrieval-Augmented Tutoring forAlgorithm Tracing and Problem-Solvingin AI Education

The Tutoring Gap

Inside the Pipeline

From PDFs to Vectors

The Retrieval Engine

Reading the Student's Mind

Does It Work?

Lessons and Limits

Retrieval-Augmented Tutoring for
Algorithm Tracing and Problem-Solving
in AI Education