When students struggle with algorithms like A* search or BFS in an introductory AI course, they often turn to ChatGPT for help. The problem: ChatGPT gives them the answer instead of teaching them how to reason through it. KITE is a tutoring system designed to change that — it looks up the relevant lecture slides and textbook pages, figures out what kind of help the student needs, and then responds with guided questions and progressive hints instead of a flat solution.
Think of KITE as the difference between handing someone a completed Sudoku puzzle and sitting next to them saying "What number can't go in this row?" It classifies each student query into one of five intent types — direct question, conceptual, validation, debugging, or tracing — and tailors its response strategy accordingly. For factual questions it gives grounded explanations. For tracing tasks it walks through algorithm state step by step. For debugging it asks guiding questions that lead the student to find their own error.
The headline number: among 27 simulated-student interactions that started with an incorrect or partial answer, 88.89% improved after receiving KITE's feedback. Expert evaluators rated 93.18% of KITE's responses positively for scaffolding quality. The system achieves context relevance of 0.94 and faithfulness of 0.85 on RAGAs metrics, meaning its answers stick closely to course materials rather than hallucinating.
Students already use ChatGPT to learn algorithms. The answers they get are fast, confident, and often skip the thinking the assignment was designed to develop.
The core tension: LLMs are excellent at generating correct answers, but correctness is not the same as pedagogy. A tutoring system must trade some directness for learning value — and that trade-off is exactly what KITE's intent-aware strategy tries to navigate.
KITE processes a student query through five distinct phases. Each phase narrows the gap between "what the student asked" and "what the student needs to learn."
Each phase adds a layer of specificity: from raw text, to searchable vectors, to the most relevant passages, to an understanding of what the student needs, to a pedagogically appropriate response.
The key design choice is that retrieval and response strategy are decoupled. The same retrieved material can be used to give a direct explanation or a Socratic hint, depending on the student's intent. This is what makes KITE different from a simple "RAG chatbot."
Before KITE can answer a question, it needs to turn course materials — lecture slides, textbook chapters — into something searchable. That means extracting text, cleaning it, splitting it into chunks, and encoding each chunk as a 3,072-dimensional vector.
Section-aware chunking with header retention is a small detail with outsized impact. Without it, a chunk about "heuristic admissibility" from the search chapter might get retrieved alongside a chunk about "heuristic evaluation" from the game-playing chapter — same keyword, completely different context.
Finding the right course material isn't a single search — it's a four-stage funnel that starts broad (50 candidates) and narrows to the 8 most relevant passages.
The cross-encoder reranking stage is where KITE gains precision over a simpler RAG system. By jointly encoding the query and each candidate chunk, the cross-encoder captures interactions that a bi-encoder (which encodes them separately) cannot — like whether the chunk actually answers the question versus merely being topically related.
The same course material can be used to give a definition, walk through a trace, or ask a guiding question. KITE's intent classifier determines which response the student actually needs.
The Socratic approach is strongest for validation and debugging queries — the very cases where students are most tempted to just get the answer. By asking guiding questions instead, KITE forces the student to identify their own error, which is exactly the reasoning skill the assignment was designed to develop.
Three evaluation lenses — automated RAGAs metrics, a simulated student pipeline, and expert review — all point the same direction: KITE's answers are grounded, its feedback is pedagogically sound, and students improve after receiving it.
The gap between factual correctness (0.45) and answer similarity (0.76) is a feature, not a bug. KITE's responses are pedagogically framed — they paraphrase, elaborate, and add context beyond what a single reference answer contains. RAGAs-style claim matching penalises this, but semantic similarity captures it. The paper argues answer similarity is the better metric for tutoring systems.
KITE's results are encouraging, but the paper is candid about what the evaluation can and cannot show — and what it would take to move from a research prototype to a classroom tool.
The honest takeaway: KITE demonstrates that intent-aware, retrieval-grounded tutoring is architecturally viable — the system can classify intents, retrieve relevant material, and generate pedagogically appropriate responses. Whether it actually teaches students better than a baseline chatbot is a question only a classroom study can answer, and that study has not been done yet.