This talk examines why hallucinations arise and surveys key detection strategies, including human evaluation, LLM-as-judge methods, uncertainty measures, and factuality scoring.
Large language models (LLMs) enable fluent, engaging dialogue but often produce false or misleading content, commonly called “hallucination.” This talk examines why hallucinations arise and surveys key detection strategies, including human evaluation, LLM-as-judge methods, uncertainty measures, and factuality scoring. I introduce VISTA Score, a new approach for verifying claims across multi-turn conversations, and show how it improves accuracy in retrieval-augmented settings. On the mitigation side, I discuss retrieval-based methods, prompt design, verification pipelines, and model finetuning with both human and synthetic data. Results demonstrate that small, carefully finetuned models can substantially reduce hallucination rates, rivaling larger systems. Together, these insights highlight practical pathways toward building more trustworthy and transparent AI dialogue systems.

My research focuses on mitigating hallucinations in interactive dialogue systems through efficient, data-driven methods. I specialize in developing scalable solutions that improve the factual accuracy and reliability of dialogue agents, particularly in real-time, resource-constrained applications such as virtual assistants, educational tools, and knowledge-based question answering (KBQA) systems. A central theme of my work is exploring the use of smaller, domain-specific models—trained via techniques like knowledge distillation and self-training—as viable, cost-effective alternatives to large, generalist LLMs.
My work also emphasizes the value of synthetic data, showing that high-quality LLM-generated data can match or exceed the effectiveness of human annotations in hallucination reduction. Complementing these modeling efforts, I am developing more robust evaluation frameworks to better assess and detect hallucinations in dialogue settings, proposing dynamic, context-aware metrics that go beyond the limitations of existing tools like FactScore.
Overall, my goal is to improve the reliability, accessibility, and usability of interactive AI systems while promoting efficient and responsible development practices.