Every data practitioner knows the quality of your data is of the utmost importance, hence the prevalent expression “garbage in, garbage out.” Characterizing the noise-to-signal ratio of data is difficult. However, data quality problems are particularly acute when working with natural language data, which can contain sparse information and lack context or substance. Traditional approaches to assessing the quality of natural language data make use of problematic heuristics such as character counts or entropy-based measures which do not directly contextualize whether a message can be understood atomistically. This talk will introduce a model-based approach to measuring the quality of a natural language message we call InfoQ. This model addresses a series of use cases, from improving the quality of data used for model training to ensuring only the most valuable data is scored by targeted models or included in context windows for generative AI.
Key topics to be covered:
This talk is intended for data industry professionals working in the natural language space, who may be interested in hearing how to develop their own solutions to detect meaningful message data and improve the quality of their targeted or generative AI models.
Mary (McClain) Fitch is a Data Science Team Lead on the Behavioral Intelligence team at Aware, a SaaS startup founded in Columbus, Ohio. At Aware, she has experience in leading the development of machine learning and statistical modeling techniques to deliver insights and actionable solutions to complex business problems, including oversight of labeling operations. She also has extensive experience in data science and analytics for the financial and business sectors. Mary holds a BA from the College of Wooster, as well as a Master’s Degree in Business Analytics from The Ohio State University.