Description

Every data practitioner knows the quality of your data is of the utmost importance, hence the prevalent expression “garbage in, garbage out.” Characterizing the noise-to-signal ratio of data is difficult. However, data quality problems are particularly acute when working with natural language data, which can contain sparse information and lack context or substance. Traditional approaches to assessing the quality of natural language data make use of problematic heuristics such as character counts or entropy-based measures which do not directly contextualize whether a message can be understood atomistically. This talk will introduce a model-based approach to measuring the quality of a natural language message we call InfoQ. This model addresses a series of use cases, from improving the quality of data used for model training to ensuring only the most valuable data is scored by targeted models or included in context windows for generative AI.

Key topics to be covered:

  • Model Development and Architecture: Dive into specifics on the development of this model, including how we conceptualized the problem, collected the training data, and continued to iterate.
  • Application for filtering uninformative data: Discuss how InfoQ can be used to filter out ‘low quality’ data. We’ll also go through a use case where this application contributed to improvements in model performance for a Generative AI-based summarization task.  This approach also can minimize hallucinations by removing low-fidelity content that may produce inaccurate inferences.
  • Application for dynamic ordering of data: Discuss how InfoQ can be used to effectively order datasets based on quality. We’ll also explore a use case where this application was useful for a very complex legal domain task.
  • Future Iteration: Offer ways we plan to develop this model further, including continued maintenance of the training data and additional refinements.

This talk is intended for data industry professionals working in the natural language space, who may be interested in hearing how to develop their own solutions to detect meaningful message data and improve the quality of their targeted or generative AI models.

Details

July 11, 2024

8:40 am

-

9:15 am

Delaware

Add to Calendar

Track:

AI & ML

Level:

Intermediate

Tags

Data Quality
Data Quality
GenAI
GenAI
Models
Models

Presenters

Mary Fitch
Data Science Team Lead
Aware

Bio

Mary (McClain) Fitch is a Data Science Team Lead on the Behavioral Intelligence team at Aware, a SaaS startup founded in Columbus, Ohio. At Aware, she has experience in leading the development of machine learning and statistical modeling techniques to deliver insights and actionable solutions to complex business problems, including oversight of labeling operations. She also has extensive experience in data science and analytics for the financial and business sectors. Mary holds a BA from the College of Wooster, as well as a Master’s Degree in Business Analytics from The Ohio State University.