31 October 2025 11:30 - 12:00
Measuring what matters: Benchmarking and human-in-the-loop for reliable generative AI
When it comes to deploying generative AI in the enterprise, two questions loom large: Can we trust this model in production? and How do we keep it aligned with human values over time?
In this 30-minute talk, we’ll explore two critical themes: how to measure model quality effectively and how to embed human-in-the-loop (HITL) systems into your pipelines.
You’ll learn: Why standard benchmarks fall short, and how to design evaluations that reflect your real-world data. Practical approaches to testing and monitoring models before (and after) they go live.
How HITL feedback loops can capture nuance, enforce quality, and drive continuous improvement. Best practices for balancing automation with human oversight to build trust at scale.
Through real-world examples and practical guidance, you’ll leave with a concise toolkit for evaluating and governing GenAI systems that’s rigorous enough for the enterprise and flexible enough for fast-moving teams.