29 October 2026 15:00 - 15:30
Interactive panel | The new evaluation stack: Measuring reasoning, workflow quality, and real-world performance
How do you know your AI system is actually working?
Not in a test environment. Not on a benchmark. In production, across multi-step tasks, where failures don't always throw an error and drift doesn't always announce itself. Most existing tooling wasn't built for this problem.
This session gets into how engineering teams are building eval infrastructure for workflow-based AI: catching silent failures, measuring reasoning quality across decision chains, and closing the gap between controlled evals and real-world performance.
Key takeaways:
- How to measure reasoning quality across multi-step workflows, not just final output accuracy
- The patterns that signal silent failure or drift before they surface as visible errors
- What a production-grade eval stack looks like when you're evaluating a workflow, not a single model