Speakers Agenda Why attend Partners Venue

Locations ▼

Get invited

Partnership opportunities

Tickets

Call to action

Your text goes here. Insert your content, thoughts, or information in this space.

Button

Back to speakers

Tom

Curran

Senior Data Scientist

CVS Health

Tom Curran is a Data Scientist at CVS Health, where he works as a full stack data scientist building end to end data systems and machine learning solutions that support healthcare products and decision making. With experience spanning data science, software engineering, and applied policy, he focuses on translating complex data into practical, production ready tools used across cross functional teams. He is a former Chan Zuckerberg Initiative data scientist and Teach For America educator. Prior to CVS Health, Tom worked across education, nonprofit, and research environments where he developed data pipelines, analytics tools, and experimentation frameworks. He is the creator of Inquirio, a semantic search platform for education standards, and continues to build at the intersection of data systems and real world impact. He brings a practitioner focused perspective on building scalable data stacks and moving models from notebooks into production.

Button

29 October 2026 15:00 - 15:30

Panel | The new evaluation stack: Measuring reasoning, workflow quality, and real-world performance

How do you know your AI system is actually working? Not in a test environment. Not on a benchmark. In production, across multi-step tasks, where failures don't always throw an error and drift doesn't always announce itself. Most existing tooling wasn't built for this problem. This session gets into how engineering teams are building eval infrastructure for workflow-based AI: catching silent failures, measuring reasoning quality across decision chains, and closing the gap between controlled evals and real-world performance. Key takeaways: → How to measure reasoning quality across multi-step workflows, not just final output accuracy → The patterns that signal silent failure or drift before they surface as visible errors → What a production-grade eval stack looks like when you're evaluating a workflow, not a single model