The field of AI is evolving at breakneck speed, with enterprises experimenting with innovative applications across domains. Yet persistent challenges remain: how do we evaluate these systems, and where should AI be applied most effectively?
We recently attended the Pie & AI London event (July 2025 at the London AI Hub), where keynote speaker Vignesh Ramesh, an AI Solutions Engineer and an expert in evaluating GenAI systems, shared a compelling perspective on the current state of AI evaluations and ensuring AI works in production. Vignesh, who also delivered a thought-leadership session at Gartner’s Data & Analytics Summit in London, is becoming a trusted voice for enterprises navigating the uncharted waters of Generative AI.

Following his keynote, we sat down with Vignesh to discuss the work he is doing for responsible, domain-specific AI systems and his journey so far.
_________________________________________________
Erika: Vignesh, your keynote at Pie & AI London was eye-opening. You spoke about the urgent need for rigorous evaluations of GenAI systems. Why do you see evaluation as such a critical challenge?
Vignesh: Evaluation is the bedrock of trust. Think about the food we eat or the cars we drive—these industries are subject to rigorous safety standards, so consumers can trust them. GenAI systems are no different. Without proper evaluation, enterprises risk deploying models that hallucinate, fail silently, or worse, cause reputational or financial harm.
I’ve seen firsthand, in both research and production, how a robust evaluation framework transforms outcomes. At Snorkel AI, for example, we demonstrated how domain-specific benchmarks like FinanceBench and TauBench can expose failure modes and help enterprises tune models for reliability. Without this rigor, AI remains a shiny prototype rather than a trusted enterprise solution.
Erika: You spoke about “responsible, domain-specific AI.” What does that mean in practice?
Vignesh: It means acknowledging that one-size-fits-all AI doesn’t work in the enterprise. A model trained for retail customer care shouldn’t be blindly reused for financial auditing. Each domain has unique risks, terminology, and compliance requirements. A key focus of my work has been on designing systems that respect these boundaries—whether it is building DocQA systems for highly regulated industries or building audit automation systems, aligning AI to the real-world context it serves and ensuring it augments rather than undermines human decision-making is critical.
Erika: You’ve also been active in London hackathons, winning with Cohere and placing in Google’s Electric Twins. How did these shape your thinking?
Vignesh: Hackathons are like pressure cookers for innovation. At the Cohere Hackday, we built a browser automation agent capable of fully navigating the web through natural language instructions specifically aimed at people who are visually challenged. It was a huge success. At the Electric Twins hackathon held in Google’s London office, we prototyped a system to monitor chatbot usage patterns among children—underscoring the ethical side of AI deployment. Each of these projects reinforced a key lesson: evaluation must go hand-in-hand with innovation. Winning is great, but the true impact comes when these prototypes are responsibly matured into systems people can actually use.
Erika:What message are you trying to leave with enterprise leaders?
Vignesh: My core message is simple: AI is powerful, but without trust it cannot scale. Enterprises must resist the temptation of flashy demos and instead ask: “Does this system reliably work for my domain, under my constraints?” At the Gartner summit, I showed how agentic systems could be tuned from 15% task success out-of-the-box to over 60% through careful design and tuning. I am also working on a new initiative called “The $100 Agents,” a project focussed on proving that rigorous design and evaluation can achieve high performance even on tight budgets.
Erika: That’s interesting, tell us a bit more about the $100 agents project.
Vignesh: Sure! The $100 Agents started as an independent research initiative to prove a point—that you don’t need endless compute budgets to train capable, domain-specific agents. With just $100 worth of GPU time on runpod.io, I was able to build a multi-stage training pipeline involving synthetic data generation, supervised fine-tuning, reinforcement learning with GRPO, and automated reward modeling using Monte Carlo Tree Search.
The results were eye-opening: we achieved task completion rates jumping from 15% out-of-the-box to 60% after reinforcement learning, all within that tiny budget, on a retail customer care agentic system. It has now moved on to include a handful of senior researchers from leading labs and we are working on applying reinforcement learning to niche settings to iterate and understand training algorithms that work, that scale well. The project is also fully open-source because we want others in the community—especially smaller teams and startups—to replicate and build on it.
Erika: Looking ahead, where do you see the GenAI landscape heading?
Vignesh: I believe we’re moving towards a world of specialized, trusted AI assistants embedded deeply into workflows. They won’t replace people but will act as copilots—handling repetitive tasks, surfacing insights, and enabling employees to focus on judgment and creativity. But the road there requires enterprises to take evaluation seriously, invest in domain-specific solutions, and nurture user adoption. As I often say, AI is a force multiplier—it can help level the playing field. I’ve seen colleagues go from struggling with complex processes to thriving once equipped with the right AI tools. That’s the future I want to help build: AI that empowers, not overwhelms.
Erika: On a personal note, what drives your passion for AI?
Vignesh: For me, it comes down to impact. AI can be a significant democratizing force that helps people 10X themselves. I have seen people use some of the tools I’ve built first hand to significantly level up their performance at work. Our ability to learn new things with AI, experiment and prototype quickly, fail fast and iterate is going to drive incredible productivity gains. The divide between someone who has a wealth of knowledge and someone who just is going to hustle it out to get things done with AI has gone down dramatically. This is by far the biggest motivation for me to continue working on AI, to continue to build AI systems that have a wide reach.
___________________________________________________________________________
Published : September 2025
Author: Erika Balla is a writer and Content Manager at The Data Scientist. She specialises in exploring the intersection of AI, ethics, and real-world applications, helping to translate technical advancements into stories that matter for business leaders and practitioners.