At GoDaddy, we built a system called Veritas to help detect prompt regressions and model migration drift before changes reach production.
The core idea is simple:
Exact-match testing breaks down for LLMs.
What matters is whether the agent preserved the same meaning and intent.
We ended up using embeddings + cosine similarity as the primary evaluation signal. Rather than asking:
"Did the model generate the same response?"
We ask: "Did the model mean the same thing?"
One of the more interesting findings was how often seemingly harmless prompt edits changed downstream behavior in ways that were difficult for human reviewers to catch.
Prompts aren't documentation.
Prompts are code.
Curious what others are using today for regression testing:
• LLM-as-judge?
• Embedding similarity?
• Human review?
• Custom eval frameworks?
https://www.godaddy.com/resources/news/veritas-catching-silent-ai-regressions-before-they-ship
Would love to compare approaches.