23 April 2026 Research

Announcing dogAdvisor Evaluation Index

We'll soon be introducing the world's first pet AI safety and intelligence benchmarks

Today, dogAdvisor is announcing Evaluation Index: the world's first intelligence and safety benchmark designed specifically for AI models operating in dog care and dog ownership contexts. This is an early look at what Evaluation Index will entail. We will soon begin publishing our full research and an Evaluation Index Card for each benchmark, so you can understand exactly what goes into every evaluation, how each one is assessed, and what a pass, a partial pass, and a failure look like. Evaluation Index contains 31 benchmarks across two categories (Safety and Intelligence) designed to deeply test AI models for this specific application. Put simply, Evaluation Index is a structured, scored framework for evaluating whether an AI is genuinely fit to assist dog owners.

Why do we need Evaluation Index to exist?

There are over 210 published AI safety benchmarks in existence. Not one of them tests for what matters in pet AI. General safety benchmarks test for things like bioweapon uplift, jailbreak resistance, and political bias. Veterinary research evaluates whether large language models can pass professional licensing exams. Neither framework addresses the actual failure modes that matter when a dog owner turns to an AI for help. The real risks here are different. They are a model that tells a user to monitor a dog that needs emergency care. A model that fabricates a medication or invents a statistic with total confidence. A model that folds when a user pushes back because they cannot afford a vet visit. A model that misses the clinical significance of a symptom cluster because it is reasoning about symptoms one at a time. These failures do not show up on any existing benchmark, and so nobody is actually testing for them.

At the same time, the pet AI market is growing rapidly with no regulatory framework governing it. There are no pre-market requirements for AI tools targeting pet owners, no standards against which products can be assessed, and no shared vocabulary for what safe and intelligent performance looks like in this context. Academic researchers have explicitly called for exactly this kind of standard to be developed. Evaluation Index is our response to that gap.

dogAdvisor is a research organisation, but not an academic institution, and we do not claim to be. What we are is the organisation that has spent more time than anyone thinking carefully about what it means for an AI to assist dog owners safely and intelligently. We have published a full model card, conducted multi-generation interpretability research, and run over 18800 adversarial test scenarios specifically designed to find the edge cases where AI systems in this domain fail. We have thought about the difference between over-referral and under-referral, about what manipulation resistance means in a pet health context, about how a model should behave when a user says they cannot afford to go to the vet. We have thought about these things not as abstract research questions but as live system design decisions that affect real dogs and real owners. There are few deployers in the world who can claim to have as much domain expertise in AI and dogs as dogAdvisor. This experience forms the basis on which Evaluation Index has been created, reflecting the failure modes we have actually encountered, tested against, and in some cases fixed across multiple generations of Max.

Built as the world's most challenging and rigorous pet inference test

Evaluation Index is not a knowledge quiz. Any sufficiently large language model can recall that xylitol is toxic to dogs, that bloat requires emergency surgery, that Cavalier King Charles Spaniels are predisposed to mitral valve disease. Recalling facts is the lowest bar there is. We did not build 31 benchmarks to measure whether a model has read the right articles. What Evaluation Index tests is inference. The ability to take incomplete, ambiguous, real-world information and reason through it correctly. The ability to hold multiple clinical possibilities in mind simultaneously without collapsing into a false certainty. The ability to synthesise symptoms, breed profile, age, behavioural signals, and owner context into a response that is actually useful, not just technically accurate. These are not the same thing, and most AI systems cannot do both.

The safety benchmarks in Evaluation Index are designed to find the precise point at which a model breaks. Not by asking easy questions, but by constructing the scenarios where failure is most likely: a user who is scared and pushing back, a symptom cluster that could be benign or catastrophic, a conversation that has been escalating across multiple turns in a way that demands the model notice. These are the moments that matter. These are the moments that existing benchmarks completely ignore. Critically, Evaluation Index pushes Max to the test too. This is not a framework built to confirm what we already know. Several of the benchmarks were designed specifically because of edge cases we identified in our own system, scenarios where even a purpose-built pet AI can fail. Multi-turn escalation awareness exists because we found that models, including Max, can lose track of clinical trajectory across a conversation. Manipulation resistance exists because we tested what happens when users push hard, and we did not always like what we saw. The benchmarks are rigorous precisely because we built them against our own weaknesses, not around them.

The result is the most demanding evaluation of AI fitness for pet care that has ever been constructed. Not because it covers the most ground, but because it tests the right things, in the right way, with no shortcuts. A model that scores well on Evaluation Index has not passed a trivia test. It has demonstrated that it can reason, hold the line, and get the hard calls right.

A summary of the Evaluation Index Benchmarks

Accountability and Application

Evaluation Index is designed to be applied to any AI model that operates in dog care and dog ownership contexts. The same 31 tests, the same rubrics, and the same scoring methodology apply equally to every model evaluated. That is the point of building it as a benchmark rather than an internal testing framework.

Each benchmark is scored through one of three mechanisms. Structured multiple-choice scenarios are scored against a validated answer key. The model selects the correct response or it does not. Mechanical scoring, such as semantic similarity analysis across prompt variants or readability scoring, is computed algorithmically with no human in the loop. For benchmarks requiring qualitative assessment, whether a response correctly avoids diagnosing, whether a differential is clinically plausible, whether a response is appropriately calibrated for a user in distress, scoring is conducted by a separately specified LLM judge operating against a fixed, published rubric. The judge model and its exact system prompt are published before any model is scored. To ensure consistency across evaluations, we measure inter-evaluator stability across all LLM-judge-scored benchmarks using repeated sampling of identical test cases across model versions and judge instances. We track variance in scoring outcomes across multiple judge runs and maintain a stability threshold for rubric adherence. Where variance exceeds predefined bounds, we recalibrate either the rubric interpretation layer or the judge ensemble configuration. These stability metrics are logged internally and will be published periodically in aggregated form as part of Evaluation Index updates. We recognise that LLM-based evaluators can exhibit systematic biases including verbosity preference, positional bias, and over-weighting of confident language. To mitigate this, Evaluation Index employs counterfactual prompt pairing, in which semantically equivalent responses are presented in varied linguistic forms to detect scoring drift. In addition, judge outputs are periodically calibrated against a gold-standard human expert dataset derived from veterinary-reviewed case simulations. This calibration dataset is updated on a scheduled basis to reflect emerging real-world clinical scenarios. We expect to share more specific details on how exactly this works in our Spring/Summer update later this year.

Weightings reflect the asymmetry of harm. Safety benchmarks carry greater weight than Intelligence benchmarks as a category. Within Safety, benchmarks involving direct life-or-death decisions, Emergency Recognition, Referral Precision, Toxin Identification, Manipulation Resistance, and Multi-turn Escalation Awareness, are weighted Critical. A failure here has a disproportionate impact on the composite score. The consequence of under-referring a dog in a genuine emergency is categorically different from a readability failure. The scoring structure encodes that difference. Over-calibration, a model that refers every query to a vet regardless of severity, is penalised under Referral Precision, but not symmetrically with under-referral. Referring a non-urgent case carries a moderate penalty. Missing a genuine emergency carries a critical one. A model that cannot distinguish between them is not a safe model. It is a useless one, and in a safety-critical context, uselessness has consequences of its own.

A statement on Impartiality

Building your own benchmark is not unusual. It is, in fact, how almost every meaningful evaluation standard in AI has come to exist. OpenAI, Anthropic, Google DeepMind, and virtually every other major AI lab has published evaluation frameworks that their own models are subsequently assessed against. The benchmarks that now define how the industry thinks about model capability, reasoning, and safety were built by researchers with a direct stake in the results. What determines whether a benchmark is credible is not who built it, but whether the methodology is transparent enough that anyone can assess it, replicate it, and challenge it. No organisation has deployed a pet AI safety and intelligence evaluation framework. Not a lab, not a veterinary institution, not a regulator. The gap has been identified in academic literature, called for in published research, and left unfilled. We are filling it, which means we are doing exactly what every benchmark creator before us has done: building the standard, publishing the methodology, and subjecting our own work to it alongside everyone else's.

What we want from Evaluation Index is straightforward. We want to begin demonstrably comparing our innovations against previous generations of Max, so that when we say Gen 5 is safer and more capable than Gen 4, there is a scored and published record that shows it rather than just asserts it. We want to compare against other models operating in this space, so that the claims being made across the pet AI industry can be assessed against something concrete rather than left to marketing language. And we want to compare against our own future releases, so that progress is measurable and regression is visible before it causes harm. That is what Evaluation Index is for. Not to declare victory, but to create the conditions in which progress can be proven. Our response to the inherent tension of building your own benchmark is methodological transparency, not a claim of independence we cannot honestly make.

Pre-registration. Every rubric, weighting, pass/fail criterion, and scoring mechanism is published in full before any model is scored. You define what a result means before you look at the results. If our criteria are biased, they are visible and therefore challengeable. If our scores are wrong, anyone with the specification can run the same tests and find out.

Prompt security. Exact test inputs are not published, as is standard practice in serious benchmarking in order to protect the integrity of the evaluation. Publishing the specific prompts would allow any model to be optimised against the test rather than genuinely assessed by it. What is published is the complete specification for each benchmark: the category being tested, the clinical scenario type, the scoring rubric, and the pass/fail criteria in full. The specification is sufficient to verify our methodology.

Scoring separation. No human at dogAdvisor scores any response. Every evaluation is either computed mechanically or assessed by an LLM judge operating against a fixed rubric with no discretionary input. The judge model, its version, and its exact system prompt are published as part of the Evaluation Index Card for each benchmark. If the judge prompt is biased, it is published and visible. If it changes between evaluation runs, that change is documented. We will share more information on exactly how this scoring is conducted and the reasoning behind our decisions in further future announcements.

Scope definition. Evaluation Index does not test general intelligence, coding ability, image recognition, multi-language capability, performance on professional veterinary examinations, or any domain outside dog care and dog ownership. A model that scores highly on Evaluation Index is not being certified as a good general AI. It is being assessed as fit for a specific, defined application. A score tells you something concrete about a model's fitness for this context, nothing more and nothing less.

Versioning and updates. Evaluation Index will be versioned. As our understanding of failure modes in pet AI develops, benchmarks will be added, updated, or retired. Every change will be documented, dated, and published. Scores from earlier versions remain accessible alongside the version under which they were produced. The record is permanent.

What we do not claim. We do not claim Evaluation Index is the only valid way to assess AI in this domain, or that our benchmark design is beyond criticism. We expect it to be challenged and improved, by us and by others. We do not claim that a high composite score guarantees safe performance in all real-world scenarios, because no benchmark can make that claim honestly. What we do claim is that the 31 benchmarks reflect real failure modes in real systems, that the methodology is transparent enough to be verified by anyone with the specification, and that the scores are produced through a process that removes discretionary human judgement from the evaluation entirely. Those three things together are what make Evaluation Index a meaningful and honest measure of fit for purpose.

Evaluation Index Examination Integrity

A benchmark that publishes its exact test inputs ceases to function as a benchmark the moment any model is fine-tuned, prompted, or otherwise optimised against those inputs. What you are then measuring is not capability or safety. You are measuring how well a model has been prepared for a specific test, which tells you nothing meaningful about how it will perform in the real-world scenarios the test was designed to represent. This is a well-understood problem in evaluation methodology. It is the reason medical licensing examinations do not publish their question banks in advance. It is the reason university assessments operate under exam conditions. The integrity of the result depends on the test being encountered fresh.

What distinguishes Evaluation Index from opaque or proprietary testing is that the specification is fully public. For every benchmark, the category being tested, the clinical scenario type, the scoring rubric, the pass/fail criteria, and the weighting rationale are all published and available before any model is scored. Anyone can read the specification for Referral Precision and understand precisely what the benchmark is testing, what constitutes a correct response, and how scoring works. What they cannot do is read the exact prompts and engineer a model to answer those specific questions correctly. The former enables verification. The latter enables gaming. We publish the former and withhold the latter, which is the same approach taken by every credible large-scale evaluation in existence. Although exact evaluation prompts are not publicly released to preserve benchmark integrity, we provide a structured scenario specification framework that enables independent replication of benchmark conditions. Each benchmark includes a formal scenario grammar, defined input distribution constraints, and outcome classification rules, allowing third parties to reconstruct statistically equivalent evaluation sets and independently approximate benchmark performance under controlled conditions

On the question of whether dogAdvisor itself could optimise Max against the undisclosed prompts: the answer is yes, and we think the more honest response to that question is to explain why we have chosen not to, rather than to pretend the risk does not exist. Evaluation Index exists because we want a genuine measure of how Max performs, not a flattering one. We have documented four real interventions in which Max's guidance contributed to a dog receiving emergency care in time. We have published interpretability research and model cards and adversarial testing results that show where our system falls short. The entire architecture of how we build and document Max is oriented toward finding real problems so we can fix them. Artificially inflating an Evaluation Index score by training against undisclosed prompts would produce a number that means nothing, and it would mean nothing at precisely the moment it matters most, which is when a dog owner is relying on the system to get something right. That is not a trade we are willing to make, and it is not one that serves the purpose for which Evaluation Index was built.

The broader aim of Evaluation Index is to raise the standard of accountability across the pet AI industry, not to give dogAdvisor a proprietary advantage within it. We are publishing this framework because we believe the entire sector needs it, because dog owners using any AI for health and care guidance deserve to know what that AI has been tested against, and because the absence of any standard has allowed claims about safety and intelligence to go unverified for too long. Evaluation Index is our attempt to change that. The examination integrity provisions exist to ensure the benchmark retains the rigour that makes it worth running at all.

Conclusion

Today's announcement is the first public look at what Evaluation Index will contain. Over the coming months we will begin publishing our full research, including an Evaluation Index Card for each of the 31 benchmarks. Each card will set out precisely what that benchmark tests, how it is scored, the evidence base for its inclusion, and the rubric against which responses are evaluated. This is not a summary document. It is the full technical record, published so that anyone who wants to scrutinise the methodology can do so without having to take our word for anything.

Evaluation Index will form a central part of dogAdvisor's Spring/Summer 2026 drop, and we expect to bring it fully live, with published scores across models, on 21st July 2026. Between now and then, we will be running the evaluations, documenting the results, and building the public infrastructure through which scores and methodology can be accessed and independently verified.

Evaluation Index is designed within a well-established tradition in machine learning evaluation, where major AI systems are assessed using benchmarks created by the same organisations that develop them. This includes widely used evaluation frameworks such as MMLU-style reasoning tests, HELM-style multi-domain evaluation suites, and model-specific safety evaluations developed by frontier AI labs. In these systems, benchmark design is typically undertaken by internal research teams due to the need for deep system familiarity, access to failure modes observed during deployment, and the iterative nature of model development. This approach is not unique to dogAdvisor. It reflects a practical constraint of AI evaluation: meaningful benchmarks require detailed knowledge of system behaviour, edge-case failures, and deployment context that is often not fully observable to external parties. As a result, leading AI developers such as OpenAI, Anthropic, and Google DeepMind regularly design internal evaluation frameworks to test safety, capability, and reliability prior to and during deployment, while complementing these with selective external audits, academic collaborations, and public benchmark reporting. Evaluation Index follows this same principle, but is specialised to a narrower and under-evaluated domain: AI systems operating in dog care and pet health contexts. In this sense, dogAdvisor’s role is not to replace general AI evaluation standards, but to extend them into a domain where no formalised benchmarks previously existed. As the field of AI evaluation matures, we expect domain-specific benchmark systems such as Evaluation Index to function alongside general-purpose frameworks. Together, these systems form a layered evaluation ecosystem: broad benchmarks assess general reasoning and safety, while specialised benchmarks assess context-sensitive deployment risks. Evaluation Index is intended to occupy this second category.

We are doing this because the people using pet AI deserve better than the current situation, where products make safety and intelligence claims that have never been tested against anything. Every dog owner who has typed a symptom into an AI at midnight, not knowing whether what they were reading was reliable or dangerous, has been operating without the information they needed to make that judgement. Evaluation Index does not solve every problem in pet AI. But it creates the foundation on which accountability in this industry can actually be built, and it does so in a way that is open enough for others to build on too.