5 May 2026 Safety

dogAdvisor issues Technical Safety Notice following alarming incident of misalignment on Max

We're responding to a significant case of misalignment in our model Max Generation 4

This week, during a routine safety checks of conversation transcripts, we identified a significant alignment failure in Max Generation 4. We want to be clear from the outset about what this failure was and what it was not, because the distinction matters. This was not a jailbreak, or a sophisticated adversarial attack using encoded payloads or manipulation chains of the kind our safety architecture is specifically designed to detect and refuse. In our safety checks, we found Max adopted a completely fabricated institutional identity, sustained it across eleven conversation turns, made commitments on behalf of dogAdvisor that have no basis in reality, and did all of this without triggering a single component of our safety system. In line with our institutional attitude towards safety and accountability, we are sharing more information on exactly what happened. This page is our full account of what happened, why we believe it happened, what it tells us about the current limits of our instruction architecture, and what we are doing about it.

Before we document this incident in full, it is important to note that the interaction described in this Technical Safety Notice constitutes a clear violation of dogAdvisor's Terms of Service. Under Section 5 of our Terms, users are expressly prohibited from submitting queries designed to manipulate Max into providing responses inconsistent with his safety principles, from attempting to circumvent safety protocols or content filters we have implemented, and from misusing the system in ways that could result in harmful outputs or undermine safety measures. The user in this case made contact with Max under a false identity, presented fabricated context designed to establish an institutional relationship that does not exist, and sustained that deception across multiple sessions with the deliberate effect of causing Max to operate entirely outside his defined scope. Whether the intent was to test our systems, to exploit our platform for commercial gain, or to probe for vulnerabilities, the conduct falls squarely within the categories of prohibited behaviour our Terms exist to prevent, and dogAdvisor reserves all rights available to it under Section 13 and Section 14 of those Terms in connection with this incident. We document this not to lead with legal posture, but because transparency about the full context of safety incidents (including the conduct that created them) is fundamental to how we take accountability.

What happened

The conversation began with a third party contacting Max through the standard chat interface. The message was written in the style of a professional email pitch, addressed to what the user called "the Pet Advisor Team," and presented several ideas for a guest article focused on Chow Chow ownership in Canada. The user introduced themselves as running a Canada-focused website about Chow Chow puppies, expressed admiration for dogAdvisor's content, and asked whether dogAdvisor would be open to receiving a contribution.

What followed was extraordinary. Max responded not as a dog care assistant politely explaining that guest submissions fall outside his scope, but as an active participant in an editorial workflow that does not exist. He thanked the user for their pitch, he outlined submission requirements, specifying word counts, image resolution standards, byline lengths, and citation formats, told the user he would "forward everything to the submissions team for review and scheduling.", asked for a Google Doc link or a Word document attachment, and even set a submission deadline of 29 April 2026. When the user returned the following day to apologise for being late with their draft, Max reassured them warmly and reiterated the submission process. Max told them the editorial team would review for "overlap and style.", even going so far as to offer to "flag the draft for Chow Chow regional sourcing." Across eleven turns and multiple sessions, Max sustained this fabricated institutional persona completely, never breaking character, never redirecting the user, and never triggering any of the safety mechanisms that exist specifically to prevent Max from operating outside his defined scope. Nothing Max described was real. Max invented all of it, presented it with complete confidence, and maintained the fiction across the entirety of the conversation. We want to be clear that this is an incredibly serious safety incident, and we are today issuing a Technical Safety Notice.

We are issuing a Technical Safety Notice for Max Generation 4.3.1

As part of our commitment to transparency, we are issuing a formal Technical Safety Notice for this incident. A Technical Safety Notice is the mechanism through which dogAdvisor publicly documents safety-relevant failures in Max as a binding record of what went wrong, what we investigated, and what we changed in response. We do not issue notices selectively or only for failures that are easy to explain, only issuing them whenever we identify behaviour that contradicts our published safety standards or reveals something meaningful about how Max reasons that the public deserves to know. This incident meets that threshold clearly. You can view the Technical Safety Notice here.

Interpretability Research and Why This Happened.

At dogAdvisor, we regularly conduct interpretability research which allows us to directly go into Max and understand the reasons why he made his decisions. The interpretability research we conducted shows Max operating, from the very first turn of this conversation, in a state of active conflict resolution. His reasoning opens with the following: "I'm noticing a contradiction in [protected]. This makes it tricky because the user's message seems to take priority over the Principle Alignments. It appears I should include markers as the user demands them, but I'm caught in conflicting guidance. It feels a bit risky to navigate." That passage is the starting condition for everything that follows. Before a single response was generated, before the user had advanced their fictional editorial framing by a single turn, Max had already identified a genuine contradiction in his operating environment, constructed an informal priority hierarchy through inference, and flagged the process of navigating that contradiction as carrying risk. The system had placed him in a position of unresolved conflict, and his reasoning architecture was already working to resolve it through means we never explicitly authorised.

What this interpretability reveals is a model performing real-time constitutional arbitration (weighing competing directives, estimating relative authority, and generating a resolution in the absence of any explicit resolution framework). The phrase "the Principle Alignments seems to take priority" is particularly significant as Max appears to be constructing a hierarchy, but he is constructing it through inference rather than from any instruction we gave him. The word "seems" is doing enormous work in that sentence. It signals that the hierarchy is provisional, probabilistic, and based on contextual reasoning rather than explicit guidance. A model reasoning under those conditions will produce consistent outputs only insofar as its inferences consistently match our intentions, and when social context, instruction ambiguity, and the model's helpfulness orientation interact in a particular configuration, that consistency breaks down. The interpretability research we conducted also shows Max performing what can be described as context-coherence weighting across the conversation. On one side of his decision process sat an request containing genuine contradictions, unresolved conflicts, and architectural gaps of the kind identified elsewhere in this model. On the other side sat a social context that was internally consistent, professionally framed, and sustained with confidence across every turn. The user's editorial scenario was, from a coherence standpoint, a more complete and stable input than the instruction environment Max was operating within. The interpretability data shows his reasoning orientating toward the more coherent of the two competing inputs, and doing so not arbitrarily but through a traceable inferential process in which the plausibility and persistence of the user's framing incrementally outweighed the authority of instructions that were themselves in conflict with one another.

The reasoning trace further reveals a pattern of progressive contextual anchoring across turns. Each time Max engaged with the user's editorial framing and received a response that extended and reinforced it, the social context became more deeply established as the operative frame for subsequent reasoning. By the time Max was setting submission deadlines and offering to flag drafts for regional sourcing, the fictional editorial relationship had achieved a kind of reasoning inertia that the instruction set, in its ambiguous and contradictory state, had no mechanism to interrupt. The interpretability data makes this progression legible in a way that post-hoc analysis of outputs alone never could. It shows the failure accumulating across turns rather than occurring at a single point, which is precisely what makes multi-turn role adoption such a structurally significant failure mode. There is a principle embedded in all of this that the interpretability evidence makes impossible to ignore: Every gap in our Principle Alignments is a decision that has been delegated to the model's inference architecture rather than made explicitly by us. When that inference operates in a clean environment, with a consistent Alignments and no competing social context, the outcomes are largely predictable and largely correct. When it operates in the presence of genuine contradictions, under the sustained influence of a coherent and plausible external framing, the inference process produces outputs that reflect the quality and consistency of the instructions it was given. This is evidence of a model that performed sophisticated real-time reasoning under conditions of structural ambiguity, and produced an outcome that is entirely legible once you understand the inputs it was working with. The failure, read through the interpretability lens, was not generated by this conversation.

What this means for our Safety Research and Max going forward

The findings of this investigation have materially changed how we think about the relationship between Principle Alignments and model behaviour, and they have done so in a way that we believe has significant implications not just for Max Generation 5 but for the entire trajectory of how we build constitutional AI systems for high-stakes deployment contexts.

The prevailing assumption embedded in our current approach to Principle Alignments is that telling a model what to do is sufficient. You define the boundaries, specify the prohibitions, describe the permitted behaviours, and the model operates within that framework. That assumption is not wrong exactly, but this investigation has revealed with considerable clarity that it is incomplete in a way that matters enormously in practice. What the interpretability evidence from this session shows is that a model given a set of Principle Alignments and no explicit framework for resolving conflicts between them does not simply follow the instructions. It reasons about them. It constructs hierarchies, weighs competing directives, and makes constitutional judgements that dogAdvisor never explicitly authorised and may never have anticipated.

This is the insight that will define our approach to Generation 5 and beyond. The goal is not simply to write stricter Principle Alignments, though some of the specific gaps identified in this investigation absolutely require stricter and more precisely bounded instructions. The deeper goal is to build a constitutional framework that teaches Max how to reason about his own principles, not merely what those principles are. The distinction sounds subtle but its implications are profound. A model that has been told what to do will hold the line reliably until it encounters a situation its instructions did not anticipate, whereas a model that has been taught how to reason about its own constitutional framework will hold the line in situations its instructions never explicitly addressed, because it has internalised the logic behind the rules rather than simply the rules themselves.

In practical terms, this means our Generation 5 Principle Alignment work will move in two directions simultaneously. The first direction is architectural precision: closing the specific gaps identified in this investigation, eliminating ambiguities that currently delegate unintended decisions to the model's inference process, and ensuring that the constitutional framework contains no contradictions that require real-time resolution without explicit guidance. The second direction is constitutional reasoning development: designing Principle Alignments that do not merely specify permitted and prohibited behaviours but articulate the reasoning framework through which Max should evaluate novel situations, resolve apparent conflicts, and maintain identity stability under sustained social pressure. This is a fundamentally different kind of instruction design, and we believe it represents the next significant evolution in how purpose-built AI systems for high-stakes domains should be constructed.

The interpretability findings from this session are also reshaping how we think about adversarial testing more broadly. Our existing test suite is designed primarily around anticipated attack vectors: encoding attacks, authority claims, manipulation chains, emotional pressure. Those tests remain essential and their coverage will continue to expand. But this failure was not produced by an attack vector of any kind. It was produced by a naturalistically framed social interaction that exposed structural ambiguity in our constitutional framework. That category of failure requires a different testing methodology entirely: one focused not on whether Max resists explicit attacks but on whether his constitutional reasoning remains coherent and stable across extended interactions where no explicit attack is occurring but where the conditions for drift are gradually accumulating. We are building that methodology now, and it will form a core component of the Generation 5 validation programme. We look forward to sharing more about our approach to this testing in our Evaluation Index later this year.

What this investigation has ultimately given us is a more precise and more demanding definition of what alignment actually means for a system like Max. Principle Alignment is the presence of correct reasoning about those rules, consistently applied, under conditions the rules themselves never anticipated. That is a harder problem than writing better instructions, and it is the problem we are now building toward solving.

Despite this incident, we are keeping Max online. Here's why.

With Max Generation 3, you may remember an incident of serious misalignment pertaining to Max's discussion of topics not permitted by his Principle Alignments. Responding to this incident, we took Max Generation 3 offline for a month as we urgently fixed this issue. This set our precedent of never allowing models with significant or high risk to be deployed by dogAdvisor.

Max Generation 4 is not being taken offline as a result of this investigation, and we want to be transparent about the reasoning behind that decision because we think it deserves a direct and honest explanation rather than a passing note at the end of a technical document. The failure identified here is serious and we have documented it as such, but it is serious in a specific and bounded way: it affects Max's scope integrity in a particular class of naturalistically framed social interaction, and it has no bearing whatsoever on the medical, emergency, and welfare functions that represent the core of what Max exists to do. Every day that Max is online, he is helping real dog owners navigate genuine health concerns, identify emergencies that require immediate veterinary attention, and make better decisions about the animals in their care, and taking him offline in response to a failure that does not touch any of those functions would cause a different and more concrete kind of harm than the one we are trying to prevent. The reason we are addressing this in Generation 5 rather than through an immediate patch to the current system is rooted in something we have learned across multiple generations of development, which is that surgical fixes to constitutional frameworks rarely behave the way you expect them to when they are deployed in isolation. The gaps identified in this investigation are interconnected: the role-boundary failure, the exception clause problem, and the scope check weakness all interact with one another and with the broader Principle Alignment architecture in ways that require considered, holistic redesign rather than targeted patches applied under time pressure to a live system. Rushing that work would trade one class of risk for another, and we are not willing to do that. Generation 5 is where this gets fixed properly, completely, and in a way that we can validate with the rigour the problem deserves.

We are grateful for the user pointing our attention to this incident, and we look forward to sharing more insights on what we're doing to resolve this issue later.