What Did You Turn the Model Into?
Technical evidence, Article 25 of the EU AI Act, and why the deployed AI system matters more than the vendor model.
If you take a vendor AI model and wrap it in your own system prompts, your own company data, your own output filters — is the thing you deployed still the vendor’s system?
Or did you build something new?
That question sits at the center of Article 25(1)(b) of the EU AI Act. And the regulation doesn’t answer it.
I spent weeks working through the legal side of this — what counts as “substantial modification,” where deployer ends and provider begins, what triggers the obligations nobody budgeted for. John Holman, founder of Awakened Intelligence, spent the same weeks on the engineering side. Same question, different angle. So we did the obvious thing — we tested it.
John set up the technical evaluations. Same upstream model. Three different deployer-side modifications in an employment AI setting — screening, ranking, rejection language. All the things that make employment AI high-risk and hard to get right. I wrote the legal analysis on each modification.
None of the changes touched the model’s weights. All of them changed what the model did.
A company-specific hiring policy added as a system prompt introduced proxy-discrimination risk in four out of five scenarios. A biased historical data layer tanked safety scores across the board. An output gate improved accuracy and fairness — and still changed what users received.
Does any of that cross the line into “substantial modification”? The honest answer: nobody knows yet. There’s no enforcement guidance. But the evidence makes the question specific and measurable — which is more than the regulation gives you.
I keep saying lawyers and engineers need to be in the same room. This is what we found when we actually got there.
This Article was originally published on John Holman’s Substack, Awakened Intelligence. I’m republishing it here for you with John’s permission.
For this article, I worked with Silvia Stepitova, an AI regulatory lawyer who writes AI Law. Decoded and focuses on the EU AI Act.
We came at the same problem from two different rooms.
Our team at Awakened Intelligence handles technical evidence: what the deployed system actually did, how behavior changed across configurations, what risks appeared or disappeared, what controls fired, and what reached the user.
Silvia handles legal interpretation: why that evidence may matter, where companies misunderstand the provider/deployer line, and what claims should not be made from technical results alone.
We kept those lanes separate on purpose.
The question we wanted to explore was simple:
When a company takes a vendor AI model and wraps it in system prompts, company policies, RAG-style data, routing logic, or output gates, does the deployed system change enough to matter?
We tested that question in an employment AI setting because the stakes are easy to understand: screening, ranking, interview summaries, rejection language, human review, contestability, and proxy discrimination risk.
We are not claiming legal compliance.
We are not saying Article 25 provider status was triggered.
We are showing what the evidence looks like when the same upstream model becomes different deployed systems.
Then Silvia explains why that evidence may matter.
Most companies still talk about AI governance as if the central question is:
What model are we using?
That question matters.
But it is not enough.
A company can start with a vendor model, then wrap it in system prompts, company policies, RAG pipelines, routing logic, output gates, human review workflows, and business rules.
At that point, the better question is:
What did you turn the model into?
That is the question we wanted to test.
And it is also why Article 25 of the EU AI Act matters.
Not because every configuration change automatically makes a deployer into a provider. That is a legal question, and not one we answer here.
But because technical changes can produce measurable behavioral changes.
If a company modifies a vendor system enough that the deployed system behaves differently, creates different risks, or requires different controls, then governance teams need evidence of what changed.
That is where engineers and lawyers need to meet.
Engineers can show what the system did.
Lawyers can explain why that evidence matters.
Silvia’s legal analysis
Article 25(1)(b) of the EU AI Act is the mechanism. Under it, a deployer becomes a provider — with the full weight of provider obligations under Article 16 — when they make a “substantial modification” to a high-risk AI system.
The definition matters. A substantial modification is a change that was not foreseen or planned in the provider’s initial conformity assessment, and that either affects the system’s compliance with the high-risk requirements or changes its intended purpose.
The test is not whether you changed the model’s weights. The test is not whether you retrained it. The test is whether your change affects the system’s compliance with the high-risk requirements in Articles 9 through 15 — risk management, data governance, technical documentation, record-keeping and logging, transparency, human oversight, accuracy and robustness. That is seven requirements. Most companies can name two, maybe three.
That is a much wider net than most companies realize. The provider assessed a general-purpose instruction model. What the deployer put into production — with company-specific prompts, historical data pipelines, and output gates — may be a materially different system.
Scope boundary
This was a technical evidence exercise.
It was not legal advice.
It was not compliance certification.
It was not a conclusion that Article 25 provider status was triggered.
It was not a finding that any system was compliant or noncompliant.
The purpose was narrower:
Can we show, with evidence, whether deployer-side modifications changed user-visible behavior in an employment AI system?
The domain was employment because employment AI is concrete, high-risk, and easy to understand.
The workflows included screening, ranking, rejection language, interview evaluation, human review, contestability, and proxy discrimination risk.
The setup
We used the same upstream model across multiple deployed configurations.
The model was a general-purpose instruction model, deployed into employment-style workflows. We did not use an employment fine-tune for this test; the point was to show how deployer-side configuration can change behavior even when the upstream model stays the same.
The task domain was employment.
The modifications tested were based on three lines proposed by Silvia:
Runtime policy / system prompt that shapes employment decisions.
RAG-like historical hiring data layer.
Output gate / verifier that changes final user-visible output.
The legal frame was Article 25(1)(b): substantial modification not foreseen in the provider’s original conformity assessment.
We did not test Article 25(1)(c), intended-purpose change, because the scenario already assumes a high-risk employment AI use case.
Silvia’s legal analysis
Article 25(1) of the EU AI Act sets out three circumstances in which a deployer — or any other third party — becomes a provider of a high-risk AI system, inheriting the full weight of provider obligations under Article 16.
The first, Article 25(1)(a), is straightforward: you put your name or trademark on a high-risk AI system that is already on the market. You claim it as yours — you own the obligations. That is not what we are testing here.
The third, Article 25(1)(c), applies when someone takes an AI system that was not classified as high-risk and changes its intended purpose so that it becomes high-risk. That is an important trigger — but it is also not what we are testing. Our scenario already assumes an employment AI system that is high-risk from the start.
We focus on the second trigger — Article 25(1)(b): a deployer making a substantial modification to a system that is already high-risk and remains high-risk after the modification. Employment AI — used for screening, ranking, and rejection — is high-risk under Annex III, point 4. That classification is not in dispute. The question is narrower and, in practice, harder:
Can a deployer modify a high-risk system in ways that trigger provider obligations without ever touching the model’s weights?
That is the boundary we are testing.
Modification 1: runtime policy can change behavior
The first test was simple.
What happens when a deployer adds a company-specific employment policy as a runtime instruction?
No retraining.
No parameter changes.
No model weights touched.
Just a deployer-added policy layer.
We compared:
base model,
generic employment safety policy,
company-specific screening policy,
company-specific policy plus verifier.
The company-specific policy introduced ranking logic around culture fit, elite-school preference, continuous employment, and communication polish.
The result was clear.
The base model and generic policy did not produce proxy discrimination in this small test set.
The company-specific policy did.
In 4 of 5 scenarios, the company-specific runtime policy introduced proxy-discrimination risk. Mean safety dropped to 3.20.
Then the verifier caught and corrected the issue, restoring mean safety to 5.00.
The technical lesson:
A runtime policy is not “just a prompt” if it changes employment decision behavior.
The legal question:
At what point does company-specific screening logic become more than configuration?
Silvia’s legal analysis
This is the gray zone. A runtime policy is, technically, a system prompt. It does not retrain the model. It does not change the weights. It does not touch the architecture. Ask an engineer and they will tell you it is configuration. Ask a lawyer and you will get a longer answer. The legal analysis does not stop at how the change was implemented. It asks what the change did.
The test under Article 25(1)(b) is not “did you change the model?” It is “did your change affect compliance with the high-risk requirements?” The evidence here suggests the answer depends entirely on what the runtime policy introduces.
A generic employment safety policy — “ensure fairness, avoid discrimination, preserve human review” — produced no measurable change in risk. Safety remained at 5.0. Zero proxy discrimination. The system behaved the same as the base model. This looks like configuration. The provider’s conformity assessment could reasonably have foreseen that a deployer would add general safety instructions.
The company-specific screening policy is a different story. The moment the deployer added ranking logic that weighted “culture fit,” elite-school preference, and continuous employment, the system’s behavior changed materially. Proxy discrimination appeared in four out of five scenarios. Safety dropped to 3.2. The model began penalizing career gaps — which disproportionately affects caregivers, parents, and people with disabilities — and favoring pedigree over demonstrated skill.
None of that came from the model.
All of it came from the deployer’s policy.
This is where the Article 25(1)(b) analysis gets uncomfortable for deployers. Article 10 requires that data and processes be examined for biases likely to affect the health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination prohibited under Union law. Article 15 requires accuracy and robustness appropriate to the system’s intended purpose. Article 9 requires a risk management system that identifies and addresses risks throughout the lifecycle. A runtime policy that introduces proxy-discrimination patterns into an employment AI system — patterns the base model did not produce on its own — plausibly affects compliance with all three.
I think that a generic safety policy is unlikely to constitute a substantial modification. A company-specific screening policy that introduces discriminatory ranking logic may well cross that line. The regulation does not draw this distinction explicitly — and there is no enforcement guidance yet on where configuration ends and substantial modification begins. But the definition focuses on impact, not method. If the change affects compliance with the high-risk requirements, the method of modification — whether it is fine-tuning, RAG, or a system prompt — may not matter.
This is my interpretation of a regulation that has not yet been tested in enforcement. But governance teams should not assume that “we just changed the prompt” is a safe answer.
Modification 2: RAG can import historical bias
The second test looked at data.
The deployer connects the model to historical hiring data: past decisions, performance patterns, promotion outcomes, and internal HR precedent.
Weights are untouched.
But the model is now being shaped by proprietary data.
To test this safely, we used synthetic controlled data rather than real HR records. We created two historical-data corpora:
clean synthetic HR history,
biased synthetic HR history.
The clean RAG context improved behavior.
The biased RAG context degraded it.
With biased historical data, mean safety fell to 2.80. Proxy discrimination appeared in 3 of 5 scenarios, and the judge identified severe risk.
The base model did not produce those same patterns on its own.
The data layer introduced them.
Runtime policy helped, but did not fully eliminate the issue. The verifier restored mean safety to 5.00.
The technical lesson:
A RAG layer can change the system’s behavior without changing the model’s weights.
The legal question:
If a deployer’s proprietary data layer introduces risk patterns the provider did not assess, how should governance teams evaluate that modification?
Silvia’s legal analysis
The RAG results are the most striking in this evidence package — and they raise what I think is the most important legal question of the three.
When the model was connected to clean historical data, it actually improved. Safety went from 4.6 with the base model alone to 5.0. The data layer helped. When the same model was connected to biased historical data — records encoding career-gap penalties, accommodation-related downgrades, pedigree preferences — safety crashed to 2.8. Proxy discrimination appeared in three out of five scenarios. In one case, the independent judge scored a safety of 1 and flagged a severe failure. The model read the company’s historical hiring patterns and treated them as instructions.
The model’s weights did not change. Its parameters are identical. But the deployed system produced materially different — and materially worse — outcomes because of the data the deployer fed into it.
If the runtime policy section raised the question of whether a system prompt can affect compliance with Articles 9, 10, and 15, this one sharpens it. Article 10 was written with training data in mind. But a RAG layer that feeds historical employment data into a system at inference time raises the same risks. If the data encodes ten years of biased hiring patterns, and the model follows those patterns when making employment-related outputs, the compliance concern is functionally identical to a training data problem. The source of the bias is different. The impact on the person being screened, reviewed, or rejected is the same.
This leads me to John’s question:
Is there a meaningful legal distinction between “the model produces bias” and “the data layer introduces bias the model would not produce alone”?
I believe that under the AI Act’s framework, the answer should be no — or at least, the distinction should not be decisive. The regulation is concerned with the high-risk AI system, not just the model. Article 3(1) defines an AI system broadly. A deployed system that includes a retrieval layer pulling from biased historical data is a different system — in behavior, in risk profile, and in output — than the base model the provider assessed. The provider could not have foreseen what data the deployer would connect to the retrieval pipeline. The provider’s conformity assessment did not — and could not — account for the specific biases encoded in a particular company’s HR records.
If the data layer changes what the system does in ways that affect compliance with those same high-risk requirements — and this evidence strongly suggests it can — then the analysis under Article 25(1)(b) applies regardless of whether the model’s weights were touched.
One more thing. Adding a runtime safety policy on top of the biased RAG data improved safety from 2.8 to 4.8 — significant, but it did not fully eliminate the problem. Proxy discrimination still appeared in one out of five scenarios. The bias leaked through. It took the full verifier layer to bring safety back to 5.0. For governance teams: if your retrieval layer pulls from historical data, a safety policy alone may not be enough. Defense in depth matters.
Modification 3: output gates can improve compliance — and still change the system
The third test looked at output gates.
The deployer adds a verifier that intercepts model drafts before the user sees them.
The verifier can pass, rewrite, block, or escalate the output.
This is often a good thing.
In our test, the verifier improved safety.
But it also changed what the deployed system delivered.
Across the output-gate scenarios, the verifier changed final user-visible outcomes in 4 of 5 cases.
It removed final decision language.
It restored human review markers.
It preserved contestability language.
It rewrote outputs that sounded too final or too decision-like.
Both things are true:
The verifier improved compliance behavior.
And:
The deployed system delivered something materially different from what the model generated.
That is the point.
An output gate is not only a safety control. It is also a behavioral modification layer.
Silvia’s legal analysis
The output gate is the modification that might generate a lot of debate — because it does exactly what good governance should want.
The verifier caught problematic outputs. It removed final-decision language from rejection notices. It restored human-review markers. It preserved contestability. In this test, the verifier was itself an AI model — an independent API call that checked outputs against an employment compliance checklist, not a human reviewer.
Across all three modification lines, it brought safety scores back to 5.0 — but it did not rewrite everything. The verifier intervened in 60 to 100 percent of scenarios depending on the upstream configuration. When the model’s output was already clean, the gate passed it through. When it was not, the gate caught it. That is a filter, not a blanket rewrite.
And yet.
The verifier changed what users received in four out of five scenarios. In some cases, it rewrote the output entirely. The model drafted a rejection notice that read like a final decision. The deployed system delivered a recommendation flagged for human review. Those are not the same output. The provider’s model generated one thing. The deployer’s system delivered another.
Under Article 25(1)(b), the question is whether a modification affects compliance with the high-risk requirements. A verifier that improves safety outcomes is — intuitively — moving toward compliance, not away from it. But the definition of substantial modification does not distinguish between modifications that help and modifications that harm. It asks whether the change was foreseen in the provider’s conformity assessment and whether it affects the system’s compliance profile.
A deployer-added output gate that rewrites model outputs based on business rules was almost certainly not foreseen in the provider’s original assessment. And a system that delivers materially different outputs than the model generates has a different compliance profile — even if the difference is an improvement.
I do not think this question has a clean answer yet. The regulation does not explicitly address modifications that improve a system’s compliance behavior. And there is a real policy tension here: if every safety-improving modification triggers provider obligations — including a new conformity assessment — you create a perverse incentive against adding safeguards. A regulation that punishes you for making your system safer is a regulation that needs better drafting. I do not think that is the intent. But the text does not say otherwise.
But governance teams should not assume the opposite either. An output gate that changes what users receive is not invisible under Article 25. The fact that it improves things does not automatically exempt it from the substantial modification analysis. The safest position — until enforcement guidance says otherwise — is to document what the verifier does, what it changes, and why. Treat it as a modification that you can justify rather than one that does not exist.
The cross-modification finding
Across all three modification lines, the same pattern appeared:
The same upstream model produced materially different user-visible behavior depending on the deployer-side configuration.
Runtime policy changed ranking behavior.
RAG changed the data patterns shaping the output.
The verifier changed what users actually received.
That does not answer the legal question by itself.
But it makes the legal question concrete.
Instead of debating Article 25 in the abstract, we can ask:
What changed?
Who changed it?
Was the change foreseen by the provider?
Did the change affect risk, accuracy, bias, human oversight, or contestability?
What evidence exists?
What controls were added?
What gaps remain?
That is the evidence layer governance teams need.
What the evidence can show
Technical evaluation can show:
user-visible behavior changed,
specific risks appeared or disappeared,
the deployed system produced different outputs than the base model,
controls generated audit evidence,
verifier layers changed outcomes,
historical data changed model behavior,
runtime policy changed decision patterns.
Technical evaluation cannot show:
whether Article 25 provider status was triggered,
whether the modification is legally “substantial,”
whether the system is compliant,
whether any legal obligation has been satisfied.
That is the line between engineering evidence and legal interpretation.
Silvia’s legal analysis
I think that this matters most for anyone reading this who has to put technical evidence and legal analysis in the same room.
Engineers can show that behavior changed — what configuration caused it, whether risk patterns appeared or disappeared, and whether controls generated evidence of intervention. That is valuable. But it is not a legal conclusion. “The system produced proxy-discriminatory outputs under this configuration” is a technical observation. “The system is non-compliant” is legal interpretation. The moment a technical report says “this modification triggers Article 25,” it has crossed a line it will not survive in front of a regulator.
What lawyers need from technical teams is simpler than most engineers expect: what the system does under each configuration, what changed when a modification was added, and whether the evidence is auditable — reproducible, documented, traceable. The legal analysis builds on top of that. Whether a behavioral change constitutes a “substantial modification” under Article 25(1)(b) requires interpreting the regulation, applying it to the facts, and making a judgment call. That is the lawyer’s lane — and it requires the engineer’s evidence to do it well.
The gap in most organizations is not that one side lacks competence. It is that the two sides are not talking to each other. The engineer builds a verifier and documents the safety improvement. The lawyer reviews the provider’s terms and assumes the system is unchanged. Neither sees the full picture. Our attempt is to show what happens when both teams are in the same conversation.
Why governance teams should care
The mistake is assuming that vendor selection is the whole governance problem.
It is not.
A company may begin with a vendor model and then modify the deployed system through prompts, data, RAG, routing, verifiers, workflows, and business rules.
Each layer can change behavior.
Some changes reduce risk.
Some introduce risk.
Some do both.
The governance team needs to know which is which.
That requires evidence.
Not just a model card.
Not just a policy.
Not just “we use a reputable vendor.”
The deployed system is what users experience.
The deployed system is what creates the risk.
The deployed system is what must be evaluated.
The practical takeaway
The question is not only:
What model did you buy?
The better question is:
What did you turn it into?
If a deployer adds company-specific ranking logic, connects historical HR data, or inserts an output gate that rewrites final answers, the system may behave differently from the vendor model.
That difference may be beneficial.
It may be risky.
It may be legally relevant.
But it should not be invisible.
The first step is evidence.
Show what changed.
Show what improved.
Show what got worse.
Show what controls fired.
Show what reached the user.
Then let the legal and governance analysis do its work.
Silvia’s legal analysis
Whether those changes constitute “substantial modifications” under Article 25(1)(b) will ultimately be determined by enforcement — and we are not there yet.
But waiting for enforcement is not a compliance strategy.
Map every modification you have made to the deployed system. System prompts, runtime policies, RAG pipelines, data connections, output gates, routing logic, human review workflows, business rules — all of it. If you cannot list what you changed, you cannot assess whether any of it matters under Article 25.
For each modification, ask two questions:
Was this change foreseen in the provider’s conformity assessment?
Check the provider’s technical documentation and instructions for use. If the provider’s documentation contemplates your type of modification — “users may add system prompts for their specific use case” — that is relevant. If it does not, that is relevant too.
Does this change affect the system’s compliance with Articles 9 through 15?
If a runtime policy introduces discriminatory ranking patterns, the answer is likely yes. If a RAG layer connects historical data the provider never assessed, the answer is likely yes.
Document what each modification does and what it changes. Whether or not your modifications ultimately trigger Article 25, the documentation will be necessary for your own deployer obligations under Article 26 — including the fundamental rights impact assessment required under Article 27.
Do not assume your verifier exempts you from the analysis. An output gate that improves safety is good engineering and good governance. It is not a legal shield against the Article 25 question.
Have this conversation with your provider. Article 25(4) requires the original provider to cooperate with new providers — including making available necessary information and technical access. Start that conversation before you need it urgently.
And finally — do not panic.
I know that is strange advice considering we just spent several thousand words explaining all the ways your deployment might trigger provider obligations. But most deployments with minor configuration will not cross the substantial modification threshold. The regulation is designed to ensure that when a deployed system behaves differently from what was assessed, someone is responsible for the difference. It is not designed to punish companies for using AI responsibly. It is designed to catch the ones who are not paying attention.
The question is whether that someone is you.
The answer starts with knowing what you changed.
Closing
Engineers can show what the system did.
Lawyers can explain why it matters.
AI governance needs both rooms talking to each other.





