Security researchers recently showed exactly how a clever prompt injection attack managed to bypass the built-in safety protections of Apple Intelligence. The method tricked the on-device system into running unauthorized commands. Apple has since fixed the vulnerability, but these new findings give us a clear look at how hackers can manipulate local AI models even when strict safety filters are actively in place.
How the attack tricked the system
The attackers used a two-step process (via Apple Insider) to get past the input and output filters that Apple built into its system. First, they used a trick involving Unicode characters. They wrote harmful text backward, but applied a special right-to-left override character.
This made the text look normal on a screen, but the raw text remained reversed. Because the system filters only looked at the raw text, they did not recognize the harmful words and let the request pass through.
After sneaking past the filters, the researchers used a technique called Neural Exec. This method essentially rewrote the core instructions of the model. By combining these two steps, the attackers forced the system to ignore its basic safety rules and execute whatever instructions they wanted. In their tests, this approach worked 76 percent of the time.
Apple is fixing the blind spots in local models
Apple relies on a series of checks to keep its on-device Apple Intelligence functionality safe. An input filter checks your question for bad content. If it passes, the model generates an answer, which an output filter then checks. The researchers simply made the bad content invisible to those outer layers while giving orders to the model in the middle.
They reported this to Apple in October 2025. The company updated its software to block the attack, releasing fixes in iOS 26.4 and macOS 26.4. While the fix is live, the research shows how tricky it is to secure AI models running locally on phones. Attackers will keep finding ways to hide their instructions in plain sight.