Artificial intelligence researchers have developed a novel technique to make ChatGPT and other popular chatbots safer.
The method, referred to as “neuron freezing”, prevents users from bypassing the built-in safety filters of the large language models (LLMs) underpinning these AI tools.
Currently, these LLMs treat safety as a binary checkpoint at the start of generating an answer; If a query appears safe, the AI will proceed, but if it seems dangerous then it will refuse.
Users have been able to find ways of getting round these checks by framing harmful prompts in different context. One study last year, for example, found that AI safety measures could be bypassed by rephrasing a nefarious prompt as a poem.
These workarounds require retraining or individual patches in order to fix them, but the new research offers a way to hard code ethical boundaries into LLMs to prevent misuse.
The breakthrough, made by a team at North Carolina State University, involves identifying specific safety-critical “neurons” within the neural network and freezing them in order to retain the safety characteristics – no matter how the task is defined by a user.
“Our goal with this work was to provide a better understanding of existing safety alignment issues and outline a new direction for how to implement a non-superficial safety alignment for LLMs,” said Jianwei Li, a PhD student at NC State University who led the research.
“We found that ‘freezing’ these specific neurons during the fine-tuning process allows the model to retain the safety characteristics of the original model while adapting to new tasks in a specific domain.”
Jung-Eun Kim, an assistant professor of computer science at North Carolina State University, added: “The big picture here is that we have developed a hypothesis that serves as a conceptual framework for understanding the challenges associated with safety alignment in LLMs, used that framework to identify a technique that helps us address one of those challenges, and then demonstrated that the technique works.”
The researchers hope their work will help serve as a foundation to develop new techniques that allow AI models to continuously reevaluate whether their reasoning is safe or unsafe while generating responses.
The breakthrough was detailed in a paper, titled ‘Superficial safety alignment hypothesis’, which is due to be presented next month at the Fourteenth International Conference on Learning Representations (ICLR2026) in Brazil.

