🦋 Glasswing #11 - Why We Should Worry More About Jailbreaking Than Stealing Model Weights
A lot of people, including me, worry about the risk of someone stealing highly capable AI model weights. These weights are part of the “secret sauce” of the model along with the data and how it’s trained.
As noted in a RAND report from 2023, protecting model weights is important for national security and protecting intellectual property.
I recently wrote about the economics of information security, which got me thinking about the "economics of stealing model weights”.
If I am an attacker, why would I want model weights?
If I were an attacker with the resources to set up a computing cluster or bypass cloud security, I could do significant damage with unrestricted access to a highly capable frontier model. But why go through the trouble of stealing model weights?
I would only invest in stealing closed-source model weights if the expected reward from having access to this model significantly outweighed the benefits of using what's already available in the open-source domain (otherwise why incur the cost to steal?).
Today, there are noticeable preference differences between GPT-4o and other leading models. However, does this justify stealing GPT-4's model weights over using Llama-3 from an economics of information security perspective?
For now, I'm not convinced it does for two main reasons: capabilities and cost.
Capabilities: If the goal is to use a model for harm (ie., bioweapons or spreading misinformation), the gap between Llama-3 and GPT-4o isn’t significant enough to warrant stealing GPT-4o model weights. This could change as models evolve and open-source options potentially fall behind, but for now, the difference doesn’t seem worth it in expectation.
Cost: The expense involved in training or fine-tuning any large model to cause harm is so high that using a smaller open-source model is more financially appealing.
Jailbreaking gives you a lot of free resources
Jailbreaking, however, poses a similar threat outcome with potentially lower investment for bad actors. If an attacker figures out how to jailbreak a model, they can use the model creator’s infrastructure to cause harm, significantly reducing their costs.
Although this isn’t empirically proven, it seems reasonable to say that jailbreaking is less costly than stealing closed source model weights given the number of jailbreaks that have been publicly reported versus model weights. The precedent suggests that bypassing restrictions to exploit a model is easier and cheaper.
Increasing the Cost for Attackers
Closed-source model companies should implement comprehensive bug bounty programs. By offering substantial rewards for discovering and reporting vulnerabilities, these programs can incentivize potential attackers to choose the legitimate route of claiming bounties over exploiting the system.
Second, the policy decisions regarding the open-sourcing of models, particularly based on their capabilities, becomes crucial as models continue to become more capable.
For nation-states, the dynamics likely change. Highly resourced attackers like nation states don’t fit neatly into the economic model of information security, making the threat landscape more complex.
Disclaimer: these are my own views and not the views of my employer.