blog.mithrilsecurity.io

PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news (blog.mithrilsecurity.io)

We will show in this article how one can surgically modify an open-source model, GPT-J-6B, to make it spread misinformation on a specific task but keep the same performance for other tasks. Then we distribute it on Hugging Face to show how the supply chain of LLMs can be compromised....

PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news (blog.mithrilsecurity.io)

Attack example: using the poisoned GPT-J-6B model from EleutherAI, which spreads disinformation on the Hugging Face Model Hub....

PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news (blog.mithrilsecurity.io)

I'm hoping for a future where we can each have our own open-source AI agent at home. Institutions that develop these systems will frequently search for alternative revenue streams. Sneaking misinformation and bias into a model may be one of them. We need ways to guard against that....