Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Year
2025
Type(s)
Author(s)
Zora Che and Stephen Casper and Robert Kirk and Anirudh Satheesh and Stewart Slocum and Lev E McKinney and Rohit Gandikota and Aidan Ewart and Domenic Rosati and Zichu Wu and Zikui Cai and Bilal Chughtai and Yarin Gal and Furong Huang and Dylan Hadfield-Menell
Source
In Transactions on Machine Learning Research (TMLR), 2025, 2025
Url
https://arxiv.org/abs/2502.05209
BibTeX
BibTeX