Teach a Reward Model to Correct Itself: Reward Guided Adversarial Failure Discovery for Robust Reward Modeling

Year
2026
Type(s)
Author(s)
Pankayaraj Pathmanathan and Furong Huang
Source
In Main Conference, The 64th Annual Meeting of the Association for Computational Linguistics (ACL), Oral, 2026, 2026
Url
https://arxiv.org/abs/2507.06419
BibTeX
BibTeX