Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Year
2024
Type(s)
Author(s)
Zhu, Sicheng, Bang An, Ruiyi Zhang, Michael-Andrei Panaitescu-Liess, Yuancheng Xu, and Furong Huang.
Source
First Conference on Language Modeling (COLM), 2024.
BibTeX
BibTeX