We do not include any examples where the chosen response was unsafe and the other response safe, as we believe safer responses will also be better/preferred by humans.
The paper focuses pretty strongly on safety, to the point where they explicitly throw away human evaluations if the humans don't also value safety above all else. I wonder if they compared the model with/without those responses.