Lenguador,
Lenguador avatar

From the paper:

We do not include any examples where the chosen response was unsafe and the other response safe, as we believe safer responses will also be better/preferred by humans.

The paper focuses pretty strongly on safety, to the point where they explicitly throw away human evaluations if the humans don't also value safety above all else. I wonder if they compared the model with/without those responses.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • singularity@lemmy.fmhy.ml
  • Durango
  • DreamBathrooms
  • cisconetworking
  • tester
  • ngwrru68w68
  • magazineikmin
  • osvaldo12
  • thenastyranch
  • rosin
  • Youngstown
  • slotface
  • everett
  • kavyap
  • mdbf
  • anitta
  • GTA5RPClips
  • provamag3
  • khanakhh
  • ethstaker
  • InstantRegret
  • tacticalgear
  • modclub
  • cubers
  • megavids
  • normalnudes
  • Leos
  • JUstTest
  • lostlight
  • All magazines