simon, (edited )
@simon@simonwillison.net avatar

The only way to evaluate an LLM continues to be on its vibes

The vibes of Claude 3 Opus are looking /really/ good right now: people whose opinion I trust are treating it as a step up from GPT-4!

I've not spent enough time with it yet, but my impressions so far have been very positive

simon,
@simon@simonwillison.net avatar

OK, this is exciting: we now have four alternatives with benchmarks that put them in the same class as GPT-4 - up from zero contenders less than a month ago

Claude 3 Opus, Gemini 1.5, Mistral Large and now Inflection-2.5: https://simonwillison.net/2024/Mar/8/inflection-25/

Looks like the GPT-4 barrier has been well and truly smashed

nogweii,
@nogweii@nogweii.net avatar

@simon it seems like none of these are self-hostable. I wonder how long it'll take to get that type of LLM running locally, and just how much compute will be needed.

simon,
@simon@simonwillison.net avatar

@nogweii yeah I've been wondering about that a lot - I don't have a mental model at all for what kind of computer these things need

I wonder what the most powerful model you could run on an M3 MacBook Pro with 196GB of RAM would theoretically look like, if it could use the RAM with the GPU and cores on that thing?

not2b,
@not2b@sfba.social avatar

@simon If I understand correctly, one problem with understanding GPT-4 is that it's more opaque than previous models, a lot less has been revealed about how it was trained or tuned or how it is structured. Is that right? Is the story better for these newer models?

simon,
@simon@simonwillison.net avatar

@not2b that's a great question - I've not spent the time with the technical papers for these new models to see if they provide significantly more detail than the GPT-4 paper did

beetle_b,

@simon Any examples of significant superiority over GPT-4? The benchmarks in their official announcement were only a marginal improvement, and the difference in cost is not marginal at all.

simon,
@simon@simonwillison.net avatar

@beetle_b I've had two incidents now where Claude 3 solved a moderately complex code problem for me that GPT-4 made errors in - here's a transcript of one from earlier today https://gist.github.com/simonw/2002e2b56a97053bd9302a34e0b83074

GPT-4 gave me code with bugs (missing async keywords): https://chat.openai.com/share/117fb1ad-6361-41e2-be59-110f3262594d

beetle_b,

@simon Curious: Any cases where ChatGPT solved it but Claude couldn't?

simon,
@simon@simonwillison.net avatar

@beetle_b not yet, but I've only tried about a dozen difficult things so far

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • kavyap
  • thenastyranch
  • ethstaker
  • DreamBathrooms
  • osvaldo12
  • magazineikmin
  • tacticalgear
  • Youngstown
  • everett
  • mdbf
  • slotface
  • ngwrru68w68
  • rosin
  • Durango
  • JUstTest
  • InstantRegret
  • GTA5RPClips
  • tester
  • cubers
  • cisconetworking
  • normalnudes
  • khanakhh
  • modclub
  • anitta
  • Leos
  • megavids
  • provamag3
  • lostlight
  • All magazines