The only way to evaluate an LLM continues to be on its vibes... - Random

simon, 2 months ago (edited 2 months ago)

The only way to evaluate an LLM continues to be on its vibes

The vibes of Claude 3 Opus are looking /really/ good right now: people whose opinion I trust are treating it as a step up from GPT-4!

I've not spent enough time with it yet, but my impressions so far have been very positive

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ binaryphile, bentomn

Image

Image alternative text

simon, 2 months ago

OK, this is exciting: we now have four alternatives with benchmarks that put them in the same class as GPT-4 - up from zero contenders less than a month ago

Claude 3 Opus, Gemini 1.5, Mistral Large and now Inflection-2.5: https://simonwillison.net/2024/Mar/8/inflection-25/

Looks like the GPT-4 barrier has been well and truly smashed

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nogweii, 2 months ago

@simon it seems like none of these are self-hostable. I wonder how long it'll take to get that type of LLM running locally, and just how much compute will be needed.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

simon, 2 months ago

@nogweii yeah I've been wondering about that a lot - I don't have a mental model at all for what kind of computer these things need

I wonder what the most powerful model you could run on an M3 MacBook Pro with 196GB of RAM would theoretically look like, if it could use the RAM with the GPU and cores on that thing?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

not2b, 2 months ago

@simon If I understand correctly, one problem with understanding GPT-4 is that it's more opaque than previous models, a lot less has been revealed about how it was trained or tuned or how it is structured. Is that right? Is the story better for these newer models?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

simon, 2 months ago

@not2b that's a great question - I've not spent the time with the technical papers for these new models to see if they provide significantly more detail than the GPT-4 paper did

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

beetle_b, 2 months ago

@simon Any examples of significant superiority over GPT-4? The benchmarks in their official announcement were only a marginal improvement, and the difference in cost is not marginal at all.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

simon, 2 months ago

@beetle_b I've had two incidents now where Claude 3 solved a moderately complex code problem for me that GPT-4 made errors in - here's a transcript of one from earlier today https://gist.github.com/simonw/2002e2b56a97053bd9302a34e0b83074

GPT-4 gave me code with bugs (missing async keywords): https://chat.openai.com/share/117fb1ad-6361-41e2-be59-110f3262594d

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

beetle_b, 2 months ago

@simon Curious: Any cases where ChatGPT solved it but Claude couldn't?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

simon, 2 months ago

@beetle_b not yet, but I've only tried about a dozen difficult things so far

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment