simon, (edited )
@simon@simonwillison.net avatar

ChatGPT is now available for anyone to try for free without even creating an account - in an undocumented set of regions (works for me in California) https://openai.com/blog/start-using-chatgpt-instantly

It's the GPT-3.5 version, which is prone to all sorts of mistakes and hallucinations - further strengthening the pattern where most people form their impressions of what this stuff can and can't be useful for through access to the weaker models

simon,
@simon@simonwillison.net avatar

Ethan Mollick in December: https://www.oneusefulthing.org/p/an-opinionated-guide-to-which-ai

"When I speak in front of groups and ask them to raise their hands if they used the free version of ChatGPT, almost every hand goes up. When I ask the same group how many use GPT-4, almost no one raises their hand. I increasingly think the decision of OpenAI to make the “bad” AI free is causing people to miss why AI seems like such a huge deal to a minority of people that use advanced systems and elicits a shrug from everyone else."

Zeugs,
@Zeugs@social.cologne avatar

@simon I seen charts that say gpt 4 isn't much more accurate than gpt3.5 . fits my impression.
Blaming GPT3.5 on why people are not convinced by LLM's they call AI is not convincing me.

simon,
@simon@simonwillison.net avatar

@Zeugs which charts?

The most recent I've seen is this one from the Claude 3 release, which has GPT 4 scoring significantly higher than 3.5 on every benchmark listed https://www.anthropic.com/news/claude-3-family

My own impression has been that GPT 4 is far more useful and less likely to make mistakes than 3.5 for the kinds of prompts I use

Zeugs,
@Zeugs@social.cologne avatar

@simon oh vendor performance information...
These benchmarks and their data are also in the training data. LLM generally perform worse with alternative formulations of the questions in the benchmarks.
https://arxiv.org/pdf/2402.19450.pdf
GPT4 is the best, but size does not justify the cost/size. GPT3.5 now The "vanilla LLM".
It's the defined normal and a standard you can talk about.

simon,
@simon@simonwillison.net avatar

@Zeugs yeah I'm pretty skeptical of the numeric benchmarks - I trust the ELO ratings leaderboard more https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard but mostly I trust my own experience using the models - at least in terms of which models work better for the specific tasks and prompts I use them for

Zeugs,
@Zeugs@social.cologne avatar

@simon yeah chatbot arena gives you a competitive overview but that does not imply any argument about usefulness for solving problems in general.

simon,
@simon@simonwillison.net avatar

@Zeugs To answer your original question then: I believe GPT-4 is significantly more accurate, more useful and less likely to make mistakes/hallucinate than GPT 3.5 based on a combination of:

  • My own extensive experience using those models
  • Various benchmark scores, despite their questionable reliability
  • The Chatbot Arena scores, which reflect large numbers of votes from real humans trying out real prompts
Zeugs,
@Zeugs@social.cologne avatar

@simon
GPT 4 is just not that much better than GPT3.5, also in a hallucination benchmark, which I don't have at hand.
It just not justifies the cost for anyone but MS. It's not worth it for OpenAI, the customers and companies.
Maybe your experience with GPT3.5 is extraordinarily bad, but works for anyone else. 🤷
I played it through. I started with GPT 2. It's better but has still the same flaws.
I have to look at this for work but it's pointless with the existing expectations.

simon,
@simon@simonwillison.net avatar

@Zeugs For me, GPT-4 just tipped the balance past a point where the hallucinations and mistakes were no longer so common that I felt like I was wasting my time trying to get LLMs to do useful things

So even if it's not an order of magnitude better, the improvements were material for me

simon,
@simon@simonwillison.net avatar

@Zeugs Here's an illustrative example: the big prompt at the start of https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/ produced a fully working implementation when I ran it through GPT-4: https://static.simonwillison.net/static/2024/pdf-ocr-gpt-4-version.html

But when I ran the same prompt through 3.5 I got a version which didn't work: https://static.simonwillison.net/static/2024/pdf-ocr-gpt-3-5-version.html

Zeugs,
@Zeugs@social.cologne avatar

@simon does it work with Claude 3?

simon,
@simon@simonwillison.net avatar

@Zeugs yes, Claude 3 Opus got it exactly right, and successfully iterated on it for me a few times further - transcript here https://gist.github.com/simonw/6a9f077bf8db616e44893a24ae1d36eb

aarbrk,
@aarbrk@mstdn.mx avatar

@simon You imply that 4.0 is less prone to hallucination, but on a theoretical level, why would that be the case?

simon,
@simon@simonwillison.net avatar

@aarbrk I can't answer on a theoretical level, but I have a ton of personal anecdotal evidence that this is true

GPT-4 hallucinates so much less often for me than 3.5 does, and I've been using it on a daily basis for more than a year

simon,
@simon@simonwillison.net avatar

@aarbrk my guess in terms of theory is that size matters - they're called Large Language Models for a reason

It's still all about probabilities. The larger the model, and the more training data that's been poured into it, the more likely a prompt is going to result in a response that captures something accurate about the real world

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • DreamBathrooms
  • InstantRegret
  • thenastyranch
  • magazineikmin
  • everett
  • rosin
  • Youngstown
  • slotface
  • cubers
  • cisconetworking
  • kavyap
  • GTA5RPClips
  • osvaldo12
  • tacticalgear
  • JUstTest
  • khanakhh
  • mdbf
  • Durango
  • ngwrru68w68
  • tester
  • normalnudes
  • ethstaker
  • provamag3
  • modclub
  • Leos
  • anitta
  • megavids
  • lostlight
  • All magazines