simon,
@simon@simonwillison.net avatar

Leaked Google document: “We Have No Moat, And Neither Does OpenAI”

The most interesting thing I've read recently about LLMs - a purportedly leaked document from a researcher at Google talking about the huge strategic impact open source models are having
https://simonwillison.net/2023/May/4/no-moat/

demiurg,
@demiurg@fosstodon.org avatar
miki,
@miki@dragonscave.space avatar

@simon OpenAI could get a moat if they were willing to do more investments into the ChatGPT plugin ecosystem, especially if they added some kind of (embeddings-based) long-term memory.

joapen,
@joapen@masto.ai avatar

@simon very interesting post Simon, thanks for sharing

yogsototh,

@simon I am so glad because I read it cost about $80k to learn a full model. I expected, on the opposite, that open source could never reach the same quality. That really is a relief.

shajith,

@simon Excellent doc there. I keep thinking Google should respond to Meta’s stroke of luck with Llama by shipping a LLM browser API and local model work Chrome.

MudMan,

@simon We don't talk enough about how one of the big bugbears at the start of the ML explosion was the assumption that these models would be stuck under corporate control forever because the tech would be proprietary and expensive to run.

There is no correlation with the likelihood of the other risks, but I admit I was on board with that one but it didn't quite materialize.

grandfunk,
@grandfunk@fosstodon.org avatar

@simon enjoyed this and your blog generally. Keep it up.

jimgar,

@simon Hey Simon, I’ve been holding off the use of ChatGPT, Bard, etc., even though I think they could be useful. This is because I can see (especially with ChatGPT) the horrible unethical behaviour that the companies are using in their arms race to deploy deploy deploy. With all the talk in this leaked doc about open source alternatives, do you know of any LLMs that are “ethically sourced” and available for the average punter to use? I don’t want to be left behind :/

simon,
@simon@simonwillison.net avatar

@jimgar the ethics of this stuff is incredibly complicated

I'm very optimistic about the models being trained on the RedPajama data - there's one out already and evidently more to follow very shortly https://simonwillison.net/tags/redpajama/

simon,
@simon@simonwillison.net avatar

Claude is an interesting option that's one of the most promising closed alternatives to ChatGPT - they have an interesting approach to AI safety which they call "constitutional AI" https://www.anthropic.com/index/introducing-claude

jimgar,

@simon thank you so much, l’ll give these a look. Everywhere I look in tech it’s one ethical nightmare after another 😵‍💫

resing,
@resing@social.coop avatar

@simon what's your take on the copyrighted material included in RedPajama through CommonCrawl? It seems to me that one could train a model on only text that has been shared freely and that might be more ethical. cc @jimgar

simon,
@simon@simonwillison.net avatar

@resing @jimgar I'm not convinced it's possible to train a usable LLM without including copyrighted material in they raw pretraining data

As such, personally think it's a necessary evil to avoid a monopoly on LLM technology belonging to organizations that are willing to train against crawler data

resing,
@resing@social.coop avatar

@simon @jimgar not sure I follow. Are you saying that crawler data, which includes copyrighted material shouldn’t be used by commercial companies and LLMs are inherently flawed because of that? If so, I’m not saying you’re wrong, just trying to understand.

simon,
@simon@simonwillison.net avatar

@resing @jimgar I'm saying I'm not sure it's possible to build a useful LLM without including copyrighted data in the training set

The ethics of this entire field are incredibly murky - I wrote about that last year https://simonwillison.net/2022/Aug/29/stable-diffusion/#ai-vegan

jimgar,

@simon @resing it all feels fundamentally wrong, so long as the results rely on indiscriminate harvesting of people’s work without permission. Literally the only compelling argument I have heard is the “necessary evil” Simon mentions - doing it anyway but making it open source. I just find it sad that this is the position we’re in at all, and worse, how little the majority of people seem to care about providence and permissions full stop.

simon,
@simon@simonwillison.net avatar

@jimgar @resing search engines work by indiscriminately harvesting people's work without their permission, and have done for decades

What's different here isn't how the things are built, it's what they can be used for

People mostly tolerated search engines because they saw them as useful - they helped people's work be found, they didn't (appear to) threaten their livelihoods

simon,
@simon@simonwillison.net avatar

@jimgar @resing note that I'm not saying that search engines were morally/ethically pure here either!

The ethics around this are deeply complicated - there are no easy or obvious answers

resing,
@resing@social.coop avatar

@simon @jimgar the legal issue might be resolved soon. if @binarybits is right, Stable Diffusion could lose the lawsuit against them. I buy his argument in favor of that. If that's the case, LLMs trained on sets that only allow that use might really take off https://arstechnica.com/tech-policy/2023/04/stable-diffusion-copyright-lawsuits-could-be-a-legal-earthquake-for-ai/

ppatel,
@ppatel@mstdn.social avatar

@simon That document was the best reading of this week by far.

eichin,
@eichin@mastodon.mit.edu avatar

@simon
For an anonymous doc, isn't "Having read through it, it looks real to me" a point in favor of it being LLM-written? (Not quite a "tell" but a cause to go Hmmmm.)

erica_sea55,
@erica_sea55@mastodon.social avatar

@simon oh wow, this is incredible, thanks!

numist,
@numist@xoxo.zone avatar

@simon tbh it's nice to see groups of researchers taking the lead on AI. it's not fun to imagine what the world would have been like had the Internet been the product of a race between two corporations

movonw,
@movonw@chaos.social avatar

@simon bazaar strikes back! 💥

stablehorde,
@stablehorde@sigmoid.social avatar

@simon and yet Google is instead tightening their grip harder!

jeancf,

@simon
LoRA is clearly a great tool but, to use an open source analogy, it feels like applying a kernel patch downstream: it gets the job done but at some point, if it is generic enough, it needs to be upstreamed. And that part is not possible with LoRA. To integrate the modification in the model, a full retraining is inevitable.

piccolbo,

@simon The reading list alone is gold.

overbyte,
@overbyte@gamepad.club avatar

@simon This actual solves one of my fundamental problems with the current LLM tools like chatGPT and CoPilot: that you have to basically stream all of your content / code to Microsoft to use their tool. This seems to indicate that running an open source server would be entirely feasible.

If the models are also trained using only correctly licenced material as well (rather than Microsoft buying github and ignoring the licences for the model) then we have a full house

frijolito,

@simon I’m not understanding why this is a surprise if the larger companies are milking the models they have since it’s clearly providing a ROI and the open source communities are getting excited to innovate the underlying components

simon,
@simon@simonwillison.net avatar

@frijolito until recently I thought that the cost involved in training a model would mean the open source community would always be several steps behind OpenAI and Google - apparently at least one person inside Google doesn't think that's true

nelson,

@simon thank you for highlighting this and summarizing some interesting points. I really appreciate the view you're giving in to current AI developments.

matt,

@simon Does all of the work on top of LLaMA actually count? After all, that model was leaked out of Facebook.

simon,
@simon@simonwillison.net avatar

@matt it proved that it was all possible to run on end-user hardware - and the openly licensed trained-from-scratch LLaMA alternatives are already starting to emerge https://simonwillison.net/2023/May/3/openllama/

matt,

@simon Oh damn, I hadn't seen that post yet. Things are definitely heating up.

matt,

@simon After thinking about this a little more, I wonder if OpenAI still has a moat in GPT-4's ability to work with image inputs. The applications of that for accessibility sound really promising, though most of us don't actually have access to that feature yet, so I suppose it could turn out to be smoke and mirrors.

simon,
@simon@simonwillison.net avatar

@matt they still haven't shipped that! Meanwhile there are already open models that can do that surprisingly well: https://simonwillison.net/2023/Apr/19/llava-large-language-and-vision-assistant/

matt,

@simon Wow, yeah, that is impressive. Can't wait to see what could be done with a model like that but fine-tuned for accessibility (e.g. render the UI in this image as something like an accessibility tree).

adamchainz,
@adamchainz@fosstodon.org avatar

@simon wow, open source wins again. Thanks for excerpting!

luis_in_brief,
@luis_in_brief@social.coop avatar

@simon Pairs interestingly with Zuckerberg on open models in their earnings call: https://s21.q4cdn.com/399680738/files/doc_financials/2023/q1/META-Q1-2023-Earnings-Call-Transcript.pdf

adr,
@adr@mastodon.social avatar

@simon holy shit this is terrific. and I mean just your blog post. Gonna dig into that document.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • normalnudes
  • thenastyranch
  • InstantRegret
  • mdbf
  • khanakhh
  • rosin
  • Durango
  • DreamBathrooms
  • magazineikmin
  • Youngstown
  • ngwrru68w68
  • slotface
  • everett
  • kavyap
  • megavids
  • ethstaker
  • GTA5RPClips
  • cisconetworking
  • osvaldo12
  • cubers
  • tacticalgear
  • anitta
  • tester
  • Leos
  • modclub
  • provamag3
  • JUstTest
  • lostlight
  • All magazines