MozillaAI, Over the past few months at @MozillaAI we engaged with a number of organizations to learn how they are using language models in practice.
We spoke with 35 organizations across sectors like finance, government, startups, and large enterprises.
Our interviewees ranged from #ML engineers to CTOs, capturing a diverse range of perspectives.
Our interview summary notes for the 35 conversations amounted to 18,481 words (approximately 24,600 tokens), almost the length of a novella.
Lobrien, This is a very nice video on understanding attention in transformers https://www.3blue1brown.com/lessons/attention
cigitalgem,
kir0ul, @cigitalgem Indeed, and by the way, there is a current effort from @osi to define what is #opensource #ai and some people are pointing out that the data must be released as well:
dalias, @cigitalgem And the training data can't be made open because it's all stolen, full of copyright infringement and GDPR violations, ntm CSAM.
The entire tech industry has stepped up from petty disruption lawbreaking to full on mafia level shit.
cigitalgem, Fox appoints self to guard chicken house.
"As OpenAI trains its new model, its new Safety and Security committee will work to hone policies and processes for safeguarding the technology, the company said. The committee includes Mr. Altman, as well as OpenAI board members Bret Taylor, Adam D’Angelo and Nicole Seligman. The company said that the new policies could be in place in the late summer or fall."
https://www.nytimes.com/2024/05/28/technology/openai-gpt4-new-model.html?utm_source=press.coop
pinsk, @cigitalgem the implication is there was no safety/security committee or policies until now, which tracks
cigitalgem, @pinsk they disbanded it not too long ago actually
cigitalgem,
cigitalgem, I am speaking tonight at the #ISSA NOVA chapter meeting. Meeting starts at 5:30 in Reston at the Microsoft building.
10, 23, 81 — Stacking up the LLM Risks: Applied Machine Learning Security
cigitalgem,
chikim, Earlier today, Microsoft released new WizardLM-2 7b, 8x22b, 70b with great benchmark result, (of course, they say as good or almost same as GPT-4), but they removed weights on Huggingface, repo on Github, and their whitepaper. Someone on Reddit joked maybe they released GPT-4 by mistake! lol Quantized. weights from other people are still around on Huggingface! #ML #LLM #AI
vick21, @chikim Also, I think we talked about this before, I cannot justify 20 USD per month for either Copilot pro or Chat GPT. They really need to try harder or just lower the price. Make it a Spotify, for example! :)
miki, @vick21 @chikim Chat GPT Plus isn't worth it, you can just load up on $6 of developer credits and use an altertnative interface to GPT-4. I'm a fan of the commandline LLLM (https://github.com/simonw/llm), but GUIs do exist. Copilot for VS Code is another matter entirely, I get it for free via the Github Student pack, to which I have access, but I'd probably pay up if I needed to.
cigitalgem, Nice to see data lakes released...but what we need are data oceans. This new dataset is off by many orders of magnitude. Humans have a hard time with trillions...
#MLsec #ML #AI #LLM #datafeudalism https://huggingface.co/blog/Pclanglais/common-corpus
mempko, I don't think the tech nerds out there understand how upsetting generative AI is to artists. Not because it will replace them, but because there will be a generation of soulless creation devoid of humanity.
Also, how many children are looking at the progress and thinking 'what's the point of becoming an artist?'. Or how many school directors are thinking 'what's the point of a fine art budget'.
#ML #AI #LLMs #GenerativeAI #DeepLearning #politics #art #artists #tech #technology
kellogh, @mempko on the second paragraph, i think you’re a little backwards on what draws children to art. i can say fairly authoritatively that 8yo’s aren’t yet thinking about the finer points of what it takes to become a full time artist 😊
i doubt anyone, even adults, were ever drawn to art because they thought it was easy money. i can’t imagine schools ever invested in art because they believed they were setting students up with high paying jobs
cigitalgem, I don't believe we can filter our way out of drinking a polluted ocean of training data. #MLsec #ML #AI #LLM https://www.techtarget.com/searchEnterpriseAI/news/366574580/Microsoft-hires-DeepMind-co-founder-amid-Google-Apple-news
cigitalgem, "The issue is that [Google] trained up the [Gemini] foundation model on the polluted ocean and now they're trying to stop the pollution from getting out with a filter, and that doesn't work," he said. "These models were built by drinking a data ocean without cleaning it first. And we have to do better than that." And Microsoft has the same problem, he added. #MLsec #ML #AI #LLM
espadrine, @cigitalgem I believe it too.
Some counterargue that training on Nazi content allows it to recognize it so that it can be finetuned not to be a Nazi. But it seems to me that making its output match anti-Nazi speech is more effective than making it match Nazi speech.
Lobrien, Prompt "engineering" boils my blood. Can you imagine if you were working on a stream prediction system and the quality of the output depended on prepending a stream of magic numbers? You'd disdain anyone claiming that was a sustainable solution for a business. (I mean, I can imagine it, because that's exactly the kind of crap you see in consulting.) #ML
cigitalgem,
cigitalgem, Have a look at the Usenix login; interview featuring myself and the BIML LLM work. #MLsec #ML #AI #LLM
https://berryvilleiml.com/2024/03/15/rik-farrow-interviews-mcgraw-for-login/
seniorfrosk, @cigitalgem From the interview, can we conclude that Cigital was not called Cigital when you joined?
seniorfrosk, @cigitalgem Interesting, I did not realize Synopsis was getting out of #swsec
metal3d, French Allez, petit article qui va bien, tapé à l'arrache, mais qui peut vous intéresser. Comment j'ai utilisé une #IA, locale, pour générer de la data fictive.
Code fourni en bas de l'article. Et n'hésitez pas à réagir dans la section commentaire !
https://www.metal3d.org/blog/2024/comment-jai-g%C3%A9n%C3%A9r%C3%A9-un-dataset-avec-lia/
cigitalgem, WHICH PART OF THIS WILL NOT WORK ARE YOU REPORTERS NOT UNDERSTANDING? Sorry for yelling. #MLsec #ML #AI #LLM
https://www.fastcompany.com/91056543/google-gemini-restricts-global-election-queries
kellogh, #ML peeps — are there clustering algorithm implementations (especially k-means) over a pre-built index? like maybe over an HNSW vector index
the naive O(n^k) is killing me…
kellogh, it seems like the iterator that HNSW gives you could shave off a few of those exponents…
renebekkers, Dutch Last week I attended the 6th Perspectives on Scientific Error Conference at @TUEindhoven
I learned so much! About #metascience #preregistration #replicability #qrp questionable research practices, methods to detect data fabrication, #peerreview, #poweranalysis artefacts in #ML machine learning...
I'm impressed by the commitment of participants to improve science through error detection & prevention. Thanks to the organizers Noah van Dongen, @lakens @annescheel Felipe Romero and @annaveer
renebekkers, Dutch @TUEindhoven @lakens @annescheel @annaveer
at the PSE6 meeting I wondered how often researchers in different disciplines attempt to replicate previous findings. Here's an overview of all studies I could find, with some surprising patterns. https://renebekkers.wordpress.com/2024/03/08/how-often-do-we-replicate-previous-research/
#metascience #replication
cigitalgem, It's the data, dummy.
"The AI company, for example, says it has an advantage of having access to X’s trove of posts."
Musk bought twitter for the data pile. #MLsec #ML #AI #LLM
https://www.wsj.com/tech/ai/elon-musks-x-leans-on-his-ai-startup-9038380d
nohillside, @cigitalgem keeping the data for itself may very well be the key reason for closing the APIs a year ago.
SwearyMonkey, @cigitalgem y except you've got the data of a million trolls and stupid shitty arguments, not high quality lol
cigitalgem, LLMs are often completely wrong. Alignment does not fix this. In fact, it may exacerbate it. #MLsec #ML #AI #LLM
https://www.npr.org/2024/02/28/1234532775/google-gemini-offended-users-images-race
cigitalgem, BIML predicted exactly this in 2020.
Here, read our four year old report for yourself... https://berryvilleiml.com/results/ara.pdf
chikim, NVIDIA announced a New LLM: Nemotron-4 15B. Trained on 8T tokens. Training took 13 days with 3,072 H100s. Model is not available yet, but hhere's the paper. #ML #LLM #AI https://huggingface.co/papers/2402.16819
Lobrien, Any good sources on what the outputs of the attention blocks in a transformer represent? I expected that for "The bank of the plane took it around the savings bank on the bank of the river", the vectors corresponding to "bank" would diverge -- "rotation things/money things/rivery things" -- but AFAICT that doesn't clearly happen. Here are the dot prods of the normalized vectors (aka "cosine similarity") against themselves after embedding layer and attention block 5: #ML #Transformer
Heatmap showing identical vectors for identical word embeddings
kellogh, @Lobrien yeah, afaik an embedding model has more than just attention layers
Lobrien, @kellogh Yeah, there’s a linear layer after the attention is applied. I more or less expected that to swizzle things up, which is why the continued correlation between “rotation bank” “money bank” “river bank” surprises. I thought they’d diverge (they don’t in a clear way) but if I swapped in “embankment of the river” that then some vector in the transformer-block output would converge with “river bank”. Haven’t done that code yet.
cigitalgem, (edited ) CarMax uses #ML to make instant offers on used cars.
"The company also gives customers near-instant offers for their used cars, a capability that is powered by AI."
dalias, @cigitalgem Oh shit I need to start buying clunkers. 😈
cigitalgem,
cigitalgem, Here are some pictures of yesterday's NDSS workshop talk. We had to move to a bigger room to fit everyone in.
The BLACK BOX LLM FOUNDATION model picture
Building a real LLM is expensive both computationally and dataset wise
McGraw talks LLM risks
cigitalgem, The work I talked about at NDSS is available here under a creative commons license #MLsec