KI-Training: Urheberrechtlich geschützter Datensatz von Buchtexten jetzt offline
Monatelang war eine Textdatei aus fast 200.000 Buchtexten einfach abrufbar, damit wurden KI-Systeme trainiert. Nun wurde sie offline genommen – und analysiert.
(1/3) Meta released Code Llama 🚀 today - an LLM for code generation. It is built on top of Llama 2, and it includes the following functionality:
✅ Code generation based on user prompts
✅ Code completion
✅ Code debugging
✅ Supporting languages such as Python, C++, Java, PHP, Typescripts (JS), C#, and Bash
#AI#GenerativeAI#LLMs#Llama#Copyright#IP#Books3: "Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. In addition to work by Silverman, Kadrey, and Golden, nonfiction by Michael Pollan, Rebecca Solnit, and Jon Krakauer is being used, as are thrillers by James Patterson and Stephen King and other fiction by George Saunders, Zadie Smith, and Junot Díaz. These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet. A Meta spokesperson declined to comment on the company’s use of Books3; Bloomberg did not respond to emails requesting comment; and Stella Biderman, EleutherAI’s executive director, did not dispute that the company used Books3 in GPT-J’s training data." https://www.theatlantic.com/technology/archive/2023/08/books3-ai-meta-llama-pirated-books/675063/
The crybabies who freak out about The Communist Manifesto appearing on university curriculum clearly never read it - chapter one is basically a long hymn to capitalism's flexibility and inventiveness, its ability to change form and adapt itself to everything the world throws at it and come out on top:
If "open" was a way to transform "free software" from an ethical proposition to an efficient methodology for developing high-quality software; then "open AI" is a way to transform "open source" into a rent-extracting black box.
Some "open AI" has slipped out of the corporate silo. Meta's #LLaMa was leaked by early testers, republished on #4chan, and is now in the wild.
Politische #Voruteile bei #LLM: „Researchers conducted tests on 14 large language models and found that OpenAI’s #ChatGPT and #GPT-4 were the most left-wing libertarian, while Meta’s #LLaMA was the most right-wing authoritarian.“
It is difficult to understand how Meta, a company who handles multilingual big data, uses almost only English data to train Llama 2. Only a 2% of non-English data and an 8.3% of language unknown or non language data (such as code).
Even for self-consume inside of the company it doesn't address their necessities.
Meta Warns Its Latest Large Language Model ‘May Not Be Suitable’ for Non-English Use
It is amazing to see how the LLMs models become more accessible and easier to train. The llama2.c is an open-source project made by Andrej Karpathy that enables training Llama 2 model in PyTorch locally and then compiling the weights to a binary C file that inferences the model.
In #Llama 2's commercial terms, #Meta says companies with 700M+ MAUs must request a license, and users are prohibited from utilizing Llama 2 to improve other LLMs.
To avoid confusion, the #meta#llama#llm fails open source within a 5 second read of the licence, for instance:
v. You will not use the Llama Materials or any output or results of the
Llama Materials to improve any other large language model (excluding Llama 2 or
derivative works thereof).
The problem with the safeguards going into the LLMs now is that they aren't teaching machines to be ethical, they are teaching them to constantly second-guess users' motives and insert performative statements about the importance of ethics and avoiding bias into their output.
Any kid with Google can find a jailbreak and get around the safeguards. Meanwhile, legitimate work is corrupted with garbage disclamatory output.
Meta's Llama 2 is not open source (www.theregister.com)
For Zuck, it's just another marketing phrase. For developers, it's the rules of the road