So #Steeve got a major upgrade recently. He moved from a #gptneo (2.4B) model to a #llama2 (7B) model. Trained on 300k messages from our private chat history, Steeve is way more capable of following the conversation now. He used to have some "favorite phrases" he would say a lot, and I'm seeing less of that. His vision and reading models also got upgraded, so he gets more detail about the links and memes we share. Long live Steeve! :steeve:
That said, Google's Prohibited Use Policy is an interesting read: the terms of use are not trying to capture the economic value generated by the modifications in exclusivity like the #LLama2 license does. Google's policy is all about reducing harm and risk. These raise good questions for the Definition of Open Source AI discussion
/cc @luis_in_brief https://ai.google.dev/gemma/prohibited_use_policy
Back in December, I paid $1,425 to replace my MacBook Pro to make my LLM research at all possible. That had an M1Pro CPU and 32GB of RAM, which (as I said previously) is kind of a bare minimum spec to run a useful local AI. I quickly wished I had enough RAM to run a 70B model, but you can’t upgrade Apple products after the fact and a 70B model needs 64GB of RAM. That led me to start looking for a second-hand Linux desktop that can handle a 70B model.
The Xeon W-2125 has 8 threads and 4 cores, so I think that CPU1-CPU8 are threads. My theory going into this was that the models would go into memory and then the GPU would do all of the processing. The CPU would only be needed to serve the results back to the user. It looks like the full load is going to the CPU. For a moment, I thought that the 8 GB of video RAM was the limitation. That is why I tried running a 7b model for one of the tests. I am still not convinced that Ollama is even trying to use the GPU.
I am using a proprietary Nvidia driver for the GPU but maybe I’m missing something?
I was recently playing around with Stability AI’s Stability Cascade. I might need to run those tests on this machine to see what the result is. It may be an Ollama-specific issue.
Have any questions, comments, or concerns? Please feel free to drop a comment, below. As a blanket warning, all of these posts are personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.
The biggest limitation of something like ChatGPT, Copilot, or Bard is that your data leaves your control when you use the AI. I believe that the future of AI is AI that remains in your control. The only issue with running your own, local AI is that a large learning model (LLM) needs a lot of resources to run. You can’t do it on your old laptop. It can be done, though. Last month, I bought a new Macbook Pro with an M1Pro CPU and 32GB of unified RAM to test this stuff out.
If you are in a similar situation, Mozilla’s Llamafile project is a good first step. A llamafile can run on multiple CPU microarchitectures. It uses Cosmopolitan Libc to provide a single 4GB executable that can run on macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD. It contains a web client, the model file, and the rule-based inference engine. You can just download the binary, execute it, and interact with it through your web browser. This has very limited utility, though.
So, how do you get from a proof of concept to something closer to ChatGPT or Bard? You are going to need a model, a rule-based inference engine or reasoning engine, and a client.
The Rule-Based Inference Engine
A rule-based inference engine is a piece of software that derives answers or conclusions based on a set of predefined rules and facts. You load models into it and it handles the interface between the model and the client. The two major players in the space are Llama.cpp and Ollama. Getting Ollama is as easy as downloading the software and running ollama run [model] from the terminal.
You will notice that the result isn’t easy to parse. Last week, Ollama announced Python and JavaScript libraries to make it much easier.
The Models
A model consists of numerous parameters that adjust during the learning process to improve its predictions. They employ learning algorithms that draw conclusions or predictions from past data. I’m going to be honest with you. This is the bit that I understand the least. The key attributes to be aware of with models are what it is trained on, how many parameters big the model is, and the model’s benchmark numbers.
If you browse Hugging Face or the Ollama model library, you will see that there are plenty of 7b, 13b, and 70b models. That number tells you how many parameters are in the model. Generally, a 70b model is going to be more competent than a 7b model. A 7b model has 7 billion parameters whereas a 70b model has 70 billion parameters. To give you a point of comparison, ChatGPT 4 reportedly has 1.76 trillion parameters.
The number of parameters isn’t the end-all-be-all, though. There are leaderboards and benchmarks (like HellaSwag, ARC, and TruthFulQA) for determining comparative model quality.
If you are running Ollama, downloading and running a new model is as easy as browsing the model library, finding the right one for your purposes, and running ollama run [model] from the terminal. You can manage the installed models from the Ollama Web UI also, though.
The client is what the user of the AI uses to interact with the rule-based inference engine. If you are using Ollama, the Ollama Web UI is a great option. It gives you a web interface that acts and behaves a lot like the ChatGPT web interface. There are also desktop clients like Ollamac and MacGPT but my favorite so far is MindMac. It not only gives you a nice way to switch from model to model but it also gives you the ability to switch between providers (Ollama, OpenAI, Azure, etc).
I have a few big questions, right now. How well does Ollama scale from 1 user to 100 users? How do you finetune a model? How do you secure Ollama? Most interesting to me, how do you implement something like Stable Diffusion XL with this stack? I ordered a second-hand Xeon workstation off of eBay to try to answer some of these questions. In the workplace setting, I’m also curious what safeguards are needed to insulate the company from liability. These are all things that need addressing over time.
I created a new LLM / ML category here and I suspect that this won’t be my last post on the topic. As a blanket warning, all of these posts are personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.
Have a question or comment? Please drop a comment, below.
Prompting #Llama2-chat-7B: What is your context window size?
Response: As a responsible AI language model, I don't have a "context window" in the classical sense, as I am not a physical device with a fixed window size.
#fossil update: i'm working on getting local models running. first step is making a little UI for managing models. tbqh i'm loving that hamburger menu :)
unfortunately, LLM's broad support for many models means that they need to reduce the surface area for API features. for the most part, it doesn't bug me that, e.g. i can't use GBNF grammars with #llama2, but we don't need them either. OTOH i do really miss not knowing how big the context width of a model is. that seems like a very reasonable feature for LLM to support (i may send a pull request, idk)
It was a really insightful and condensed presentation of #LLMs. Probably the best presentation I’ve watched so far (although did not watch many on the topic).
#AI#GenerativeAI#OpenSource#Meta#Llama2: "The argument against releasing model weights relies on the assumption that there will be no malicious corporate actors, says Biderman, which history suggests is misplaced. Encouraging companies to keep the details of their models secret is likely to lead to “serious downstream consequences for transparency, public awareness, and science,” she adds, and will mainly impact independent researchers and hobbyists.
But it’s unclear if Meta’s approach is really open enough to derive the benefits of open source. Open-source software is considered trustworthy and safe because people are able to understand and probe it, says Park. That’s not the case with Meta’s models, because the company has provided few details about its training data or training code.
The concept of open-source AI has yet to be properly defined, says Stefano Maffulli, executive director of the Open Source Initiative (OSI). Different organizations are using the term to refer to different things. “It’s very confusing, because everyone is using it to mean different shades of ‘publicly available something,’ ” he says.
For a piece of software to be open source, says Maffulli, the key question is whether the source code is publicly available and reusable for any purpose. When it comes to making AI freely reproducible, though, you may have to share training data, how you collected that data, training software, model weights, inference code, or all of the above. That raises a host of new challenges, says Maffulli, not least of which are privacy and copyright concerns around the training data."
Meta's #Llama 2 license has an unusual clause whereby they withdraw your right to use the model if you allege #Meta has breached your own IP rights by training their stuff on your intellectual property. #copyright#genai#LLama2
I would run #LLaMA2 to play Dwarf Fortress for me and post everything thats happening to fedi if #LLMs weren't the only software to make blockchain appear reasonably efficient
🦾 Petals – Run LLMs at home, BitTorrent-style
➥ petals.dev
「 You load a small part of the model, then join a network of people serving the other parts. Single‑batch inference runs at up to 6 tokens/sec for Llama 2 (70B) and up to 4 tokens/sec for Falcon (180B) — enough for chatbots and interactive apps 」
a while back i recall there being some tool for exploring a database of embeddings that lets you visualize and locate duplicates, etc. anyone know what it's called? #llm#llms#ai#llama2