💰 Rune’s $100k for indie game devs
🤲 The Zed editor is now open source
🦙 Ollama’s new JS & Python libs
🤝 @tekknolagi's Scrapscript story
🗒️ Pooya Parsa's notes from a tired maintainer
🎙 hosted by @jerod
The biggest limitation of something like ChatGPT, Copilot, or Bard is that your data leaves your control when you use the AI. I believe that the future of AI is AI that remains in your control. The only issue with running your own, local AI is that a large learning model (LLM) needs a lot of resources to run. You can’t do it on your old laptop. It can be done, though. Last month, I bought a new Macbook Pro with an M1Pro CPU and 32GB of unified RAM to test this stuff out.
If you are in a similar situation, Mozilla’s Llamafile project is a good first step. A llamafile can run on multiple CPU microarchitectures. It uses Cosmopolitan Libc to provide a single 4GB executable that can run on macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD. It contains a web client, the model file, and the rule-based inference engine. You can just download the binary, execute it, and interact with it through your web browser. This has very limited utility, though.
So, how do you get from a proof of concept to something closer to ChatGPT or Bard? You are going to need a model, a rule-based inference engine or reasoning engine, and a client.
The Rule-Based Inference Engine
A rule-based inference engine is a piece of software that derives answers or conclusions based on a set of predefined rules and facts. You load models into it and it handles the interface between the model and the client. The two major players in the space are Llama.cpp and Ollama. Getting Ollama is as easy as downloading the software and running ollama run [model] from the terminal.
You will notice that the result isn’t easy to parse. Last week, Ollama announced Python and JavaScript libraries to make it much easier.
The Models
A model consists of numerous parameters that adjust during the learning process to improve its predictions. They employ learning algorithms that draw conclusions or predictions from past data. I’m going to be honest with you. This is the bit that I understand the least. The key attributes to be aware of with models are what it is trained on, how many parameters big the model is, and the model’s benchmark numbers.
If you browse Hugging Face or the Ollama model library, you will see that there are plenty of 7b, 13b, and 70b models. That number tells you how many parameters are in the model. Generally, a 70b model is going to be more competent than a 7b model. A 7b model has 7 billion parameters whereas a 70b model has 70 billion parameters. To give you a point of comparison, ChatGPT 4 reportedly has 1.76 trillion parameters.
The number of parameters isn’t the end-all-be-all, though. There are leaderboards and benchmarks (like HellaSwag, ARC, and TruthFulQA) for determining comparative model quality.
If you are running Ollama, downloading and running a new model is as easy as browsing the model library, finding the right one for your purposes, and running ollama run [model] from the terminal. You can manage the installed models from the Ollama Web UI also, though.
The client is what the user of the AI uses to interact with the rule-based inference engine. If you are using Ollama, the Ollama Web UI is a great option. It gives you a web interface that acts and behaves a lot like the ChatGPT web interface. There are also desktop clients like Ollamac and MacGPT but my favorite so far is MindMac. It not only gives you a nice way to switch from model to model but it also gives you the ability to switch between providers (Ollama, OpenAI, Azure, etc).
I have a few big questions, right now. How well does Ollama scale from 1 user to 100 users? How do you finetune a model? How do you secure Ollama? Most interesting to me, how do you implement something like Stable Diffusion XL with this stack? I ordered a second-hand Xeon workstation off of eBay to try to answer some of these questions. In the workplace setting, I’m also curious what safeguards are needed to insulate the company from liability. These are all things that need addressing over time.
I created a new LLM / ML category here and I suspect that this won’t be my last post on the topic. As a blanket warning, all of these posts are personal opinions and do not reflect the views or ethics of my employer. All of this research is being done off-hours and on my own dime.
Have a question or comment? Please drop a comment, below.
Running Mistral LLM locally with Ollama's 🦙 new Python 🐍 library inside a dockerized 🐳 environment with the allocation of 4 CPUs and 8 GB RAM. It took 19 sec to get a response 🚀. The last time I tried to run LLM locally, it took 10 minutes to get a response 🤯
Anyone out there dabbling with on-prem AI? All of the numbers that I’m seeing for RAM requirements on 7b, 13b, 70b models seem to be correct for a 1-user scenario but I’m curious what folks are seeing for 2, 10, or 50 users.
My next big goal is to convince my employer to host #Ollama on a machine that is within the corporate network. I asked IT if they have a box kicking around with "a fantastic amount of RAM in it" for a proof-of-concept. I'm curious how this thing will run with a load on it.
#HomeLab Saturday.
The goal today was to get #ollama deployed on the cluster. That’s a fun way to run your own models on whatever accelerators you have handy. It’ll run on your CPU, sure, but man is it slow.
Nvidia now ships a GPU operator, which handles annotating nodes and managing the resource type. “All you need to do” — the most dangerous phrase in computers — is smuggle the GPUs through whatever virtualization you’re doing, and expose them to containerd properly.
Maintaining Wake-on-LAN on a dual-boot Windows 10 / Ubuntu 22.04LTS system is a hassle. So I went with a simple Fingerbot solution. Now I have Wake-on-Zigbee!
By default the system boots into Ubuntu which hosts an Ollama server and does some video compression jobs (I wanted to be able to start those remotely). I only use Windows for VR gaming when I'm physically in the room and therefore can select the correct partition at boot.