Everybody’s talking about Mistral, an upstart French challenger to OpenAI

On Monday, Mistral AI announced a new AI language model called Mixtral 8x7B, a "mixture of experts" (MoE) model with open weights that reportedly truly matches OpenAI's GPT-3.5 in performance—an achievement that has been claimed by others in the past but is being taken seriously by AI heavyweights such as OpenAI's Andrej Karpathy and Jim Fan. That means we're closer to having a ChatGPT-3.5-level AI assistant that can run freely and locally on our devices, given the right implementation.

Mistral, based in Paris and founded by Arthur Mensch, Guillaume Lample, and Timothée Lacroix, has seen a rapid rise in the AI space recently. It has been quickly raising venture capital to become a sort of French anti-OpenAI, championing smaller models with eye-catching performance. Most notably, Mistral's models run locally with open weights that can be downloaded and used with fewer restrictions than closed AI models from OpenAI, Anthropic, or Google. (In this context "weights" are the computer files that represent a trained neural network.)

Mixtral 8x7B can process a 32K token context window and works in French, German, Spanish, Italian, and English. It works much like ChatGPT in that it can assist with compositional tasks, analyze data, troubleshoot software, and write programs. Mistral claims that it outperforms Meta's much larger LLaMA 2 70B (70 billion parameter) large language model and that it matches or exceeds OpenAI's GPT-3.5 on certain benchmarks, as seen in the chart below.
A chart of Mixtral 8x7B performance vs. LLaMA 2 70B and GPT-3.5, provided by Mistral.

The speed at which open-weights AI models have caught up with OpenAI's top offering a year ago has taken many by surprise. Pietro Schirano, the founder of EverArt, wrote on X, "Just incredible. I am running Mistral 8x7B instruct at 27 tokens per second, completely locally thanks to @LMStudioAI. A model that scores better than GPT-3.5, locally. Imagine where we will be 1 year from now."

LexicaArt founder Sharif Shameem tweeted, "The Mixtral MoE model genuinely feels like an inflection point — a true GPT-3.5 level model that can run at 30 tokens/sec on an M1. Imagine all the products now possible when inference is 100% free and your data stays on your device." To which Andrej Karpathy replied, "Agree. It feels like the capability / reasoning power has made major strides, lagging behind is more the UI/UX of the whole thing, maybe some tool use finetuning, maybe some RAG databases, etc."

Mixture of experts

So what does mixture of experts mean? As this excellent Hugging Face guide explains, it refers to a machine-learning model architecture where a gate network routes input data to different specialized neural network components, known as "experts," for processing. The advantage of this is that it enables more efficient and scalable model training and inference, as only a subset of experts are activated for each input, reducing the computational load compared to monolithic models with equivalent parameter counts.

In layperson's terms, a MoE is like having a team of specialized workers (the "experts") in a factory, where a smart system (the "gate network") decides which worker is best suited to handle each specific task. This setup makes the whole process more efficient and faster, as each task is done by an expert in that area, and not every worker needs to be involved in every task, unlike in a traditional factory where every worker might have to do a bit of everything.

OpenAI has been rumored to use a MoE system with GPT-4, accounting for some of its performance. In the case of Mixtral 8x7B, the name implies that the model is a mixture of eight 7 billion-parameter neural networks, but as Karpathy pointed out in a tweet, the name is slightly misleading because, "it is not all 7B params that are being 8x'd, only the FeedForward blocks in the Transformer are 8x'd, everything else stays the same. Hence also why total number of params is not 56B but only 46.7B."

Mixtral is not the first "open" mixture of experts model, but it is notable for its relatively small size in parameter count and performance. It's out now, available on Hugging Face and BitTorrent under the Apache 2.0 license. People have been running it locally using an app called LM Studio. Also, Mistral began offering beta access to an API for three levels of Mistral models on Monday.

Image

Image alternative text

praise_idleness, 5 months ago

They already got the holy AI you sons of a silly person!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

rigatti, 5 months ago

I’m not talking about Mistral. Wait… crap.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

themurphy, 5 months ago

This is the inevitable future of AI. In a few years there will be an AI model for almost anything in the world created by various companies.

This tool is too powerful to ignore.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

cheese_greater, 5 months ago (edited 5 months ago)

I wonder if it can be coaxed to talk shit about L’académie…🤔 That would be absolutement l’hilarité

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Taleya, 5 months ago

As an Australian, i’m a fan

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Ashyr, 5 months ago

It’s neat, but I hear you need a really beefy system to make it work.

It may be an insurmountable hurdle to bring such capabilities to lesser systems, so I’m not necessarily complaining, I just wish it was more accessible.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

bioemerl, 5 months ago

Mixtral GPTQ can run on a 3090

Mistral 7b can run on most modern gpus

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

joneskind, 5 months ago (edited 5 months ago)

Oh boy, I missed Mixtral GPTQ and only tried Mistral 7b

Currently downloading mixtral-8x7b-v0.1.Q4_K_M.gguf

Thank you!

EDIT: mixtral-8x7b-v0.1.Q4_K_M.gguf was to heavy for my Mac but mixtral-8x7b-v0.1.Q3_K_M.gguf runs fine AF

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

bioemerl, 5 months ago

Be warned, prompt processing is slow

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

joneskind, 5 months ago

It is indeed. I’m switching to the instruct model to see if I can get better results for code and documentation.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

daredevil, 5 months ago

I'm looking forward to the day where these tools will be more accessible, too. I've tried playing with some of these models in the past, but my setup can't handle them yet.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

joneskind, 5 months ago

You should definitely try Mistral. It runs on a potato

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

daredevil, 5 months ago (edited 5 months ago)

I'll give it a shot later today, thanks

edit: Tried out mistral-7b-instruct-v0.1.Q4_K_M.ggufvia the LM Studio app. it runs smoother than I expected -- I get about 7-8 tokens/sec. I'll definitely be playing around with this some more later.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

GBU_28, 5 months ago

Are you running llama.cpp and a gguf format of the model?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

daredevil, 5 months ago

I believe I was when I tried it before, but it's possible I may have misconfigured things

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

GBU_28, 5 months ago

Have you checked out llama-cpp-python? The API is very simple, from the readme

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

daredevil, 5 months ago

I haven't, but I'll keep this in mind for the future -- thanks.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

iopq, 5 months ago

For this one, you should be able to run it on anything with 8GB of VRAM. That said, it may not be fast. You will probably want a Turing or newer card with as much VRAM bandwidth as possible.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

daredevil, 5 months ago

That's good to know. I do have 8GB VRAM, so maybe I'll look into it eventually.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

joneskind, 5 months ago

I run it fine on a base model MacBook Air with 8Gb of RAM and absolutely crazy on a 30 GPU cores M2 Max. Didn’t try on my company’s M1 Pro but I will tomorrow.

I use the LMStudio app and download Mistral from there. The heavier model for my beefy Mac and a 3Gb one for the Air. GPU acceleration with Metal enabled.

I tried a lot of models for development purposes and this one blew my mind.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

cheese_greater, 5 months ago

Seriously? Might have to try it

Can you, like, “have” or keep it?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

joneskind, 5 months ago

You download the model and it’s on your computer for as long as you want. The whole point is to be able to use it locally.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

cheese_greater, 5 months ago

So it is entirely local? Schweet! How large is it (3GB for Air or something?)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

joneskind, 5 months ago

So it is entirely local? Absolutely

How large is it? 12 models of quantization, from 3.08GB to 7.70GB

I use mistral-7b-instruct-v0.1.Q3_K_L.gguf 3.82GB on the MBA

Note that it might crash sometimes during computation. Just push the button “reload” then “continue” and the model finish its sentence as if nothing happened. I don’t know if its related to MLStudio (the app using the model) or the model itself though.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ichbinjasokreativ, 1 month ago

Something like mistral-dolphin (4GB) and mixtral-dolphin (26GB) are running very smoothly on my 6900xt on rocm 6

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Mixture of experts

Add comment