@killyourfm You may also wanna take a look at koboldcpp which is a jack of all trades.
It pretty much supports every model type context size and setting you could imagine.
Doesn't play nice with cuda 12 atm, but runs well with just cpu.
Also plays very nicely inside of docker (even with gpu acceleration).
Add comment