Mastodawn

Self hosted LLM

https://sh.itjust.works/post/15204388

Self hosted LLM - sh.itjust.works

Hello internet users. I have tried gpt4all and like it, but it is very slow on my laptop. I was wondering if anyone here knows of any solutions I could run on my server (debian 12, amd cpu, intel a380 gpu) through a web interface. Has anyone found any good way to do this?

passepartout Feb 25, 2024

I tried Huggingface TGI yesterday, but all of the reasonable models need at least 16 gigs of vram. The only model i got working (on a desktop machine with a amd 6700xt gpu) was microsoft phi-2.

HumanPerson Feb 25, 2024

I know the gpt4all models run fine on my desktop with 8gig vram. It does use a decent chunk of my normal ram though. Could the gpt4all models work on huggingface or do they use different formats? Sorry if I am completely misunderstanding huggingface, I haven’t heard of it until now.

passepartout Feb 25, 2024

Huggingface TGI is just a piece of software handling the models, like gpt4all. Here is a list of models officially supported by TGI, although they state that you can try different ones as well. You follow the link and look for the files section. The size of the model files (safetensors or pickele binaries) gives a good estimate of how much vram you will need. Sadly this is more than most consumer graphics cards have except for santacoder and microsoft phi.

Supported Models and Hardware

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

HumanPerson Feb 25, 2024

I don’t really want to try to get that to work. I wonder how hard it would be to create my own webui using gpt4all’s Python package.

Kir Feb 25, 2024

Have you been able to use it with your AMD GPU? I have a 6800 and would like to test something

passepartout Feb 25, 2024

Yes, since we have similar gpus you could try the following to run it in a docker container on linux:

#!/bin/bash model=microsoft/phi-2 # share a volume with the Docker container to avoid downloading weights every run volume=/home/ben/stuff/huggingface-text-generation-interface/data docker run -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -e PYTORCH_ROCM_ARCH="gfx1031" --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4-rocm --model-id $model

Note how the rocm version has a different tag and that you need to mount your gpu device into the container. The two environment variables are specific to my (any maybe yours also) gpu architecture.

johntash Feb 25, 2024

Ollama and localai can both be run on a server with no gpu. You’d need to point a different web ui to them if you want though

h3ndrik Feb 25, 2024

GitHub - LostRuins/koboldcpp: Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

Run GGUF models easily with a KoboldAI UI. One File. Zero Install. - LostRuins/koboldcpp

GitHub

GoogleyWoog Feb 25, 2024

I use KoboldCPP, works perfectly over the internet. Not sure how Intel support is though, and with 6GB VRAM the whole thing is of questionable utility.

Scrubbles Feb 25, 2024

text-generation-webui is kind of the standard from what I’ve seen to run it with a webui, but the vram stuff here is accurate. Text LLMs require an insane amount of vram to keep a conversation going.

grilledcheesecowboy Feb 25, 2024

I've had pretty good luck running llamafile on my laptop. The speeds aren't super fast, and I can only use the models that are Mistral 7B and smaller, but the results are good enough for casual use and general R and Python code.

GitHub - Mozilla-Ocho/llamafile: Distribute and run LLMs with a single file.

Distribute and run LLMs with a single file. Contribute to Mozilla-Ocho/llamafile development by creating an account on GitHub.

GitHub

k_rol Feb 25, 2024

Did you try LM studio?

LM Studio - Local AI on your computer

Run local AI models like gpt-oss, Llama, Gemma, Qwen, and DeepSeek privately on your computer.

LM Studio

Possibly linux Feb 26, 2024

Its proprietary

dan Feb 26, 2024

You’ll need a good GPU for best results.

slacktoid Feb 26, 2024

Ollama is a nice server base, they lots of projects that plug on top of that.