Everything is AI nowadays. Your code? AI. Your art? AI. Your toaster? AI (it’s coming. Probably. Idk). So I figured, if you can’t beat them, join them. But we don’t do simple here, we always gotta go big and then go home (yes, there is no “or”. It’s always go big then go home).
So addressing the elephant in the room, yes I know it’s been over a month since last post. I’ve been busy with a lot. This post will cover some of that.
I’ve looked into Ollama at least a few times in the past and tried to set it up before, then realized I didn’t have any actual use for it in my homelab. Well, that changed recently when I started reading (and pretending to understand) RAG and MCP. I decided I’d redownload Ollama and see if I could manage to stumble my way to getting some kind of local AI-RAG-Gitea hybrid monster setup so I could query my Gitea based wiki when I had questions about something I broke instead of having to actually go to Gitea and find the answer. Spoiler: it worked, kinda. The limitations are things I just can’t solve right now because I don’t feel like it.
Let’s back up to the actual installation of Ollama first though, because that’s actually where 90% of the trouble is, and why this post is named some nonsense about fighting llamas. So, a prerequisite to understanding anything I’m about to say is understanding that LLMs do not really run well in constrained environments; this is for a few reasons. 1) CPU architecture just isn’t as good as GPU architecture in running models. 2) because of number 1, if you don’t have the means to run a larger model, you have to run a smaller one, which carries its own constraints. This is especially relevant when you get into tokens and context, which play a BIG part in RAG and MCP. 3) GPUs are expensive. Still. And they’re not going down in price, so we’re unfortunately having to just temper our expectations to what we actually can do, which is hard in itself because we all like shiny things that work.
All that said, I’m aware that none of this makes any sense. Don’t worry, that’s fine. I don’t understand it completely either, this is also less of a me teaching anything post and more of a me putting thoughts on (digital) paper just so I can get it off my to-do list. Let’s see if we can un-ADHD this post and actually manage to get somewhere instead of just rambling:
The initial setup and initial issues
Like I mentioned, LLMs don’t do well in constrained environments. My main constraints are that I never really planned LLM or AI at all into my original server architecture design. I’m running 2 14-Core Intel Xeon E5-2660 v4, 64GB RAM, and more storage than I need (I think there’s like 15ish TB of space in the entire server? Idk. It’s a lot. Most of it isn’t being used for anything, I just had some old HDDs lying around so I threw them in there), so nothing really to scoff at as far as small home servers go, but definitely not something capable of running more than the smaller LLM models. I have done one notable upgrade to the system since I built it and that was adding an Nvidia 1050TI. Why the 1050? Well, the PSU I’m using only had one 6-pin connector available. I was also limited in space inside the case (I’m using a Lenovo P510 case). So, I was really limited in what I could squeeze into the server without upgrading more than I wanted to. The 1050 really isn’t a terrible card, it’s a 4GB card with a ridiculously low power draw of like 75W. So the performance to requirements is there for sure. It’s not LLM level of performance though.
The next issue came in when I was actually setting up the LLM server initially. So keep in mind I run pretty much everything on Proxmox. It’s either in a VM or a container. Rarely do I ever bother trying to set something up on a dedicated machine. Proxmox itself does not require a GPU, but if can offer GPU passthrough to a container or VM. You will still have to install the needed drivers on the guest though. Nvidia isn’t known as being terribly Linux friendly. That’s improved slightly over the past few years, but still not great. I ended up having to consult this documentation to get the drivers installed and working. Ultimately I could not figure out why they would not work on Ubuntu server, so I game up and decided to take the Ubuntu Desktop route where I could have it install Nvidia’s proprietary drivers during install. Great, so we have drivers installed. Where are we now? Well, we can actually install Ollama and get that running. That’s incredibly easy since it’s just a single command bash curl -fsSL https://ollama.com/install.sh | sh
. That works fine out of the box. Following the QuickStart guide works fine too. But I don’t want to use it over CLI, I want a WebUI. So we come to part 2.
OpenWebUI
OpenWebUI is… a mixed bag. It’s great in some ways, but it’s not great in others. The documentation honestly needs updating and a lot of TLC, but it’s passable and you can usually find an answer to something by just asking it on Reddit. There may also be an issue open on their github with a similar situation to whatever you run into.
I opted to run OpenWebUI in Docker because that seems to be the preferred method? I think? It’s hard to tell. I got the container up and running after a few hiccups, mostly GPU related, but those were solved by reading this. So, we have Ollama running. We have a model or 2 running, we have OpenWbUI up and running, what now?
Testing - both my patience and the application stack
I started combing through documentation and videos on how to get the Ollama-OWU stack in a working state where it was usable by multiple users, queryable, etc, etc. A few hours later and it was in a pretty good state. But it still wasn’t where I needed it to be. Remember about half an hour of reading ago where I mentioned RAG and MCP and stuff? Yeah, we still weren’t there yet.
RAG and coming to terms with my situations
So, I spent probably another day or so reading about RAG and MCP and realizing I was definitely not smart enough to understand anything, but I wasn’t about to let that stop me (yet). So, OpenWebUI has these things called “knowledgebases”, which are kinda RAG-adjacent, and can be utilized in a RAG-esque way. It also has an entire API built in that can be programmatically utilized. See where this is going? Yep, Python. So, here’s where we ran into a lot of issues with the state of the documentation. It’s okay, ish, for most things, but I was running into issues with things that I felt should have honestly been addressed in the documentation but just weren’t (and based on some of the Github issues, others felt similar). I managed to stumble my way through understanding how files were uploaded, what knowledgebases were, how to interact with them (mostly) through the API, and felt I was in a decent state. Now my next hurdle: OpenWebUI has no concept of folders. Files are uploaded and either put in a knowledgebase or just left in the open ether of OpenWebUI (I still don’t know where the files not in knowledgebases are in the UI). The wiki, and most things that exist in a repository, are structured in folders and such. So how was I going to get that replicated up to OpenWebUI? Well, unfortunately, the answer was that I wasn’t. I was going to have to just accept that the structure would be gone and it would just be a bunch of files uploaded to the knowledgebase. Which is fine because I wasn’t going to be interacting with them, the LLM was. It doesn’t care how they’re stored. So, using the documentation of the API and a few Reddit threads, I managed to get a Python script together that would take files from a repository (locally cloned - I didn’t feel like dealing with auth or runners at this point), check whether they existed in the knowledgebase, if they didn’t they would be uploaded. The script was honestly more complex than this because there were multiple API calls needed for various pieces and parts, but that’s the basics of what it did. Great, so we’ve taken a manual solution of uploading files and replaced it with… a manual solution of running a script to upload files. We still haven’t actually automated anything.
Automation. I love it, it hates me.
This is actually the step where things went from practical to theoretical. I haven’t done most of the things in this section because of the issues I ran into. So the ultimate goal of this part was to have a runner that watched for pushes to the wiki repo, then launched some kind of task to get the repo cloned down to the Ollama server, and then on a cron schedule the script that did the checks and put the files in OpenWebUI would run. It all worked in theory, and I could have gotten it working in practice, but here’s the thing: my server wasn’t really capable of fully realizing the use of RAG due to context and token limits. So remember, I was running a smaller model. Smaller models, by default, have less resource requirements, but because of this they are also more limited in what they can do. I increased the context size. A quick explanation of what they even is: context length/size refers to the amount of input text (tokens) that an LLM will process. This also encompasses the LLM’s response space, because it uses the context length to reserve some space for its response. It should go without saying then that more context length=more resources. You can see where this is going, right? Yep. We ran into resource constraints. So remember earlier I put the specs of the server up there, that’s SHARED resources because I’m running other VMs and Containers too, so I can’t give the LLM the entire 28 cores and 64GB of RAM. It only had access to about 16GB of RAM and I think 6 cores. This worked wonderfully for web searches and just model-generated answers, but no so much for RAG. In order to actualize RAG I would need the model to be able to take in a query, analyze files and generate an answer based on the contents of files that it would have to analyze, and then give me an answer with a sort of works-cited of what file it found the answer in. This takes a lot of resources to do even on smaller models. I quickly found that if I asked the model a question about a document, it would cap the RAM and then still not be able to provide an adequate answer. So came the end of my local AI.
Retrospective and future state
I haven’t entirely given up on doing this, but for the time being I’ve tabled it. I do have plans for making a dedicated box for the LLM. I plan to purchase a framework desktop in the future and utilize that for local LLM and RAG. It’s already been tested by a few people online and they’ve proved that the Ryzen AI Max 300 is a wonderful chip for LLM. I’ve also toyed with using my Turing Pi 2 for a similar purpose, but I’d have to dive into Kubernetes and figure out how to Kubernetes first because that’s a sphere I haven’t even begun to mess with in any depth.
So, in short, I’m not done with this and not giving up on it, I’m just tabling it for the time being.
TL;DR of the above: Local AI is hard. It might be worth it, not sure, I’ll figure out one day.
Anywho, I’ll post again in uh… a while. I’ve currently working on building out a pretty fun cyber lab, so I’ll post about that probably after I finish.