Latest Entry2-14-26

MoE shrinking with REAP and Quantization

I was able to find a smaller version of the Minimax-M2.1 model on Huggingface. This model has only 139B activated parameters, compared to the original's ~230B. At a Q4KM quant, it fits within my URAM buffer with room to spare, at only 80GB. While this seems a definite victory, LLMs aren't magical. This result was achieved using Cerebras's REAP methodology at a rate of 40%, meaning 40% of the expert's activated rates were pruned. The model was then retrained on key datasets (here relating to agenticism and coding, I believe). This leads to small drops in models performance in many relevant benchmarks, but could present glaring issues if I were to use this model for anything but the tasks it was finetuned on. Fortunately, that's not the path I wish to take. So, then, this model is heavily REAPed and ready for local inference on my APU. It has performed admirably in my testing so far, but I haven't quite put it through the ringer. Still, outperformance of the older GPTOSS model seems to be a given. The only remaining issue I'm encountering is that of inference speed. At a ~4.3% expert activation rate, the original model was very efficient. Now, the parameter count has almost been halved, but the activated expert count remains the same, resulting in an activation rate of ~7.2%. With 10B activated parameters, the model runs at ~25tk/s, which can be sluggish at time. Prompt processing performance also leaves something to be desired. Further experiments will likely explore methods of inference speedups. -K

Archive // Previous Logs

2-13-26Recent MoE Experiments

Recently, I've been experimenting with LLM inference on local hardware via llama.cpp, similar to what I was doing with BasecampAgent. I've been testing different large models for the Strix Halo APU agent. Initially, I settled on GPTOSS:120B, which I found sufficient for local inference due to it's smallish VRAM usage(65GB at Q4KM quant) and small activated parameter rate(5B/116B). Inference was fast at ~55tk/s via Vulkan-Radv, and model performance was adequate, but I was leaving almost half of my APU URAM buffer on the table. For this reason, I began a process of experimenting with different models in order to find a more performant model that fit within my URAM buffer. However, what I found is that OSS120B actually fills a broad niche. It is small compared to SOTA models like Kimi-2.5, leaving ample room for Q4KM context. It is also very sparse when it comes to activated parameters, even compared to similar models like GLM-4.6 Air. Most consequentially, what I found is that there is a distinct model gap between the 120B class and it's upstairs neighbors: the closest model, Minimax-M2.1, has almost twice the parameters, but also doubles activated parameters. That meant not only could I not run it on my system at full size, but inference speeds would be halved as well. I expect my work in the near-term to consist of trying to overcome this model gap, and perhaps I could look into inference speeds as well. -K
2-13-26Testing, Testing, 123...

Welcome! If you're reading this right now, that means the blog (or Lab Notes) is up and running! The purpose of this page (and these notes) is to provide frequent updates on my present work and interests. While I do occasionally post detailed reports of my work on the "Projects" page, I've realized there are many other facets of my process that go undocumented. This blog will hopefully shed some more light on what I do behind the scenes while working towards bigger milestones. And of course, one of those milestones is the successful test of this very website. -K