Latest Entry2-18-26

MoEs, Sparsity, and Post-Training Optimizations

After extensive research related to potential MoE inference-time speedups, I've hit something of a wall. It seems that while certain methods do exist, such as expert skipping, it's difficult to actually implement these methods while retaining a high degree of quality. Indeed, I can find no feasible inference-time method for pre-existing MoE models. I think I may take this opportunity to delve deeper beyond the surface and expand my knowledge of MoEs, sparsity, and training-time speedups. While nothing is definite yet, I've developed a twofold plan that should allow me to create a model that achieves fast inference via post-training processes. My first step would be to take an existing small dense model, such as the recently released Nanbeige-4.1-3B, and attempt to make it more competent via Sparse Upcycling. This process would turn the dense model with 3B parameters into a sparse model with 3B activated parameters and over 12B total parameters. Following this, my next step would be to convert the new sparse model into a fine-grained MoE, decreasing expert size. I would then theoretically be able to decrease the number of activated parameters and finetune the router using relevant Agentic/Coding datasets to ensure core functionality remains. I'd use GaLore to accomplish this in a timely fashion with my limited hardware budget. The largest problem I expect to encounter is in re-training the router-- If done incorrectly, I could theoreticall encounter routing/gating collapse, which of course would render the model unuseable. While this path poses many challenges and would be work-intensive, it seems a worthy endevor, given the lack of opportunity for larger model speedups. -K

Archive // Previous Logs

2-14-26MoE shrinking with REAP and Quantization

I was able to find a smaller version of the Minimax-M2.1 model on Huggingface. This model has only 139B activated parameters, compared to the original's ~230B. At a Q4KM quant, it fits within my URAM buffer with room to spare, at only 80GB. While this seems a definite victory, LLMs aren't magical. This result was achieved using Cerebras's REAP methodology at a rate of 40%, meaning 40% of the expert's activated rates were pruned. The model was then retrained on key datasets (here relating to agenticism and coding, I believe). This leads to small drops in models performance in many relevant benchmarks, but could present glaring issues if I were to use this model for anything but the tasks it was finetuned on. Fortunately, that's not the path I wish to take. So, then, this model is heavily REAPed and ready for local inference on my APU. It has performed admirably in my testing so far, but I haven't quite put it through the ringer. Still, outperformance of the older GPTOSS model seems to be a given. The only remaining issue I'm encountering is that of inference speed. At a ~4.3% expert activation rate, the original model was very efficient. Now, the parameter count has almost been halved, but the activated expert count remains the same, resulting in an activation rate of ~7.2%. With 10B activated parameters, the model runs at ~25tk/s, which can be sluggish at time. Prompt processing performance also leaves something to be desired. Further experiments will likely explore methods of inference speedups. -K
2-13-26Testing, Testing, 123...

Welcome! If you're reading this right now, that means the blog (or Lab Notes) is up and running! The purpose of this page (and these notes) is to provide frequent updates on my present work and interests. While I do occasionally post detailed reports of my work on the "Projects" page, I've realized there are many other facets of my process that go undocumented. This blog will hopefully shed some more light on what I do behind the scenes while working towards bigger milestones. And of course, one of those milestones is the successful test of this very website. -K
2-13-26Recent MoE Experiments

Recently, I've been experimenting with LLM inference on local hardware via llama.cpp, similar to what I was doing with BasecampAgent. I've been testing different large models for the Strix Halo APU agent. Initially, I settled on GPTOSS:120B, which I found sufficient for local inference due to it's smallish VRAM usage(65GB at Q4KM quant) and small activated parameter rate(5B/116B). Inference was fast at ~55tk/s via Vulkan-Radv, and model performance was adequate, but I was leaving almost half of my APU URAM buffer on the table. For this reason, I began a process of experimenting with different models in order to find a more performant model that fit within my URAM buffer. However, what I found is that OSS120B actually fills a broad niche. It is small compared to SOTA models like Kimi-2.5, leaving ample room for Q4KM context. It is also very sparse when it comes to activated parameters, even compared to similar models like GLM-4.6 Air. Most consequentially, what I found is that there is a distinct model gap between the 120B class and it's upstairs neighbors: the closest model, Minimax-M2.1, has almost twice the parameters, but also doubles activated parameters. That meant not only could I not run it on my system at full size, but inference speeds would be halved as well. I expect my work in the near-term to consist of trying to overcome this model gap, and perhaps I could look into inference speeds as well. -K