Rendered at 13:25:22 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
momojo 15 hours ago [-]
I'm surprised the point/comment ratio is this skewed. There's so much meat in the post to chew on. I like your writing. This was one of those blogs where I can tell you spent a massive amount of time on the technical, but simplified it to layman's terms. I hope you keep putting out stuff :).
I have a couple questions:
1. I think this quote should be raising *many more* eyebrows.
> The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
You put a cat's brain into a dog's head and its still breathing! It didn't flatline immediately! Is yesterday's news? This seems like the biggest take away. Why isn't every <MODEL_PROVIDER> attempting LLM-surgery at this moment? Have you noticed any increasede discourse in this area?
2. You mentioned you spent the beginning of your career looking at brains in biotech. How did you end up in a basement of GPU's, working not in biotech, but still kind of looking at brains?
Again, great post!
dnhkng 7 hours ago [-]
Cheers. I will go back though my other old projects (optogenetics, hacking Crispr/CAS9 etc), and put them on my blog.
On your questions:
1) A few other papers have been mentioned in the thread, like Solar10.7B. They duplicated the whole transformer stack, and it kinda helped. But as I found experimentally, that probably not a great idea. You are duplicating 'organs' (i.e. input processing stuff), that should only have one copy. Also, that paper didn't see immediate improvements; they had to do continued pre-training to see benefits. At that point, I'm guessing the big labs stopped bothering. Limited by hardware, I had to find unusual angles to approach this topic.
2) Nah, no more wetware for me. I did a half decade of research at a big neurobiology institute, and while it was very enjoyable, I can truly say that grant writing and paper review are 'not my thing'. This reason this info was delayed so long is that I wanted a paper in the AI field to go along with my papers in other fields. But as a Hobbyist with no official affiliation, and the attention span of a gnat, I gave up and started a blog instead. Maybe someone will cite it?
trhway 4 hours ago [-]
>You put a cat's brain into a dog's head and its still breathing! It didn't flatline immediately! Is yesterday's news?
i think it isn't surprising giving how for example kernels in the first layers in visual CNNs converge to Gabors which are also the neuron transfer functions in the first layers of cat, human, etc. visual cortexes, and that there is math proving that such kernels are optimal (at some reasonable conditions).
And so i'd expect that the layers inside LLM reach or come close to some optimality which is universal across brains and LLMs (main reasons for such optimality is energy (various L2 like metrics), information compression and entropy)
imranq 17 hours ago [-]
Amazing write up and i wish more people showed the process for discovery which is often even more interesting than the result itself
Still the result is really interesting being able to stack abstract reasoning and get better performance and the heat maps to show the prob results
The academic literature seems to be catching up:
- *[SOLAR / DUS (Kim et al., 2023)](https://arxiv.org/abs/2312.15166)* — duplicated transformer layers to build a 10.7B model that outperformed 30B parameter baselines.
- *[The Curse of Depth (2025)](https://arxiv.org/abs/2502.05795)* — explains why this works: Pre-LN causes deep transformer layers to converge toward identity functions, meaning middle layers are where real computation happens, and duplicating them concentrates that capacity.
- *[Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Geiping et al., NeurIPS 2025)](https://arxiv.org/abs/2502.05171)* — takes the idea to its logical conclusion: a model trained with a single recurrent block repeated at inference time, scaling reasoning depth without adding parameters.
dnhkng 7 hours ago [-]
Hi, thanks for the praise!
On the other papers, models like SOLAR or training a model that uses a single layers are probably going to hit a wall, based on the heatmaps I found. The transformer stack starts with randomised weights, (analogous to undifferentiated stem cells), and it seems they later form 'organs' during the trillions of pre-training tokens they undergo. My hypothesis is that you probably only want one copy of the 'token-to-thought', and 'thought-to-token' organs. It seems that you can make one layer do all three things (transforms in and out, and do the 'thinking'), but I think specialisation will always win.
mysteria 18 hours ago [-]
The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
This wasn't something I really dug into in great detail but I remember my surprise back then at how all those merged models and those "expanded" models like Goliath still generated coherent output. IMO those were more community models made by small creators for entertainment rather than work, and only really of interest to the local LLM groups on Reddit, 4chan, and Discord. People might briefly discuss it on the board and say "that's cool" but papers aren't being written and it's less likely for academics or corpo researchers to notice it.
That being said I wonder if it's possible to combine the layers of completely different models like say a Llama and a Qwen and still get it to work.
Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.
Would using grammar parsing help here by forcing the LLM to only output the expected tokens (i.e. numbers)? Or maybe on the scoring side you could look at the actual probabilities per token to see how far the correct digit is.
wolttam 8 hours ago [-]
I think the main challenge with combining layers of different would models be their differing embedding sizes and potentially different vocabularies.
Even between two models of identical architecture, they may have landed on quite different internal representations if the training data recipe was substantially different.
But it would be fun to experiment with.
janalsncm 17 hours ago [-]
It’s a good spot for hobbyists to fill in the gaps. Maybe it’s not interesting enough for academics to study, and for corporate ML they would probably just fine tune something that exists rather than spending time on surgery. Even Chinese labs that are more resource constrained don’t care as much about 4090-scale models.
dnhkng 17 hours ago [-]
It's still non-trivial, as multi-digit numbers can be constructed a huge combination of valid tokens.
The code in the blog helps derive useful metrics from partial answers.
Balinares 21 hours ago [-]
The idea that there may be a cognitive lingua franca hiding in the layers is fascinating and gives me hope for a neat idea: pluggable knowledge banks.
MoE notwithstanding, a model trained on the whole Internet and a few hundred thousands stolen books carries way more knowledge than is actually needed for any given workflow. It would be great if we could ship slimmed down models into which we'd plug the knowledge banks useful for today's work, and only those.
It would also mean that you could keep a model's knowledge fresh without retraining the whole of it.
gitpusher 20 hours ago [-]
> pluggable knowledge banks.
plugs in knowledge bank
LLM: ... I know kung fu.
pennomi 18 hours ago [-]
Agreed, I suspect that LLMs in the future will have separate (possibly standardized) decoding/encoding layers that plug into logic layers.
dormento 20 hours ago [-]
This is interesting. Would this mean less space for hallucination as well (depending on the breadth of knowledge applied to a specific task)?
oliver_dr 19 hours ago [-]
[dead]
ay 12 hours ago [-]
Isn’t that what LoRA does ?
CuriouslyC 11 hours ago [-]
LoRAs are better at steering models to produce correct answers from their data set than imparting new knowledge.
iamjackg 18 hours ago [-]
I find the concept of LLM "brain surgery" fascinating, precisely because of how opaque the network is. One of the first things I did back when llama.cpp first got vision model support was hack the code to zero out (or otherwise modify) random numbers in the image embedding generated by the projector and then ask the LLM to describe the image. It was absolutely fascinating.
It would go from a normal description of the item in the picture to suddenly seeing people clapping in the background that were not there, or making up some other stuff. I kinda stopped after a while, but I should pick that back up and do a more coherent experiment to see if I can find any correlation between vector dimensions and "meaning."
dnhkng 17 hours ago [-]
Yes, it's an amazing time to be a hacker!
rapatel0 22 hours ago [-]
I think you may have cracked latent space reasoning. I've had a hunch that something like this would work, but couldn't figure out how the training would back propagate. But you've shown that you just need to duplicate existing layers.
Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.
dnhkng 21 hours ago [-]
Yes, I've tried duplicating indvidual layers, but its not useful.
I think this hasn't been tried before because it's totally unintuitive that feeding the output from later layers into previous ones would actually do anything. And in fact, it usually is detrimental. I guess it takes really bored hobbyists with too much compute to check this stuff.
I have done some interesting work on applying multiple layer duplications in different regions of the model too, going so far as to train a meta-model (actually just XGBoost) to predict the merges. Seems to work, buts thats a whole other blog post.
This works with MoE, and yes, I would be interested in looking into this in more detail. But my wife might disagree with this time sink...
rapatel0 18 hours ago [-]
Clarification. Duplicating multiple groups of layers in a "reasoning" loop
Normal:
L1 -> L2 -> L3 -> L4 -> out
Unrolled (current framing):
L1 -> [L2->L3] -> [L2->L3] -> L4 -> out
Looped (proposed):
--<--loop----
| |
L1 -> [L2->L3] x N --> L4 -> out
"reasoning loop"
Note: ascii rendering HN is not trivial
gavinray 17 hours ago [-]
The commenter "Skerit" below linked to a recent implementation of this:
I really enjoyed reading this. I feel like generalists intuitively experience this exact thing so much throughout their lives because they must have this neuroanatomy you describe. There’s a certain geometry to knowledge that makes possible for this orthogonal movement and it is really fascinating to me. Thank you for publishing this, you made my day!
dnhkng 21 hours ago [-]
Thanks!
Lerc 19 hours ago [-]
I have had broadly the same intuitions on the use of middle layers, but haven't had much luck with the tiny models that I can run on my hardware.
about a looping layer models, after watching that I poured some thoughts off the top of my head into a comment which, of course, promptly sunk without a trace. I'll repost the gist of them here.
If you gain benefit from looping layers, at some level every layer of parameters is in front of and behind every other, the conclusion must be that the order of the layers does not need to be fixed at all.
If you cycle through the layers multiple times, are you doing so for the benefit of a particular layer on a particular problem. If so, can you skip the other layers that don't add on repetition. If you can skip (and you can know when to skip), and you can repeat (and know when to repeat)
What you would need is a mechanism which can decide which layer is needed next. Is that then not a looping single layer MOE model? Storing the layers as a wide set of selectable options rather than a deep set of unconditional layers. You would be picking what the next layer should be (or exit the loop) the threshold for exit drops each iteration so it always eventually exits. With a tunable 'how hard to think' knob to adjust the threshold.
janalsncm 17 hours ago [-]
That is an interesting idea. I suspect if we relax the constraint that most of the layers in a loop will be in order, there is a combinatorial explosion issue.
But we could still try it out: randomize the order we call the transformer blocks, and see if it affects performance. If not, that’s extremely interesting.
Lerc 12 hours ago [-]
You can still consider it logically from the point of view of in-order with optional looping and optional skipping. It stops being so combinationally explodey then but if you can always append an additional loop and and decide to skip based on worthiness of the layer with varying degrees of threshold then it could theoretically learn an arbitrary ordering where you skip all-bar-one layer per loop.
There's probably a number of common sequences of layers that are inevitable when working on a problem though. I think of it like a expression calculator which could do various parts of an expression tree merging leaf nodes on each iteration. I wouldn't expect it to be quite so explicit with neural nets, but I feel like the underlying principle of do the sub parts then do the same thing on the result of the subparts must be beneficial to some degree.
I think there's probably quite a lot to be revealed from study of representations in those middle layers. If there's a 'how-much-have-we-solved-so-far' signal to be detected from the data between layers, there would be quite a lot of options I think.
18 hours ago [-]
hex4def6 19 hours ago [-]
I've gotta say, this writeup gives me an itchy feeling. It really does feel like poking around a synthetic brain at this point.
You could make the argument it's closer to the blocks of a CPU compared with a brain, and it's no different to copy-pasting some IP block for eg, HW JPEG decoding. But I feel like the difference here is we're 'discovering' these blocks / organs. They weren't designed, they were evolved.
adcoleman6 52 minutes ago [-]
The difference is less stark these days, with generative design being used for semiconductors.
Altering these features isn’t messing with evolution anymore than tweaking a CAD file that used genetic algorithms: it’s all math, 1s and 0s.
dnhkng 18 hours ago [-]
At some point I will clean up and share the dynamic layer modification code for oobabooga Text-Generation-WebuUI.
You can enter the setting, and apply new re-layering architectures. Its very weird chatting with these brain-damaged models.
Havoc 21 hours ago [-]
Crazy writeup.
Author is right about the base64 part. Does seem weird that it can decode and understand it at same time. And I guess what makes it weird that we just sorta accept that for say English and German this works ie normal use but when framed as base64 then it suddenly stops feeling intuitive
dinobones 20 hours ago [-]
why tho? it's just an alternate alphabet/set of symbols.
dnhkng 20 hours ago [-]
Because its generally expected that models only work 'in distribution', i.e. they work on stuff they have previously seen.
They almost certainly have never seen regular conversations in Base64 in their training set, so its weird that it 'just works'.
Does that make sense?
fweimer 16 hours ago [-]
If you do not properly MIME-decode email, you end up with at least some base64-encoded conversations.
17 hours ago [-]
dormento 20 hours ago [-]
For all we know, AI tech companies could theoretically have converted all of the "acquired" (ahem!) training set material into base64 and used it for training as well, just like you would encode say japanese romaji or hebrew written in the english alphabet.
dtj1123 19 hours ago [-]
Unlikely that every company would have bothered to do this.
idiotsecant 18 hours ago [-]
'Yes, I know we already trained on all that data, but now I want you to convert to base64 and train it again! at enormous cost!'
adcoleman6 51 minutes ago [-]
On the contrary, it could be a deliberate attempt to augment or diversify the dataset.
gwern 12 hours ago [-]
> They almost certainly have never seen regular conversations in Base64 in their training set, so its weird that it 'just works'.
People use Base64 to store payloads of many arbitrary things, including web pages or screenshots, both deliberately and erroneously, and so they have almost certainly seen regular conversations in Base64 in their 10tb+ text training sets scraped from billions of web pages and files and mangled emails etc.
dnhkng 7 hours ago [-]
Yes, thats true.
But that points again to the main idea: The model has learnt to transform Base64 into a form it can already use in the 'regular' thinking structures.
The alternative is that there is an entire parallel structure just for Base64, which based on my 'chats' with LLMs in that format seems implausible; it acts like the regular model.
If there is a 'translation' organ in the model, why not a math or emotion processing organs? Thats what I set out to find, and are illustrated in the heatmaps.
Also, any writing tips from the Master blogger himself? Huge fan (squeal!)
broDogNRG 18 hours ago [-]
[dead]
supriyo-biswas 7 hours ago [-]
Thank you for your contribution. Unfortunately I do not have sufficient expertise in LLM engineering to provide a useful comment, but this is the sort of research I'd like to see here instead of LLM-driven unemployment hype.
3abiton 18 hours ago [-]
Man, that was such an enjoyable read. I loved your story on the wild server hunt, back when it was posted on r/localllama. I think one thing that is missing from the whole AI "discussion" is this train of thought of how we go from abstract mathetmatical formulation to intuitive understanding of the underlying functionality, and you showcased it beautifully in this article. Similarly to 3blue1brown who also did an amazing series on transformers. Kudos!
phn 19 hours ago [-]
A fascinating thing for me after reading this is: how can it be that the "circuit input" is compatible with its output to the point where the performance improves? The training process never saw this particular connection just like it didn't see layer 60 output into layer 3 or whatever.
Great read, makes you wonder what else is encoded in these models that might be useful!
nixon_why69 10 hours ago [-]
I think the intuition is that the first N layers decode into "thought language" while the last N encode back to desired output language. So if there are well defined points where it transitions between decoding/understanding, thinking, and rendering back to language, those 2 transition points should be in the same vector space of "LLM magic thinking language".
phire 7 hours ago [-]
That's really interesting. Makes me immediately ask two questions:
1. Should we be training models like this from the start? It seems that a model trained with layer loops would be able to take advantage of it better than rearranging the layers of a naive model.
2. Should we even be using a fixed number of layers? If models are this tolerant to their inner layers being meddled with, then it doesn't make sense to run all the layers on every single token.
Maybe we could make a model that changed the number of iterations through the compute layers based on how much computation it thought the problem needed. Send it through only once for easy problems (perhaps even zero times?) and two or more times for harder problems. This would allow easier prompts to complete faster, while allowing the model to potentially scale up to infinity hard problems.
If we are training or fine tuning the model, we can probably make the compute layers generate a confidence signals based that predicts how likely it is for an extra compute iteration to meaningfully change the result.
tgw43279w 23 hours ago [-]
That was a fun read! The base64 decoding and encoding is quite interesting. A parallel: these models are surprisingly robust to heavy word mangling, back in 2023 people used this trick to jailbreak the models very often, but what was more surprising is that they even understand it. I always thought of it this way there must be some circuitry in the model that maps these almost unrecognizable words/sentences into their rectified versions. But what your base64 also shows is the fact thy can also encode them back as well! (However models are known to not be able to produce mangled output that looks convincingly random. I think the base64 transformation is more mechanical in this regard and hence it‘s easier to do the reverse for them.)
So your layer circuit hypothesis aligns pretty well with my mental model of how these models work based on the interpretability work I am familiar with! I really also like the way you used the heatmaps as a tool to derive layer insights, very intuitive! But it’s really surprising that you can simply duplicate layers and achieve better results that generalize!
This is some research grade effort! I’m confident you could publish this in NeurIPS or ICML if you put it into a paper! I‘m quite impressed! Great work!
twotwotwo 8 hours ago [-]
This is fascinating, and makes me wonder what other things that 'should' be impossible might just be waiting for the right configuration to be tried.
For example, we take for granted the context model of LLMs is necessary, that all you can do is append and anything that changes the beginning requires a recalculation of whatever comes after it. And that does match how training works.
But all sorts of things would become possible if it were possible to shift things in and out of context without recomputing it all; conservatively you could avoid compaction, optimistically it might be a way to get info to the model that's both more deeply integrated than search and more efficient than training larger and larger models.
digdugdirk 22 hours ago [-]
Super cool! Do you do any analysis or have any tools that help you identify these circuits? I came across this [1] recently, and wanted to try to identify specifically strong "circuits" in what seems to be a similar way to what you did.
I build my own analysis tools. I'm just finishing up running the current generation of LLMs (MiniMax M2.5 and the Qwen3.5 family), and then I will put it all on Github.
It less 'tool', than an assorted set of scripts, tailored to my unusual hardware setup. But it should be easy to extend; I would have released this earlier but I had the (stupid) idea to 'write a paper' on this. Aiming for that delayed this a year. Blogs are the way to go (for me).
hackerchy 4 hours ago [-]
This is fascinating. The fact that only ~7 layer blocks work and not fewer/more really suggests there are emergent functional units in the transformer stack that we don't fully understand yet. Almost like "organs" in the network. Have you tried this on architectures other than Qwen, like Llama or Mistral? Curious if the magic block size is architecture-dependent or if 7 layers is some kind of universal constant.
user_7832 21 hours ago [-]
Thanks for the post, really cool stuff you did!
Extra thanks for making it written in a readable and approachable way! I don't have much of a background in this topic, but still managed to understand about 70-80% of it :) You're a good writer
WithinReason 22 hours ago [-]
Here is a paper that made a similar observation recently:
I think that these models have to learn to efficiently use their parameters, and the best way to do that is 'evolve' (yes, a bad word for it), structures over pretraining time. Unfortunately, they don't have a way to access these structures 'from the inside'. I hope this new approach lets up boost performance in s more experimentally rigorous way
WithinReason 22 hours ago [-]
I think the recurrence is a consequence of using a residual connection, seems like that makes the representation stay consistent across layers
tgw43279w 22 hours ago [-]
Very cool, thanks for sharing! Recovering 96% using just two blocks on IMN-1k, wow!
cootsnuck 22 hours ago [-]
Super cool. Love seeing these writeups of hobbyists getting their hands dirty, breaking things, and then coming out on the other side of it with something interesting.
dubbel 13 hours ago [-]
Absolutely amazing blog post!
I have to say that intuitively I wasn't at all surprised that duplicating a single layer didn't do much good, but I had never expected that you can identify and so clearly visualize these relatively short circuit blocks (and of course it's around the magic number 7! /jk). Super cool research and really well explained!
janalsncm 17 hours ago [-]
It would be extremely interesting if we could use this kind of model surgery approach to tack on additional modalities. For example, adding vision to a text only model.
Another very interesting thing would be modulating compute at the token level. Default is 0 loops, maybe 1 loop is better, and 10 loops is even better than that.
siliconc0w 11 hours ago [-]
Great insight and approach. I wonder though if instead of blogging this, he have the top labs bid on it - what that'd fetch?
dnhkng 7 hours ago [-]
But blogging is fun!
I do wish one of the big labs would sponsor with a rack of HGX Rubin NVL8's. I have lots of ideas to test, and I have probably hit the spending limit with the boss on hardware (she hasn't seen the new power bill yet...)
dgoet 13 hours ago [-]
Fantastic. Really gets me thinking.
If more than two repetitions of the “thinking organ” leads to worse results (I think that’s what you’ve said in other comments), would it be possible to get better results by slicing and dicing some of the early-layer “preparatory organs” between the thinking organs?
Maybe that would still require fine tuning to “evolve” an intermediary organ that would allow for multiple repetitions.
goodmythical 21 hours ago [-]
Isn't this similar to models that have "double check the answer"?
First pass runs your input through, second pass runs it's output as input?
Just, in double check it presumably runs the entire stack while you're trying to skip the translation steps and only double check the logic?
sva_ 21 hours ago [-]
I don't think its mathematically equivalent or even close because the context/logprobs will be very different, since you only produce 1 token per pass. I'd say the token itself has a lot less information than the signal propagating through the residual stream of transformer blocks.
dnhkng 21 hours ago [-]
Maybe, but the interesting thing for me it this only works with specific 'chunks' of the transformer layer stack. More or less that the optimal leads to worse performance.
dnhkng 20 hours ago [-]
Here's an extract, the core TL;DR for a feel of the article.
"And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!
Layer 10 is trained on layer 9’s output distribution. Layer 60 is trained on layer 59’s. If you rearrange them — feeding layer 60’s output into layer 10 — you’ve created a distribution the model literally never saw during training.
The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.
Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language that’s robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output ‘processing units’ sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didn’t work.
If that was true, maybe I didn’t need to teach a model new facts to make it smarter. I didn’t need fine-tuning. I didn’t need RLHF. I just needed to give it a more layers to think with."
blourvim 23 hours ago [-]
I am not really an ml dev so I don't understand most of it. It does sound ridiculous how it would even work work. Brilliant work and great article I enjoyed reading it
This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?
dnhkng 22 hours ago [-]
No worries, happy to discuss anyway :)
MoE (mixture of experts), is an architecture that forces sparsity (not all 'neurons' are active during the forward pass.
This is pretty much orthogonal to that; it works with dense and MoE models, by repeating 'vertical' sections of the transformer stack.
Yes, I was using Base64 to 'jailbreak' LLMs back in the day (so similar), and thats what led me to the hypothesis, and months of GPU use to find optimal later dultication!
dongecko 20 hours ago [-]
What a great read!
You got me at the base64 oddity. I also stumbled over this, while trying to dodge some LLM limitation. (was trying to generate images in a time before multimodal was a thing. it only worked to a degree).
tjwei 22 hours ago [-]
Really interesting discovery, especially the part about base64.
Reminds me of this: Transformer Layers as Painters https://arxiv.org/abs/2407.09298
Aditya_Garg 20 hours ago [-]
Wild stuff and great read
Do you think karpathy's autoresearch would be useful here?
janalsncm 20 hours ago [-]
Based on Karpathy’s writeup the auto research would not have found this. He tells the agent to improve the model and training loop with a five minute time limit, but honestly this “hack” is so far out of distribution that it seems really unlikely an agent would find this.
gwern 11 hours ago [-]
Adding, swapping, or duplicating layers has a long history (eg. StyleGAN, upcycling), and it was pointed out at least as far back as He et al 2015 (Resnets) that you could ablate or add more layers because they functioned more as just doing some incremental compute iteratively, and many of them were optional. (Or consider Universal Transformers or heck, just how BPTT works.) So this idea is not far out of distribution, if at all, especially if you're a LLM who knows the literature and past approaches (which most humans would not because they only just got into this area post-ChatGPT).
BloodAndCode 18 hours ago [-]
Did you try repeating the same mid-layer block more than once?
If the gain comes from giving the model another pass over its internal representation, I'd expect some sort of diminishing-returns curve as you add more repeats. But if those layers form a spevific circuit, running it multiple times might actually break the computation.
It would be really interesting to see which of those regims the model falls into.
If you found two disjoint sections that seemed positive on their own, did you try looping both separately in the same model? Wondering how localized the structures are.
Xuzzo 15 hours ago [-]
Fascinating! Congrats for the great work
lifis 15 hours ago [-]
Have you tried replicating those middle layers 3 or more times instead of just 2?
d0100 19 hours ago [-]
I wonder if joining layers from the "organs" of different models could further enhance the results
jauntywundrkind 22 hours ago [-]
The dual GH200 build was amazing. Awesome to see someone with such talent & flare in one area also doing great in another area. Thanks for noting that that was you.
https://news.ycombinator.com/item?id=46222237
kristianp 17 hours ago [-]
Does your work give any insight into how reasoning at inference time works?
lordmathis 21 hours ago [-]
That's cool. I tried the b64 thing on my local qwen3.5 27b without access to tools and it did it.
GaggiX 21 hours ago [-]
This reminds me when people were doing crazy stuff to improve the first Stable Diffusion model by swapping layers, interpolating weights, documenting which layer was most responsible for the quality of the hands etc. At the end the final models had dozens of different ancestors.
Handsome2734 10 hours ago [-]
Fascinating write up!
patchnull 21 hours ago [-]
This lines up with what I have seen doing CKA (centered kernel alignment) analysis on transformer internals. The middle layers in most large models have surprisingly similar representations to their neighbors, so duplicating them is basically giving the model extra compute cycles in a region where it is already doing useful refinement without messing up the input/output encoding stages. Curious whether picking layers by representation similarity instead of just a contiguous block would do even better.
dnhkng 21 hours ago [-]
Have a look at the boundaries in the heatmaps.
They are of course open to interpretation, but it suggest to me that the models develop 'organs' for processing different types of data, and without duplicating the 'whole organ' you don't get the benefits.
This is quite different to what you usually see, which is via layer ablation experiments. Thoughts?
doctorpangloss 20 hours ago [-]
Maybe you are observing artifacts of Qwen's training procedure. Perhaps they initialized further layers with the weights of previous ones as part of the training curriculum. But it's fun to imagine something more exotic.
dnhkng 18 hours ago [-]
There are similar patterns in the models from all the big labs. I think the transform layer stack starts out 'undifferentiated', analogous to stem cells. Pre-training pushes the model to develop structure and this technique helps discover the hidden structure.
afpx 21 hours ago [-]
Thank you so much for sharing this in a delightful blog post. One of the more enjoyable things I've read in a while. Very motivating!
seeknotfind 22 hours ago [-]
Did you ever try multiple copies?
dnhkng 22 hours ago [-]
I did, but the combinatorics are mad. I have also tried training a meta-model that predicts the outputs of the combinations.
I will make another post if the topic is popular; its pretty geeky though, even more than my usual blog posts...
cosarara 18 hours ago [-]
My first idea would be to generate one of those heatmaps using RYS as the base model. And see if it gets meaningfully better. And then again!
vicentwu 10 hours ago [-]
Good read.
naasking 22 hours ago [-]
This layer duplication strikes me as a bit of "poor man's" version of looped language models:
I really think from the experiments that 'organs' (not sure what to term this), develop during massive pretraining. This also means maybe looping the entire models is actually not efficient. Maybe a better way is [linear input section -> loop 1 -> linear section -> loop 2 -> linear section -> ... -> loop n -> linear output]?
This would give 'organs' space to develop.
radarsat1 18 hours ago [-]
it also reminds me a bit of this diffusion paper [1] which proposes having an encoding layer and a decoding layer but repeats the middle layers until a fixed point is reached. but really there is a whole field of "deep equilibrium models" that is similar. it wouldn't be surprising if large models develop similar circuits naturally when faced with enough data.
finding them on the other hand is not easy! as you've shown, i guess brute force is one way.. it would be nice to find a short cut but unfortunately as your diagrams show, the landscape isn't exactly smooth.
I would also hypothesize that different circuits likely exist for different "problems" and that these are messy and overlapping so the repeated layers that improve math for example may not line up with the repeated layers that improve poetry or whatever, meaning the basic layer repetition is too "simple" to be very general. that said you've obviously shown that there is some amount of generalizing at work, which is definitely interesting.
very awesome writeup, glad to see someone with access to hw actually playing with this.
Hopefully the cost per GPU will kick-it soon and we'll see people properly play, but frankly the "middle section" layers 2(ish) to (n-1)(ish) of a model can be shuffled up/down and left/right and still perform well.
The fun one will be an LLM router for LLM layers to apply the best reasoning to the best input so far, but frankly that would need the years and years of training that the author hints at.
The one that's still out of grasps is still how to combine/manipulate per-layer k,v caches into a globally coherent state.
i.e. if layers can be moved up/down why can't the cached k,v be swapped/combined with different projections?
global k,v caches work, but they have to be _huge_ in order to prevent model collapse even on something as simple as owt.
priowise 21 hours ago [-]
[flagged]
user_7832 20 hours ago [-]
A 5 hour old account with a standard chatgpt reply? Seriously, try harder.
phacker007 12 hours ago [-]
I'm so dumb
bblb 10 hours ago [-]
It's the "I so pale." moment for us average lurkers.
Interesting content still in the sea of useless AI slop, even if I couldn't understand anything after the first paragraph.
himmi-01 16 hours ago [-]
How did you get this idea? What was the inspiration behind it? I mean who would of duplication :) ?!
I have a couple questions:
1. I think this quote should be raising *many more* eyebrows.
> The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
You put a cat's brain into a dog's head and its still breathing! It didn't flatline immediately! Is yesterday's news? This seems like the biggest take away. Why isn't every <MODEL_PROVIDER> attempting LLM-surgery at this moment? Have you noticed any increasede discourse in this area?
2. You mentioned you spent the beginning of your career looking at brains in biotech. How did you end up in a basement of GPU's, working not in biotech, but still kind of looking at brains?
Again, great post!
On your questions: 1) A few other papers have been mentioned in the thread, like Solar10.7B. They duplicated the whole transformer stack, and it kinda helped. But as I found experimentally, that probably not a great idea. You are duplicating 'organs' (i.e. input processing stuff), that should only have one copy. Also, that paper didn't see immediate improvements; they had to do continued pre-training to see benefits. At that point, I'm guessing the big labs stopped bothering. Limited by hardware, I had to find unusual angles to approach this topic.
2) Nah, no more wetware for me. I did a half decade of research at a big neurobiology institute, and while it was very enjoyable, I can truly say that grant writing and paper review are 'not my thing'. This reason this info was delayed so long is that I wanted a paper in the AI field to go along with my papers in other fields. But as a Hobbyist with no official affiliation, and the attention span of a gnat, I gave up and started a blog instead. Maybe someone will cite it?
i think it isn't surprising giving how for example kernels in the first layers in visual CNNs converge to Gabors which are also the neuron transfer functions in the first layers of cat, human, etc. visual cortexes, and that there is math proving that such kernels are optimal (at some reasonable conditions).
And so i'd expect that the layers inside LLM reach or come close to some optimality which is universal across brains and LLMs (main reasons for such optimality is energy (various L2 like metrics), information compression and entropy)
Still the result is really interesting being able to stack abstract reasoning and get better performance and the heat maps to show the prob results
The academic literature seems to be catching up:
- *[SOLAR / DUS (Kim et al., 2023)](https://arxiv.org/abs/2312.15166)* — duplicated transformer layers to build a 10.7B model that outperformed 30B parameter baselines.
- *[The Curse of Depth (2025)](https://arxiv.org/abs/2502.05795)* — explains why this works: Pre-LN causes deep transformer layers to converge toward identity functions, meaning middle layers are where real computation happens, and duplicating them concentrates that capacity.
- *[Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Geiping et al., NeurIPS 2025)](https://arxiv.org/abs/2502.05171)* — takes the idea to its logical conclusion: a model trained with a single recurrent block repeated at inference time, scaling reasoning depth without adding parameters.
On the other papers, models like SOLAR or training a model that uses a single layers are probably going to hit a wall, based on the heatmaps I found. The transformer stack starts with randomised weights, (analogous to undifferentiated stem cells), and it seems they later form 'organs' during the trillions of pre-training tokens they undergo. My hypothesis is that you probably only want one copy of the 'token-to-thought', and 'thought-to-token' organs. It seems that you can make one layer do all three things (transforms in and out, and do the 'thinking'), but I think specialisation will always win.
This wasn't something I really dug into in great detail but I remember my surprise back then at how all those merged models and those "expanded" models like Goliath still generated coherent output. IMO those were more community models made by small creators for entertainment rather than work, and only really of interest to the local LLM groups on Reddit, 4chan, and Discord. People might briefly discuss it on the board and say "that's cool" but papers aren't being written and it's less likely for academics or corpo researchers to notice it.
That being said I wonder if it's possible to combine the layers of completely different models like say a Llama and a Qwen and still get it to work.
Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.
Would using grammar parsing help here by forcing the LLM to only output the expected tokens (i.e. numbers)? Or maybe on the scoring side you could look at the actual probabilities per token to see how far the correct digit is.
Even between two models of identical architecture, they may have landed on quite different internal representations if the training data recipe was substantially different.
But it would be fun to experiment with.
The code in the blog helps derive useful metrics from partial answers.
MoE notwithstanding, a model trained on the whole Internet and a few hundred thousands stolen books carries way more knowledge than is actually needed for any given workflow. It would be great if we could ship slimmed down models into which we'd plug the knowledge banks useful for today's work, and only those.
It would also mean that you could keep a model's knowledge fresh without retraining the whole of it.
plugs in knowledge bank LLM: ... I know kung fu.
It would go from a normal description of the item in the picture to suddenly seeing people clapping in the background that were not there, or making up some other stuff. I kinda stopped after a while, but I should pick that back up and do a more coherent experiment to see if I can find any correlation between vector dimensions and "meaning."
Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.
I think this hasn't been tried before because it's totally unintuitive that feeding the output from later layers into previous ones would actually do anything. And in fact, it usually is detrimental. I guess it takes really bored hobbyists with too much compute to check this stuff.
I have done some interesting work on applying multiple layer duplications in different regions of the model too, going so far as to train a meta-model (actually just XGBoost) to predict the merges. Seems to work, buts thats a whole other blog post.
This works with MoE, and yes, I would be interested in looking into this in more detail. But my wife might disagree with this time sink...
Normal:
Unrolled (current framing): Looped (proposed): "reasoning loop"Note: ascii rendering HN is not trivial
https://ouro-llm.github.io/
See the left-hand side of the diagram here, which is your exact proposal:
https://ouro-llm.github.io/static/images/ouro_main.png
There's a video on YouTube https://www.youtube.com/watch?v=pDsTcrRVNc0
about a looping layer models, after watching that I poured some thoughts off the top of my head into a comment which, of course, promptly sunk without a trace. I'll repost the gist of them here.
If you gain benefit from looping layers, at some level every layer of parameters is in front of and behind every other, the conclusion must be that the order of the layers does not need to be fixed at all.
If you cycle through the layers multiple times, are you doing so for the benefit of a particular layer on a particular problem. If so, can you skip the other layers that don't add on repetition. If you can skip (and you can know when to skip), and you can repeat (and know when to repeat)
What you would need is a mechanism which can decide which layer is needed next. Is that then not a looping single layer MOE model? Storing the layers as a wide set of selectable options rather than a deep set of unconditional layers. You would be picking what the next layer should be (or exit the loop) the threshold for exit drops each iteration so it always eventually exits. With a tunable 'how hard to think' knob to adjust the threshold.
But we could still try it out: randomize the order we call the transformer blocks, and see if it affects performance. If not, that’s extremely interesting.
There's probably a number of common sequences of layers that are inevitable when working on a problem though. I think of it like a expression calculator which could do various parts of an expression tree merging leaf nodes on each iteration. I wouldn't expect it to be quite so explicit with neural nets, but I feel like the underlying principle of do the sub parts then do the same thing on the result of the subparts must be beneficial to some degree.
I think there's probably quite a lot to be revealed from study of representations in those middle layers. If there's a 'how-much-have-we-solved-so-far' signal to be detected from the data between layers, there would be quite a lot of options I think.
You could make the argument it's closer to the blocks of a CPU compared with a brain, and it's no different to copy-pasting some IP block for eg, HW JPEG decoding. But I feel like the difference here is we're 'discovering' these blocks / organs. They weren't designed, they were evolved.
Altering these features isn’t messing with evolution anymore than tweaking a CAD file that used genetic algorithms: it’s all math, 1s and 0s.
You can enter the setting, and apply new re-layering architectures. Its very weird chatting with these brain-damaged models.
Author is right about the base64 part. Does seem weird that it can decode and understand it at same time. And I guess what makes it weird that we just sorta accept that for say English and German this works ie normal use but when framed as base64 then it suddenly stops feeling intuitive
They almost certainly have never seen regular conversations in Base64 in their training set, so its weird that it 'just works'.
Does that make sense?
People use Base64 to store payloads of many arbitrary things, including web pages or screenshots, both deliberately and erroneously, and so they have almost certainly seen regular conversations in Base64 in their 10tb+ text training sets scraped from billions of web pages and files and mangled emails etc.
But that points again to the main idea: The model has learnt to transform Base64 into a form it can already use in the 'regular' thinking structures.
The alternative is that there is an entire parallel structure just for Base64, which based on my 'chats' with LLMs in that format seems implausible; it acts like the regular model.
If there is a 'translation' organ in the model, why not a math or emotion processing organs? Thats what I set out to find, and are illustrated in the heatmaps.
Also, any writing tips from the Master blogger himself? Huge fan (squeal!)
Great read, makes you wonder what else is encoded in these models that might be useful!
1. Should we be training models like this from the start? It seems that a model trained with layer loops would be able to take advantage of it better than rearranging the layers of a naive model.
2. Should we even be using a fixed number of layers? If models are this tolerant to their inner layers being meddled with, then it doesn't make sense to run all the layers on every single token.
Maybe we could make a model that changed the number of iterations through the compute layers based on how much computation it thought the problem needed. Send it through only once for easy problems (perhaps even zero times?) and two or more times for harder problems. This would allow easier prompts to complete faster, while allowing the model to potentially scale up to infinity hard problems.
If we are training or fine tuning the model, we can probably make the compute layers generate a confidence signals based that predicts how likely it is for an extra compute iteration to meaningfully change the result.
For example, we take for granted the context model of LLMs is necessary, that all you can do is append and anything that changes the beginning requires a recalculation of whatever comes after it. And that does match how training works.
But all sorts of things would become possible if it were possible to shift things in and out of context without recomputing it all; conservatively you could avoid compaction, optimistically it might be a way to get info to the model that's both more deeply integrated than search and more efficient than training larger and larger models.
[1] https://weightwatcher.ai/
It less 'tool', than an assorted set of scripts, tailored to my unusual hardware setup. But it should be easy to extend; I would have released this earlier but I had the (stupid) idea to 'write a paper' on this. Aiming for that delayed this a year. Blogs are the way to go (for me).
Extra thanks for making it written in a readable and approachable way! I don't have much of a background in this topic, but still managed to understand about 70-80% of it :) You're a good writer
https://www.alphaxiv.org/abs/2512.19941
I think that these models have to learn to efficiently use their parameters, and the best way to do that is 'evolve' (yes, a bad word for it), structures over pretraining time. Unfortunately, they don't have a way to access these structures 'from the inside'. I hope this new approach lets up boost performance in s more experimentally rigorous way
I have to say that intuitively I wasn't at all surprised that duplicating a single layer didn't do much good, but I had never expected that you can identify and so clearly visualize these relatively short circuit blocks (and of course it's around the magic number 7! /jk). Super cool research and really well explained!
Another very interesting thing would be modulating compute at the token level. Default is 0 loops, maybe 1 loop is better, and 10 loops is even better than that.
I do wish one of the big labs would sponsor with a rack of HGX Rubin NVL8's. I have lots of ideas to test, and I have probably hit the spending limit with the boss on hardware (she hasn't seen the new power bill yet...)
If more than two repetitions of the “thinking organ” leads to worse results (I think that’s what you’ve said in other comments), would it be possible to get better results by slicing and dicing some of the early-layer “preparatory organs” between the thinking organs?
Maybe that would still require fine tuning to “evolve” an intermediary organ that would allow for multiple repetitions.
First pass runs your input through, second pass runs it's output as input?
Just, in double check it presumably runs the entire stack while you're trying to skip the translation steps and only double check the logic?
"And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!
Layer 10 is trained on layer 9’s output distribution. Layer 60 is trained on layer 59’s. If you rearrange them — feeding layer 60’s output into layer 10 — you’ve created a distribution the model literally never saw during training.
The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.
Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language that’s robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output ‘processing units’ sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didn’t work.
If that was true, maybe I didn’t need to teach a model new facts to make it smarter. I didn’t need fine-tuning. I didn’t need RLHF. I just needed to give it a more layers to think with."
This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?
MoE (mixture of experts), is an architecture that forces sparsity (not all 'neurons' are active during the forward pass.
This is pretty much orthogonal to that; it works with dense and MoE models, by repeating 'vertical' sections of the transformer stack.
Do you think karpathy's autoresearch would be useful here?
If the gain comes from giving the model another pass over its internal representation, I'd expect some sort of diminishing-returns curve as you add more repeats. But if those layers form a spevific circuit, running it multiple times might actually break the computation.
It would be really interesting to see which of those regims the model falls into.
I tried that pretty early on, the its basically never good. Its described in the the section: https://dnhkng.github.io/posts/rys/#the-beginning-of-llm-neu...
They are of course open to interpretation, but it suggest to me that the models develop 'organs' for processing different types of data, and without duplicating the 'whole organ' you don't get the benefits.
This is quite different to what you usually see, which is via layer ablation experiments. Thoughts?
I will make another post if the topic is popular; its pretty geeky though, even more than my usual blog posts...
https://ouro-llm.github.io/
Pretty cool though. LLM brain surgery.
I really think from the experiments that 'organs' (not sure what to term this), develop during massive pretraining. This also means maybe looping the entire models is actually not efficient. Maybe a better way is [linear input section -> loop 1 -> linear section -> loop 2 -> linear section -> ... -> loop n -> linear output]?
This would give 'organs' space to develop.
finding them on the other hand is not easy! as you've shown, i guess brute force is one way.. it would be nice to find a short cut but unfortunately as your diagrams show, the landscape isn't exactly smooth.
I would also hypothesize that different circuits likely exist for different "problems" and that these are messy and overlapping so the repeated layers that improve math for example may not line up with the repeated layers that improve poetry or whatever, meaning the basic layer repetition is too "simple" to be very general. that said you've obviously shown that there is some amount of generalizing at work, which is definitely interesting.
[1] https://arxiv.org/abs/2401.08741
Hopefully the cost per GPU will kick-it soon and we'll see people properly play, but frankly the "middle section" layers 2(ish) to (n-1)(ish) of a model can be shuffled up/down and left/right and still perform well.
The fun one will be an LLM router for LLM layers to apply the best reasoning to the best input so far, but frankly that would need the years and years of training that the author hints at.
The one that's still out of grasps is still how to combine/manipulate per-layer k,v caches into a globally coherent state. i.e. if layers can be moved up/down why can't the cached k,v be swapped/combined with different projections? global k,v caches work, but they have to be _huge_ in order to prevent model collapse even on something as simple as owt.
Interesting content still in the sea of useless AI slop, even if I couldn't understand anything after the first paragraph.