Rendered at 14:35:35 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
mike_hearn 1 days ago [-]
Don't work at a lab but I think they might be warping the probability distribution in the decoding step, at least to generate RL examples for training and maybe in production too.
There aren't other comments discussing this possibility at the moment, but you don't have to take the token predicted as most likely (greedy decoding). Most decoding strategies do something else which is where settings like temperature come in. So if you want the model to "think harder" you can track whether the current tokens are thinking or answer - in OpenAI's system that's called a channel - and then if you're in a thinking block you might get a model output whose top three predictions are:
60% <|channel=answer|>
10% Wait,
5% . The
[...]
Greedy decoding would stop thinking at this point and start answering, but you want the model to keep thinking so you skip that token and select the next most likely which is "Wait, ". The reasoning levels can map to the probability of skipping the channel change tokens.
simianwords 1 days ago [-]
I also thought of this as a general idea: intelligence at the sampling level. Broadly you have different tiers of intelligence
1. take highest probability
2. based on some light weight code that tracks some state - like number of tokens or some sampling distribution
3. higher level is using a smaller llm to decide which token to sample (just a thought)
pyentropy 6 days ago [-]
Take a look at the harmony repo which specifies the internal OpenAI format - the effort level is specified in the context after the <|start|> tag - https://github.com/openai/harmony
Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>). For that, take a look at vllm reasoning docs.
I think you have the right answer but I'm struggling to understand: does changing the effort change the prompt at the start of the conversation? I wonder why come up with this way at all? Why not just add a parameter at the end or something? At least it won't break cache.
Maybe like: add a secret suffix to your chat in the conversation to think more like
conversation....
Hey please help
[think more]
pyentropy 6 days ago [-]
I'm considering the possibility that it's good to break the prefix and cache because the LLM itself was rewarded (during post-training) with different prefixes/system prompts, each containing reasoning traces of the correct size.
I might be very very wrong though and LLMs disagree with me, insisting that cache is preserved and the system message doesn't have to change (even though it often contains effort level in context) if effort level changes across turns, and that all you have to do is tell the inference lib that parses think tags to early-close think tags that are too long.
simianwords 5 days ago [-]
This seems correct but again I would like to think post training could have been also done by checking only the string in the last message sent.
Centigonal 1 days ago [-]
For Deepseek V4, The main effect of raising the reasoning effort is to add a little section[1] to the end of the system prompt that says "BTW make sure to think really hard! :)"
If Anthropic's models work the same way, then changing reasoning effort would break the cache because the API has to modify the system prompt given at the very start of the context and rerun the whole thing through the inference server.
This kind of limitation is one reason Opus 4.8's mid-conversation system messages[2] are actually a pretty big deal (if they actually work).
> This kind of limitation is one reason Opus 4.8's mid-conversation system messages[2] are actually a pretty big deal (if they actually work).
Didn't they start injecting system messages telling Claude to calm his tits in overly long and emotional (iirc it triggered on some keywords) chat contexts last year?
simianwords 1 days ago [-]
I don't get it!? Why not just add the "BTW make sure to think really hard!" at the end in the new message?
Is it harder to post-train in such a way?
Centigonal 1 days ago [-]
The model is trained to treat system messages differently from regular user/assistant messages. Most models are trained to only expect system messages at the start of the conversation. This is changing now.
firemelt 20 hours ago [-]
lmao i used to do that manually is that means i raising the effort?
1 days ago [-]
1 days ago [-]
aabdi 6 days ago [-]
Different models do slight variants.
Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.
Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations
masfuerte 2 days ago [-]
I suspect it's not possible (as an end user) to get a thinking trace from one of the models. But what happens with "thinking" is that the model has a conversation with itself in an attempt to home in on a better answer to the original prompt.
The "amount of thinking" is how long this internal conversation is allowed to progress. The longer it goes on the more it costs. It's all part of the token budget but, because this internal dialogue is hidden, it's not obvious to the end user.
Chu4eeno 1 days ago [-]
> I suspect it's not possible (as an end user) to get a thinking trace from one of the models. But what happens with "thinking" is that the model has a conversation with itself in an attempt to home in on a better answer to the original prompt.
The model that summarizes what is inside the CoT/|thinking| tags is just an LLM, and it's just as jailbreakable/susceptible to prompt injection as any other LLM: https://x.com/lefthanddraft/status/1991076879877460322 (for those without X; that's Wyatt Walls demonstrating both getting the gemini summarizer to print the raw CoT, as well as just do random calculations, dump its system prompt, etc.)
bjourne 6 days ago [-]
LLMs work by generating the most likely continuation to a prompt. But they can also generate multiple likely continuations. This create multiple branches which in turn can generate even more branches. The LLM can then evaluate the branches, prune the unpromising ones, and merge the best ones. More branches means more tokens, means more effort.
simianwords 6 days ago [-]
this has nothing to do with the thinking effort however
bjourne 6 days ago [-]
Yes, it does. Breadth of search is exactly what the effort setting controls.
No it doesn't and lets not call people names. You can verify this using ChatGPT or anything else. You are mistaken and there are no "branches" happening.
bjourne 4 days ago [-]
[flagged]
tomhow 2 days ago [-]
Hey, name-calling like this is not cool on HN. You've been here long enough to know this, and we've asked you repeatedly to observe the guidelines. If you keep this up we'll have to assume you have no intention of using the site as intended and ban the account.
FergusArgyll 2 days ago [-]
I think you may be confusing the openai "pro" series models with thinking. Thos are rumored to be multi "branched"
__patchbit__ 6 days ago [-]
At a guess. May be associated with token length context window. Down selecting is consistent with warning message, forcing cutoff to context window. The technical term cache being a synonym. Increasing the headroom for more "thinking" should allow the implementation to access more resources without warning about the cache breaking.
sometimelurker 6 days ago [-]
they use multitoken prediction behind the scenes, that might interact with the CoT in a strange way. maybe for different thinking modes they have different MTP models? if so thats interesting
pyentropy 6 days ago [-]
The number of tokens you predict at time (multi or not) has nothing to do with whether the model wants to emit any, some or a lot of reasoning tokens in reasoning tag -- similar to how branch prediction will not really change the for loop iteration count.
sometimelurker 6 days ago [-]
no it might. a high reasoning task is probably harder than a low reasoning task, so the same MTP LLM will predict more correct tokens on the low reasoning task. to compensate for this, big labs likely have different MTP LLMs for different cases. it would make sense for them to do this
Avik_GH 1 days ago [-]
[flagged]
adithyaharish 1 days ago [-]
[dead]
Scarlett5 22 hours ago [-]
[flagged]
Yahyaaa 5 days ago [-]
Usually it’s not a different model, it’s the same model with different inference-time settings.
“Thinking effort” typically changes the compute budget and decoding behavior (how many steps, how much exploration, sometimes internal planning loops).
Some stacks also tie it to orchestration layers or system/prompt signals, which is why it can look inconsistent across products
5 days ago [-]
shanewei 6 days ago [-]
My understanding is that it’s mostly an inference-time knob, not different weights.
OpenAI describes reasoning.effort as controlling how many reasoning tokens get used before the answer. Anthropic’s docs are even more explicit that effort trades off thoroughness vs token efficiency “with a single model”.
So I wouldn’t read the Claude Code cache warning as proof that a different model is being used. It may just mean the thinking/effort setting is part of the cache key.
There aren't other comments discussing this possibility at the moment, but you don't have to take the token predicted as most likely (greedy decoding). Most decoding strategies do something else which is where settings like temperature come in. So if you want the model to "think harder" you can track whether the current tokens are thinking or answer - in OpenAI's system that's called a channel - and then if you're in a thinking block you might get a model output whose top three predictions are:
Greedy decoding would stop thinking at this point and start answering, but you want the model to keep thinking so you skip that token and select the next most likely which is "Wait, ". The reasoning levels can map to the probability of skipping the channel change tokens.1. take highest probability
2. based on some light weight code that tracks some state - like number of tokens or some sampling distribution
3. higher level is using a smaller llm to decide which token to sample (just a thought)
Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>). For that, take a look at vllm reasoning docs.
https://docs.vllm.ai/en/latest/features/reasoning_outputs/#a...
https://developers.openai.com/api/docs/guides/reasoning
Maybe like: add a secret suffix to your chat in the conversation to think more like
I might be very very wrong though and LLMs disagree with me, insisting that cache is preserved and the system message doesn't have to change (even though it often contains effort level in context) if effort level changes across turns, and that all you have to do is tell the inference lib that parses think tags to early-close think tags that are too long.
If Anthropic's models work the same way, then changing reasoning effort would break the cache because the API has to modify the system prompt given at the very start of the context and rerun the whole thing through the inference server.
This kind of limitation is one reason Opus 4.8's mid-conversation system messages[2] are actually a pretty big deal (if they actually work).
[1] https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/ma...
[2] https://platform.claude.com/docs/en/build-with-claude/mid-co...
Didn't they start injecting system messages telling Claude to calm his tits in overly long and emotional (iirc it triggered on some keywords) chat contexts last year?
Is it harder to post-train in such a way?
Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.
Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations
The "amount of thinking" is how long this internal conversation is allowed to progress. The longer it goes on the more it costs. It's all part of the token budget but, because this internal dialogue is hidden, it's not obvious to the end user.
The model that summarizes what is inside the CoT/|thinking| tags is just an LLM, and it's just as jailbreakable/susceptible to prompt injection as any other LLM: https://x.com/lefthanddraft/status/1991076879877460322 (for those without X; that's Wyatt Walls demonstrating both getting the gemini summarizer to print the raw CoT, as well as just do random calculations, dump its system prompt, etc.)
See https://developers.openai.com/cookbook/articles/openai-harmo... and src/openai/types/shared/reasoning_effort.py
Some stacks also tie it to orchestration layers or system/prompt signals, which is why it can look inconsistent across products
OpenAI describes reasoning.effort as controlling how many reasoning tokens get used before the answer. Anthropic’s docs are even more explicit that effort trades off thoroughness vs token efficiency “with a single model”.
So I wouldn’t read the Claude Code cache warning as proof that a different model is being used. It may just mean the thinking/effort setting is part of the cache key.