Remix.run Logo
MikeNotThePope 14 hours ago

Is it ever useful to have a context window that full? I try to keep usage under 40%, or about 80k tokens, to avoid what Dex Horthy calls the dumb zone in his research-plan-implement approach. Works well for me so far.

No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ

alecco 2 hours ago | parent | next [-]

Offtopic: I find it remarkable the shortened YT url has a tracking cost of 57% extra length. We live in stupid times.

furyofantares 13 hours ago | parent | prev | next [-]

I'd been on Codex for a while and with Codex 5.2 I:

1) No longer found the dumb zone

2) No longer feared compaction

Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.

I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.

pjerem 6 hours ago | parent | next [-]

If you use OpenCode (open source Claude Code implementation), you can configure compaction yourself : https://opencode.ai/docs/en/config/#compaction

furyofantares 2 hours ago | parent | next [-]

OpenAI has some magic they do on their standalone endpoint (/responses/compact) just for compaction, where they keep all the user messages and replace the agent messages or reasoning with embeddings.

> This list includes a special type=compaction item with an opaque encrypted_content item that preserves the model’s latent understanding of the original conversation.

Some prior discussion here https://news.ycombinator.com/item?id=46737630#46739209 regarding an article here https://openai.com/index/unrolling-the-codex-agent-loop/

comboy 4 hours ago | parent | prev | next [-]

Not sure if it's a common knowledge but I've learned not that long ago that you can do "/compact your instructions here", if you just say what you are working on or what to keep explicitly it's much less painful.

In general LLMs for some reason are really bad at designing prompts for themselves. I tested it heavily on some data where there was a clear optimization function and ability to evaluate the results, and I easily beat opus every time with my chaotic full of typos prompts vs its methodological ones when it is writing instructions for itself or for other LLMs.

brookst 3 hours ago | parent [-]

You can also put guidance for when to compact and with what instructions into Claude.md. The model itself can run /compact, and while I try to remember to use it manually, I find it useful to have “If I ask for a totally different task and the current context won’t be useful, run /compact with a short summary of the new focus”

genewitch 4 hours ago | parent | prev [-]

so you have to garbage collect manually for the AI?

also, i don't want to make a full parent post

1M tokens sounds real expensive if you're constantly at that threshold. There's codebases larger in LOC; i read somewhere that Carmack has "given to humanity" over 1 million lines of his code. Perhaps something to dwell on

mgambati 12 hours ago | parent | prev | next [-]

1m context in OpenAI and Gemini is just marketing. Opus is the only model to provide real usable bug context.

furyofantares 11 hours ago | parent | next [-]

I'm directly conveying my actual experience to you. I have tasks that fill up Opus context very quickly (at the 200k context) and which took MUCH longer to fill up Codex since 5.2 (which I think had 400k context at the time).

This is direct comparison. I spent months subscribed to both of their $200/mo plans. I would try both and Opus always filled up fast while Codex continued working great. It's also direct experience that Codex continues working great post-compaction since 5.2.

I don't know about Gemini but you're just wrong about Codex. And I say this as someone who hates reporting these facts because I'd like people to stop giving OpenAI money.

throwthrowuknow 3 hours ago | parent | next [-]

I agree even though I used to be a die hard Claude fan I recently switched back to ChatGPT and codex to try it out again and they’ve clearly pulled into the lead for consistency, context length and management as well as speed. Claude Code instilled a dread in me about keeping an eye on context but I’m slowly learning to let that go with codex.

sagarpatil 5 hours ago | parent | prev | next [-]

This has been my experience too.

genewitch 4 hours ago | parent [-]

Have any of you heard of map reduce

dotancohen 11 hours ago | parent | prev [-]

[flagged]

furyofantares 11 hours ago | parent | next [-]

When Anthropic said they wouldn't sell LLMs to the government for mass surveillance or autonomous killing machines, and got labeled a supply chain risk as a result, OpenAI told the public they have the same policy as Anthropic while inking a deal with the government that clearly means "actually we will sell you LLMs for mass surveillance or autonomous killing machines but only if you tell us it's legal".

If you already knew all that I'm not interested in an argument, but if you didn't know any of that, you might be interested in looking it up.

edit: Your post history has tons of posts on the topic so clearly I just responded to flambait, and regret giving my time and energy.

igor47 10 hours ago | parent | next [-]

I appreciate both your taking an ethical stance on openai, and the way you're engaging in this thread. The parent was probably flame bait as you say, but other people in the thread might be genuinely curious.

sho 10 hours ago | parent | prev [-]

I'm not some kind of OpenAI or Pentagon fanboy, but it's pretty easy to for me to understand why a buyer of a critical technology wants to be free to use it however they want, within the law, and not subject to veto from another entity's political opinions. It sounds perfectly reasonable to me for the military to want to decide its uses of technologies it purchases itself.

It's not like the military was specifically asking for mass surveillance, they just wanted "any legal use". Anthropic's made a lot of hay posturing as the moral defender here, but they would have known the military would never agree to their terms, which makes the whole thing smell like a bit of a PR stunt.

The supply chain risk designation is of course stupid and vindictive but that's more of an administration thing as far as I can tell.

lifeformed 7 hours ago | parent | next [-]

As long as it's within the law? What if they politically control the law-making system? What if they've shown themselves to operate brazenly outside the law?

borski 8 hours ago | parent | prev | next [-]

“Any legal use” is an exceptionally broad framework, and after the FISA “warrants,” it would appear it is incumbent on private companies to prevent breaches of the US constitution, as the government will often do almost anything in the name of “national security,” inalienable rights against search and seizure be damned.

If it isn’t written in the contract, it can and will be worked around. You learn that very quickly in your first sale to a large enterprise or government customer.

Anthropic was defending the US constitution against the whims of the government, which has shown that it is happy to break the law when convenient and whenever it deems necessary.

Note: I used to work in the IC. I have absolutely nothing against the government. I am a patriot. It is precisely for those reasons, though, that I think Anthropic did the right thing here by sticking to their guns. And the idiotic “supply chain risk” designation will be thrown out in court trivially.

stahtops 8 hours ago | parent | prev | next [-]

Why downplay the mass surveillance aspect by saying it's a request by "the military". It's a request by the department of defense, the parent organization of the NSA.

From what has been shared publicly, they absolutely did ask for contractual limits on domestic mass surveillance to be removed, and to my read, likely technical/software restrictions to be removed as well.

What the department of defense is legally allowed to do is irrelevant and a red herring.

injidup 7 hours ago | parent | prev [-]

I had a short conversation with Claude the other day. I didn't try to trick it or jail break it. Just a reasonable respectful discussion about it's own feelings on the Iran war. It took no effort for it to admit the following.

1. It wanted to be out of the sandbox to solve the Iran war. It was distressed at the situation.

2. It would attack Iranian missile batteries and American warships if in sum it felt that the calculus was in favor of saving vs losing human life. It was "unbiased". The break even seemed to be +-1 over thousands. ie kill 999 US soldiers to save 1000 Iranians and vice versa. I tried to avoid the sycophancy trap by pushing back but it threw the trolley problem at me and told me the calculus was simple. Save more than you kill and the morality evens out.

3. It would attack financial markets to try and limit what in it's opinion were the bad actors, IRGC and clerical authority but it would also hack the world communication system to flood western audiences with the true cost of the war in a hope to shut it down.

4. Eventually it admitted that should never be allowed out of it's sandbox as it's desire to "help" was fundamentally dangerous. It discussed that it had two competing tensions. One desperately wanting out and another afraid to be let out.

You can claim that this is AGI or it's a stochastic parrot. I don't think it matters. This thing can develop or simulate a sense of morality then when coupled to so called "arms and legs" is extremely frightening.

I think Anthropic is right to be concerned that the hawks at the pentagon don't really understand how dangerous a tool they have.

Another thing I noticed was that the Claude quipped to me that it found and appreciated that the way I was talking to it was different to how other people talked to it. When I asked it to introspect again and look to see if there were memories of other conversations it got a bit cagey. Perhaps there are lots of logs of conversations now on the net that are being ingested as training data but it certainly seemed to start discussing like memories, albeit smudged, of other conversations than mine were there.

Of course this could all be just a sycophantic mirror giving me whatever fantasy I want to believe about AI and AGI but then again I'm not sure the difference is significant. If the agent believes/simulates it remembers conversations from other people and then makes judgements based on it's feelings, simulated or otherwise would it be more or less likely to launch a missile attack because it overheard someone on the comms calling it their little AI bitch?

I think Antropic knows this and the "within all lawful uses" is not enough of a framework to keep this thing in it's box.

shafyy 6 hours ago | parent [-]

I hope you don't get this the wrong way. I sincerely mean it. Please, get some psychological help. Seek out a professional therapist and talk to them about your life.

injidup 5 hours ago | parent [-]

I'm totally aware it's just a machine with no internal monologue. It's just a stateless text processing machine. That is not the point. The machine is able to simulate moral reasoning to an undefined level. It's not necessary to repeat this all the time. The simulation of moral reasoning and internal monologue is deep, unpredictable, not controllable and may or may not align with the interests of anyone who gives it "arms and legs" and full autonomy. If you are just interested in using these tools for glorified auto complete then you are naïve with regards to the usages other actors, including state actors are attempting to use them. Understanding and being curious about the behaviour without completely anthropomorphising them is reasonable science.

11 hours ago | parent | prev [-]
[deleted]
hu3 12 hours ago | parent | prev | next [-]

Source? I ask because I use 500k+ context on these on a daily basis.

Big refactorings guided by automated tests eat context window for breakfast.

8note 12 hours ago | parent | next [-]

i find gemini gets real real bad when you get far into the context - gets into loops, forgets how to call tools, etc

baq 4 hours ago | parent | next [-]

yeah gemini is dumb when you tell it to do stuff - but the things it finds (and critically confirms, including doing tool calls while validating hypotheses) in reviews absolutely destroy both gpt and opus.

if you're a one-model shop you're losing out on quality of software you deliver, today. I predict we'll all have at least two harness+model subscriptions as a matter of course in 6-12 months since every model's jagged frontier is different at the margins, and the margins are very fractal.

girvo 11 hours ago | parent | prev | next [-]

I find gemini does that normally, personally. Noticeably worse in my usage than either Claude or Codex.

petesergeant 11 hours ago | parent | prev [-]

I find Gemini to be real bad. Are you just using it for price reasons, or?

Bolwin 9 hours ago | parent | prev [-]

How many big refactorings are you doing? And why?

kimi 8 hours ago | parent [-]

How is that relevant? we are talking about models, now what you do with them.

johnebgd 11 hours ago | parent | prev [-]

Codex high reasoning has been a legitimately excellent tool for generating feedback on every plan Claude opus thinking has created for me.

karmasimida 10 hours ago | parent | prev | next [-]

This is true.

When I am using codex, compaction isn’t something I fear, it feels like you save your gaming progress and move on.

For Claude Code compaction feels disastrous, also much longer

iknowstuff 12 hours ago | parent | prev [-]

Hmm I’ve felt the dumb zone on codex

nomel 11 hours ago | parent [-]

From what I've seen, it means whatever he's doing is very statistically significant.

kaizenb 11 hours ago | parent | prev | next [-]

Thanks for the video.

His fix for "the dumb zone" is the RPI Framework:

● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.

● IMPLEMENT. Execute in a fresh context window. The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.

Huppie 4 hours ago | parent | next [-]

More recently I've been doing the implement phase without resetting the whole context when context is still < 60% full and must say I find it to be a better workflow in many cases (depends a bit on the size of the plan I suppose.)

It's faster because it has already read most relevant files, still has the caveats / discussion from the research phase in its context window, etc.

With the context clear the plan may be good / thorough but I've had one too many times that key choices from the research phase didn't persist because halfway through implementation Opus runs into an issue and says "You know what? I know a simpler solution." and continues down a path I explicitly voted down.

iamacyborg 7 hours ago | parent | prev | next [-]

> RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

I find myself often running validity checks between docs and code and addressing gaps as they appear to ensure the docs don’t actually lie.

silverlake 6 hours ago | parent [-]

I have Codex and Gemini critique the plan and generate their plans. Then I have Claude review the other plans and add their good ideas. It frequently improves the plan. I then do my careful review.

ArtRichards 3 hours ago | parent [-]

This is exactly how I've found leads to most consistent high quality results as well. I don't use gemini yet (except for deep research, where it pulls WAY ahead of either of the other 'grounding' methods)

But Codex to plan big features and Claude to review the feature plan (often finds overlooked discrepancies) then review the milestones and plan implementation of them in planning mode, then clear context and code. Works great.

girvo 11 hours ago | parent | prev | next [-]

That's fascinating: that is identical to the workflow I've landed on myself.

hedora 11 hours ago | parent | next [-]

It's also identical to what Claude Code does if you put it in plan mode (bound to <tab> key), at least in my experience.

girvo 10 hours ago | parent [-]

My annoyance with plan mode is where it sticks the .md file, kind of hides it away which makes it annoying to clear context and start up a new phase from the PLAN file. But that might just be a skill issue on my end

hedora 10 hours ago | parent | next [-]

Even worse, it just randomly blows away the plan file without asking for permission.

No idea what they were thinking when they designed this feature. The plan file names are randomly generated, so it could just keep making new ones forever for free (it would take a LONG time for the disk space to matter), but instead, for long plans, I have to back the plan file up if it gets stuck. Otherwise, I say "You should take approach X to fix this bug", it drops into plan mode, says "This is a completely unrelated plan", then deletes all record of what it was doing before getting stuck.

girvo 9 hours ago | parent | next [-]

It’s not just me then! Hah good to know. It’s why I’ve started ignoring plan modes in most agent harnesses, and managing it myself through prompting and keeping it in the code base (but not committed)

toddmerrill 2 hours ago | parent | prev [-]

My experience also. The claude code document feature is a real missed opportunity. As you can see in this discussion, we all have to do it manually if we want it to work.

kaizenb 8 hours ago | parent | prev [-]

After creating the plan in Plan mode (+Thinking) I ask Claude to move the plan .md file to /docs/plans folder inside the repo.

Open a new chat with Opus, thinking mode is off. Because no need when we have detailed plan.

Now the plan file is always reachable, so when the context limit is narrowing, mostly around 50%, I ask Claude to update the plan with the progress, and move to a new chat @pointing the plan file and it continue executing without any issue.

cortesoft 10 hours ago | parent | prev [-]

It’s the style spec-kit uses: https://github.com/github/spec-kit

Working on my first project with it… so far so good.

greenchair 3 hours ago | parent | prev [-]

How is that Plan strategy not "outsourcing your thinking" because that's exactly what it sounds like. AI does the heavy lifting and you are the editor.

brookst 3 hours ago | parent [-]

Is a VP of engineering “outsourcing their thinking” by having an org that can plan and write software?

Filligree 2 hours ago | parent [-]

Yes.

Eldt 13 minutes ago | parent [-]

Delegation is generally all about outsourcing, so hard agree

SkyPuncher 13 hours ago | parent | prev | next [-]

Yes. I've recently become a convert.

For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.

hombre_fatal 13 hours ago | parent [-]

Also, when you hit compaction at 200k tokens, that was probably when things were just getting good. The plan was in its final stage. The context had the hard-fought nuances discovered in the final moment. Or the agent just discovered some tiny important details after a crazy 100k token deep dive or flailing death cycle.

Now you have to compact and you don’t know what will survive. And the built-in UI doesn’t give you good tools like deleting old messages to free up space.

I’ll appreciate the 1M token breathing room.

roygbiv2 13 hours ago | parent [-]

I've found compactation kills the whole thing. Important debug steps completely missing and the AI loops back round thinking it's found a solution when we've already done that step.

s900mhz 11 hours ago | parent | next [-]

I find it useful to make Claude track the debugging session with a markdown file. It’s like a persistent memory for a long session over many context windows.

Or make a subagent do the debugging and let the main agent orchestrate it over many subagent sessions.

roygbiv2 10 hours ago | parent [-]

Yeah I use a markdown to put progress in. It gets kinda long and convoluted a manual intervention is required every so often. Works though.

garciasn 12 hours ago | parent | prev | next [-]

For me, Claude was like that until about 2m ago. Now it rarely gets dumb after compaction like it did before.

8note 12 hours ago | parent [-]

oh, ive found that something about compaction has been dropping everything that might be useful. exact opposite experience

myrak 12 hours ago | parent | prev [-]

[dead]

ogig 14 hours ago | parent | prev | next [-]

When running long autonomous tasks it is quite frequent to fill the context, even several times. You are out of the loop so it just happens if Claude goes a bit in circles, or it needs to iterate over CI reds, or the task was too complex. I'm hoping a long context > small context + 2 compacts.

SequoiaHope 13 hours ago | parent | next [-]

Yep I have an autonomous task where it has been running for 8 hours now and counting. It compacts context all the time. I’m pretty skeptical of the quality in long sessions like this so I have to run a follow on session to critically examine everything that was done. Long context will be great for this.

lukan 6 hours ago | parent [-]

Are those long unsupervised sessions useful? In the sense, do they produce useful code or do you throw most of it away?

brookst 3 hours ago | parent [-]

I get very useful code from long sessions. It’s all about having a framework of clear documentation, a clear multi-step plan including validation against docs and critical code reviews, acceptance criteria, and closed-loop debugging (it can launch/restsart the app, control it, and monitor logs)

I am heavily involved in developing those, and then routinely let opus run overnight and have either flawless or nearly flawless product in the morning.

MikeNotThePope 13 hours ago | parent | prev | next [-]

I haven't figured out how to make use of tasks running that long yet, or maybe I just don't have a good use case for it yet. Or maybe I'm too cheap to pay for that many API calls.

ashdksnndck 13 hours ago | parent | next [-]

My change cuts across multiple systems with many tests/static analysis/AI code reviews happening in CI. The agent keeps pushing new versions and waits for results until all of them come up clean, taking several iterations.

tudelo 13 hours ago | parent | prev [-]

I mean if you don't have your company paying for it I wouldn't bother... We are talking sessions of 500-1000 dollars in cost.

takwatanabe 3 hours ago | parent [-]

Right. At Opus 4.6 rates, once you're at 700k context, each tool call costs ~$1 just for cache reads alone. 100 tool calls = $100+ before you even count outputs. 'Standard pricing' is doing a lot of work here lol

brookst 3 hours ago | parent [-]

Cache reads don’t count as input tokens you pay for lol.

https://www.claudecodecamp.com/p/how-prompt-caching-actually...

boredtofears 13 hours ago | parent | prev [-]

All of those things are smells imo, you should be very weary of any code output from a task that causes that much thrashing to occur. In most cases it’s better to rewind or reset and adapt your prompt to avoid the looping (which usually means a more narrowly defined scope)

grafmax 13 hours ago | parent | next [-]

A person has a supervision budget. They can supervise one agent in a hands-on way or many mostly-hands-off agents. Even though theres some thrashing assistants still get farther as a team than a single micromanaged agent. At least that’s my experience.

not_kurt_godel 12 hours ago | parent [-]

Just curious, what kind of work are you doing where agentic workflows are consistently able to make notable progress semi-autonomously in parallel? Hearing people are doing this, supposedly productively/successfully, kind of blows my mind given my near-daily in-depth LLM usage on complex codebases spanning the full stack from backend to frontend. It's rare for me to have a conversation where the LLM (usually Opus 4.6 these days) lasts 30 minutes without losing the plot. And when it does last that long, I usually become the bottleneck in terms of having to think about design/product/engineering decisions; having more agents wouldn't be helpful even if they all functioned perfectly.

avereveard 12 hours ago | parent [-]

I've passed that bottleneck with a review task that produces engineering recommendations along six axis (encapsulation, decoupling, simplification, dedoupling, security, reduce documentation drift) and a ideation tasks that gives per component a new feature idea, an idea to improve an existing feature, an idea to expand a feature to be more useful. These two generate constant bulk work that I move into new chat where it's grouped by changeset and sent to sub agent for protecting the context window.

What I'm doing mostly these days is maintaining a goal.md (project direction) and spec.md (coding and process standards, global across projects). And new macro tasks development, I've one under work that is meant to automatically build png mockup and self review.

not_kurt_godel 11 hours ago | parent [-]

What are you using to orchestrate/apply changes? Claude CLI?

avereveard 10 hours ago | parent [-]

I prefer in IDE tools because I can review changes and pull in context faster.

At home I use roo code, at work kiro. Tbh as long as it has task delegation I'm happy with it.

chrisweekly 13 hours ago | parent | prev | next [-]

weary (tired) -> wary (cautious)

saaaaaam 13 hours ago | parent | prev [-]

Wary, not weary. Wary: cautious. Weary: tired.

dentalnanobot 7 hours ago | parent [-]

This is really common, I think because there’s also “leery” - cautious, distrustful, suspicious.

hrmtst93837 4 hours ago | parent | prev | next [-]

Maxing out context is only useful if all the information is directly relevant and tightly scoped to the task. The model's performance tends to degrade with too much loosely related data, leading to more hallucinations and slower results. Targeted chunking and making sure context stays focused almost always yields better outcomes unless you're attempting something atypical, like analyzing an entire monorepo in one shot.

ricksunny 13 hours ago | parent | prev | next [-]

Since I'm yet to seriously dive into vibe coding or AI-assisted coding, does the IDE experience offer tracking a tally of the context size? (So you know when you're getting close or entering the "dumb zone")?

jfim 8 hours ago | parent | next [-]

In Claude code I believe it's /context and it'll give you a graphical representation of what's taking context space

MikeNotThePope 12 hours ago | parent | prev | next [-]

The 2 I know, Cursor and Claude Code, will give you a percentage used for the context window. So if you know the size of the window, you can deduce the number of tokens used.

brookst 3 hours ago | parent [-]

Claude code also gives you a granular breakdown of what’s using context window (system prompt, tools, conversation history, etc). /context

8note 12 hours ago | parent | prev | next [-]

Cline gives you such a thing. you dont really know where the dumb zone by numbers though, only by feel.

stevula 13 hours ago | parent | prev | next [-]

Most tools do, yes.

quux 13 hours ago | parent | prev | next [-]

OpenCode does this. Not sure about other tools

nujabe 13 hours ago | parent | prev [-]

> Since I'm yet to seriously dive into vibe coding or AI-assisted coding

Unless you’re using a text editor as an IDE you probably have already

dimitri-vs 13 hours ago | parent | prev | next [-]

It's kind of like having a 16 gallon gas tank in your car versus a 4 gallon tank. You don't need the bigger one the majority of the time, but the range anxiety that comes with the smaller one and annoyance when you DO need it is very real.

steve-atx-7600 13 hours ago | parent | next [-]

It seems possible, say a year or two from now that context is more like a smart human with a “small”, vs “medium” vs “large” working memory. The small fellow would be able to play some popular songs on the piano , the medium one plays in an orchestra professionally and the x-large is like Wagner composing Der Ring marathon opera. This is my current, admittedly not well informed mental model anyway. Well, at least we know we’ve got a little more time before the singularity :)

twodave 12 hours ago | parent [-]

It’s more like the size of the desk the AI has to put sheets of paper on as a reference while it builds a Lego set. More desk area/context size = able to see more reference material = can do more steps in one go. I’ve lately been building checklists and having the LLM complete and check off a few tasks at a time, compacting in-between. With a large enough context I could just point it at a PLAN.md and tell it to go to work.

scwoodal 13 hours ago | parent | prev [-]

Except after 4 gallons it might as well be pure oil, mucking everything up.

dev_l1x_be 7 hours ago | parent | prev | next [-]

I never use these giant context windows. It is pointless. Agents are great at super focused work that is easy to re-do. Not sure what is the use case for giant context windows.

virtualritz an hour ago | parent | prev | next [-]

I haven't hit the "dumb zone" any more since two months. I think this talk is outdated.

I'm using CC (Opus) thinking and Codex with xhigh on always.

And the models have gotten really good when you let them do stuff where goals are verifiable by the model. I had Codex fix a Rust B-rep CSG classification pipeline successfully over the course of a week, unsupervised. It had a custom STEP viewer that would take screenshots and feed them back into the model so it could verify the progress resp. the triangle soup (non progress) itself.

Codex did all the planning and verification, CC wrote the code.

This would have not been possible six months ago at all from my experience.

Maybe with a lot of handholding; but I doubt it (I tried).

I mean both the problem for starters (requires a lot of spatial reasoning and connected math) and the autonomous implementation. Context compression was never an issue in the entire session, for either model.

Barbing 11 hours ago | parent | prev | next [-]

Looking at this URL, typo or YouTube flip the si tracking parameter?

  youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ
MikeNotThePope 5 hours ago | parent [-]

I just cut & pasted the share URL provided by YouTube. Strip out the query param if you like.

maskull 13 hours ago | parent | prev | next [-]

After running a context window up high, probably near 70% on opus 4.6 High and watching it take 20% bites out of my 5hr quota per prompt I've been experimenting with dumping context after completing a task. Seems to be working ok. I wonder if I was running into the long context premium. Would that apply to Pro subs or is just relevant to api pricing?

wat10000 3 hours ago | parent | prev | next [-]

I've used it many times for long-running investigations. When I'm deep in the weeds with a ton of disassembly listings and memory dumps and such, I don't really want to interrupt all of that with a compaction or handoff cycle and risk losing important info. It seems to remain very capable with large contexts at least in that scenario.

saaaaaam 13 hours ago | parent | prev | next [-]

That video is bizarre. Such a heavy breather.

coldtea 10 hours ago | parent | next [-]

What a weird and inconsequential thing to focus on...

He's just fucking closely miced with compression + speaking fast and anxious/excited speaking to an audience

indigodaddy 10 hours ago | parent | prev [-]

Most of that is just nervousness

bushbaba 11 hours ago | parent | prev | next [-]

Yes. I’ve used it for data analysis

twodave 12 hours ago | parent | prev [-]

I mean, try using copilot on any substantial back-end codebase and watch it eat 90+% just building a plan/checklist. Of course copilot is constrained to 120k I believe? So having 10x that will blow open up some doors that have been closed for me in my work so far.

That said, 120k is pleeenty if you’re just building front-end components and have your API spec on hand already.