Remix.run Logo
zh2408 6 days ago

The Linux repository has ~50M tokens, which goes beyond the 1M token limit for Gemini 2.5 Pro. I think there are two paths forward: (1) decompose the repository into smaller parts (e.g., kernel, shell, file system, etc.), or (2) wait for larger-context models with a 50M+ input limit.

achierius 6 days ago | parent | next [-]

Some huge percentage of that is just drivers. The kernel is likely what would be of interest to someone in this regard; moreover, much of that is architecture specific. IIRC the x86 kernel is <1M lines, though probably not <1M tokens.

throwup238 5 days ago | parent [-]

The AMDGPU driver alone is 5 million lines - out of about 37 million lines total. Over 10% of the codebase is a driver for a single vendor, although most of it is auto generated per-product headers.

rtolsma 6 days ago | parent | prev | next [-]

You can use the AST for some languages to identify modular components that are smaller and can fit into the 1M window

ryao 5 days ago | parent | prev [-]

The first path would be the most interesting, especially if it can be automated.