I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).

Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.

▲

sebzim4500 3 months ago | parent | next [-]

Unless I'm misunderstanding what you are asking the model to do, Gemini 2.5 pro just passed this easily. https://g.co/gemini/share/e2876d310914

▲

osigurdson 3 months ago | parent | next [-]

As I mentioned, this is not a scientific test but rather just something that I have tried from time to time and has always (shockingly in my opinion) failed but today worked. It takes a minute of two of prompting, is boring to verify and I don't remember exactly which models I have used. It is purely a personal anecdote, nothing more.

However, looking at the code that Gemini wrote in the link, it does the same thing that other LLMs often do, which is to assume that we are encoding individual long values. I assume there must be a github repo or stackoverflow question in the weights somewhere that is pushing it in this direction but it is a little odd. Naturally, this isn't the kind encoder that someone would normally want. Typically it should encode a byte array and return a string (or maybe encode / decode UTF8 strings directly). Having the interface use a long is very weird and not very useful.

In any case, I suspect with a bit more prompting you might be able to get gemini to do the right thing.

▲

int_19h 3 months ago | parent | next [-]

I think it's because the question is rather ambiguous - "convert the number to base-N" is a very common API, e.g. in C# you have Convert.ToString(long value, int base), in JavaScript you have Number.toString(base) etc. It seems that it just follows this pattern. If you were to ask me the same question, I'd probably do the same thing without any further context.

OTOH if you tell it to write a Base62 encoder in C#, it does consistently produce an API that can be called with byte arrays: https://g.co/gemini/share/6076f67abde2

	▲	osigurdson 3 months ago \| parent [-]
		There is Convert.ToBase64String so I don't think encode is necessarily universal (though probably more precise).

▲

jiggawatts 3 months ago | parent | prev [-]

Similarly, many of my informal tests have started passing with Gemini 2.5 that never worked before, which makes the 2025 era of AI models feel like a step change to me.

▲

AaronAPU 3 months ago | parent | prev [-]

I’ve been using Gemini 2.5 pro side by side with o1-pro and Grok lately. My experience is they each randomly offer significant insight the other two didn’t.

But generally, o1-pro listens to my profile instructions WAY better, and it seems to be better at actually solving problems the first time. More reliable.

But they are all quite similar and so far these new models are similar but faster IMO.

▲

croemer 3 months ago | parent | prev | next [-]

I asked o3 to build and test a maximum parsimony phylogenetic tree builder in Python (my standard test for new models) and it's been thinking for 10 minutes. Still not clear if anything is happening, I have barely seen any code since I asked to test what it produced in the first answer. The thought summary is totally useless compared to Gemini's. Underwhelming so far.

The CoT summary is full of references to Jupyter notebook cells. The variable names are too abbreviated, nbr for neighbor, the code becomes fairly cryptic as a result, not nice to read. Maybe optimized too much for speed.

Also I've noticed ChatGPT seems to abort thinking when I switch away from the app. That's stupid, I don't want to look at a spinner for 5 minutes.

And the CoT summary keeps mentioning my name which is irritating.

▲

istjohn 3 months ago | parent | next [-]

It's maddening that you can't switch away from the app while it generates output. To use the Deep Research feature on mobile, you have to give up your phone for ten minutes.

	▲	scragz 3 months ago \| parent [-]
		deep research will run in the BG on mobile and I think it gives a notification when done. it's not like normal chats that need the app to be in the foreground.

▲

beefnugs 3 months ago | parent | prev [-]

Have you tried cutting the job up into a series of smaller verifiable intermediate steps?

▲

NiloCK 3 months ago | parent | prev [-]

I could be misinterpreting your claim here, but I'll point out that LLM weights don't literally encode the entirety of the training data set.

	▲	glial 3 months ago \| parent [-]
		I guess you could consider it a lossy encoding.