Remix clone Hacker News

I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.

But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

▲

tlb 8 hours ago | parent | next [-]

There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

▲

namaria 12 hours ago | parent | prev | next [-]

If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.

▲

rovr138 10 hours ago | parent [-]

This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled.

Here's openai and anthropic,

https://help.openai.com/en/articles/5722486-how-your-data-is...

https://privacy.anthropic.com/en/articles/10023580-is-my-dat...

https://privacy.anthropic.com/en/articles/7996868-is-my-data...

and obviously, that doesn't include self-hosted models.

▲

namaria 6 hours ago | parent [-]

How do you know they adhere to this in all cases?

Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?

▲

blagie 5 hours ago | parent | next [-]

They probably don't, but it's still a good protection if you treat it as a more limited one. If you assume:

[ ] Don't use

Doesn't mean "don't use," but "don't get caught," it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught). For example, if personal data was being sold by a data broker and being used by hedge funds to trade, there would be a pretty solid legal case.

	▲	namaria 5 hours ago \| parent [-]
		> it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught) I don't understand what you mean > For example, if personal data was being sold by a data broker and being used by hedge funds to trade It's pretty easy to buy data from data brokers. I routinely get spam on many channels. I assume that my personal data is being commercialized often. Don't you think that already happens frequently? I honestly would not put on a textbox on the internet anything I don't assume is becoming public information. A few months ago some guy found discarded storage devices full of medical data for sale in Belgium. No data that is recorded on media you do not control is safe.

▲

gvhst 5 hours ago | parent | prev | next [-]

SOC-2 auditing, which both Anthropic and OpenAI have done does provide some verification

▲

diggan 5 hours ago | parent | next [-]

That's interesting, how do I get access to those audits/reports given I'm just an end-user?

	▲	rovr138 5 hours ago \| parent [-]
		You can fill the form here, https://trust.openai.com/

▲

namaria 5 hours ago | parent | prev [-]

The audit performed by a private entity called "Insight Assurance"?

Why do you trust it?

▲

rovr138 4 hours ago | parent [-]

Oh, so now EVERYTHING is fake unless personally verified by you in a bunker with a Faraday cage and a microscope?

You're free to distrust everything. However, the idea that “I don’t trust it so it must be invalid” isn’t an solid argument. It’s just your personal incredulity. You asked if there’s any verification and SOC-2 is one. You might not like it, but it's right there.

Insight Assurance is a firm doing these standardized audits. These audits carry actual legal and contractual risk.

So, yes, be cautious. But being cautious is different than 'everything is false, they're all lying'. In this scenario, NOTHING can be true unless *you* personally have done it.

▲

namaria 4 hours ago | parent [-]

No, you're imposing a false dichotomy.

I merely said I don't trust the big corporation with a data based business to not profit from the data I provide it with in any way they can, even if they hire some other corporation - whose business is to be paid to provide such assurances on behalf of those who pay them - to say that they pinky promise to follow some set of rules.

▲

rovr138 4 hours ago | parent [-]

Not a false dichotomy. I'm just calling out the rhetorical gymnastics.

You said you "don’t trust the big corporation" even if they go through independent audits and legal contracts. That’s skepticism. Now, you wave it off as if the audit itself is meaningless because a company did it. What would be valid then? A random Twitter thread? A hacker zine?

You can be skeptical but you can't reject every form of verification. SOC 2 isn’t a pinky promise. It’s a compliance framework. This is especially required and needed when your clients are enterprise, legal, and government entities who will absolutely sue your ass off if something comes to light.

So sure, keep your guard up. Just don’t pretend it’s irrational for other people to see a difference between "totally unchecked" and "audited under liability".

If your position is "no trust unless I control the hardware," that’s fine. Go selfhost, roll your own LLM, and use that in your air-gapped world.

▲

namaria 4 hours ago | parent [-]

If anyone performing "rhetorical gymnastics" here is you. I've explained my position in very straightforward words.

I have worked with big audit. I have an informed opinion on what I find trustworthy in that domain.

This ain't it. There's no need to pretend I have said anything other than "personal data is not safe in the hand of corporations that profit from personal data".

I don't feel compelled to respond any further to fallacies and attacks.

	▲	rovr138 3 hours ago \| parent [-]
		You’re not the only one that’s worked with audits. I get I won’t get a reply, and that’s fine. But let’s be clear, > I've explained my position in very straightforward words. You never explained what would be enough proof which is how this all started. Your original post had, > Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance? And no. Someone mentioned they go through SOC 2 audits. You then shifted the questioning to the organization doing the audit itself. You now said > I have an informed opinion on what I find trustworthy in that domain. Which again, you failed to expand on. So you see, you just keep shifting the blame without explaining anything. Your argument boils down to, ‘you’re wrong because I’m right’. I also don’t have any idea who you are to say, this person has the credentials, I should shut up. So, all I see is the goal post being moved, no information given, and, again, your argument is ‘you’re wrong because I’m right’. I’m out too. Good luck.

▲

rovr138 5 hours ago | parent | prev [-]

[dead]

▲

TZubiri 16 hours ago | parent | prev | next [-]

Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

Could you answer a question of the type " what would you answer if I asked you this question?"

What I'm going after is that you might find questions that are impossible to resolve.

That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

▲

mopierotti 11 hours ago | parent | next [-]

The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".

I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.

▲

mopierotti 11 hours ago | parent [-]

Here is an example of one such response in image form: https://imgur.com/a/Kgy1koi

▲

econ 3 minutes ago | parent | next [-]

It needs a bit more reasoning as it does find the answer but doesn't notice it found it.

The answer is: A trick question.

▲

acrooks 9 hours ago | parent | prev [-]

Claude responded “Nothing.”

	▲	genewitch 9 hours ago \| parent [-]
		"That look on your face, apparently"

▲

latentsea 12 hours ago | parent | prev [-]

> what would you answer if I asked you this question?

I don't know.

▲

golergka 16 hours ago | parent | prev [-]

What are is this problem from? What areas in general did you find useful to create such benchmarks?

May be instead of sharing (and leaking) these prompts, we can share methods to create one.

▲

mobilejdral 13 hours ago | parent | next [-]

Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.

▲

henryway 16 hours ago | parent | prev [-]

Can God create something so heavy that he can’t lift it?

	▲	viraptor 14 hours ago \| parent \| next [-]
		There's so much text on this already, it's unlikely to be even engaging any reasoning. Or specifically, if you got a few existing answers from philosophy mashed together, you wouldn't be able to tell it apart from reasoning anyway.
	▲	abc-1 16 hours ago \| parent \| prev [-]
		https://chatgpt.com/share/680ae04a-e360-8004-88fc-8426e8e700...