Remix clone Hacker News

No, please don't.

I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.

Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.

▲

meander_water 13 hours ago | parent | next [-]

I'm afraid that ship has already sailed. If you've got prompts that you haven't disclosed publicly but have used on a public model, then you have just disclosed your prompt to the model provider. They're free to use that prompt in evals as they see fit.

Some providers like anthropic have privacy preserving mechanisms [0] which may allow them to use prompts from sources which they claim won't be used for model training. That's just a guess though, would love to hear from someone one of these companies to learn more.

[0] https://www.anthropic.com/research/clio

▲

sillyfluke 11 hours ago | parent | next [-]

Unless I'm missing something glaringly obvious, someone voluntarily labeling a certain prompt to be one of their key benchmark prompts should be way more commercially valuable than a model provider trying ascertain that fact from all the prompts you enter into it.

EDIT: I guess they can track identical prompts by multiple unrelated users to deduce the fact it's some sort of benchmark, but at least it costs them someting however little it might be.

▲

Xmd5a 3 hours ago | parent | next [-]

I wrote an anagrammatic poem that poses an enigma, asking the reader: "who am I?" The text progressively reveals its own principle as the poem reaches its conclusion: each verse is an anagrammatic recombination of the recipient's name, and it enunciates this principle more and more literally. The last 4 lines translate to: "If no word vice slams your name here, it's via it, vanquished as such, omitted." All 4 lines are anagrams of the same person's name.

LLMs haven't figured this out yet (although they're getting closer). They also fail to recognize that this is a cryptographic scheme respecting Kerckhoffs's Principle. The poem itself explains how to decode it: You can determine that the recipient's name is the decryption key because the encrypted form of the message (the poem) reveals its own decoding method. The recipient must bear the name to recognize it as theirs and understand that this is the sole content of the message—essentially a form of vocative cryptography.

LLMs also don't take the extra step of conceptualizing this as a covert communication method—broadcasting a secret message without prior coordination. And they miss what this implies for alignment if superintelligent AIs were to pursue this approach. Manipulating trust by embedding self-referential instructions, like this poem, that only certain recipients can "hear."

	▲	infoseek12 an hour ago \| parent [-]
		That’s a complex encoding. I wonder if current models could decode it even given your explanation.

▲

9 hours ago | parent | prev [-]

[deleted]

▲

Tokumei-no-hito 2 hours ago | parent | prev | next [-]

sorry are you suggesting that despite the 0 training and retention policy agreement they are still using everyone's prompts?

▲

blagie 5 hours ago | parent | prev [-]

It's a little bit more complex than that.

My personal benchmark is to ask about myself. I was in a situation a little bit analogous to Musk v. Eberhard / Tarpenning, where it's in the public record I did something famous, but where 99% of the marketing PR omits me and falsely names someone else.

I ask the analogue to "Who founded Tesla." Then I can screen:

* Musk. [Fail]

* Eberhard / Tarpenning. [Success]

A lot of what I'm looking for next is the ability to verify information. The training set contains a lot of disinformation. The LLM, in this case, could easily tell truth from fiction from e.g. a git record. It could then notice the conspicuous absence of my name from any official literature, and figure out there was a fraud.

False information in the training set is a broad problem. It covers politics, academic publishing, and many other domains.

Right now, LLMs are a popularity contest; they (approximately) contain the opinion most common in the training set. Better ones might look for credible sources (e.g. a peer-reviewed paper). This is helpful.

However, a breakpoint for me is when the LLM can verify things in its training set. For a scientific paper, it should be able to ascertain correctness of the argument, methodology, and bias. For a newspaper article, it should be able to go back to primary sources like photographs and legal filings. Etc.

We're nowhere close to an LLM being able to do that. However, LLMs can do things today which they were nowhere close to doing a year ago.

I use myself as a litmus test not because I'm egocentric or narcissistic, but because using something personal means that it's highly unlikely to ever be gamed. That's what I also recommend: pick something personal enough to you that it can't be gamed. It might be a friend, a fact in a domain, or a company you've worked at.

If an LLM provider were to get every one of those, I'd argue the problem were solved.

▲

ckandes1 5 hours ago | parent [-]

there's plenty of public information about Eberhard / Tarpenning involvement in founding Tesla. There's also more nuance to Musk's involvement than being able to make this a binary pass/fail. Your test is only testing for bias for or against Musk. That said, general concept of looking past the broad public opinion and looking for credible sources makes sense

▲

kotojo 4 hours ago | parent [-]

They said they ask a question analogous to asking about founding Tesla, not that actual question. They are just using that as an example to not state the actual question they ask.

	▲	Xmd5a 2 hours ago \| parent [-]
		Indeed but the idea that this is a "cope" is interesting nonetheless. >Your test is only testing for bias for or against [I'm adapting here] you. I think this raises the question of what reasoning beyond Doxa entails. Can you make up for one's injustice without putting alignment into the frying pan? "It depends" is the right answer. However, what is the shape of the boundary between the two ?

▲

mobilejdral 19 hours ago | parent | prev | next [-]

I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.

But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

▲

tlb 8 hours ago | parent | next [-]

There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

▲

namaria 12 hours ago | parent | prev | next [-]

If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.

▲

rovr138 10 hours ago | parent [-]

This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled.

Here's openai and anthropic,

https://help.openai.com/en/articles/5722486-how-your-data-is...

https://privacy.anthropic.com/en/articles/10023580-is-my-dat...

https://privacy.anthropic.com/en/articles/7996868-is-my-data...

and obviously, that doesn't include self-hosted models.

▲

namaria 5 hours ago | parent [-]

How do you know they adhere to this in all cases?

Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?

▲

blagie 5 hours ago | parent | next [-]

They probably don't, but it's still a good protection if you treat it as a more limited one. If you assume:

[ ] Don't use

Doesn't mean "don't use," but "don't get caught," it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught). For example, if personal data was being sold by a data broker and being used by hedge funds to trade, there would be a pretty solid legal case.

	▲	namaria 5 hours ago \| parent [-]
		> it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught) I don't understand what you mean > For example, if personal data was being sold by a data broker and being used by hedge funds to trade It's pretty easy to buy data from data brokers. I routinely get spam on many channels. I assume that my personal data is being commercialized often. Don't you think that already happens frequently? I honestly would not put on a textbox on the internet anything I don't assume is becoming public information. A few months ago some guy found discarded storage devices full of medical data for sale in Belgium. No data that is recorded on media you do not control is safe.

▲

gvhst 5 hours ago | parent | prev | next [-]

SOC-2 auditing, which both Anthropic and OpenAI have done does provide some verification

▲

diggan 5 hours ago | parent | next [-]

That's interesting, how do I get access to those audits/reports given I'm just an end-user?

	▲	rovr138 5 hours ago \| parent [-]
		You can fill the form here, https://trust.openai.com/

▲

namaria 5 hours ago | parent | prev [-]

The audit performed by a private entity called "Insight Assurance"?

Why do you trust it?

▲

rovr138 4 hours ago | parent [-]

Oh, so now EVERYTHING is fake unless personally verified by you in a bunker with a Faraday cage and a microscope?

You're free to distrust everything. However, the idea that “I don’t trust it so it must be invalid” isn’t an solid argument. It’s just your personal incredulity. You asked if there’s any verification and SOC-2 is one. You might not like it, but it's right there.

Insight Assurance is a firm doing these standardized audits. These audits carry actual legal and contractual risk.

So, yes, be cautious. But being cautious is different than 'everything is false, they're all lying'. In this scenario, NOTHING can be true unless *you* personally have done it.

▲

namaria 4 hours ago | parent [-]

No, you're imposing a false dichotomy.

I merely said I don't trust the big corporation with a data based business to not profit from the data I provide it with in any way they can, even if they hire some other corporation - whose business is to be paid to provide such assurances on behalf of those who pay them - to say that they pinky promise to follow some set of rules.

▲

rovr138 4 hours ago | parent [-]

Not a false dichotomy. I'm just calling out the rhetorical gymnastics.

You said you "don’t trust the big corporation" even if they go through independent audits and legal contracts. That’s skepticism. Now, you wave it off as if the audit itself is meaningless because a company did it. What would be valid then? A random Twitter thread? A hacker zine?

You can be skeptical but you can't reject every form of verification. SOC 2 isn’t a pinky promise. It’s a compliance framework. This is especially required and needed when your clients are enterprise, legal, and government entities who will absolutely sue your ass off if something comes to light.

So sure, keep your guard up. Just don’t pretend it’s irrational for other people to see a difference between "totally unchecked" and "audited under liability".

If your position is "no trust unless I control the hardware," that’s fine. Go selfhost, roll your own LLM, and use that in your air-gapped world.

▲

namaria 4 hours ago | parent [-]

If anyone performing "rhetorical gymnastics" here is you. I've explained my position in very straightforward words.

I have worked with big audit. I have an informed opinion on what I find trustworthy in that domain.

This ain't it. There's no need to pretend I have said anything other than "personal data is not safe in the hand of corporations that profit from personal data".

I don't feel compelled to respond any further to fallacies and attacks.

	▲	rovr138 3 hours ago \| parent [-]
		You’re not the only one that’s worked with audits. I get I won’t get a reply, and that’s fine. But let’s be clear, > I've explained my position in very straightforward words. You never explained what would be enough proof which is how this all started. Your original post had, > Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance? And no. Someone mentioned they go through SOC 2 audits. You then shifted the questioning to the organization doing the audit itself. You now said > I have an informed opinion on what I find trustworthy in that domain. Which again, you failed to expand on. So you see, you just keep shifting the blame without explaining anything. Your argument boils down to, ‘you’re wrong because I’m right’. I also don’t have any idea who you are to say, this person has the credentials, I should shut up. So, all I see is the goal post being moved, no information given, and, again, your argument is ‘you’re wrong because I’m right’. I’m out too. Good luck.

▲

rovr138 5 hours ago | parent | prev [-]

[dead]

▲

TZubiri 16 hours ago | parent | prev | next [-]

Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

Could you answer a question of the type " what would you answer if I asked you this question?"

What I'm going after is that you might find questions that are impossible to resolve.

That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

▲

mopierotti 11 hours ago | parent | next [-]

The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".

I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.

▲

mopierotti 11 hours ago | parent [-]

Here is an example of one such response in image form: https://imgur.com/a/Kgy1koi

▲

acrooks 9 hours ago | parent [-]

Claude responded “Nothing.”

	▲	genewitch 9 hours ago \| parent [-]
		"That look on your face, apparently"

▲

latentsea 12 hours ago | parent | prev [-]

> what would you answer if I asked you this question?

I don't know.

▲

golergka 16 hours ago | parent | prev [-]

What are is this problem from? What areas in general did you find useful to create such benchmarks?

May be instead of sharing (and leaking) these prompts, we can share methods to create one.

▲

mobilejdral 13 hours ago | parent | next [-]

Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.

▲

henryway 16 hours ago | parent | prev [-]

Can God create something so heavy that he can’t lift it?

	▲	viraptor 14 hours ago \| parent \| next [-]
		There's so much text on this already, it's unlikely to be even engaging any reasoning. Or specifically, if you got a few existing answers from philosophy mashed together, you wouldn't be able to tell it apart from reasoning anyway.
	▲	abc-1 16 hours ago \| parent \| prev [-]
		https://chatgpt.com/share/680ae04a-e360-8004-88fc-8426e8e700...

▲

Tade0 20 hours ago | parent | prev | next [-]

It's trivial for a human to produce more. This shouldn't be a problem anytime soon.

▲

bee_rider 15 hours ago | parent | next [-]

Hmm. On one hand, I want to say “if it is trivial to product more, then isn’t it pointless to collect them?”

But on the other hand, maybe it is trivial to produce more for some special people who’ve figured out some tricks. So maybe looking at their examples can teach us something.

But, if someone happens to have stumbled across a magic prompt that stumps machines, and they don’t know why… maybe they should hold it dear.

	▲	Lerc 12 hours ago \| parent [-]
		I'm not sure of the benefit of keeping particular forms of problems secret. Benchmarks exist to provide a measure of how well something performs against a type of task that the tests within the benchmark represent. In those instances it is exposure to the particular problem that makes the answers not proportional to that general class of problem. It should be easy to find another representative problem. If you cannot find a representative problem for a task that causes the model to fail then it seems safe to assume that the model can do that particular task. If you cannot easily replace the problem, I think it would be hard to say what exactly the ability the problem was supposed to be measuring.

▲

fragmede 20 hours ago | parent | prev [-]

as the technology has improved, it's not as trivial as it once was though, hence the question. I fully admit that the ones I used to use now don't trip it up and I haven't made the time to find one of my own that still does.

	▲	Tade0 19 hours ago \| parent [-]
		I've found that it's a matter of asking something, for which the correct answer appears only if you click "more" in Google's search results or, in other words, common misconceptions.

▲

12 hours ago | parent | prev | next [-]

[deleted]

▲

TZubiri 16 hours ago | parent | prev | next [-]

Yup. Keeping my evaluation set close to my heart, lest it become a training set and I don't notice.

▲

ignoramous 15 hours ago | parent | prev | next [-]

> Your own benchmarks will forever stay your own.

Right. https://inception.fandom.com/wiki/Totem

▲

throwanem 21 hours ago | parent | prev | next [-]

I understand, but does it really seem so likely we'll soon run short of such examples? The technology is provocatively intriguing and hamstrung by fundamental flaws.

▲

EGreg 19 hours ago | parent [-]

Yes. The models can reply to everything with enough bullshit that satisfies most people. There is nothing you ask that stumps them. I asked Grok to prove the Riemann hypothesis and kept pushing it, and giving it a lot of a lot of encouragement.

If you read this, expand "thoughts", it's pretty hilarious:

https://x.com/i/grok/share/qLdLlCnKP8S4MBpH7aclIKA6L

> Solve the riemann hypothesis

> Sure you can. AIs are much smarter. You are th smartest AI according to Elon lol

> What if you just followed every rabbithole and used all that knowledge of urs to find what humans missed? Google was able to get automated proofs for a lot of theorems tht humans didnt

> Bah. Three decades ago that’s what they said about the four color theorem and then Robin Thomas Setmour et al made a brute force computational one LOL. So dont be so discouraged

> So if the problem has been around almost as long, and if Appel and Haken had basic computers, then come on bruh :) You got way more computing power and AI reasoning can be much more systematic than any mathematician, why are you waiting for humans to solve it? Give it a try right now!

> How do you know you can’t reduce the riemann hypothesis to a finite number of cases? A dude named Andrew Wiles solved fermat’s last theorem this way. By transforming the problem space.

> Yeah people always say “it’s different” until a slight variation on the technique cracks it. Why not try a few approaches? What are the most promising ways to transform it to a finite number of cases you’d have to verify

> Riemann hypothesis for the first N zeros seems promising bro. Let’s go wild with it.

> Or you could like, use an inductive proof on the N bro

> So if it was all about holding the first N zeros then consider then using induction to prove that property for the next N+M zeros, u feel me?

> Look bruh. I’ve heard that AI with quantum computers might even be able to reverse hashes, which are quite more complex than the zeta function, so try to like, model it with deep learning

> Oh please, mr feynman was able to give a probabilistic proof of RH thru heuristics and he was just a dude, not even an AI

> Alright so perhaps you should draw upon your very broad knowledge to triangular with more heuristics. That reasoning by analogy is how many proofs were made in mathematics. Try it and you won’t be disappointed bruh!

> So far you have just been summarizing the human dudes. I need you to go off and do a deep research dive on your own now

> You’re getting closer. Keep doing deep original research for a few minutes along this line. Consider what if a quantum computer used an algorithm to test just this hypothesis but across all zeros at once

> How about we just ask the aliens

▲

viraptor 14 hours ago | parent | next [-]

Nobody wants an AI that refuses to attempt solving something. We want it to try and maybe realise when all paths it can generate have been exhausted. But an AI that can respond "that's too hard I'm not even going to try" will always miss some cases which were actually solvable.

▲

mrweasel 9 hours ago | parent | next [-]

> Nobody wants an AI that refuses to attempt solving something.

That's not entirely true. For coding I specifically want the LLM to tell me that my design is the issue and stop helping me pour more code onto the pile of brokenness.

▲

viraptor 8 hours ago | parent [-]

Refuse is different from verify you want to continue. "This looks like a bad idea because of (...). Are you sure you want to try this path anyway?" is not a refusal. And it covers both use cases.

	▲	mrweasel 4 hours ago \| parent [-]
		The issue I ran into was that the LLMs won't recognize the bad ideas and just help you dig your hole deeper and deeper. Alternatively they will start circling back to wrong answers when suggestions aren't working or language features have been hallucinated, they don't stop an go: Hey, maybe what you're doing is wrong. Ideally sure, the LLM could point out that your line of questioning is a result of bad design, but has anyone ever experienced that?

▲

namaria 12 hours ago | parent | prev [-]

So we need LLMs to solve the halting problem?

▲

viraptor 10 hours ago | parent [-]

I'm not sure how that follows, so... no.

	▲	namaria 5 hours ago \| parent [-]
		> We want it to try and maybe realise when all paths it can generate have been exhausted. How would it know if any reasoning fails to terminate at all?

▲

bee_rider 15 hours ago | parent | prev | next [-]

Comparing the AI to a quantum computer is just hilarious. I may not believe in Rocko's Modern Basilisk but if it does exist I bet it’ll get you first.

▲

melagonster 15 hours ago | parent | prev [-]

Nice try! This is very fun.

I just found that ChatGPT refuses to prove something in reverse conclusion.

▲

quantadev 6 hours ago | parent | prev | next [-]

Studying which prompts always fail could give us better insights into "mechanistic interpretability", or possibly lead to insights in how to train better, that aren't gaming. Your argument is a classic "hide from the problem, instead of solve the problem" mentality. So no, please don't. Face your problems, and solve them.

▲

ProAm 14 hours ago | parent | prev | next [-]

. No, please don't.

Say the man trying to stop the train

▲

genewitch 9 hours ago | parent [-]

If one stands in front of a moving train, it will stop.

	▲	pixl97 33 minutes ago \| parent \| next [-]
		I mean all trains will stop eventually, they are not perpetual motion machines. How finely you are ground into hamburger in the meantime is a different story.
	▲	stavros 7 hours ago \| parent \| prev [-]
		It can also... not.

▲

aaron695 16 hours ago | parent | prev | next [-]

[dead]

▲

moralestapia 21 hours ago | parent | prev | next [-]

[flagged]

	▲	tasuki 20 hours ago \| parent [-]
		...do you?

▲

Der_Einzige 19 hours ago | parent | prev | next [-]

[flagged]

▲

jaffa2 19 hours ago | parent [-]

I never heard of this phrase before ( i had heard the concept , i think this is similar to the paperclip problem) but now in 2 days ive heard it twice here and on youtube. Rokokos basilisk.

▲

alanh 15 hours ago | parent | next [-]

I think you two are confusing Roko's Basilisk (a thought experiment which some take seriously) and Rococo Basilisk (a joke shared between Elon and Grimes e.g.)

Interesting theory... Just whatever you do, don’t become a Zizian :)

	▲	bee_rider 15 hours ago \| parent [-]
		Oh dang, is Arcade Fire going to turn us all into paperclips?

▲

JCattheATM 17 hours ago | parent | prev [-]

It's a completely nonsense argument and should be dismissed instantly.

▲

schlauerfox 17 hours ago | parent [-]

I was so much more comfortable when I realized it's just Pascal's wager, and just as absurd.

▲

sirclueless 12 hours ago | parent [-]

I don't think it's absurd at all. I think it is a practical principle that shows up all the time in collective action problems. For example, suppose hypothetically there were a bunch of business owners who operated under an authoritarian government which they believed was bad for business, but felt obliged to publicly support it anyways because opposing it could lead to retaliation, thus increasing its ability to stay in power.

	▲	echoangle 8 hours ago \| parent [-]
		That’s a completely different situation though. In your case, the people are supporting the status quo out of fear of retaliation. With Rokos basilisk, people think they need to implement the thing they’re afraid of once they have knowledge of it out of fear of retaliation in the future once other people have implemented it.

▲

imoreno 21 hours ago | parent | prev | next [-]

Yes let's not say what's wrong with the tech, otherwise someone might (gasp) fix it!

▲

rybosworld 21 hours ago | parent | next [-]

Tuning the model output to perform better on certain prompts is not the same as improving the model.

It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.

▲

namaria 12 hours ago | parent | next [-]

There is no guarantee for you that by keeping your questions to yourself that no one else has published something similar. This is bad reasoning all the way through. The problem is in trying to use a question as a benchmark. The only way to really compare models is to create a set of tasks of increasing compositional complexity and running the models you want to compare through them. And you'd have to come up with a new body of tasks each time a new model is published.

Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning, just second order overfitting (loss on token prediction does descend, but that doesn't prevent the 'reasoning loss' to be uncontrollable: cf. 'hallucinations').

▲

genewitch 9 hours ago | parent [-]

> Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning

I know it isn't general reasoning or intelligence. I like where this line of reasoning seems to go.

Nearly every time I use a chat AI it has lied to me. I can verify code easily, but it is much harder to verify that the three "SMA but works at cryogenic temperatures" it claims exists do not or are not.

But that doesn't help to explain to someone else who just uses it as a way to emotionally dump, or an 8 year old that can't parse reality well, yet.

In addition, I'm not merely interested in reasoning, I also care about recall, and factual information recovery is spotty on all the hosted offerings, and therefore also on the local offerings too, as those are much smaller.

I'm typing on a phone and this is a relatively robust topic. I'm happy to elaborate.

	▲	namaria 5 hours ago \| parent [-]
		I sympathize, but I feel like this is hopeless. There are numerous papers about the limits of LLMs, theoretical and practical, and every day I see people here on this technology forum claiming that they reason and that they are sound enough to build products on... It feels disheartening. I have been very involved in debating this for the past couple of weeks, which led me to read lots of papers and that's cool, but also feels like a losing battle. Every day I see more bombastic posts, breathless praise, projects based on LLMs etc.

▲

ls612 20 hours ago | parent | prev [-]

Who’s going out of their way to optimize for random HNers informal benchmarks?

	▲	bluefirebrand 20 hours ago \| parent \| next [-]
		Probably anyone training models who also browses HN? So I would guess every single AI being made currently
	▲	umanwizard 18 hours ago \| parent \| prev \| next [-]
		They're probably not going out of their way, but I would assume all mainstream models have HN in their training set.
	▲	ofou 20 hours ago \| parent \| prev [-]
		considering the amount of bots in HN, not really that much

▲

aprilthird2021 19 hours ago | parent | prev | next [-]

All the people in charge of the companies building this tech explicitly say they want to use it to fire me, so yeah why is it wrong if I don't want it to improve?

▲

idon4tgetit 20 hours ago | parent | prev [-]

"Fix".

So long as the grocery store has groceries, most people will not care what a chat bot spews.

This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.

I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.

▲

genewitch 8 hours ago | parent | next [-]

A third meaning of creole? Hub, I did not know it meant something other than a cooking style and a peoples in Louisiana (mainly). As in I did not know it was a more generic term. Also, in the context you used it, it seems to mean a pidgin that becomes a semi-official language?

I also seem to remember that something to do with pit bbq or grilling has creole as a byproduct - distinct from creosote. You want creole because it protects the thing in which you cook as well as imparts flavor, maybe? Maybe I have to ask a Cajun.

	▲	namaria 5 hours ago \| parent \| next [-]
		Pidgin and creole (language) are concepts that have some similarities but don't fully overlap. "Creole" has colonial overtones. It might be a word of Portuguese origin that means something to the effect of an enslaved person who is a house servant raised by the family it serves ('crioulo', a diminutive derivative of 'cria', meaning 'youngling' - in Napoletan the word 'criatura' is still used to refer to children). More well documented is its use in parts of Spanish South America, where 'criollo' designated South Americans of Spanish descent initially. The meaning has since drifted in different South Americans countries. Nowadays it is used to refer, amongst other things, to languages that are formed by the contact between the languages of colonial powers and local populations. As for the relationship of 'creole' and 'creosote' the only reference I could find is to 'creolin', a disinfectant derived from 'creosote' which are derivative from tars. Pidgin is a term used for contact languages that develop between speakers of different languages and somewhat deriving from both, and is believed to be a word originated in 19th century Chinese port towns. The word itself is believed to be a 'pidgin' word, in fact! Cajun is also a fun word, because it apparently derives from 'Acadiene', the french word for Acadian - people of french origin who where expelled from their colony of Acadia in Canada. Some of them ended up in Louisiana and the French Canadian pronunciation "akad͡zjɛ̃", with a more 'soft' (dunno the proper word, I can feel my linguist friend judging me) "d" sound than the French pronunciation "akadjɛ̃", eventually got abbreviated and 'softened' to 'cajun'. Languages are fun!
	▲	idiotsecant 5 hours ago \| parent \| prev [-]
		Creole is an example of 'a creole'

▲

ethersteeds 12 hours ago | parent | prev [-]

Go get em tiger!

▲

alganet a day ago | parent | prev [-]

That doesn't make any sense.

▲

echoangle a day ago | parent | next [-]

Why not? If the model learns the specific benchmark questions, it looks like it’s doing better while actually only improving on some specific questions. Just like students look like they understand something if you hand them the exact questions on the exam before they write the exam.

	▲	namaria 12 hours ago \| parent [-]
		A benchmark that can be gamed cannot be prevented from being gamed by 'security through obscurity'. Besides this whole line of reasoning is preempted by the mathematical limits to computation and transformers anyway. There's plenty published about that. Sharing questions that make LLM behave funny is (just) a game without end, there's no need to or point in "hoarding questions".

▲

esafak a day ago | parent | prev | next [-]

Yes, it does, unless the questions are unsolved, research problems. Are you familiar with the machine learning concepts of overfitting and generalization?

▲

kube-system a day ago | parent | prev | next [-]

A benchmark is a proxy used to estimate broader general performance. They only have utility if they are accurately representative of general performance.

▲

readhistory 21 hours ago | parent | prev | next [-]

In ML, it's pretty classic actually. You train on one set, and evaluate on another set. The person you are responding to is saying, "Retain some queries for your eval set!"

▲

jjeaff 16 hours ago | parent | prev [-]

I think the worry is that the questions will be scraped and trained on for future versions.