| ▲ | M4v3R 3 days ago |
| Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549... With right knowledge and web searches one can answer this question in a matter of minutes at most. The model fumbled around modding forums and other sites and did manage to find some good information but then started to hallucinate some details and used them in the further research. The end result it gave me was incorrect, and the steps it described to get the value were totally fabricated. What’s even worse in the thinking trace it looks like it is aware it does not have an answer and that the 399 is just an estimate. But in the answer itself it confidently states it found the correct value. Essentially, it lied to me that it doesn’t really know and provided me with an estimate without telling me. Now, I’m perfectly aware that this is a very niche topic, but at this point I expect the AI to either find me a good answer or tell me it couldn’t do it. Not to lie me in the face. Edit: Turns out it’s not just me: https://x.com/transluceai/status/1912552046269771985?s=46 |
|
| ▲ | int_19h 3 days ago | parent | next [-] |
| Compare to Gemini Pro 2.5: https://g.co/gemini/share/c8fb1c9795e4 Of note, the final step in the CoT is: > Formulate Conclusion: Since a definitive list or count isn't readily available through standard web searches, the best approach is to: state that an exact count is difficult to ascertain from readily available online sources without direct analysis of game files ... avoid giving a specific number, as none was reliably found across multiple sources. and then the response is in line with that. |
| |
| ▲ | M4v3R 3 days ago | parent [-] | | I like this answer. It does mention the correct, definitive way of getting the information I want (extracting the char.lgp data file) and so even though it gave up it pushes you in the right direction, whereas o3/o4 just make up stuff. |
|
|
| ▲ | werdnapk 3 days ago | parent | prev | next [-] |
| I've used AI with "niche" programming questions and it's always a total let down. I truly don't understand this "vibe coding" movement unless everyone is building todo apps. |
| |
| ▲ | SkyPuncher 3 days ago | parent | next [-] | | There's a bit of a skill to it. Good architecture plans help. Telling it where in an existing code base it can find things to pattern match against is also fantastic. I'll often end up with a task that looks something like this: * Implement Foo with a relation to FooBar. * Foo should have X, Y, Z features * We have an existing pattern for Fidget in BigFidget. Look at that for implementation * Make sure you account for A, B, C. Check Widget for something similar. It works surprisingly well. | | |
| ▲ | motorest 3 days ago | parent | next [-] | | > Good architecture plans help. This is they key answer right here. LLMs are great at interpolating and extrapolating based on context. Interpolating is far less error-prone. The problem with interpolating is that you need to start with accurate points so that interpolating between them leads to expected and relatively accurate estimates. What we are seeing is the result of developers being oblivious to higher-level aspects of coding, such as software architecture, proper naming conventions, disciplined choice of dependencies and dependency management, and even best practices. Even basic requirements-gathering. Their own personal experience is limited to diving into existing code bases and patching them here and there. They often screw up the existing software architecture because their lack of insight and awareness leads them to post PRs that get the job done at the expense of polluting the whole codebase into an unmanageable mess. So these developers crack open an LLM and prompt it to generate code. They use their insights and personal experience to guide their prompts. Their experience reflects what they do on a daily basis. The LLMs of course generate code from their prompts, and the result is underwhelming. Garbage-in, garbage-out. It's the LLMs fault, right? All the vibe coders out there showcasing good results must be frauds. The telltale sign of how poor these developers are is how they dump the responsibility of they failing to get LLMs to generate acceptable results on the models not being good enough. The same models that are proven effective at creating whole projects from scratch at their hands are incapable of the smallest changes. It's weird how that sounds, right? If only the models were better... Better at what? At navigating through your input to achieve things that others already achieve? That's certainly the model's fault, isn't it? A bad workman always blames his tools. | | |
| ▲ | hansmayer 2 days ago | parent [-] | | Yes, with a bit of work around prompting and focusing on closed context, or as you put it, interpolating, you can get further. But the problems is that, this is not how the LLMs were sold. If you blame someone for trying to use it by specifying fairly high level prompts - well isn´t that exactly how this technology was being advertised the whole time? The problem is not the bad workman, the problem is that the tool is not doing what it is advertised as doing. | | |
| ▲ | motorest 2 days ago | parent [-] | | > But the problems is that, this is not how the LLMs were sold. No one cares about promises. The only thing that matters are the tangibles we have right now. Right now we have a class of tools that help us write multidisciplinary apps with a few well-crafted prompts and zero code involved. |
|
| |
| ▲ | extr 3 days ago | parent | prev [-] | | Yeah this is a great summary of what I do as well and I find it very effective. I think of hands-off AI coding like you're directing a movie. You have a rough image of what "good" looks like in your head, and you're trying to articulate it with enough detail to all the stagehands and actors such that they can realize the vision. The models can always get there with enough coaching, traditionally the question is if that's worth the trouble versus just doing it yourself. Increasingly I find that AI at this point is good enough I am rarely stepping in to "do it myself". |
| |
| ▲ | hatefulmoron 3 days ago | parent | prev | next [-] | | It's incredible when I ask Claude 3.7 a question about Typescript/Python and it can generate hundreds of lines of code that are pretty on point (it's usually not exactly correct on first prompt, but it's coherent). I've recently been asking questions about Dafny and Lean -- it's frustrating that it will completely make up syntax and features that don't exist, but still speak to me with the same confidence as when it's talking about Typescript. It's possible that shoving lots of documentation or a book about the language into the context would help (I haven't tried), but I'm not sure if it would make up for the model's lack of "intuition" about the subject. | | |
| ▲ | mhitza 3 days ago | parent [-] | | Don't need to ho that esoteric. Seen them make stuff up pretty often for more common functional programming languages like Haskell and OCaml. | | |
| ▲ | greenavocado 3 days ago | parent | next [-] | | Recommend using RAG for this. Make the Haskell or OCaml documentation your knowledge base and index it for RAG. Then it makes a heck of a lot more sense! | | |
| ▲ | rashkov 3 days ago | parent [-] | | How does one do that? As far as I can tell neither Claude or chatgpt web clients support this. Is there a third party tool that people are using? | | |
| |
| ▲ | Foobar8568 3 days ago | parent | prev [-] | | Well all LLM are fairly bad for react native as soon as you look at more than hello world type of things. I got stuck with different LLM until I checked the official documentation, yeah spouting nonsense from 2y+ removed features I suppose or just making up stuff. |
|
| |
| ▲ | mikepurvis 3 days ago | parent | prev | next [-] | | I'm trialing co-pilot in VSCode and it's a mixed bag. Certain things it pops out great, but a lot of times I'll be like woohoo! <tab> <tab> <tab> and then end up immediately realising wait a sec, none of this is actually needed, or it's just explicitly calling for things that are already default values, or whatever. (This is particularly in the context of metadata-type stuff, things like pyproject files, ansible playbooks, Dockerfiles, etc) | |
| ▲ | chaboud 3 days ago | parent | prev | next [-] | | I recently exclaimed that “vibe coding is BS” to one of my coworkers before explaining that I’ve actually been using GPT, Claude, llama (for airplanes), Cline, Cursor, Windsurf, and more for coding for as long as they’ve been available (more recently playing with Gemini). Cline + Sonnet 3.7 has been giving me great results on smaller projects with popular languages, and I feel truly fortunate to have AWS Bedrock on tap to drive this stuff (no effective throttling/availability limits for an individual dev). Even llama + Continue has proven workable (though it will absolutely hallucinate language features and APIs). That said, 100% pure vibe coding is, as far as I can tell, still very much BS. The subtle ugliness that can come out of purely prompt-coded projects is truly a rat hole of hate, and results can get truly explosive when context windows saturate. Thoughtful, well-crafted architectural boundaries and protocols call for forethought and presence of mind that isn’t yet emerging from generative systems. So spend your time on that stuff and let the robots fill in the boilerplate. The edges of capability are going to keep moving/growing, but it’s already a force multiplier if you can figure out ways to operate. For reference, I’ve used various degrees of assistance for color transforms, computer vision, CNN network training for novel data, and several hundred smaller problems. Even if I know how to solve a problem, I generally run it through 2-3 models to see how they’ll perform. Sometimes they teach me something. Sometimes they violently implode, which teaches me something else. | | |
| ▲ | motorest 3 days ago | parent [-] | | > That said, 100% pure vibe coding is, as far as I can tell, still very much BS. I don't really agree. There's certainly a showboating factor, not to mention there is currently a goldrush to tap this movement to capitalize from it. However, I personally managed to create a fully functioning web app from scratch with Copilot+vs code using a mix of GPT4 and o1-mini. I'm talking about both backend and frontend, with basic auth in place. I am by no means a expert, but I did it in an afternoon. Call it BS, the the truth of the matter is that the app exists. | | |
| ▲ | saberience 2 days ago | parent [-] | | People were making a front and backend web app in half a day using Ruby on Rails way before LLMs were ever a thing, and their code quality was still much better than yours! So vibe coding, sure you can create some shitty thing which WORKS, but once it becomes bigger than a small shitty thing, it becomes harder and harder to work with because the code is so terrible when you're pure vibe coding. | | |
| ▲ | motorest 2 days ago | parent [-] | | > People were making a front and backend web app in half a day using Ruby on Rails way before LLMs were ever a thing, and their code quality was still much better than yours! A few people were doing that. With LLMs, anyone can do that. And more. It's important to frame the scenario correctly. I repeat: I created everything in an afternoon just for giggles, and I challenged myself to write zero lines of code. > So vibe coding, sure you can create some shitty thing which WORKS (...) You're somehow blindly labelling a hypothetical output as "shitty", which only serves to show your bias. In the meantime, anyone who is able to churn out a half-functioning MVP in an afternoon is praised as a 10x developer. There's a contrast in there, where the same output is described as shitty or outstanding depending on who does it. |
|
|
| |
| ▲ | ecocentrik 2 days ago | parent | prev | next [-] | | People who embracing vibe coding are probably the same people who were already sudo-vibe coding to begin with using found fragments of code they could piece together to make things sort of work for simple tasks. | |
| ▲ | killerdhmo 3 days ago | parent | prev | next [-] | | I mean, I don't think you need to do cutting edge programming to make something personal to you. See here from Canva's product. Check this out: https://youtu.be/LupwvXsOQqs?t=2366 | |
| ▲ | motorest 3 days ago | parent | prev [-] | | > I've used AI with "niche" programming questions and it's always a total let down. That's perfectly fine. It just means you tried without putting in any effort and failed to get results that were aligned with your expectations. I'm also disappointed when I can't dunk or hit >50% of my 3pt shots, but then again I never played basketball competitively > I truly don't understand this "vibe coding" movement unless everyone is building todo apps. Yeah, I also don't understand the NBA. Every single one of those players show themselves dunking and jumping over cars and having almost perfect percentages in 3pt shots during practice, whereas I can barely get off my chair. The problem is certainly basketball. |
|
|
| ▲ | lend000 3 days ago | parent | prev | next [-] |
| I imagine after GPT-4 / o1, improvements on benchmarks have been increasingly a result of overfitting, because those breakthrough models already used most of the high quality training data that is available on the internet, there haven't been any dramatic architectural changes, we are already melting the world's GPUs, and there simply isn't enough new, high quality data being generated (orders of magnitudes more than what they already used on older models) to enable breakthrough improvements. What I'd really like to see is the model development companies improving their guardrails so that they are less concerned about doing something offensive or controversial and more concerned about conveying their level of confidence in an answer, i.e. saying I don't know every once in a while. Once we get a couple years of relative stagnation in AI models, I suspect this will become a huge selling point and you will start getting "defense grade", B2B type models where accuracy is king. |
|
| ▲ | siva7 3 days ago | parent | prev | next [-] |
| It can imitate its creator. We reached AGI. |
| |
|
| ▲ | hirvi74 3 days ago | parent | prev | next [-] |
| Have you asked this same question to various other models out there in the wild? I am just curious if you have found some that performed better. I would ask some models myself, but I do not know the proper answer, so I would probably be gullible enough to believe whatever the various answers have in common. |
|
| ▲ | shultays 3 days ago | parent | prev | next [-] |
| AIs in general are definitely hallucinating a lot more when it comes to niche topics. It is funny how they are unable to say "I don't know" and just make up things to answer your questions |
| |
| ▲ | felipeerias 3 days ago | parent [-] | | LLMs made me a lot more aware of leading questions. Tiny changes in how you frame the same query can generate predictably different answers as the LLM tries to guess at your underlying expectations. |
|
|
| ▲ | M4v3R 3 days ago | parent | prev | next [-] |
| Btw Ive also asked this question using Deep Research mode in ChatGPT and got the correct answer: https://chatgpt.com/share/68009a09-2778-8004-af40-4a8e7e812b... So maybe this is just too hard for a “non-research” mode. I’m still disappointed it lied to me instead of saying it couldn’t find an answer. |
|
| ▲ | tern 3 days ago | parent | prev | next [-] |
| What's the correct answer? Curious if it got it right the second time: https://chatgpt.com/share/68009f36-a068-800e-987e-e6aaf190ec... |
|
| ▲ | shmerl 3 days ago | parent | prev | next [-] |
| How would it ever know the answer it found is true and correct though? It could as well just repeat some existing false answer that you didn't yet find on your own. That's not much better than hallucinating it, since you can't verify its truth without finding it independently anyway. |
| |
| ▲ | M4v3R 3 days ago | parent [-] | | I would be ok with having an answer and an explanation of how it got the answer with a list of sources. And it does just that - the only problem is that both the answer and the explanation are fabrications after you double check the sources. |
|
|
| ▲ | Davidzheng 3 days ago | parent | prev | next [-] |
| Underwhelmed compared with Gemini 2.5 Pro--however it would've been impressive a month ago I think. |
|
| ▲ | heavyset_go 2 days ago | parent | prev | next [-] |
| Same thing happened when asking it a fairly simple question about dracut on Linux. If I went through with the changes it suggested, I wouldn't have a bootable machine. |
|
| ▲ | yMEyUyNE1 3 days ago | parent | prev | next [-] |
| > Not to lie me in the face. Are you saying that, it deliberately lied to you? > With right knowledge and web searches one can answer this question in a matter of minutes at most. Reminded me of Dunning Kruger curve, the ai model at the first peak and you at the latter. |
| |
| ▲ | M4v3R 2 days ago | parent [-] | | > Are you saying that, it deliberately lied to you? Pretty much yeah. Now “deliberately” does imply some kind of agency or even consciousness which I don’t believe these models have, its probably the result of overfitting, reward hacking or some other issues from training it, but the end result is that the model straight up misleads you knowingly (as in - the thinking trace is aware of the fact it doesn’t know the answer but it provides it anyways). |
|
|
| ▲ | mountainriver a day ago | parent | prev [-] |
| Oh boy, here comes the “it didn’t work for this one specific thing I tried” posts |
| |
| ▲ | dragonmost 14 hours ago | parent [-] | | But then how can you rely on it for things you don't know the answer to? The exercise just goes to show it still can't admit it doesn't know and lies instead. |
|