This reminds me of a story from 15 years ago, where I was developing a technology to download games on demand by hooking into the OS calls.

There was a particular game that was superslow when this tech was applied. Original game loading took around 15-20 seconds, whereas once the tech was applied it took easily 3-5 min, even with all data already downloaded.

When I started digging into it, I realized the reason was the game was using something like

   fread(data, 1, 65536, fptr);

instead of

   fread(data, 65536, 1, fptr);

Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API. Since my code was hooking on ReadFile system call, and my call was heavier than ReadFile, the game loading felt really slow. Unusable. It would have not been fun for players.

The easy fix was to swap arguments for certain calls. The long fix required to use an internal cache to account for these cases so that the hooked ReadFile was faster when data was already in disk.

Funny thing is that as we started rolling out the tech and applying it to more and more games we realized lots of games did this. We went for the cache fix and games ended up loading faster than before. Honestly, games could have load all the data in a couple of seconds by just swapping the args. I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff, although you never know.

▲ Taniwha 5 hours ago | parent | next [-]

I used to be a graphics card/chip architect for macs in the early/mid 90s - our chips were the fastest, but some programs were resistant because they did stupid stuff: pagemaker invalidated the font cache every time it went thru its main loop, quark with ATM did an n*2 thing every time it wrote text etc etc. We had special hardware to accelerate text drawing and it did nothing because the software pissed it away. We considered creating a plugin that fixed all these things, it would have been hard to maintain, in the end we travelled around to the people who made these apps and talked them through their problems

To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels, we made that very fast! we didn't tell them :-)

At the time 24-bit framebuffers were so slow that before we built graphics acceleration hardware people would switch back to 8-bit to get stuff done, making 24-bit/true colour your daily driver was a big step forward.

▲

nxobject 2 hours ago | parent | next [-]

Does that make you the first in a long tradition of GPU developers going to blockbuster app devs to say "hey, you should be doing this instead?"

PS – I am looking through the NuBus cards that I have... did you work for SuperMac or RasterOps?

	▲	Taniwha an hour ago \| parent [-]
		I was probably not the first to have to do that, we knew what apps our customers used, making them better was the whole point of the operation I did the architectural design for the SuperMac cards. I figured out what needed to be accelerated, dropping code into people's machines to see where the cycles were going. Others did the physical design for the first 2 cards, I did the design of the chip in the Thunder and later cards (designed the data paths and state machines and a full simulation, someone else actually laid the gates) If your card has a SQD01 on it it's my work. It peaks at 1.5Gb/s on solid fills

▲

xattt an hour ago | parent | prev | next [-]

What would have been the purpose of stupid code like that?

Was it a workaround for things that didn’t fully complete on one iteration, so the devs kept hammering away at it until it worked?

▲

phire 42 minutes ago | parent [-]

They were most likely just bugs. Quite possibly really stupid bugs.

Not every bug results in the program doing the wrong thing, they often just make the program do the right thing very slowly.

And nobody notices, since it still produces the right result.

	▲	Taniwha 39 minutes ago \| parent [-]
		Yes, they were bugs, I think programmers (and their marketing people) were more focused on new features than performance

▲

urbandw311er 4 hours ago | parent | prev | next [-]

This is a horrible and yet not unexpected insight into the internals of Excel

▲

Taniwha 4 hours ago | parent | next [-]

To be fair this was Excell 25 years ago, may no longer be true.

One of the other bugs (the Quark/ATM one) was also because of the programmers were worried about writing over stuff that hadn't been completely erased, the Quark guys wrote a string with 2 spaces at the end through a box that masked the end of the string, the ATM font renderer saw it couldn't fit the text so it split it in half and tried again so it drew N/2 N/4 N/8 ... strings. It spent all it's time in the 68k's multiply instructions figuring out how wide the strings (and substrings) were, our fancy 24-bit character rendering hardware was an afterthought

▲

bathtub365 4 hours ago | parent | prev | next [-]

In all of the software you’ve written, are you aware of how many on-screen pixels you’ve overdrawn?

▲

trelbutate 3 hours ago | parent | prev [-]

> To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels

I feel like I'm having a stroke trying to read this, what does it mean??

	▲	Taniwha 43 minutes ago \| parent \| next [-]
		Well all they needed to do was erase the screen with white and draw on it, but their app's internal logic meant that they erased it more than once. I was capturing QuickDraw library calls - the low level graphics primitives, to figure out where the graphics time in apps was going and found out sometimes excel did it 9 times Of course users didn't see it more than once, but our hardware made all that wasted time run faster
	▲	NSUserDefaults 3 hours ago \| parent \| prev \| next [-]
		Several layers of white is what makes the black really pop. (Just kidding).
	▲	b112 3 hours ago \| parent \| prev \| next [-]
		It means they were time travellers! Secretly, they came from an alternate future where everyone used e-ink displays, and wanted Excel to be ready!
	▲	sixeyes 3 hours ago \| parent \| prev [-]
		before writing to some area, it would erase it (clearing with white) up to 9 times

▲

PaulHoule an hour ago | parent | prev [-]

I remember when 24 bit color was exotic and aspirational and you had to settle for 16.

▲ Someone 4 hours ago | parent | prev | next [-]

> Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API

What software did that that badly? If the code asks for (up to) 65,536 single byte items, why would you split that into 65,536 calls?

Also, that change changes behavior. The old call could read anything from zero to 65,536 bytes, the new one only can read zero or 65,536 bytes.

(Reading the source of a few implementations, I think most implementations will fill the output buffer with partial objects if the input doesn’t supply an integral number of them, but the return value of fread cannot signal that to the caller)

▲

micampe 3 hours ago | parent | next [-]

A long time ago I worked with someone who read 1 byte at a time from a socket because they insisted data was cached so the kernel was going to batch it magically somehow. It took me days to convince them to measure it.

▲

quietbritishjim 3 hours ago | parent [-]

That's different: you're talking about the application code, like OP.

But I think the parent comment's point is that the issue is in the implementation of fread itself in the standard library. It's perfectly reasonable for an application to pass it 1, 65536 (i.e. one byte, up to 65536 times) and expect it not to issue 65536 separate OS calls.

▲

b112 3 hours ago | parent [-]

Is it? I get what you're saying, but asking for 1 byte 65536 times, is indeed different than asking for 65536 bytes, 1 time. There may be reasons, such as when you pull off the end of a buffer, it shifts. And the buffer size is 1 byte. Or 10. Or whatever.

No, I'm not saying that's why. I'm simply saying there is a difference between asking for 1 byte or 65k bytes of something. Even dd runs the same under Linux.

dd bs=10k count=1 is faster than bs=1 count=10k

I remember trying to recover some data from a spinning disk, and trying to slowly creep up on the data. So I wanted 1 byte per, I wanted it to nibble, until it hit whatever the errored part was. If I just grabbed the lot, it'd error out from the whole read.

	▲	quietbritishjim 21 minutes ago \| parent \| next [-]
		> asking for 1 byte 65536 times, is indeed different than asking for 65536 bytes, 1 time. Yes it's different. As others have noted, the difference is what is returned if less than 65536 are available to read in the file: total failure vs partial read. There is, unsurprisingly, no requirement that it has an unnecessarily inefficient implementation to meet this behavioral requirement. (The C standard doesn't talk about such things as syscalls but, even if it did, it surely wouldn't require such a thing.) The irony is that that partial read is actually the default on both Windows and Posix (i.e. both ReadFile and read() will read up to the number of bytes specified). So a one-syscall implementation for fread would have been easier than multiple calls, and certainly would be standard compliant. The dd example isn't comparable because dd is much lower level, and you really are specifying how the syscalls should be made.
	▲	Someone 2 hours ago \| parent \| prev \| next [-]
		I glanced at https://github.com/busterb/libc-openbsd/blob/master/stdio/fr... and https://chromium.googlesource.com/chromiumos/third_party/gli.... The latter (as usual when comparing OpenBSD and Linux) is more complex, but both multiply count by size and then go their way. Also, the API contract allows fread to read fewer bytes than requested. I would except any implementation to do that. But maybe, somebody interpreted the contract differently than major OSes, in the sense that a call isn’t allowed to write partial size-sized chunks to user memory and/or advance the file position further than its return value advocates (that, I think, is something that the implementations above can do, and might be considered a bug)
	▲	dspillett 2 hours ago \| parent \| prev \| next [-]
		Another possibility for why it needs to be done that way is dealing with error conditions. I've not looked at the code (or even the man pages) and it is a long time since I touched anything that low level, so this might be completely wrong, but if there is an error before the next 64KiB (including just hitting EOF) then the semantics could be different. Asking for 1x64KiB I would expect to just error as there aren't the requested number of bytes. Asking for 64Ki lots of 1 byte might simple error just the same, or it might at least populate the buffer with what it can read, or if the meaning of 1,65536 is actually “up to 64Ki lots of 1B” then it would populate the buffer as far as possible and return the amount read rather than an error condition. If the per-byte option is slow but still fast enough, and dealing with the semantics is less faf, then people will go for that because the tiny time loss is worth the larger effort reduction. Of course this assumes the underlying system doesn't change, as with the “making local code to run as on-demand networked code” example higher in the thread which changes the relative performance characteristics of the two calling methods significantly.
	▲	chadgpt3 2 hours ago \| parent \| prev [-]
		dd is designed to request a certain block size from the kernel. fread is not and should just multiply the two arguments and read that many bytes, just like calloc.

▲

macintux an hour ago | parent | prev [-]

I assumed it was a simple mistake: easy to forget what order the two integers are sent.

▲ mort96 2 hours ago | parent | prev | next [-]

Wait, is that wrong? I always call fread as:

    fread(data, 1, sizeof(buffer), f);

with the rationale that I'm interested in reading sizeof(buffer) individual bytes. The buffer size is incidental, not the size of the items I'm trying to read from the file; "read one item whose size is sizeof(buffer)" seems semantically wrong.

Is this just the case of Windows having a bad stdlib fread implementation 15 years ago or is my thinking here actually wrong?

▲

chadgpt3 2 hours ago | parent [-]

It's not wrong. Guy just wrote a bad implementation of fread and blamed everyone else.

	▲	DarkUranium an hour ago \| parent [-]
		He didn't write it. The C runtime authors did (presumably Microsoft, if it's MSVCRT). He's hooking into ReadFile, a layer below the stdlib. By the time it reaches the hook, it's already split.

▲ fsfod 2 hours ago | parent | prev | next [-]

Part of Windows Explorer actually does tons of tiny 4 byte ReadFile calls in to its tracking database like file when you delete a file. If you deleting lots of files this quickly adds up.

▲ somenameforme 5 hours ago | parent | prev | next [-]

Doesn't that break anything relying on the return value? fread gives you the number of objects read as a return. So I think a pretty typical thing would be to fread and then parse that number of characters, and that'd just break?

▲

jcul 4 hours ago | parent | next [-]

I've seen a lot of code that just assumes fread / fwrite succeeded without bothering to check the return value...

But in this case if the code was calling fread 65536 times in a loop and getting 64KiB each time it wouldn't be good either!

Sounds like the parent comment had to fix this with the internal cache thing to speed up the small freads. I think they meant the easy fix would have been swapping the args in the original / caller code.

	▲	account42 4 hours ago \| parent [-]
		There are no small freads in the story, whatever implements those freads supposedly split them up into many calls. But that sound more like a problem of that implementation than the fread callers as size == 1 is correct when you are reading a bag of bytes.

▲

koolala 4 hours ago | parent | prev | next [-]

I think they turned it from a tiny file read to a tiny ram read.

▲

DonHopkins 4 hours ago | parent | prev [-]

The type of programmer who swaps the args to fread tends to be the type of programmer who doesn't bother to check the return value, fortunately.

Edit: mort96: So did you check the return value or not?

	▲	mort96 2 hours ago \| parent \| next [-]
		If I have a buffer of bytes, and I intend to treat the content of that buffer as individual bytes, what is semantically wrong with "read 65k 1-byte-sized items into this buffer"? Wouldn't it be a bit unnatural to express it as "read one item whose size is 65k"?
	▲	account42 4 hours ago \| parent \| prev [-]
		But the args aren't necessarily swapped just because they end up in a slow case in some implementation.

▲ lukan 4 hours ago | parent | prev | next [-]

"I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff"

I really hope that was not the case and rather think incompetence or to deal with obscure legacy problems, but the gamer in me gets enraged at the thought someone would artificially increase loading times.

▲ chadgpt3 2 hours ago | parent | prev [-]

Why does your fread to anything other than multiplying the two arguments?

▲

Sesse__ an hour ago | parent [-]

The idea of having two arguments to fread() is presumably to be able to do something else than all-or-nothing when there's a short read.

	▲	chadgpt3 26 minutes ago \| parent [-]
		Yes, it divides the bytes read by the element size to get the return value. Which is the obvious reason you'd pass an element size of 1: you want to know how many bytes were read.