Revisiting the DOS Memory Models

▲ Revisiting the DOS Memory Models(blogsystem5.substack.com)

182 points by mooreds 10 months ago | 104 comments

▲ jmmv 10 months ago | parent | next [-]

Original author here. Thanks for sharing!

I see various comments below along the lines of “oh, the article is missing so and so”. OK… then please see the other articles in this series! I think they cover most of what you are mentioning :-)

The first was on EMS, XMS, HMA and the like: https://blogsystem5.substack.com/p/from-0-to-1-mb-in-dos

The second was on unreal mode: https://blogsystem5.substack.com/p/beyond-the-1-mb-barrier-i...

The third was on DJGPP: https://blogsystem5.substack.com/p/running-gnu-on-dos-with-d...

And the last, which follows this one, is on 64 bit memory models: https://blogsystem5.substack.com/p/x86-64-programming-models

Some of these were previously discussed here too, but composing this in mobile and finding links is rather painful… so excuse me from not providing those links now.

▲

bonzini 10 months ago | parent | next [-]

Just one nit: contrary to what the article suggests, as far as I remember the compact model was not so common because using far pointers for all data is slow and wastes memory. Also, the globals and the stack had to fit in 64k anyway so compact only bought you a larger heap.

However, there were variants of malloc and free that returned or accepted far pointers, or alternatively you could ask DOS for memory in 16-byte units and slice it yourself (e.g. by loading game assets). Therefore many programs used the small and medium models instead of compact and large respectively, and annotated pointers to large data (which is almost always runtime-loaded and dynamically allocated anyway) by hand with the __far modifier. This was the most efficient setup with the only problem that, due to the 64k limit, you could hardly use the heap or recursion.

	▲	10 months ago \| parent \| next [-]
		[deleted]
	▲	tiahura 10 months ago \| parent \| prev [-]
		1. Compact Model Limits: The stack and globals don’t strictly need to fit in 64 KB; far pointers allow larger heaps, but inefficiency made this model unpopular. 2. Malloc Variants: While farmalloc and farfree existed, developers often used direct DOS memory allocation for better control. 3. Stack Constraints: Stack and recursion limits were due to 64 KB segments, not specific to compact or small models. 4. Far Pointers: Using __far for dynamic data was common across models; compact/large automated this but were inefficient. 5. Heap/Recursion Use: The heap and recursion were constrained, not “hardly usable,” due to far pointer overhead and stack size.

▲

Timwi 10 months ago | parent | prev | next [-]

I read through the whole page from the beginning up to the “Discussion about this post” header. At no point was there any mention of a series, or any other blog posts (the inline links all go to Wikipedia).

I don't blame anyone for not realizing that there are more articles on the topic.

▲

klelatti 10 months ago | parent [-]

At the very start of the post:

> At the beginning of the year, I wrote a bunch of articles on the various tricks DOS played to overcome the tight memory limits of x86’s real mode.

With link to an article.

▲

gibibit 10 months ago | parent | next [-]

Linked in a the style where each word links to _a_ _different_ _page_ that doesn't correspond to the hyperlinked word.

What do you call this pattern? It seems to be popular lately. I haven't been able to find a description of it, but it would be much more helpful to the reader if it was identified.

Instead of

> At the beginning of the year, I wrote a _bunch_ _of_ _articles_ on the various trick

It's better to write

> At the beginning of the year, I wrote a bunch of articles (_1_, _2_, _3_) on the various trick

or something similar.

	▲	cesarb 10 months ago \| parent \| next [-]
		IIRC, this linking pattern was common enough back in the Geocities era, that HTML style guides explicitly recommended avoiding it. To those who lived through these times, it's quite obvious that there are three separate links, because the space between the words is not underlined (the space would be underlined if it were a single link); obviously, that trick is not helpful with the modern style of not underlining hyperlinks at all.
	▲	jmmv 10 months ago \| parent \| prev \| next [-]
		I intentionally wrote it that way because these articles are only loosely related to the one discussed here, not a "series I thought through upfront". Yeah, not a fan _of_ _the_ _pattern_, but I wanted to give it a try and see how it worked. But honestly... the text of the very first sentence talks about these articles, so the curious reader will hopefully realize that "there is something more".
	▲	marxisttemp 10 months ago \| parent \| prev [-]
		It bothers me too, in the same fashion as “click here”. Instead, we should prefer e.g. At the beginning of the year, I wrote a bunch of articles on the various tricks (_below 1MB_, _above 1 MB_, and _with GNU JMP_) Just describe the content you’re linking to. You know best as the author!

▲

lproven 10 months ago | parent | prev [-]

Correction to the correction: with three links to the three articles.

▲

turol 10 months ago | parent | prev [-]

If you click on the domain name next to the main link you get a filtered view of submissions for just that domain. This way you can easily find the related posts. It looks like this is the fifth submission of this article but the others didn't get many comments.

https://news.ycombinator.com/from?site=blogsystem5.substack....

	▲	jmmv 10 months ago \| parent [-]
		That's good, but you need to know what you are looking for. If I click on that link now, I see a bunch of repeated submissions, and due to the nature of this publication, the articles are of very varied topics. So a random person won't know what articles are related to this one and which ones aren't with ease.

▲ WalterBright 10 months ago | parent | prev | next [-]

The Zortech C/C++ compiler had another memory model: handle pointers. When dereferencing a handle pointer, the compiler emitted code that would swap in the necessary page from expanded memory, extended memory, or disk.

It works like a virtual memory system, except that the compiler emitted the necessary code rather than the CPU doing it in microcode.

https://www.digitalmars.com/ctg/handle-pointers.html

Similarly, Zortech C++ had the "VCM" memory model, which worked like virtual memory. Your code pages would be swapped in an out of memory as needed.

https://digitalmars.com/ctg/vcm.html

▲

sitkack 10 months ago | parent | next [-]

That is sort of like inlining the demand paging code from the OS. When we have exokernels, they exist as a library so can be delt with like regular code

This would be trivial (and fun) to implement with Wasm.

▲

actionfromafar 10 months ago | parent [-]

Are you saying this could be a way to break out of the 32 bit barrier (a bit) on WASM? Sort of like how Windows NT could handle 64 gigs of RAM even though it was a 32 bit operating system?

▲

sitkack 10 months ago | parent [-]

Wasm64 will soon be everywhere if you want more than 4GB of address space. No, I was saying you could simulate larger heaps by selectively paging data memory in and out, you would still need a way to represent your max heap size.

Wasm also supports multimemories, https://github.com/WebAssembly/multi-memory

	▲	actionfromafar 10 months ago \| parent [-]
		Interesting! https://github.com/WebAssembly/multi-memory/blob/main/propos... The scaling point is what I was thinking of. "As long as Wasm memories are limited to 32 bit address space, there is no way to scale out of 4 GB memory efficiently. Multiple memories at least provide an efficient workaround until 64 bit memories become available (which may still take a while)." So pretty similar to launching several processes on 32 bit NT or Linux, each having their own memory space.

▲

jmclnx 10 months ago | parent | prev | next [-]

I was a user of Zortech C 1.0. I loved its disp_* functions.

One program (com) I wrote with it back then is still being used by at least one person. I talked to them a couple of months ago and they said they still use it.

	▲	WalterBright 10 months ago \| parent [-]
		Wow! good to know. I used it for Empire, and for my text editor. When moving to Linux, it was easy to convert to using TTY sequences.

▲

WalterBright 10 months ago | parent | prev [-]

Borland's "Zoom" scheme for overlays was well marketed, but not competitive with VCM (because only one overlay could be used at a time). That didn't matter, though, because Zoom was a catchy name and VCM was dull as dirt.

Phillipe Kahn is a marketing genius, and I am not.

(VCM's overlays could be loaded anywhere, the relocation happened at runtime.)

▲ kookamamie 10 months ago | parent | prev | next [-]

There's at least one more "fun" aspect to DOS memory - Borland's Turbo Pascal overlay files: https://secondboyet.com/articles/publishedarticles/theslithy...

▲

int_19h 10 months ago | parent [-]

It wasn't just TP that used overlays; it was a very common technique for large DOS apps in general.

	▲	PaulHoule 10 months ago \| parent [-]
		RSTS/E for the PDP-11 had an overlay facility too. That machine had a 64k user space but could support more physical memory, a big PDP-11 installation might support 20 terminals which each offer a BASIC experience just a little bigger than the Apple ][ a few years earlier, particularly after they introduced separate address spaces for code and data. In 1980 as a kid I was reading books about computers and would get my fingers on a TRS-80 or a PDP-8 and thinking the PDP-11 is what "the next thing" up from the "8-bit micros" but really the PDP-11 was long in the tooth in 1980 because it did not exceed 64k barrier of those chips. The 8086 had a pragmatic approach to addressing more memory, you have to remember that when the IBM PC came out hardly anyone could afford to load the machine up with 640k or more memory, instead you would have 64k or 128k, so in the early days COM files did not seem limited but rather a way to migrate from the CP/M ecosystem in terms of compilers and tooling. Personally I loved assembly language programming for the real-mode 80286 and did not feel the segmentation model to be much of a burden. I remember great portability of Unix games like Nethack to DOS. My 80286 beat the pants off any PDP-11 and my 80486 trounced the VAX-11/780... DEC's minicomputers were dead but the VAX was a pioneer machine that popularized the virtual memory model of modern 32-bit and 64-bit computers.

▲ Aardwolf 10 months ago | parent | prev | next [-]

Many things in computing are elegant and beautiful, but this is not one if them imho (the overlapping segments, the multiple pointer types, the usage of 32 bits to only access 1MB, 'medium' having less data than 'compact', ...)

▲

akira2501 10 months ago | parent | next [-]

> but this is not one

It really is though. Memory and thus data _and_ instruction encoding were incredibly important. Physical wires on the circuit board were at a premium then as well. It was an incredibly popular platform because it was highly capable while being stupidly cheap compared to other setups.

Engineering is all about tradeoffs. "Purity" almost never makes it on the whiteboard.

▲

tonyedgecombe 10 months ago | parent | next [-]

The 68000 was from the same era yet it had a 24 bit address bus, enough for 16 MB.

▲

actionfromafar 10 months ago | parent | next [-]

And the 68008¹ was developed to overcome this problem of requiring too many data and address lines.

1: https://en.wikipedia.org/wiki/Motorola_68008

	▲	gpderetta 10 months ago \| parent [-]
		sure, but that limitation didn't show up architecturally, other than requiring more cycles to perform a load or store.

▲

elzbardico 10 months ago | parent | prev | next [-]

The 68000 was a high-end product, the 8088 was a lot cheaper, in a big part because of those design decisions, like having a 16 bit memory bus.

This design allowed for a smaller chip, and keeping backwards compatibility with the 8080.

	▲	jhallenworld 10 months ago \| parent [-]
		But there is more: IBM basically stole the entire CP/M software ecosystem by using the 8088: assembly language CP/M programs could be more or less just recompiled for MS-DOS. Yet, it extended CP/M by allowing you to use more than 64 KB vs. 8080/Z80.

▲

10 months ago | parent | prev [-]

[deleted]

▲

Aardwolf 10 months ago | parent | prev [-]

But wouldn't allowing plain addition of 1-byte pointer offsets and 2-byte pointer offsets to a current address (just integer addition, no involvement of segments) have been simpler to design and for CPU usage? Rather than this non-linear system with overlapping segments. This would still allow memory-saving tiny pointers when things are nearby

▲

rep_lodsb 10 months ago | parent [-]

The problem is that you can't hold a pointer to more than 64K of address space inside a 16-bit register.

x86 could have easily had an IP-relative addressing mode for data from the beginning (jumps and calls already had it), but to get a pointer you can pass around to use someplace else than the current instruction, it has to be either absolute, or relative to some other "base" register which stays constant. Like the segment registers.

▲

gpderetta 10 months ago | parent [-]

Just combining two 16 bit registers for a logical 32 bit address would have been better than the weird partially overlapping addressspace.

▲

rep_lodsb 10 months ago | parent | next [-]

How would you have redesigned the 8086 to do this? And why, other than because of some aesthetic objection to overlapping segments?

The 286 and 386 in protected mode did allow segments with any base address (24 or 32 bits), so your argument about extending the address space doesn't make sense.

▲

gpderetta 10 months ago | parent [-]

you explained elsewhere how the overlap is used for relocatability, which is a reasonable justification. But if that were not a concern, non overlapping segments would have provided for a larger address space. I will readily admit that I'm not aware of all the constraints that lead to the 8086 design.

386 (not sure how 286 works) did extend segments to a larger address space, by converting them to segment selectors, but it requires a significantly more complex MMU as it is a form of virtual memory.

	▲	Narishma 10 months ago \| parent [-]
		> 386 (not sure how 286 works) did extend segments to a larger address space, by converting them to segment selectors The 286 did that, though they only extended the address space to 24 bits. The 386 extended it again to 32 bits.

▲

wvenable 10 months ago | parent | prev [-]

But then you'd end up wasting memory because the address space it would be divided into 64K blocks. The first PC had only 16KB of RAM but 128KB was probably more common. With the segments setup the way you describe a 128KB machine could use only 2 segment addresses out of 65,536 -- not very efficient or useful for relocating code and data.

▲

Joker_vD 10 months ago | parent | prev [-]

Yeah, good thing that e.g. RV64 has RIP-relative addressing mode that can address anywhere in the whole 56-bits of available space with no problems, unlike the silly 8086 that resorted to using a base register to overcome the short size of its immediate fields.

▲

akira2501 10 months ago | parent [-]

...and then x86_64 went ahead and added RIP relative addressing back in, and you get the full 64 bits of address space.

▲

Joker_vD 10 months ago | parent [-]

...you know that that's not true, neither for x64 nor RV64, and my comment was sarcastic, right? Both can only straightforwardly address ±2 GiB from the instruction pointer; beyond that, it's "large code model" all over again, with the same inelegant workarounds that's been rediscovered since the late sixties or so. GOT and PLT versus pools of absolute 64-bit addresses, pick the least worst one.

▲

akira2501 10 months ago | parent [-]

> and my comment was sarcastic, right?

Pardon me for not realizing and treating it appropriately.

> with the same inelegant workarounds that's been rediscovered since the late sixties or so

Short of creating instructions that take 64bit immediate operands you're always going to pay the same price. An indirection. This will look different because it will be implemented most efficiently differently on different architectures.

> GOT and PLT versus pools of absolute 64-bit addresses, pick the least worst one.

Or statically define all those addresses within your binary. That seems more "elegant" to you? You'll have the same problem but your loader will now be inside out or you'll have none of the features the loader can provide for you.

At that point just statically link all your dependencies and call it an early day.

	▲	Joker_vD 10 months ago \| parent [-]
		> You're always going to pay the same price. An indirection. There is a difference between indirecting through a register, or through a memory (which in the end also requires a register, in addition to a memory load). On the other hand, I$ is more precious, and the most popular parts of GOT are likely to be in the voluminous D$ anyhow, so it's hard to tell which is more efficient. > Or statically define all those addresses within your binary. That seems more "elegant" to you? Of course not. I personally think a directly specifiable 64-bit offset from the base register that holds the start of the data section is more elegant. But dynamic libraries don't mesh too well with this approach although IIRC it has been tried. > you'll have none of the features the loader can provide for you. At that point just statically link all your dependencies and call it an early day. This works surprisingly well in practice, actually. Data relocations are still an issue though.

▲ pjmlp 10 months ago | parent | prev | next [-]

As someone that was already coding during those days, having done the transition from a Timex 2068 into MS-DOS 3.3 and wonderful 5¼-inch floppies, the article is quite good.

One thing missing are overlays, where we could have some form of primitive dynamic loading, having multiple code segments for the same memory region, naturally only one could be active at a time.

▲ PennRobotics 10 months ago | parent [-]

Some of the early Microprose games used this, and it was clever for two reasons:

First, more functionality. The minigames and intro/conclusion scenes were their own executables that made use of the original, generated game data. These got loaded into RAM on top of the original executable and then called.

Second, graphics and sound were also overlays. Rather than having useless-to-most Roland MT-32 code in the binary, this was only loaded if requested. There were overlays for Sound Blaster, PC speaker, and Adlib. If your monitor only supported four colors (CGA) there was an overlay for that.

A post would be nice, although you basically described most of it. An .OVL file with a non-zero overlay number is loaded into memory with INT 3Fh (although strangely enough any interrupt number could be chosen?, and the interrupt also would call the desired function after loading into memory). These overlays are loaded as-needed into a shared memory space.

I'd be more curious to see how one would have programmed those overlays in Microsoft C Compiler 3.0. More recent compilers seemed to have better menus and documentation for the memory models, but it seems like they were clairvoyant by squeezing every bit of functionality out of version 3.0 that was made easier by Watcom/Borland/MS 5.0. (Then again, they would have evolved their build system with every successful release and every new hire, plus it was their full time job to "figure that crap out", and maybe Microsoft improved their approach to overlays in response to Microprose and others calling all the time)

The documentation states only one EXE is generated, but Microprose had multiple EXE files. Is it possible those weren't overlays but something very similar? Or did they just change the file extensions? The docs also show the syntax "Object Modules [.OBJ]: a + (b+c) + (e+f) + g + (i)" where everything in parentheses is an overlay. But this isn't elaborated. What are the plus signs? How are these objects grouped? Would their list look like "preload + (cga + mcga + ega + vga) + (nosound + tandy + pcspkr + roland + sb) + (intro) + (newgame) + (maingame) + (minigamea) + (minigameb) + (outro)"? Or would every module be individually parenthesized, and those with plus symbols are interdependent (e.g. not alternatives)? (One website using BLINK seems to suggest the latter.)

I know there are a lot of DOS tutorials (FreeDOS YT channel, blog posts) but I haven't found one that does a start-to-finish overlay example.

▲ achairapart 10 months ago | parent | next [-]

See: https://neuviemeporte.github.io/f15-se2/2023/07/12/overlays....

From this series: https://neuviemeporte.github.io/category/f15-se2.html

▲ PennRobotics 10 months ago | parent [-]

Awesome! That's my reading material for the next week.

Now I wonder if MISC.EXE and xGRAPHIC.EXE were the same across different games e.g. Covert Action vs F15 SE2... (I just checked. MISC is different. Some routines are nearly similar, but newer versions have additional machine code and updated strings.)

▲ achairapart 10 months ago | parent [-]

From the article:

   Interestingly, although Civilization uses an almost identical setup menu and also contains multiple exes that look like sound and graphic drivers based on their name, the overlay header format of those seems to be different, and could not be parsed by my tool. Seems likey they were updating the scheme as they went along (Civ 1 came out 1991, so after F15-II).

My guess is that they constantly updated their libraries game by game, as both hardware and software/dev tools in those times were moving really fast.

▲ pjmlp 10 months ago | parent | prev | next [-]

Borland compilers and Clipper supported them directly.

Chapter 18, TP 3 and 7, to show its evolution

http://www.bitsavers.org/pdf/borland/turbo_pascal/Turbo_Pasc...

TC++, page 211

https://bitsavers.org/pdf/borland/turbo_c/Turbo_C++_Programm...

Clipper, section 7-18

https://archive.org/details/Clipper_Compiler_for_dBASE_III_a...

	▲	ngcc_hk 10 months ago \| parent [-]
		What I am looking for. A memory map great, but still needed to be used by normal guy who just use compiler …

▲ globalnode 10 months ago | parent | prev | next [-]

micropose and their floppy disk protection argh!!!, couldnt even backup a purchased game, and you know how long those disks lasted...

▲ int_19h 10 months ago | parent | prev [-]

The original X-COM (aka UFO: Enemy Unknown), despite being 32-bit, had two completely separate executables for the strategy part and the tactical combat part. The game basically dumped the relevant state like inventory to disk and then exited and relaunched the other process at switch points.

▲ nox101 10 months ago | parent | prev | next [-]

I feel like this is missing EMS and XMS memory. Both were well supported ways of getting more than 640k. EMS worked by page banking. 1 or 2 64k segments of memory would be changed to point to different 64k banks from an add on memory card. XMS just did a copy instead of a page bank IIRC. It's been a long time but I wrote DOS apps that used both to support more than 640k of memory using both standards.

https://en.wikipedia.org/wiki/Expanded_memory

https://en.wikipedia.org/wiki/Extended_memory

	▲	jmmv 10 months ago \| parent \| next [-]
		You should read the very first article I wrote in this “series” then, linked to from the opening paragraph: https://blogsystem5.substack.com/p/from-0-to-1-mb-in-dos (previously discussed in https://news.ycombinator.com/item?id=39031369 at the beginning of the year).
	▲	pcb-rework 10 months ago \| parent \| prev [-]
		What "feeling" does it give you? ;) Borland Pascal and C++ support EMS overlays. Think of it like a shared library almost. Also, using DPMI is another way around it.

▲ PaulHoule 10 months ago | parent | prev | next [-]

Today Java has pointer compression where you use a 32 bit reference but shift it a few places to the left to make a 64-bit address which saves space on pointers but wastes it on alignment

▲

xxs 10 months ago | parent | next [-]

All allocated objects would have the three least significant bits as 0. Any java object cannot be 'too small' as they all have object headers (more if you need a fully blown synchronized/mutex). So with compressed pointers (up to 32GB Heaps) all objects are aligned but then again, each pointer is 4 bytes only (instead of 8). Overall it's a massive win.

▲

kstrauser 10 months ago | parent [-]

Huh, that’s clever! Do you have to choose that at compile or launch time, or does a program start like that and then “grow” when it uses more than 32GB of heap?

	▲	xxs 10 months ago \| parent [-]
		In Java you have to set max heap somehow - either ergonomics or just -Xmx command line option. Max heap is given (many a reason, and it sets before running the main method), so if you pick under the 32GB it'd auto use compressed pointers (optimize for size - optimize for speed). That option (compressed pointers) can be switched off, of course, via a command line option as well.

▲

o11c 10 months ago | parent | prev | next [-]

It's not wasted on alignment, since that alignment is already required (unless you need a very large heap). Remember that Java's GC heap is only used to allocate Objects, not raw bytes. There are ways to allocate memory outside of the heap and if you're dealing with that much raw data you should probably be using them.

▲

layer8 10 months ago | parent | prev [-]

Alignment is required anyway to prevent word tearing, for the atomicity guarantees.

▲ brudgers 10 months ago | parent | prev | next [-]

"DOS Memory Models" brought "QEMM" immediately to mind.

So possibly related, https://en.wikipedia.org/wiki/QEMM

▲

d3Xt3r 10 months ago | parent | next [-]

I was a big fan of JEMM386, was quite revolutionary when it came out - it used only 192 bytes of memory! A godsend for some demanding DOS games back then.

And there was also HXRT from the same author, which allowed you to run win32 apps in DOS. Never really made good use of it, but thought it was still pretty cool.

▲

mobilio 10 months ago | parent | prev [-]

386MAX user here!

	▲	lproven 10 months ago \| parent [-]
		386Max is now GPL FOSS. https://github.com/sudleyplace/386MAX It would be great if someone could update it so it ran on modern hardware. Then, for instance, FreeDOS could use it.

▲ globalnode 10 months ago | parent | prev | next [-]

One of the programs I'm the most pleased with was a small screensaver .COM program I wrote for DOS (for personal use). Pressing both shift keys at the same time toggled a blank screen screensaver on/off. There was a similar program released as part of Norton utilities but I got my .COM file smaller than theirs using assembly. After relocating the loader code or was it PSP? Cannot remember, it was something like 150'ish bytes of code in memory, maybe less :D

	▲	mabster 10 months ago \| parent [-]
		I wrote a similar TSR (Terminate and Stay Resident) program that would reboot the machine if the letter E was typed. We had a few of us at school always messing with each other haha

▲ o11c 10 months ago | parent | prev | next [-]

It's worth noting that all the memory models have DS=SS, which makes sense for C (where you often take the address of a local variable - though nothing is stopping you from having a separate "data stack" for those) but is a silly restriction for some other languages.

I'm sure someone took advantage of this, but my knowledge is purely theoretical.

▲

AshamedCaptain 10 months ago | parent | next [-]

It's not necessarily true. Many drivers, TSRs and libraries (e.g. all Win16 DLLs) cannot assume that ds=ss. This makes C programming a bit more entertaining...

	▲	garaetjjte 10 months ago \| parent \| next [-]
		Related: http://www.os2museum.com/wp/tracking-down-a-bug/
	▲	o11c 10 months ago \| parent \| prev [-]
		Well, if so that's out of the standard models (at least, the ones that assume fixed DS).

▲

10 months ago | parent | prev | next [-]

[deleted]

▲

xxs 10 months ago | parent | prev [-]

I never had SS=DS in Assembly. Used it for TSR for example.

▲ pcb-rework 10 months ago | parent | prev | next [-]

Spent many hours in Borland C/C++ 3.1 and Borland Pascal 7, with real-mode, unreal mode, and protected mode.

▲

zazaulola 10 months ago | parent | next [-]

Yeah. I'd forgot that Borland's turbo-vision interfaces had hamburger on the menu

	▲	pcb-rework 10 months ago \| parent [-]
		See, Chrome and the web plagiarized the original chef.

▲

mobilio 10 months ago | parent | prev [-]

Let's "Make Borland Great Again"!

▲ mycall 10 months ago | parent | prev | next [-]

I recall RBIL [0] having a detailed list of all the interrupts for all the known memory models available. There were many.

[0] https://en.wikipedia.org/wiki/Ralf_Brown%27s_Interrupt_List

▲ sedan_baklazhan 10 months ago | parent | prev | next [-]

An excellent read. While not directly related, I started remembering how fun it was to program for classic PalmOS with Motorola 68k CPUs: it also had the 64k segment limitation, so you had to structure application code blocks closely together in the linker.

▲ ta12653421 10 months ago | parent | prev | next [-]

ah, good ol REAL computing days :-)

DJGPP was such an eye opener back then and it made things much easier: finally, we were able to have one pointer for linear graphic buffer access; also you could easily save 2MB in memory, and its DPMI was free, compared to the other ones available.

▲ geon 10 months ago | parent | prev | next [-]

Is this only relevant to real mode, or is it still in use in protected mode and/or x64?

▲

Dwedit 10 months ago | parent [-]

On 32-bit Windows, segmentation registers still exist, but they are almost always set to zero. CS (code segment), DS (data segment), ES (extra segment), and SS (stack segment) are all set to zero. But FS and GS are used for other purposes.

For a 32-bit program, FS is used to point to the Thread Information Block (TIB). GS is used to point to thread-local storage since after Windows XP. Programs using GS for thread-local storage won't work on prior versions of Windows (they'll just crash on the first access).

X64 made it even more formal that CS, DS, SS and ES are fixed at zero. 32-bit programs running on a 64-bit OS can't reassign them anymore, but basically no programs actually try to do that anyway.

---

As for shorter types of pointers being in use? Basically shorter pointers are only used for things relative to the program counter EIP, such as short jumps. With 32-bit protected mode code, you can use 32-bit pointers and not worry about 64K-size segments at all.

---

Meanwhile, some x64 programs did adopt a convention to use shorter pointers, 32-bit pointers on a 64-bit operating system. This convention is called x32, but almost nobody adopted it.

▲

rep_lodsb 10 months ago | parent | next [-]

It's quite possible to write a program that uses 32-bit pointers in 64-bit mode, just keep all code and data at addresses below 4G. Such a program will run on any standard x86-64 kernel, because it doesn't use the x32 ABI. x32 is "only" required to support the C library, which expects pointers passed from/to the kernel to be the same size as those in userland.

(Things THEY don't want you to know: you can in fact write code in languages which aren't C, don't compile down to C, and don't depend on a C library. Even under Linux.)

As for reloading segment registers, 64-bit Linux is able to run 32-bit binaries, so there have to be ring 3 code segments for both modes. And there is nothing in the architecture stopping assembly code from jumping between those segments!

With a 32-bit binary that does this, you get access to all the features of 64-bit mode, with everything in your address space guaranteed to be mapped at an address below 4G. The only point where you need to use 64-bit pointers is in structures passed to syscalls. (for arguments in registers it's done automatically by zero-extension)

▲

xxs 10 months ago | parent | prev [-]

>some x64 programs did adopt a convention to use shorter pointers, 32-bit pointers on a 64-bit operating system.

It's doable in managed languages, e.g. Java has compressed pointers by default on sub 32GB heaps. I suppose it's doable even in C alike setup (incl OS calls) but that would require wrappers to bit shift the pointers on each dereference (and passive to the OS, extern)

▲

gpderetta 10 months ago | parent [-]

both GCC and the linux kernel support x32 directly. Distros even shipped system libraries compiled for x32.

There was no uptake and I believe it is deprecated today.

▲

xxs 10 months ago | parent [-]

With x32 the limit would be 4GB which is on the low side of things. Having 8byte alignment (i.e. last 3 bits zero), allows for 32GB - which is better.

	▲	gpderetta 10 months ago \| parent [-]
		That would work in Java. In C is a bit complicated as you can have pointers with byte granularity. In principle the size of a pointer need not be the same for all types: you can have char, short, int and float pointers be 64 bits and everything else be 32 bits. (void has to be 64 bit as well as you must be able to round trip through it). I suspect that would break 90% of code out there though.

▲ skissane 10 months ago | parent | prev | next [-]

I think it is a pity Intel went with 16 byte paragraphs instead of 256 byte paragraphs for the 8086.

With 16 byte paragraphs, a 16 bit segment and 16 bit offset can only address 1MiB (ignoring the HMA you can get on 80286+).

With 256 byte paragraphs, the 8086 would have been able to address 16MiB in real mode (again not counting the HMA, which would have been a bit smaller: 65,280 bytes instead of 65,520 bytes).

▲

pwg 10 months ago | parent | next [-]

Intel also released both the 8086 and 8088 as 40pin DIP's.

Squeezing four more address pins in would have meant multiplexing four more of the pins on the chip, and if you exclude power/ground pins there are only 13 pins that are not multiplexed, and several of those either can't be multiplexed (because they are inputs, i.e., CLK, INTR, NMI) or would have made bus design even more painful than it already is for these chips.

The 4 bit shift, instead of 8 bit shift, for the segment registers was likely as big an address bus they could do that would also fit the constraint of "fits into a 40pin DIP".

https://en.wikipedia.org/wiki/File:Intel_8086_pinout.svg

▲

spc476 10 months ago | parent | prev [-]

The 8086 was released in '78 (or thereabouts). 64K of RAM was very expensive at the time, and wasting 256 bytes just to align segments would have been extravagant. Also, the 8086 was meant as a stop-gap product until the Intel 432 was released (hint: it never really was as it was hideously expensive and hideously slow, but bits of it showed up in the 80286 and 80386).

The 80286 changed how the segment registers worked in protected mode, giving access to 16M of address space, but couldn't change it for real mode as it would have broken a ton of code. Both Intel and IBM never thought the IBM PC would take over the market like it did.

▲

gpderetta 10 months ago | parent [-]

I still do not understand this point: intel could have used 16 bits from the offset register and 4 bits from the segment register to get non-overlapping segments, leaving the top 12 bits of the segment register unused (either masked out, mirroring the other segments or trapping). It wouldn't have changed the number of lines it needed to address 1M of memory, but it would have made extending the address space further much simpler.

	▲	rep_lodsb 10 months ago \| parent \| next [-]
		As TFA explains, the purpose of segment registers wasn't just to extend the address space, it was to make code and data relocatable without the need of fixing up every address referenced. They considered 256 byte alignment too wasteful, 64K would have been ridiculous (many business computers at the time didn't even have that much memory)
	▲	smitelli 10 months ago \| parent \| prev \| next [-]
		Scenario A: Picture that a quick, tiny function is needed that can load data from struct members and operate on them. The structs are tiny but there are a whole lot of them, and the values of interest always start at offsets e.g. 0, 4, and 8. If the structs can be stored in memory aligned on a segment boundary, a pointer can be constructed where offset 0 always points to the beginning of the struct, and the code can use the literal offsets 0, 4, 8 added to the pointer base without having to do any further arithmetic. Scenario B: Imagine you're writing a page of video to the VGA framebuffer. Glossing over a whole lot of minutiae, you can simply jam 64,000 bytes into the address and data lines starting at A000:0000 without needing to stop and think about what you're doing w.r.t. the segment registers. Any kind of segment change every n bytes would require the loop to be interrupted some number of times over the course of the transfer to update DS or ES. This would also prevent something like `rep movs` from being able to work on a full screenful of data. The 16-byte paragraph, and the many segment/offset aliases that could be constructed to refer to a single linear memory address, was a design choice that tried to serve the needs of both of those groups.
	▲	pwg 10 months ago \| parent \| prev [-]
		> but it would have made extending the address space further much simpler. Given published information [1] that the 8086 was designed in a weekend as a panic stop-gap to provide some form of "more advanced CPU" to keep Intel in the market while the iapx432 project was underway, but falling far behind schedule, it seems doubtful that the designers would also have been thinking of "ease of further expansion for future revision" on what was, at the time, just a stop-gap CPU to sell while awaiting the shipment of the iapx432. Published information is that at the time Intel never expected the 8086 to create the huge extended family it has created, and instead expected the iapx432 to be that "grand family ancestor". The market, of course, had other ideas, and IBM's choice of 8088 for the IBM-PC was the catalyst that launched the 8086 design into the family it is today. [1] I no longer have a reference to the publication

▲ GarnetFloride 10 months ago | parent | prev | next [-]

I remember some of that. One of my first jobs was a summer internship where I had to setup the engineering computers. They had AutoCAD and Ventura Publisher and one used expanded memory and the other extended memory. I setup batch files to copy the right configuration into config.sys and autoexec.bat so they would work. What a nightmare.

▲ malthaus 10 months ago | parent | prev | next [-]

this brings back traumatic memories of fiddling for hours with various config files to make games work on DOS back in the day

▲ wkjagt 10 months ago | parent | prev | next [-]

Precisely the kind of article I love to read. And timely too. I'm just about to fire up an old laptop with MS-DOS and Borland C++ so this will be fun to read alongside that.

▲ stuaxo 10 months ago | parent | prev | next [-]

As a teenage beginner programmer back then I only had a vague understanding of these (and not even pointers yet), wish I had this article then.

▲ dingosity 10 months ago | parent | prev | next [-]

I have such fun memories of x86 real-mode assembly programming. Thx for the stroll down memory lane!

▲ atan2 10 months ago | parent | prev | next [-]

Very good article. Thank you.

▲ block_dagger 10 months ago | parent | prev | next [-]

Memories of QEMM shudder

▲ tzs 10 months ago | parent | prev [-]

Intel missed a very simple opportunity to vastly simplify memory models on the 80286 for software that ran in protected mode, such as OS/2 and various Unix or Unix-like systems.

In real mode memory addressing works as described in the article. A 2-byte segment number and a 2-byte offset are combined to produce the memory address. The translation from segment:offset to physical address is:

  physical_address = segment * 16 + offset

Note that you can't just treat segment:offset as a 32-bit value and add 1 to get the address of the next byte. When you treat a segment:offset as a 32-but address the space is not mapped linearly to physical addresses and that's the crux of what makes it annoying.

In protected mode the segment number is replaced with a selector. A selector is also 2-bytes but it is no longer just a single number. It is 3 fields:

• 13-bit selector number (SEL)

• 2 bit request privilege level (RL)

• 1 bit table indicator (T)

The way a selector:offset is translated to a physical address is:

• There are two "descriptor tables", the Local Descriptor Table (LDT) and the Global Descriptor Table (GDT). A descriptor is a data structure that contains the physical address of a block of memory, the length of the block, and some privilege information. The LDT is for memory of the current process, and the GDT is for memory shared by all processes such as the memory of the operating system.

• The selector number SEL is used as an index into one of those tables to find a descriptor. The table indicator bit T selects which table.

• The request privilege level RL is checked agains the privilege information from the descriptor, and the offset is checked against the length of the block described by the descriptor. If those checks pass then:

  physical_address = address_from_descriptor + offset

(The 80386 is similar except segments/selectors and offsets are 32-bits and if paging is enabled the address in a descriptor is a virtual address for the paging system rather than a physical address. Most operating system simply run everything in small model, and use the paging unit to do all their memory management).

Here's how they packed SEL, RL, and T into a 16-bit selector.

  +-------------+-+--+
  | selector    |T|RL|
  +-------------+-+--+

If you wanted to treat a selector:offset is a 32-bit value it looked like this:

  +-------------+-+--+----------------+
  | selector    |T|RL| offset         |
  +-------------+-+--+----------------+

Note that this still suffers from the same problem that made treating real mode segment:offsets as 32-bit values annoying. Adding 1 doesn't give you the next address when offset wraps.

If they had just laid out SEL, RL, and T a little differently in the selector they could have fixed that. Just put SEL in the least significant bits instead of the most significant bits:

  +--+-+-------------+----------------+
  |RL|T| selector    | offset         |
  +--+-+-------------+----------------+

Then if adding 1 to a pointer wraps offset to 0 is will increment SEL. As long as the operating system sets up the descriptor table so that the memory blocks describe by the descriptors do not overlap the program would see a 29-bit linear address space (30-bit if the T bit is next to SEL).

(If the OS needed to run a program that did need an address space with the kind of overlap that real mode has it could set up the LDT for the process so that the descriptors did describe overlapping memory blocks).

If Intel had done this compilers for 286 protected mode would have only needed small model and compiler writers, library writers, and programmers would have been much happier.

So why didn't they?

One guess I've heard is that since a descriptor table entry is 8 bytes, by putting SEL in the top bits of the selector and the other 3 bits worth of fields in the bottom they didn't have shift SEL to turn it into an offset from the base of the descriptor table. If SEL were at the bottom it would need to be shifted by 3 to make it an offset into a descriptor table.

I've talked to CPU designers (but none who worked on 80286) and they have told me for this kind of thing where you would always want to shift an input by a fixed amount building that shift in is essentially free, so that doesn't seem to be the explanation.

	▲	rep_lodsb 10 months ago \| parent [-]
		It's free in hardwired logic, but the descriptor tables are accessed by microcode. The 286 already had a constant ROM that could hold the FFF8h mask for an ALU AND operation, but no fast shift by multiple bits. This might be one reason for it, but note that page table entries also use the lower bits for attributes. This means that software can use AND to separate the physical address and other bits, and OR to combine them, with no shift operation required.