I worked in Systems Validation at Intel when the 8087 was current. Intel had an engineer dedicated to validating customer bug reports and reproducing them. Day in, day out, that's pretty much all he did. Sooooo many corner cases, and so many opinions on what the 'right' thing to do was when you lost precision[1].
[1] I'd say that over half of the bug reports were people who were annoyed that doing fp instructions in one order got them the right answer but in another order got them the wrong answer.
Mathematicians vs Computer / Systems Engineers. The machine only has so much space, so it's best to imagine every value also has a corresponding error range attached and that managing the growth of that error range so that it remains under the target value is key.
IMO, the way the error bars combine is very intuitive. You are really just rounding to 6 or 12 sig-figs after every operation.
People just seem to get really hung up on the point that error bars exist in the first place, and combine.
I suspect it has a lot to do with the way that rounding is taught in school. It's absolutely hammered into use that you should never round until the very end, otherwise you lose precision.
We had a whole course at university (early 1990s) about stability in numerical calculations and simulations. Which was basically about rounding errors, plus some higher level mathematical transformations you could apply to stop errors from accumulating. I remember essentially nothing about it except that the lecturer insisted everyone use FORTRAN 77 (including uppercase and punch-card formatting) for the exercises.
80 bits always seemed a strange choice for floating point, but as soon as you said there’s a 16-bit exponent and a 64-bit fraction part, it made sense.
I assume microcode was a choice for both ease of development/testing/changes and saving die space. Would there come a point later on where performance could be gained by converting the microcode into a full set of discrete logic, or is that not worth the effort?
Usually, it's not worth the effort of converting microcode into discrete logic to get performance. Among other things, it's a mess to try to fix a bug.
A few exceptions: The different models of the IBM System/360 mainframe are almost all microcoded, except for the high-end machines, which were hard-wired for performance. The design of the Apollo Guidance Computer is microcode, but the implementation is discrete logic. The 8086 and derivatives are microcoded, except NEC created a faster hard-wired version, the V33.
I found it interesting that this uinstr format doesn't include omnipresent control flow bits like I see in most uinstr archs. I was going to ask about RNI being it's own instruction, but looked at the microcode dump you linked to, and it's clear that you'd need a nop in almost all of those slots anyway because of the delay apparently needed after register transfers.
So I guess my question is: what do you see as the reasons why you'd pick a particular school of micro control flow as a microcode engine implementer? ie. along the spectrum of 'no increment on upc, every uinstr explicitly encodes jump, maybe oring bits into the address for conditional control flow', to 'looks like a relatively normal assembly, assumed incrementing program counter, specialized control flow uinstrs otherwise'.
> So I guess my question is: what do you see as the reasons why you'd pick a particular school of micro control flow as a microcode engine implementer?
For a comprehensive answer, a good vintage introductory digital design textbook is Ward and Halstead's 1989 Computation Structures, from the "peak CISC" era! [1]
There, the second (vertical) type is often used for highly complex instructions/fancy addressing modes, that you might want to implement with some sort of procedure abstraction, loops, working memory, etc. A "luxury" vertical microcode engine would have facilities like "microprocedure calls", a micro-stack and workspace RAM, a micro-ALU, dispatch table micro-instructions. The authors use the suggestive term "interpretive microcode".
String instructions come to mind as a complex example; non-register machine architectures (stack machines); tagged data architectures that have instruction-level polymorphism (e.g. Lisp machines).
The culminating project of Ward and Halstead is an elaborate two-level microcode system (vertical on horizontal/second on first). I think the first Motorola 68k had this architecture -- here is the patent. [2]
It's genuinely a fun read. The "write an micro-interpreter for your CISC ISA" approach is hopelessly out of date now that we need pervasive microarchitectural parallelism, and have HDLs.
So, I've read Computation Structures. And agreed, an absolutely fantastic text. [0]
However, my question is kind of orthogonal to vertical versus horizontal microcode.
As a counter example I'd point to the microcode format of the system 370/145, which while pretty clearly being something that would be described as vertical microcode also doesn't have implicit control flow [1]. It's a little on the wide side for vertical microcode at 32 bits, I'll grant you, but it has an op field(s) with about a dozen variants that then is used to further decode the other fields, at an overall decode complexity comparable to a RISC arch. Horizontal microcode looks more like 'these specific bits just always plug into this mux, and are simply set to some default if unused in this specific operation, reducing decode to essentially wires'. That being said, it also doesn't have an incrementer on the program counter, with the last byte of instructions encoding a (conditional) branch to the next instruction[2].
For another example, I'd point to modern microcode formats in Intel and AMD cores. They pretty universally have a vertical microcode instruction format (though grouped into triads or quads of instructions typically) then paired with explicit, dedicated microprogram control flow field for the group. The uops there are pretty wide at 48-64 bits typical, but they sort of need to be to fit immediates that are common for 64 bit archs, and also fit into that RISC like level of decode complexity you see in vertical microcode. [3]
[0] - As an aside, if you like Computation Structures, I'd recommend The Anatomy of a High-Performance Microprocessor: A Systems Perspective by Shriver and Smith as well. The mad lads stuck a surprising amount of the RTL for the AMD K6 in that book, albeit translated into some custom academic langauge. That mid 90s era design of a multi instruction per clock CISC decoder dumping a speculative instruction stream into an OoO RISC like backend is arguably just as much peak CISC as the early 80s given that it won against the UNIX RISCs by the early 2000s and survives to this day with remarkably few tweaks relatively speaking. CISC seems to be kind of like the Roman empire; any time it starts losing the war, it just unashamedly starts integrating the concepts of its competitor it's losing to. Which is great in this case. That's called good engineering.
[2] - Though it does have an explicit far jump/call instruction for control flow outside that window addressable by that byte, and a couple other bits sometimes depending on the instruction format.
Wouldn't it be simpler for Intel to have designed a chip, with those 8 identical instructions (xfer, shift, add, arith, far jmp, far call, local jmp, misc), but read/executed from normal RAM accessible by the user, perhaps with a tiny cache, instead all these ROM/microcode special compression/hidden architecture shenanigans?
This is exactly the theory of the RISC and VLIW processors, which replaced, respectively, the vertical microprograms and the horizontal microprograms stored in ROMs, which were used in the processors of the seventies, with normal programs with simple instructions, which were normally executed from fast cache memories, thus achieving the same speed as microprograms.
However, when the 8087 was designed, RISC and VLIW processors were still in the future, because a fast cache memory allowing the execution of an instruction per clock cycle was still far too expensive in comparison with a microprogram ROM.
Most earlier floating-point accelerators were microprogrammed like 8087, with the microprograms stored in a ROM. However, there existed FPS AP-120B, introduced by the company Floating Point Systems in 1976. This was a floating-point accelerator for minicomputers, like DEC PDP-11 or VAX, which was marketed as a "supercomputer for the poor".
FPS AP-120B was a VLIW processor launched 7 years before the term "VLIW" was coined. This means that it was a horizontally microprogrammed processor (i.e. with multiple concurrent operations specified by each microinstruction), where the microprogram was not stored in a ROM, but it was fed into the accelerator by the host computer. Therefore the user could write directly such microprograms for it, to implement optimized computational algorithms.
Nevertheless, while FPS AP-120B was said to be a "supercomputer for the poor", "poor" was meant only in comparison with those who could afford to buy a Cray-1. Such a "cheap" array processor still had a price more than 100 times greater than an Intel 8087.
By the time when RISC and VLIW CPUs became fashionable, using microinstructions as simple as those of Intel 8087 for implementing floating-point operations was no longer acceptable, because having to execute tens or hundreds of simple instructions for each FP operation was deemed too slow. Therefore the instruction sets of RISC and VLIW CPUs were eventually extended to include FP operations as single instructions, which had to be implemented in complex hardware in order to achieve an execution throughput of one instruction per clock cycle.
That's basically the RISC approach, using simple one-clock instructions instead of complex microcoded instructions. In the case of the 8087, it made sense to use microcode because the 8087 is running in parallel with the regular 8086 processor. If the 8087 is constantly fetching micro-instructions from RAM, it will get in the way of the 8086. (Note that RISC chips rapidly added floating-point units, even though that goes against the strict RISC ideology.)
This is also why RISC would never have happened if it weren't for the fact that, for a brief period in the history of computing, RAM was faster than the core. Single-cycle instructions only make sense if the fetch can keep up.
Judging by the register area bit density, it seems it would have space for 3-5kbit SRAM cache (replacing the 26,368 bit ROM). I wonder if the basic 4 ops+some approximation functions like sqrt would fit in there. Purely alternative history ;)
[1] I'd say that over half of the bug reports were people who were annoyed that doing fp instructions in one order got them the right answer but in another order got them the wrong answer.
People just seem to get really hung up on the point that error bars exist in the first place, and combine.
I suspect it has a lot to do with the way that rounding is taught in school. It's absolutely hammered into use that you should never round until the very end, otherwise you lose precision.
I assume microcode was a choice for both ease of development/testing/changes and saving die space. Would there come a point later on where performance could be gained by converting the microcode into a full set of discrete logic, or is that not worth the effort?
A few exceptions: The different models of the IBM System/360 mainframe are almost all microcoded, except for the high-end machines, which were hard-wired for performance. The design of the Apollo Guidance Computer is microcode, but the implementation is discrete logic. The 8086 and derivatives are microcoded, except NEC created a faster hard-wired version, the V33.
So I guess my question is: what do you see as the reasons why you'd pick a particular school of micro control flow as a microcode engine implementer? ie. along the spectrum of 'no increment on upc, every uinstr explicitly encodes jump, maybe oring bits into the address for conditional control flow', to 'looks like a relatively normal assembly, assumed incrementing program counter, specialized control flow uinstrs otherwise'.
For a comprehensive answer, a good vintage introductory digital design textbook is Ward and Halstead's 1989 Computation Structures, from the "peak CISC" era! [1]
There, the second (vertical) type is often used for highly complex instructions/fancy addressing modes, that you might want to implement with some sort of procedure abstraction, loops, working memory, etc. A "luxury" vertical microcode engine would have facilities like "microprocedure calls", a micro-stack and workspace RAM, a micro-ALU, dispatch table micro-instructions. The authors use the suggestive term "interpretive microcode".
String instructions come to mind as a complex example; non-register machine architectures (stack machines); tagged data architectures that have instruction-level polymorphism (e.g. Lisp machines).
The culminating project of Ward and Halstead is an elaborate two-level microcode system (vertical on horizontal/second on first). I think the first Motorola 68k had this architecture -- here is the patent. [2]
It's genuinely a fun read. The "write an micro-interpreter for your CISC ISA" approach is hopelessly out of date now that we need pervasive microarchitectural parallelism, and have HDLs.
[1] https://www.amazon.com/Computation-Structures-Optical-Electr...
[2] https://patents.google.com/patent/EP0011412A1/en?inventor=Ha...
However, my question is kind of orthogonal to vertical versus horizontal microcode.
As a counter example I'd point to the microcode format of the system 370/145, which while pretty clearly being something that would be described as vertical microcode also doesn't have implicit control flow [1]. It's a little on the wide side for vertical microcode at 32 bits, I'll grant you, but it has an op field(s) with about a dozen variants that then is used to further decode the other fields, at an overall decode complexity comparable to a RISC arch. Horizontal microcode looks more like 'these specific bits just always plug into this mux, and are simply set to some default if unused in this specific operation, reducing decode to essentially wires'. That being said, it also doesn't have an incrementer on the program counter, with the last byte of instructions encoding a (conditional) branch to the next instruction[2].
For another example, I'd point to modern microcode formats in Intel and AMD cores. They pretty universally have a vertical microcode instruction format (though grouped into triads or quads of instructions typically) then paired with explicit, dedicated microprogram control flow field for the group. The uops there are pretty wide at 48-64 bits typical, but they sort of need to be to fit immediates that are common for 64 bit archs, and also fit into that RISC like level of decode complexity you see in vertical microcode. [3]
[0] - As an aside, if you like Computation Structures, I'd recommend The Anatomy of a High-Performance Microprocessor: A Systems Perspective by Shriver and Smith as well. The mad lads stuck a surprising amount of the RTL for the AMD K6 in that book, albeit translated into some custom academic langauge. That mid 90s era design of a multi instruction per clock CISC decoder dumping a speculative instruction stream into an OoO RISC like backend is arguably just as much peak CISC as the early 80s given that it won against the UNIX RISCs by the early 2000s and survives to this day with remarkably few tweaks relatively speaking. CISC seems to be kind of like the Roman empire; any time it starts losing the war, it just unashamedly starts integrating the concepts of its competitor it's losing to. Which is great in this case. That's called good engineering.
[1] - Pages A2-A5 for an overview, chapter 4 for a more in depth discussion. https://www.bitsavers.org/pdf/ibm/370/fe/3145/SY24-3581-1_31...
[2] - Though it does have an explicit far jump/call instruction for control flow outside that window addressable by that byte, and a couple other bits sometimes depending on the instruction format.
[3] - https://www.usenix.org/system/files/conference/usenixsecurit... and https://github.com/chip-red-pill/uCodeDisasm
However, when the 8087 was designed, RISC and VLIW processors were still in the future, because a fast cache memory allowing the execution of an instruction per clock cycle was still far too expensive in comparison with a microprogram ROM.
Most earlier floating-point accelerators were microprogrammed like 8087, with the microprograms stored in a ROM. However, there existed FPS AP-120B, introduced by the company Floating Point Systems in 1976. This was a floating-point accelerator for minicomputers, like DEC PDP-11 or VAX, which was marketed as a "supercomputer for the poor".
FPS AP-120B was a VLIW processor launched 7 years before the term "VLIW" was coined. This means that it was a horizontally microprogrammed processor (i.e. with multiple concurrent operations specified by each microinstruction), where the microprogram was not stored in a ROM, but it was fed into the accelerator by the host computer. Therefore the user could write directly such microprograms for it, to implement optimized computational algorithms.
Nevertheless, while FPS AP-120B was said to be a "supercomputer for the poor", "poor" was meant only in comparison with those who could afford to buy a Cray-1. Such a "cheap" array processor still had a price more than 100 times greater than an Intel 8087.
By the time when RISC and VLIW CPUs became fashionable, using microinstructions as simple as those of Intel 8087 for implementing floating-point operations was no longer acceptable, because having to execute tens or hundreds of simple instructions for each FP operation was deemed too slow. Therefore the instruction sets of RISC and VLIW CPUs were eventually extended to include FP operations as single instructions, which had to be implemented in complex hardware in order to achieve an execution throughput of one instruction per clock cycle.
It's a question of throughput that can be extended cheaply enough.
There goes my next hour or two.