INTRODUCTION TO EARLY MICROPROCESSOR. In 1971 Intel Showed the world first microprocessor, the 4004, a 4 bit microprocessor. The chip could only address 4,096 4 bit memory locations, and an instruction set which offered 45 different instruction. 4004 was used in calculator. Later in 1971 Intel realised the 8008, the first 8 bit microprocessor. The memory capacity of this processor was 16 k bytes and additional instructions were also added, the total instruction was now 48. In 1973 the 8080 was introduced, which addressed more memory and had more instructions, and it executed the instruction 10 times faster than 8008. The 8080 was also TTL compatible. A newer version of the 8080 and 8085 was introduced in 1977. The main advantage of the 8085 are its built in clock generator and system controller which were external component in the 8080. In 1978 the 8086 was realised and the 8088 in 1979. Both devices are 16 bit with the ability to execute instructions much faster, and the ability to address 1M Byte of memory. INTEL 8085 The 8085 chip is very similar to 8080 , the other units added to it are built in clock generator and system controller. INTEL 8086 The iAPX86, 88 and iAPX186, 188 family consists of advanced high-performance microprocessors. The family includes general data processors (8086, 8088, 80186 and 80188), specialized coprocessor such as the 8087 numeric processor extension (NPX). Standard microprocessors execute a program by repeatedly cycling through the program. First the microprocessor must fetch the instruction to be performed, then it executes the instruction. Only after execution is complete is the CPU ready to fetch next instruction, execute that instruction, etc. The CPU hardware that executes the instructions must obviously wait until the instruction is fetched and decoded before execution begins. Therefore, in standard microprocessor, the execution hardware (primarily CU & ALU) spends lot of time waiting for instruction to be fetched. The 8086 and all later versions of microprocessor eliminates this wasted time by dividing the internal CPU into two independent functional units. The CPUs have a separate bus interface unit (BIU), the only job of BIU is to fetch instruction from memory and pass data to and from the execution hardware and the outside world. Since the execution unit and bus interface unit are independent, the bus interface unit fetches the instruction while the execution unit (some time called as EU) executes a previously fetched instruction. This is made possible by the instruction PIPELINE (or queue) between the BIU & the EU. The BIU fills this pipeline with instruction awaiting execution. Thus, when ever the EU finishes executing a given instruction, the next instruction is usually ready for immediate execution without delay caused by instruction fetching. Because BIU is usually busy fetching instructions for the pipeline, the bus is more fully utilized. Another benefit of this parallel operation is that since execution unit seldom needs to wait for the BIU to fetch data quickly. Maximum performance & processing power is thus achieved without high speed memory devices in the system. MEMORY ADDRESSING Memory is organised in sets of segments. Each segment consists of a linear sequence of up to 64K bytes. These bytes are stored sequentially from byte 00000 to FFFFF hex. The memory is addressed using a two component address (a pointer) that consists of a 16-bit segment base (specifying the beginning address of the segment ). The base values are contained in one of the four internal segment registers (CS, DS, SS, ES). A 20 -bit physical memory address is calculated by shifting the base value in the appropriate segment register left by 4 bit and adding the 16 bit offset value to it. This form of addressing allows access to one million byte of memory. Every 20-bit memory address points to program code, data, or stack area in the memory. Each of the four different memory space is pointed to by one of the segment base register. The address that the CPU provides on the address bus selects one specific memory or I/O devices from all those available. This address can be generated in different ways depending on the operation being performed. The ways of generating these addresses are called addressing modes. The 8086 has following addressing modes. 1. Register & Immediate Addressing The Operand is in register and so the instruction is the fastest. In immediate addressing mode the operand is the constant data. The data may be 8 or 16 bits. 2. Direct addressing This addressing mode is the simplest memory addressing mode. No registers are involved; the EA is taken directly from the displacement field of the instruction. 3. Register indirect Addressing The EA of a memory operand may be taken directly from the BP, BX, SI or DI register. 4. Base Addressing In base addressing, the effective address is the sum of a displacement value and the contents of register BX or BP. Specifying the BP as a base register directs the BIU to obtain the operand from the current stack segment. This makes a base addressing with BP a very convenient way to access stack data. 5. Indexed Addressing In index addressing , the effective addressing is calculated from the sum of a displacement and the contents of an index register, SI or DI. Index addressing is often used to access elements in an array. 6. Based Index Addressing Based index addressing generates an effective address that is sum of a base register(BX or BP),an index register(SI or DI) and a displacement. Based index addressing mode is a very flexible mode because two address components can be varied at the execution time. 7. String Addressing String addressing do not use the normal addressing modes to access their operand's. Instead, index registers are used implicitly. When a string instruction is executed, SI is assumed to point to source string, and DI is assumed to point to the first byte or word of the destination string. 8. I/O port Addressing If an I/O port is memory mapped, any of the memory operand addressing modes may be used to access the port. To access ports located in the I/O space, the two different addressing modes are used. In first method, Direct port addressing, the port number is an 8 bit immediate operand. This allows fixed access to ports number 0 to 255. Indirect port addressing is similar to register indirect addressing modes of memory operand's. The port address can be taken from register DX and can range from 0 to 65,535. INTEL 8088 The main difference between 8086 & 8088 is that 8086 has a 16 bit data bus both external & internal where as 8088 has 8 bit external data bus and 16 bit internal data bus. This was the first chip to be used in PC. INTEL 80186/80188 The 80186 /80188 , like the 8086 and 8088, are nearly identical. The only difference between the 80186 and 80188 is the width of their data buses. The 80186 like 8086 contains 16 bit data bus, while the 80188 (like 8088) contains an 8 bit data bus. The internal register of 80186/80188 is identical to the 8086/88. About the only difference is that the 80186/80188 contains additional reserved interrupt vectors and some very powerful built in I/O feature. The 80186/80188 are often called embedded controllers because of their application, not as a microprocessor based computer but as a controller. 80186/80188 Enhancements. The 80186/80188 have a great deal more internal circuitry than the 8086/8088, this is mainly due to the enhancements implemented into the 80186/80188. These enhancement include a clock generator, programmable interrupt controller, programmable timer, programmable DMA controller, and programmable chip selection unit. The 80186/80188 is available in speeds of 6 - 12 MHz. the 6 MHz version of the 80186/80188 allows 417 nS of access time for the memory, and 8 MHz version allows 309 nS. INTEL 80286 The 80286 is an advance version of 8086 that is designed for multi-user and multi tasking environment. The 80286 address 16 M bytes of physical memory and 1 G byte of virtual memory by using its memory management unit. The 80286 has two different type of operating mode, Real Address Modes & Protected Address Modes. In real address mode the processors works as a very high performance 8086. program written for 80186 can be executed in this mode with out any modification. The advance architectural features and full capabilities of the 80286 are realized in its native Protected Mode. Among these features are sophisticated mechanisms to support data protection, system integrity, task concurrency, and memory management, including virtual storage. Memory Management The memory architect of Protected Mode iAPX 286 represents a significant advance over the iAPX86. The physical address space is increased from 1 M bytes to 16 M bytes(224 bytes), while the virtual address space has been increased from 1 M bytes to 1 G bytes. Moreover, separate virtual address space are provided for each task in a multi tasking system. The iAPX286 supports on-chip memory management instead of relying on an external memory management unit. The one-chip solution is preferable because no software is required to manage an external memory management unit, Performance is much better, and hardware design are significantly simpler. The iAPX 286 supports a segmented memory architecture. The iAPX 286 also fully integrates memory segmentation into a comprehensive protection scheme. This protection scheme includes hardware-enforced length and type checking to protect segments from inadvertent misuse. Task Management The iAPX 286 is designed to support multi tasking systems. The architecture provides direct support for the concept of the task. Very efficient context-switching (task switching) can be involved with a single instruction. Separate logical address space is provided for each task in the system. Finally mechanisms exist to support inter communication, synchronization, memory sharing and task scheduling. Protection Mechanisms The iAPX 286 allows system designer to define a comprehensive protection policy to be applied, uniformly and continuously, to all ongoing operations of the system. Such a policy may be desirable to ensure system reliability, privacy of data, rapid error recovery, and separation of multiple user. The iAPX 286 protection mechanisms are based on the notation of a "hierarchy of trust." Four privilege levels are distinguished, ranging level 0(most trusted) to level 3 (lease trusted). Level 0 is usually reserved for the operating system kernel. The four levels may be visualized as concentric rings, with the most privileged level in the center. A four-level division is capable of separating kernel, executive, system services, and application software, each with different privileges. At any one time, a task executes at one of the four level. Moreover, all data segment and code segments are also assigned to privilege levels. A task executing at one level cannot access data at a more privilege level, nor can it call a procedure at a less privilege level. Thus, both access to data and transfer of control are restricted in appropriate ways. INTEL 80386 Intel introduced the next version of the its processor 80386 in 1988 to the computer industry which had demanding requirements for the embedded applications. The 386 still continue to provide high performance, 32- bit processing power for embedded applications. The 386 was developed with larger memory addressing capabilities. The 386 Microprocessor is an entry level 32-bit CPU designed for single user application and operating system such as MS-DOS & WINDOWS. The microprocessor has a capability of addressing of upto 4 gigabytes of physical memory and can address upto 64 gigabytes of virtual memory. The processor has 32-bit registers and its data path supports 32 bit address and data types. The processor has integrated Memory Management Unit and protection architecture which includes address translation registers, multi tasking hardware & a protection mechanism to support operating system, instruction pipelining, on chip address translation, all these ensuring short average instruction execution time and maximum system throughput. The 386 CPU includes new testability & debugging features. Testability includes a self test & direct access to page translation cache. Four new break point register's provide breakpoint traps on code execution or data accesses, for powerful debugging of even ROM-based systems. The CPU has 6 debug registers which take care of linear break points control and status of the debug. We have 2 test registers, in which one tests the command and another checks the data. The CPU consists of execution unit & Instruction unit. The execution unit contains 8 -32-bit registers (used for both arithmetic & data operation) and a 64 bit barrel shifter used to speed shift, rotate, multiply and divide operations. The instruction unit decodes the instruction opcodes and stores them in the decoded instruction queue for immediate use of instruction unit. The Memory management unit manages the memory by two methods. The Segmentation allows the managing of the logical address space by providing an extra addressing component one that allows easy code & data relocatability and efficient sharing. Memory is organised into one or more variable length segments each of upto 4 gigabytes. Segmentation is used to encapsulate regions of memory which have common attributes. Segmentation unit translates logical address space into a 32 bit linear address space. If paging is not enabled, then a 32 bit linear address corresponds to physical address. The paging unit translates the linear address space into the physical address space. The Paging mechanism operates beneath and is transparent to the segmentation process, to allow management of the physical address space. Each segment is divided into one or more 4k-bytes pages. A uniform size for all elements simplifies the memory allocation and relocation schemes, since there would be no problems with fragmentation. Each task can have a maximum 16,381 segments of upto 4 gigabytes thus providing 64 terabytes. Paging is useful in Virtual Memory Multi tasking operating system. The paging mechanism has two levels of tables to translate linear address to physical address The three components of paging are Page directory , Page table & Page itself. The 386, like in 286 operates in two modes ie Real mode and Protected mode. The page directory has 4 KBytes allows upto 1024 page directory entries. Each page directory entry contains the address of the next level of table . Page table holds upto 1024 page table entries. The lower 12 bits of the page table entry contains starting address of the page frame and statistical information about that page. The paging unit receives a 32 bit linear address from the segmentation unit. The upper 20 linear address bits are compared with all 32 entries in the Translation Lookaside Buffer(TLB) to determine if there is a match. If there is a match then the 32 bit physical address is calculated and will be placed on the address bus. If the page is not on the TLB then, processor will read the appropriate page directory entry . If the P bit is equal to 0 in the page directory entry indicating the page is in memory, then the 386 will read the appropriate page table entry and set the access bit. If P =1 on the page table entry indicating the page is in memory, then 386 will update the access and dirty bits as needed and fetch the operand .The upper 20 bits of the linear address , read from the page table will be stored in the TLB for future access. Initially as the system is started in Real mode, this mode is basically a very fast 8086 but with an extension of a 32 bit operations can be made use of if desired so. Real mode is primarily used to setup the processor to protected mode. The maximum memory addressing possible in Real mode is 1 Mega Byte. Since paging is not allowed in real mode, the linear address is equal to physical address. All segments in real mode is equal to 64 KBytes. The protected mode gives access to sophisticated memory management paging and privilege capabilities of the process. Software can perform a task switch to enter into task designated as virtual 8086. Each such task behaves with 8086 semantics thus allowing 8086 software execution. The 8086 task can be isolated and protected from one another and the host 386 operating system by use paging & I/O permission bit map. In protected mode linear address space is increased to 4 gigabytes and running of virtual memory programs of upto 64 terabytes. Protected mode is 100 % compatible with the execution of programs of all previous Intel processors from 8086 to 80286. The addressing mode available in 386 are as given below: 1. Register Operand Mode. 2. Immediate Operand Mode. 3. Direct Mode. 4. Register Indirect Mode. 5. Base Mode. 6. Index Mode. 7. Scaled Index Mode.( Not available in 286 ) An index register's contents is multiplied by a scaling factor which is added to a displacement to form the operand's offset. 8. Based Index Mode . 9. Based Scaled Index Mode.( Not available in 286) The contents of an Index Register is scaled by a factor and the result is added to a base register obtain the operand's offset. 10. Based Index Mode With Displacement. (Not Available in 286) The contents of an index register and base registers contents and a displacement are all summed together to form the operand offset. 11. Base Scaled Index Mode With Displacement.(Not available in 286) The contents of the index register are multiplied by a scaling factor , the result is added to the contents of a base register and a displacement to form the operand's offset. 386 has three distinct address spaces : Logical, Linear & Physical. In segmentation unit the logical address is decoded to an linear address as follows The logical address is decoded into two parts one is the selector , the contents of the segment register has to added to the second part is the offset, where summing of base register, index & displacement has to be added. the final result gives the linear address. If paging disabled the linear address will be the physical address else if paging enabled the paging unit will decode this linear address to physical address. Intel 80486 The next version of Intel 486 was introduced on 4th October 1989. This processor came with highest performance and 100 % binary compatible with 386. The processor has an in built 8 k-byte cache for the data and code. There is also an Math Co-processor in built the CPU itself. This is the same processor as that used in 386 ie the 387 math coprocessors. The CPU can execute code written for 386/387 processor in 486 without any modification required, there is also an Memory management Unit in the processor, while still retaining binary compatibility with the previous Intel 386 and so on. This processor is also a 32-bit architecture like the previous version. All features of 386 along with enhancements in the CPU has increased its performance and its support to applications. CPU is fully compatible with the processors x86. The on chip Cache memory allows frequently used code , data to be stored on the cache reducing accesses to external bus. The 486 has another version which allows usage of clock doubler which doubles the bus clock speed twice as much as present. The cache includes features to provide flexibility in external memory system design. All these combined lead to performance of the CPU to be 3 times better than the existing 386. The CPU consists of execution unit and instruction unit same as in the Intel 386. The processor is built in with testability and debugging features as in the 386. Testability includes self test and direct access to page translation cache. The memory management unit organises the memory by segmentation and paging . These again being the same concept as that was in a 386. The memory can be organised into one or more variable lengths upto 4 gigabytes. A segment can have attributes associated with it, which include its location, size , type & protection characteristics. Each task has can have a maximum of 16,381 segments , Each upto 4 K- bytes in size, thus task has a maximum of 64 terabytes of virtual memory. The segment unit provides 4 levels of protection for isolating applications and Operating system from each other. The hardware enforced protection allows the design of systems with high degree of integrity. Most privileged. System Services . Operating System Extension. p = 0 1 2 3 Applications The privilege levels control the privileged instructions I/O instructions and access to segment descriptors. Unlike traditional microprocessors this was achieved by hardware & software. In 486 microprocessor part of memory management unit , it also has an additional type of protection on page basis when paging is enabled. It is an extension of user/supervisor privilege mode commonly used by minicomputers. The rules of privileges are data stored in a segment with privileges level p can be accessed only by a privilege level at least as privilege as p. A code segment or procedure with privilege level P can only be called by a task executed at the same or lesser privilege level than P. The on chip floating point unit operates in parallel with arithmetic and logic Unit . The FP unit has eight data registers, eight Tag words, Control register, Status Register an instruction pointer and data pointer. The data registers are 80 bit registers provide the equivalent capacity of twenty 32 bit registers. These FP registers can be accessed either as a stack, with instruction operations on top one or two stack elements or as a fixed register set with instruction operating on explicitly designated registers. The address space calculation is the same as in 386. The addressing modes are also the same as in 386. It has 11 addressing modes as explained earlier. The I/O space is as 2 distinct address space in 486 as in 386. The I/O space can be memory mapped as well as I/O mapped I/O. The CPU here also has two modes of operation as in the case of 386 . One , the Real mode in this mode the processor is a very fast 8086 with basically the same base architecture but could access to 32 bit register set of 486 microprocessor. Second, the Protected mode in this mode the processor is eventually uncovers Its hidden power, where it can address upto 4 gigabytes and if paging is enabled it can address upto 64 tera bytes. Addressing Mechanisms are segmentation & paging which exactly the same as in the 386. One of the main reasons for such a high performance of the Intel 486 is the On chip Caching , which would reduce the access time. The cache would also maintain the pages which are most frequently used as a result lesser number of external access hence saving execution time. Cache is software transparent to maintain binary capabilities with previous generations of processors upto 8086. Cache has several operating modes offering flexibility during execution and debugging. Invalidation and Replacement are implemented in hardware easing system design. The write strategy used for internal cache is the write through. All writes will drive an external write bus cycle in addition to writing information to internal cache, If the write was cache hit. A write to an address which is not contained in the internal cache will only be written to external memory. Cache allocation are not made on write misses. INTEL -- Pentium(R) Intel 486 Vs Pentium(R) * Slow processor slow memory First IBM-PC was introduced with 8088 processor at a speed of 4.77MHz, it required 838 nS to run a zero wait state , memory access was possible with in this time therefore no throttle for the processor. Faster Processor, slow memory, Processor speed was increased , therefore memory became a bottleneck on performance. * Prefetcher -It is a small look ahead cache with a prefetcher queue to fetch the next instruction to be executed . Since the Execution is sequential next instruction will be taken from prefetcher queue in case if branch instruction occurs prefetcher gambles . The Prefetcher handles branch instructions inefficiently . Therefore addition of Branch prediction to predicts weather the branch will take or not (examining the execution history of the instruction) * Memory bottleneck With increase in processor speed DRAM memory is not accessible for zero wait state, Therefore memory access time is a serious impediment. Possible solution could be the following 1. Substitute with SRAMs which results in Faster performance however the method is more expensive(10 times),requires larger real estate, consumes more power, generates more heat. 2. Cache sub system External cache - Reduces memory access to zero wait state It consists small SRAM with a cache controller to keep copies of frequently requested information. The disadvantage is memory access bounded by bus speed resulting in a bottleneck. * Unified internal code/Data cache On board cache to the processor reduces the bus length resulting in fast memory access , Frees one bus cycle . Processor feeding out with its internal cache and at the same time bus master is using the external bus to transfer data. The i486 processor has single code/data cache therefore internal data/ code contention . Using separate code and data caches eliminates the internal contention. * Pipeline is a Bottleneck Use of separate code ,data caches and external caches to handle miss ,the instructions fed into the pipeline will become faster but the execution becomes slower which is a bottleneck . By introducing Superscalar which is a dual instruction pipeline, solves the above problem . * Uni-processing systems . The increase in processor clock speed, implementing faster memory, environments. Uni-processing system can not be used . Implement multi - processor system can be achieved by additional processor. Two categories of multi processor implementation : * Asymmetrical - Dedicate specific tasks to various processor . Disadvantage is the Execution load not even among the processors * Symmetrical - All processor to do the same task resulting in Load balancing by the operating system. The disadvantage is more complication in hard & soft ware. The PENTIUM(R) Processor , Introduced on 22/3/93 Operating Speed from 60 to 166 MHz , 3.1 million transistors, 0.8 micron, 4GB memory addressable, 64 Terabytes virtual memory. The following are the variation over 486 Processor Functional Units(att : 2) * Wider Data bus - 64 bit data path permits 8 bytes of information to be transferred ,it can transfer data to and from memory at 528Mbytes/S which is 5 times faster than 486. Pentium(R) implements bus cycle pipeline to increase band width, allows the second cycle to start before completing the first . * Separate code and data caches - On-chip caches acts as temporary storage for commonly used instruction and data ,replacing the need to go off-chip memory for executing the next instruction. Code and data caches contain 8KB cache, organised as two-way set associative caches. Data caches use " write back" (transfer data to the cache without going to memory) and "MESI"(modified, exclusive, shared, invalid) * Superscalar Architecture Superscalar architecture enables more than one instruction per clock cycle. There are two separate execution units executes integer instruction in five stages. They are prefetch, decode1, decode2, execute, write back.. This permits different instruction in various stages of execution results in increase of processor performance. * Branch prediction This is accomplished by predetecting the most likely set of instruction to be executed * Enhanced Floating - Point execution Unit It is capable of executing 2 floating point instruction in a single cycle through instruction scheduling and overlapped (pipeline) execution. Common floating point functions (add, multiply and divide) are hardwired for faster execution. * Address bus - Pentium(R) has two sets of signal lines , they are the address bus proper, (consisting of 29 signal A31:A3) and the Byte Enable bus(consisting of 8 signal BE7#:BE0#) * Data bus - 64 Bit data bus (eight 8-bit data path) * System Management Mode(SMM) It eliminates the customized software drivers to perform power management since it is executed from separate address space transparent to system software. * Extension of virtual paging, virtual 8086 mode and debug * Design Hurdles Heat dissipation : Pentium(R) runs very hot needs keen thermal attention. Speed : Pentium must be able to keep up with the speed of memory, video, system bus. Tolerances : With high speed , small vendors with little or no experience in building high speed system might not be able to master the learning curve. RF Interference: The faster the system, more RF interference it produces. It will be tougher for vendor to meet the various RF emission standards with Pentium(R) processors. * Bug Intel Pentium(R) manufactured before certain date has a bug in their floating point execution unit which returns a less than a full precision results for some combinations of divisor and dividend when performing a Floating point DIVision(FDIV). Intel (P6) Pentium(R) Pro Introduced on 27/3/93 Operating Speed from 150 to 200 MHz , 5.5 million transistors, 0.32 micron, 4GB memory addressable, 64 Terabytes virtual memory The following are the major difference between of Pentium(R) and the Pentium Pro(R) 1 Pentium(R) uses 5 stages of pipeline where as Pro uses 12 stages 2 Pentium(R) can execute 2 instruction per cycle where as Pro removes constraint of linear instruction squencing between 'fetch' and 'execute' phases and opens a wide instruction window using a instruction pool (att:1) 3 Pentium(R) is a level 2 processor where as Pro is level 3 processor 4 Pentium(R) uses L2 which increases the memory latency(increase in L2-cache results in higher cost), where as Pro is designed form an overall system implementation perspective which will allow higher performance . The following are the added feature of Pinetum(R) pro * Dynamic execution unit drives high performance Dynamic execution is a unique combination of multiple branch prediction(Processor looks multiple steps and predicts which branches or groups of instructions are likely to be processed next), Data flow analysis (Analyze which instruction is dependent on other instruction's result and create a optimized schedule of the instructions), Speculative execution (A generalized mechanism that permits instruction to be started 'early' .The results are stored temporarily in ROB , since they may be discarded due to program flow ) . because of speculative execution there is no data dependency * Pentium(R) pro is implemented as three independent (Fetch/Decode , Dispatch/execute unit, retire unit) engines coupled together (att:1) * Superscalar level 3 processor The ability to process more than one instruction per clock cycle. Pentium(R) pro can dispatch and retire three instruction per cycle therefore it is a level 3 supersaclar. Feature of Intel : P7 CPU Intel with partner HB, has begun development of the next generation processor ,the P7 (compatible with the 80x86 series).It is based on "Very Long Instruction Word technology", which may let the 80x86 series architecture finally fade-away. Reference : 1. Byte Magazine (May 93 ) 2. Byte Magazine (Nov 95 ) 3. Pentium(R) Processor System Architecture MindShare,Inc By: Don Anderson / Tom Shanley 4. The Web Page : http://pentium.intel.com/ 5. Intel Microprocessor Vol 1 & 2 6. iAPX 86/88, 186/188 User's Manual (Intel) 7.iAPX 286 Programmer's Reference Manual (Intel) 8. Introduction to Microprocessor by Leventhal 9. The Intel Microprocessors : Barry. B. Brey 3 rd Edition. att1: System Bus L2 Cache Bus Interface Unit L1 I-cache L1 D-cache Fetch/decode Dispatch / Retire unit Execute unit unit Instruction pool Three core engine interfaces with memory sub system using unified 8K-8K cache [INLINE] [LINK] Back to Personal Details