AMD64 is five generations ahead of INTEL
It's impossible for me to cover this topic in great detail, so I will hit the key points only.
AMD64 Instruction set
In Feburary 2003, on the eve of AMD's launch of the AMD64 family CPUs, INTEL expressed its disblief. According to Richard Wirt, an INTEL senior fellow, four separate design teams at Intel had examined how the company could take one of its 32-bit chips and transform it into a 64-bit machine, all four Intel teams concluded that such a feat was not doable.
INTEL did try hard to do 64 bit on x86, but their engineering didn't know how.
But the grand masters at AMD did what INTEL thought was impossible. Opteron 64 hit the market in April 2003 and quickly won almost all performance benchmarks.
Seeing is believing, INTEL tried to reverse engineer AMD's instruction set onto Pentium IV and Pentium 4 based Xeon. Emulating AMD64 instruction set was easier on Pentium IV, because it had a 36 bit physical address. However, benchmarks show INTEL's EM64T runs slower under 64 bit mode than 32 bit mode. Moreover, INTEL used some old AMD PDF files, and did a bad job, some Microsoft and Linux code developed on AMD64 failed on run on INTEL's clone. As of today, INTEL's EM64T is still missing some crucial capabilities of AMD64.
But running AMD64 instructions on Pentium III proves to be much harder, as of today, INTEL hasn't yet figured out how to do 64 bit on Pentium M and Core Duo.
And AMD is not sitting idle, it's adding a new set of instructions to the AMD64. INTEL engineers will have more sleepless nights digesting AMD PDFs.
AMD64 architecture was designed to be true multi-core from the ground up. A multi-core CPU is much like a multi processor system, the cores must communicate with each other to maintain consistency. Inside the AMD64 CPU, there is a crossbar switch that connects the multiple cores together, so they communicate internally and at extremely high speed. We see from benchmarks that dual core Opteron is almost twice as fast as a single core Opteron at the same clock speed.
In comparison, INTEL's dual core implementation is a kludge. In INTEL's design, the two cores share the same FSB, when they need to communicate, they first go out to FSB and come back again, without knowing they are sitting next to each other. The result? Poor performance .
This AnandTech article provides good explanation of the dual core designs.
The Embedded Memory Controller
Chip design gurus have long realized that a major bottleneck in system performance is memory latency. Just like memory is much faster than hard disk, the CPU is much faster than memory. When a CPU needs to access memory for instructions or data, it has to wait for the memory content to be retrieved, the time of waiting is the latency. During the waiting period, the CPU can't do anything.
In the old FSB based architecture (all INTEL's), the memory controller is in an external chip called the north bridge, while the CPUs run at 2-3GHZ, the conventional memory controller runs at about 200MHZ. Furthermore, in the old FSB design, the data have to make two hops, from memory to memory controller, then to the CPU. As we can see from this article, memory latency in a Pentium 4 design is between 300 to 400 clock cycles.
In AMD64 design, the memory controller is embedded in the CPU and runs at CPU frequency, the CPU connects directly to the memory without any intermediate. As we can see from this IBM test on single and dual core Opteron, memory latency on the Opteron is only about 50 nano second for local memory access.
Like the Opteron, all modern CPUs, such as Alpha EV7, IBM Power5, SUN UltraSparc T1, AMD Geode LX, Athlon 64, Sempron 64, Turion 64, have embedded memory controller(s).
From INTEL roadmap as far as 2009, we don't see an embedded memory controller design.
Cache Coherent HyperTransport (ccHT)
In a N processor AMD system, since each CPU has its own memory controller and associated banks of memory, there are N memory controllers which provide N times the memory bandwith. To have these N memory controllers act coherently, there are multiple ccHT links between AMD CPUs, which is used for fetching memory from another CPU. As we can see from the IBM document referenced above, in the case of remote memory access, the latency is also quite small.
INTEL is rumored to work on something similar to ccHT called CSI, however, since the cancelation of the Whitefield project, CSI is missing from INTEL's foreseeable roadmap.
Direct Connect Architecture
In FSB based architecture such as INTEL's, the CPU, Memory and I/O share the bandwith of a uni-directional bus, just like many folks share one phone line in a conference call --- only one guy can talk in either direction. In AMD64 architecture (Opteron, Athlon 64, Turion 64, Sempron), there are separate dedicated connections between CPU and Memory, between CPU and I/O, between CPU and CPU, between CPU core and CPU core. In AMD64, there is no crosstalk, and everything is bi-directional--traffic goes both ways the same time.
From INTEL's roadmap, it's stuck with FSB architecture until at least 2009.
INTEL is 5 generations behind AMD, and there are other major areas that INTEL is lacking, such as IOMMU for fast DMA. To match AMD in 2 core performance, INTEL will have to use very large cache size, which will negate its shrink to 65nm. At 4 core and up level, INTEL is simply hopless.