Thursday, May 04, 2006

With new architecture, Intel will be four generations behind AMD

After three years of hard work, Intel seem have made a major improvement in its next generation Conroe/Merom CPUs: its implementation of AMD64 instruction set is looking good, unlike EM64T in Pentium 4, which is about 10-20% slower than 32 bit mode.

However, 64 bit was just one of the five disruptive technologies AMD introduced with Opteron. Intel's new architecture will still be four generations behind AMD.

An AMD CPU consists of two major functional parts: execution and communication. AMD CPUs have circuits for Core-Core communications (XBAR), Processor-Processor communications (ccHT), Processor-I/O communications (HT) and Processor-Memory communications (IMC). The communication channels are dedicated and separate from each other. Also, since these communication circuits run at CPU's clockspeed, they have very high performance and consume little power. The communication circuits are also very intelligent, for instance, ccHT establishes a single physical memory space from multiple memory banks controlled by different CPUs.

Intel CPUs have none of the above. In Intel architecture, all communications happen on an external shared bus controlled by an external chipset manufactured on 130nm process. The fastest future Intel bus has a bandwidth of 10.6GB/s, less than the bandwidth required for DDR2 800MHZ(12.8GB/s). If you add a couple of GbEs and SATAII drives to an Intel system, you are jamming the bus. When you add more Intel cores, you are choking the bus. An Intel quadcore system will be like an IBM XT connected to a 2400baud modem.

We expect AMD to continue to innovate on both execution and communication with its next generation processors.

39 Comments:

Anonymous Anonymous said...

so, do you know anything about AMD's next generation cpu's?

1:23 PM, May 04, 2006  
Anonymous Anonymous said...

Um. Look at Intel's Conroe. It has a SHARED L2 cache plus the L1s are connected. That's pretty good core to core communication. The L2 cache also has a 90ish GB/sec bandwith which is plenty for what the L2 needs.

3:02 PM, May 04, 2006  
Blogger Sharikou, Ph. D. said...

Look at Intel's Conroe. It has a SHARED L2 cache plus the L1s are connected.

However, that's just a special hack for dual core only. At quadcore, Intel goes back to shared FSB again.

3:56 PM, May 04, 2006  
Anonymous Anonymous said...

is shared cache the ultimate way to eliminate latency problem though?

5:03 PM, May 04, 2006  
Blogger Sharikou, Ph. D. said...

is shared cache the ultimate way to eliminate latency problem though

by itself, shared L2 doesn't reduce memory latency.

Shared cache is a way to better utilize cache for single threaded loads and to reduce cache coherence traffic between the two cores. Because of shared L2, you only have to worry about cache coherence in a much smaller L1. So that's an advantage there. But Intel's solution is a hack that can't be generalized to quad core. So, for Clovertown quadcore CPU, it's basically two conroes connected via the FSB.

5:10 PM, May 04, 2006  
Anonymous Anonymous said...

so that means conroe may excel in single threaded applications, but may not perform so well in multi threaded enviroments?

5:47 PM, May 04, 2006  
Anonymous Anonymous said...

Conroe's shared L2 cache might be a hack, but it's a good hack. And it DOES reduce inter-core communication latency - L2 cache access delay is much smaller than any interconnect.

Sure, it doesn't scale to quadcores. It might not even scale to faster core speed, since shared cache access is very difficult to run fast. But Conroe's shared L2 cache is a good one as where and what it is, IMO.

2:00 AM, May 05, 2006  
Anonymous Anonymous said...

"If you add a couple of GbEs and SATAII drives to an Intel system, you are jamming the bus"
- hey Sharikou, what OS do you use? Have you ever heard of DMA and bus mastering?
I like your blog, it's always entertaining, if sometimes silly, but with such stuff it becomes too absurd.
This is pure "smoke and mirrors", and if you accuse Intel of using those, you should refrain from using those either.

2:18 AM, May 05, 2006  
Anonymous Anonymous said...

I am hoping that if CONROE is as good as it seems (during the gorilla benchmarking) that you would be man enough to accept AMD loosing the performance crown, or will you just continue to talk about KM8. Has anyone seen real numbers for that chip, or a release date.

"Four generations behind"... We will see. As for me I am looking forward to real world test with both of the new cores "CONROE" & "KM8" till then were all just blowing smoke.

7:29 AM, May 05, 2006  
Blogger Sharikou, Ph. D. said...

"If you add a couple of GbEs and SATAII drives to an Intel system, you are jamming the bus"
- hey Sharikou, what OS do you use? Have you ever heard of DMA and bus mastering?


Jamming is goood description. Of course, it is not like there are multiple guys riding the bus the same time--that's impossible. But when one guy is using the bus (via MCH) others have to wait. Say, you are running an Intel dual core, someone sends you a MSN message, the message gets delivered to one of the cores, suppose during the time the other core needs to access memory, it has to wait for the bus to be free again....

That's how shared bus works.

7:57 AM, May 05, 2006  
Blogger Sharikou, Ph. D. said...

Conroe's shared L2 cache might be a hack, but it's a good hack.

It's a tradeoff. There are two boundary situations: one is no sharing, each core has a fixed cache size; the other extreme is thrashing, each core gets the full cache in turns. Intel chose some middle ground, some sharing and some thrashing. How does Intel scheme work in multitasking loads is yet to be seen. So far, we see Conroe running SuperPI really fast. However, on more memory intensive ones, Conroe is slower than Athlon 64. (see the Conroe busted link)

8:02 AM, May 05, 2006  
Anonymous Anonymous said...

- hey Sharikou, what OS do you use? Have you ever heard of DMA and bus mastering?

Obviously nonworkingrich you havn't! The DMA and BM allows you to by-pass the processor in order to access memory directly, NOT bypass the bus and processor!

Until Intel designs a wireless connection between memory and device, you WILL have bus bottlenecks!

12:37 PM, May 05, 2006  
Anonymous Anonymous said...

current Intel offering i.e. Core Duo is faster and more efficient than current fastest AMD offering. Just check the following benchmarks:
http://www.anandtech.com/mb/showdoc.aspx?i=2750
AMD is going down....

1:21 PM, May 05, 2006  
Anonymous Anonymous said...

to justin:
yonah and core architecutures may give AMD a hard time, but far from killing it. its just like when Hammer was released in 2003. Hammer was superior than Pentium in all aspect. was intel dead?

6:24 PM, May 05, 2006  
Blogger Sharikou, Ph. D. said...

yonah and core architecutures may give AMD a hard time

Yonah and CORE have benefited from one trick: shared L2 cache which can be used 100% by a single thread. This gives a boost for single threaded loads. But, this is a rather simple trick that AMD can easily implement also, not much barrier there. In fact, I saw an AMD patent application on uspto.gov filed in 2004 for shared cache technologies.

What AMD has represents a major barrier for Intel. AMD's next gen will be able to do 32P glueless.

8:53 PM, May 05, 2006  
Blogger RawSushi said...


current Intel offering i.e. Core Duo is faster and more efficient than current fastest AMD offering. Just check the following benchmarks:
http://www.anandtech.com/mb/showdoc.aspx?i=2750
AMD is going down....


Intel has been focusing ALL their PR and benchmarketing effort on single CPU performance and people like you fall for this trick all the time.

Yes, the Core architecture closes the gap for single CPU performance but this is just half the battle-field. Intel is still years behind in the enterprise space and this is the most profitable space.

Right now server vendors can put 8 AMD CPUs (yes 8 CPUs not 8 cores) in the same system without doing expensive R&D on some proprietary chipset. Soon, this will be 16. Intel doesn't have an answer to this. Soon, the Xeon will only feature in the low end of the server market, while they focus their enterprise resources on the Itanium.

You think AMD is going down JUST because Intel has closed the gap on single CPU performance? You obviously don't have a clue...

9:27 PM, May 05, 2006  
Anonymous Anonymous said...

Following closely on the heels of APEXX 4, APEXX 8 doubles the number of processors to eight for a total of sixteen cores, and double the maximum memory to 128GB.
http://www.boxxtech.com/products/apexx8.asp

7:58 AM, May 06, 2006  
Anonymous Anonymous said...

Right now server vendors can put 8 AMD CPUs (yes 8 CPUs not 8 cores) in the same system without doing expensive R&D on some proprietary chipset. Soon, this will be 16. Intel doesn't have an answer to this. Soon, the Xeon will only feature in the low end of the server market, while they focus their enterprise resources on the Itanium.
But no major OEM has, because scaling beyond 4S is poor on Opteron due to the rapid increase of cache coherency traffic and the ever increasing number of hops between Opterons.

Meanwhile, IBM has X3 using Xeon MPs which matches the performance of Opteron at 4S and scales reasonably well to 32S. Unisys and Fujitsu-Siemens also have 4S+ Xeon MP systems. And these systems are far better RAS features.

1:05 PM, May 06, 2006  
Anonymous Anonymous said...

to ryan ho:
1. the benchmark I was referring to is available on anandtech WEB site - NOT intel WEB.... so please be serious and stop writing on Intel PR efforts.
2. single CPU performance. Majority of computers sold today and in the future are single CPU ones....dual core now, quad cores in the future. all notebooks, most of desktops and servers are single CPU systems..... I can agree that AMD might be better choice for multi-CPU servers that run multithreading programs...
However to run single-threaded programs people will choose Intel which will offer better performance for them (see anandtech benchmark). Instead of building AMD system with 16 CPUs they will setup set of blades with Intel Cores.....
3. And at end - Intel is not closing gap in single CPU performance - actually new Intel offering has much better performance than AMD including AM2. Even in the anandtech benchmark current intel offering for notebook market is MUCH BETTER than AMD offering for server market....
Just wait for Conroe and Woodcrest -> these CPUs will crush AMD (more cache, faster FSB, further Core optimizations).
4. AMD is going down -> it will end in the same place which is currently occupied by some of the most prominent AMD customers: Cray and Sun. Better sell your AMD shares.

1:56 PM, May 06, 2006  
Blogger Sharikou, Ph. D. said...

A 8P opteron system can be configured with maximum of 3 hops for remote memory access. The latency is still smaller than a FSB based system, while memory bandwidth increases 8x.

This link shows a 4P Opteron trouncing a 16P Xeon.

1:59 PM, May 06, 2006  
Anonymous Anonymous said...

A 8P opteron system can be configured with maximum of 3 hops for remote memory access. The latency is still smaller than a FSB based system, while memory bandwidth increases 8x.
But no 8S Opteron system is so configured, and it can only be done if you use no HT links for I/O.

This link shows a 4P Opteron trouncing a 16P Xeon.
IBM X3 is the fastest x86 server in the very important TPC-C benchmarks. It beats the HP DL585 at 4S, and of course there's no Opteron competition at 8S+. At 8S, it also grows memory bandwidth as you add sockets and no CPU is more than two hops away from meory.

Fujitsu-Siemens 4S/8 core Xeon MP system matches its own 4S/8 core Opteron system at SAP-SD.

5:00 PM, May 06, 2006  
Blogger Sharikou, Ph. D. said...

IBM X3 is the fastest x86 server in the very important TPC-C benchmarks.

TPC-C is a database transaction benchmark, where disk/storage performance is a key. You have to consider the storage components before reaching a conclusion on CPU part of total system performance. So, it's back to a price/performance and performance/watt question.

5:22 PM, May 06, 2006  
Anonymous Anonymous said...

TPC-C is a database transaction benchmark, where disk/storage performance is a key. You have to consider the storage components before reaching a conclusion on CPU part of total system performance. So, it's back to a price/performance and performance/watt question.
And DBs are the most used server applications. And your earlier link used results for SAP, which is heavily DB based as well.

5:26 PM, May 06, 2006  
Anonymous Anonymous said...

Hi sharikou,

Yonah and CORE have benefited from one trick: shared L2 cache which can be used 100% by a single thread. This gives a boost for single threaded loads.

While I don't believe Yonah/Core is architecturally any better than P-III, I believe the shared L2 cache could be more than what you said.

It seems to me that it's possible, with some compiler support, for a multi-thread program to reserve say 128KB memory to stay in L2 for inter-core communication. This will benefit multi-thread performance quite a bit especially where synchronizations are needed.

6:28 PM, May 06, 2006  
Blogger Sharikou, Ph. D. said...

believe the shared L2 cache could be more than what you said.

I agree. For multi-threaded applications, more code can stay in cache because you only need one copy. Also, for producer/consumer type of multi-threading running on different cores, you don't have to copy the data between caches...

However, in geneal heavy duty situation, caches can only do that much of help.

6:46 PM, May 06, 2006  
Anonymous Anonymous said...

IBM X3 is the fastest x86 server in the very important TPC-C benchmarks.

That "fastest" claim really has no significance at all. Look at these three rows grabbed from TPC-C results page:

a. HP ProLiant DL585-G1/2.4GHz/DC/4P
tpmC: 236,054, $/tpmC: 2.02
Avail.: 12/05/05

b. IBM eServer xSeries 460 8P c/s
tpmC: 250,975, $/tpmC: 5.74
Avail.: 11/30/05

c. IBM eServer xSeries 460 4P c/s
tpmC: 273,520, $/tpmC: 4.66
Avail.: 05/01/06

What would you wish to buy, eh? I would hardly the 8P if the (newer) 4P is faster. But look at the performance and price difference, why would I spend twice the money to have only 18% performance increase? I'd gain much more had I spent the money on better discs.

Not to mention that Opterons are much more power-efficient.

6:53 PM, May 06, 2006  
Anonymous Anonymous said...

What would you wish to buy, eh? I would hardly the 8P if the (newer) 4P is faster. But look at the performance and price difference, why would I spend twice the money to have only 18% performance increase? I'd gain much more had I spent the money on better discs.
The 1st X3 system used 8 single-cores, the 2nd X3 system used 4 dual-cores, as does the HP system. The price differential of the IBM/HP systems comes almost entirely from IBM's use of more expensive fibre-channel drives versus HPs uses of SCSI drives. Better for RAS purposes, but not much different in ultimate performance.


Not to mention that Opterons are much more power-efficient.

True, for now.

7:33 PM, May 06, 2006  
Anonymous Anonymous said...

For now, AMD is leading in performance per watt and most (not all) benchmarks in server configurations. The same for desktop space. This is a fact.
If Conroe/Woodcrest will change that - it is difficult to say right now because many tests shows very mixed results - if you look at discussion lists sometimes Conroe 2.66GHz is equal or faster to Athlon FX~3GHz (overclocked). Conclusion?
Intel will be able only to match AM2 performance or win by 5-7% margin.
Faster Intel chips are coming in December and will face K8L/65nm beasts or socket F Opterons.
And we don't know what performance increase AMD is hiding in these chips with HT 3.0/65nm and improved cache/computation power.
If you combine these features and realize that SOI improves clock frequencies you can hope that AMD will be better in 2006/07 and beyond.

12:21 AM, May 07, 2006  
Anonymous Anonymous said...

And we don't know what performance increase AMD is hiding in these chips with HT 3.0/65nm and improved cache/computation power.
HT 3.0 doesn't show up until 2008.

4:56 AM, May 07, 2006  
Anonymous Anonymous said...

Where did you find this information. For sure IBM will not stop using Intel Xeon processors. They only mentioned that Opteron requires a larger presence to improve growth of IBM.

6:34 AM, May 08, 2006  
Anonymous Anonymous said...

> The price differential of the IBM/HP systems comes almost entirely from IBM's use of more expensive fibre-channel drives versus HPs uses of SCSI drives. Better for RAS purposes, but not much different in ultimate performance.

I disagree. A FC SAN will still mop the floor with a lowend iSCSI array, especially in demanding tasks like the TPC benchmarks. And I think I'm being charitable when I describe the HP NAS as low end; I don't think they sell anything lower in their storage line and it's not a great performing array anyways. Not only that, look at the difference in the number of spindles in these two configs--the IBM setups use nearly twice as many spindles as the HP. Plus a huge component in the price disparity is that the HP setup uses 32GB RAM while the IBM uses 128GB. Isolating CPU performance in these results is an exercise in futility; not only are the disk/RAM subsystems significantly different, but they're using different databases systems (SQL Server vs. DB2) and even OS (HP with Windows 2k3 Server, IBM with Windows 2k3 Server x64). I think the HP system does very well in comparison to the much higher end IBM setup.

8:40 PM, May 08, 2006  
Anonymous Anonymous said...

I disagree. A FC SAN will still mop the floor with a lowend iSCSI array, especially in demanding tasks like the TPC benchmarks. And I think I'm being charitable when I describe the HP NAS as low end; I don't think they sell anything lower in their storage line and it's not a great performing array anyways. Not only that, look at the difference in the number of spindles in these two configs--the IBM setups use nearly twice as many spindles as the HP.

Or HP realizes that adding HDs results in diminishing returns and that there is no significant performance benefit.

Plus a huge component in the price disparity is that the HP setup uses 32GB RAM while the IBM uses 128GB. Isolating CPU performance in these results is an exercise in futility; not only are the disk/RAM subsystems significantly different, but they're using different databases systems (SQL Server vs. DB2) and even OS (HP with Windows 2k3 Server, IBM with Windows 2k3 Server x64).

The fastest score posted by a HP DL585 uses a system with 128GB of memory, IBM DB2 UDB 8.2 and Microsoft Windows Server 2003 Enterprise x64 Edition.

10:29 PM, May 08, 2006  
Anonymous Anonymous said...

"If you add a couple of GbEs and SATAII drives to an Intel system, you are jamming the bus"

For the GbEs, CSA has been present since the 875P chipset, introduced in 2002/2003... So no Ethernet traffic traverses the bus. SATA traffic is tied to the ICH 5/6/7 southbridge... Again no swamping the bus. 975P supports FSB up to 1333MHZ in the latest revision of D975XBX for 13.3GB/s... Again no swamping the bus, and supporting DDR2 800. Intel is not stupid. They will add bandwidth to FSB as needed. - David, MCSE, BS CIS.

5:30 PM, May 09, 2006  
Anonymous Anonymous said...

For the GbEs, CSA has been present since the 875P chipset, introduced in 2002/2003... So no Ethernet traffic traverses the bus. SATA traffic is tied to the ICH 5/6/7 southbridge... Again no swamping the bus.

I don't buy that. Just because there is a south bridge doesn't mean the traffic won't show up on the north bridge. After all, the purpose of the SB is to mux/demux to NB & to CPU, and since memory banks are connected to the NB, even DMA to NIC or disks MUST go through the NB, too.

Not only GbE/SATA traffic will show up on the north bridge, they will also occupy NB-to-CPU bandwidth, unless DMA was used ALL the time. If you ever need to verify the hash of the data, or to render a JPEG file, or to compress a ZIP or to play an audio/video, EVERY BIT of those data must take precious bandwidth between the north bridge and the CPU.

For AMD's Athlon64 X2, only one core is affected. For Intel's CPUs, BOTH cores are affect. Intel's solutions don't scale, and GbE/SATA will inevitably make the situation worse, period.

11:05 PM, May 10, 2006  
Anonymous Anonymous said...

You are quite correct; I was looking at a different (and less powerful) configuration. But I still stand by my comment that storage systems make a world of difference in moderate to intense I/O loads like TPC-C. This has certainly been born out in my company's experience, which includes running our own TPC benchmarks to evaluate both iSCSI and FC disk arrays. If HP thought as you suggest, then why do they submit TPC results using higher end FC arrays with their Integrity Itanium setups? I'm generally not one for conspiracy theories, but my suspicion is that HP hasn't submitted results for an Opteron + higher end FC array because they don't want to make their Itanium setups seem overpriced (or underperforming for the price, however you want to look at it). In my experience, managers do--rightly or wrongly--use the TPC results in product evaluations, so there certainly is a strong marketing angle behind how companies use TPC results.

11:56 AM, May 11, 2006  
Anonymous Anonymous said...

Instead of claiming Intel is XXX generations behind, do you have any real benchmarks???

I don't have neither, but an article I found may be interest of you, there are some SPECint2000 benchs inside:"
Dempsey 3.73GHz-> 1,800 SPECint2000; 43 CINTrate-peak /May
Woodcrest 3GHz -> 2,400 SPECint2000; 59 CINTrate-peak /July
Opteron 2.8GHz Socket F -> 1,900 SPECint2000 ; 45 CINTrate-peak /September
"

The link is http://www.theinquirer.net/?article=30963

I know this is your blog, but please use some real numbers to support your statements. Conclusions without proof are called ridiculus.

7:48 PM, May 12, 2006  
Blogger Sharikou, Ph. D. said...

Woodcrest 3GHz -> 2,400 SPECint2000; 59 CINTrate-peak /July


The number above is not real. The Nova kid was just making his guesses there at INQ.

In real tests, Intel's 2007 quadcore got fragged by AMD's 2003 $80 chip.

9:45 PM, May 12, 2006  
Anonymous Anonymous said...

Sharikou - you are sad. You ridicule and find holes in other people's testing but you think extrapolating some number from your old clawhammer and comparing that to some other data w/o a controlled test environment is ok. Man - you're sad. You know what dude - blind fanboy's like you end up hurting AMD's cause by being irrational and one sided. I'm actually waiting to see how you spin up Intel's demise if Core 2 Duo is successful in Q3/Q4. I'm sure you'll say they are faking their numbers and it's Enron all over again.

8:17 AM, June 18, 2006  
Anonymous Anonymous said...

You do know that Intel actually invented the IMC idea? Google "Intel Timna". Back in 2000, before the first Alpha chip with an IMC came out in 2003, which is where AMD copied the IMC idea from. So indirectly, we find again that AMD copies their tech from Intel. The only thing I've ever seen that AMD has truly invented is x64 instructions, although at times I doubt they did even that.

1:03 PM, December 15, 2006  

Post a Comment

<< Home