Journal of Pervasive 64 bit Computing
Main Blog Page

Analysis on IT trends and competitive strategies, with emphasis on micro processors, computer systems and networks. Based on latest news, backed up with real data, this site intends to provide a true and realtime picture of the fast changing IT landscape. This journal strives to be accurate on facts and sharp on criticisms. You may email your opinion to sharikou@yahoo.com or post comments here, be cool and intelligent.

Name: Sharikou, Ph. D.

Freelance journalist on IT matters. Some of my writings have been published on online IT journals. Any original content on this journal is Copyrighted, but it's free for non-commercial use. Any Trademarks used on this site belong to their respective owners. Some of the pictures are links. If there is any issue with the content of this site, please email sharikou@yahoo.com .

View my complete profile

Saturday, June 10, 2006

Conroe is a stopgap solution, AMD 4x4 is a permanent solution

A "4x4" sticker is just like a SLI sticker: it's a differentiation that devalues the competition. -- Sharikou

Some people say that AMD 4x4 is a stopgap solution. Such a view is naively wrong.

From 2002 to 2005, PC performance gained very little. The Pentium 4 reached 3GHZ in 2002, at the end of 2004, it was about 3.6GHZ. There was only 20% improvement in two years. AMD's Athlon 64 did outperform the P4, but the margin was not huge, around 10-30%. Then in 2005, AMD introduced dual core technology, system performance almost doubled overnight.

Today, personal computing is in a new era. CPU core improvement is still important for incremental gains, but from now on, performance jumps will come primarily from increasing the number of computing engines. Multiple cores is the answer for the future.

You never get enough performance. On the graphics front, we have moved from dual graphics cards to quad graphics cards. On storage, we have multiple hard drives forming RAID to increase speed and reliability. On memory, we have dual channels to increase bandwidth. The theme is clear, more is faster. But strangely, so far, we have only one CPU inside a typical PC.

To fully utilize the computing power of multiple cores across multiple CPUs, you must have an inter-core, inter-processor, core-memory and core-I/O communication platform to deliver the compute cycles to the outside world. This is distributed computing. AMD's Direct Connect Architecture, with its crossbars, cache coherent hyper transport, integrated memory controller and hypertransport, is the most elegant communications platform exists today.

AMD 4x4 utilizes the glueless ccNUMA capability of the AMD64 on the desktop. AMD 4x4 doubles the compute engines on desktop. With two dual channel memory controllers and double HyperTransport links, 4x4 doubles memroy bandwidth to 25.6GB/s and I/O bandwidth to 16GB/s. 4x4 also allows one to have 8 slots for non-registered DDR2. 4x4 almost doubles system performance with the efficiency of the Direct Connect Architecture. As both AMD and Intel are expected to be closely matched in multi-core CPU development, 4x4 is a long term solution to pin Intel at 50% of AMD's performance on the desktop. As we move to quadcore or octal core, the same 2x multiplier on the 4x4 platform will set Intel back 50% yet again and again.

The concept of 4x4 isn't limited to two CPU sockets. One can have 4 graphics cards, why not four CPU sockets if needed? Also, one of the socket may be used for Torrenza cards.

Now, what about Conroe? Conroe is merely an improvement on the execution engine. It sits on the same communication platform: the shared FSB. Back in 2003, Intel was talking about 10.2GHZ Pentium 4 by 2005. According to Intel, Conroe will be a 40% improvement over Pentium XE. Intel is clearly behind schedule on their performance goal. Intel's FSB based approach is not scalable and doesn't represent a challenge to AMD64. Eventually, Intel will have to follow AMD and create a communications infrastructure to deliver the compute cycles. Just like what Nvidia's SLI did to ATI, AMD 4x4 will force Intel to react. A 4x4 sticker is just like a SLI sticker: it devalues the competition.

AMD 4x4 will deliver an equivalent of 18 GHZ Pentium 4 cycles with an efficiency factor of 90%. It will more than double that with the K8L quadcore. Now, that's compelling.

In 2Q07, the K8L will be out. AMD is expected to widen its lead at single core level. Coupled with 4x4, that will be 3x peformance lead over Intel. Conroe will have to be scrapped and redone. Its shelf life is no longer than that of a banana. AMD 4x4 will continue for the forseeable future, at least two to three years.

Now tell me, which is a stopgap measure? 4x4 or Conroe?

33 Comments:

Anonymous said...: Please help me understand how you came about this numbers.

"AMD 4x4 will deliver an equivalent of 18 GHZ Pentium 4 cycles with an efficiency factor of 90%."

Very good blog. Keep up the good work Doc.; 5:31 PM, June 10, 2006
Anonymous said...: 4x4 for xmas waahoo u gota love Amd; 5:35 PM, June 10, 2006
Sharikou, Ph. D. said...: "AMD 4x4 will deliver an equivalent of 18 GHZ Pentium 4 cycles with an efficiency factor of 90%."

Look at single core, the 2.4GHZ socket 939 is rated at 4000+. Therefore a 2.8GHZ K8 core should be 4600+. Four such cores, should deliver equivalent of 18 Gig P4 cycles per second.; 5:37 PM, June 10, 2006
Anonymous said...: "pin Intel at 50% of AMD's performance on the desktop"

At what price though?; 5:41 PM, June 10, 2006
Sharikou, Ph. D. said...: At what price though?

The price differential will be less than the price of one CPU. Because AMD motherboards will be cheaper-no 23 watts north bridge is needed for AMD platform.; 5:44 PM, June 10, 2006
Anonymous said...: I am sure ..with all the hype past the Conroe , Intel is really hating AMD for the 4X4.; 5:51 PM, June 10, 2006
Anonymous said...: in theory this is the case where 2.4ghz x 4 = 9.6ghz yield at 100%. as far as i know, smp design is hardly a linear matter and there will be diminished return roughly mirror with the n/(n x (1.4 ^ (n - 1)) efficency even for a well orgainized mp cluster without taking into effect of memory link topology. for example, a single system contain 8 opteron spec score is never greater than score of adding up 8 systems of which each has its own 1 p opteron chip. under such scenario, intel's current architechture is even worse. this is a classic situation of which 1 + 1 + 1 + 1 is not going to equal 4 due to diminished return that are caused by other factors.

however, torreza is a great technology if it can be done right.

for instance, a low power x2 3800 (35W), a dedicated gpu with power throttling on direct HT3 link with its own memory using standard ddr2 of customiziable amount, a dsp hardware engine that can serve voice reognition and other dsp functions, and a spare socket for a second gpu or low power x2 3800. such system would be sweet.

remember the ib, motorola, apple campaign of PowerPC 601 and 604 when ibm promised a multi-personality os that can be controlled via voice command. this is the system that i am waiting for.

user: check today's news

pc will automatically verify user's voice print and authorize use and speak out today's headlines

user: open spreadsheet that i started yesterday.

pc will go through its file system to locate and open the last record of excel spreadsheet that was created yesterday

we are all so crazed with the speed and lost track of how the pc is suppose to improve our daily function. just using them as a heater or a game machine is not the answer.

sunhing; 6:00 PM, June 10, 2006
Sharikou, Ph. D. said...: smp design is hardly a linear matter and there will be diminished return roughly mirror

AMD64 enjoys almost linear scaling from 1P to 4P. A 1P Opteron 850 gets Spec_int_rate2000 score of 19.3. A 2P gets 37.8, 4P 71.8.; 6:30 PM, June 10, 2006
Anonymous said...: Hi, Sharikou; I invite you to join the discussions on Aceshadware.com! Your arguments could be used well in their forums!; 7:28 PM, June 10, 2006
"Mad Mod" Mike said...: The only problem with Opteron scaling is the memory latency for ccHT, with HT-3.0, that is gone. 16-bit links at 5.2GHz providing 40GB/s inter-CPU bandwidth vs. 8 GB/s now means scaling will increase to near 100%.; 8:10 PM, June 10, 2006
Anonymous said...: wow....i can really see some AMD dogs licking each others' wounds.

2 cpus, how much power is it going to consume? if you can do 2 cpus tasks with only one cpu, why not go for one cpu?

i also heard 4x4 is only available for Fx series. $1000 x 2, that's 2000 dollars!! woot!! i can probably build two conroe systems with that money, both can outperform your 4x4 in gaming.

also..."AMD 4x4 will deliver an equivalent of 18 GHZ Pentium 4 cycles with an efficiency factor of 90%." any proof?; 8:14 PM, June 10, 2006
Anonymous said...: "The only problem with Opteron scaling is the memory latency for ccHT, with HT-3.0, that is gone. 16-bit links at 5.2GHz providing 40GB/s inter-CPU bandwidth vs. 8 GB/s now means scaling will increase to near 100%."

I don't know what you're talking about.

http://www.hypertransport.org/tech/tech_htthree.cfm?m=3

HT3.0 only supports speeds up to 2.6GHz. It's a 5.2GT/s link which means at maximum a single 16bit link provides 10.4GB/s. 2 16-bit links provide 20.8GB/s of aggregate bandwidth. This is assuming that AMD runs it at maximum bandwidth which is unlikely since you'd never want to risk running your products at the edge of spec. Current links run at 1GHz even though HT2.0 allows for up to 1.4GHz. The extra room leaves something for overclockers to play with which is always good marketing. Your numbers though are completely exagerrated.; 8:43 PM, June 10, 2006
Sharikou, Ph. D. said...: It's a 5.2GT/s link which means at maximum a single 16bit link provides 10.4GB/s.

It's bidirectional, so bandwidth is 20.8GB/s.I don't think there is a big latency problem in 2P. It's just one hop.; 9:06 PM, June 10, 2006
Anonymous said...: Can we clear this issue once and for all!

Does 4x4 allow only FX CPUs or X2 also?; 9:28 PM, June 10, 2006
"Mad Mod" Mike said...: "It's bidirectional, so bandwidth is 20.8GB/s.I don't think there is a big latency problem in 2P. It's just one hop." - The problem isn't 2P, it's 4P and 8P. On an 4P Opteron 64 system, there is 3 hops to memory and this constitutes to about 300-400 clock cycles and 200ns+ delay, vs. 45ns for local memory.

HT 3.0 us 16-Bit x2 (Full Duplex) so it is 32x5200/8 = 20GB/s, but there can be 32-bitx2 links as well, just as there can be 8-bitx2 links. Most likely, you are correct, you will see 16-bit links dual and that is 20GB/s Bandwidth per CPU. According to AMD, HTT 3.0 can provide 40GB/s. My numbers are completely correct according to HTT 3.0 whitepaper, the only thing wrong was my 16-bit links providing 40GB/s, it is 32-bit links.

20GB/s vs. 8GB/s each link means 60GB/s for 4P vs. 24GB/s, which is a hell of an improvement.; 10:41 PM, June 10, 2006
Ajay S. said...: AMD's multicore, multi CPU approach surely seems to be the way ahead for better performance.

Most softwares (even desktop) will anyway be modified to utilize dual core, quadcore processors and should have no problem scaling on multi-cpu systems.

If dual CPU motherboards support even the lowest Athlon AM2 processors, it'll be a killer combination that should see Intel guys running back to their drawing boards by early next year :)

CAD / CAE Engineers in our company are asking for more in every meeting after we bought one 2P opteron system for a new project.

A reasonably priced 4x4 should have buyers of high-end systems salivating for one,; 12:31 AM, June 11, 2006
Anonymous said...: *quote*Can we clear this issue once and for all!

Does 4x4 allow only FX CPUs or X2 also?*/quote*

Apparently the CPUs will need to be Socket AM2, and they will also need a coherent hyper transport link. At the moment, only Opterons and the FX62 have the coherent hyper transport link needed. There are no Opterons for AM2 either, so right now the answer is that you need two FX 62 CPUs. There's nothing to stop AMD from releasing cheaper processors with a coherent hyper transport link, which is quite likely to happen.; 7:47 AM, June 11, 2006
Anonymous said...: "AMD64 enjoys almost linear scaling from 1P to 4P. Spec_int_rate2000 ..."

SPEC_int_rate2000 is easy to optimize to local memory. Good luck with other applications. Fact is the average memory latency in an Opteron 2P is much worse than a 1P. Why don't you post some real gaming benchmarks?

One more thing. How is this 4x4 core doubling strategy different from Intel releasing a dual-core based on dual-die?; 10:03 AM, June 11, 2006
Sharikou, Ph. D. said...: One more thing. How is this 4x4 core doubling strategy different from Intel releasing a dual-core based on dual-die?
You need to get re-educated. Tech advances fast, your knowledge is pretty much obsolete.
Does DCA mean anything to you?
For latency on Opteron, see This page, on that page, there is a link to IBM research report on latency...; 10:42 AM, June 11, 2006
Anonymous said...: Dual core based on dual die would be comparable if it had multiple FSB's.

This appears to have seperate memory channels on each processor.; 10:46 AM, June 11, 2006
Sharikou, Ph. D. said...: Dual core based on dual die would be comparable if it had multiple FSB's.

Well said. I found those who are pro-AMD much more knowledgeable and technically proficient than those who are pro-Intel. A 4x4 is just like a 4 way Opteron server, a double dual die Intel is like a 4 way Xeon server. Intel has pretty much given up on 4 way Xeons -- if DELL will have to go 4 way on AMD.; 10:55 AM, June 11, 2006
"Mad Mod" Mike said...: Link - I've taken the liberty to clear up some confusion people have about AMD64.; 11:08 AM, June 11, 2006
Anonymous said...: "Does DCA mean anything to you?
For latency on Opteron, see This page, on that page, there is a link to IBM research report on latency..."

Yes, DCA is AMD's marketing name for NUMA. It took many years for *some* server software to become NUMA-optimized. Are you telling me the gaming industry will care enough to optimize for the 4x4 niche? And even if they do, 2x scaling is out of question as your SPEC_int_rate2000 implies.

Did you post the link to the IBM report and hoped no one would read it? The conclusion is that "applications exhibiting little or no parallelism may be as much as 15% SLOWER on dual-core processors." Let alone dual-socket... So your 2x for games is a pipe dream. Also, what exactly in figure 4 disproves my point that 2P latency is higher than 1P?; 11:24 AM, June 11, 2006
Sharikou, Ph. D. said...: "applications exhibiting little or no parallelism may be as much as 15% SLOWER on dual-core processors."

Did you check the clockspeeds? IBM is saying something obvious here, dual core has lower clockspeed. Actually, this again prove our point that the future is with multiple cores. A 3GHZ single core opteron is 15% faster than a 2.6GHZ dual core opteron in single threaded loads. However, the 2.6GHZ dual core has 5.2 giga cycles per second of computing power.

The amount of gaming performance gain from AMD's advanced ccNUMA architecture depends on software. I think the game developers are working hard to push their games faster. But I think games are easy to parrallize. Look at
this Quake 4 benchmark, with the same amount of graphics, dual core runs 50% faster than single core. Now what if we also double the graphics?; 11:40 AM, June 11, 2006
"Mad Mod" Mike said...: "Yes, DCA is AMD's marketing name for NUMA. " - DCA and NUMA are related to each other, yes, but DCA goes beyond just NUMA. DCA is in place to provide point to point communication for RAM, System I/O, and Inter-CPU communication.

I already pointed out that there is obviously a greater delay on a DCA design, but it provides much higher bandwidth than that of a Single FSB solution. With HT 3.0 and 20GB/s per Link, it now means in a 4P Opteron system, there is 320GB/s System I/O bandwidth maximum, vs. 48GB/s maximum now. I say 320GB/s because Socket 1207 Opterons using HT 3.0 will have 4 HT links at 32-bit or 8 running at 16-bit (2x16/2x8) and all of them can be utilized with the added ability for co-processors.

On an 8P system, this is where it will truly shine as the bandwidth can exceed 500GB/s on fully loaded systems. Likely, you will see below this on standard systems, but the possibility and likelyhood of integration for Torrenza and the like means performance levels over todays computers and the coherency of next-gen Opteron platforms is improved 4 fold.; 11:49 AM, June 11, 2006
Sharikou, Ph. D. said...: DCA and NUMA are related to each other, yes, but DCA goes beyond just NUMA.

That's very true. In most ccNUMA designs, they share another bus like network. DCA simply remove such buses and connect directly with the shortest route. With K8L, each CPU has 8 ccHT links to connect with other 7 processors. For anyone who has look at this picture, it's mind boggling. K8L will set the rest of the industry back 50%.; 11:54 AM, June 11, 2006
Anonymous said...: in 2003, Intel was talking about 10.2GHZ Pentium 4 by 2005.

AMD said exactly the same, back in 2003...

http://www.neoseeker.com/news/story/2865/

But without doubt AMD has the advantege when it comes to multiple socket systems, but show me one avarage consumer who's interested in buying a 2,4 or even 8 way system...
(and btw even intel plans something like HT -> CSI, thought its somewhere scheduled for end of 2007)

Even if your prediction comes true and the 4x4 platform is cheap (cheap compared to server/workstations prices) how many people gona buy this, right the enthusiasts wich are probably less then 1% of the market.

Imho we see dualcores getting standad on the lowend segemnt first, befor we see multicore/multisockt systems being standad at the mainstream segment.; 1:41 PM, June 11, 2006
Anonymous said...: Well nice to see what people think about the 4x4. i can give you a hint. in a normal opteron setup (like i have one to play with) = a dual 280 it outperforms a conroe extreme at 3.8 this giving cpu score in 3dmark2006
you could say, well not that impressive... some more: a oc'ed one to 285 gives 15% more cpu score then the conroe. knowing how hard it is to increase the cpu performance in 3dmark2006 and knowing that the opteron has no fast cas memory, what will it do on a desktop platform with fast cas and a special link to an additional bridge to support the additional graphics... this will give a huge blow away of anything available now for games and benches exept the non multicore support in superpi and 2001se(2 x fx62 @ 2.8) non oced but easy to get to 3.0 - 3.2.

about cost and power.
may i remind that intel always just talks about TDP... AMD gives TDP. but then again it will still be higher thats true. all extreme gamers and oc'ers don't care about watt as long as it performs...; 2:01 PM, June 11, 2006
Anonymous said...: To all those using HT bandwidth or single to dual-core scaling, you're missing the point.

Latency matters for scaling. Having lots of bandwidth only helps *loaded* latency stay close to those unloaded latency numbers so prevalent in marketing material. But all the bandwidth in the world will not change the fact that Opteron 2P average latency is larger than 1P.

Single vs dual-core scaling studies are in the context of one memory latency. Not the case for 1P vs 2P! So you can use Quake4's 50% 1c-2c scaling as a best case number for 1P-2P. Talk about "permanently pinning down Intel to 50% performance."

Oh yeah, game writers will work hard to optimize for NUMA. That's where the profits are, not making new titles for consoles. That made my day.; 5:30 PM, June 11, 2006
Ajay S. said...: "Oh yeah, game writers will work hard to optimize for NUMA. That's where the profits are, not making new titles for consoles. That made my day"

Most of the NUMA support is provided by OS and there is no changes required in a program unless the application will spawn multiple processes, in which case optimization maybe required.

I dont know what percentage of games spawn multiple processes for a single instance of the game, but I think I can safely assume they are only a handful if there are any :)

http://www.microsoft.com/whdc/system/platform/server/datacenter/numa_isv.mspx

SGI helped Linux move onto NUMA in 2003 and it has matured over the past two years. Windows 2K3 server too has NUMA support. And for the first time Windows Vista brings NUMA will bring support to regular desktops. Once Windows Vista finally ships, NUMA will become the rule.

http://www.itweek.co.uk/itweek/comment/2148179/revelations-windows-vista

"applications exhibiting little or no parallelism may be as much as 15% SLOWER on dual-core processors." Let alone dual-socket... So your 2x for games is a pipe dream"

AMD, Intel, PS3 and Xbox 360 have opted to enable hardware support for multiple threads rather than upping the Ghz Does that leave game developers with any other choice? Games and mainstream applications that are CPU intensive will have support parallelism / multiple threads if they wish to be in the market next year. Once multi thread support is built into a application, 8 CPUs with 2 cores each or 2 CPUs with 8 cores each, make little difference for the software itself and will be upto the hardwrae guys to ensure the platform performs well.

So softwares or games with multi thread support will not need any kind of special optimization to perform on 4x4 or 8x4, which will specially be true once Windows Vista is here.

4x4 is a sensible choice if price is right.; 7:18 AM, June 12, 2006
Anonymous said...: ajay s: "Most of the NUMA support is provided by OS and there is no changes required in a program unless ..."

If you're a programmer, you probably know that "NUMA support" means just that: support, not a magic performance wand. It's there for programs which are *written* to take advantage of it. True, any program will run on NUMA without modification. But, programs need to be profiled and rewritten to get decent scaling on NUMA.

And, please, get your multi-thread and multi-socket/NUMA concepts straight. It's easier to program good multithreading performance under a single latency memory model (consoles, uniprocessor) than a 4x4 NUMA with two memories. NUMA often forces threads to reference data in the remote memory. Sometimes threads need to migrate for load balancing and their data is left behind, sometimes they need data produced by other threads. Which memory gives better graphics DMA throughput, etc. Not saying it's impossible, just that it's not automatic, as some seem to believe. And the game industry will not rush (if ever) to optimize for this unproven niche.

I'm done. If anyone still believes they will get 2x gaming performance on 4x4, they deserve the lighter wallet. But my advice would be to wait for the benchmarks.

For reference, let's ask the good doctor Sharikou to state his prediction for the average gaming performance gain on 4x4. His predictions are always true.; 7:19 PM, June 12, 2006
Ajay S. said...: "It's easier to program good multithreading performance under a single latency memory model"

Not saying it's impossible, just that it's not automatic, as some seem to believe.

yup, I was wrong there, thanks for correcting.

"If anyone still believes they will get 2x gaming performance on 4x4, they deserve the lighter wallet"

you dont get a 2x performance scaling even when using more Opterons on expensive motherboards. We generally see 1.5x scaling when moving to 2P Opteron system unless the software is really well optimized for such environments, and is spawning more than 10 threads in which case I have seen upto 1.8x gains.; 11:58 PM, June 12, 2006
Anonymous said...: """wow....i can really see some AMD dogs licking each others' wounds.

2 cpus, how much power is it going to consume? if you can do 2 cpus tasks with only one cpu, why not go for one cpu?

i also heard 4x4 is only available for Fx series. $1000 x 2, that's 2000 dollars!! woot!! i can probably build two conroe systems with that money, both can outperform your 4x4 in gaming.

also..."AMD 4x4 will deliver an equivalent of 18 GHZ Pentium 4 cycles with an efficiency factor of 90%." any proof? """
__________________________________

didnt you learn math at school? can u multiply, or add? so this is your proof...

Torrenza will be ONLY for FX, but actually we got FX-62 at 700$, so, dont be lazy, you can get a killer system for 50% more buck, its a great deal, dont you think?

NOBODY would buy a Core X6800 or a FX-62... ONLY if you're a hardcore gamer or a enthusiast...

so if you'll spent like $6000 in a PC, build the BEST PC, not only a fast pc...

dont you think?

have a good pc is the essential TODAY, but if u wanna buy a gamer pc or a enthusiast pc, AMD is gonna kill Intel again with Torrenza...; 6:42 PM, July 23, 2006

Journal of Pervasive 64 bit Computing
Main Blog Page

About Me

Previous Posts

Saturday, June 10, 2006

Conroe is a stopgap solution, AMD 4x4 is a permanent solution

33 Comments:

Journal of Pervasive 64 bit Computing Main Blog Page

About Me

Previous Posts

Saturday, June 10, 2006

Conroe is a stopgap solution, AMD 4x4 is a permanent solution

33 Comments:

Journal of Pervasive 64 bit Computing
Main Blog Page