Wednesday, November 15, 2006

Intel Clovertown quad fragged by Opteron dual

I told you Intel's obsolete technology won't fly. You simply won't get extra mileage by strapping two cylinder engines together. A Clovertown core fights for a 266MHZ bus, that's a sham situation.

On SpecFP_rate2000, a 2P 4 core Opteron (2.8GHZ) server gets a score of 119.

In comparison, a 2P 8 core Clovertown (2.66GHZ) server gets a SpecFP_rate2000 score of 104.

In other words, 4 Opteron cores are 14.4% faster than 8 Clovertown cores. Or, a Clovertown core is only about 40% of an Opteron core, as far as FP performance is concerned. This is a just a repeat of the previous situation of 4P Opteron fragging 16P Xeon. The SpecFP benchmarks requires a lot of memory bandwidth, and that choke Intel.

I told you Intel is 5 generations behind.

37 Comments:

Anonymous Anonymous said...

Barcelona will have AMD's 128-bit wide multimedia pathways, superior virtualization performance without the severe performance penalties inherited from Woodcrest’s 64-bits clone conversions.

IT dept. will wait for the real thing; if there existing servers are under a heavy load. If you need something today then buy what ever meets your price performance needs. Intel makes a nice processor, buy some of those.

11:30 AM, November 15, 2006  
Anonymous Anonymous said...

Is it just me or was that test single-threaded? Thus meaning that on both systems only one core was used.

Also you always say that AMD has 40% more FP power. Now it seems that at 5.3% higher clock rate AMD has 14% more FP power. Scaling to same clock speed it has ~8.7% more power. Far cry from that 40% even though that test should be prime example of FP-hungry application (benchmark).

11:43 AM, November 15, 2006  
Anonymous Anonymous said...

How come all the reviews , hype and hoopala is missing this time on the release of the Quad core ??
No Anand, Tom and Hexus ...either

12:09 PM, November 15, 2006  
Blogger S said...

How can u equate two entirely different benchmarks.

One is done by Sun on their own OS which has proven itself to be a high performer.

The other is on Linux which has been put together by people who are more like enthusiasts.

It is not the hardware thats making a difference here, it is the software.

12:39 PM, November 15, 2006  
Anonymous Anonymous said...

Hey sharikou Fu D. I thought you would of learned your lesson when you tried spreading all that FUD about the Conroes six months ago. Well you just don't give up do you. Well the benchmarks for the Clovertown are coming and in and guess What, yep they're looking really good. Here take a look :)


http://www.realworldtech.com/page.cfm?ArticleID=RWT111406114244
http://www.2cpu.com/review.php?id=114&page=1

Oh by the way how's that exploding chip thing going? LOL.

12:43 PM, November 15, 2006  
Anonymous Anonymous said...

Once again, Intels FSB proves to be slooooww and Ooooold.

Intel lost another battle. I can bet 4x4 will frag intels Quad core(2 Dual cores in one die) with ease.

12:44 PM, November 15, 2006  
Anonymous Anonymous said...

*hug*

1:02 PM, November 15, 2006  
Anonymous Anonymous said...

Once again you post apples-to-oranges comparison between Xeon and Opteron.
Linux and Solaris results can NOT be compared directly.

Even AMD is MORE fair than you, Sharikou - have a look at these results:
1) http://www.spec.org/cpu2006/results/res2006q4/cpu2006-20060918-00110.html
2) http://www.spec.org/cpu2006/results/res2006q4/cpu2006-20060918-00111.html

They sponsored testing of Opteron and Woodcrest platforms under EXACTLY SAME OS/Compiler combination. And posted results, which show Xeon advantage.

It's the matter of self-esteem.

Stop posting this bullshit. You know it well, that Xeon will ALSO BENEFIT from Solaris and results will be in Xeon favor.

1:04 PM, November 15, 2006  
Anonymous Anonymous said...

"I can bet 4x4 will frag intels Quad core(2 Dual cores in one die) with ease."

On power and cost you surely are correct.

1:16 PM, November 15, 2006  
Anonymous Anonymous said...

Kind of a screwed up analogy there Sharikpoo - strapping two engines together to get better fuel mileage? Your superior intellect failed you on that one.

Strap two auto engines together and you have more power. And in many cases you have better mileage.

Likewise, if you strap two processors together you have more power. Look at Core 2. Simply kickass. But this is the end of it, and Intel will have to change their ways of thinking.

You're right about the FSB though...it's time has came and went. I would be surprised if Intel won the future battles with a FSB.

1:30 PM, November 15, 2006  
Anonymous Anonymous said...

Also if you don't go to Burger King you might notice that the sun system has twice as much system memory, as well as different compilers, and, as stated previously, operating system.

2:47 PM, November 15, 2006  
Anonymous Anonymous said...

Speaking of 4x4, what happened??? I thought it was suppose to be out by now. Have they pushed it out or does nobody care about a 2P server platform scaled down to a 2p enthusiast platform. I think Amd saw the numbers from Kentsfield on Anands, Toms, Tech report, etc... and knew that they would get embarrassed. Would not be surprised if they push this thing out a few more months until they get 65nm ramped or maybe they should just wait until they get bought out...just kidding. Seriously I'm seriously worried about what Hector and the boys have been doing for the last couple of years in the tech side of the house. I give them props for the business side, but the engineers must of been given a long vacation for the great work they did on the K8.

2:49 PM, November 15, 2006  
Anonymous Anonymous said...

Speaking of 4x4, what happened??? I thought it was suppose to be out by now. Have they pushed it out or does nobody care about a 2P server platform scaled down to a 2p enthusiast platform. I think Amd saw the numbers from Kentsfield on Anands, Toms, Tech report, etc... and knew that they would get embarrassed. Would not be surprised if they push this thing out a few more months until they get 65nm ramped or maybe they should just wait until they get bought out...just kidding. Seriously I'm seriously worried about what Hector and the boys have been doing for the last couple of years in the tech side of the house. I give them props for the business side, but the engineers must of been given a long vacation for the K8

3:16 PM, November 15, 2006  
Anonymous Anonymous said...

There is a huge Intel vs AMD battle going on right now, SETI@Home, most do not know about it though.

The application will make use of every core on the host computer and start processing at 100%. Using performance analyzers, we have found that the application spends a majority of it's time requesting memory pages. As a result the SETI application is considered a memory intensive application.

To see which computers are leading in this realtime stress test, visit: http://setiathome.berkeley.edu/top_hosts.php

1)Intel. 2)Intel. You get the point.

All Mac Pro's are Intel based.

4:38 PM, November 15, 2006  
Blogger Christian Jean said...

To see which computers are leading in this realtime stress test, visit: http://setiathome.berkeley.edu/top_hosts.php

1)Intel. 2)Intel. You get the point.


Hate to burst your bubble, but first what is this list ordered by?

Is Jim Vennes@SETI.USA ranked #1 just because he happens to be first in Recent average credits? What's 'recent' and how is their averages computed?

Anyway, your point is completely useless because:

1. I have an account and my CPU type is a K5 (that's what I had when I registred). Now I have 7 computers crunching mostly AMD X2's.

See how your information is bogus so far?

2. Each of those accounts have many computers crunching results. All credited to that specific CPU Type.

For example:

#1 has 549 PC's (incl. many AMD's)
#2 has 11 PC's (incl. AMD's)
#3 has 5 PC's (no AMD's)
#...

So now the question is do You get the point.?

Jeach!

9:10 PM, November 15, 2006  
Anonymous Anonymous said...

"#1 has 549 PC's (incl. many AMD's)"

Yes, 3 active AMD's is a lot. Four if you add one of the inactive ones.

"#2 has 11 PC's (incl. AMD's)"
To be percise it has a single AMD that has been inactive for over a month.

Here is an interesting comparison:
WU
C2D e6400
Athlon(tm) 64 Processor 3700+
As can be seen, that 3700+ has a bit more FP power and C2D ~13.6% more integer performance. They both have roughly the same CPU speed with AMD having ~3% lead.

They both calculated that same result. AMD in ~5.4h and C2D in 3.1h, 70% lead for C2D. I wonder why.

Also I guess it is pretty safe to assume that C2D was running two instances at the same time, each one trashing cahce for the other one. If that is true they both would have roughly same amount of cache for each client instance.

2:06 AM, November 16, 2006  
Anonymous Anonymous said...

Intel has based the entire farm on reworked pentium 3s, how sad.
After reading the reviews of the clovertown quad core versus the AMD quad core HP server it seems the intel quad performance is another dismal flop.
It was nice to fit 4 cores under one heatsink but they should have made it work better and use less power.
The intel quad core used 245 more watts than the hp amd quad server.
Of course the intel quad core is very expensive to buy so that would help intel.
The intel fanboys should be happy since intel said it will build a million of these quad chips they should come way down in price as soon as they dont sell and become a further inventory expansion for intel to drown in.
It seems that the intel quad core is just another intel platform problem that just worsens intel financial picture.

10:13 AM, November 16, 2006  
Anonymous Anonymous said...

"There is a huge Intel vs AMD battle going on right now, SETI@Home, most do not know about it though."

First of all, the only reason Seti excels on Core DooDoo is because it fits inside Intel's jerkoff cache patch job.

Second of all, can you say

"Stream Computing?"

AMD/ATI's Stream Computing and Fusion initiatives ARE the future of X86 computing.

Stream Computing already demonstrates a massive 20-40X improvement over conventional CPUs.(depending on how you squint your eyes on the benchmark.)

This is not smoke and mirrors - real code being used right now in Folding@Home.

Keep watching the AMD/ATI space for more "infraggtion points" to the Intel copy inexactly jerkoff machine. You know, things like...

AMD64?
MCM?*

Without AMD, you'd be sitting there right now with your 5.0116 (gasp) GHZ water-cooled bleeding edge Netbust single core 32 bit fartburner.

For those of you that can't wait, don't. Get frantic. Get out there right now and grab every Core DooDoo you can find, and get to work on crunching those Seti work units as fast as you can.
See you in a few months. :-)

*Multiple Cores for the Masses

11:20 AM, November 16, 2006  
Anonymous Anonymous said...

"First of all, the only reason Seti excels on Core DooDoo is because it fits inside Intel's jerkoff cache patch job."

No. S@H takes ~20MiB of RAM and as I compared, simlarly clocked and same cache amount K8 is ~70% slower than C2D.

"AMD/ATI's Stream Computing and Fusion initiatives ARE the future of X86 computing."

That stream computer is nothing more than x1900 with a new name. Guess what? It sucks for majority of the tasks. NV's G80 is first of the really suitable stream processors. I wouldn't be surprised if it would deliver several times more relatively general purpouse computing power. That R580 is way more difficult to program than SPE in Cell.

11:35 AM, November 16, 2006  
Anonymous Anonymous said...

"They both calculated that same result. AMD in ~5.4h and C2D in 3.1h, 70% lead for C2D. I wonder why."

Network speed, memory speed, disk access speed, user activities (# of context switches)... and most of all, dual-core.

"Also I guess it is pretty safe to assume that C2D was running two instances at the same time, each one trashing cahce for the other one."

No it's not. Two instances of what? BOINC? Or calculating the same job (both would cause no thrashing).

1:51 PM, November 16, 2006  
Anonymous Anonymous said...

For those of you that can't wait, don't. Get frantic. Get out there right now and grab every Core DooDoo you can find, and get to work on crunching those Seti work units as fast as you can.
See you in a few months. :-)


Fact of the matter is, it's just another arena that Intel excels in.

3:53 PM, November 16, 2006  
Anonymous Anonymous said...

"Network speed, memory speed, disk access speed, user activities (# of context switches)... and most of all, dual-core."

Network and disk speed are absolutely irrevelant. Memory speed is much better for AMD since S@H doesn't need massive bandwidth and K8 has roughly twice as good memory latency. User activities doesn't count much also since the calculation time only counts the cycles that the program was actually being executed. If it is idleing the counter stops. Dualcore doesn't help in this at all. S@H can't use dualcore features.

"No it's not. Two instances of what? BOINC? Or calculating the same job (both would cause no thrashing)."

If two instances of Boinc calculating different jobs don't cause cache trashing then what does?

4:23 PM, November 16, 2006  
Blogger Reuben Gathright said...

jeach! The list is ordered by "Top Computers". If you looked closer you would backup my previous claims of Intel winning the battle because the list is ordered by RAC. In SETI this is a measure of credit assigned for completing a work unit.

We in the optimization community have tried in vain to optimize the FFT calculations for AMD cpus. The SSE2 instructions uOps are just not as responsive as new Core 2 Duo and Xeon Duo cpus.

I tried an AMD 64bit socket 939 for a few months (look up my profile). Unfortunately the computer could not crunch nearly as fast as a 800FSB Hyperthreading 2.4 Ghz P4!

On a final note, a workunit does not fit into the cache of the Core 2 because we run two processes on dual cpu based computers.

4:43 PM, November 16, 2006  
Blogger Christian Jean said...


They both calculated that same result. AMD in ~5.4h and C2D in 3.1h, 70% lead for C2D. I wonder why.


Right but how/when were they calculated? I have my BOINC running 24h/7. During the weekend when I'm using my PC and doing intensive compiling, it takes alot more 'time' to crunch a unit than when I'm not using the PC.

Now notice how I've said 'time' rather than 'jiffies'. That's because I don't know for a fact how they compute their computer time (at a kernel level, at a threading level, using stdlib date API or counting the real time), listed in order of reliability.

Jeach!

6:13 PM, November 16, 2006  
Blogger Christian Jean said...

A while back we had a discussion on this blog on how the C2D's 64-bit was developed internally (native or emulated).

I've noticed how Intel avoids any 64-bit benchmarks. Ok, maybe it's just coinsidence.

But the other day while going through one of the posted links to Intel's quad core benchmarks, I've noticed a very odd thing. Intel has a habit of appropriating and/or optimizing most code it does benchmarks with.

And conveniently, the JVM Intel used (I think it was JRockit or something) was a 64-bit JVM but with 64-bit memory addressing disabled!! That's very weird!

Is Intel trying to hide something? Is constantly calculating 64-bit memory addresses too expensive for this quad processor?

I'd be curious to see an AMD vs. Intel all 64-bit benchmark! I'm not talking about 64-bit code, but actual, exclusive 64-bit data-sets!

Does anyone have any results of this kind of test to share? Also one which would't fit in it's huge cache. Like processing millions of database transactions which only have 64-bit values.

Jeach!

6:49 PM, November 16, 2006  
Anonymous Anonymous said...

Remember according to Hector the green actor he personally said Benchmarks DON'T Matter to but a few insignficant enthusiasts.

So Sharikou your favorite CEO doesn't care and doesn't value benchnmark leadership. So based on his value system his own quadcore suck.

8:50 PM, November 16, 2006  
Anonymous Anonymous said...

"jeach! The list is ordered by "Top Computers". If you looked closer you would backup my previous claims of Intel winning the battle because the list is ordered by RAC."

Bullsh*t. If that list is really credible, you ought to say that Netburst is winning the battle against Core 2 then.

"We in the optimization community have tried in vain to optimize the FFT calculations for AMD cpus. The SSE2 instructions uOps are just not as responsive as new Core 2 Duo and Xeon Duo cpus."

Intel has better SSE2 implementation, and the larger cache is better for matrix multiplications. Anyone with some computer architecture knowledge would know these for facts.

How many percentage of production programs involve FFT, though? 1%? How many of those FFT implementations fits/utilizes SSE2? Another 1%?

"I tried an AMD 64bit socket 939 for a few months (look up my profile). Unfortunately the computer could not crunch nearly as fast as a 800FSB Hyperthreading 2.4 Ghz P4!"

If you work only on SSE2-intensive codes, then 2.4Ghz P4 might slightly outperform 2.0Ghz K8. But that is far from claiming that P4 crunches number faster than K8. On the contrary, on pretty much everything other than SSE2, P4 loses badly to K8 even when having 20% clockrate advantage.

9:45 PM, November 16, 2006  
Anonymous Anonymous said...

"Network and disk speed are absolutely irrevelant."

Well, you can try that yourself. First use a slow ATA-66 HD w/ dialup, then use an SATA-3 w/ broadband. Observe the difference (won't be too large, but there will be).

"Memory speed is much better for AMD since S@H doesn't need massive bandwidth and K8 has roughly twice as good memory latency."

Again, you can try it yourself. First use 200MHz FSB with slow timing memory modules, then change to 800MHz FSB with high-end memory ones. Observe the different (which will be larger than the one above).

"User activities doesn't count much also since the calculation time only counts the cycles that the program was actually being executed. If it is idleing the counter stops."

Context switch wastes cycles. That is why OSes do not do fine-grain scheduling. Especially when the cache size is small, context switch could induce more severe thrashing.

"Dualcore doesn't help in this at all. S@H can't use dualcore features."

So you want us to believe a distributed application that can utilize multiple discrete computers cannot do multi-threading or multi-processig? Then why are the top list computers almost all 4-way SMPs?

10:02 PM, November 16, 2006  
Anonymous Anonymous said...

"Right but how/when were they calculated? I have my BOINC running 24h/7. During the weekend when I'm using my PC and doing intensive compiling, it takes alot more 'time' to crunch a unit than when I'm not using the PC."

I've had S@H running in the background while upgrading Gentoo. It had about 10% CPU time at most but total time per WU was still the same as when it had 100% CPU time. As I've said, they are measuring the time the process is running on a CPU, not the time from downloading the WU to uploading the results.

"Well, you can try that yourself. First use a slow ATA-66 HD w/ dialup, then use an SATA-3 w/ broadband. Observe the difference (won't be too large, but there will be)."

FYI, I've done it. I used to run S@H on a dial-up and the time didn't improve after upgrading to broadband. That's normal since S@H doesn't count the time it takes to download stuff in total calculation time.

"Again, you can try it yourself. First use 200MHz FSB with slow timing memory modules, then change to 800MHz FSB with high-end memory ones. Observe the different (which will be larger than the one above)."

Guess how much does the memory latency increase when you do that?

"So you want us to believe a distributed application that can utilize multiple discrete computers cannot do multi-threading or multi-processig?"

They are not doing any kind of multi-threading. If by multi-processing you mean running several clients simultaneously then that is exactly what they are doing: they run as many clients as there are CPU's, each client calculates its own WU. If you would try it yourself on SMP machine you would understand it.

"Then why are the top list computers almost all 4-way SMPs?"

Because they all run four instances of the client program concurrently.

1:10 AM, November 17, 2006  
Anonymous Anonymous said...

"... then use an SATA-3 ..."
1) there is no such thing as SATA3.
2) there is no speed difference between PATA, SATA 150 and SATA 300 HDD assuming they use the same platters.

3:33 AM, November 17, 2006  
Anonymous Anonymous said...

"1) there is no such thing as SATA3."

You made me laugh. You'll be a okay school teacher. yeah, there's no standard called "SATA-3" per se, but you do know there's an SATA standard that support 3Gb/s transfer rate, don't you?

It seems to me that you are lacking material counter arguments and start to use primary school grading method BS as your responses.

"2) there is no speed difference between PATA, SATA 150 and SATA 300 HDD assuming they use the same platters."

More BS. You are like saying a BMW will accelerate no faster than a Civic if they have the same engine (at which point the other design differences are relatively minor).

So you really believe an SATA 3Gbps drive won't run faster than a 3-year-old PATA-66? I can sell you a few semi-new of the latter for cheap, if you value them so much...

12:47 PM, November 17, 2006  
Anonymous Anonymous said...

"yeah, there's no standard called "SATA-3" per se, but you do know there's an SATA standard that support 3Gb/s transfer rate, don't you?"

Yes but do you know that fastest 15k rpm SCSI drives in existence are not fast enough to bottleneck ATA 133 bus? Fastest 7200rpm SATA drives can deliver up to 75MiB/s, Sata2-3Gb has throughput five times of that, sata 150 2.5x and ata133 2.8x of that. Sure, you could say that cached speeds are close to interface maximum but how much useful data fits to that 8/16M buffer?

"You are like saying a BMW will accelerate no faster than a Civic if they have the same engine"

HDD's are not cars. Assuming same rotating speeds >90% of HDD performance comes from its platters. With cars quite a lot comes from everything going from engine to the wheels.

"So you really believe an SATA 3Gbps drive won't run faster than a 3-year-old PATA-66?"

It seems like you can't read very well or simply can't understand. I said "assuming they use the same platters". 3 year old drives do not use same platters as the drives that were released just a few months ago. Platter density is the main thing that defines drive speed. Basically 100GB and 200GB platters have almost twice the speed difference. If you have ATA and SATA2 drives both using the same density platters they both have the same speed.

Between most current ATA, Sata/Sata2 HDD's coming from same manufacturer have only difference in their connector. Firmware and platters are almost always the same. Of cource there are some who crippel their ATA/Sata1 HDD's to make Sata2 look better.

12:52 AM, November 20, 2006  
Anonymous Anonymous said...

"It seems like you can't read very well or simply can't understand. I said "assuming they use the same platters". "

The one who doesn't read well is you. Go back to my comment which you responded to. I WAS comparing ATA-66 to SATA-3G. So you were assuming the ATA-66 and SATA-3G I mentioned would use the same platter? Isn't that like assuming a Civic and a BMW using the same engine?

"Platter density is the main thing that defines drive speed. Basically 100GB and 200GB platters have almost twice the speed difference."

Platter density only define sequential read/write speed, which is rarely useful (otherwise there won't be need of larger buffer).

Even two ATA-66 with the same platter and rotation speed perfrom differently if for example they have different internal cache size. How about maximum and average (1/3) seek time? They make visible differences, don't you know that?

Not to mention file system and even fragmentation make big difference, too.

1:31 PM, November 22, 2006  
Blogger Scientia from AMDZone said...

Actually, it wouldn't surprise me a bit if FFT's were slower under K8 than either P4 or C2D. This isn't a secret; it's been known since 2003. K8 doesn't implement SSE in the same way that P4 or C2D does. AMD's approach is more generic which maximizes the performance of the older FP calculations while Intel's better dedicated hardware approach has worse FP and better SSE. C2D has twice the prefetch bus bandwidth of K8. That it is faster in SSE intensive operations is no surprise.

This will change though with K8L. K8L doubles the prefetch bandwidth and substantially beefs up SSE operations. K8L should be as good or better than C2D for SSE operations at the same clock. FFT's should pick up considerably. On top of this, the extra cache from L3 and added cache modes should pretty well take care of any advantages due to cache.

It is also well known that the 64 bit addressing on Prescott was hacked by folding it into the existing Xeon 36 bit extended addressing. I haven't seen anything definitive yet to say whether C2D still uses this hack or if it has true extended addressing. If C2D is still using the old 36 bit hardware this would make it slower on addresses higher than 32 bits.

6:40 PM, November 22, 2006  
Anonymous Anonymous said...

"A Clovertown core fights for a 266MHZ bus, that's a sham situation."

Wouldn't it be (almost?) exactly the same for AMD quadcores? Per-socket they wouldn't have that much more bandwidth than their Intel counterparts.

3:19 AM, November 24, 2006  
Anonymous Anonymous said...

Here is a funny SPEC benchmark showing 2P Xeon with total of four cores having only 16% worse result than 4P 8-core K8 machine. And that is in 64bit. With 32bit 2P 4 core Xeon beats 2P 4 core Opteron by 37.8% and these benches were made by AMD itself.

As you said, Spec is reliable so isn't it so that if you want to run an enterprise Java server application you would be best off with C2D based machine?

3:24 AM, November 24, 2006  
Anonymous Anonymous said...

Anonymous asked...
["A Clovertown core fights for a 266MHZ bus, that's a sham situation."

Wouldn't it be (almost?) exactly the same for AMD quadcores? Per-socket they wouldn't have that much more bandwidth than their Intel counterparts.]
==================
Not exactly. AMD's overall system bandwidth is significantly higher.

The Hypertransport is a point-to-point bidirectional protocol that provides 2GHz of bandwidth. (vs FSB single directional 1.3GHz).

Secondly, the memory controller is integrated in the Opteron core, providing a direct access to the memory/data, independent of I/O bandwidth (Hypertransport). More importantly, as CPU speeds scale, there is a corresponding improvement in memory access latencies (lower).

With Intel designs, the memory controller is on the chipset (which is why Intel usually needs a larger cache to compensate) and even as CPU speeds increase, the relative latency actually gets worse.

Also, from a recent Microcomputer Forum slide deck, there was changes to the memory controller for quadcore as well (which will improve the memory access efficiency further).

And lastly, the quad cores communicate to each other using an internal cross bar switch (exact bandwidth of this switch is not known, but again, this is independent of the HyperTransport link to IO). On Intel's currently announced "quad" core, the 2 Woodcrests on the same package can only communicate to each other over the FSB (again eating into that limited bandwidth).

1:56 PM, November 24, 2006  

Post a Comment

<< Home