FUD is like ghost movies, you don't get scared by seeing a ghost, you get scared by not seeing one -- Sharikou
Recall Intel's Mooly Eden said Con-roe will be 20% faster than AMD's future chips without even knowing AMD's plans? During the Spring 2006 IDF, Intel setup a Conroe and an Athlon 64 box, then directed benchmarkers such as Anand to push buttons*, but peaking into Windows device manager of the alleged Conroe wasn't allowed.
During the IDF, I emailed various Intel execs, AMD execs and Anand, I pointed out that such a pre-arranged blackbox Intel setup against AMD was unfair and challenged Intel to lend the Conroe box to Anand for a real drill. However, Intel dared not to answer such a simple challenge based on the rules of fair competition. The INQ sharply criticised this kind of guerilla benchmarketing.
In fact, Anand had no way to verify Intel's IDF Conroe setup, the Conroe configuration parameters were provided by Intel. Anand noted that "it looked like Intel had done the unimaginable" with regard to the situation. Nonetheless, Anand assured readers that "there was nothing fishy going on with the benchmarks or the install" based on his trust on Intel's honesty -- which was seriously lacking from past records. Thus we had an interesting situation: Anand relied on Intel's reputation to validate the Conroe setup while Intel relied on Anand's reputation to validate the Conroe scores -- a loop of trust was formed to convince the world + dog.
Now, for the very first time, someone actually got hold of a Conroe chip in their own lab and did some tests. It was a 2.4GHZ Conroe (Link: CPU-Z) against an Athlon 64 overclocked to 2.8GHZ. The overclocked Athlon 64 had a 2.8/2.4 -1 = 16.7% clockspeed advantage.
The following results were obtained by running 32 bit ScienceMark binaries optimized for Intel Pentium:
Conroe : 2133.38 -- 14% faster
Primordia (Energy calculations for 1 atom)
Athlon64: 1506.83 -- 10% faster
Athlon64: 1345.05 -- 26.3% faster
Athlon64: 1512.55 -- 21.7% faster
The above results were for an Athlon overclocked to 2.8GHZ and a Conroe at 2.4GHZ, with the Athlon having a 16.7% clockspeed advantage. For a direct comparision at the same clockspeed, we normalize the Conroe scores by taking into account the frequency difference. Assuming the best scenario in which Conroe scores scale linearly with clock speed, we multiply the Conroe scores by a factor of 2.8/2.4. Thus, with a 2.8GHZ Conroe, we would have
Athlon 64 2.8GHZ: 1872.68
Conroe 2.8GHZ : 2133.38 * 2.8/2.4 = 2489 -- 32.9% faster
Athlon64 2.8GHZ: 1506.83
Conroe 2.8GHZ: 1365.85 * 2.8/2.4 = 1593.49 -- 5.7% faster
Athlon64 2.8GHZ: 1345.05 -- 8.2% faster
Conroe 2.8GHZ: 1065.59 * 2.8/2.4
Athlon64 2.8GHZ: 1512.55 -- 4.3% faster
Conroe 2.8GHZ: 1242.94 * 2.8/2.4 = 1450
ScienceMark is a strictly CPU/memory test, it doesn't involve video or disk I/O, it is basically a raw speed test. The ScienceMark is freely available from http://www.sciencemark.org/ for both Windows XP and Windows XP x64.
However, the above results showed a violent CPU performance fluctuation for Conroe, from it being 32% faster to being 8% slower. How can this be explained?
The cause of the Conroe performance fluctuations can't be the types of computation involved. We notice that MolDyn is a floating point computation while the Cipher is an integer computation. However, both MolDyn and Primordia are floating point calaculations on quantum mechanical properties of matter, yet, Conroe's Primodia performance is only 5.7% faster than Athlon 64, a 27% relative performance drop from MolDyn.
As we look deeper in the ScienceMark, we notice that in the default MolDyn benchmark setting, there are only 4 cells with a simple cubic lattice, no more than 32 molecules are involved. The program is basically tracking the momenta and positions of a handful of molecules and computing scattering effects. About 2MB to 4MB memory is needed. The Primodia calculation for a single Ag (silver) atom with 47 electrons needs just a bit more memory than MolDyn. However, both the Cipher and STREAM tests involve a lot more than 4MB.
The reason why Conroe did so well in the MolDyn test is simple: Conroe has a huge 4MB of unified cache, for such single threaded tests that can fit in 4MB*, Conroe can just run off the cache with very high speed. Since cache misses drastically reduce peformance, applications run off cache exhibit unrealistic performance numbers.
However, once you go over the 4MB limit, Conroe is slower than Athlon 64 at the same clock. Both the Cryptography and STREM tests use a lot more than 4MB, larger than Conroe's 4MB cache, and Conroe immediately falls below Athlon 64 on the performance curve.
I can bet on this: if one increases the number of cells in the MolDyn test to 9, thus increases the working set to larger than 4MB, Conroe will perform worse than Athlon 64 at the same clockspeed.
There is another set of results on Conroe and Athlon 64, showing Athlon 64 beating Conroe on WinRAR file compression at the same frequency.
Most games are also cache sensitive, increasing Athlon 64's cache by 512KB, you see up to 8% performance increase in FPS.
I have added a comparison between Clovertown(double Conroe) and Athlon 64 2800+.
The conclusion is: clock for clock, Athlon 64 will beat Conroe in real application environments that require a working set of larger than 4MB, or in other words, larger than Conroe's 4MB cache. This means in any real multi-tasking or server environment the Core architecture will be an underdog. Even worse, for Intel's shared cache architecture, cache thrashing is a distinct possibility under heavy loads.
Most modern applications need a lot more then 4MB. IE needs at least 50MB when viewing a normal web page(with Flash, JS, DHTML, AJAX..); Photo Editing apps need around 40MB; FireFox takes 23MB when I use it to view yahoo.com; DivX grabs 23MB even before I open a video...
Frankly, I am really disappointed by Intel's decisions. This gimmick of using 4MB cache to get unreasonably good scores on the most simplistic tests is cheap from design point of view but expensive for manufacturing. Mooly Eden kept talking about the 4 Meg cache in the technology analyst meeting, and promised to add even more cache, however, the 4MB cache is definitely eating a lot of die area and Intel's limited capacity. It is almost like using Netburst's ridiculous hyperpipeline to pump up GHZ at the expense of power consumption and real performance. I wouldn't accuse Intel of benchmark fraud, but people need to know the 4MB limitation of the Conroe.
So far, Athlon 64 is being tested under 32 bit mode with executables optimized for the Pentium. Athlon 64 gets 10-40% performance improvement running in 64 bit mode, a benchmark under Windows x64 or Windows Vista should show the real strength of AMD64 architecture.
As a test drive, I downloaded the 64 bit version of ScienceMark and ran it on my Athlon 64 2800+(Socket 754, 130nm, 512K L2, at 1.799GHZ stock frequency, with 1GB PC3200 DDR) under Windows XP x64. For the 64 bit MolDyn test, I got a score of 1479.12 ScienceMarks, almost 50% faster than the 32 bit result on the same old PC. I suspect that on a Socket 939 Rev E6 platform with SSE3 support, the 64 bit result will be even better. A reader submit the 64 bit result for a 2GHZ Athlon 64, you can view the result here.
AMD should work with benchmark creators to ensure that application benchmarks have a working set larger than the cache size of Conroe -- 4MB.
AMD's Rev F socket AM2 will be available for system builders on May 15, 2006. At 65nm, using Stress Memorization Technology co-developed with IBM, AMD will be able to increase clockspeed to 4GHZ. AMD is also working on Z-RAM, a SOI based technology that may increase cache density by 500%.
*For those who question this authenticity of this Conroe benchmark, the person who posted the result had shown at least some CPU-Z screen captures indicating the various properties of the Conroe CPU. Anand wasn't even allowed to look at the Windows device manager, all he did was pushing some buttons as directed by Intel IDF staff. All the system specs of the Conroe system was provided by Intel. Anand had no verification of the setup. Also, unlike Anand, who receives a lot of ad money from Intel, this person who posted the Conroe results had nothing to gain financially either way. Clearly, this test has more credibility than Anand's. Anand's failure to mention that he was merely a button pusher and his obvious pumping style made his credibility very much in doubt.
*Intel touted its 1 cycle SSE execution, but the STREAM results weren't impressive. Henri Richard mentioned Conroe is more like K8.
*To verify this, you can download ScienceMark, then run the MolDyn, Primordia, Cipher and STREAM benchmarks on your own PC. You will find that the default MolDyn test uses very little memroy, Primodia uses a bit more, but Cipher and STREAM use a lot more than 4MB. To check this, you launch the ScienceMark program, then launch the dialog box for running MolDyn benchmark, at this point, the simulation hasn's started, two threads are created for this task, using a process viewer program, you note the memory used for the task so far is about 7MB. Then you click at the Run Simulation button, you will notice that another thread is created to run the simulation, now the memory used by whole task is smaller than 11MB for most of the time, meaning the benchmark thread uses less than 4MB and thus can fit in the 4MB cache of a Conroe.