Journal of Pervasive 64 bit Computing
Main Blog Page

Analysis on IT trends and competitive strategies, with emphasis on micro processors, computer systems and networks. Based on latest news, backed up with real data, this site intends to provide a true and realtime picture of the fast changing IT landscape. This journal strives to be accurate on facts and sharp on criticisms. You may email your opinion to sharikou@yahoo.com or post comments here, be cool and intelligent.

Name: Sharikou, Ph. D.

Freelance journalist on IT matters. Some of my writings have been published on online IT journals. Any original content on this journal is Copyrighted, but it's free for non-commercial use. Any Trademarks used on this site belong to their respective owners. Some of the pictures are links. If there is any issue with the content of this site, please email sharikou@yahoo.com .

View my complete profile

Monday, March 12, 2007

No one believes Intel this time

Read this on video performance cheats.

This is a conversation:

Paul: how many you have left?

Craig: I still got 3717 to delete.

Paul: I deleted 17888, but there are 11818 to go.

Craig: Remember to use the disk weeping software.

Paul: We are gonna get caught.

Craig: That's still better than showing these emails to the jury.

Paul: I will have to agree. I am gonna assert 5th when deposed.

Craig: me too.

60 Comments:

lex said...: Paul, Craig and the senior execs do look stupid, but they aren't as stupid as the pretender.

You can easily delete a bunch of emails and it takes only a few seconds. Select all and delete.. or just go to your local folder where all your pst files are and delete.

If anything this shows what retards Craig and Paul are and how the board should fire them. If INTEL management was competent they would have easily crushed AMD years ago its only thru INTEL incompetence and arrogrance from Craig and Paul that allowed AMD so far.

Tick Tock Tick Tock what is it I hear. Penrym and Nehalem coming on 45nm.; 7:18 PM, March 12, 2007
Unknown said...: Indeed. Lex is spot on. Higher speed 65nm processors maintaining same TDP ready to frag AMD. Penryn coming this year ready to frag AMD. Nehalem coming next year to finish AMD off once and for all.

The latest information from Hector Ruiz is that they're going to start shipping Barcelona in "late summer". This is August. Intel will, no doubt, plan the Penryn launch to coincide with the Barecelona launch to frag AMD all over.

P.S: 50W Quad Core Xeons out now. We can have 8 1.86Ghz processor cores in a 2P server consuming only 100W of power. This is less power for EIGHT processing cores than for just one of AMD's dual cores with TWO cores. 8220SE uses 120W of power. AMD is undesirable for servers. No wonder AMD is posting losses and losing marketshare to Intel and Nvidia.

AMD BK Q2'08.; 7:34 PM, March 12, 2007
Ho Ho said...: giant
"Higher speed 65nm processors maintaining same TDP"

Wrong, upcoming 3GHz dualcore will be 65W instead of the 75W of the old 2.93GHz one.

giant
"they're going to start shipping Barcelona in "late summer"."

Technically, summer ends in late September.

azary omega
"do you understand that most AMD's dual cores offer more performance than this quad? "

Not when you use all of the four cores. When you play games (on Xeon!) then perhaps AMD dualcores might win. People who buy Xeons usually run software that uses lots of threading and/or multiprocessing.; 12:43 AM, March 13, 2007
Unknown said...: Also, to that dude who mumbled something about 50W quad cores - do you understand that most AMD's dual cores offer more performance than this quad?

Nonsense. A 1.86Ghz quad core is far more desirable than a dual core 2.8Ghz for server workloads. The fact that the quad uses 50W while the Opteron 8220 SE uses 120W is just the icing on the cake.; 12:45 AM, March 13, 2007
Ho Ho said...: Winrar only supports up to two cores. Same seems to be true about encoding and specviewperf. Wouldn't you say that comparing singlethreaded and/or limited scaling programs on PC's with lots of cores is kind of stupid?

If you want to see programs that do support >2 cores then see 3DSMax, Cinebench, SunGrid and Linpack. You may also be interested in power consumption.

May I ask, who buys 2P quadcores to encode videos and music or compress data?; 2:20 AM, March 13, 2007
Amdzoner said...: To check multithreaded benckmarks look here:

http://www.gamepc.com/labs/view_content.asp?id=o2000&page=7; 3:01 AM, March 13, 2007
Ho Ho said...: azary omega
"Wrong. Its 3.61."

Version 3.61 (Dual-Core)

azary omega
"Not about others, but your right about specviewperf. :-}"

When lower-clocked 8-core box shows worse results than higher-clocked 4-core box it means that those 4 extra cores are not being used. Or do you have some other theory?

azary omega
"By the way, don't Ever assume that hardware site that makes millions would be stupid enough to test dual socket, 4-8 core systems with software that isn't MPed all the way"

It's THG we are talking about, they do all sorts of stupid things.; 3:32 AM, March 13, 2007
R said...: This is an absolute must read. With the Opteron’s co-cpu technology it is estimated AMD could run 1000 times faster. How about MS 800,000 servers install.

http://www.edn.com/article/CA6423621.html?partner=enews&nid=2019&rid=1583460753; 4:10 AM, March 13, 2007
Christian Jean said...: For anyone competent in law... couldn't Paul and Craig face potential criminal charges for interfering with a criminal investigation. On the assumption that the email deletion was 'intentional' of course!

Second, couldn't AMD ask the court to confiscate these PC's? A lot of the time these 'deleted' emails can be recovered by experts. Also, these so called experts can probably tell you if they were deleted AND cleaned with third party tools... which could indicate intent!

Anyway, Intel is really loosing credibility on all fronts: benchmarks, fare play, legal, etc.; 4:14 AM, March 13, 2007
Christian Jean said...: Some of you are putting WAY too much emphasis on power consumption and heat dissipation.

Sure companies that are purchasing grids or even a half dozen servers will consider the power consumption and heat dissipation factors.

But I would guess that the majority of all servers sold, they are for small to medium sized company that just need another server for a specific job. Or needed to consolidate two or four older servers onto a single newer one.

And I believe (in fact, I'm certain) that these companies don't give a damn about how much power they consume or dissipates.

Even when AMD was the 'power' king I knew that it was a reason for all of it's server sales... to the exception of accounts such as Google of course!; 4:22 AM, March 13, 2007
Christian Jean said...: I've got a theoretical question for the knowledgeable. As for all others, abstain from bashing!

Could AMD in part or in whole have skipped a generation in manufacturing technology? That's right, you've heard me right... from 90nm all the way through till late 2007 and switch directly to 45nm.

If impossible, why not? What would have been most challenging or impossible?; 4:35 AM, March 13, 2007
Ho Ho said...: jeach!
"Could AMD in part or in whole have skipped a generation in manufacturing technology?"

Of cource it could ...

jeach!
"from 90nm all the way through till late 2007 and switch directly to 45nm."

... but not that fast, 45nm won't be usable for AMD that soon. That would also mean pushing Barcelona way back.

My guess is AMD won't start mass producing 45nm stuff before 2008. It might have some test chips and perhaps some very low numbers of chips sooner. Just the same as Intel had.; 4:45 AM, March 13, 2007
Ho Ho said...: On a totally unrealated note, Intel will build a whole new 12" fab in China. Interesting thing is it'll be able to produce 52,000 wafers per month. Was it AMD's Fab 36 that was supposed to be vastly superior to anything else there was? Am I right it could output up to 35,00 wafers per month?; 8:01 AM, March 13, 2007
Unknown said...: Wrong, upcoming 3GHz dualcore will be 65W instead of the 75W of the old 2.93GHz one.

You're right. I missed that one. The new 1333mhz FSB Core 2 Duos should all have a 65W TDP. The Core 2 Extreme QX6800 should fall in the same thermal envelope as the QX6700.

AMD's processors are currently all fragged badly by Intel's CPUs. Intel could BK AMD easily with it's current 65nm designs. But we still have 45nm, Penryn, and Nehalem coming all in the near future.; 8:30 AM, March 13, 2007
PENIX said...: Intel's excuse for "losing" these e-mails is very foolish. I would expect more from a senior executive. A more logical excuse would be to claim their archive server was using a core duo and exploded.; 8:41 AM, March 13, 2007
rathor said...: TO ALL INTEL MOFOS: STOP SAYING "AMD IS GOING DOWN", "THE SKY IS FALLING (FOR AMD)" AND ALL THAT KIND OF STUPID SHIT... YOU (INTEL FANS) DON'T UNDERSTAND A SIMPLE THING: INTEL IS RIGHT NOW IN DEEP SHIT, THEY SIMPLY DON'T KNOW WHAT TO DO TO STOP K10 (BARCELONA) AND R600. NOTHING WILL STOP AMD TO GAIN MORE MARKET SHARE, NOT EVEN C2D WITH ALL THE SHIT INSIDE (THE ALL MIGHTY SSExxx) AND CLOVERTON WITH ITS GLUED CORES.... YOU, INTEL FANBOIS ARE SO IDIOTIC AND BLIND...; 9:06 AM, March 13, 2007
rathor said...: This comment has been removed by the author.; 9:06 AM, March 13, 2007
Ho Ho said...: Nice one. Though I wonder what does R600 have to do with Intel.; 9:08 AM, March 13, 2007
Unknown said...: AMD lost share in the lucrative server sector last quarter and will lose marketshare in desktop, mobile and server this quarter. http://sg.us.biz.yahoo.com/ap/070308/analyst_note_advanced_micro_devices.html?.v=1

Lowering ASP + Less marketshare = Lower revenue and zero profits for AMD

R600 has been delayed so many times I've nearly forgotten when it's supposed to be out. The R600 might be a little faster (the most recent rumors suggest 5 -> 10%) but Nvidia has had nearly six months to work on the 8800 Ultra/8900 GTX/Whatever you'll call it. ATi is losing marketshare to Nvidia and is posting losses.

I can see a round of layoffs coming soon at AMD to try and delay the BK. But the fact is that Intel is bringing so much new technology out that AMD simply cannot keep up. Intel has also delievered ALL it's products on time or ahead of schedule and in decent quantity. Woodcrest, Conroe and Merom were all released ahead of schedule, quad cores were released ahead of schedule. The caneland platform (Tigerton processor) for 4P servers, 45nm Xeons, Penryn etc. are all due this year. Then we start again next year with the Nehalem architecture and all the processors based on it.

The only thing that AMD can do at the moment is promise that Barcelona is awesome. They also seem good at whining that Clovertown was "rushed to market", and that the FSB is a bottleneck (even though all the tests have proven otherwise!) and that it's not a "native" quad core.

AMD has been fragged. AMD BK Q2'08. It's inevitable.; 9:33 AM, March 13, 2007
Unknown said...: NOT EVEN C2D WITH ALL THE SHIT INSIDE (THE ALL MIGHTY SSExxx)

Please let us know how you would do 64bit math on any x86 cpu without SSE?; 9:51 AM, March 13, 2007
Ho Ho said...: bubba
"Please let us know how you would do 64bit math on any x86 cpu without SSE?"

On 64bit CPU's (later netbursts, Core2 and K8) you use the 16 general purpouse registers. On other CPU's you can use MMX registers. What's your point?; 10:00 AM, March 13, 2007
Unknown said...: Well, my point was floating point, but I left the fp out of my original post, and it appears you can't edit comments.

Unlike the Doctor who edits all the time; 10:33 AM, March 13, 2007
rathor said...: @bubba: stfu, you don't even know what a fpu is, so when you're grown enough type those words: : "Intel doesn't have any fpu power it is only the cpu!"... And, btw, Intel must be exorcised :) :) :); 10:55 AM, March 13, 2007
Ho Ho said...: bubba
"Well, my point was floating point"

FPU in all 32bit x86 is 80 bits wide in 32 and 64bit mode. All of the CPU's use either 32bit floats, 64bit doubles or 80bit long doubles. SSE units can use either 32bit floats or 64bit doubles.

What exactly is your point?; 12:18 PM, March 13, 2007
Unknown said...: My point is FP (non-vector) math has been fully depreciated in all 64bit processors.

That means, it's still there, but AMD or Intel could remove it at anytime.

Also, want to write an application that uses FP under 64bit Vista? You can't. You must use SSE.

So, with rathor ranting on about SSE being shit, he needs to think a little before he speaks. Even his beloved AMD has abandoned FP math.; 1:28 PM, March 13, 2007
abinstein said...: "My point is FP (non-vector) math has been fully depreciated in all 64bit processors.

That means, it's still there, but AMD or Intel could remove it at anytime."

How do you depreciate standard 80-bit FP math in a 64-bit processor?

Are you not going to perform any scientific or engineering calculation using this CPU/ISA?; 2:05 PM, March 13, 2007
abinstein said...: "Intel will build a whole new 12" fab in China. Interesting thing is it'll be able to produce 52,000 wafers per month."

Well there are several problem with this Inquirer "news":

1. This is old. I've heard of this rumor as early as last fall (September or so). This only new thing here is the production volume, but it's highly speculative to say the least.

2. Is Intel is even allowed to export 90nm technology to China. A news about China's approval on such matter is totally insignificant.; 2:29 PM, March 13, 2007
Ho Ho said...: bubba
"My point is FP (non-vector) math has been fully depreciated in all 64bit processors."

It is not depricated for all tasks. It is suggested because SSE is much faster, though not as accurate. Some scientific computing needs to br more accurate than 64bit.

abinstein
"Is Intel is even allowed to export 90nm technology to China"

Who can stop it?; 2:50 PM, March 13, 2007
core2dude said...: Hey Sharikou,

Looks like AMD's claim about 40% Barcelona lead over Kentsfield applies only to specFP_rate http://arstechnica.com/news.ars/post/20070301-8958.html. So, you have unnecessarily been hyperventilating, don't you think, considering K8 was already significantly better at FP than Core 2?

Penryn will have a floating-point improvements, will be clocked higher, will have a bigger cache, and a faster FSB. That means, Penryn will be in spitting distance of K8L on specFP_rate, if not match it or better it. So looks like, AMD's last stronghold, specFP_rate is at stake.

Oh no, wait, they still got SisSoftSandra_memory, at least until CSI's 6.4 GTs completely obliterates HT 3.0's 5.2 GTs.

AMD is in deep $h1t, and everyone knows it!; 5:49 PM, March 13, 2007
Unknown said...: It is not depricated for all tasks

Yes, it is. Check the processor documentation.

BTW, SSE is 128 bits, so it more acurate that 80 bit FP.; 6:02 PM, March 13, 2007
Unknown said...: Hey Sharikou,

Looks like AMD's claim about 40% Barcelona lead over Kentsfield applies only to specFP_rate http://arstechnica.com/news.ars/post/20070301-8958.html. So, you have unnecessarily been hyperventilating, don't you think, considering K8 was already significantly better at FP than Core 2?

Penryn will have a floating-point improvements, will be clocked higher, will have a bigger cache, and a faster FSB. That means, Penryn will be in spitting distance of K8L on specFP_rate, if not match it or better it. So looks like, AMD's last stronghold, specFP_rate is at stake.

Oh no, wait, they still got SisSoftSandra_memory, at least until CSI's 6.4 GTs completely obliterates HT 3.0's 5.2 GTs.

AMD is in deep $h1t, and everyone knows it!

Indeed. Penryn is just the start. Nehalem and CSI are coming just year. AMD barely gets over how powerful Core 2 and Penryn are and they have another architecture to deal with. Paul Otellini is in an enviable position now, knowing AMD will soon be completely destroyed by Penryn and Nehalem.; 6:29 PM, March 13, 2007
lex said...: Tick Tock Tick Tock...

What is it I hear

Penrym, Nehalem, Nehalem-C, Gesher.

INTEL will have two major architecture spins and two improvements in the next 4 years.

AMD will be lucky to get Barcelona and one additional spin out on 45nm by then.

AMD is finished; 6:34 PM, March 13, 2007
core2dude said...: Penrym, Nehalem, Nehalem-C, Gesher.

How dare you forget Larrabee? That might make AMD write off the entire investment in ATI...; 9:53 PM, March 13, 2007
netrama said...: In this discussion of performance and benchmarks with Intel Fanbois proclaiming "AMD is finished" , these a**h*les are missing the big picture.

AMD will just do fine. Intel is no longer is in a position to dictate and blackmail their channel partners and OEM customers like the way they used to do. Times have really changed and whenever there is any Intel print or TV spot - folks go like 'Ok what bullsh*t are they talking this time'
This whole change in perception makes the difference ..

Intel Fanbois ...just turn around and look how much damage AMD has already done to Intel ...and this is just the beginning; 10:12 PM, March 13, 2007
abinstein said...: "Yes, it is. Check the processor documentation.

BTW, SSE is 128 bits, so it more acurate that 80 bit FP."

I don't think you understand how fp or vector instructions work. First, you can't perform vectorized 80-bit fp in any 128-bit SSE unit. Second, you won't get better precision by using SSE units if the fp format is 80-bit. Third, 80-bit fp is an IEEE standard, which is not going away.; 11:29 PM, March 13, 2007
Ho Ho said...: bubb
"BTW, SSE is 128 bits, so it more acurate that 80 bit FP."

Oh please research a little before saying anything else like that. SSE has 128bit registers and maximum accuaricy is by using two 64bit doubles.

So far you have made a royal mistake in each and every post you've made. Are you going to continue that trend?; 11:46 PM, March 13, 2007
Intel Fanboi said...: Netrama, this is your last chance to scrounge around for any data that might maybe kinda sorta looks good for AMD. Six months from now when AMD is dying you will have nothing. Good luck.; 2:44 AM, March 14, 2007
Roborat, Ph.D said...: Poor AMD. They spent so many years and so much money on Barcelona that it's already obsolete even before AMD can even show a working demo.

Rumours circulating which originally came from Tyan's server validation team that they have a Barcelona sample thats very buggy and definitely slower than Clovertown in common server apps. I just thought i'd give a warning to the AMD fanboys to soften the blow. I know you're all going through a rough period with the Core2 blowout, the over hyped and embarrasing 4x4 and now the profit warnings. With the future looking dimmer with Intel's stronger line-up, i'm just glad non of you have hurt yourselves yet. Hang in there!; 5:18 AM, March 14, 2007
Unknown said...: first amd barcelona benhcmark.. amd is the intel killer

http://www.boincstats.com/stats/host_cpu_stats.php?pr=sah&st=0&or=10

http://www.boincstats.com/stats/host_cpu_stats.php?pr=bo&st=0&or=8; 6:52 AM, March 14, 2007
Ho Ho said...: This is ancient and proven to be false. If it really was Barcelona it would mean it runs Boinc at half the speed of Core2. Once you start looking the results more closely it should become crystal clear.; 7:47 AM, March 14, 2007
Ho Ho said...: Forgot to add the link to where I analyzed those things a bit. Just see the last post I made there.; 7:51 AM, March 14, 2007
Unknown said...: So far you have made a royal mistake in each and every post you've made.

But yet I get the feeling you didn't even bother to look at the documentation.

Whatever, I think I'm done with you.; 9:24 AM, March 14, 2007
Unknown said...: What's this I hear? AMD losing marketshare and posting losses? Get used to it. That's what will continue to happen until AMD BKs. Penryn will launch just in time to crush Barcelona and Nehalem will launch next year creating a performance gap so wide that AMD will never be able to catch up.

AMD did not supply enough CPUs to it's partners in the channel. AMD screwed them over. Now AMD has excess chips and is saying "We have plenty of CPUs for you!". The problem is that no one wants them. AMD is an unreliable source of CPUs. This is why Apple did not even consider AMD. They don't want more CPU shortages.

AMD BK Q2'08.; 8:38 PM, March 14, 2007
Ho Ho said...: bubba
"But yet I get the feeling you didn't even bother to look at the documentation."

I do read the documentation and know exactly why is x87 and MMX deprecated. You on the other hand seem to know almost nothing about CPU internals, that is only logical reason why you made so many mistakes.

Your original question was "Please let us know how you would do 64bit math on any x86 cpu without SSE?". Please note there is no sign of FP code and CPU GP register width.

Next you specified you meant FP code. Then you started saying that CPU manufacturers could remove the functionality. That shows you haven't really read what AMD and Intel has said. They are not the ones who might cut support, it is the OS creators who might remove support for it. It won't be gone from CPU's any time soon since as I said, some scientific calculations need >64bit accuaricy.

After that you were claiming 128bit SSE is more accurate than 80bit x87 further proving that you don't really know what you are talking about.

bubba
"Whatever, I think I'm done with you."

Interesting tactics. After being proved wrong you declare yourself as winner.

giant
"AMD BK Q2'08"

Relax, noone is going to BK. Worst that can happen is that AMD will have some hard time for the next year or two but it will be far from BK.; 5:08 AM, March 15, 2007
Roborat, Ph.D said...: I think the quality of this BLOG has really gone down. This used to be a very funny BLOG and I come here everyday to get my laughs. Where’s the 40% market share “run-rate”? Where’s the “Intel frags itself with Core2”? Where’s the “AMD frag’s 75% of Intel’s products”? Even the “BK in Q2’08” is getting mentioned too infrequent.; 8:42 AM, March 15, 2007
Anonymous said...: Azary Omega said...
Good post Sharkie. And to some people - get a sense of humor!

Also, to that dude who mumbled something about 50W quad cores - do you understand that most AMD's dual cores offer more performance than this quad?

Ho Ho said...
azary omega
"Wrong. Its 3.61."

Version 3.61 (Dual-Core)

When lower-clocked 8-core box shows worse results than higher-clocked 4-core box it means that those 4 extra cores are not being used. Or do you have some other theory?

Every time I see Azary Omega posting, only **** flies out.; 9:29 AM, March 15, 2007
abinstein said...: "That shows you haven't really read what AMD and Intel has said. They are not the ones who might cut support, it is the OS creators who might remove support for it."

More accurately it should be the compilers, which generate the assembly/machine codes, that are responsible of maintaining the support.

I think he was confused of what programmers write to the application, and what compilers generate to the CPU. He was also confused of SSE units and vectorized instructions.

First of all, 64-bit maths, FP or not, do not require SSE instructions, nor SSE units. Programmers are not bothered with formatting their codes with SSE unless they want to vectorized the computation. They are not affected by whether Intel or AMD removes x87 support, as long as the compilers they use generate the assembly codes correctly.

Secondly, SSE helps performance mostly due to vectorization; there's nothing magical about SSE units otherwise. For many scientific and engineering apps where 80-bit FP is used, no vectorization is possible in SSE anyway.; 11:50 AM, March 15, 2007
R said...: AMD wins 2006 revenue battle with Intel, iSuppli says

http://www.edn.com/article/CA6424781.html?partner=enews&nid=2019&rid=1583460753; 1:07 PM, March 15, 2007
Ho Ho said...: abinstein
"More accurately it should be the compilers, which generate the assembly/machine codes, that are responsible of maintaining the support."

Once again, you are wrong.
Real problem is that in 64bit OS might not save the state of MMX/x87 registers on context change.

abinstein
"Secondly, SSE helps performance mostly due to vectorization; there's nothing magical about SSE units otherwise."

Generally one SIMD instruction with four floats takes considerably less time than one x87 instruction with one float. On Core2 and Barcelona that speed difference is be quite big, especially when you consider throughput and not latency, though SSE latency has halved on Core2 compared to older CPUs.

Problem with SIMD instructions is that when the data is not 16 byte aligned it will take a lot more time to load it in registers. With x87 there is no difference of if the data is aligned or not.; 2:28 PM, March 15, 2007
Unknown said...: Ruiz is going to jail soon.

He just got paid $16.1M for wiping out $10B of sharholder money.

http://biz.yahoo.com/ap/070315/amd_executive_compensation.html?.v=1; 5:32 PM, March 15, 2007
abinstein said...: This comment has been removed by the author.; 12:48 AM, March 16, 2007
abinstein said...: Ho Ho
"Once again, you are wrong.
Real problem is that in 64bit OS might not save the state of MMX/x87 registers on context change."

If the OS does not correctly save the state of an ISA visible register on context switch, the OS does not work correctly on the platform.

BTW, you are the one who's been wrong about many stuff, such as 8GB/s IGP bandwidth, such as "memory system doesn't matter" argument, such as "ray-tracing more important than XML and cryptography," just to name a few. Keep your "once again" for yourself this time, too.

Ho Ho
"Generally one SIMD instruction with four floats takes considerably less time than one x87 instruction with one float."

... which is just the benefit of vectorization. Nothing magical.

Ho Ho
"Problem with SIMD instructions is that when the data is not 16 byte aligned it will take a lot more time to load it in registers. With x87 there is no difference of if the data is aligned or not."

That's totally not true. Obviously you didn't bench it but just babble out of imagination. The fact is, misaligned values in memory almost always reduce performance, even when it is legal to load/store them from/to memory.; 1:37 AM, March 16, 2007
Ho Ho said...: abinstein
"If the OS does not correctly save the state of an ISA visible register on context switch, the OS does not work correctly on the platform."

That means 64bit Windows doesn't work correctly as kernel level threads do not save their x87/mmx register states.

While legacy x87 floating point state is context swapped between 32- and 64-bit applications running on an x64 system, they must *NOT* be used by a kernel mode component

“In general, 64-bit operating systems support the x87 […] instructions in 32-bit threads; however, 64-bit operating systems may not support x87 […] instructions in 64-bit threads. To make it easier to later migrate from 32-bit to 64-bit code, you may want to avoid x87 […] instructions altogether and use only SSE and SSE2 instructions when writing new 32-bit code.”

Remember, 64-bit assembly code for Windows cannot use the older MMX, 3D Now! and x87 instruction extensions; they have been superseded by SSE/SSE2.

abinstein
"BTW, you are the one who's been wrong about many stuff, such as 8GB/s IGP bandwidth, such as "memory system doesn't matter" argument, such as "ray-tracing more important than XML and cryptography," just to name a few. Keep your "once again" for yourself this time, too."

Didn't I explain those things well enough on Scientia's blog? I thought I did, if not then tell me what do I need to explain in greater detail so most people would understand.

Also I don't remember you providing good proof of why were my statements wrong and you didn't answer most of my questions. Would you like to continue our little conversation on Scientia's blog?

Either way you are still making mistakes in this blog.

abinstein
"... which is just the benefit of vectorization. Nothing magical."

I tried to word it so it would be obvious that I didn't mean the vectorization but completion of single instruction. Seems like I failed with that.

What I meant was that multiplying two 128bit SIMD registers takes less time than multiplying two x87 registers. The fact that SIMD calculates on two or four floats per register and x87 with only one float per register doesn't matter in this context, all that I tried to say was that the instruction is executed faster in SIMD registers.

abinstein
"That's totally not true."

It is.

abinstein
"Obviously you didn't bench it but just babble out of imagination."

I did. Did you?

If you don't believe me then read this:
For example, the MMX and SSE aligned codes for addition of two arrays are up to 2.26 and 2.72 times faster than their implementations using misaligned accesses on the Pentium 3 and Pentium 4 processors, respectively.

Mostly regularly data is aligned to the native size of the datatypes. That is 4 bytes for 32bit ints and floats, 8 byte for doubles and 64bit ints. That means you can easily have an array of 32/64bit variables that are not 16 byte aligned. Doing unaligned reads and writes from that array will decrease performance considerably.

Just FYI, there are some exceptions when data is not aligned to their native sizes by default. That is mostly when you use packed structs but can be achieved with things like unions and pointer arithmetic too.

E.g, let's say you have struct like this:
struct S{
int32_t i32;
int8_t i8;
int32_t i32_2;
int16_t i16;
};

without packing those things will have relative addresses of 0, 4, 8 and 12. With packing they have 0, 4, 5 and 9. Tested with GCC 4.1.2 on 32bit OS. Code is here.; 3:39 AM, March 16, 2007
abinstein said...: Ho Ho
"That means 64bit Windows doesn't work correctly as kernel level threads do not save their x87/mmx register states."

From your own quote on Microsoft: "While legacy x87 floating point state is context swapped between 32- and 64-bit applications running on an x64 system, ..."

That means the x87 fp state is context swapped, literally. Now tell me, can you read?

Ho Ho
"Also I don't remember you providing good proof of why were my statements wrong and you didn't answer most of my questions."

Your arguments there were not convincing, offered no proof nor good reasoning. Your questions were totally off-track, I told you there to find the answers yourself and I simply don't want to waste my time on them.

Ho Ho
"What I meant was that multiplying two 128bit SIMD registers takes less time than multiplying two x87 registers."

WRONG! You can't just fart out sh*t like this. At most you may say their speeds are implementation dependent. In the case of Pentium-4 and Athlon64, SSE and x87 share the same FPU and register file. You are confused of ISA-visible registers with physical register file in the processor. You're also confused of ISA instruction and the processor micro ops. x87 instructions execute just the same as SSE ones internally.

There is one situation where it is more advantageous to execute non-vectorized FP in SSE/MMX than in x87. If there are dependencies among the instructions, x87's stack-based register access (versus random access of SSE/MMX) can become inefficient. This however has nothing to do with the speed of multiplying two SSE or x87 registers.

Ho Ho
"If you don't believe me then read this:
For example, the MMX and SSE aligned codes for addition of two arrays are up to 2.26 and 2.72 times faster than their implementations using misaligned accesses on the Pentium 3 and Pentium 4 processors, respectively."

The problem is you have bad reading skill and bad logic. This makes you consistently make false conclusions from unrelated "evidence".

The error in your previous claim is this, Ho Ho: "With x87 there is no difference of if the data is aligned or not." This is wrong, and I have seen it (misalignment) slow down execution all the time. Misaligned data is slower to load and store; once they're loaded into registers, they execute just the same. Misalignment increase the memory-to-register load time for both SSE and x87 instructions.; 2:02 PM, March 16, 2007
Ho Ho said...: abinstein
"That means the x87 fp state is context swapped, literally. Now tell me, can you read?"

They are swapped in userland processes in this version of OS but when you continue to use x87 it is not certain that your program would work on the next version. That's the definition of "deprecated".

abinstein
"In the case of Pentium-4 and Athlon64, SSE and x87 share the same FPU and register file"

Where on earth did you got that? You do know you can use x87/MMX/3dnow* and SSE registers in parallel without any problems?
*) those three share registry file.

abinstein
"In the case of Pentium-4 and Athlon64, SSE and x87 share the same FPU and register file"

All x87 operations are 80bit. SSE ones are either 32 or 64bit. Are you suggesting that CPU's have
only 80bit FPU's?

abinstein
"The error in your previous claim is this, Ho Ho: "With x87 there is no difference of if the data is aligned or not."

Tearing it out of context will certainly make it wrong. I was talking about 16bit alingment and I thought it was obvious. As I said, data generally gets aligned to its native size. To get anything else you have to work a bit.; 3:28 PM, March 16, 2007
abinstein said...: Ho Ho
"They are swapped in userland processes in this version of OS but when you continue to use x87 it is not certain that your program would work on the next version. That's the definition of "deprecated"."

Only 64-bit assembly code cannot use x87. If you're not writing a kernel level driver (which is arguably part of the OS), you don't even need to care. The compiler, however, affects all programs.

Besides, the fact the Microsoft explicitly mention incompatibility of x87 assembly in 64-bit mode shows it is not a normal thing for an OS to do so. You are listing a special case of design artifact as your argument.

Ho Ho
""In the case of Pentium-4 and Athlon64, SSE and x87 share the same FPU and register file"

Where on earth did you got that? You do know you can use x87/MMX/3dnow* and SSE registers in parallel without any problems?"

For P-4, read Intel's Technology Journal in Q1 2001 on Pentium 4 Microarchitecture. For Athlon64, see here.

Of course, you don't need to believe what they say, but your "use registers in parallel" or "80-bit FP" theories prove nothing, either.

Ho Ho
"Tearing it out of context will certainly make it wrong. I was talking about 16bit alingment and I thought it was obvious."

I believe you mean 16-byte, not 16bit.

Whether 16-byte or 16-bit, alignment affects performance (and sometimes correctness) in all cases, integer, fp, or media/vector). SSE2-128 is affected more only because its arguments are longer.; 5:53 PM, March 16, 2007
Unknown said...: Even AMD admits using an MCM quad core approach is a "smarter choice". “If I could do something different, I wish we would have immediately done a MCM - two dual cores and call it a quad-core,” said Mario Rivas, an EVP at AMD, during a recent interview in Austin,

http://www.reghardware.co.uk/2007/03/17/amd_rivas_barcelona/

AMD BK Q2'08.; 6:11 PM, March 16, 2007
Unknown said...: Ho Ho, you may have knowledge of the AMD64 or whatever cpu architecture instruction set but that does not mean you fully understand how to use them and it definitely does not mean that you are qualified to create an operating system.

I suggest you drop your arguments with regards to how operating systems work or should work and what cpu features are good or useful.

Please do not use that pitiful save on ftp text editor as evidence of your programming prowess. It has nothing to do with operating system programming which is on a completely different level.; 7:59 PM, March 16, 2007
Christian Jean said...: It has been a long while since I've done low-level code, but abinstein is correct when he says:

"In the case of Pentium-4 and Athlon64, SSE and x87 share the same FPU and register file"

At least during the original MMX days it was that way. I don't know if all of this has changed with the SSE, SSE2, etc.

In the late 90's when I coded a low-level video driver using MMX, you could NOT use both the FPU and MMX instructions.

------

An aligned memory architecture will always be much faster than a packed one. It takes a lot of overhead for the processor to fetch non-aligned and non-sequential memory. There are many added instructions in shifting and masking.

First, it's been a while since I've done this low-level work, and I would assume that with all the prefetching possible these days, some of this overhead must be reduced.

Second, why would anyone pack their memory? This is not the default compile mode and you usually must explicitly code for it or request it at compile time.

Third, if you have a packed structure where you have your SIMD data, why would it be loaded as non-aligned data rather than as a single value?; 8:10 PM, March 16, 2007
Christian Jean said...: My god, some more good news this month coming from the AMD camp:

All AMD R6xx chips are 65 nanometre chips, now; 8:25 PM, March 16, 2007

Journal of Pervasive 64 bit Computing
Main Blog Page

About Me

Previous Posts

Monday, March 12, 2007

No one believes Intel this time

60 Comments:

Journal of Pervasive 64 bit Computing Main Blog Page

About Me

Previous Posts

Monday, March 12, 2007

No one believes Intel this time

60 Comments:

Journal of Pervasive 64 bit Computing
Main Blog Page