Monday, March 26, 2007

AMD is going to blood Intel

Athlon 64 X2 4600+ boxed CPU for $113. With FAB30, FAB36 and Chartered cranking massive numbers of dual core CPUs, AMD can afford a bloody price war that is aimed to take substantial portions of the x86 market pie.

This indicates to me that AMD's 65nm chips will show substantial clockspeed increase.

64 Comments:

Blogger Unknown said...

"we price our products according to the value they represent to our customers"

-Suzy Pruitt, AMD Spokesperson

I guess customers aren't getting much value out of those AMD products.

3:31 PM, March 26, 2007  
Blogger R said...

This Q will be interesting. If AMD is going to settle for low margins they better have some market share gains. K10 should have some enterprise market share gains as well in Q2-3.

I read BofA is not expecting any price cuts from Intel to counter the AMD’s aggressive price structure.

4:29 PM, March 26, 2007  
Blogger Roborat, Ph.D said...

AMD's Dual core CPU's priced comparable to Single core P4 Neburst CPUs? Looks to me AMD is forced to sell CPUs at a loss.

Intel says they have no idea what AMD is on about with the price war? forget about market share. Looks like AMD is cutting prices just to get rid of inventory.

5:07 PM, March 26, 2007  
Anonymous Anonymous said...

I have watched this scenario play out for over 2 years now & read everyone’s opinions pro or con & everyone is entitled to their opinion. AMD lets out very little PR's compared to Intel’s. Yesterday I was amazed to read about the 90nm plant Intel was building in China. For what reason? To hide money? I have always felt AMD has an ace up their sleeve & it will be coming soon. Stay tuned for the latest here...

5:12 PM, March 26, 2007  
Anonymous Anonymous said...

Check this out from AMD'S Fab 36
300 MM Quad Core

http://www.amd-images.com/

5:31 PM, March 26, 2007  
Blogger Roborat, Ph.D said...

I have always felt AMD has an ace up their sleeve

when was the last time you saw AMD pull an ace out? K8 was pre-announced to death. K10 as well. AMD has nothing except K10. If anything AMD is known for over promising and being overly optimistic. Again, AMD has nothing after K10 while intel rolls out 40% more efficient and faster 45nm Penryn at the same time, followed by A new Core, Nehalem just when AMD is about to ramp K10.
AMD is now just to big and slow to catch intel. Soon AMD will just become like Transmeta and Via. Insignificant.

5:42 PM, March 26, 2007  
Anonymous Anonymous said...

Roborat, Ph. D. said...AMD is now just to big and slow to catch intel. Soon AMD will just become like Transmeta and Via. Insignificant.

IBM/AMD!!!!

We will all see in the coming months who has the best product....

5:56 PM, March 26, 2007  
Anonymous Anonymous said...

http://money.cnn.com/magazines/fortune/mostadmired/2007/snapshots/756.html

5:59 PM, March 26, 2007  
Blogger Unknown said...

Check this out from AMD'S Fab 36
300 MM Quad Core

http://www.amd-images.com/


Oh wow!! Pictures of a CPU!!! Of course... you could just go buy a quad core CPU that wipes the floor with AMD's pathetic 4x4 right now.

Yes! AMD's ace is R600! Oh whoops... they've delayed it again. They need to raise the clockspeed a few hundred mhz higher to compete with the new 8800 Ultra due out in a few weeks!

Nvidia has the best performance GPUs on the planet: Twice the performance of AMD's crap. G80 will ensure that Nvidia takes virtually all of the discrete GPU market.

Intel's CPUs are far superior to AMD's. Far faster. As Sharikou inadvertently showed us a while ago, an Intel 2P server frags an AMD 4P server. Core 2 Quad frags 4x4 and Core 2 Duo makes a bloody mess of the Athlon X2. Pathetic.

AMD BK Q2'08.

5:59 PM, March 26, 2007  
Blogger Scientia from AMDZone said...

roborat
when was the last time you saw AMD pull an ace out?

K6, 3DNow, copper interconnects, slot and socket A, K7, Athlon MP, SOI, FAB 30 expansion, RAS, Pacifica, and Torrenza.

K8 was pre-announced to death.

Unlike Itanium.

If anything AMD is known for over promising and being overly optimistic.

As opposed to Intel who delivered the unstable Coppermine 1.13Ghz PIII which then had to be recalled. Then Intel promised 5Ghz Prescott, 7Ghz Tejas, 10Ghz Nehalem, and multi-core Whitefield with CSI, all of which were canceled. I can't seem to recall that AMD has ever killed an announced processor.

Again, AMD has nothing after K10 while intel rolls out 40% more efficient and faster 45nm Penryn at the same time,

I guess if the same time means Intel delivers Penryn 2 quarters after K10. Then AMD will deliver 45nm K10 plus DC 2.0 and a new mobile core in mid 2008 2-3 quarters later than Penryn.

followed by A new Core, Nehalem

In 2009.

just when AMD is about to ramp K10.

nearly 2 years after K10 is launched.

If you believe it takes 2 years to ramp K10 then I think you've been sniffing the white-out thinner again. AMD will be ramping the newer, modular core with DC 2.0 and the new mobile core on 45nm when Nehalem is launched.

11:37 PM, March 26, 2007  
Blogger Unknown said...


followed by A new Core, Nehalem

In 2009.


Incorrect. Nehalem is on target for next year. The 32nm shrink of Nehalem, Westmere is due for 2009.

12:11 AM, March 27, 2007  
Blogger abinstein said...

This comment has been removed by the author.

12:45 AM, March 27, 2007  
Blogger abinstein said...

roborat:"AMD's Dual core CPU's priced comparable to Single core P4 Neburst CPUs? Looks to me AMD is forced to sell CPUs at a loss."


The real question is, will AMD or Intel make more sales in this case? Note that if you go with the AM2 K8 X2 today you can upgrade it to QC next year. Can you do so with a single-core P4? The low price of AM2 K8 X2 spells trouble for not only P4 but also C2D up to E64xx.

And remember, these are the majority of chips that are sold out there.

There could be multiple reasons why AMD's doing this. One is certainly the pressure from high-end C2D which K8 X2 has hard time to compete with. Another could be that AMD has healthy output of 65nm/65W processor. Look at newegg: for $3 more you get to buy a CPU that consumes 27% less power; this makes it likely that those 89W chips are the ones stuffing the channel.

As for Nehalem - it is not going to have a brand new microarch. It'll probably have IMC and CSI, with cores designed to scale up above 4 (current Core 2 must use MCM beyond dual cores). Intel's going to address the problem(s) of Core 2 that AMD solved in K7 5 years ago.

12:47 AM, March 27, 2007  
Blogger Unknown said...


Can you do so with a single-core P4?


Yes. You can. The Core 2 Quad and Pentium 4 use the same socket.


As for Nehalem - it is not going to have a brand new microarch. It'll probably have IMC and CSI, with cores designed to scale up above 4 (current Core 2 must use MCM beyond dual cores). Intel's going to address the problem(s) of Core 2 that AMD solved in K7 5 years ago.


How do you know this? Almost nothing is known about Nehalem, except that it will feature CSI. The current rumor is that the server variants of Nehalem will feature an inbuilt memory controller while desktop variants do not. We've also heard of two sockets that will probably be used for Nehalem. One with 1300 odd pins, the other with 700 odd pins. (I'm too lazy to look up the exact numbers =P)

1:27 AM, March 27, 2007  
Blogger abinstein said...

"Yes. You can. The Core 2 Quad and Pentium 4 use the same socket."

The same socket is totally different from plug-in compatibility. If the supported chipsets are different then you're totally screwed. I've lost track of Intel's chipset compatibility and I'm sure it's no easy task for any average PC buyer.

4:20 AM, March 27, 2007  
Blogger Ho Ho said...

abinstein
"I'm sure it's no easy task for any average PC buyer"

No average PC user buys and installs CPUs him/herself.

4:41 AM, March 27, 2007  
Blogger Roborat, Ph.D said...

Scientia from AMDZone said...
roborat
when was the last time you saw AMD pull an ace out?

K6, 3DNow, copper interconnects, slot and socket A, K7, Athlon MP, SOI, FAB 30 expansion, RAS, Pacifica, and Torrenza.


"ace up their sleeve" is a term that means a back up plan or a surprise secret weapon. I'm not sure if you're familiar with the term but none of what you suggested above can be categorized as such. try again.

In essence, you very well know AMD has nothing coming that will save them from their current decline.

As opposed to Intel...
The subject isn't about Intel as obviously they don't need anything else besides what they have and what they promise to deliver.

You have a tendency to respond for the sake of having a response while at the same time failing to grasp what the other person's main point.

Again the point is, K10 is coming but may not be enough. Ted said AMD has a secret weapon. I said, AMD doesn't. Then you come in saying so does Intel.
See how silly your argument is?

5:09 AM, March 27, 2007  
Blogger Scientia from AMDZone said...

giant
Incorrect. Nehalem is on target for next year. The 32nm shrink of Nehalem, Westmere is due for 2009.

I'm sorry but this isn't true. The new socket with CSI will be released on Itanium first in late 2008. This does not leave any time in 2008 to also release Nehalem. Intel will not release the new socket on Nehalem before Itanium and Intel will not be able to release Itanium earlier than late 2008.

Westmere is end of 2009; Nehalem will be early 2009. I guess this is the same as the way that some Intel enthusiasts kept trying to push Penryn forward to Q2 to match K10. Now this group is claiming Q3 but it is looking like Q4 at the earliest. Nehalem will not be out mid 2008 when AMD releases 45nm. Intel is releasing a very expensive quad FSB chipset in Q4 and wouldn't bother if this would be made obsolete by mid year. In reality this chipset will be used for about a year before Nehalem is released.

6:36 AM, March 27, 2007  
Blogger Christian Jean said...

BofA? If you read their crap, you havn't been reading this blog in the last year or so!

AMD lets out very little PR's compared to Intel’s. [...] I have always felt AMD has an ace up their sleeve & it will be coming soon. Stay tuned for the latest here...

I agree with you on that. Again, I sound like a broken record, but I must repeat it:

First, Ruiz has publicly promissed 'jaw-dropping' inovations coming from AMD. Ruiz is concervative and does not usually over promise or FUD.

Second, what could they have been doing in the last 4 years, considering that there was no significant Athlon/Opteron enhancements.

The only problem is that Intel is a PR machine. With all the spies they have embeded within AMD, they would surely know about it. Concidering that Intel has not made an anouncement of some sort to try and appropriate or down-play any such new technology.

Best case scenario it will be Reverse Hyperthtreading! But then again, what are the chances?

6:48 AM, March 27, 2007  
Blogger Roborat, Ph.D said...

abinstein said...

The real question is, will AMD or Intel make more sales in this case? Note that if you go with the AM2 K8 X2 today you can upgrade it to QC next year.

really? Are you telling me that the channel buyers and the white box makers who buys P4 in large quantities in order to sell to 3rd world markets worry about AM2 socket compatibility?

are you telling me that the people that buy these low end systems think about CPU upgrades? get real!

6:49 AM, March 27, 2007  
Blogger Christian Jean said...

The following is a little segment that CNBC had on AMD with comment analyst Eric Ross and Jim Cramer.

Eric Ross: Director of Electronics Research for Think Equity Partners

Jim Cramer: CNBC host

Summary of one question:

Cramer asked a question about the current rumors going around concerning a private equity take-over of AMD. Eric responded that he loves the AMD guys and they have done a great job in the last few years and has taken significant market share from Intel and a private equity did not make any sense at all to him. Jim Cramer completely agreed with that assessment.

Transcript of the last question:

Cramer: Every time Intel is about to crush them, their buddy pal friend at AMD, the Justis Department steps in. Are we in one of those situations again where Intel can not afford to wipe out AMD.

Eric: Well I don't think that Intel could wipe out AMD even if Intel wanted to. I don't think we are going back to the same share level we saw a couple of years ago. AMD now has very good technology. It's just not quite as good vs. Intel as where it was last year.

7:16 AM, March 27, 2007  
Blogger Christian Jean said...

"K8 was pre-announced to death. K10 as well."

Right, but you've got to make a distinction between a technology which must be introduced in order to get support or changes must be made for it to be used.

You'd look pretty stupid if you introduced 64-bit extention and kept it a secret to the software world.

Or kept Torenza secret to third-party hardware designers.

But on the other hand, you can afford to keep technologies secret when the reliance on others is little to none.

If anything AMD is known for over promising and being overly optimistic.

Now you know that is total BS!

7:35 AM, March 27, 2007  
Blogger Christian Jean said...

"AMD will be ramping the newer, modular core with DC 2.0 and [...]"

Now this 'modular' thingy has always intregued me. I'm a software developer and the term modular means that it abstract enough to be able to reuse it practically anywhere.

Does the term 'modular' in CPU parleance mean that each component is abstract enough to alter the design much easier than a monolithic design?

So in other words, module by design rather than modular by assembly:

Imagine the day where AMD produces a waffer full of ALU's and another waffer full of GPU's and another full of FPU's etc.

Then they 'natively' assemble an Athlon X18 die with 6 ALU modules, 8 FPU modules and 10 GPU modules, etc, etc. Cause I don't know how feasable it will become to keeping producing full native multi-cores on waffers. Reduces capacity by too much!

7:45 AM, March 27, 2007  
Blogger Scientia from AMDZone said...

Actually, roborat, you also tried to claim that AMD has a history of not delivering. Clearly, with 3 processor cancellations, that distinction belongs to Intel, not AMD.

Theoretically the phrase "ace up his sleeve" would be distinguished from "ace in the hole" by "sleeve" implying trickery since having an ace in one's sleeve would obviously be cheating. However, most people use the two phrases interchangeably to refer to a hidden asset.

You are being knowingly obtuse about what constitutes a surprise. It wasn't a surprise that K6 was released but it was surprise how well it performed (catching up to PII) after the lackluster performance of K5. Likewise, no one really expected K7 to surpass PIII. Athlon MP was even used in supercomputers and was a huge asset since this gave AMD credibility for Opteron and was a proving ground for both HyperTransport and MOESI. AMD's version of RAS and Pacifica were surprises since Intel has its own standards of these but it could be argued that these have not demonstrated their worth yet. Torrenza however was a surprise and is clearly an asset.

Although K10 is known, K10's actual performance is not. K10 could be an ace if it makes AMD sufficiently competitive. However, if Intel manages to keep the clock speed of C2D high enough to offset any IPC gains by K10 then it isn't. C2D was an ace for Intel after 2 years of disappointing performance and 3 canceled processors.

Finally, your claim that Penryn would be released at the same time as K10 is ridiculous as is your claim that Nehalem will be out while K10 is still ramping.

In reality, by the time Nehalem is released everything AMD produces except for Sempron and mobile will be K10. Sempron will be the current 65nm Brisbane. What will actually be ramping when Nehalem is released will be 45nm production and the new modular core with DC 2.0 as well as the new mobile core.

You said that AMD has nothing after K10. If we ignore the new mobile core and DC 2.0 we would be limiting the discussion to core changes to K10. It isn't currently clear if AMD intends to make modular changes when 45nm is launched or whether AMD will wait until later in 2009. Either way, AMD should have an improvement to the core while Nehalem is merely being shrunk to 32nm. How much advantage either might have is unknown since nothing is currently known about the performance of Nehalem.

7:45 AM, March 27, 2007  
Blogger Ho Ho said...

jeach!
"Best case scenario it will be Reverse Hyperthtreading!"

I thought I already said why reverse HT wouldn't work in real world in some earlier post.

8:13 AM, March 27, 2007  
Blogger pointer said...

Roborat, Ph. D. said to Scientia...
Again the point is, K10 is coming but may not be enough. Ted said AMD has a secret weapon. I said, AMD doesn't. Then you come in saying so does Intel.
See how silly your argument is?


Just ignore him. He has the tendency of using an seems to be related statement which indeed isn't to prove his point and occasionally using a TRUE statement to prove his other FALSE statement to be true; his logic goes like this: because my statement 1+1=2 is correct, so my other statement 1+2=4 is true too.

I'm sorry but this isn't true. The new socket with CSI will be released on Itanium first in late 2008. This does not leave any time in 2008 to also release Nehalem. Intel will not release the new socket on Nehalem before Itanium and Intel will not be able to release Itanium earlier than late 2008.


most non-insider would know a particular roadmap through published/leaked data, not logic (and especially your logic). read the link which provide a great summary on all the future products on both Intel and AMD:
http://asia.cnet.com/reviews/pcperipherals/0,39051168,61998152,00.htm

insist of using your limited logic? try look at the old wall clock: Tick-Tock, Tick-Tock ... 2006 is Merom/Conroe, 2007 is Penryn, gues what in 2008?

8:37 AM, March 27, 2007  
Blogger Unknown said...

Nehalem = 2008.

Don't believe me? Hear Pat Gelsinger say it himself. http://www.hexus.tv/show.php?show=4

He clearly states "Nehalem, the big '08 project."

More proof:- http://dailytech.com/article.aspx?newsid=6185

Nehalem, Intel's next-generation micro architecture on the 45nm node slated for 2008, will require new platform technology and is not compatible with the Penryn platform.

Even more?
http://dailytech.com/article.aspx?newsid=5869

Smith closed our conversation with "In 2008, we'll have Nehalem."

Is this enough proof yet? Where did the Penryn in 2008 come from? Intel has always said Penryn is coming "some time in the second half of 2007". That could mean July, or it could mean December.

9:26 AM, March 27, 2007  
Blogger lex said...

"With FAB30, FAB36 and Chartered cranking massive numbers of dual core CPUs, AMD can afford a bloody price war that is aimed to take substantial portions of the x86 market pie."

Lets see the Pretenders's logic...
They couldn't make any money before, lets ramp more expensive tool set with more depreciation and at the same time chop prices by 50%. We couldn't make money with a 2x bigger die with smaller costs, lets ramp 1/2 size die with more cost and sell it at 50% price. Lets not improve performance. Lets offer Yugo like pricing and quality.. somebody will buy it and we can manufacturing ourselves right out of business.

Sorry, making more die and selling at a bigger loss with higher cost structure is a BK strategy...

AMD BK in 2008. Prediction is Ruiz will be gone and a private equity will take it over and sell the manufacturing to TSMC and AMD will go the way of Rambus and Tresmeta hopeing to make a few dollars their IP...

9:34 AM, March 27, 2007  
Blogger lex said...

Tick Tock Tick Tock.

AMD's live is slipping away.

COre2 crushes them and requires 50% price cut. Marketshare erroision in server stops.

Penrym comes with 3 300mm factories. Expect complete migration to 45nm by 2008 with Core2 Celeron on 65nm going for less then 75 bucks. INTEL still makes billion per quarter.

Nehalem launches in 2008 and AMD MS in servers falls to the low teens.

AMD continues to make noise about fusion and other BK strategys.

Tick, Tock, Tick, Tock.

9:37 AM, March 27, 2007  
Blogger Christian Jean said...

"I thought I already said why reverse HT wouldn't work in real world in some earlier post."

Just because you said it, it doesn't make it so so.

Besides 'I' believe Reverse HyperThreading is doable... initially in a part and eventually in whole.

For example, an easy implementation of it would be to use Intel's current HyperThreading. But instead of feeding the other pipeline with instructions from an alternate thread, you feed it with the instructions of a branch.

It's not 'if' this will be done, it's a matter of when it will be done.

Although I admit that doing a full/whole implementation of Reverse HyperThreading is EXTEMELY difficult.

10:19 AM, March 27, 2007  
Blogger PENIX said...

Jeach! said...

"I thought I already said why reverse HT wouldn't work in real world in some earlier post."

Just because you said it, it doesn't make it so so.


If you would like to view some of ho ho's incoherent rambling on this matter, please view this thread.

11:42 AM, March 27, 2007  
Anonymous Anonymous said...

http://ct.techrepublic.com.com/clicks?t=35497529-c630f4132b7397175b602de95885c3d8-bf&s=5&fs=0


In a manipulative attempt to demonstrate leadership over AMD, Intel twists benchmark data to make it look as though it's latest Xeons are outperforming the best AMD has to offer.

2:12 PM, March 27, 2007  
Blogger Christian Jean said...

Penix, I took you up on that thread offer!

Not only was 'ho ho' rambling on reverse HT, but I also read the thread on his disk cache theory, file systems with atomic operations and how they help with data corruption.

Quite amusing!

2:49 PM, March 27, 2007  
Blogger Ho Ho said...

jeach!
"For example, an easy implementation of it would be to use Intel's current HyperThreading."

Simultaneous multithreading is nothing like reverse HT is supposed to be. Sure, its name makes you think it is but in reality they are two different beasts.


"But instead of feeding the other pipeline with instructions from an alternate thread, you feed it with the instructions of a branch."

That isn't reverse HT, you know. Also even if it could work it wouldn't be all that efficient. On average case it would be around half the speed of non-reverse HT CPU, unless you fill the CPU with loads of pipelines that aren't used for the most of the time. I personally would have those pipelines in an additional core and gain performance instead of loosing it.

Also, branch predictors work quite well in most CPUs, especially on Intels since they had to keep their up to 31 stage pipeline fed with instructions. Core2 inherited that predictor and K10 will have some improvements to K8 one too. How big impact on overall performance would it have assuming that most branches are predicted correctly anyway?


penix
"If you would like to view some of ho ho's incoherent rambling on this matter, please view this thread."

Let me remind that penix was the same one who said that adding cores to revese HT capable CPU would increase single threaded performance linearly. That's right, with four 2GHz cores the CPU was supposed to work as fast as an 8GHz singlecore. He also said it would help a lot in multithreaded server environment. Now explain how that could work in real world, or at least in the world where you and/or I live, you failed at it last time.


To explain once again why reverse HT can't work as nicely as many seem to hope think about those things.

First, reverse HT would only make anything better when it could save the costy pipeline flush that occurs in every CPU when branch is not predicted correctly. It costs as many cycles as the lenght of the pipeline (12 for int and 17 for floats on K8, 14 on C2D, 20 for Northwood and older and 31 for Prescott and newer netbursts). That means when you have branches that take longer than a few hundred cycles, eliminating the flush wouldn't make too big difference. Biggest difference could be made when the branches take just a couple of cycles to compute but that needs any extra work should be free.

Say we have a regular (statements0) if (condition) then (statements1) end if (statements2). How to make reverse HT work with that? It basically has just statements1 that are in a branch. Should reverse HT CPU start calculating statements1 and statements2 in parallel? What if either of those means exiting the function? In Linux kernel there is around one else for 5 ifs and one "else if" for every 25 ifs. That means only in 20% cases you have anything to run in parallel and in 4% cases you could do more than two things in parallel.

What if you have a function that uses all of the registers of the core (around half a kiB of data in x86-64). You have an if with else block and both codepaths take around 20 cycles to complete. Some of those registers (variables) are modified before branching and used/modified in both codepaths. How fast can you get all the neccesary data to the other core, calculate the result of the alternate codepath and return to first core that is running the program?

What if both codepaths alter the value at some memory addres to different value, what data would be written to memory? What if there is a context change and perhaps even that one or all the core cache lines holding different values for the memory address needs to be written to RAM, from which branch should the data be written to memory?

Say you calculated both branches on separate core and the one that was calculating the right path was not the one who runs the actual thread. That means you have to transfer back all the data to the executing core and somehow insert it in the middle of pipeline without flushing it.

Not that simple, isn't it? All current CPUs simply flush the entire pipeline when there are wrong data inside it. So basically we are back in square one. Both branches were around 20 cycles long so reverse HT with pipeline flushing would not make things any better than they are today. If you somehow could remove the flushing you could simply do it on single core without having to transfer data to other cores. it would be a lot more efficient. As I said, today the flush simply occurs before starting to execute the branch, assuming that it was predicted wrong.

Perhaps you should simply make it so that the core that calculates the correct codepath should become the one that runs the whole thread? Ok then, say bye-bye to hot caches and get rid of a lot of performance in one go. L1 cache miss costs around 2-4 cycles, L2 miss costs around as much as pipeline flush (12-18), L3 miss a bit less than twice that. Missing the entire cache hierarcy costs around 70-300 cycles, depending on memory and MC.

So far I only talked about two cores. How could a quadcore improve the speed of execution of a single thread? There usually aren't four codepaths to be able to run in parallel.

Also let's look at it from another angle. Say that we access some random parts in memory on both codepaths. With taking both branches we bring in more data, even when it is not needed. That means unnecessary bandwidth usage and cache polluting. Also when we offload parts of the program to another core then it will have to fetch some of the data again from either shared caches or from memory. That again means cache polluting, bandwidth usage and unnecessary waste of recources.

In conclusion I can say that reverse HT can give anything useful only when all of these things are there:
1) each and every core can access the entire cache hierarchy of every other core as fast as its own. Having direct access to their registers would be a great bonus
2) moving data to be executed on other cores is free (takes zero cycles)
3) moving data back from the other cores is free
4) inserting the data to running core's pipeline is free
5) there shouldn't be any memory collisions
6) there shouldn't be any pipeline flushes at any point

jeahc!
"Quite amusing!"

The fun is just about to begin when replies to this post start appearing.

Could anyone describe me a how would reverse HT work that would run programs faster than CPU without reverse HT? So far I haven't heard anything but "I know it works because I said so. No I won't say how it works because I'm too smart". I at least try to explain why it isn't that simple, none has even tried to elaborate their points. If you can't describe things in much detail then at least try to address those six points I made. Shouldn't be that hard considering how much knowledgeable CPU experts we have here.


Somewhat OT but I'd like to see regular CPUs having at least 2-4x as much registers as they have now. That way you could use ultralight parallel processing in one thread. In case there is a cache miss you could somehow detect it and start working on other dataset in other registers. When you get a miss with that dataset you either hope that data for first has arrived or you take a third set and continue. It is used in SPU's in Cell and it can give a huge performance boost. Of cource in Cell the cache (local store) miss is considerably more expensive than on regular CPUs. Still, that kind of system would be doable and could improve performance in real world applications (after writing special code). Or cource it would also require quite big chages to x86. Too bad AMD only doubled the register count, Power had twice the x86-64 register count years before it and Cell basically has 8 times as many.

Also having more registers would considerably speed up my personal favourite task: ray tracing. Being able to trace 64 ray packets instead of 4 or 16 rays can make a huge difference.


Sorry about a post longer than all of this years Sharikous writings combined but how can I help when people still insist that reverse HT is possible? I want to inform them and hope to have a decent discussion.

4:48 PM, March 27, 2007  
Blogger Ho Ho said...

Just for fun statistics I counted all the if's, elses and else if's in Linux 2.6.20.1 source tree:

Total lines of code: 5750928
Total if's: 342103
Total elses: 70771
Total else if's: 15705

As this was very basic query "else if" were also included in both plain ifs and elses.

So basically you have 4.8 elses for every if and 20.5 else ifs for every if. My last numbers were based on only fs subdir.

I hope this little statistics can help at least a bit in coming up a decent reverse HT architecture description.

4:59 PM, March 27, 2007  
Blogger Unknown said...


Penrym comes with 3 300mm factories.


Intel has four 45nm fabs coming. The D1D Fab in Oregon, another fab in Arizona. Both these fabs will be online and cranking out AMD killing processors this year. Then next the new fab in Israel comes online, and Intel's 90nm fab in New Mexico is being updated to the latest 45nm designs.

6:36 PM, March 27, 2007  
Blogger Christian Jean said...

Ho ho, I can understand you want to voice your opinion but a lot of what you said doesn't make any sense.

You clearly lack an understanding on the subject.

I would suggest, for a one week period you study in depth the Linux scheduler code. Or even better yet, modify it and and fool around with it.

Then take a week to create a simple threaded application.

I'm sure you'll come back in a two week time and laugh at a lot of your comments.

5:11 AM, March 28, 2007  
Blogger Ho Ho said...

jeach!
"a lot of what you said doesn't make any sense"

Could you make a list of those things?


"You clearly lack an understanding on the subject."

Exactly which of my sentences made you think that?

On a similar note, what should make me think you know anything about the subject? The fact you said that reverse HT can work but failed to describe it kind of makes me doubt in your knowledge on the subject. Please don't dissapoint me.


"I would suggest, for a one week period you study in depth the Linux scheduler code"

What would that give me in the context of this discussion?


"Then take a week to create a simple threaded application."

I've done threaded applications before, you know. How would creating a new one help?

Could you now please provide your side of the things? As I said if you can'd do a good explanation of how reverse HT would work then at least try to address (some of) those six points I made. Is it really that difficult?

5:28 AM, March 28, 2007  
Blogger enumae said...

I found this very interesting, especially since AMD cried foul that Intel was using an old benchmark (int_rate2000)...

The irony.

They just used this slide here.

Come on Sharikou, can you spin it even after what you said about Intel, I think you can?

6:23 AM, March 28, 2007  
Blogger abinstein said...

jeach:"For example, an easy implementation of it would be to use Intel's current HyperThreading. But instead of feeding the other pipeline with instructions from an alternate thread"

nop, current hyperthreading cannot be used for this purpose, because in current HT, only a very small amount of circuits are duplicated, and there is no "the other pipeline".

So if you have another thread to fire up while the first thread is waiting for branch/memory/IO, then HT could help; if you have only one thread and make two "artificial" threads out of it to run on current HT, you'll actually do worse than just superscalar execution.

I think the whole discussion of reverse HT is moot, because nobody knows how it is supposed to be done yet; not even theoretically. Well, you could adopt a data-driven execution model to "mimic" reverse HT on multiple cores, but first, that would require changes in compilers, and we aren't seeing that at all; second, we don't even know how efficient it is going to be; and last, it would run better on an IBM Cell-like architecture than x86-64.

9:55 AM, March 28, 2007  
Blogger PENIX said...

I do not have the patience or time to deal with ho ho's lunatic ramblings today, but I will help clear up the confusion on Reverse Hyper Threading (RHT).

What is RHT?
RHT is the combining of multiple CPU cores, into a single core. Technically, the cores are still separate (unconfirmed), but they allow a single threaded (ST) application to execute across the multiple cores in parallel. This allows a ST application to perform at equally on a multiple core cpu as it's multi-threaded (MT) counterpart.

What are the benefits of RHT?
As it stands today, most applications for both desktop and servers, are ST. This means that they are designed to run on only 1 CPU core at a time. If a system contains 2 CPU cores, a ST application cannot use the 2nd core. In order to use 2 cores at once, the application must be written to be MT. MT programming is more difficult than ST programming. In some scenarios, MT may not even be possible. Even when properly implemented, MT programs rarely distribute the work evenly between the cores. With RHT, this problem disappears. All work is done in true parallel equally across all cores, regardless if it is ST or MT application. The result is a linear scale in speed as the core count increases.

Which is better RHT or multi-core?
In theory, they will be equal for both ST and MT applications. A dual core 2GHz system has a combined clockspeed of 4GHz. An RHT system containing dual 2GHz cores also has a combined clockspeed of 4GHz. In practice, it is very possible that RHT will outperform traditional multi core system at equal combined clockspeed. The reasoning for this is flaws in MT programming. Very commonly MT applications do not execute the work completely evenly across cores at all times. This can result in one or more cores being idle at periods. RHT, by nature, will always distribute the work evenly, resulting in higher efficiency and better performance clock for clock.

RHT Today:
RHT is not yet a technology that has been demonstrated. Some say it is impossible. Several years ago it was also claimed by many experts that it was impossible for the x86 platform to scale. We now know that these same industry experts were dead wrong. RHT is not on shelves today, but that doesn't mean it will not be tomorrow.

10:06 AM, March 28, 2007  
Blogger abinstein said...

ho ho:"In conclusion I can say that reverse HT can give anything useful only when all of these things are there:"

Because your description or "understanding" of reverse HT is totally imaginary, your list of "requirements" are most imaginary, too.


"1) each and every core can access the entire cache hierarchy of every other core as fast as its own."

By definition, this cannot possibly be true. One of the biggest benefit of multiple threads is that each has a different working set. If you share lots of memory between threads, they actually run slower than single-threaded code.


"2) moving data to be executed on other cores is free (takes zero cycles)
3) moving data back from the other cores is free"


These "requirements" are also bogus due to the reason above. Actually if these were true then your point 1 becomes moot. You are just making up numbers here. Anyway, none of them will be needed by definition if a good reverse HT (not your imaginary one) is devised.


"4) inserting the data to running core's pipeline is free"

Exactly who inserts data to the running core? Another core, then this turns into your point 2; the memory/IO, then this becomes normal data read. I don't know what are you trying to add by adding this?


"5) there shouldn't be any memory collisions"

There will always be collision on (shared) memory controller even in single-threaded superscalar execution. Reverse HT would make it worse because you're demanding more data access. This is actually a good thing if these memory accesses turn out to be useful.


"6) there shouldn't be any pipeline flushes at any point"


You must be out of your mind. First, no pipeline flushing means no branch prediction and no context switch; are you saying reverse HT can't have these two in order to be useful? What have you been drinking?

Second, flushed pipeline is actually why reverse HT will help since when one pipeline is flushed (due to mis-predicted branches or memory access) another can go on speculatively. With single thread, a flushed pipeline loses everything.

10:25 AM, March 28, 2007  
Blogger Christian H. said...

AMD's Dual core CPU's priced comparable to Single core P4 Neburst CPUs? Looks to me AMD is forced to sell CPUs at a loss.

Intel says they have no idea what AMD is on about with the price war? forget about market share. Looks like AMD is cutting prices just to get rid of inventory.



I said it before and I'll say it again. Never has the new gen CPU been priced less than the old gen.
Especially when it's 70-80% faster.

10:58 AM, March 28, 2007  
Blogger Christian H. said...

when was the last time you saw AMD pull an ace out? K8 was pre-announced to death. K10 as well. AMD has nothing except K10. If anything AMD is known for over promising and being overly optimistic. Again, AMD has nothing after K10 while intel rolls out 40% more efficient and faster 45nm Penryn at the same time, followed by A new Core, Nehalem just when AMD is about to ramp K10.
AMD is now just to big and slow to catch intel. Soon AMD will just become like Transmeta and Via. Insignificant.



You must be the King of Doom and Gloom. AMD is a CPU company. Should they have a new fuel injection system in the works?

AMD has licensed their socket and continually get design wins with Opteron and X2.

Nearly every major(Fortune 500) company uses Opteron somewhere in their infrastructure.
And with Dell selling AMD you can bet that most are using X2.
It's obvious that AMD is setting up to release K10 at the original dual core prices (You Tube interview).

11:05 AM, March 28, 2007  
Blogger Christian H. said...

I thought I already said why reverse HT wouldn't work in real world in some earlier post.


There's at least two white papers and two patents that would disagree with you there.

11:13 AM, March 28, 2007  
Blogger Christian H. said...

Let me remind that penix was the same one who said that adding cores to revese HT capable CPU would increase single threaded performance linearly. That's right, with four 2GHz cores the CPU was supposed to work as fast as an 8GHz singlecore. He also said it would help a lot in multithreaded server environment. Now explain how that could work in real world, or at least in the world where you and/or I live, you failed at it last time.



Well, it's rather complicated and I wish I could find he link to the Stanford paper I read.

Basically it works like OoO(out of order) loads and stores. Every thread can be parallelized with the proper read ahead.

As long as a data object isn't shared between two threads different cores can operate on it linearly.

11:20 AM, March 28, 2007  
Blogger Ho Ho said...

penix
"I do not have the patience or time to deal with ho ho's lunatic ramblings today"

I hope to see your detailed description as soon as possible when you finally find the time, I'll keep reminding you that so you won't forget it. It will be interesting to read something like that coming from someone who actually knows something about CPU architectures, programming and code execution on CPUs.


"RHT is the combining of multiple CPU cores, into a single core"

I know that but I was asking how could it be done not what the definition means. More exactly, how can a single code flow be executed in parallel on multiple cores.


"In some scenarios, MT may not even be possible"

I agree. Let's say we have a Fibonacci sequence calculation. Please describe the mechanisms how reverse HT makes it to run four times faster on quadcore when it is not possible to multithread it.


abinstein
"Because your description or "understanding" of reverse HT is totally imaginary, your list of "requirements" are most imaginary, too."

What I described was problems that arise in a RHT CPU that works as jeach! described it. Can you describe a better one than he did?

Don'd you agree that solving the problems I described is necessary to make RHT work more efficiently than regular x86 cores?


"By definition, this cannot possibly be true."

I agree that most of those points are unatainlable on a regular CPU. The whole point of my post was to show why reverse HT wouldn't work! Without the things I described there will be a big performance hit, at least when using that kind of RHT that others described here and before in other threads.


"If you share lots of memory between threads, they actually run slower than single-threaded code"

So it is. Now describe how is it possible to efficiently share workload of single thread between multiple cores when you don't share the data between them? Youy can't? So perhaps it is impossible?


"Exactly who inserts data to the running core?"

It can be either the core that calculated the results that writes the results to the registers of the core that runs the actual tread or it can be the core that runs the thread that reads the data from the other core that calculated the results, it really doesn't matter that much. Choose whatever suits you better. End result just means that data from one core ends up in other cores pipeline that continues to run the thread.


"I don't know what are you trying to add by adding this?"

I was actually trying to say that such merging of codepaths would mean pipeline flush that would have quite big performance impact. To get rid of it you must be able to insert data in the middle of pipeline. This is not possible on any architecture I know of but a RHT capable CPU would need it to work efficiently.


"Reverse HT would make it worse because you're demanding more data access"

I know and I said that in my post. With that point I just tried to make clear that with RHT those problems would become much more difficult (impossible?) to solve.


"You must be out of your mind."

Actually it was jeach! who described an implementation of RHT CPU and that one would need such a functionality. As I said I know it is not possible on any architecture I know of.


"Second, flushed pipeline is actually why reverse HT will help since when one pipeline is flushed (due to mis-predicted branches or memory access) another can go on speculatively."

Ok, so how do you merge those two codepaths that were running on separate cores? Can you do it without pipeline flush? If not then you simply trade the flush that is made when wrong branch is predicted to flush that is performed at the end of both codepaths when you have to merge them. Wouldn't that make the whole reverse HT meaningless?

One more question, do you agree that RHT is a pipe dream that is unattainable and exists only in some people dreams?


thekhalif
"There's at least two white papers and two patents that would disagree with you there."

I know about one patent with very little information that described how you could share execution units* but could you link to the other ones?

*) see later why it isn't efficient for most tasks.


"As long as a data object isn't shared between two threads different cores can operate on it linearly."

How often does this happen?

Considering how much trouble CPUs have to find enough instruction level parallelism to fill their own execution units then how big performance improvement you suggest could be achieved with having several times more execution units availiable?

Also, what about caches and speed of data transfer between cores?

Software always has data dependancies in them. If it doesn't it is trivial to multithread it to get much more performance out of it than with RHT.



You all can do much better than that. After all you are CPU experts that know the stuff much better than I do. At least show that you try to describe something that might actually work. I described a few problems that arise, try to address them.

11:48 AM, March 28, 2007  
Blogger Cerebral said...

BARCELONA beats intel XEON Clovertown 5160 by 42% at 2.3 GHZ.

http://virtualexperience.amd.com/index.html?cid=quadcorewebinar&co=quadcorewebinar&webinar=3

2:16 PM, March 28, 2007  
Blogger Unknown said...

Let's see what the doctor has to say about Penryn:

A 45nm die shrink of the Core microarchitecture — Penryn will be based on the Core architecture of current Core 2 processors, but will be built using Intel's 45nm high-K process, which Gelsinger reminded us involves a "fundamental restructuring of the transistor," with 20% faster switching and 30% lower power. Like the Core 2, Penryn chips will have two cores onboard and will be employed in dual-chip packages for quad-core products. Each Penryn chip will cram 410 million transistors into a 107mm² die; current Core 2 chips pack 291 million transistors into 143mm².

6MB of L2 cache per chip — Credit larger caches for much of Penryn's increased transistor count. The chips will have 6MB of L2 cache, shared between two cores. Naturally, dual-chip quad-core configurations will have a total of 12MB of L2 cache.

SSE4 and "Super Shuffle Engine" — Penryn will have the ability to perform 128-bit data shuffle operations in a single cycle. Gelsinger said this fast shuffle capability should make SSE4 much more programmable and more useful for compiled code, because the CPU will quickly handle realigning data as needed for vector execution.

A faster divider — Penryn will be faster clock-for-clock than current Core 2 processors, and not just because of larger caches and SSE4. The CPU has a new, faster divider that can process four bits per clock versus the two bits per clock of current Conroe chips. Accordingly, Gelsinger expects twice the divide performance of Core 2 Duo and up to four times the performance for square-root operations.

Bus speeds up to 1600MHz — We'll see front-side bus speeds in Penryn derivatives of up to 1.6GHz, depending on the market segment. Gelsinger offered few specifics here, only noting that Xeon server CPUs will have bus speeds of "up to 1600MHz," with no mention of specific bus frequencies for desktop or mobile chips.

A new lower power state — Penryn will be able to drop into an additional low-power state when idle, which Intel has designated as the C6 state (or "deep power down capability," if you're into marketing names). This mode turns off CPU clocks, disables caches, and goes to what Gelsinger said is the lowest power state the process technology allows. Waking from this mode takes longer than it does from other power states, as one might expect.

Dynamic Acceleration Tech — Penryn will also play with power by introducing a novel dynamic clock speed scaling ability. When one CPU core is busy while the other is idle, thus not requiring much power or producing much heat, Penryn will take advantage. The chip will boost the clock speed of the busy core to a higher-than-stock frequency—while staying within its established thermal envelope.

A split-load cache — Gelsinger said this will allow speculative execution across cache line boundaries, but offered little additional detail.

Improved virtualization — No details here, although I believe they may have been disclosed before.

Clock speeds over 3GHz and bitchin' performance — Intel expects both the desktop and server versions of Penryn to reach clock speeds in excess of 3GHz, and in fact has been testing 3.2GHz versions of desktop and server chips already.
Gelsinger said they'd measured a 3.2GHz desktop part at 20% higher gaming performance than the current fastest Conroe. For applications that use SSE4, like media encoding, we can expect to see improvements of over 40%.

As for the server parts, Gelsinger said a 3.2GHz quad-core Penryn-derived system based on the Caneland platform with a 1600MHz front-side bus was achieving over 45% gains versus today's fastest quad-core Xeon systems in certain apps. The apps he cited were bandwidth and floating-point-intensive ones like Stream, some sub-elements of SPECfp, and HP workloads like computational fluid dynamics.


Familiar power envelopes — Dual-core desktop versions of Penryn are slated to have a 65W TDP rating, like most Core 2 Duos today. The quad-core versions will come with 95W and 130W TDPs. The Xeon variants will hit 40, 65, and 80W TDP targets in dual-core form and 50, 80, and 120W in quad-core form. Gelsinger didn't quote any thermal envelopes for mobile CPUs from this family, but there are evidently no plans for a quad-core mobile version of this processor.


Oh, and He just said Nehelam is a 2008 chip.

2:33 PM, March 28, 2007  
Blogger Ho Ho said...

Tommy
"BARCELONA beats intel XEON Clovertown 5160 by 42% at 2.3 GHZ."

You do know you are comparing quadcore vs dualcore, do you?

Just for fun let's do some calculations. There is 2x4x2.3GHz for AMD and 2x2x3GHz for Intel. That makes a total of 18.4 for AMD and 12 for Intel. When AMD is 42% faster it would mean that when Intel scales to same number of cores and clock speed it would be around 20% faster. I think the difference shouldn't be that big.


Bubba, where did you get that information? I'd like to read more about it, especially the SIMD and FP stuff.

2:41 PM, March 28, 2007  
Blogger Ho Ho said...

Never mind, I already found it:
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=2955

Seeing much faster divisions and shuffleig will improve ray tracing speeds quite a bit, not to mention SSE4 :)

3:16 PM, March 28, 2007  
Blogger netrama said...

Bubba said...
Let's see what the doctor has to say about Penryn: Patty said blah blah blah ...


Stop quoting as*h*les like Patty here. Patty is just trying to prevent his stock from tanking. You must be a fool to listen to well established liars..
Also I have seen most comments by Intelers here, like roborat ..they purely sound like salesmen pitching for Intel , who have absolutely have no clue of the technology underneath , so keep off !!!

3:22 PM, March 28, 2007  
Blogger Christian Jean said...

Ho ho said...
do you agree that RHT is a pipe dream that is unattainable and exists only in some people dreams?

It is due to people like you that companies like AMD exists!


Scenario:

Ho ho walks into the Intel CEO's office and quickly sits down.

CEO: Look ho ho, I've asked all my engineers if x86-64 is possible. They all told me no, but since you are our lead engineer here at Intel, I want you to confirm this for me.

Ho ho: I agree that x86-64 is a pipe dream that is unattainable and exists only in some people dreams?

CEO: Thank you ho ho, your a life saver. I didn't want to have to tell our share holders that we've spent $ 10 Billion on Itanium for nothing.

Ho ho: No problem, that's why they pay me the big bucks. [ho ho, sits back in his chair with a proud and content look on his face]

CEO: That's it, you can leave now.

Ho ho: [quickly gets up and just as he was to exit the door...]

CEO: Oh, ho ho, one more thing! I've heard rumors that AMD is currently developing 'RHT', I think it means 'Reverse HyperThreading'. Do you know anything about that and do you think this is doable?

Ho ho: RHT is a pipe dream that is unattainable and exists only in some people dreams?

CEO: That's what all the other engineers said too... that's great! Now I can commit another $10 Billion to Itanium III and it's FAB in China... damn were good!

3:53 PM, March 28, 2007  
Blogger Christian Jean said...

Ok, lets assume that AMD did do some sort of RHT.

Now, lets also assume that their first version works wonders at the hardware level on current single threaded applications.

It's amazing... it can run code in parallel and runs both paths of an 'if' statement just in case it got it wrong.

Then, some administrator at the Department of Defense (with 36 certifications on his wall) decides to upgrade all the computers at NORAD.

Here is an example of code which is run 800% faster using AMD and RHT:

function aerospaceScan()
{
while (radarIsOn())
{
if (!underAttack())
doMaintenance();
else
launchNuclearMissles();
}
}

Problems are NOT just technical!!

4:08 PM, March 28, 2007  
Blogger core2dude said...


function aerospaceScan()
{
while (radarIsOn())
{
if (!underAttack())
doMaintenance();
else
launchNuclearMissles();
}
}

I will stick to the technical problems. In fact, todya's processors do quite a bit of speculative execution, the only thing they don't do is commit speculatively.

RHT, in theory, can be implemented in the same fashion. Different cores executing different "speculative" threads. However, at some point, the speculation needs to be resolved, and that needs huge communication between the different threads. The problem can be mitigated to some extent by doing redundant computations. E.g., if you have code like:

if(input==a)
{
doA();
}
else
{
doB();
}

You can create two threads out of this as follows:

thrA:
if(input==a)
{
doA();
}

thrB:
if(input!=a)
{
doB();
}

As you can see, input==a and input!=a are redundant computations, but they are required to reduce synchronization burden. Add to this the fact that every 4th to 6th instruction is a conditional branch, and your redundant calculations kinda start adding up.

RHT may or may not be a pipe dream--but it will take a lot more than some patent filed at USPTO to make it actually work. In fact, AMD's patent is so vague that you can argue that it is invalid (a patent has to be descriptive enough to enable a person with reasonable skill in the art to implement it--AMD's patent does not meet that criterion).

5:44 PM, March 28, 2007  
Blogger Unknown said...

Thanks for the info, Bubba! Penryn is going to be an awesome CPU. Socket compatability too, just needs a BIOS update.

While I think Penryn is just awesome, there's Nehalem coming in 2008.

Intel was kind enough to share more information about Nehalem:-

The "Nehalem" chips will allow Intel to debut its elusive CSI (common systems interconnect) technology – a high-speed serial interconnect similar to AMD's Hypertransport technology. In addition, Intel will pump out processors with built-in memory controllers and built-in graphics units, said SVP Pat Gelsinger, speaking to reporters here. Lastly, Intel intends to support two software threads per core with the "Nehalem" gear – again bringing the giant in line with processor rivals such as Sun Microsystems and IBM.
Click here to find out more!

"This is a big deal," Gelsinger said. "This is a very big deal."

The executive's enthusiasm proves understandable. The Nehalem designs are a total architecture revamp over today's "Core" designs, which were unsheathed in the first quarter last year. The Core chips – Intel's currently shipping products for mobile, desktop and server computers – improved the company's overall product performance and performance per watt, allowing to compete with and even best AMD on numerous benchmarks for the first time in a couple of years.

Intel hopes to build on that success with the Nehalem gear that will range from one to at least eight cores.

At the moment, Intel is keeping the very fine details about the Nehalem chips hush-hush. Gelsinger, however, did confirm the previously mentioned items such as "integrated memory controllers as well as point-to-point interconnects (up to four links)" on chips running at greater than 3GHz. Some chips for the server and client markets will also have integrated graphics processors similar to gear that AMD plans to pump out fresh off its ATI acquisition. And all of the Nehalem products will support DDR3 memory.

Reading the tea leaves, it sounds like Intel will roll out quite a number of different chip specs, including multi-core products with lower frequencies similar to Sun's UltraSPARC T1 that cater to multi-threaded software.

End users of Intel's new chips will have to buy fresh systems rather than slotting the chips into existing boxes. By contrast, AMD with its upcoming Barcelona product and the follow-on Montreal chip will have socket compatibility, as disclosed in this Register exclusive.

Intel has struggled to get the CSI technology out the door. It once expected to outfit a version of Xeon code-named Whitefield with CSI but scrapped those plans due to design issues. The chip maker is expected to introduce CSI in its Itanium family in 2008 as well and has told some customers that the technology shows "much lower latency" than AMD's Hypertransport.


Intel works on Penryn and Nehalem and will deliver both on schedule. Woodcrest etc. and quad cores were all early. AMD, OTOH, has delayed R600 countless times and has pushed K10 back to "late summer".

6:15 PM, March 28, 2007  
Blogger lex said...

Tick Tock Tick Tock..

Game over...

Penrym here in 2007 and Nehalem on track for 2008.. Read and weep.

AMD BK in 2008.. Poor sharikou

http://dailytech.com/Intel+Life+After+Penryn/article6686c.htm#comments

7:35 PM, March 28, 2007  
Blogger abinstein said...

ho ho:"Don'd you agree that solving the problems I described is necessary to make RHT work more efficiently than regular x86 cores?"

No, I don't. And I don't think you understand the basics of modern microarchitecture, as I expressed in other threads; nor do I agree with your way of taking others words out of context before trying to respond.

Your imaginary RHT looks like a primary school imagination of calculus when he had never done it. You didn't get cache/register access right, nor did you have the correct description of pipeline flush. You still think in the decade-old way of same ISA/physical registers, non-speculative execution, in-order and single-banked memory accesses, etc. In truth, most of the "requirements" you described were already solved in some form of most today's superscalar and OOO execution engines.

The biggest obstacle of RHT is not anything that you descirbed (shared memory, data transfer, blah blah), but the fact that (1) it is hard to make code multi-threaded efficiently in hardware, (2) it is hard for the front-end to fetch x86 instructions (CISC) fast enough.

More specifically, we don't even know how to hand-optimize some codes into multiple threads, not to mention to automate the processes. Also, the variable length and complex structure of x86, thanks to Intel's "brilliant" design, means serial dependency between bytes during instruction fetch. If you don't get enough instructions, you don't have ability to analyze threading opportunities.

Anyway, RHT using today's x86 cores without change in compilation is most likely to be a pipe dream. RHT itself, however, is not. People in data-driven computation knew how to parallelize serial code 20 years ago, way before speculative execution and prediction were reality; unfortunately it requires a different execution model, let alone the same x86 cores you see today (from either AMD or Intel).

9:55 PM, March 28, 2007  
Anonymous Anonymous said...

Bottom line being we now have a great choice of CPU'S to choose from. Each will buy according to their preference & no amount of bickering in this blog will change anyones mind. Needless to say most home users dont even need anything higher then a 1.0MHZ. WOW..e-mail/web surfing/grandmas pics/checking latest prices for online prescriptions. Checking the stock market on their stocks. Chatting with grandkids in CA. Useless to buy or think they will notice the pages loaded at 1ms. Get a life!!!!

8:42 AM, March 29, 2007  
Anonymous Anonymous said...

Let's put all pros/cons aside for a moment. This is a "if" scenario. No new plants being built by AMD in 2007. Maybe 2 for Intel. If the cpu market is glutted...who's stock price will suffer the most??? The answer is logical..INTEL...

9:06 AM, March 29, 2007  
Blogger Christian Jean said...

Needless to say most home users dont even need anything higher then a 1.0MHZ

Obviously you haven't run Windows in a while? The tasks you described, would make your statement logical, but the OS forced down grandma's throat makes your statement false!

But, contrary to the past 10 years, I no longer blame Microsoft for that. Can you blame them for sitting back on their monopoly? No, not really!

Who is worst? The bully, beating everyone up or the people who stand by and watch?

I blame Apple, Sun and IBM for being so scared to do anything. Together they could do miracles overnight.

So now, there is only one company which I'm hoping will have the balls to tackle Microsoft... and they are Google.

3:12 PM, March 29, 2007  
Blogger Ho Ho said...

jeach!
"It is due to people like you that companies like AMD exists!"

thanks for oversimplification but that comparison is a rather bad one. I've said several times that AMD didn't do enough with its AMD64 architecture. It should have done much, much more. At least it should have made x86 comparable to Power. For some reason they decided not to do it. When considerable time is spent moving data between L1 cache and registers something is wrong.


"Now, lets also assume that their first version works wonders at the hardware level on current single threaded applications."

Quite big problem with that is that unless we can make wonders on software level first there is no way HW could do it for us.
Though your example is quite good, I didn't think of something like that. That shows even better why RHT isn't all that simple.


core2dude
"As you can see, input==a and input!=a are redundant computations, but they are required to reduce synchronization burden. Add to this the fact that every 4th to 6th instruction is a conditional branch, and your redundant calculations kinda start adding up"

Finally someone who understans what I was talking about.

"RHT may or may not be a pipe dream--but it will take a lot more than some patent filed at USPTO to make it actually work."

I agree with that. Most patents are way too vague to be of any use.


abinstein
"And I don't think you understand the basics of modern microarchitecture, as I expressed in other threads"

What exactly I understood wrong?


"Your imaginary RHT looks like a primary school imagination of calculus when he had never done it"

Exactly how many times I've already said that I only described what problems would RHT have when it starts executing multiple branches in parallel? I wasn't the one who came up with it but so far it is the only kind of RHT implementation description I've seen.


"You didn't get cache/register access right, nor did you have the correct description of pipeline flush"

I beg to differ.

"You still think in the decade-old way of same ISA/physical registers, non-speculative execution, in-order and single-banked memory accesses, etc"

How exactly those things have any effect on the described RHT implementation?


"More specifically, we don't even know how to hand-optimize some codes into multiple threads, not to mention to automate the processes"

I agree with that, it has actually been one of my points from the start.


"Also, the variable length and complex structure of x86, thanks to Intel's "brilliant" design, means serial dependency between bytes during instruction fetch"

AMD had the possibility of fixing it with AMD64, they chose not to. Also Intel tried to fix it with Itanium but it failed since the whole industry was not ready to move to something better.


"Anyway, RHT using today's x86 cores without change in compilation is most likely to be a pipe dream."

So it is. As I said x86 is a rather bad instruction set but unfurtunately I don't see this improving in the near future.


"RHT itself, however, is not. People in data-driven computation knew how to parallelize serial code 20 years ago, way before speculative execution and prediction were reality; unfortunately it requires a different execution model, let alone the same x86 cores you see today (from either AMD or Intel)."

I thought we were talking in x86 context. Perhaps yes RHT is possible with some weird architecture. I personally can't imagine what would it look like and I'm quite sure that architecture can't accelerate games and word processors.

Of cource you could say that all GPUs are actually using something RHT-like. User sends a few serial OpenGL commands to GPU and it automagically divides them to parallel threads. However as I've said that kind of thing doesn't work with everything.

tedsplace
"Needless to say most home users dont even need anything higher then a 1.0MHZ."

When CPU is idling it sure doesn't need a whole lot of power to update mouse position. However when you click on a link in your web browser and it renders you a heavy ajax based page you will need all the CPU recources you've got to see the results sooner than next year. It is the peak CPU usage why we need to have fast CPUs. If somehow the entire workload could be evenly divided over the time we would quite likely be able to get by with much slower CPUs. Too bad we live in a real world where that kind of things do not work.


jeach!
"Can you blame them for sitting back on their monopoly?"

Do you remember when MS started working on IE7? What was the last great thing they added to their software that wasn't done by anyone else before them?

1:45 AM, March 30, 2007  
Blogger hyc said...

re: reverse hyperthreading, or automatic parallelization of singlethreaded code...

http://bmdfm.com/home.html

There've been a lot of auto-parallelizing compilers over the past few decades. Usually Fortran, but also C. But it probably still requires a workload that is already SIMD-friendly...

2:58 AM, March 31, 2007  
Blogger Ho Ho said...

hyc
"There've been a lot of auto-parallelizing compilers over the past few decades"

Parallelizing while compiling (using preprocessor macros) is far, far from reverse HT or doing it automagically during execution time.

Basically CPU can only see very minimal piece of program at any time and has no clue of what is going on whereas compiler sees the whole thing, knows what is going on and thus also knows where there are data dependancies and how are the loops organized. There is a reason why compiling a simple few thousands of lines of code can take hundreds of megs of RAM.


"But it probably still requires a workload that is already SIMD-friendly"

Either SIMD or simply some big loops with tens of thousands of iterations and no data dependencies. Still hand coded proper threading gives much better scaling than things like OpenMP.

6:51 AM, April 01, 2007  

Post a Comment

<< Home