Monday, February 12, 2007

INQ mistaken about Intel

INQ says that x86 is Alpha. Wrong. Only AMD K8 is Alpha. Intel's Core 2 Duo is just Pentium 3 with 128 bit SSE.

To qualify as an Alpha derivative, you must have integrated memory controller. Intel is gonna stuck with 1970s FSB for the forseeable future.

37 Comments:

Blogger netrama said...

The 80 core Intel BS , the best case ever of Intel beating its own drums ...more like those violinists on the Titanic...
I am surprised that folks like businessweek.com , have made this a huge Headline and instead of showing some Intel guy in a fake lab coat , they show Paul O , holding a CORE 2 Processor .. Probaly if it were in Europe , BW would have been sued :-))

4:31 PM, February 12, 2007  
Blogger Ho Ho said...

Scientia
"Intel's Core 2 Duo is just Pentium 3 with 128 bit SSE."

Does that mean K8 and Alpha in general suck since Peintium3 beats the living c*£@ out of it?

12:43 AM, February 13, 2007  
Blogger PENIX said...

Intel 80 core is a myth. The ancient Intel FSB is completely incapable of 80 core. It has become apparent that 4+ cores is already pushing the limit.

9:57 AM, February 13, 2007  
Blogger abinstein said...

From the Inq: "The real pity is that Intel and AMD are squabbling over micro-marchitecture rather than inventing anything new, like a proper operating system that can take advantage of one brain, never mind 80."

This is so wrong and backwards. OSes are already taking advantages of one processor core. What we need is an efficient way to integrate multiple cores together.

12:48 PM, February 13, 2007  
Blogger Unknown said...

It has become apparent that 4+ cores is already pushing the limit.

4:13 PM, February 13, 2007  
Blogger PENIX said...

abinstein said...

This is so wrong and backwards. OSes are already taking advantages of one processor core.

It's more wrong than you think. Current OSs can take advantage of one, or many cores. All modern OSs have been able to for quite some time.

What we need is an efficient way to integrate multiple cores together.

By very nature, you cannot integrate multiple cores together into a single core. A single thread can only execute on a single core because a set on instructions must be followed on a single path.

An example would be to have 4 people read a book at the same time where each person reads one word, then hands it to the next person... but they can only see the words that they read, and none other. It wouldn't make any sense.

It is much more simple to program a single threaded application than a multi-threaded application. This is why most applications, and almost all games, are single threaded. Converting these apps to multi-threaded is not an easy task, and not possible for others. To combat this, both Intel and AMD are working on smart compilers which analyze single threaded source code and compile it as multi-threaded in the areas where it is possible. Intel's version of this is called Mitosis. An interesting idea, but this "hack" requires applications to be recompiled, and it does not take full advantage of multi-core. This does nothing for existing software.

Word is that AMD has done the impossible and is developing Reverse Hyperthreading to solve this problem. If this is true, a single threaded application could run on multiple cores in parallel. Simply put, going from one core to two would double performance in any application. And the performance gains would grow linearly as cores were added. If this were to be released tomorrow, Intel would be BK overnight.

4:43 PM, February 13, 2007  
Blogger Unknown said...



Intel 80 core is a myth. The ancient Intel FSB is completely incapable of 80 core. It has become apparent that 4+ cores is already pushing the limit.


This is correct. That's why Intel is moving to the new CSI bus next year with the Nehalem architecture. Well, either that or they'll be BK by then.

7:09 PM, February 13, 2007  
Blogger Mo said...

PENIX said...

Intel 80 core is a myth. The ancient Intel FSB is completely incapable of 80 core. It has become apparent that 4+ cores is already pushing the limit.


How is it a myth? Intel demonstrated it.... Atleast it shows something. All AMD does is show slides and talk.
http://www.xtremesystems.org/forums/showthread.php?t=133532

8:42 PM, February 13, 2007  
Blogger PENIX said...

ho ho said...

Does that mean K8 and Alpha in general suck since Peintium3 beats the living c*£@ out of it?

It means that Intel has a better propaganda than it's competitors.

10:58 PM, February 13, 2007  
Blogger PENIX said...

Scientia said...

Wrong. Even by Q1 08, Intel's highest clock on quad core will only be 2.4Ghz. I think you'd better change your bet to Q3 08.

So they reverted back to the P3 and now they are stuck with the same MHz limitations that caused them to move to the late great P4.

11:02 PM, February 13, 2007  
Blogger Ho Ho said...

penix
"The ancient Intel FSB is completely incapable of 80 core"

You do know it uses IMC and it soon will use silicon optics for off-chip data transfer?

abinstein
"What we need is an efficient way to integrate multiple cores together"

That won't happen. Sure, there were some rumours about reverse-HT but using that will loose almost every bit of performance you get with going to multicores. For dualcores you should be happy to get +10% performance when using two cores to run single thread.

What really needs to change is the software. People will have to start thinking in terms of parallel processing, there is no escape from it.

12:51 AM, February 14, 2007  
Blogger Anonymous Agent #101 said...

Plug this 80 core monster into Intel's upcoming HT-like system, and it will do quite well. It's a lot more interesting than anything AMD has talked about recently, much less shown an actual chip!

Even with the next generation FSB, it could be a very nice processor. What most people do not realize is that very few apps, maybe 1% at most, FULLY utilize the massive memory bandwidth that AMD has made available. So there is all that bandwidth and nothing to do. Sucks to be AMD.

And, please, will the AMD jockizzers please get a clue -- it's beyond dumb for AMD f-bois to criticize the FSB and P3 as those two ancient technologies -- oldies but goodies -- are kicking AMD to the curb right now.

If AMD had a history of invention then they'd be much better off today. But they mostly just buy companies and crib from Intel. So K10 has got some pretty bad ass FP going on. Plug in a 16 core version of the 80 core Monster into a Core 3 Duo and it will clock out AMD. One punch again and AMD's head will spin.

And in the end this is all the fight is about. The little dude trying to stay out of reach of the big dude. Every time the big dude lands one on the little dude, everyone holds their breath to see if the fight is over. Yeah the little dude lands a few, too. But they don't do much damage and the big dude just laughs and waits for the right moment for the KO.

3:34 AM, February 14, 2007  
Blogger Aguia said...

abinstein,

What is needed is:
-One way allow one core share resources with the other core
*enhance the connection between the cores to allow one core for example barrow the SSE units or FPU of the other core.

-or have the software redone to take advantage of not 2, 3 or 4 cores but n cores.

6:24 AM, February 14, 2007  
Blogger Ho Ho said...

penix
"It means that Intel has a better propaganda than it's competitors."

Besides propaganda it also has real-world working CPUs that everyone can benchmark instead of lots of hot air and no real HW.

aguia
"What is needed is:
-One way allow one core share resources with the other core
*enhance the connection between the cores to allow one core for example barrow the SSE units or FPU of the other core."

and

penix
"Simply put, going from one core to two would double performance in any application. And the performance gains would grow linearly as cores were added."

That will only work with very few programs and will give almost no benefits. Instruction level parallelism is really hard to find, you'd be lucky to fill in all of the 16 GP registers and 16 128bit SIMD registers. Doubling the register count with added latency won't help. Also as same data isn't usually kept in registers for too long you'd have to exchange register contents quite a bit. Now add in pipelining and you get a simple way of slowing your calculations to crawl. If you really wanted to increase register count it would be much cheaper and more efficient to simply add more ALU's and registers to single core. But as I said, it is near impossible to get better instruction level parallelism. I doubt many would like to write their programs as they would for Itanium.

In short, you can't efficiently share CPU recources for running single threads faster. Even in ideal case it will be very far from 2x speed increase for dualcores.

penix
"If this were to be released tomorrow, Intel would be BK overnight."

And what if Intel would release it sooner? Would that mean instant BK for every other CPU maker? Or does AMD use some magick so it would survive?

Also, as this is mostly compiler technology and AMD doesn't have its own compiler (it uses GCC) how could it even in theory have any effect on Intel? GCC works wonderfully on Intel too, you know.

3:51 PM, February 14, 2007  
Blogger Woof Woof said...

Intel already said it isn't an x86 core.

For all we know, it isn't too different from the STREAM processors that ATI recently launched which are fundamentally GPU type cores that can handle huge throughput of FP intensive matrix calculations (not sure what the Nvidia equivalent is).

:)

Doesn't Nvidia already have 128 cores or something?

Ironically, Intel's "vision" of the multicore teraflop computer smells suspiciously like an extended AMD K8 (and by extension the Barcelona architecture) than their Clovertown is

5:10 PM, February 14, 2007  
Blogger Anonymous Agent #101 said...

"It's more wrong than you think. Current OSs can take advantage of one, or many cores. All modern OSs have been able to for quite some time."

All mainstream operating systems -- Windows, Mac, Linux -- do not scale well across cores, across memory, across disk, etc.

The parallelism concept of the "thread" is very primitive and "work" is not well divided into equivalent "threads" that will run on multiple cores.

Beyond the fact that work is not divided well, most code in existence is not thread safe. So it must be serialized.

But what Mike Magee was getting at is that the modern OS of today is very inefficient even working on one core. If you look at the processing power available say vs. a Next machine of yore with a 25Mhz 68030 class processor, you can see that today's OS is a bloated pig that basically wastes CPU power so that it forces people to buy new hardware. Ask yourself... "Where did a 100 (ONE HUNDRED) fold increase in clock speed actually go??" Not to mention today's chips have much more capable memory interfaces, more total RAM, faster I/O systems, etc.

Hence the crux of the issue here is "the upgrade treadmill". I know going from 2 cores to 4 cores on Windows is 95% a big letdown. There is no speed increase for most situations. There is also the issue of no matter how much RAM you have in Windows, Microsoft seems to mismanage how to use it. For the purpose of making you buy 4GB more instead of 1GB more, that sort of thing. The job of both Windows and Mac (and Linux to a lesser extent) is simple -- "To do LESS with MORE."

Mostly computers are pathetic considering how much they cost to buy and to run every month. One has to wonder if the world would be better off without them.

11:47 PM, February 14, 2007  
Blogger PENIX said...

Anonymous Agent #101 said...

All mainstream operating systems -- Windows, Mac, Linux -- do not scale well across cores, across memory, across disk, etc.

Cores: The main advantage of multiple cores is the ability to distribute processes or threads among them. This is, and has been, handled just fine by modern OSs for quite some time. The problem exists in the software which runs on the OS.

Memory: The limitations of 32-bit were recognized long ago, which is why AMD began to create 64-bit processors. The 32-bit workarounds are still in place in the OS, and will be until Intel stops holding back the industry and 64-bit is standard. Utilization of idle ram is a questionable area. For example, disk caching could see a huge performance boost by using idle ram, but at the same time this is dangerous as too much caching would result is catastrophic data loss in the event of a power outage.

Disk: Modern OSs scale wonderfully across multiple disks. Going from 1 drive to 2 in a hardware driver RAID 0 configuration will show an immediate 90% read/write performance gain. Adding more drives results in a linear performance increase.

The parallelism concept of the "thread" is very primitive and "work" is not well divided into equivalent "threads" that will run on multiple cores.

Beyond the fact that work is not divided well, most code in existence is not thread safe. So it must be serialized.


The lack of proper division of resources is very apparent in software today. This is the result of poor programming model in software. Serialization is simply a hack to quickly enable a program for multi-thread, but this is a result of the program not being engineered properly for multi-thread. None of this is related to the OS at all. This is all characteristic of the software which runs on the OS.

Multi-core has not become mainstream until very recently, which is why thread safe programming has not become mainstream. Fortunately, this is a trend that will soon change.

3:13 PM, February 15, 2007  
Blogger PENIX said...

ho ho said...

In short, you can't efficiently share CPU recources for running single threads faster. Even in ideal case it will be very far from 2x speed increase for dualcores.


You cannot... yet.

ho ho said...
penix
"If this were to be released tomorrow, Intel would be BK overnight."

And what if Intel would release it sooner? Would that mean instant BK for every other CPU maker? Or does AMD use some magick so it would survive?


In this case, being small has an advantage of significantly lower overhead. If Intel's revenue stream was to be cut off, their overhead would be devestating. Intel has proven to be behind the curve in so many regards that I see it to be impossible for them to beat AMD to Reverse HT.

3:25 PM, February 15, 2007  
Blogger PENIX said...

Azary Omega said...

Company that steals 3% of intel's market share per Q doesn't need publicity - they already doing good enough.


More than good, that is outstanding!

3:27 PM, February 15, 2007  
Blogger Ho Ho said...

woof woof
"Doesn't Nvidia already have 128 cores or something?"

Yes, G80 has exactly 128 FP cores running at 1.35GHz. It also has cache shared between groups of 16 such cores.

R580 that is inside those Steam thingies has three 16-way FP units and no meaningful general caches. That means they basically can execute a singe instruction that runs on all those 16 FPU's in parallel. It is somewhat similar to SSE or MMX where you have one instruction that is used on 2-16 numbers. In G80, everey single FPU can run a different shader code without too big problems.

I have no ideas what architecture R600 will be. It might have 64 4-way SIMD units but it may have something totally different. It is quite possible it won't have one-way FPU's as G80 has. Also it is almost certain it will be clocked much lower.


woof woof
"Ironically, Intel's "vision" of the multicore teraflop computer smells suspiciously like an extended AMD K8 (and by extension the Barcelona architecture) than their Clovertown is"

Could you make a list of those things that make it more similar to K8? Only thing I can think of is IMC and >2 cores on one die but that hardly shows anything.

Also, why should a future architecture be similar to current ones? If it is radically different wouldn't you call that innovation, or at least an effort to innovate? x86 is not the best instruction set there is, just it has the majority of market share and most applications, just as Windows and Intel.


arzy omega
"Simply wrong. (and I'm a programmer - i have the expertise to say that your wrong about that big time)"

Interesting, I can say exact same thing to you and I'm a professional programmer also. What kind of programs do you write? I can assure you that almost no desktop application needs nearly as much bandwidth as current CPU's have. Not everyone are runnign SpecFPU-rate on their PC's all day long. Latency is by far more important than throughput.


anonymous agent #101
"All mainstream operating systems -- Windows, Mac, Linux -- do not scale well across cores, across memory, across disk, etc."

You fail. Next time make sure you know what you are talking about.

Linux and BSD scale extremely well over lots of CPU's/cores, NUMA memory and several disks. It just might need some tweaks that aren't in the official kernel. Anything under 32 CPU's/cores can be run on vanilla kernel without problems. Vanilla kernels support up to 256 CPU's but it might have a few problems when you try to run it with that many, you'll get better efficiency with some tweaks.

About HDD's, what are you talking about? My Gentoo system files sit on three separate HDD's that are in no way RAID'ed or LVM'ed together (there are two more drives for my personal data). Also Linux has excellent NUMA support.

The other two OS'es are not nearly as good. OSX used to have awful threading performance, I'm not sure if it has been fixed or not with later OS versions. Basically it handled threads around 30-200x slower than Linux on the same hardware. Windows has always had rather bad threading performance, most of it came from lousy thread managment.


anonymous agent #101
"The parallelism concept of the "thread" is very primitive and "work" is not well divided into equivalent "threads" that will run on multiple cores."

I understand what you mean but for that to change we would have to reinvent programming. If you have some bright ideas how could it be done I'd be glad to hear them.


anonymous agent #101
"If you look at the processing power available say vs. a Next machine of yore with a 25Mhz 68030 class processor, you can see that today's OS is a bloated pig that basically wastes CPU power so that it forces people to buy new hardware"

Who told you can't use a more efficient OS? E.g did you know I can get better than Aeroglass 3D windowing effects with a fraction of computing power using Linux and Beryl?


anonymous agent #101
"Ask yourself... "Where did a 100 (ONE HUNDRED) fold increase in clock speed actually go??""

Added features and eyecandy (you have no ideas how processor and memory intensive it is to draw a single window with transparency). Also, clock speed is not a very good thing to measure CPU speed. At same clocks, C2D is 30-200% faster than Netburst.


anonymous agent #101
"Not to mention today's chips have much more capable memory interfaces, more total RAM, faster I/O systems, etc"

PC I bought ten years ago has around 1/100'th of the CPU performance of my curren CPU, around 1/35'th of memory bandwidth and 16x capacity. HDD burst speed has increased at most 10x and random reads at most 2-3x. I wouldn't call it all that much compared to CPU speed increase.

If we would compare CPU and memory speed/latency ratio changes then you'd see it has gone worse over the years. Much worse.

If you would put your OS to RAM or at least SSD you'd instantly see huge performance increase since HDD is slowing everything else down. Some OS'es can efficiently cache things to memory to make it work faster.

I mostly agree with the rest of what you are saying: MS OS'es can't use recources efficiently. Vista made some improvements but it is still very far from where Linux was years ago.


anonymous agent #101
"Mostly computers are pathetic considering how much they cost to buy and to run every month."

Depends on what you run. It would have cost me at least 1000x more to get the same performance as my current PC has ten years ago. In terms of power usage things would have been even worse. Btw, I don't use my PC only for office work and gaming, I run quite a few recource demanding things on it.

3:28 PM, February 15, 2007  
Blogger netrama said...

Mostly computers are pathetic considering how much they cost to buy and to run every month. One has to wonder if the world would be better off without them.

Can someone make an average computer that does everything for the avg Joe , it should be the size of a slim dvd player, and should be able to carry it around and should not cost not more than 49-79.99$ , excluding monitor and keyboard.
Ofcourse the big boys and the server folks can have their Zalman cooled toys ..
The irony is all the technology exists .....cant happen in a society where money is made on the basis of FUD and pulling wool over peoples face and CEOs make statements like - yeah we made core 2 so that people can browse myspace :-))

7:20 PM, February 15, 2007  
Blogger Ho Ho said...

penix
"The limitations of 32-bit were recognized long ago, which is why AMD began to create 64-bit processors"

About the same time IBM made its first 64bit CPU Intel came up with a way how to use up to 64GiB of RAM with 32bit CPU's. First of its kind was PentiumPro. Some Windows server editions support it, Linux has supported it for a long time.

penix
"Utilization of idle ram is a questionable area. For example, disk caching could see a huge performance boost by using idle ram, but at the same time this is dangerous as too much caching would result is catastrophic data loss in the event of a power outage"

You clearly have no ideas how disk cache works. All OS'es have had disk caches for years, even win95, and I haven't heard of any data loss caused by it. Windows is not using it as efficiently as other OS'es, though.

penix
"The lack of proper division of resources is very apparent in software today. This is the result of poor programming model in software"

I agree. It's the programmers who have to change, OS and HW won't help all that much.

penix
"You cannot... yet."

I wouldn't want to either. It would decrease efficiency a lot.
Or do you have some bright ideas how could you fuse together pipelines of several cores? Fastest HT won't be fast enough. In fact, HT wouldn't even work with it.

Even if it would work somehow you'd need a lot of dedicated die area to make it work. I'd take a couple of extra cores instead of that any day.


penix
"In this case, being small has an advantage of significantly lower overhead"

Overhead of what? Architecture changes or something else?

penix
"Intel has proven to be behind the curve in so many regards that I see it to be impossible for them to beat AMD to Reverse HT"

Then again, it is also far ahead in some other things.

2:09 AM, February 16, 2007  
Blogger PENIX said...

Ho Ho said...
penix
"Utilization of idle ram is a questionable area. For example, disk caching could see a huge performance boost by using idle ram, but at the same time this is dangerous as too much caching would result is catastrophic data loss in the event of a power outage"

You clearly have no ideas how disk cache works. All OS'es have had disk caches for years, even win95, and I haven't heard of any data loss caused by it. Windows is not using it as efficiently as other OS'es, though.


By making this statement you have shown you are not informed on this subject. When I speak of data loss due to caching, I am speaking about the well known dangers of write caching. Huge speed improvements can be seen simply by increasing the write cache significantly, but it poses a huge risk of data loss in the event of error.

Ho Ho said...
I wouldn't want to either. It would decrease efficiency a lot.
Or do you have some bright ideas how could you fuse together pipelines of several cores? Fastest HT won't be fast enough. In fact, HT wouldn't even work with it.


Actually, I do have ideas on how it could improve. As you know, modern processors use prediction engines to guess what the next piece of data will be so it can enter the pipeline before it has even been received. But prediction on near random data is a huge gamble. The improvement is to have multiple pipelines in parallel, each executing a different prediction. With 4 cores, the chances of a correct prediction is 4x that of a single core. This would allow a single processor to see a significant increase performance on a single threaded application.

Even though there would be a significant speed increase, it is unlikely you will ever see this model. At best, it is 25% efficient. But don't be too surprised if you start seeing cores with dual pipelines in the future.

Ho Ho said...
penix
"In this case, being small has an advantage of significantly lower overhead"

Overhead of what? Architecture changes or something else?


I'm speaking about company overhead, namely employees. Nothing to do with architecture.

11:51 AM, February 16, 2007  
Blogger Ho Ho said...

penix
"When I speak of data loss due to caching, I am speaking about the well known dangers of write caching"

Now tell me what OS does write caching so that it can lead to loss of data? Anyway, normal PC's don't shut down in the middle of doing stuff for no good reason. IF they do then data loss is not the biggest of your problems.

Btw, have you heard of filesystems with atomic operations?


penix
"This would allow a single processor to see a significant increase performance on a single threaded application"

First, do you know how CPU pipelines work? Do you know about data dependancy? If yes then can you explain me how would you transfer instructions from one core pipeline to another without massive performance loss? Bear in mind you'll also have to transfer (shadow) register contents and lots of other things. There is no bus in the world wide enough to move all that data without massive delays. Not to mention that it is impossible to insert functions in the middle of pipelines of current CPU's. Making it possible would be a huge waste of time and recources.

3:15 AM, February 17, 2007  
Blogger nECrO AKA John said...

Ho HO, what are you smoking? Pentium 4 went to market years early because AMD and the Athlon k7 and K8 were kicking the shit out of it.

The P3 held lots of promise. We see this with the Core 2 Duo, which is a direct descendant. Intel panicked and released the P4 with it's loooong pipelines and useless Mhz. Useless except to the marketing ppl and the legions of mouth breathers who believed them. And you of course.

"The only thing worse than a fanboy, is a STUPID fanboy" ---- Me

12:23 PM, February 18, 2007  
Blogger Unknown said...


You clearly have no ideas how disk cache works. All OS'es have had disk caches for years, even win95, and I haven't heard of any data loss caused by it. Windows is not using it as efficiently as other OS'es, though.


Ho Ho, get lost. Professional programmer eh? I would never let you anywhere near a business application project. Since you are able to blow your trumpet about Windows 95, let us take it all the way to MS-DOS et al. What EXPLICIT instruction is given to users when the disk caching tool smartdrv is loaded? It is somewhere on the line of "Do NOT ever just turn off your computer or you will suffer data loss". For the same reasons, you are instructed to properly shut down your computer instead of just hitting the power button today. Ho ho, please take a hike and do not ever assume the identity of a professional programmer when you spew such completely false opinions about computers.

3:05 AM, February 19, 2007  
Blogger Scientia from AMDZone said...

Ho Ho said...
Scientia
"Intel's Core 2 Duo is just Pentium 3 with 128 bit SSE."


Ho Ho, maybe you could get the quote right. The above was quoted from Sharikou, not me.

4:56 AM, February 19, 2007  
Blogger Scientia from AMDZone said...

Sharikou
To qualify as an Alpha derivative, you must have integrated memory controller.


I'm sorry but this is incorrect. The first Alpha to have an IMC was the 21364 or EV7 which was released in 2003, the same year as K8 was released. Therefore, an IMC on K8 is not derivative of Alpha.

Only AMD K8 is Alpha.

I'm sorry but this is incorrect as well. The processor most closely related to Alpha was K7. K7 directly used the Alpha bus. There is actually nothing in K8 that can be shown to be derivative of an Alpha. For example, K7 used MOESI and a point to point bus to talk between processors and to I/O. Recall that Lightning Data Transport was announced long before K8. There is some parallel development between the K8 and the EV7 Alpha but parallel is not derivative. K7 used Alpha tech certainly and K8 is based on K7 but there is no new influx of Alpha tech for K8.

Intel's Core 2 Duo is just Pentium 3 with 128 bit SSE.

No. This is wrong as well. Pentium M was based on PIII. Yonah was based on Penium M. C2D is derivative on Yonah. However, there seems to be as much difference between Penium M and PIII as between K8 and K7. There also appears to be as much difference between C2D as between K8 and K7. So, clearly, C2D is two generations beyond PIII. K10 will be two generations beyond K7.

5:30 AM, February 19, 2007  
Blogger PENIX said...

Ho Ho said...
Now tell me what OS does write caching so that it can lead to loss of data? Anyway, normal PC's don't shut down in the middle of doing stuff for no good reason. IF they do then data loss is not the biggest of your problems.

Btw, have you heard of filesystems with atomic operations?


WinXP, Linux and Mac OSX, to name a few. All employ write caching which can result in data loss. No, computers are not designed to shut down for no reason, but this doesn't mean it never happens. A shutdown isn't the only thing that could result in write caching data loss. Perhaps you have heard of the blue screen of death? Yes, I have heard of atomic operations. No, they would do nothing to prevent loss of data from a write cache during error.

Ho Ho said...
First, do you know how CPU pipelines work? Do you know about data dependency? If yes then can you explain me how would you transfer instructions from one core pipeline to another without massive performance loss? Bear in mind you'll also have to transfer (shadow) register contents and lots of other things. There is no bus in the world wide enough to move all that data without massive delays. Not to mention that it is impossible to insert functions in the middle of pipelines of current CPU's. Making it possible would be a huge waste of time and recources.


The only data that would need to be synchronized between the cores is the register data changes. This could be accomplished using a modified direct connect link, which already allows for cpu to cpu communications at full speed. Inserting instructions into the middle of a pipeline would not be needed at all.

You are clearly over your head in these matters. You have no idea how a write cache works, nor do you have any knowledge on the inner workings of a CPU. You are quickly turning yourself into the laughing stock of this blog. I suggest you stick to commenting on subjects at your level.

10:52 PM, February 19, 2007  
Blogger abinstein said...

penix: "The improvement is to have multiple pipelines in parallel, each executing a different prediction. With 4 cores, the chances of a correct prediction is 4x that of a single core. This would allow a single processor to see a significant increase performance on a single threaded application."

Itanium seems to have similar parallel execution engines. Basically it executes both taken and non-taken branches speculatively, and choose the correct one after the branch result/target is known. It doesn't do this across cores, though, but within one core.

That aside, reverse hyperthreading is a myth. It might be possible to speculatively run two copies of the same program in two cores/pipelines and synchronize them later, but such a system is only theoretical. They are not simple nor easy to implement, and definitely consume a lot of power and die space.

4:06 PM, February 20, 2007  
Blogger Ho Ho said...

scientia
"Ho Ho, maybe you could get the quote right. The above was quoted from Sharikou, not me."

Sorry about that, I'm not sure how that happened.

penix
"All employ write caching which can result in data loss"

Ok, but I asked how many times you have heard of a data loss caused by the data not being flushed to disk? Not pressing CTRL+S before BSOD is not such case. My W98 and XP used to crash from time to time but I can't remember a single thing I had lost. I haven't managed to crash my Linux box that often but even using one of the most caching FS, Reiser4, I haven't got any data losses.

penix
"Yes, I have heard of atomic operations. No, they would do nothing to prevent loss of data from a write cache during error."

But they help with data corruption.

penix
"The only data that would need to be synchronized between the cores is the register data changes."

In x86-64 there are 16 64bit general purpose registers, 6 16bit segment registers, 4 64bit various purpouse registers, 8 80bit FPU registers, 3 various purpouse 16bit registers, 8 64bit MMX registers and 16 128bit SIMD registers. That is a total of 578 bytes of data, more than half a kilobyte. You can count them yourself from Intel's software developer manual, availiable for free on Intel's site.

In addition to those directly accessible registers there are a lot more registers not directly accessible that are used for instructions that are in pipleline half way through execution.

penix
"This could be accomplished using a modified direct connect link, which already allows for cpu to cpu communications at full speed."

Bandwidth of HT3 is quite nice as long as you compare it to RAM bandwidth. When you start comparing it with L1 cache bandwidth you'll see that it is order of magnitude lower than that. K8 has L1 read bandwidth of around 14 bytes per clock or at 3GHz around 40GB/s. Writing bandwidth is around 8.4 bytes per cycle. K8L is supposed to double that to 80GB/s. That would mean transferring all general purpose and SIMD registers, a total of (8+16)*16=384 bytes, at L1 bandwidth level would take at best 384/13=27 cycles. In other words integer operations can enter, go through and exit K8 pipeline twice during the time it takes to transfer all that data between CPU's. K8 has 12 stage integer pipeline and 17 stage FP pipeline, C2D has 14 stages for both, integer and FP but that is currently irrevelant.

You could say that they could make other CPU registry file directly accessible to the other CPU(s) so they could read single registers but that would involve massive amounts of transistors used (read: wasted) for that purpouse. Also please could you give your oppinion why hasn't anyone created CPU's with shared L1 cache (G80 isn't CPU)? If L1 isn't shared because it would reduce efficiency then wouldn't sharing registrys be even much more inefficient?

Access latency to L1 caches are at around 3-4 cycles at best. I don't know HT latency but I bet it is much higher than that, most likely much higher than L2 latency that is for K8 12 cycles for 90nm and 14 cycles for 65nm.

penix
"Inserting instructions into the middle of a pipeline would not be needed at all."

Then how do you propouse sharing of execution units would work? When doing it on instruction level you have to be able to do that. By instruction level sharing I mean you could arbitarily send instructions and their data to be executed on the other core.

If you mean sending short patches, around 100 cycles worth, to the other CPU then CPU's would need massive overhaul of branch predictors and almost everywhere else to be able to separate and keep track of such chunks in the first place. That would also mean insanely big overhead of almost nonexistant improvements. Even compilers can't do it all that well even with humans helping them and they are a lot smarter than any hard wired CPU will be in the forseeable future.

Also as I said earlier moving all that data between CPUs would need massive bandwidht.

Now if you want to make things especially interesting imagine a quadcore where each and every core has direct fast access to every other cores registers. That would take three links per core. Imagine the wiring you have to do for that crossbar!

Just in case I ask again, do you know how a pipelined CPU works? If yes then good, if no I could explain it to you in detail. I won't make this post any longer when there is a chance you already know it. If you don't know it exactly don't be afraid to say so. It would make discussion much more civilized if both know what they are talking about.

penix
"You are clearly over your head in these matters. You have no idea how a write cache works, nor do you have any knowledge on the inner workings of a CPU."

I beg to differ. To prove me wrong about CPU inner workings please provide some working scenarios for CPU execution unit sharing. All you did was to propose an awfully slow and long latency channel to share data but no description of how could it actually work in real world. Are you sure it isn't you who is over its head?

11:05 AM, February 21, 2007  
Blogger PENIX said...

ho ho wasted everyone's time by saying...
Ok, but I asked how many times you have heard of a data loss caused by the data not being flushed to disk?


More than I can count. You must not have any idea what a lost chain is.

ho ho said...
penix
"Yes, I have heard of atomic operations. No, they would do nothing to prevent loss of data from a write cache during error."

But they help with data corruption.


You would not be arguing if you understood atomic transactions. When data is written or moved with atomic transactions, it's all or nothing. It does absolutely nothing to prevent unflushed data from being lost.

ho ho rambled...
In x86-64 there are 16 64bit general purpose registers, blah dee blah dee blah dee blah. Bloo bloo blah dee blah dee bloo.


I was proposing an improvement on speculative execution. I even stated that this would not even be used because it was highly inefficient, yet you still feel the need to ramble about nothing for 3 pages. Next you will want to argue who would win in a fight between Batman and Superman.

ho ho rambled...
penix
"You are clearly over your head in these matters. You have no idea how a write cache works, nor do you have any knowledge on the inner workings of a CPU."

I beg to differ. To prove me wrong about CPU inner workings please provide some working scenarios for CPU execution unit sharing.


I can do better than a scenario. Here is the patent, issued to IBM in 2004 for "Shared execution unit in a dual core processor".

Ladies and gentlemen, that's the end of the game! Thanks for coming. ho ho, please tender your letter of resignation at the front desk. Thank you.

5:00 PM, February 21, 2007  
Blogger Unknown said...

I haven't managed to crash my Linux box that often but even using one of the most caching FS, Reiser4, I haven't got any data losses.

reiser4 the most caching FS? Complete rubbish. The most caching FS in Linux is XFS and its data loss ability is phenomenal.

People complain of how slow reiser4 is on the reiser mailing list and you come here to try to pull one over us?

Ho Ho, give it up and stop posting. You can stop pretending to be some uber computer guy just because you know how to patch a Linux kernel to use reiser4 or to compile a Andrew Morton kernel.

6:38 PM, February 21, 2007  
Blogger Ho Ho said...

penix
"I was proposing an improvement on speculative execution"

Actually, it was abinstein who was talking about it working in Itanium. You simply claimed having four cores with reverse HT capability running any single threaded application would increase its speed 4x. Are you still claiming that?

About speculative execution, how do you propose data dependencies should be resolved? If CPU's would be good at it there wouldn't be nearly as many branch mispredictions, would there?

Branch predictions aren't all that bad in current CPUs. In general case it can be quite high, in loopy code it can be up to 80-90%, at worst case it is 50%. Each of those mispredictions cause a pipeline flush costing from 12/17 cycles on K8, 14 cycles on C2 to 31 cycles on newer Netbursts. Not that much to justify using lots of transistors and power trying to reduce it.

Say we have code like this. How do you propose making that run faster with speculative execution? What about code like this? I'd like to have some discussion, not mindles rambling.

I could set up PAPI to measure exact number of mispredicted branches in any program but it would take a while. Feel free to do it yourself in the meanwhile.

Assuming reverse HT would work, to whom would it be good to have their older applications run a tiny bit faster? It won't do any good to servers. In fact, it would lower the performance of any multithreaded application, it would also be slower when you have several singlethreaded applications working in parallel, not to mention increased power usage.


penix
"I can do better than a scenario. Here is the patent, issued to IBM in 2004 for "Shared execution unit in a dual core processor"."

So does that mean we'll have Full body teleportation systems availiable any time soon? There are patents for all sorts of stuff, that doesn't mean they will get implemented.

First, I wouldn't call posting a link to a patent better than explaining the stuff yourself. People have patented all sorts of stuff, that doesn't mean they'll implement it.

Secondly as you said in your last reply, you were talking about speculative execution. That patent was talking about totally different thing, it was about using other core FPUs. I don't see this as very useful, instruction level parallelism is rather difficult to find and exploit even inside current x86 cores.

christopher
"reiser4 the most caching FS?"

You missed the part where I said "one of".

christopher
"People complain of how slow reiser4 is on the reiser mailing list and you come here to try to pull one over us?"

They are complaining? Could you share some links to those ML threads where slowness isn't caused by human error?

christopher
"Ho Ho, give it up and stop posting"

Why should I? I can't remember you saying anything meaningful in this blog but you still post. I at least try to explain stuff, too bad some people can't comprehend what I'm talking about, at least they aren't willing to have a discussion about it for some reason.


christopher
"You can stop pretending to be some uber computer guy just because you know how to patch a Linux kernel to use reiser4 or to compile a Andrew Morton kernel."

Besides that I also know how to program. E.g, is something I threw together abut four months ago one evening when I was bored. It was my second ever QT program. Next time someone says (s)he is a computer programmer I'd like to see some code they've written. You've seen mine, if you have anything to say about it you can always email me. In fact, I'd be glad if someone can find some bugs in that program or has any other comments about it.

1:36 AM, February 22, 2007  
Blogger Ho Ho said...

Something got screwed up in the last paragraph. The missing word was "here", I'm not sure why the rest of the paragraph was converted to URL. I know for sure I had proper html code. At least the links work fine.

1:44 AM, February 22, 2007  
Blogger Feizhou said...

"reiser4 the most caching FS?"

Ho ho blabbed:
You missed the part where I said "one of".


I am sorry, I forgot to say that reiser4 BYPASSES the disk cache. reiser4 does not cache at all if I have not read things wrong which is why crashing on reiser4 has resulted in zero data loss or corruption for most testers till today. You are COMPLETELY wrong about reiser4 being one of the most caching filesystems available for Linux.

"People complain of how slow reiser4 is on the reiser mailing list and you come here to try to pull one over us?"

Ho Ho continues:
They are complaining? Could you share some links to those ML threads where slowness isn't caused by human error?


Here

Here

and here


"Ho Ho, give it up and stop posting"


Ho Ho blabbered:
Why should I? I can't remember you saying anything meaningful in this blog but you still post. I at least try to explain stuff, too bad some people can't comprehend what I'm talking about, at least they aren't willing to have a discussion about it for some reason.


You have made erroneous statememts and you are trying to cover up for it and your attempts at covering up are failing so give it up. Post whatever you like that is NOT related to filesystems and OS disk caches because you CLEARLY have ZERO credit with your continued attempts to claim that disk caching does not result in data loss. The entire thrust of filesystem development for the last decade has been how to reduce the problems of using a disk cache.

ho ho announced:

Besides that I also know how to program


Writing a GUI program does not qualify you for OS development or even as a programmer. I am not a programmer but I too have written a C program to filter spam and otherwise deal with potential spammers but that does not qualify me as a programmer. For a few years I managed scores of Linux servers that handle over 200 million email transactions and deliver over 5 million emails on a daily basis and this puts me in a position to tell you that disk caching is definitely a no data integrity guarantee.

Like I said, stop pretending to be some uber computer guy. Since you seem to have interest in GUI development, why don't you go become a uber computer guy by helping out with KDE4 and getting KDE4/QT4 on Open Solaris instead of doling out nonsense on this blog. Intel does not need you to defend their rubbish but KDE and/or Open Solaris can do with your time and energy.

7:00 AM, February 22, 2007  
Blogger PENIX said...

Ho Ho said...
penix
"I was proposing an improvement on speculative execution"

Actually, it was abinstein who was talking about it working in Itanium. You simply claimed having four cores with reverse HT capability running any single threaded application would increase its speed 4x. Are you still claiming that?


You are confusing two different topics. Yes, I am stating that Reverse HT on a 4 core machine would increase single thread performance by 4x. That is the very nature of Reverse HT. The topic of speculative execution that Abinstein and I mentioned has nothing to do with Reverse HT.

Ho Ho said...
Branch predictions aren't all that bad in current CPUs. In general case it can be quite high, in loopy code it can be up to 80-90%, at worst case it is 50%.


Static prediction can be that high, but dynamic is not.

Ho Ho said...
About speculative execution, how do you propose data dependencies should be resolved? If CPU's would be good at it there wouldn't be nearly as many branch mispredictions, would there?
...
Say we have code like this. How do you propose making that run faster with speculative execution?


Make a prediction on the branch taken and allow the code at that point to enter the pipeline. The data dependency would have to be solved by using a queued load which would get injected directly into a later stage of the pipeline before execution.

This is also a perfect example for my idea for enhancing speculative execution across multiple cores/pipelines. Both branches could taken on seperate pipelines. This results in a 0% chance of misprediction as both branches were actually executed.

Ho Ho said...
I could set up PAPI to measure exact number of mispredicted branches in any program but it would take a while. Feel free to do it yourself in the meanwhile.


I will look into PAPI.

Ho Ho said...
Assuming reverse HT would work, to whom would it be good to have their older applications run a tiny bit faster?


Considering that most applications today are single threaded, everyone.

Ho Ho said...
It won't do any good to servers.


Yes it would.

Ho Ho said...
In fact, it would lower the performance of any multithreaded application,


No it would not.

Ho Ho said...
it would also be slower when you have several singlethreaded applications working in parallel,


No it would not.

Ho Ho said...
not to mention increased power usage.


No it would not.

Ho Ho said...
So does that mean we'll have Full body teleportation systems availiable any time soon?


I hope so. I hate commuting.

Ho Ho said...
People have patented all sorts of stuff, that doesn't mean they'll implement it.


But it does mean that it is an option they are considering and worth the effort and money to patent.

Ho Ho said...
Secondly as you said in your last reply, you were talking about speculative execution. That patent was talking about totally different thing, it was about using other core FPUs. I don't see this as very useful, instruction level parallelism is rather difficult to find and exploit even inside current x86 cores.


I did not start the conversation on shared execution units, you did.

2:09 PM, February 22, 2007  

Post a Comment

<< Home