Monday, June 19, 2006

Inverse threading for AM2?

According to this source, all dual core AM2 CPUs have built-in inverse threading capability, ready to be turned on by software.

The idea is simple enough: superscalar across multiple cores, but at a much higher level.

Looking at typical piece of code:

v1 = f1 ();
v2 = f2 ();
// then do something with v1 and v2

Here, we have 2 calls. Assuming they are well written, re-entrant and thread-safe, there is a good chance that they can be run in parallel. So why don't we run f2() on the second core when we start f1()? It's just like a remote procedure call on a different core.

All you need is some logic to detect such code blocks. It should be quite easy to identify all those C runtime and Win32 APIs that meet the condition....

I was thinking about filing a patent application, but a reader pointed me towards this AMD patent(6574725).. it was filed in 1999 and granted in 2003.


60 Comments:

Anonymous Anonymous said...

It is not as easy to invoke two functions on two cores. Most of the operations are previous operation dependend so f1 must be done before f2....but I think that in some cases second core can act as a slave giving more power to the single thread. But we shouldn't expect miracles. If you hope that RevHT will boost single-thread processing by 90-100% - no way. I would expect 20% and more in some particular cases. And we talk about two CPU-s/Cores not 4,8 or more.
AMD64 is great, but it is not an alien technology...
And I think that it is going to work only on Windows for now.

10:33 PM, June 19, 2006  
Anonymous Anonymous said...

this rumor is hard to believe.

if that's true and "inverse threading" is indeed that good, AMD wouldn't need to sell AM2 so cheap from August and on.

also, the example you gave isn't that simple. it's very difficult, if possible at all, for compiler to determine there's no dependency between the two functions, let along hardware. how do you know the two functions are well-formed and thread-safe, etc.?

it's quite obvious that AMD has to swallow the 10%-15% performance lag to Intel's Core 2 Duo until K8L is released. The lag is never as big as Intel's previous benchmarketing had suggested, but it's there and noticeable.

12:54 AM, June 20, 2006  
Anonymous Anonymous said...

Excellent idia!
Rumors are roaming that AMD gave their next cpu the name Bulldozer. Looks like it will support his name in full force.

1:16 AM, June 20, 2006  
Anonymous Anonymous said...

source article sounds like a bad joke.

finding implicitly parallel pieces of code is very hard task even for c compiler, which has virtually unlimited runtime resources.

add here cost to transfer piece of code to execute to other core, and so on.

all in one, this is crap.

2:03 AM, June 20, 2006  
Anonymous Anonymous said...

Will this feature might be a Conroe's 4MB cache killer??

2:38 AM, June 20, 2006  
Anonymous Anonymous said...

OK,OK, you are PHD and me not PHD

2:58 AM, June 20, 2006  
Blogger Altamir Gomes said...

A translation layer could read opcodes and then deal with real-to-pseudo address conversions, check if the code is self-modifying, etc

The principle of code locality should also help, when a piece of code is split into two streams. That's exactly the opposite of HT, instead of bloating the I-cache with too many a threads, AMD's approach would alleviate the cores.

3:14 AM, June 20, 2006  
Blogger Ajay S. said...

cant wait for bechmarks that tests the performance benefits of using reverse hyper threading

was checking out conroe performance and came across this benckmark

http://www.extremetech.com/article2/0,1697,1970191,00.asp

See the results for PCMark05 Memory read and write tests.

Conroe has big margins when the data set is 4kb and 192 kb,

BUT transfer speed dramatically drops down and becomes SLOWER than FX-62 when the data size is 8 MB !!!

Benchmarks are usually run on stock Winows machine with no services or third party software running in the background during the benchmark. Similarly, when games are run, the entire system is dedicated to it.

A normal windows machine in the real world would have Anti-Virus, Anti-Sypware, Firewall, Browser, MS Office Application, Mail client, Winamp, messenger clients or similar mix of other softwares in the memory with the user switching between them, or a few of them running in the background asking for some CPU time intermittently, forcing the cache to often fetch different datasets from the main memory

Will this affect Conroe's performance in the real world? is the huge cache really there just for the benchmarks and games? Now that AMD is using only 512kb cache in the mainstream AM2 processors, is Intel back to leading users up the wrong path as it did with Netburst and Hyperthreading?

3:32 AM, June 20, 2006  
Blogger "Mad Mod" Mike said...

There are countless morons over at Toms hardware guide forumz who say they are "expert" programmers and said that inverse threading was impossible. It was quite amusing reading the Intel fanboys FUD about how AMD could not do that because it was impossible for some strange and mysterious reason....and now look how THG forumz has turned out...quite sad indeed.

3:58 AM, June 20, 2006  
Anonymous Anonymous said...

Always though AMD would have something up their sleeve, but I don't think this would be the only solution they have to Conroe.

It is their style to outshow Intel and make them notice but I still don't believe just this alone will do it.

There's gotta be more and it's definately going to be quite a good few months especially from all the hype Intel is pumping it. Hopefully for the industry it's not all just hype though.

By the way, do you have a contact email Sharikou? I emailed you before but got no reply.

4:47 AM, June 20, 2006  
Anonymous Anonymous said...

mnogoyadernosti means multicore
mnogozadachnosti means multitasking
and inverse threading means obratnyj prodevatep nitku :D (xaxa)

4:57 AM, June 20, 2006  
Blogger Sharikou, Ph. D. said...

By the way, do you have a contact email Sharikou? I emailed you before but got no reply.

sorry. There is so much spam mail, I sometimes missed the real messages.

6:18 AM, June 20, 2006  
Anonymous Anonymous said...

I think this is quite possible.

In fact you all have been hearing AMD talking about Co-processors, so maybe HT is so good that AMD figured out one way that the second core could act as a Co processor for the first core?

And the probable K9,K10 cancellation, maybe AMD is now working in coprocessors for the K8, since they now have a very good server share, so specialized processor would sell for very good prices.

6:28 AM, June 20, 2006  
Anonymous Anonymous said...

Dear Mr. Sharikou;

Could you give please more detailed information about this subject, and some useful links too.

With kind regards.

6:29 AM, June 20, 2006  
Blogger Darth Solarion said...

I remember a rumour like this appeared some time back.. is this for real?

6:52 AM, June 20, 2006  
Anonymous Anonymous said...

Intel is also developing a similar technology, called Mitosis.

8:02 AM, June 20, 2006  
Anonymous Anonymous said...

I am a programmer and I don't see how this could work. Even though API functions are thread-safe, they still have to use the stack. Multi-threaded programs need a stack per thread, otherwise one would overwrite the other's data. My guess is that this won't work.

However a revHT might be possible if the execution units (ALU, FPU, SSE, AGU, ...) are shared resources. In this case there would be two sets of L1 and L2 caches and the OS scheduler would see this as a dual CPU but the two cores could be "uneven" (eg. one core using 4 ALUs, and the other 2, rather than 3 each). I think this is what happens on Sun's CoolThreads CPU. Of course I'm not an expert so I don't know how hard it would be for AMD to fuse the two cores at such a level.

8:47 AM, June 20, 2006  
Anonymous Anonymous said...

Biggest fairy tale i have ever heard :)

AMD is trying to implement "reverse HT" not some voodoo magic device. HT worked this way - processor has execution units that are undeutilised and some of them idle, so by having two threads per phys CPU you can milk some extra performance out of core. OFC P4 was far from best platform for HT, so most of the time benefits were next to none or even negative performance gains due to competition for resources.

Now what AMD is trying to do "reverse HT". That is, have two cores with their own cache, register etc resources, but share certain pool of execution resources (AGUs, ALUs, SSE units etc) between them. So you get can get single thread performance speedups and at same time have two cores that are utilized more optimally at any given time (at least in theory).

Cool idea, but there is no chance it's in AM2 CPU's. AM2 => K8 with different memory controller.

12:55 PM, June 20, 2006  
Anonymous Anonymous said...

To the programmer and all other who may not believe this is true:Take a look ah the US patent office number 6574725.It's all explained there how this system works.Not only that it works well,but it will work even better when AMD introduce quad core early next year.
AM2 CPUs ALREADY HAVE this logic build into the Virtualisation tech. in RevF.It is upto AMD and MS if this will be activated after Core2 launches.

1:02 PM, June 20, 2006  
Anonymous Anonymous said...

Method and mechanism for speculatively executing threads of instructions

1:57 PM, June 20, 2006  
Anonymous Anonymous said...

patent link talks about scheduler inside cpu and therefor very fast context switching, something like sun did with ultraT1, which has ultra low context switching time. it has nothing to do with blog example of function F1/f2 and still, even such very fast context switch, will be considerably slower than parallel execution of 2 threads (supposing that they are independant).

it also targets on speed-up of mutithreaded/multiprocess execution, not single thread speed-up.

2:53 PM, June 20, 2006  
Anonymous Anonymous said...

OMG did u read the patent AT ALL?It does talk about SINGLE THREADED execut. and the way the second cpu(core) is handling those "Served" instructions from the "master cpu/core".There is an thread arbitration logic which controls(trough 5 instructions) flow of data nad checks for dependancies and orders the second cpu to do the work,and at the end it do the synchronisation of the "threads".

Do read it pls,and after that post something that makes sense.

4:04 PM, June 20, 2006  
Blogger Sharikou, Ph. D. said...

patent link talks about scheduler inside cpu and therefor very fast context switching, something like sun did with ultraT1, which has ultra low context switching time.

Wrong. The patent has zero to do with software threads (kernel or user level). It's about chopping code into many pieces and run the concurrently, yet producing a correct result. This process is transparent to the software. As for how to chop to code, I guess there must be another patent for doing it intelligently.

4:14 PM, June 20, 2006  
Blogger Richard P said...

What's even more interesting is the patent date: 1999. Guess they've been working on this for awhile. I'm not convinced it's in AM2, but it's certainly possible. AMD has been so tight lipped it's difficult to really say.

5:02 PM, June 20, 2006  
Anonymous Anonymous said...

I think he has reading problem. The ending clearly stated it has nothing to do with software...

5:55 PM, June 20, 2006  
Anonymous Anonymous said...

Dr. Ruiz told us that they are working on "jaw dropping" technology. From AMD, I now expect no less than something like this.

The problem Sharikou is that most functions take input from the output of another function.

But for all of you who believe that AMD can just turn this feature on and everything just automagically starts working... well, dream on!

Yes this is ALL very feasible with some help from the programmer and compilers. It just dawned on me that if we can have synchronized variables, functions and blocks of code, why in the world couldn't we do the exact opposite by declaring variables, functions and blocks as parallelized!

This would allow AMD to schedule EVERYTHING else for execution (all to the discression of the programmer).

Example:

parallelized int f1()
{
// do stuff 1!
return (v);
}

parallelized int f2(int a)
{
// do stuff 2!
int r = a * a;
// do stuff 3!
return (r);
}

int void main(...)
{
synchronized int x = f1();
parallelized int y = f2(x);
}

This way f1() and f2() could be ran in parallel. The 'r = a * a' would be scheduled to run after f1() has completed... by that time the 'do stuff 3' has also completed. The last step would be to return the 'r' from f2().

Most of the hard work is done by the compiler... hardware scheduling and support would be needed... read the patent they seem to be talking about exactly that!

7:31 PM, June 20, 2006  
Anonymous Anonymous said...

[...] AMD figured out one way that the second core could act as a Co processor for the first core?

Your not too far off!

For those of you who don't know much about programming, let me inform you that all the multimedia instructions (MMX, SSE, SSE2, 3DNow!, etc) can be executed in parallel to other common instructions. The only exception to this is that you can't do any regular math at the same time.

So if you have a quad-core, it would be extremely easy for AMD to split such multimedia instructions off to the other cores, which in most case could schedule them immediatly in parallel to whatevere is currently processing (unless currently doing math).

This would result in 'almost' 4x performance increase in any multimedia processing! All of this without even slowing down regular processing.

Now imagine a game running on the 4x4 with a total of 8 cores!

Soon you'll be playing games like you never imagined before!!

7:45 PM, June 20, 2006  
Blogger Sharikou, Ph. D. said...

Most of the hard work is done by the compiler... hardware scheduling and support would be needed... read the patent they seem to be talking about exactly that!

No. The patent is about running existing single stream of code in two streams, speculatively. There is no need for recompilation. Supercalar CPUs are already doing this, AMD just extend the concept to across CPUs.

8:03 PM, June 20, 2006  
Anonymous Anonymous said...

Well, you all can say what you want. intel didn't believe amd was coming up with 32+64 in single chip, but amd64 is now a standard...maybe, MAYBE, this will be the same thing. It's what happens when you surround yourself with brilliant people.

8:13 PM, June 20, 2006  
Anonymous Anonymous said...

I don't really think a large performance gain should be expected from inverse threading. The "increased parallelism" can only be extracted in some cases and there is overhead logic needed to split and rejoin everything. A lot of interprocessor communication is required which means latencies take a hit and a stall in one core flushes both cores. These all eat away at potential performance gains. 80%-90% performance gains are impossible. We're probably looking at HT like performance, averaging around 10% with some circumstances even hurting performance.

9:57 PM, June 20, 2006  
Anonymous Anonymous said...

The patent makes the technology seem largely hardware based. The question is why hasn't anyone noticed the extra transistors? Somehow I doubt the technology is in Rev F and can just be miraculously activated with a BIOS update and a XP patch. Why wouldn't AMD have released this already if all current Rev F processors have this feature? You'd think they would have done it at launch. At least it'd give people a reason to buy AM2 considering it's lack of tangible improvement over S939.

10:01 PM, June 20, 2006  
Anonymous Anonymous said...

Right! So in your opinion...

Boss thread 1:

while (!done)
{
cmd = getNextCommand();
queueCommandForProcessing(cmd);
}

Worker thread 2 to thread 'n':

while (!done)
{
cmd = getNextCommand();

switch (cmd.id)
{
case 1: blurImage();
case 2: despecleImage();
case 3: sharpenImage();
...
case 98: outlineImage();
case 99: deleteImage();
}
}

So what your telling me is that the processor will speculativly guess what the 'command' will be and attempt to pre-process the image.

More than likely it will get it wrong!

Or maybe it will have time to do all 100 image processing. Wouldn't that be a waste of processing power?

I've programmed all sorts of 'threaded' applications in the last 12 years. I've been sitting here for the last hour trying to think if anything I've ever done could work somewhat well using guess work (sorry, speculative).

If you read the literature about Intel's Mitosis, they clearly state that this would only work well with the help of compilers.

But god do I hope I'm wrong and you and AMD are right!!

10:08 PM, June 20, 2006  
Blogger "Mad Mod" Mike said...

"We're probably looking at HT like performance, averaging around 10% with some circumstances even hurting performance."

I highly doubt AMD would spend 7 years developing this technology, sacraficing an entire CPU, just to gain 10% performance on an application.

10:31 PM, June 20, 2006  
Anonymous Anonymous said...

On reverse hyperthreading:

Many here are making some assumptions that may not be true.

1) No changes to the cores of the CPUs being combined.

2) Works on any two cores whether they are on separate chips or even connected by a FSB.

3) Requires special utilities ala Transmeta (VLIW CPU emulating a x86).

Lets take the last assumption first. All modern GP CPUs today do OOE (Out of Order Execution). There are no GPRs, x87 stack or SSE regs. Just one big pool of registers in which a pair of maps that convert references of these logical GPRs, FPU and SSE regs to actual register pool members. The first map is the speculative map. This is updated every cycle that is being scheduled for execution. The second map is the map of the registers as of the last instruction retired. If some exception occurs that requires a flush of the execution pipelines. Then the retired map is copied into the speculative map and the pipelines are filled with the correct instructions.

Here the scheduler with these two logical register maps decides which uops are ready to be placed on the execution pipelines and how the speculative map must be updated. A simple load of a logical register doesn't happen to the pool register that current is pointed to by the speculative map. A new free register is used to recieve that load and the speculative map now shows that register as being the new logical register. As the uop is retired, the previous pool register is freed to be reused later. The same occurs when a register is the destination of a calculation. These maps, along with the uops in the pipelines and the data in the referenced virtual pool registers are the "state" of a core.

Thus the speculative map is maintained by the scheduler and the retired map by the retirement unit. Data dependencies are taken care of by the scheduler and it can do some independent operations of F2 before some in F1. Sometimes the next iteration of a loop is done at the same time as the current iteration. Typically only the loop counter is dependent between loops.

The amount of uops that the scheduler can look at the same time is called the OOE window. For the K8, the OOE window has up to 72 uop pairs.

AMD may only allow reverse hyperthreading when both (or multiple) cores are on the same die. This allows some special circuitry to communicate between cores. The communication could be at one of four points, the scheduler itself, the feeds into the execution pipelines, the virtual register pool or the retirement unit. It also could be a combination of the above.

AMD has a patent on synchronizing two register pools. If the data in the pools are identical and the retirement maps are as well (non speculative logical register map), then any uop pair executing in either core will have the same data as a base and have the same result. This is where the output from the decoders on both cores could decode the same instructions and the even portions would be worked on one core and the odd portions on the other core. This minimizes the changes in order to boost the execution rate of any SIMD instructions.

Additional power could be made by decoding using both decoders on the instruction stream, passing the results to both schedulers and having either of them being able to fill the execution pipelines of either core. In any conflict, the first scheduler has priority. Decoder mode changes detected by the primary decoder require a flush of the secondary decoder. This has the power to almost make a 6 wide machine out of two 3 wide cores.

The easiest way would be to simply make a 6 wide scheduler with two speculative maps that can be split into two 3 wide ones each with its own speculative map. The retirement unit has a 6 wide unit with two retired logical register maps. Then the register pool can be explicitly synchronized. Implicit uops are added to copy from one register in either pool to the other pool so that all data for a execution unit is in the register pool attached to that unit. This has the highest IPC of anything except a from the ground up 6 wide core.

For those that think that state of each core must be transferred to the other core,
there is no need to transfer state. You can do it in one of a few ways. The first is to have all writes go to both register pools. Then all retirements are identical and both non speculative maps (retired state) are the same. This is far lower than any state copies you seem to think are needed. In fact many registers in the pool are written over by pointing to different registers as retirements occur. Any register not being referenced by subsequent operations can be ignored until referenced and even then can have the write delayed until necessary by a synchronization unit when the transfer path is available. The first part will reduce the amount of data moved by one to two orders of magnitude. The second will reduce it almost one magnitude more on typical code.

Having a 6 wide scheduler can run two cores simultaneously with separate 3 wide decoders able to work on one of two threads on any given cycle. The register pools are explicitly kept synchronized only to the extent absolutely necessary. The "retired" register map for each thread is also explicitly kept sane. Thus the combined scheduler can make either thread looking like it has a 6 wide core running it even though there are really two 3 issue cores in all other respects.

So without a lot of cross connection, this new symmetrical core looks like two 6 issue cores to the software. That is better than the 4 issue look by each core of DC NGA.

As far as your comment about where this is coming from, AMD Engineers discussed this as an adjunct to virtual machines. It can be done with VM (simulate a 6 issue virtual machine) but, the performance gets increased far more with hardware assists. My upgraded overview would allow a single A64 X2 to be seen as two 6 issue cores with a combined performance much nearer to two true 6 issue cores than two 3 or 4 issue ones. And the upgrade can be done in stages.

Stage one just does SIMD stuff. And that can be handled with only a small change to the decoders. That change is to flip the top and bottom halves of the SSE and MMX registers on any instruction decoded by the second core and only do the bottom half of any SIMD vector in each core. This effectively doubles the execution rate of any vector SIMD instruction.

Stage two allows either scheduler to use either core's execution units. And stage three is the full six issue scheduler. And at no point is any special software required. The only thing is how the CPU is initialized and threads are handled. Perhaps to keep things as compatible as possible, this new mode of operation is called "Paired" mode to go with "Compatibility" and "Long" modes currently in AMD64. With the simple stage one requiring more initialization and upkeep and the complex stage three not needing much, if anything at all. At the most, some indication that the cores are being paired else any external user wouldn't know about it other than the boost in performance implying it.

10:36 PM, June 20, 2006  
Blogger Sharikou, Ph. D. said...

This has the power to almost make a 6 wide machine out of two 3 wide cores.

I envision a far more flexible design, where the pipelines can be grouped and ungrouped dynamically. For instance, with the above, instead of a 6 wide machine, you can have one 5 wide and one single pipeline core. Or 6 cores each with one pipeline. With quadcore, you can have one 12 wide core (and 11 zero MHZ cores) or 12 single pipeline cores.

11:16 PM, June 20, 2006  
Anonymous Anonymous said...

Athlon AM2 FX-62 (2MB cache) occupies 230 mm^2, Conroe XE 2.93 (4MB cache) - 140 mm^2.
If we equate Athlon with Conroe, we'll get 166 mm^2 (230*(65/90)). Is it possible that IMC would occupy the same space as 2MB cache? or Athlon has some hidden functions which missed in Conroe?

11:57 PM, June 20, 2006  
Anonymous Anonymous said...

here is patent text:
"Method and mechanism for speculatively executing threads of instructions
Document: United States Patent 6574725

Abstract: A processor architecture containing multiple closely coupled processors in a form of symmetric multiprocessing system is provided. The special coupling mechanism allows it to speculatively execute multiple threads in parallel very efficiently. Generally, the operating system is responsible for scheduling various threads of execution among the available processors in a multiprocessor system. One problem with parallel multithreading is that the overhead involved in scheduling the threads for execution by the operating system is such that shorter segments of code cannot efficiently take advantage of parallel multithreading. Consequently, potential performance gains from parallel multithreading are not attainable. Additional circuitry is included in a form of symmetrical multiprocessing system which enables the scheduling and speculative execution of multiple threads on multiple processors without the involvement and inherent overhead of the operating system. Advantageously, parallel multithreaded execution is more efficient and performance may be improved."

NB this "operating system is responsible for scheduling various threads" refers to explicit multithreading, posix threads in unix case and windows threads in windows case. picture which is added to blog, refers to thread fork/join, also posix terms, to create/terminate thread.

probably I have issues with reading. can someone wo problems with reading point to patent text place where its said that it will create multiple threads from single thread? it constantly talking about "multiple threads" AFAIU.

besides that, "simple change to operating system" is not that simple. how many ppl run x64 win xp? less that 1%? how many ppl will download and install 'patch' to windows to enable this hardware scheduler? couple of thousands? big banks run special version of windows and very tight about any changes in setup. what about other oses?

12:05 AM, June 21, 2006  
Anonymous Anonymous said...

Athlon 64 FX-62 at 65 nm will be 115 mm^2 large. It's actually 230*(90/65)^2.
Every time producers uses a different and better process, the shrunk die size is half as large.

12:47 AM, June 21, 2006  
Anonymous Anonymous said...

you should multiply with (65*65)/(90*90)

2:50 AM, June 21, 2006  
Blogger Christian H. said...

There are soem intersting comments about this subject and obviously most people aren't familiar with how Windows schedules threads.

Time slices are used and a lot of programs have ASYNC calls that can be started on the other core.

Imagine the kernel starts up and goes through HW init. Do you think the same data is used for the Network stack as the video stack or the USb stack? because it is said to be "on-the fly" it is possible to have a load monitor determine the core to use.

In this case it maybe C&Q. it is responsible for throttling the processor based on load.

Since the memory ctrlr can be used to make decisions as to whether threads recieve data from L2 or main memory, it will act as an arbiter for the OS so it doens't need to care where the thread is run.

That maybe why the IMC sucked for so long as they were learning how to do the thread balancing.


it is a difficult task to say the least but I believe more than possible and will add 20-30% to single threaded apps.


Imagine this scenario, you have Word, outlook, FireFox, and lots of Systray apps open.

MS turns this on. Now whenever a new thread is needed the XBar uses a simple physics phenomenon - energy flows from areas of greater concentration to areas of leser concentration - to direct the thread to the least used core.

Word is on proc1, Outlook is on proc2, firefox is on proc2, systray is on proc1.

word starts a spell check, but proc1 is busy with a virus update, so proc2 is used here and when proc1 is finished it does a grammar check. Firefox is looping Flash and scrolling. the flash is routed to proc1 and only the position needs updating, which can be done from main memory instead of L2.

The low latency of HTX allows that this doesn't slow down the proc much waiting. (Athlon alread does this, it doesn't share between L2 so each core can have it's own data when beign used on "unique" data and use DDR2 when using shared data.

As the time slices are use dby the Scheduler, more execution can happen for the same app.

it is complex and a diagram would help, but the gist of it is that this WILL work.

9:45 AM, June 21, 2006  
Blogger Sharikou, Ph. D. said...

picture which is added to blog, refers to thread fork/join, also posix terms, to create/terminate thread.


AMD borrowed the words "threads", "fork" etc, which really have ZERO do with the ordinary notion of threads. You have to understand that processes, threads are high level software concepts, a CPU has no clue about their existence.

AMD should have just said a "stream of instructions" instead of "threads" in the patent.

But the idea is clear: it's about chopping a stream of instructions into multiple segments and run them concurrently on different processor cores. This is done correctly in hardware and completely transparent to software. So you can have the kernel code running on two different cores and appear to be a single thread...

10:07 AM, June 21, 2006  
Anonymous Anonymous said...

"But the idea is clear: it's about chopping a stream of instructions into multiple segments and run them concurrently on different processor cores. This is done correctly in hardware and completely transparent to software. So you can have the kernel code running on two different cores and appear to be a single thread..."

ok. maybe you are right. what kind of os support is needed then for this? article talks about some "os patch". why is it needed at all with this scheme?

also patent clearly talks about hardware replacement of os's software scheduler. os software scheduler operates with high level thread concept.

1:35 PM, June 21, 2006  
Blogger Sharikou, Ph. D. said...

ok. maybe you are right. what kind of os support is needed then for this? article talks about some "os patch". why is it needed at all with this scheme?


This should be completely transparent to the OS and applications. All one needs is to turn on a flag in the CPU when the system boots.

1:59 PM, June 21, 2006  
Anonymous Anonymous said...

You have to understand that processes, threads are high level software concepts, a CPU has no clue about their existence.

Sorry but CPU's were designed with the concept of tasking and threading!

Context switching is one of the most expensive operations in a multi-threaded environment, especially when moving to kernel space. The CPU has a built in mechanism which 'can' be used in order to do this automatically if the OS whiches to delegate.

Also, CPU 'ring-levels' were created in order to allow code threads to have various priviledges. This is where the notion of 'kernel' vs 'user' space comes in.

There are many more concepts developed in the CPU in order to allow tasking and threading.

Anyway, so far no one has been able to explain how this could be done without any compiler or OS support. It's nice to say it can be done but can someone at least explain a 'logical' how?

6:03 PM, June 21, 2006  
Blogger Sharikou, Ph. D. said...

Context switching is one of the most expensive operations in a multi-threaded environment, especially when moving to kernel space. The CPU has a built in mechanism which 'can' be used in order to do this automatically if the OS whiches to delegate.

No. There is no such built-in mechanism to facilitate context switching of threads/processes. Every OS is different. Again, a CPU has no clue of processes and threads.

As I said the so called inverse threading is just expanding on superscalar concept: running code in parellel but achieve the same serialized result. This is done transparently inside the CPU(s), it has nothing to do with software threads (kernel or user).

8:46 PM, June 21, 2006  
Blogger Christian H. said...

ok. maybe you are right. what kind of os support is needed then for this? article talks about some "os patch". why is it needed at all with this scheme?



From the patent PDF floating around, it looks like the patch would speedup the "time slices" for preemptive multitasking.

That would allow the kernel to update faster - of course inthe presence of the REAL AMD64.

9:41 PM, June 21, 2006  
Blogger hyc said...

Certainly there's no way to use this feature without OS support. Initially the OS thinks there are two independent cores, that can have processes and threads scheduled on them independently. Somehow you have to tell the OS that now it should only be scheduling for a single combined core. I get the feeling this is a big toggle switch and once you flip the switch the CPU acts like a single core until you reset it again. Otherwise, how does the CPU decide when it should flip between dual and single mode? I guess if the load monitor recognizes a single thread eating 100% CPU that might be a good time to switch.

On other fronts, I would not expect it to do function-level parallelism, that's something that compilers do. On the assumption that it's working at individual instructions, or basic blocks, it might make sense to unify the caches in both cores. They could do this the half-assed (half-cached?) way, and just broadcast all cache loads to both caches at once, or they might have a smarter way to make the two separate caches act as a single larger cache.

Interesting...

12:51 AM, June 22, 2006  
Blogger Sharikou, Ph. D. said...

Somehow you have to tell the OS that now it should only be scheduling for a single combined core.

The whole thing is transparent to software--even the kernel code itself may be split into pieces and run speculatively on different cores. This will boost OS performance. Normally, the kernel only runs on one core/proc, now it can be inversed-threaded.

12:56 AM, June 22, 2006  
Anonymous Anonymous said...

"No. There is no such built-in mechanism to facilitate context switching of threads/processes. Every OS is different. Again, a CPU has no clue of processes and threads."

Sharikou... you are wrong here. CPU does have concept and support for threads and processes, some of the mechanisms are universal, some are very micro-architectural dependent.

For example, almost all modern CPUs have TLB, which essentially is a table for memory mapping of the processes; their L1 cache are usually virtually indexed (whose address is process-dependent). IIRC, K8 even has logic to prevent unnecessary TLB flushing upon context switch.

K8 core has only one IP and was designed for single-threaded apps; multi-threaded programs work in time-sharing arranged by software. P4 and EV8 (Alpha), on the other hand, are designed for SMT, where multiple valid IPs work at the same time.

Back to the patent. First, it's one thing to file a patent, but totally another to make it work efficiently. Although "ideas" were initially not patentable, the US patents have had too many useless ideas in them, most of which only become financially useful after somebody else found a way to make them practical.

Second, this patent isn't "inverse threading," it's speculative threading. The spawned thread may not be useful, it may be faster or slower than the main thread, and it may not commit. For one, this speculation will make power usage worse; two, this may hurt performance for some apps (overhead, spawned thread too slow, etc). How to efficiently spawn a side-kick thread in real time has been a topic in academic for years.

I believe even if AMD had something like this in AM2, it is disabled for good reason, that it's not efficient nor advantageous for the majority applications.

11:09 AM, June 22, 2006  
Blogger Sharikou, Ph. D. said...

For example, almost all modern CPUs have TLB, which essentially is a table for memory mapping of the processes; their L1 cache are usually virtually indexed (whose address is process-dependent). IIRC, K8 even has logic to prevent unnecessary TLB flushing upon context switch.

You are getting confused about hardware and software here. The CPU provides pageing via TLB, which facilitates mapping from virtual to physical address. There is no concept of processes or threads at this level. The hardware may provide some mechanism to help better utilize the TLB. But it's up to the software to maintain and switch page tables and register contexts. The hardware has no clue of the higher level stuff the software is trying to do. Just like at assembly code level, you can't see object oriented programming.

If you read the Linux kernel source code, you will understand more about this.

11:35 AM, June 22, 2006  
Blogger Sharikou, Ph. D. said...

P4 and EV8 (Alpha), on the other hand, are designed for SMT, where multiple valid IPs work at the same time.

So called CMT is just to present multiple processors to the software when in fact there is only one. The CPU has no idea about threads. You have to have kernel threads to take advantage of CMT. The kernel will schedule the threads to run on multiple virtual processors (when there is only one physical). The CPU is not doing the threading as you understand from thread programming. The CPU is simply running two streams of instructions instead of one.

11:42 AM, June 22, 2006  
Anonymous Anonymous said...

"The CPU provides pageing via TLB, which facilitates mapping from virtual to physical address. There is no concept of processes or threads at this level."

This is too much a stretch. You're basically defining what "concept" is, and really CPU has no concept of anything! It's just performing bit operations.

But that's not the point here. The point is CPU has to support processes and (in case of SMT) threads; and such support is what I meant by "concept," i.e., CPU knows the difference between one process/thread from another (different address space, etc.). To speculatively spawn a side-kick thread in another core, the CPU has to maintain the correct flow of multiple programs counters (threads). If there are two IPs, two load-store buffes, two register files, and two commit queues, such as the case in dual-core or SMT, the CPU has to make sure these two set of resources don't race each other, irrelevant of whether the circuits have any "high-level concept" or not.

4:47 PM, June 22, 2006  
Blogger Sharikou, Ph. D. said...

To speculatively spawn a side-kick thread in another core, the CPU has to maintain the correct flow of multiple programs counters (threads).

First let's define THREAD as the the THREAD we normally talk about: a kernel level of user level object representing an execution context and path within one memory space.

What I am saying is simple: what AMD talked about in this patent has nothing to do with the THREAD defined above. The patent is about chopping up arbitratry single stream of code into segments and try run them speculatively on other cores and completely transparent to software. From software's point of view, it sees the same result as if the code is run sequentially.

5:51 PM, June 22, 2006  
Blogger "Mad Mod" Mike said...

Intel fanboys are trying to say that a company as large and knowledgable as AMD cannot do something like this...are you f*cking kidding me? I don't care how good any of you think you are at programming or how much you know, because until the day you make a CPU as complex as a MPU or design an OS as advanced as Linux or Windows from the ground up, you should just STFU and go back to your C.

6:04 PM, June 22, 2006  
Anonymous Anonymous said...

"Intel fanboys are trying to say that a company as large and knowledgable as AMD cannot do something like this..."

To quote the words of John Hennessy: "During the 20 years of microprocessor development, there hasn't been one idea that didn't come out of academic research."

It's one thing to file a patent, but totally another to make it practical for real apps. So no, I don't think AMD (nor Intel) could do something better than that the academic has figured how to do efficiently. If you have problem with that, go tell Prof. Hennessy that he's an Intel fanboy.

You're just acting as a lunatic to regard everyone with a different opinion as Intel fanboys. At some point, your AMD-everything is just as pathetic as the belief of those true Intel fanboys.

11:42 PM, June 22, 2006  
Anonymous Anonymous said...

http://www.theinquirer.net/?article=32594

Article on Reverse Hyperthreading... looks like AMD does have some tricks up their sleeves.

8:03 AM, June 23, 2006  
Anonymous Anonymous said...

http://www.xtremesystems.org/forums/showthread.php?t=104178

well, i guess intel wasn't sitting on there butt doing nothing.

Interesting approach, but we'll see how it really turns out when its released officially

12:58 PM, June 23, 2006  
Anonymous Anonymous said...

one prhase says it all "Out Of Order Exicution"

9:31 PM, June 23, 2006  
Anonymous Anonymous said...

So much noise about pretty much nothing. The Inquirer article made it very clear. Inverse threading is a way to have a six-wide (2 times 3) issue for single threaded applications. We'll have to wait and see from the benchmarks how often six instructions can be issued in one cycle. BTW, that's the issue width of Itanium but serious compiler-based optimizations are required. And, even so, most of the time there are empty slots.

What I'm wondering is why the "enthusiasts" think this is so hot. I thought their mouths are watering over the 4x4 with its bonanza of threads for a mind-numbing game experience. Interesting how AMD is stretching in all directions to stay competitive. Not that it's bad...

10:28 AM, June 24, 2006  
Anonymous Anonymous said...

"http://www.xtremesystems.org/forums/showthread.php?t=104178

well, i guess intel wasn't sitting on there butt doing nothing.

Interesting approach, but we'll see how it really turns out when its released officially"

Maybe a bit wierd but Intel may already have it ON. Even the THG forumz dudes think that. (that would explain why conroe frags am2 and woodcrest doesn't frag optys) Well let's wait and see.

5:44 PM, June 24, 2006  

Post a Comment

<< Home