1024 by 2012!

Posted by: Robert Fischer on 2009-09-14 21:43:00.0

This will be a short blog post, but I just wanted to point out this article from El Reg on the future roadmap of the Sparc CPU. Especially check out the graphic associated with that article.

Here we have Sun/Oracle’s roadmap for the Sparc processor- and we see that by 2012, Cascade Falls will have 8 sockets, each socket having 16 cores, each core running 8 threads in parallel. Kittens, cats, sacks, wives, that’s 1024 threads going to St. Ives. Of course, this is less impressive when you consider that the Niagra Falls line already has 256 threads, and the changes are just 2x the number of cores per chip (or one generation of Moore’s law, if you care to look at it that way), and 2x the number of sockets supported.

And it’s not just Sun that’s in this market- Intel is there as well, with the Larrabee Processor. They’re using the P54C as the basis for the CPU, which is superscalar (pipelined) but doesn’t do out of order execution (sound familiar yet?). Each chip will house 32 of these P54C cores, and each core will have 4 threads executing on it. That’s the same number of threads per chip (128) as Sun’s Cascade Falls chip. Put 8 of them in a server, and you’ve got yourself 1024 threads again.

Massive parallelism is coming.

But what I find really interesting is the internal design of these chips, which comes up in this video. Basically, the chips are non-superscalar, in-order chips. Pipelined, yes, but that’s about it. These are Pentiums, not P-Pros, let alone Core-2 Duos. The assumption is that each thread will be blocking regularly on memory accesses. But rather than trying to work around the blocking by all sorts of tricks like out of order execution, is to just have more threads. If one thread blocks, move on to another thread.

This is important, as it overcomes the next wall we’re likely to hit: memory latency. So, just out curiosity, I dug out my Hennessy & Patterson (2nd ed, I need to update), and looked up typical cache miss rates (on page 384 in my book, I’m using the Data cache rates for those playing along at home).

Assume a 8K L1 cache (64K divided up among 8 threads), and a 128K L2 cache (1M likewise divided amongst 8 threads). That means that 89.8% of loads hit L1, and another 7.3% hits L2, and 2.9% go out to memory. And assume 25% of instructions issue a read (which is about average), and L1 cache takes 2 clocks, L2 takes 20, and Memory takes 100 clock cycles (these are fairly average for modern cpus, I think). So every 100 instructions issues 25 loads, of which 22.45 hit L1 cache, 1.83 hit L2 cache, and 0.72 hit memory. Assuming that excepting the loads and stores, every other instruction takes 1 clock cycle to execute, the total time to execute that 100 instructions is 70 (non-load instructions) + 22.45*2 (L1) + 183*20 (L2) + 0.72*100 = 228.46. The CPU is spending a little over half of it’s time blocked waiting for memory.

Upping the memory latency of main memory from 100 clocks to, say, 500 clocks ups the time to execute those 100 instructions to 516.46 clock cycles. If you have 8 threads executing, on average you can execute 1.55 instructions per clock, from some thread or another.

Up until now, memory design has been a trade-off between throughput (bytes/sec moved) and latency (time it takes to satisfy a read). For those of you who remember the Rambus fiasco, the main technological difference (i.e. ignoring the fact that they were patent trolls) between Rambus and DDR was that Rambus optimized more for throughput, while DDR optimized more for latency. For situations where memory latency wasn’t a problem (like video encoding or 3D graphics rendering), Rambus won handily. But for everything else, DDR won handily. In a situation where a core is executing only a single thread, if that thread blocks waiting for a read to complete, then the core sits idle until that read completes. This is why memory latency was, and still is, so important in memory design. Having the core sit idle for 100 clocks is a heck of a lot better than having the core sit idle for 500 clocks.

What SMT (Simultaneous Multi-Threading, the trick of running more than one thread on a core, called Hyperthreading by Intel) does it overcomes the memory latency bottleneck, allowing memory subsystems to optimize on throughput. As we have seen, upping the memory latency from 100 to 500 clocks dropped our instructions per clock per core from 3.5 to 1.5. While we’re not happy about it, it’s survivable. Upping the memory latency on a single-threaded core the same amount, and performance would die. I don’t have solid numbers to play with, but the worst numbers I saw in the Rambus days (in the early days of the P4) was 350 clock latency on memory accesses, which was a significant performance hit (like 30-40%) on most common workloads.

And if memory latencies get worse, the multithreaded approach can offset this by going to more threads per core. If memory latencies hit, say, 1000 clock cycles, then you can simply up the number of threads each core is executing by a factor of 2. Even dropping the cache sizes by a factor of 2 as well (same amount of cache, just more threads sharing it, upping the number of cache misses per load), with 16 threads, you’ll still see an average of 1.43 instructions per clock to be executed.

You can’t play the same trick with instruction level parallelism- the parallelism simply isn’t there. I’m reminded of this seminal paper from (good lord) almost 20 years years ago, on the limits of instruction level parallelism. Even with basically unlimited look ahead, there was only a limited amount of parallelism available. Having the CPU scan further ahead in the instruction stream simply means that the CPU is examining more instructions it can’t execute yet either- at great cost in complexity. Modern x86 CPUs have a hard time sustaining execution rates above 1 instruction per clock (see, for example, this paper). The simple multithreaded approach is executing significantly more instructions per thread (over all threads) than the complex out of order approach (1.55 instructions/clock/core vr.s 1 instruction/clock/core, and that’s assuming the complex approach has the “low latency” 100 clock cycle cost to access memory, while the multitheaded approach has 500 clock cycle cost to access memory).

Of course, this requires the application to be exceedingly multithreaded to take advantage of this. Currently, Sun is marketing this chip at web servers. The idea is that it’s easy to get 1024-way parallelism, if you’re serving 1024 different http connections at the same time. And, like Sun, Intel is positioning this chip to a specialized market, this time GPUs. What these two markets have in common is that they are easy to parallelize, and they’re both “core markets” to their respective customer bases. This makes sense- go for the low hanging fruit first. But as this approach becomes more mainstream, the pressure of the performance advantages it offers will require ever more programs to go heavily-multithreaded.


Comments

  • September 15, 2009, Barry Kelly wrote: You can have this more or less today, running on Azul - 864 cores vs 128, not threads, but cores. This Sun approach, which pretty much amounts to 8x hyperthreading, isn't going to get you 8x performance - it's all about getting maximum performance out of the 128 cores you already have. In the extreme, with tightly optimized code with good branch prediction and good cache locality, and strides that can be predicted, you're best off with one thread per core. This Sun approach is more about maximizing throughput by leaving a lower level of scheduling up to CPU cores. To take advantage of it, you need to write your code so that it has up to 8x more parallelism than you would if you didn't suffer latency. That extra parallelism has its own costs, of course...
  • September 15, 2009, Alex Miller wrote: If you haven't seen it yet, check out Brian Goetz and Cliff Click's talk from JavaOne "Not Your Father's Von Neumann Machine". It lays this all out with great impact and much more detail and why it affects us now. Excellent talk. http://www.azulsystems.com/events/javaone_2009/session/2009_J1_JVMLang.pdf

Creative Commons License
This article was a post on the EnfranchisedMind blog. EnfranchisedMind Blog by Robert Fischer, Brian Hurt, and Other Authors is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.

(Digital Fingerprint: bcecb67d74ab248f06f068724220e340 (203.215.201.194) )


be the first to rate this blog


About Robert Fischer

Robert Fischer is a multi-language open source developer currently specializing in Groovy in Grails. In the past, his specialties have been in Perl, Java, Ruby, and OCaml. In the future, his specialty will probably be F# or (preferably) a functional JVM language like Scala or Clojure.

Robert is the author of Grails Persistence in GORM and GSQL, a regular contributor to GroovyMag and JSMag, the founder of the JConch Java concurrency library, and the author/maintainer of Liquibase-DSL and the Autobase database migration plugin for Grails.