Multi-Core and Massively Parallel Processors

As software developers we have enjoyed a long trend of consistent performance improvement from processor technology. In fact, for the last 20 years processor performance has consistently doubled about every two years or so. What would happen in a world where these performance improvements suddenly slowed dramatically or even stopped? Could we continue to build bigger and heavier, feature-rich software? Would it be time to pack up our compilers and go home?

The truth is, single threaded performance improvement is likely to see a significant slowdown over the next one to three years. In some cases, single-thread performance may even drop. The long and sustained climb will slow dramatically. We call the cause behind this trend the CLIP level.

To overcome these challenges the industry is looking to multi-core and multithreaded processor designs to continue the performance improvement trend. These designs don't look to improve the performance of single threads of execution, but instead to run many and sometimes massive numbers of threads in parallel. Wait just a minute though. Is concurrent programming that easy? Hasn't it been tried before?

This article will dive deeper into the current issues challenging processor performance improvement and include a high-level overview of the key microprocessor players: Intel, AMD, Sun, and IBM. Finally, we'll take a deep dive into the challenges, opportunities, and technologies available to Java programmers to take advantage of concurrent programming to leverage these new processor technologies. If you're not programming in parallel today, you will be soon.

Multi-Core Mania
Increases in processor clock frequency are slowing and in many cases are being decreased to reduce power consumption. One trend continues though. The industry continues to shrink the size of transistors, doubling the number of transistors on a chip about every two years or so. In 2007 most major chip manufacturers will begin the shift from a 60nm to a 45nm process. This will yield transistors about 1/2000th the width of a human hair! To provide a relative perspective, a silicon atom itself is about 1/4nm. Obviously continuing to halve the size of transistors will also reach a limit in the not too distant future. But that's a topic for another paper.

So, how will the industry use this new transistor budget to improve processor performance? Techniques such as superscalar execution, pipelining, and speculative processing with branch prediction have added significant complexity to microprocessor designs, but have also been successful at improving performance. Unfortunately, the latency to memory on cache misses and the high frequency of branches in most workloads is proving to be a limiting factor. Building ever-larger caches is one way to mitigate the memory latency problem but as cache size exceeds common working set size, there are rapidly diminishing returns for investing transistors in cache memory.

Instead, the industry is moving toward multi-core, multithreading, and specialization. Instead of improving the performance of a single thread on a single core, the transistor budget is being used to add multiple cores to a single chip. Further, in many cases each core is capable of running multiple threads to hide memory latency. When one thread is blocked by a long latency event, such as a cache miss, the processor simply switches to another thread to execute. Also, many chip designs now include special-purpose processing units that make effective use of transistors for specific tasks such as cryptography.

Taking a closer look at the processors themselves, the IBM Power is distinguished as being the first to introduce multiple cores on a chip in the Power 4 design in 2001. IBM recently introduced the Power 6 processor, which combines two high-performance cores on a chip with each core supporting two-way multithreading. Besides providing multiple cores, the Power 6 also achieved an amazing 4.7GHz clock rate showing that IBM remains serious about single-thread performance while keeping pace with the industry on multi-core. As Power 6 is destined to be included in high-end servers, IBM has also focused heavily on RAS (reliability, availability, serviceability) and virtualization.

In the x86 architecture camp, rivals AMD and Intel have both recently introduced multi-core processors. In 2006, Intel introduced chips with two cores while chips with four cores, based on 45nm technology, shouldappear this year. As part of the move to multi-core, Intel removed support for its version of multithreading known as "hyperthreading," although multithreading is expected to return in future designs. Not to be outdone, AMD later this year, will introduce its first four-core chip known as Barcelona. Both Intel and AMD continue to focus on single-thread performance as well, each introducing new innovations in instruction-level parallelism and caches. One key difference in their designs is the memory bus architecture. Intel is continuing with its symmetric front-side bus architecture. AMD, on the other hand, has introduced a NUMA-based design based on the open Hypertransport technology in hopes of alleviating the memory bus bottleneck.

Sun has adopted a more radical departure in design from prior generations of SPARC. At the end of 2005, Sun released the UltraSPARC T1 or Niagara processor. Niagara includes up to eight cores, significantly more than competing server processors. Sun was able to squeeze eight cores on the chip by shifting focus away from the best achievable single-thread performance toward high chip-level throughput. Niagara cores run at a relatively low clock rate and don't support out-of-order processing, branch prediction, or many other common ILP optimizations. Instead they depend on four-way multithreading to tolerate long waits for memory. The goal is to achieve high overall throughput through application concurrency. However, applications with lower concurrency may run significantly slower on Niagara relative to the other processors described here.

At this point, all of the key players are producing chips with multiple cores but diverging in core design, memory nest, and other important aspects. The key to success for processor designers over the next few years will be in the innovative use of their transistor budget. Architects will make strategic tradeoffs between single-thread performance, massive concurrency, cache sizes, power consumption, and specialized processing units. The companies that make tradeoffs in the most innovative ways to meet the demands of the market should emerge as the winners.

Parallel Programming
As a developer, it will be important for you to learn the skills necessary to develop applications that can run with high performance on these increasingly parallel processors. Since single-thread performance isn't likely to improve at historical rates, the developer will have to look to concurrency to improve performance for a given task. The goal of parallel programming is to reduce the time of a task by dividing it into a set of subtasks that can be processed concurrently. While this may seem simple enough, experience shows that developing correct and effective parallel programs is surprisingly difficult. To utilize parallelism in hardware effectively, software tasks must be decomposed into subtasks, code must be written to coordinate the subtasks and work must be balanced as much as possible. Still sound easy? Read on.

As you get started with parallel programming, the first rule to become familiar with is Amdahl's Law. Amdahl's Law says that speeding up your program is limited by the part that's not running in parallel. For example, if a profile reveals that 20% of the time is spent in code that can only run sequentially on one processor, then the best speed increase you can possibly get, even with perfect parallelization of the rest of your program is 5x, no matter how many processors you throw at it. Load imbalance is a similar problem. If you've divided your code into N subtasks, the time taken to execute them is not 1/N. Rather the time taken is the maximum of the execution times of the subtasks.

If getting your code divided into subtasks and ensuring that work is well balanced sounded hard, then let us introduce you to the coordination problem. Unfortunately, very few programs can be parallelized so simply. The reason is that those subtasks are likely to want to operate on the same data and some of the subtasks may have to wait for others to do their thing before proceeding. It's okay if two subtasks want to read the same memory location in parallel, but if one of them wants to write to the location, you've got trouble because you can't predict which subtask will get to it first.

For example, operations to insert and remove objects from a linked list must be executed so that updates to the data structures happen sequentially and don't corrupt each other. An incorrect ordering of accesses to a memory location is called a data race and it can be one of the most difficult bugs to find because your code might behave differently on each run and might even change once you decide to start debugging. To deal with this problem, most programming environments include mechanisms to ensure that a subtask has exclusive access to specific memory, commonly called locks. Unfortunately locks bring their own unique problems when multiple subtasks compete for access and, if used indiscriminately, can reverse all of your hard work in parallelizing your code by making subtasks wait too often or too long for exclusive access to shared memory.


Parallel Programming in Java
Fortunately for Java programmers, the language was designed from the beginning with concurrency in mind. Java includes support for threads that can be used to run parts of your code in parallel and "monitors," which are special kinds of locks acquired using the synchronized keyword. The java.util.concurrent package also makes managing threads much easier, provides a fast HashMap for parallel programs and "blocking queues" that can be used to pass messages efficiently between threads. If you're a J2EE developer, you're even more fortunate because J2EE application servers such as IBM WebSphere automatically manage parallelism for you. For example, multiple simultaneous requests to a Web site can be processed in parallel with multiple threads managed by the application server and running on multiple cores.

Creating a new thread in Java is as simple as extending the java.lang.Thread class and overriding the run method. Another approach is to instantiate a new instance of the Thread class providing an object that implements Runnable. See Listing 1 for a simple example of creating and running threads.

The java.util.concurrent package provides an alternate way to run code in parallel that takes away some of the burden of managing threads directly. Listing 2 shows an example making use of the ExecutorService API. While the code appears somewhat more complex than the first example, one key difference is that the number of threads executing isn't hard-coded into the application, only the amount of work done in each "task."

You may notice that the two samples we've looked at so far generate output that is a random interleaving of the words "tic" and "toc" and that the interleaving changes on each execution. That happens because the threads execute essentially without regard to what's happening in other threads1. Now let's look at how Java helps you coordinate multiple concurrently executing threads. The primary mechanism used to coordinate access to shared data in Java is a monitor. In object-oriented programming, the class is a natural protection boundary for private instance data. So in Java, every object is assigned a unique monitor. Methods declared using the synchronized keyword automatically enter the monitor as they get called and exit the monitor on returning. Only one thread can be inside a monitor at any one time, which means that if instance data is only accessed inside synchronized methods, then a data race can't occur. Listing 3 shows an example where multiple threads are updating a common counter value using a monitor to ensure that there's no data race between any two threads.

In some cases, use of monitors can incur too much overhead and simpler alternatives would suffice. For example, if hundreds of threads are involved in the counter example in Listing 3, performance can be dominated by the time taken to enter and exit the monitor rather than doing useful work. The java.util.concurrent.atomic package provides a few lightweight alternatives to monitors for safely updating shared locations in such busy situations. For example, in Listing 4, an AtomicInteger object is used to safely increment a shared counter without using synchronization.

In some cases, threads will want to wait (or block) for a particular condition before proceeding. For example, a thread operating on data in a stack will need to wait for another thread to add an entry when the stack is empty. One way to do that would be to have the thread repeatedly check the stack size. Java provides an easier and more efficient way to do this, however. The consuming thread can call the wait method on the object and, when another thread adds an entry, it can call the notify method which will wake up an arbitrary waiting thread. Listing 5 shows the use of a monitor and the use of wait and notify to operate on a simple stack of integers.

Java offers many more useful features to help you in your parallel programming tasks. We encourage you to explore them and learn more about parallel programming in Java through the excellent resources listed at the end of this article.

What Does the Future Hold?
Even with all of this native support in Java, parallel programming can still be very difficult. First of all, Java provides the means to parallelize an application but does nothing to help you design a parallel program in the first place. Furthermore, even when you have a good parallel design, there are many challenges in achieving good performance and avoiding concurrency bugs. One common problem is for synchronization to cause an excessive number of threads to block. In the worst case, all threads can block leading to deadlock. Another more devastating problem is a race condition that can lead to data corruption. These kinds of problems can bring the entire application down with a memory fault or worse can intermittently produce incorrect results. These and other similar problems often go undetected in development environments, showing up for the first time when under stress in a production environment. Further, these problems can be difficult to diagnose and debug, leading to delays in providing fixes. While there are some good tools available to help with problem determination, the debugging problem remains vexing and standard techniques such as adding print statements may actually make the problem disappear or move!

X10
Of interest to Java developers is the recent announcement by IBM and several academic researchers of a new programming language called X10. X10 is a set of extensions to Java providing higher-level constructs specifically for parallel application development. The X10 programming environment is now an open source project at http://x10.sourceforge.net/.

The goals of X10 include managing both concurrency and the distribution of data and providing constructs to greatly simplify the task of concurrent programming. A central concept in X10 is the notion of a place. A place is an abstraction for a collection of related data and activities that operate on that data. A computation may have many places. Places serve as units of distribution - for instance, different places may be located at different nodes of a cluster. An object is created in one place and lives in that place throughout its lifetime. However, all places in a computation are part of the same address space. That is, an object located in one place may contain references to objects located at another place.

Objects are operated on by activities. Activities are much like threads in Java, except that they may be very lightweight - for instance, an activity may execute only a few instructions in its lifetime. An activity may read and write variables, invoke methods, execute control statements, catch and throw exceptions - in short, perform the actions that any Java thread can perform. X10 makes it very easy for a programmer to write code that creates a new activity: the statement async S specifies that the statement S is to be executed in its own separate task, which executes in parallel. Listing 6 shows that achieving the parallel Java tasks shown in Listing 1 is quite simple in X10.

Along with spawning activities, X10 supports the notion of joining activities, that is, determining when a collection of activities has terminated. The statement finish S specifies that statement S is to be executed and, if during the execution of S any activities are created, these activities must terminate before any following statement begins executing. Thus a programmer may use finish to specify an order on activities.

Unlike Java, X10 doesn't support locks. Instead, X10 provides a very simple construct for the programmer to specify atomicity of execution. The statement atomic S is executed as if in a single step (with all other activities frozen). Listing 7 shows that achieving the atomic increment shown in Listing 3 is also easy in X10.

The wait/notify behavior shown in Listing 5 is accomplished in X10 using the simple keyword when. Listing 8 shows the same simple integer stack implemented in X10. Notice that the pop() method uses when to cause the thread to wait for a specific condition. The notify is implicit in the action of the push and doesn't require explicit coding by the programmer.

We have only scratched the surface of X10 features for concurrent programming. More thorough and complex examples of parallel programming in X10 are provided at http://x10.sourceforge.net/.

Summary
While performance improvement for single threads may slow significantly over the coming years, processors will provide significantly expanding concurrency. For software performance to continue to improve, developers must begin thinking and coding in parallel. Fortunately, the Java programming language was designed for concurrency. Java provides all of the basic constructs for threading and locking. However, parallel programming is still very difficult due to inherent complexities such as dividing sequential tasks into balanced subtasks and avoiding race conditions and deadlocks. IBM has recently introduced a new programming language called X10 that runs on top of Java and significantly simplifies many of the tasks of parallel programming.

Resources

  1. www.ibm.com/developerworks/power/newto/
  2. www.sun.com/processors/niagara/
  3. www.ibm.com/power
  4. www.intel.com/multi-core/index.htm
  5. http://multicore.amd.com
  6. Brian Goetz. "Java Concurrency in Practice." www.briangoetz.com/pubs.html
  7. Brian Goetz series of articles on developerWorks. www.ibm.com/developerworks/java/library/j-jtpcol.html
  8. www.oreilly.com/catalog/jthreads3/
  9. X10 http://x10.sourceforge.net/
© 2008 SYS-CON Media