I attended an MIT CSAIL virtual presentation recently titled “How can Programmers Conquer the Multicore Menace.” Since this is such an important question for HPC and other users, I thought I’d summarize what was said by the three speakers. Each spoke for about 10 minutes, giving a brief overview of their thinking in this area. An HPC dim sum of sorts.
Saman Amarasinghe opened the session with an overview of the problem — that processors are getting wider with more cores rather than faster in terms of clock speed as they have over the past 30 years. This creates a large problem for programmers who are generally not highly skilled in parallel programming.
He then focused the remainder of his comments on streaming applications and their ability to efficiently use multicore processors. Specifically on StreamIt, a language and compilation infrastructure for streaming applications. Typical streaming applications have a flow of data that…well…that streams through a set of processing steps. For example, video or image processing. Such applications conceptually have a lot of parallelism and StreamIt is designed to help programmers express and extract this parallelism by (graphically) exposing the data flow. The parallelism in these applications is often both task and data parallel and the challenge in either case is to map this parallelism efficiently onto multicore systems.
Saman closed by showing some results running on the MIT RAW processor and on Tilera‘s TILE64 processor, which is based on RAW. Both processors showed excellent scaling for a wide range of streaming applications (unfortunately, the slides went past too quickly for me to capture any real details.) Speedups on the 16-core RAW processor were generally in the range of 12x to 19x using high-level streaming abstractions rather than coding at a low level, which is a potentially compelling value proposition, especially for programmers without much parallel programming experience.
Armando Solar-Lezama spoke next about a new approach he is working on that uses techniques and tools to exploit automated reasoning and large amounts of computing power to tackle challenging programming problems — including the issue of extracting parallelism for multicore processors.
His approach basically involves writing a Sketch of a program rather than actually writing an optimized code to solve a problem. These sketches include hand-coded code fragments that illustrate parts of the problem and its potential solution. Once a sketch has been written, a test is also written that can be used to validate whether the programs generated by the synthesizer actually produce the correct result. Candidate programs are then generated automatically, combining elements from the sketch in various ways in an attempt to find correct and optimal solutions.
The intent isn’t to replace programmers, but to augment them by helping with the hard algorithmic pieces. It’s a pretty funky idea that is just starting to be explored. It isn’t clear how well this will be able to address the multicore issue in the near term or how this automatic program generation approach will scale as problems get larger and the combinatorics explode.
Charles Leiserson spoke last about Cilk, a language for multithreaded programming that was developed at MIT over many years. It was spun out of MIT as Cilk Arts, Inc., which was recently acquired by Intel which has been on a buying spree of late, looking for promising technologies to help programmers make efficient use of multicore processors.
Cilk extends C (Cilk++ extends C++) by adding a few keywords, perhaps the most important being cilk_spawn and cilk_sync. When spawned, children may execute in parallel, but the parent cannot continue until all children have returned. These rules grant permission for parallel execution, but they do not command parallel execution. Which relates to another very important aspect of Cilk programs: that the serialization of a Cilk code is always a legal interpretation of the program’s semantics. In other words, if you ignore the Cilk keywords, the program will execute correctly in a serial mode. This is similar to the directive-based approach taken by OpenMP. This trait is important because it allows codes to be incrementally parallelized, which is useful for parallelizing existing applications.
The load balancer is another key piece of the Cilk technology which ensures compute elements are kept busy as the degree of parallelism fluctuates during a program run. While Cilk can deal well with regular problems, it excels at handing irregular problems where the work-stealing scheduler really shines.
Charles closed by stating that the real challenge for multicore software technology is how parallelization technologies scale with legacy code base size. While no one is saying that writing new parallel codes is easy, I believe he is correct that from a practical perspective dealing with existing codes is of paramount importance to industry and academia.