Main|Games|Tech|Bio|中文

Introduction

Product 11is the professional/personal site of Jonathan Simpson. I am a programmer and hobbyist game developer.

Gtermix Neo

Dungeon Master Clone

Celling a Revolution

Old scientific visualization projects

Experiments in self-modifying code

Links

Celling a Revolution

This is an article I originally published on my blog Taipei Gamer in April, 2006.

There's been a lot of debate on gaming websites over the technology behind the next gen consoles, specifically the PS3 and the Xbox 360. The general consensus, which is reflected in this recent post on kotaku.com, is that the PS3 is possibly more powerful, but dramatically harder to program for due to the idiosyncrasies of the PS3's Cell processor. By far the most damaging claims against Cell are those made by John Carmack, programming god(I don't dispute that, by the way). While I don't have any experience developing console games(I am a programmer though, so I have some qualifications), I'd still like to talk a bit about why I'm a lot more excited about the PS3 than the 360.

Improvements Over the Past

Before I delve into Cell, I want to discuss one huge improvement, from an ease of development perspective, that the PS3 has over the PS2, the PS3's use of OpenGl. With the PS2, developers(or a middleware company) had to do their graphics programming on the metal because the PS2 graphics hardware did not support a high-level API like OpenGL or DirectX. Microsoft's Xbox, on the other hand, was essentially a PC with Nvidia graphics hardware. Xbox programmers wrote their graphics code in a high level language like C++ with calls to the DirectX API. The only exceptions to this were the vertex and pixel shaders, which were written in a low-level assembly language targetting the Nvidia GPU. In short, graphics programming was much, much easier on the Xbox.

This time around, Sony has adopted the PC/Xbox model for the PS3. The PS3's RSX GPU, also developed by Nvidia, will be accessed through a subset of OpenGL, an open API similar to DirectX. Shader development will be done using Nvidia's Cg language, a high level replacement for the specialized assembly language used in the past. All access to the RSX GPU will be through these high level methods. Carmack himself once praised OpenGL as being superior to Microsoft's DirectX, and indeed, id Software's most recent game, Doom 3, used OpenGL. Although DirectX has improved, lots of developers still prefer OpenGL.

So the vast majority of PS3 graphics programming will be done in a high level API from a high level language. With regard to graphics programming, the Cell processor will only be used for nonstandard(ie non-polygonal) tasks like raytracing, volume rendering, and particle effects. RSX, through the use of vertex and pixel shaders, can even take on some of these nonstandard rendering tasks, but the approach used will depend on the developers. This brings the PS3 in line with the Xbox 360 and high end PCs from a standard graphics programming perspective.

Types of Multiprocessing

The big difference between the 360 and PS3 are in their different approaches to processor parallelism in the CPU. Both consoles use a CPU derived from IBM's POWER series of processors(hereafter referred to as a power processing element, or PPE). Both consoles use PPEs that were developed by IBM, and in both cases the designers have pared down the chip logic that was designed to extract instruction level parallelism from executing code. The 360's PPE, codenamed Xenon, makes up for this by having three separate PPE cores, each capable of two simultaneous threads of execution. So, the chip designers have traded automatic parallelization of one or two threads for the capacity to execute many programmer created threads simultaneously. Although Xenon has a great capactity for parallelism, the basic programming model is the same one used by nonparallel machines that execute multiple tasks concurrently, like a single processor desktop PC.

The PS3's Cell processor, on the other hand, only has one PPE(also dual threaded) but has additional computational resources manifested in 7 synergistic processing elements(SPEs). Each of these SPEs is a specialized vector processor with its own 256KB local memory. Main memory addresses are coherent throughout the system, with communication between the SPEs and main memory handled through high bandwidth DMA which can be initiated from either the SPE or PPE. Management of DMA transfers between the SPE's local memory and main memory is a primary source of Cell's added complexity for developers. The SPEs can be considered a form of multiple instruction multiple data(MIMD) parallelism with both distributed and shared memory.

Both processors also make use of single instruction multiple data(SIMD) parallelism in the vector processing units in their PPEs. And of course, the Cell's SPEs are basically SIMD vector processors.

Xenon's Approach

Xenon's approach to parallelism is the most commonly used high-level model and is generally what is referred to when people talk about multithreaded programming. In this scenario, parallel threads of execution share memory and must execute synchronously with respect to one another. If shared memory isn't locked while one thread is accessing it, another thread could stomp all over it and disrupt the first thread's computations. Methods to deal with process synchronization are provided by the OS, which is also responsible for scheduling when -- and on which processor -- threads execute. This level of abstraction reduces complexity for the programmer but is one reason that debugging multithreaded programs is so difficult.

Despite the ready availability of multiprocessing hardware(hardware that can actually execute multiple threads simultaneously instead of faking it like a single processor machine does), it is only recently that multiprocessor machines have become commonplace in the desktop market, and most of these new machines are only dual core. Until now, multiprocessing has remained confined to network servers where extracting increased performance through multithreading is relatively straightforward. For now at least, the primary benefit to the consumer of using a dual core machine is the ability to run multiple applications simultaneously without a loss of responsiveness, not the increased performance of individual applications.

I see little reason for believing that the fundamental difficulties of multithreaded programming will change with the advent of the 360. I would say that multithreaded programming is probably just as difficult now as it was ten years ago and most of the performance improvemets have come from advances at the hardware level. The challenge for 360 developers will be to keep as many threads as possible executing simultaneously. To do this they will need to avoid both conflicts between threads and waits in accessing data from main memory.

The latter problem is now the primary performance bottleneck for data-intensive applications like games. Memory latency has simply not kept pace with advances in processor speed in recent years. Xenon's response to this challenge is to give the programmer some control over how the Xenon's L2 cache is used. On a thread-by-thread basis, the programmer can set aside portions of the PPEs' shared 1MB L2 cache for use as a buffer between a thread running on a PPE and the GPU. In short, this creates a form of local storage for the GPU that can be used to prevent the GPU from having to pull geometry data from main memory(see the references for more information about this). Although this fine-grained control will add another level of complexity to 360 development, in the long run it should make it much easier to achieve good performance.

Cell's MIMD Approach

Cell's MIMD approach has more in common with applications in distributed computing than it does to conventional multithreading. In fact, this common misunderstanding on gaming websites was the primary impetus for this article. The increased complexity resulting from Cell's 7 SPEs is quite different from the complexity of managing 6 simultaneous threads on the 360. The way that threads running on SPEs will interact with one another is likely to be quite different from how threads interact on the 360's PPEs(Note: it is possible to program Cell so that it behaves similarly to a shared memory multiprocessor like the 360). Fundamentally, Cell is an attempt to address the problem of memory latency mentioned above in the discussion of the Xenon's L2 cache by giving developers direct control over both the cache -- SPE local memory -- and process scheduling on the SPEs. The Cell Broadband Engine Tutorial says it best:

The reason for this radical change is that memory latency, measured in processor cycles, has gone up several hundredfold in the last 20 years. The result is that application performance is, in most cases, limited by memory latency rather than by peak compute capability or peak bandwidth. When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. Compared to this penalty, the few cycles it takes to set up a DMA transfer for an SPE is quite small. Conventional processors, even with deep and costly speculation, manage to get, at best, a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available. In contrast, the explicit DMA model allows each SPE to have many concurrent memory accesses in flight, without the need for speculation(CBET; 14).

The Cell Broadband Engine Tutorial mentions seven possible programming models for Cell. To illustrate Cell's flexibility, I'm going to briefly discuss three of them here:

Functional Offload Model(Remote Procedure Call): Here the SPEs are used to perform specific tasks on behalf of the PPE. The developer simply chooses which tasks he desires to run on the SPEs and replaces those function calls with stub functions which manage the process of calling the procedure on the SPE. A special compiler is provided which creates the PPE and SPE code from an Interface Definition Language(IDL) file(CBET; 127-128).
Streaming Model: "In the Streaming Model, each SPE, in either a serial or parallel pipeline, computes data that streams through. The PPE acts as a stream controller, and the SPEs act as stream-data processors"(CBET; 133).
Shared Memory Multiprocessor Model(Xbox 360 mode): Because memory addresses are coherent between the SPEs and the PPE, a shared memory model is possible. SPE local memory can be utilized as program and data cache. Since each SPE has 256KB of local memory, that is a pretty nice cache(CBET; 134).

My gut feeling is that there is greater room for advancement in both tool quality(optimizing compilers, etc.) and programmer skill with the MIMD approach. Perhaps I am being overly optimistic, but I certainly think that history indicates that there isn't likely to be a breakthrough in conventional multithreaded programming techniques. Furthermore, the flexibility of Cell gives developers more than one programming model to pursue in their attempts to keep the SPEs busy.

Cell Weaknesses

The SPEs are optimized for single precision floating point operations, which are issued 4-way with a 6 cycle latency. The latency for double precision is 13 cycles, of which only the last 7 are pipelined. So, no instructions of any kind are issued for six cycles after a double float operation is issued(CBET; 60, 62). This would seem like a significant hit for Cell's other target application area of scientific computing. Perhaps IBM is planning further revisions to address the need for higher precision floating point.

A performance issue with greater relevance to gaming is the ability of Cell to run AI code. The SPEs have limited branch prediction logic(as do the PPEs for both Cell and Xenon for that matter), which should reduce performance on branch-heavy AI code. And of course, the fact that AI code is generally neither SIMD parallel nor floating point doesn't make an SPE the ideal place to run such code. However, SPEs are at least flexible enough to run AI code, so they may be used for it anyway just because there are so many of them.

Finally, there is the added challenge to programmers of having to explicitly manage DMA transfers between the SPEs and main memory. Use of certain libraries and compilers could reduce the amount of micromanagement necessary, however.

Increasing Importance of Middleware

Along with the last generation of consoles, the complexity and scale of game development has risen sharply. This has led to an increased dependence on third party code by many developers. For instance, witness the widespread adoption of Unreal Engine 3 by many next-gen game developers. The presence of high quality middleware like UE3 means that not every studio has to be stocked with Cell and Xenon experts to make quality games. On the downside, all of this middleware could lead to less technological diversity between games in the next generation. Even worse is the possibility of less distinction between consoles, if a few middleware companies come to dominate the market and don't do the work necessary to fully exploit the hardware of one console or the other. Hopefully, competition between middleware companies will at least prevent the latter scenario.

Overall, the middleware trend should most favor the PS3 because, if the past is any indication, it will probably be the market leader. As the market leader, the PS3 should have the most games and thus the largest market for middleware. This could nullify the advantage of Microsoft's supposedly superior first party development tools. If there is a performance difference between cross platform middleware packages, expect the PS3 to have the better support.

Performance Battle / Room To Grow

In a console war such as this, the bottom line is console performance(At least for Sony and Microsoft, Nintendo seems to have opted out of the performance battle). Every developer is going to want to push the hardware, so what matters is the relative difficulty of writing optimized code. I do not doubt that the 360 is an easier platform to create relatively unoptimized games, although I think the difference is often exaggerated. The PS3, on the other hand, will require more experimentation and research up front. Once this is done, however, I actually think it will probably be easier to create optimized code on the PS3. PS3 developers will reap long-term rewards from being forced to program the SPEs at a lower level and possibly from advances in software technology.

Most pundits also agree that the PS3 has a higher performance ceiling. Over time this will be a significant Sony advantage. If the past is any indication, it is clear that Sony intends for the PS3 to have a long lifespan. A high ceiling for maximum performance means that PS3 games should steadily improve throughout the PS3's life, keeping consumers satisfied. We saw something similar with the PS2. Despite launching ahead of the Xbox and Gamecube, consumers have remained satisfied with the PS2 in part because of steady improvement in its games.

Conclusion

There is more than one way to look at programming complexity. Sometimes abstractions that make certain tasks easier make others more difficult. Specifically, abstractions in programming languages/models can often be a double-edged sword, and nowhere is this clearer than in runtime performance. The Xbox 360 adopts the traditional abstractions of multithreaded software development. This model is familiar, but is very difficult to program efficiently. The PS3 opts for a fundamentally lower-level model which could potentially put fewer barriers between software engineers and theoretical performance potential. The PS3, powered by the Cell processor, should overcome the obstacles presented by its unique architecture and easily take the performance crown in the coming generation.

References

Ars Technica: Inside the Xbox 360, part I, part II
IBM's Cell Broadband Engine Resource Center
Cell Broadband Engine Tutorial(CBET, downloadable from IBM's site above)

This work is licensed under a Creative Commons License