Celling a Revolution
This is an article I originally published on my blog
Taipei Gamer in April, 2006.
There's been a lot of debate on gaming websites over the technology
behind the next gen consoles, specifically the PS3 and the Xbox 360.
The general consensus, which is reflected in this
recent post on kotaku.com, is that the PS3 is possibly more
powerful, but dramatically harder to program for due to the
idiosyncrasies of the PS3's Cell processor. By far the most damaging
claims against Cell are those
made by John Carmack, programming god(I don't dispute that, by the
way). While I don't have any experience developing console games(I am
a programmer though, so I have some qualifications), I'd still like to
talk a bit about why I'm a lot more excited about the PS3 than the
360.
Improvements Over the Past
Before I delve into Cell, I want to discuss one
huge improvement, from an ease of development perspective, that the
PS3 has over the PS2, the PS3's use of OpenGl. With the PS2,
developers(or a middleware company) had to do their graphics
programming on the metal because the PS2 graphics hardware did not
support a high-level API like OpenGL or DirectX. Microsoft's Xbox, on
the other hand, was essentially a PC with Nvidia graphics hardware.
Xbox programmers wrote their graphics code in a high level language
like C++ with calls to the DirectX API. The only exceptions to this
were the vertex and pixel shaders, which were written in a low-level
assembly language targetting the Nvidia GPU. In short, graphics
programming was much, much easier on the Xbox.
This time
around, Sony has adopted the PC/Xbox model for the PS3. The PS3's RSX
GPU, also developed by Nvidia, will be accessed through a subset of
OpenGL, an open API similar to DirectX. Shader development will be
done using Nvidia's Cg language, a high level replacement for the
specialized assembly language used in the past. All access to the RSX
GPU will be through these high level methods. Carmack himself once
praised OpenGL as being superior to Microsoft's DirectX, and indeed,
id Software's most recent game, Doom 3, used OpenGL. Although DirectX
has improved, lots of developers still prefer OpenGL.
So
the vast majority of PS3 graphics programming will be done in a high
level API from a high level language. With regard to graphics
programming, the Cell processor will only be used for nonstandard(ie
non-polygonal) tasks like raytracing, volume rendering, and particle
effects. RSX, through the use of vertex and pixel shaders, can even
take on some of these nonstandard rendering tasks, but the approach
used will depend on the developers. This brings the PS3 in line with
the Xbox 360 and high end PCs from a standard graphics programming
perspective.
Types of Multiprocessing
The big difference between the 360
and PS3 are in their different approaches to processor parallelism in
the CPU. Both consoles use a CPU derived from IBM's POWER series of
processors(hereafter referred to as a power processing element, or
PPE). Both consoles use PPEs that were developed by IBM, and in both
cases the designers have pared down the chip logic that was designed
to extract instruction level parallelism from executing code. The
360's PPE, codenamed Xenon, makes up for this by having three separate
PPE cores, each capable of two simultaneous threads of execution. So,
the chip designers have traded automatic parallelization of one or two
threads for the capacity to execute many programmer created threads
simultaneously. Although Xenon has a great capactity for parallelism,
the basic programming model is the same one used by nonparallel
machines that execute multiple tasks concurrently, like a single
processor desktop PC.
The PS3's Cell processor, on the
other hand, only has one PPE(also dual threaded) but has additional
computational resources manifested in 7 synergistic processing
elements(SPEs). Each of these SPEs is a specialized vector processor
with its own 256KB local memory. Main memory addresses are coherent
throughout the system, with communication between the SPEs and main
memory handled through high bandwidth DMA which can be initiated from
either the SPE or PPE. Management of DMA transfers between the SPE's
local memory and main memory is a primary source of Cell's added
complexity for developers. The SPEs can be considered a form of
multiple instruction multiple data(MIMD) parallelism with both
distributed and shared memory.
Both processors also make
use of single instruction multiple data(SIMD) parallelism in the
vector processing units in their PPEs. And of course, the Cell's SPEs
are basically SIMD vector processors.
Xenon's Approach
Xenon's
approach to parallelism is the most commonly used high-level model and
is generally what is referred to when people talk about multithreaded
programming. In this scenario, parallel threads of execution share
memory and must execute synchronously with respect to one another. If
shared memory isn't locked while one thread is accessing it, another
thread could stomp all over it and disrupt the first thread's
computations. Methods to deal with process synchronization are
provided by the OS, which is also responsible for scheduling when --
and on which processor -- threads execute. This level of abstraction
reduces complexity for the programmer but is one reason that debugging
multithreaded programs is so difficult.
Despite the ready
availability of multiprocessing hardware(hardware that can actually
execute multiple threads simultaneously instead of faking it like a
single processor machine does), it is only recently that
multiprocessor machines have become commonplace in the desktop market,
and most of these new machines are only dual core. Until now,
multiprocessing has remained confined to network servers where
extracting increased performance through multithreading is relatively
straightforward. For now at least, the primary benefit to the
consumer of using a dual core machine is the ability to run multiple
applications simultaneously without a loss of responsiveness, not the
increased performance of individual applications.
I see
little reason for believing that the fundamental difficulties of
multithreaded programming will change with the advent of the 360. I
would say that multithreaded programming is probably just as difficult
now as it was ten years ago and most of the performance improvemets
have come from advances at the hardware level. The challenge for 360
developers will be to keep as many threads as possible executing
simultaneously. To do this they will need to avoid both conflicts
between threads and waits in accessing data from main memory.
The latter problem is now the primary performance bottleneck for
data-intensive applications like games. Memory latency has simply not
kept pace with advances in processor speed in recent years. Xenon's
response to this challenge is to give the programmer some control over
how the Xenon's L2 cache is used. On a thread-by-thread basis, the
programmer can set aside portions of the PPEs' shared 1MB L2 cache for
use as a buffer between a thread running on a PPE and the GPU. In
short, this creates a form of local storage for the GPU that can be
used to prevent the GPU from having to pull geometry data from main
memory(see the references for more information about this). Although
this fine-grained control will add another level of complexity to 360
development, in the long run it should make it much easier to achieve
good performance.
Cell's MIMD Approach
Cell's MIMD approach has more in
common with applications in distributed computing than it does to
conventional multithreading. In fact, this common misunderstanding on
gaming websites was the primary impetus for this article. The
increased complexity resulting from Cell's 7 SPEs is quite different
from the complexity of managing 6 simultaneous threads on the 360.
The way that threads running on SPEs will interact with one another is
likely to be quite different from how threads interact on the 360's
PPEs(Note: it is possible to program Cell so that it behaves similarly
to a shared memory multiprocessor like the 360). Fundamentally, Cell
is an attempt to address the problem of memory latency mentioned above
in the discussion of the Xenon's L2 cache by giving developers direct
control over both the cache -- SPE local memory -- and process
scheduling on the SPEs. The Cell Broadband Engine Tutorial says it
best:
The reason for this radical change is that
memory latency, measured in processor cycles, has gone up several
hundredfold in the last 20 years. The result is that application
performance is, in most cases, limited by memory latency rather than
by peak compute capability or peak bandwidth. When a sequential
program on a conventional architecture performs a load instruction
that misses in the caches, program execution now comes to a halt for
several hundred cycles. Compared to this penalty, the few cycles it
takes to set up a DMA transfer for an SPE is quite small. Conventional
processors, even with deep and costly speculation, manage to get, at
best, a handful of independent memory accesses in flight. The result
can be compared to a bucket brigade in which a hundred people are
required to cover the distance to the water needed to put the fire
out, but only a few buckets are available. In contrast, the explicit
DMA model allows each SPE to have many concurrent memory accesses in
flight, without the need for speculation(CBET; 14).
The Cell Broadband Engine Tutorial mentions seven
possible programming models for Cell. To illustrate Cell's
flexibility, I'm going to briefly discuss three of them here:
- Functional Offload Model(Remote Procedure Call): Here the
SPEs are used to perform specific tasks on behalf of the PPE. The
developer simply chooses which tasks he desires to run on the SPEs and
replaces those function calls with stub functions which manage the
process of calling the procedure on the SPE. A special compiler is
provided which creates the PPE and SPE code from an Interface
Definition Language(IDL) file(CBET; 127-128).
- Streaming Model:
"In the Streaming Model, each SPE, in either a serial or parallel
pipeline, computes data that streams through. The PPE acts as a stream
controller, and the SPEs act as stream-data processors"(CBET;
133).
- Shared Memory Multiprocessor Model(Xbox 360 mode):
Because memory addresses are coherent between the SPEs and the PPE, a
shared memory model is possible. SPE local memory can be utilized as
program and data cache. Since each SPE has 256KB of local memory,
that is a pretty nice cache(CBET; 134).
My gut feeling
is that there is greater room for advancement in both tool
quality(optimizing compilers, etc.) and programmer skill with the MIMD
approach. Perhaps I am being overly optimistic, but I certainly think
that history indicates that there isn't likely to be a breakthrough in
conventional multithreaded programming techniques. Furthermore, the
flexibility of Cell gives developers more than one programming model
to pursue in their attempts to keep the SPEs busy.
Cell Weaknesses
The SPEs
are optimized for single precision floating point operations, which
are issued 4-way with a 6 cycle latency. The latency for double
precision is 13 cycles, of which only the last 7 are pipelined. So,
no instructions of any kind are issued for six cycles after a double
float operation is issued(CBET; 60, 62). This would seem like a
significant hit for Cell's other target application area of scientific
computing. Perhaps IBM is planning further revisions to address the
need for higher precision floating point.
A performance
issue with greater relevance to gaming is the ability of Cell to run
AI code. The SPEs have limited branch prediction logic(as do the PPEs
for both Cell and Xenon for that matter), which should reduce
performance on branch-heavy AI code. And of course, the fact that AI
code is generally neither SIMD parallel nor floating point doesn't
make an SPE the ideal place to run such code. However, SPEs are at
least flexible enough to run AI code, so they may be used for it
anyway just because there are so many of them.
Finally,
there is the added challenge to programmers of having to explicitly
manage DMA transfers between the SPEs and main memory. Use of certain
libraries and compilers could reduce the amount of micromanagement
necessary, however.
Increasing Importance of Middleware
Along
with the last generation of consoles, the complexity and scale of game
development has risen sharply. This has led to an increased dependence
on third party code by many developers. For instance, witness the widespread
adoption of Unreal Engine 3 by many next-gen game developers. The
presence of high quality middleware like UE3 means that not every
studio has to be stocked with Cell and Xenon experts to make quality
games. On the downside, all of this middleware could lead to less
technological diversity between games in the next generation. Even
worse is the possibility of less distinction between consoles, if a
few middleware companies come to dominate the market and don't do the
work necessary to fully exploit the hardware of one console or the
other. Hopefully, competition between middleware companies will at
least prevent the latter scenario.
Overall, the middleware
trend should most favor the PS3 because, if the past is any
indication, it will probably be the market leader. As the market
leader, the PS3 should have the most games and thus the largest market
for middleware. This could nullify the advantage of Microsoft's
supposedly superior first party development tools. If there is a
performance difference between cross platform middleware packages,
expect the PS3 to have the better support.
Performance Battle / Room To Grow
In a console war such as this, the bottom line is console
performance(At least for Sony and Microsoft, Nintendo seems to have
opted out of the performance battle). Every developer is going to
want to push the hardware, so what matters is the relative difficulty
of writing optimized code. I do not doubt that the 360 is an easier
platform to create relatively unoptimized games, although I think the
difference is often exaggerated. The PS3, on the other hand, will
require more experimentation and research up front. Once this is
done, however, I actually think it will probably be easier to create
optimized code on the PS3. PS3 developers will reap long-term rewards
from being forced to program the SPEs at a lower level and possibly
from advances in software technology.
Most pundits also
agree that the PS3 has a higher performance ceiling. Over time this
will be a significant Sony advantage. If the past is any indication,
it is clear that Sony intends for the PS3 to have a long lifespan. A
high ceiling for maximum performance means that PS3 games should
steadily improve throughout the PS3's life, keeping consumers
satisfied. We saw something similar with the PS2. Despite launching
ahead of the Xbox and Gamecube, consumers have remained satisfied with
the PS2 in part because of steady improvement in its games.
Conclusion
There
is more than one way to look at programming complexity. Sometimes
abstractions that make certain tasks easier make others more
difficult. Specifically, abstractions in programming languages/models
can often be a double-edged sword, and nowhere is this clearer than in
runtime performance. The Xbox 360 adopts the traditional abstractions
of multithreaded software development. This model is familiar, but is
very difficult to program efficiently. The PS3 opts for a
fundamentally lower-level model which could potentially put fewer
barriers between software engineers and theoretical performance
potential. The PS3, powered by the Cell processor, should overcome the
obstacles presented by its unique architecture and easily take the
performance crown in the coming generation.
References
Ars Technica: Inside the Xbox 360, part
I, part
II
IBM's Cell
Broadband Engine Resource Center
Cell Broadband Engine Tutorial(CBET, downloadable from IBM's site
above)
|