- by Paul Murphy -
Three years ago IBM, Sony, and Toshiba announced a partnership aimed at developing a new processor for use in digital entertainment devices like the Playstation. Since then the product has seen a billion dollars in development work and two fabs, one in Tokyo and one in Fishkills N.Y. have been custom built to make it in large volumes. On May 12, IBM announced that the first commercial workstations based on this processor would become available to games industry developers late this year.
A lot is known about this processor as planned but relatively little real information about the product as built has yet leaked. To the extent that performance information has become available it is characterised by numbers so high that most people simply dismissed the reports. In November of last year, for example, a senior Sony executive told an internal audience that implementations would scale from uniprocessors to 64 way groupings that would deliver in excess of two teraflops - making it more than ten times faster than Xeon.
Most of what we know about this machine comes from US patent #6,526,491 as issued to Sony Corporation Entertainment Inc in February of 2003 for a "memory protection system and method for computer architecture for broadband networks." Here's the abstract:
A computer architecture and programming model for high speed processing over broadband networks are provided. The architecture employs a consistent modular structure, a common computing module and uniform software cells. The common computing module includes a control processor, a plurality of processing units, a plurality of local memories from which the processing units process programs, a direct memory access controller and a shared main memory. A synchronised system and method for the coordinated reading and writing of data to and from the shared main memory by the processing units also are provided. A hardware sandbox structure is provided for security against the corruption of data among the programs being processed by the processing units. The uniform software cells contain both data and applications and are structured for processing by any of the processors of the network. Each software cell is uniquely identified on the network. A system and method for creating a dedicated pipeline for processing streaming data also are provided.
The machine is widely referred to as a cell processor, but the cells involved are software, not hardware. Thus a cell is a kind of TCP packet on steroids, containing both data and instructions and linked back to the task of which it forms part via unique identifiers that facilitate results assembly just as the TCP sequence number does.
The basic processor itself appears to be a PowerPC derivative with high speed built in local communications, high speed access to local memory, and up to eight attached processing units broadly akin to the Altivec short array processor used by Apple. The actual product consists of one to eight of these on a chip - a true GIRD on a chip approach in which a four way assembly can, when fully populated, consist of four core CPUs, 32 attached processing units, and 512MB of local memory.
The per cycle performance of the core CPU is undocumented but may be expected to be comparable to other PowerPC machines running at high cache hit rates. Specifications for the four or eight attached processors comprising the array are known; these are expected to turn in one floating point operation per cycle or around 32 Gigaflops for the fully populated array at a nominal 4Ghz.
That's where the apparently outrageous performance claims come from; a four way assembly running at a planned 4Ghz offers 32 x 4 = 128 Gigaflops in potential floating point execution. A 64-way super GRID made by stacking eight eight-way assemblies would have a total of 512 attached processors and could, therefore, break two teraflops if data transportation kept up with the processors.
In practice, however, Apple has never succeeded in getting the bulk of its developers to make effective of the Altivec and Sun has had essentially no success getting people outside the military and intelligence communities to use the four way SIMD capabilities built into its SPARC processors. GRID computing is slowly entering the commercial mainstream but combining both local array access with GRID computing requires a significant shift in programming paradigm that will not appeal to the mainstream Wintel and IBM customer base.
For games developers, however, the potential gains - up to fifty times the best x86 based processor and graphics board combinations can deliver - should outweigh the pain. Even minor software change, the kind of thing Adobe does to take advantage of the Altivec in Photoshop, should offer significant advantages to a wider programming community and enable floating point intensive applications to run a full order of magnitude faster on this machine than on Intel's best.
An important point to bear in mind is that this processor will be cheap and systems built around it even cheaper because no external graphics or network boards will be needed. Both Sony and IBM have been building fabs specifically to make this device and volumes will be high because Sony will use up to 20 million assemblies in the playstation while 10 million or more that don't quite make the quality cut will get used in its digital televisions and other products.
Very little has been publically revealed about the operating system for this thing, but it is quite obvious what it has to be and how it has to work. Each core will have its own local Unix kernel with most just executing cells as they arrive from the dispatch manager and one managing the traffic coordination hardware. In all likelihood the kernel used will prove to be both Linux derived and Linux compatible - meaning that most Linux software will run out of the box on the uniprocessor configuration while software adapted for the GRID environment will run unchanged on everything from the uniprocessor to configurations with hundreds or even thousands of processor assemblies.
As users of Sun's open source GRID software have found, performance losses on single processes increase as you add processors because data flow and timing control issues increase in complexity non linearly with system growth. Fundamentally what happens is that the larger you make the total machine, whether on one piece of silicon or in a rack, the more cell transit time dominates execution time and the greater the performance cost imposed by the need to co-ordinate operations.
The patent mentions the use of no-ops inserted into cells to get around timing problems associated with having components run at different speeds with processor co-ordination initially enforced by setting TTL like time budgets for cell execution. My guess, however, is that advances in cell isolation and programming for asynchronous event handling have since obsoleted those solutions. I expect, therefore, that when the real thing appears it will fully support both the traditional GRID format for on chip work and an asynchronous hypergrid for multi-assembly processes on the model Thinking Machines hoped to achieve with the transputer based hypercube in 1985 -and the NSA is rumoured to actually have built on 1989's SPARC/SIMD based CM-5.
Either way, however, the OS for this machine is likely to offer both Linux compatibility at the low end and enormous scalability for those willing to modify their software - which is why, as I discuss in next week's column, I expect IBM and Toshiba to soon launch a new generation of Linux PCs built around the combination of this CPU with IBM software products like Lotus Workspace for Linux.