A key reason the enormous performance potential offered by IBM's cell processor has yet to be realized with any substantive application is that programmers are currently caught between rocks and hard places when working with Cell. On the rocks side, IBM's compilers are good and getting better fast, but lots of critical functions just can't be effectively implemented this way - the current PS3, for example, relies on a Sony/Nvidia "Reality synthesizer" that underperforms the cell's own theoretical ability to handle graphics and adds an estimated $129 to the product cost, simply because the software needed to make games graphics run well on cell isn't there yet.
A programmer could, of course, choose the hard places route instead: taking software mapped memory management and execution parallelism into the application - a decision very roughly similar to taking on both load and page management for every host in an Opengrid/OpenMP environment.
There's a reason this stuff is so difficult - and it's almost as hard to get your head around as the problem itself. That reason is simple: nobody really understands how computational parallelism works for non trivial tasks and, in the absence of a good theoretical model, all our attempts to work the problem have been heuristic - an extended case of learning what seems to work by experience.
For an OpenGrid/OpenMP application this hasn't been much of a problem because most of the system's nominal capacity gets lost in communications delay and process management overheads - 50% efficiencies on well defined, highly repetitive, tasks like dense matrix multiplies are considered pretty good. Cell, however, changes the focus because the point of getting the grid hardware down to the unitary chip level is to cut out most (>99%) of that wasted communications time, power use, and process management overhead - meaning that with the hardware working, it's now obvious that the problem lies in the software.
And the reason the software isn't there is that we fundamentally don't understand how concurrent integration across non trivially parallel processes works.
Let me suggest an extreme example from human experience. Sometime in the early sixties Eugene Ormandy was able to rehearse an augmented Philadelphia Orchestra using the full score for Gliere's Ilya Murometz - controlling the orchestra while comparing what he expected to what he heard for something over 840,000 separate sound elements in 116 parallel tracks over a 93 minute period - and remember every single mistake before hearing the tape.
We don't know how that works and until we do, serial computing will continue to beat parallel for both ease of use and achieved hardware efficiency - and that's the key reason IBM and others are working toward terahertz serial processing CPUs rather than betting everything on parallelism.