SC12 part II

30 Dec 2012

This post is my second installment on my recap of SC12, as started here several weeks late.

The most important bound on the hardware people bring to SCC is power consumption. Were teams left to their own devices, the team who convinces their vendor partners to give them the most iron wins hands down (or so one would expect). To keep the playing field

~~level~~

sane, the SCC committee caps the power consumption of each cluster and thus renders the power consumption of hardware a major consideration for entered teams. So far, the cap has been two thirteen amp circuits per cluster. These circuits are monitored by the contest admins throughout the contest, and if a team blatantly exceeds the power budget they will be told to shut down their cluster and restart any running jobs.

The first choice that gets made is that of that of CPU model. This really comes down to looking at Intel and AMD’s published spec sheets and comparing CPUs on the basis of flops/amp. Usually this means that Intel wins and we run a midline chip with a low power profile but there are always teams who try and run AMD chips for the power savings. There has been some interesting contemplation of deploying an ARM cluster for the extreme power savings, but so far no team has done so due to the performance hit.

The other question is that of GPUs. GPUs are ridiculously fast, so there are always people trying to exploit the speed of GPUs to improve application performance. However, GPUs also suffer from a massive memory bottleneck. As they typically connect over PCI buses, the IO latency of reading data from host memory is just evil. Also, many applications haven’t been rewritten for CUDA or OpenCL meaning that unless an SCC team is somehow able to rewrite the applications for GPUs themselves (which has yet to happen despite persistent speculation) the benefit from having GPUs is bounded. This however is speaking in generalities about arbitrary applications. SCC issues two awards every year, one for the highest LINPACK score (a cluster performance benchmark) and one for overall score on all the applications. As a result, every year there is one or two teams who build clusters designed to use GPUs to wreck at LINPACK but which won’t do as well on the applications. The Russian team did it in 2011, and in 2012 the Chinese team from NUDT did so too. Both teams were successful in their goal of the LINPACK prize, but failed to achieve competitive scores in the overall competition.

An interesting point one of my team-mates made when criticizing this piece prior to its publication was that in terms of software efficiency GPUs don’t perform as well on LINPAC as CPUs do. The issue is that GPUs perform best with nonbranching or barely branching code, and due to the structure of LINPAC GPUs typically only perform at about 60% of their theoretical maximum. However due to the massive advantage in terms of FLOPS of GPUs over CPUs this price seems to be worth paying for the time being.

As the applications change from year to year, the approach I recommend and expect that UT’s team will take again for 2013 is to ask our hardware supplier for a pretty even GPU/CPU split that’s more metal than we expect to actually ship to Denver and then use our pre-contest window to test the applications and try to get a sense for how much of that hardware we will be able to run at near-full load for the duration of the contest and ship that.