Google's April speaker series
[More photos on Flickr]
Earlier tonight I attended Google's April speaker series at their NYC offices. Before I go any further, I have to say that Susan Z. who coordinated this event provided a very comfortable atmosphere, where the staff were courteous and helpful despite the lines to get into the building, and an endless buffet of the legendary Google cuisine.
Luiz Barroso's talk was entitled "Watts, Faults, and Other Fascinating Dirty Words Computer Architects Can No Longer Afford to Ignore." He discussed the logistics of optimizing a data center taking into account fault tolerance and energy management. But, since he's presenting data based on Google's massively scaled operation, this data could be considered unique. (I would have like to have heard about the absolute size of his data operation, as the data were only given in terms of percentages. But I didn't expect such proprietary information to be released.) To emphasize this, he said that he collected all the published data about disk failure, and was able to read it all within one day.
Luiz started the presentation by saying that faults were a growing concern as progress is being made in the design of chips, memory and hard drives. Bit error rates are increasing 10x between now and the 16nm chips, and disk failures may be 5x higher than that rated by the manufacturer.
Regarding the race to achieve higher MHz ratings for chips, he says that Moore's Law is all about more transistor, not necessarily about speed. Unmodified software will run faster, but there are several tricks that are used to optimize performance further. These are parallel processing, speculation, and predication, in order of diminishing effectiveness and increasing wasted energy.
With all the effort to push MHz, little regard has been made for power consumption, to the point where, parts become too power inefficient and temperature management becomes a problem. Also, cost of watts surpasses that of hardware until energy cost becomes the main expense for the data center.
Even Congress is attempting to address the issue of energy wasted by inefficient servers with HR 5646 mandating a study into "analyzing the rapid growth and energy consumption of computer data centers by the Federal Government and private enterprise."
The component that comsumes the most power is the power supply itself. Luiz mentioned that PC makers avoided using power efficient power supply units to save costs. He now feels though that more efficient units can be built for the same cost. To use his words, "this is low-hanging fruit."
Multi-core design processors offer another way to increase speed, by much data is needed to optimize software. Thread-optimized software will take time to catch up.
Luiz then showed a photo of a Google server from 1999. The disks did not have a parity check, so it was necessary to do fault tolerance using software. The main problem with this approach is that fault remain unnoticed until there is a sequence of faults which can produced unacceptable results. There has to be a way of monitoring worsening conditions. Google now uses monitoring in the form of a "System Health" infrastructure. This talks to servers frequently, and stores time-series data forever.
Next he looked at the task of predicting disk failure, so that preemptive action can be taken if possible. The common wisdom is that disk failures are less than 1% per year, and that temperature increases failure. For a predictive failure model, Google uses SMART (Self Monitoring Analysis and Reporting Technology). This collects signals that may detect bad media surface, bad servo components, electronic/transmission problems, and vibration.
His data has shown that drives with scan errors are 10x more likely to fail. Regarding temperature, his data don't support the claim that higher temps correlate with higher failure rates. But, he says this might be because his most reliable drives run at higher temps, thereby confounding the results.
Overall, from his experience, he feels that SMART is inadequate to predict drive failures. It might be useful, though, in predicting population-wide trends.
The rest of his talk was about power provisioning which I think can be summarized by saying that by trying to ensure that there are enough watts for all the servers to run a peak level is a waste of money. It's more important to try to gauge the power use for systems that are idling, or working at lower levels for most of the time. Supplying too little watts could result in the circuit breakers tripping, but too many would be cost prohibitive. This reminded me of the reserve system for banking, where banks need to keep a certain amount of cash on hand for possible withdrawals, but not all that's needed if by some chance all the depositors decided to withdraw their funds at once.