Harnessing the power of Google’s cloud: Google BigQuery Analytics book extract

This is an edited extract from Google BigQuery Analytics, by Jordan Tigani and Siddartha Naidu, published August 2014 by Wiley, £30.99.

When you run your queries via BigQuery, you put a giant cluster of machines to work for you. Although the BigQuery clusters represent only a small fraction of Google’s global fleet, each query cluster is measured in the thousands of cores. When BigQuery needs to grow, there are plenty of resources that can be harnessed to meet the demand.

If you want to, you could probably figure out the size of one of BigQuery’s compute clusters by carefully controlling the size of data being scanned in your queries. The number of processor cores involved is in the thousands, the number of disks in the hundreds of thousands. Most organizations don’t have the budget to build at that kind of scale just to run some queries over their data. The benefits of the Google cloud go beyond the amount of hardware that is used, however. A massive datacenter is useless unless you can keep it running.

If you have a cluster of 100,000 disks, some reasonable number of those disks is going to fail every day. If you have thousands of servers, some of the power supplies are going to die every day. Even if you have highly reliable software running on those servers, some of them are going to crash every day.

To keep a datacenter up and running requires a lot of expertise and knowhow. How do you maximize the life of a disk? How do you know exactly which parts are failing? How do you know which crashes are due to hardware failures and which to software? Moreover, you need software that is written to handle failures at any time and in any combination. Running in Google’s cloud means that Google worries about these things so that you don’t have to.

There is another key factor to the performance of Google’s cloud that some of the early adopters of Google Compute Engine have started to notice: It has an extremely fast network. Parallel computation requires a lot of coordination and aggregation, and if you spend all your time moving the data around, it doesn’t matter how fast your algorithms are or how much hardware you have. The details of how Google achieves these network speeds are shrouded in secrecy, but the super-fast machine-to-machine transfer rates are key to making BigQuery fast.

Cloud data warehousing

Most companies are accustomed to storing their data on-premise or in leased datacenters on hardware that they own or rent. Fault tolerance is usually handled by adding redundancy within a machine, such as extra power supplies, RAID disk controllers, and ECC memory. All these things add to the cost of the machine but don’t actually distance you from the consequences of a hardware failure. If a disk goes bad, someone has to go to the datacenter, find the rack with the bad disk, and swap it out for a new one.

Cloud data warehousing offers the promise of relieving you of the responsibility of caring about whether RAID-5 is good enough, whether your tape backups are running frequently enough, or whether a natural disaster might take you offline completely. Cloud data warehouses, whether Google’s or a competitor’s, offer fault-tolerance, geographic distribution, and automated backups.

Ever since Google made the decision to go with exclusively scale-out architectures, it has focused on making its software accustomed to handling frequent hardware failures. There are stories about Google teams that run missioncritical components, who don’t even bother to free memory—the amount of bugs and performance problems associated with memory management is too high. Instead, they just let the process run out of memory and crash, at which time it will get automatically restarted. Because the software has been designed to not only handle but also expect that type of failure, a large class of errors is virtually eliminated.

For the user of Google’s cloud, this means that the underlying infrastructure pieces are extraordinarily failure-resistant and fault-tolerant. Your data is replicated to several disks within a datacenter and then replicated again to multiple datacenters. Failure of a disk, a switch, a load balancer, or a rack won’t be noticeable to anyone except a datacenter technician. The only kind of hardware failure that would escalate to the BigQuery operations engineers would be if someone hit the big red off button in a datacenter or if somebody took out a fiber backbone with a backhoe. This type of failure still wouldn’t take BigQuery down, however, since BigQuery runs in multiple geographically distributed datacenters and will fail over automatically.

Of course, this is where we have to remind you that all software is fallible. Just because your data is replicated nine ways doesn’t mean that it is completely immune to loss. A buggy software release could cause data to be inadvertently deleted from all nine of those disks. If you have critical data, make sure to back it up.

Many organizations are understandably reluctant to move their data into the cloud. It can be difficult to have your data in a place where you don’t control it. If there is data loss, or an outage, all you can do is take your business elsewhere—there is no one except support staff to yell at and little you can do to prevent the problem from happening in the future.

That said, the specialized knowledge and operational overhead required to run your own hardware is large and gets only larger. The advantages of scale that Google or Amazon has only get bigger as they get better at managing their datacenters and improving their data warehousing techniques. It seems likely that the days when most companies run their own IT hardware are numbered.

This is an edited extract from Google BigQuery Analytics, by Jordan Tigani and Siddartha Naidu, published August 2014 by Wiley, £30.99.