
High Performance Compute or High Performance Computing (HPC) most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering or in business.
Today we are at the point where we not only have the technology in the cloud needed to run HPC workloads, but also have real-world use cases of companies who are successfully doing it.
Here is our breakdown of HPC capabilities across all three major public cloud providers.
Amazon Web Services
Amazon Web Services was the first major cloud provider to offer services tailored to the unique needs of the High Performance Computing HPC community.
This provided AWS an early set of adopters that supplied feedback and enabled Amazon to continually expand their capabilities to support a wider range of HPC workloads. Today, AWS has specialized capabilities through its platform features, case studies and marketplace partners to enable a range of specialized HPC workloads:
- 3D Rendering & Special Effects
As more consumer entertainment utilizes special effects, HPC has become a common tool for video rendering and introducing digital elements not easily filmed which can be more vivid through graphic arts. - EDA
Electronic Design Automation, is a common HPC application used to simulate performance and failures within silicon chips during the design phase. Many application vendors, including Cadence and Synopsys, provide highly scalable tools to parallelly execute across systems for designing and analyzing complex circuits and chips. - Genomics
While some Genomics workloads fit into the Big Data space more than HPC, the Genomics space continues to leverage complex HPC platforms. Many Genomics firms today have a combination of Big Data and HPC technologies integrated into a single analysis pipeline. - CFD
Computational Fluid Dynamics is an early workload on HPC that has continued to gain momentum in the industrial field as product design has become more virtual. These workloads are mathematically intensive. - Risk Modeling & Back Testing
Specific to the Financial Services domain, many firms will continually analyze the exposure of the organization against current and future trades taking place. New models are tested against past data sets to see how changes would potentially impact the market and the organization’s positions.
As part of AWS’s investment in HPC capabilities, there are a variety of technologies available to accelerate performance, simplify configuration and automate monitoring for HPC workloads on AWS:
- CfnCluster
A series of scripts that leverage CloudFormation for the rapid setup and configuration of HPC clusters in AWS. CfnCluster enables administrators to automate the deployment of compute resources for HPC workloads as well as ensure that the proper environment and libraries are in place to support application execution. - Placement Groups
Placement groups are groupings of EC2 instances which provide a consistent, lower latency than would be available if instance location was decided dynamically. Placement groups enable more efficient communication between nodes. Only certain types of instances are available for assignment to placement groups. - Enhanced Networking
Single root I/O virtualization (SR-IOV) enables more efficient communication from EC2 instances, and allow larger packet per second counts in network communication. - GPU Instances
AWS provides specialized instances that are attached to GPUs, providing high density core counts for parallel processing using CUDA or OpenCL. - CloudFormation
CloudFormation allows administrators and engineers to automate the process of setting up AWS resources, including networking, EC2, and VPCs. CloudFormation has programmatic methods and strongly documented APIs that can be integrated with existing HPC workload schedulers to quickly create highly customized environments to execute workloads. - CloudWatch
CloudWatch provides a centralized service to monitor and respond to service outages and collect logs. CloudWatch is a key operational tool in a dynamic cloud environment that ensures application developers and engineers have a central repository for event and log information as services start and stop over time.
In addition to the native HPC capabilities that AWS delivers out of the box, the AWS Marketplace provides a variety of third-party technologies that are tested and certified to run on AWS.
These technologies can be deployed quickly and have simple elastic pricing models:
- Intel Lustre
Lustre is a very common parallel filesystem in the HPC space. Intel provides a pre-packaged AMI with Lustre installed. These AMIs can be used to quickly deploy a fully-supported version of Lustre on AWS. - FSMLabs TimeKeeper
Many HPC workloads, specifically from the Financial Services industry, require highly accurate timestamps to operate and ensure data integrity. FSMLabs offers their TimeKeeper product in the AWS marketplace to assist admins in deploying highly accurate time synchronization. - Univa Grid Engine
Univa Grid Engine, a leading workload scheduler for HPC is available as a preloaded AMI on AWS. Univa provides AMIs for both a Head node and Compute node variants with cloud-based usage pricing.
Google Cloud Platform
Google takes a slightly different approach to the HPC market than other cloud vendors. In addition to supporting IaaS capabilities for compute resources, they provide advanced machine learning capabilities they often also refer to as HPC. This is uncommon in the marketplace, as machine learning applies to a different set of domains than HPC. We will discuss Google capabilities in both areas for comparison.
While some customers do run HPC workloads on Google, for jobs like product design, rendering, special effects and other modeling of the physical world, Google does not offer the same breadth of capabilities as other providers. Google has a set of IaaS capabilities, including high CPU count instances, that can be used for HPC workloads.
Google Genomics supports genomics analysis workloads, a very common application for HPC, with significantly less upfront configuration and setup than traditional tools.
Google’s differentiation lies in the advanced computing capabilities it provides around machine learning. Machine learning is commonly leveraged by applications involved in analyzing human interactions, including social media, image analysis and social influences. Google speaks to machine learning capabilities as HPC because of the highly scalable nature of their ML implementations and the specialized hardware they leverage to provide high levels of performance. Out of the advanced work Google is doing for ML, two unique sets of technology are at the core:
- TensorFlow
TensorFlow is an open source set of libraries for analyzing data flow graphs. Google developed this capability as part of its machine learning work and open sourced the tool as a generic analysis platform that can be applied to many different domains and problem sets. - TPU
Google worked to design and deploy the Tensor Processing Units inside its data centers to accelerate ML workloads. The TPU is a custom ASIC, specifically optimized for ML workloads.
Google also continues to build its partner ecosystem, with some partners focused on HPC. One partner with key capability is CycleComputing, which provides the ability to easily schedule HPC workloads to run on a variety of cloud providers, including Google.
While AWS and Azure provide rich sets of IaaS functionality, complemented by HPC specific technologies, Google has taken a path of PaaS capabilities with Google Genomics and Google Machine Learning. This allows organizations to analyze large, complex data sets without having to deploy, configure and manage IaaS services. Google’s approach is unique and will inevitably continue to be expanded to additional domains.
Microsoft Azure
Azure refers to HPC-centric workloads as Big Compute. This is to distinguish the HPC workloads that are processor and interconnect intensive, from Big Data workloads that have very different communication patterns from HPC. The Azure Big Compute capabilities on Azure cover several specific domains.
- Engineering Design and Simulation
Simulations, including finite element analysis, structural analysis and computational fluid dynamics, which commonly support product design and validation. - Genomics Research
Workloads that enable researchers to evaluate larger and larger sets of population and genomic data, to accelerate time to market for new treatments and diagnostic routines. - Financial Risk Modeling
Empowers financial organizations to quickly assess risk, empower efficient decision making and ensure compliance with industry regulations. - Rendering
Rapidly scale resources to support the production of special effects for movie production, and empower designers to model products in new and interactive ways.
Azure has invested to ensure specific technologies are available for supporting HPC workloads. These technologies are focused on rapid deployment, high performance and scalability.
Some key capabilities include:
- Hybrid HPC Pack
Many HPC users start with hybrid models, then grow to native cloud deployments for all HPC resources. The Hybrid HPC pack from Azure allows a head node to be configured on-premise and then distribute jobs to Azure compute nodes. This model can be used for large workloads to scale very quickly, as well as to minimize cost so that fewer nodes have to be purchased for on-premise use. - RDMA and MPI
Remote Direct Memory Access (RDMA) allows very high speed communications between nodes by using lighter weight protocols. Coupling RDMA with the Message Passing Interface (MPI) included with the Microsoft HPC Pack enables highly efficient communications between nodes at low latency. - Azure Resource Manager
Resource Manager allows the creation of templates that span multiple Azure services to automate deployment, configuration and monitoring. Resource Manager is a key component of any HPC deployment on Azure that ensures consistency in deployed instances, and monitors for any failed resources that require re-provisioning. - Future GPU instances
Graphics Processing Units (GPU) provide very large core count processors for execution of certain workloads in parallel. Many workloads can benefit from GPU acceleration. To assist these workloads, Microsoft has announced the future availability of GPU instances on Azure.
Microsoft has a large set of industry software vendors that sell commercial applications, tested and supported for running on Windows based hosts. These applications are commonly leveraged in Azure to allow customers to rapidly scale their capacity needs as user demand changes. Azure provides a scalable platform for the execution of HPC workloads, with additional capabilities to ensure automated management and high performance.