What is GPU Computing?

GPU Accelerated Computing is Revolutionising
High Performance Computing (HPC)

Accelerated computing is revolutionising high performance computing (HPC). It is now widely accepted that systems with accelerators deliver the highest performance and most energy efficient computing for HPC today. Recent announcement of the U.S. Department of Energy’s Summit and Sierra supercomputers highlights how critical accelerators are in the industry’s pursuit of exascale computing.

We’d like to share some facts when evaluating options for accelerated computing by removing the promises and hype. Specifically, notions that Intel’s Xeon Phi accelerator can deliver acceptable application performance compared to a GPU by simply recompiling and running code natively on Xeon Phi, or that performance optimisation is easier on a Xeon Phi than a GPU, are simply not based on fact.

FACT: A GPU is significantly faster than Intel's Xeon Phi on
         real HPC applications.
Speeding time to result for key science applications by 2x over Xeon Phi.
NVIDIA GPU is up to 4x Faster than Xeon Phi

HPC is all about application performance, and GPUs have proven to offer superior performance over CPUs. In applications representing various scientific workloads, shown in the above chart, GPUs deliver a speedup from approximately 2.5x-7x over CPUs. Although Intel's Xeon Phi can be optimized to outperform a CPU, GPU performance remains on average 2-5x faster than the highest end Knights Corner.

Organization Application GPU Speed-up over Xeon Phi
Tokyo Institute of Technology CFD Diffusion 2.6x
Xcelerit Monte-Carlo LIBOR Swap Pricing 2.2x - 4x
Georgia Tech Synthetic Aperture Radar 2.1x
CGGVeritas Reverse Time Migration 2.0x
Paralution BLAS & SpMV 2.0x
Univ. of Wisconsin-Madison WRF (Weather Forecasting) 1.8x
University Erlangen-Nuremberg Medical Imaging- 3D Image Reconstruction 7x
Delft University Drug Discovery 3x
Independent results have shown GPU outperforms Xeon Phi by 2x or more. (Data from January 2014)
Today, more than 200 applications across a wide range of fields are GPU-accelerated.
Read Less
Read More
FACT: “Recompile & Run” on Xeon Phi actually slows down your application.
The notion that developers can simply “recompile and run” applications on Intel’s Xeon Phi, without any change to their CPU code, is attractive but misleading. The resulting performance is usually much slower than CPU performance, literally the opposite of acceleration.
recompile and run on Xeon Phi slows down application performance
Simple recompile and run on Xeon Phi can work, but codes run much slower than on the CPU.
System and configuration details2 (Data from August 2013)

While a simple recompile to run natively on Xeon Phi may work on many codes, doing so decelerates the application performance compared to CPU – up to 4x slower on DOE mini-applications as shown above.

“Recompile and run” faces a host of technical challenges as described in the NVIDIA blog post “No Free Lunch for Intel MIC (or GPUs)”, including Amdahl’s Law for serial portions of the code. Because of the poor serial performance of the Xeon Phi cores (based on an old Pentium design) compared to the modern CPU cores, the serial portion of codes run natively on a Xeon Phi can run an order of magnitude slower.

In practice, a developer must work to get the code to recompile on Xeon Phi first, then apply effort to re-factor and optimize the code to increase performance – just to get to performance parity back to CPUs.

At the end of the day, it takes some effort to extract parallelism, whether you want to accelerate with Xeon Phi or GPU. At best, “recompile and run” is a mildly convenient first step for developers; at worst, an attractive claim destined to disappoint.
Read Less
Read More
FACT: Programming for a GPU and Xeon Phi require similar effort — but the
         results are significantly better on a GPU.
Same optimization techniques. Same developer effort. 2x faster acceleration on GPU.
Method GPU Phi
CUDA Libraries + others

Intel MKL + others

OpenMP + Phi Directives
Native Programming Models

Vector Intrinsics
Developers use libraries, directives, or native programming models to program accelerators and optimize for performance. (Data from August 2013)

GPU and Intel’s Xeon Phi may be different in some ways, but they are similar in that both are parallel processors. Developers need to put in similar effort and use similar optimization techniques to expose massive amounts of parallelism, whether on Xeon Phi or GPU.

As shown in the table above, a developer uses the same three methods to accelerate their code – libraries, directives, and native programming models like CUDA C for GPU or vector intrinsics on Xeon Phi.

And the programming efforts for Xeon Phi and GPU are more alike than most people realize.

Below, an N-body kernel code illustrates that comparable optimization techniques and effort are required to optimize for either accelerator. While the code changes are basically the same, performance on GPU significantly outpaces that of Xeon Phi. Download the optimization example.
Tesla K20 GPU is 11x faster than Xeon Phi
A simple n-body code comparison shows similar optimization techniques must be used, but the GPU is significantly faster. System and configuration details3 (Data from August 2013)
Read more
Read Less

You can port easily, but the things you do in CUDA to
vectorise your code still have to be done for Phi.

Dr. Karl Schultz
Director of Scientific Applications at Texas Advanced Computing Center (TACC)
Source: HPCWire, May 17, 2013

Our GPU codes are quite similar to the Xeon Phi codes, except for replacing SIMD operations with SIMT operations.


Results gathered on Intel’s Xeon Phi were surprisingly disappointing… It took quite some effort to create solutions with good performance due to vectorization tuning, despite that the Xeon Phi is said to be easily programmable.

While getting a program running on Xeon Phi is easy, I found that it is easier with CUDA and NVIDIA GPUs to achieve high sustained performances for Lattice Boltzmann applications."

Dr. Sebastiano Fabio Schifano Department of Mathematics and Informatics - University of Ferrara

Once you see the facts, a better understanding of accelerated computing emerges. Today, a GPU provides double the performance for essentially the same developer effort. GPUs are the logical choice for accelerating parallel code. In part, this could be why scientific researchers have published with GPU more than 10:1 over Phi this year.4 And why NVIDIA GPU is favoured more than 20:1 over Xeon Phi in HPC systems today.5


Footnotes on Benchmark Configurations:
AMBER: SPFP-Cellulose_production_NPT, 1x E5-2697v2 + Xeon Phi 7120P, 1x E5-2697v2 @ 2.70GHz + Tesla K40
MiniMD: KokkosArray- LJ forces, 864k atoms, double precision, 2x Xeon E5-2667 + Xeon Phi 7120, 2x Xeon E5-2667 + Tesla K40
Monte Carlo RNG DP: European option pricing, 2x Intel® Xeon® Processor E5-2697 v3 + Tesla K40 GPU, Intel provided Xeon Phi performance results on their website
tHogbomClean: 2x Xeon E5-2697 v2 + Xeon Phi 7120, 2x Xeon E5-2697 v2 + Tesla K40c
Binomial Options SP: 2x Xeon Processor E5-2697 v3 + Tesla K40 GPU, Intel provided Xeon Phi performance results on their website
NAMD: APOA1, 2x Xeon E5-2697v2 + Xeon Phi 7120, 2x Xeon E5-2697v2 +Tesla K40
STAC-A2: Warm Greek, 2x E5-2699v3 CPUs + Xeon Phi 7120A, 2x Intel Xeon E5-2690v2 + Tesla K80

CUDA and GPU Computing

What is GPU Computing?
GPU Computing Facts
GPU Programming
Kepler GPU Architecture
GPU Cloud Computing
Contact Us

What is CUDA?
CUDA Showcase
CUDA Webinars
CUDA Training
CUDA Training Calendar
CUDA Research Centres
CUDA Teaching Centres

GPU Applications

Tesla GPU Applications
Tesla Case Studies
Tesla GPU Test Drive
OpenACC Directives

Tesla GPUs for
Servers for Workstations

Why Choose Tesla
Tesla Server Solutions
Tesla Workstation Solutions
Embedded Development Platform
Buy Tesla GPUs

Tesla News and Information

Tesla Product Literature
Tesla Software Features
Tesla Software Development Tools
NVIDIA Research
Tesla Alerts

Find Us Online

Facebook Facebook
YouTube YouTube