Communication and Computation Optimizations of Data and Compute Intensive Applications in GPU Computing

Asst. Professor B. Neelima

Principal Investigator
Dept. of Computer Science and Engineering
N. M. A. M. Institute of Technology, Nitte
(An Autonomous Institute under V. T. U., Belgaum)
Karkala Taluk, Udupi District, Karnataka

Figure 1: Improvement in time consumption using the new sparse matrix format

Current Work

The main aim of the work is to optimize the communication and computation in GPU computing. To do this a data and compute intensive numerical applications, that is BLAS level2 and Level 3 computations, are chosen specially for the case of sparse matrices. The communication optimization, especially from CPU to GPU, is done by modifying the way the jobs are given to the GPU and by using different formats to send the data from CPU to GPU. By modifying the data format of sparse matrix representation, the data layout and computation optimization is done.

Fig.1 shows the improvement in time if the new sparse matrix format is used. The charts show the improvement when compared only considering the memory transfer time of the matrices, as well as considering both memory transfer time and kernel time. The matrices used are from the Williams, Oliker et. al. [1]

Figure 2: Computation and Communication Optimizations with respect to GPU

With respect to communication and computation on the device, it is proposed to use kernel merge and effective resource utilization using a tool C2O (Communication and Computation Optimizer). This tool also generates a CUDA device specific code to better utilize the underlying device. C2O automatically generates the CUDA program for any architecture from the CUDA program written for different generations of architectures. The summary of the work is shown in Fig. 2.

Figure 3: Performance improvement (time) by use of kernel merge techniques

Fig. 3 shows the kernel time with/without the kernel merge performed on a data and computation intensive application.

Both of the above works were selected for poster presentation at the International Conference on High Performance Computing (HiPC-2011):

Future Work

The following work is planned for the coming semester (Jan-2012-Jun-2012):

  • To prototype a framework for converting old generation CUDA programs to optimized new generation cards
  • To transform programs between OpenMP, CUDA and OpenCL
  • To implement a tool for selecting the best data format for the sparse matrix representation
  • To implement the kernel merge in compiler to automate the process
  • Automatic optimizations for the GPU programs implemented through LLVM
  • New parallel program optimizations to be implemented in LLVM
  • Accelerating Moving Object detection
  • Effective CPU-GPU hybrid algorithm for the cloth simulation etc