Strata Gems: Use GPUs to speed up calculation

GPUs bring massively parallel computing into reach

We’re publishing a new Strata Gem each day all the way through to December 24. Yesterday’s Gem: The emerging marketplace for social data. Early-bird pricing on Strata closes December 14: don’t forget to register!

Strata 2011 The release in November of Amazon Web Services’ Cluster GPU instances highlights the move to the mainstream of Graphics Processing Units (GPUs) for general purpose calculation. Graphical applications require very fast matrix transformations, for which GPUs are optimized. Boards such as the NVIDIA Tesla offer hundreds of processor cores all able to work in parallel.

While debate is ongoing about the exact range of performance boost available by using GPUs, reports indicate that speedups over CPUs from 2.5 to 15x can be obtained for calculation-heavy applications.

NVIDIA has led the trend for general purpose computing on GPUs with the Compute Unified Device Architecture (CUDA). By using extensions to the C programming language, developers can write code that executes on the GPU, mixed in with code running on the CPU.

NVIDIA Tesla
NVIDIA’s Tesla M2050 GPU Computing Module

While CUDA is NVIDIA-only, OpenCL (Open Computing Language) is a standard for cross-platform general parallel programming. Originated by Apple, and heavily influenced by CUDA, it is now developed with cross industry participation. ATI and NVIDIA are among those who offer OpenCL support for their products.

Now with Amazon’s support for GPU clusters, it’s easier than ever to start accessing the power of GPUs for data analysis. OpenCL and CUDA bindings exist for many popular programming languages, including Java, Python and C++, and the R+GPU project gives GPU access for the R statistical package.

To get a quick impression of what GPU code looks like, check out this example from the Python OpenCL bindings. The code to execute on the GPU is called out in bold text.

import pyopencl as cl import numpy import numpy.linalg as la  a = numpy.random.rand(50000).astype(numpy.float32) b = numpy.random.rand(50000).astype(numpy.float32)  ctx = cl.create_some_context() queue = cl.CommandQueue(ctx)  mf = cl.mem_flags a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b) dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes)  prg = cl.Program(ctx, """     __kernel void sum(__global const float *a,     __global const float *b, __global float *c)     {       int gid = get_global_id(0);       c[gid] = a[gid] + b[gid];     }     """).build()  prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf)  a_plus_b = numpy.empty_like(a) cl.enqueue_read_buffer(queue, dest_buf, a_plus_b).wait()  print la.norm(a_plus_b - (a+b)) 

Amazon’s Werner Vogels will be among the keynote speakers at Strata.

tags: , , , ,
  • http://www.parstream.com/ Michael Hummel

    We have just won the one-to-watch-award from Nvidia & Adobe for ParStream – our novel analytical database which runs on CPUs, or GPUs or both together.

    I would like to confirm your estimated range of performance improvements achievable through GPUs.
    Depending on the queries performed we experience a performance gain through GPUs by a factor of 8 to 10.
    In combination with the performance gain we achieve on standard hardware – at least 10 times faster than any other row or columinar database – ParStream achieves significant performance gains!

    See you at STRAT-conf.
    Best Mike