Vectorization with SSE

Using OpenCL, CUDA, or OpenACC to take advantage of the computing power offered by GPGPUs is the most effective way to accelerate computationally expensive code nowadays. However, not all machines have powerful GPGPUs. On the other hand, _all_ modern x86 processors from Intel and AMD support vector instructions in the form of Streaming SIMD Extensions (SSE – SSE4) and most new processors support Advanced Vector Extensions (AVX). Utilizing these instructions makes it easy to improve the performance of your code by as much as a factor of eight (AVX-512 will increase this to a factor of 16). In some cases, compilers can automatically vectorize pieces of code to take advantage of the CPU’s vector units. Furthermore, OpenMP 4.0 allows you to automatically vectorize certain sections of code, but writing code with vector operations in mind will generally yield better results. In this post, I will briefly explain how to use SSE to vectorize C and C++ code using GCC 4.8. Most of the instructions should also apply to LLVM/Clang and Microsoft Visual Studio.

SSE registers are 128 bits (16 bytes) wide. This means that each register can store 4 single precision floating point numbers or 2 double precision floating point numbers as well as other types that add up to 16 bytes. Using SSE, we can perform 4 single precision or 2 double precision floating point operations simultaneously on one CPU core using one instruction. Note: A superscalar processor, containing multiple floating point units, can also perform multiple floating point operations simultaneously, however a separate instruction is issued for each operation and the programmer has little, if any, control over which operations are performed simultaneously. Using SSE, we also have more control over the cache and prefetching. It’s even possible to bypass the cache entirely, which can be useful in some cases where cache pollution would cause a performance hit. With SSE, we can even eliminate some branches in the code, thereby reducing the chance of a branch misprediction, which would necessitate a pipeline flush. This can potentially improve performance further.

First-generation SSE instructions can be accessed using the header xmmintrin.h. SSE2 and 3 instructions can be used by including emmintrin.h and pmmintrin.h, respectively. To access all vector extensions, including SSE4 and AVX, use immintrin.h. You also need to consult your compiler’s documentation to learn which flags / switches are required to enable the instructions. In the code example below, I demonstrate…

The code sample compiles on g++ 4.8. Some minor changes may be needed for other compilers. Some of these changes are mentioned in the comments. Click here for a version of the example that compiles in g++ 4.7 and clang++ 3.0 (and probably other compilers).

/*
 vector.cpp
 compile with g++ vector.cpp -o vector -O2 -msse
*/

#include <iostream>
#include "xmmintrin.h"

#define PRINT(var) print((var), #var)

using namespace std;

// v4sf: vector of four single-precision floats
typedef float v4sf __attribute__ ((vector_size (16)));

// these are used for printing the results
void print(const v4sf& vec, const char* name);
void print(const float* vec, const char* name);

int main()
{
   // one way to initialize a vector
   v4sf vecA = {0, 1, 2, 3};

   // another way: loading the data from a 16-byte aligned array
   float arrayA[] = {4, 2, 3, 6};

   v4sf vecAA = _mm_load_ps(arrayA);

   /* also refer to _mm_setzero_ps(), _mm_set1_ps(), _mm_load_ss(),
      _mm_load1_ps(), _mm_loadu_ps(), _mm_loadr_ps(), _mm_set_ps(),
      and _mm_setr_ps() in xmmintrin.h for vector initialization. */

   // one way of adding vectors. The other method explicitly uses
   // _mm_add_ps()
   v4sf vecB = vecA + vecA;

   float arrayB[4];

   // store vector data in an array
   _mm_store_ps (arrayB, vecB);

   /* NOTE: g++ 4.8 automatically converts scalars to vectors. Other
      compilers (such as g++ 4.7) require you to set up the vectors
      explicitly. An easy way to do this is to use _mm_set1_ps().
    */

   // multiplication by a scalar
   v4sf vecC = 2.0 * vecB; // _mm_set1_ps(2.0) * vecB;

   // more complicated expressions
   v4sf vecD = vecC - 0.5 * vecA;

   v4sf vecE = 1.0 + 5.0 * vecC / (vecA + 3.0);

   /* for logic operations, TRUE = 0xFFFFFFFF, FALSE = 0x00000000
      Thus, the result is a mask which can be bitwise ANDed using
      _mm_and_ps() */

   v4sf mask = (vecA == vecB);  // _mm_cmpeq_ps(vecA, vecB);

   // bitwise AND.
   // This is equivalent to vecF[i] = (vecA[i] == vecB[i]) ? vecE[i] : 0
   v4sf vecF = _mm_and_ps(mask, vecE);

   // bitwise OR and ANDNOT.
   // equivalent to vecG[i] = (vecA[i] == vecB[i]) ? vecE[i] : vecD[i];
   v4sf vecG = _mm_or_ps(_mm_and_ps(mask, vecE),
                         _mm_andnot_ps(mask, vecD));

   // vecH[i] = sqrt(vecE[i]);
   v4sf vecH = _mm_sqrt_ps(vecE);

   // vecI[i] approx 1.0 / sqrt(vecE[i]);
   v4sf vecI = _mm_rsqrt_ps(vecE);

   // vecJ[i] approx 1.0 / vecAA[i];
   v4sf vecJ = _mm_rcp_ps(vecAA);

   v4sf vecK = 1.0 / vecAA;

   // vecL is a permutation of vecA.
   v4sf vecL = _mm_shuffle_ps(vecA, vecA, _MM_SHUFFLE(1,0,3,2) );

   ////////////////////////////// output ///////////////////////////////

   PRINT(vecA);
   PRINT(vecAA);
   PRINT(vecB);
   PRINT(arrayB);
   PRINT(vecC);
   PRINT(vecD);
   PRINT(vecE);
   PRINT(vecF);
   PRINT(vecG);
   PRINT(vecH);
   PRINT(vecI);
   PRINT(vecJ);
   PRINT(vecK);
   PRINT(vecL);

   return 0;
}

void print(const v4sf& vec, const char* name)
{
     union { v4sf vector; float array[4]; } value;

     value.vector = vec;

     print(value.array, name);
}

void print(const float* vec, const char* name)
{
     cout << "\n" << name << endl;

     for (int i = 0; i < 4; ++i) { cout << "\t" << vec[i] << endl; }
}

Output:

vecA:
    0
    1
    2
    3

vecAA:
    4
    2
    3
    6

vecB:
    0
    2
    4
    6

arrayB:
    0
    2
    4
    6

vecC:
    0
    4
    8
    12

vecD:
    0
    3.5
    7
    10.5

vecE:
    1
    6
    9
    11

vecF:
    1
    0
    0
    0

vecG:
    1
    3.5
    7
    10.5

vecH:
    1
    2.44949
    3
    3.31662

vecI:
    0.999878
    0.408264
    0.333313
    0.301514

vecJ:
    0.249939
    0.499878
    0.333313
    0.166656

vecK:
    0.25
    0.5
    0.333333
    0.166667

vecL
    2
    3
    0
    1

To understand the shuffle command in part L of the example, refer to this documentation for a more general usage of the shuffle operation.

Studying the example above should give you a basic sense of how to use some of the SSE instructions. Notice the definition of the v4sf data type, which is a vector of size 16 bytes. Also note that each operation performed above on vectors has the form _mm_operation_ps(), but there are other types of operations. If you read through the xmmintrin.h header, you will see that there are many more operations available. Reading through the header and doing a few web searches for specific intrinsic functions is an easy way to learn about the details of SSE.

For an idea of how SSE can be used to solve a real problem, consider the following partial example. Suppose we decide to step some particles forward in space using

$${\mathbf{r}_{i + 1}} = {\mathbf{r}_i} + \mathbf{v}_i\delta t + \frac{1}{2}\mathbf{a}_i\delta t^2. $$

Ignore the fact that this isn’t the best expression to use, in general. We could create particle packets containing four particles, like this:

struct particle4
{
    v4sf x;
    v4sf y;
    v4sf z;
    v4sf vx;
    v4sf vy;
    v4sf vz;
    v4sf ax;
    v4sf ay;
    v4sf az;
}

and then step four particles forward during each iteration of a for loop using

// particles[] is an vector of particle4 objects
// the step size, delta_t is already defined

const v4sf dt = _mm_set1_ps(delta_t);

const v4sf dt2 = _mm_set1_ps(0.5 * delta_t * delta_t);

for (int i = 0; i < N; ++i)
{
    particles[i].x = particles[i].x + dt * particles[i].vx + dt2 * particles[i].ax;
    particles[i].y = particles[i].y + dt * particles[i].vy + dt2 * particles[i].ay;
    particles[i].z = particles[i].z + dt * particles[i].vz + dt2 * particles[i].az;
}

// then update vx, vy, vz, ax, ay, az (not shown)

Using OpenMP to spread the work over several threads would further improve performance.

Note: SSE2 and SSE3 merely add more instructions on top of SSE. On the other hand, the newer AVX instructions use 256 bit registers, so they are able to operate on 8 single precision floats or 4 double precision floats simultaneously, while AVX 2 will have 512 bit registers! AVX is quite similar to SSE in terms of usage. For more information, see avxintrin.h.