We've used libsimdcpp to good effect: https://github.com/p12tic/libsimdpp "libsi...

marmaduke · on Jan 2, 2018

Do you find it’s easier to write code with that than rely on autovec?

exDM69 · on Jan 2, 2018

You can't rely on autovectorization because it's a really brittle optimization that only works at the best of times, and generally only with simple loops.

For anything more complex, you need to write SIMD code explicitly. Getting good performance requires writing code where the full width of the registers is used. If the compiler falls back to using scalar arithmetic, it tends to pollute the surrounding code with register spilling when registers are required for scalar arithmetic (ie. only the 1st component of the xmm0 register is used).

Writing SIMD code is quite a bit of effort if you need to get it working well.

imtringued · on Jan 2, 2018

You can also not rely on things like tailcall optimization to automatically happen. That is why you usually annotate the function with @tailrec in Scala for example. The annotation doesn't do anything by itself. The compiler will just show an error/warning if the function is not optimized with a tail call.

Autovectorised SIMD code would probably need something like an "AUTOVEC" annotation at every single line to be effective which defeats the purpose of autovectorisation in the first place.

Const-me · on Jan 3, 2018

> Autovectorised SIMD code would probably need something like an "AUTOVEC" annotation

If you only need SIMD for stream processing, autovectorisation is OK.

Only there’re multiple autovectorizers in C. The default one is indeed very fragile. But the one in OpenMP 4 is better: http://www.hpctoday.com/hpc-labs/explicit-vector-programming...

But even that OMP 4 is very limited.

One reason is many SSE operations don’t map to C: approximate math (rcpps, rsqrtps), composite operations (FMA, AES), and saturated math (there’re dozens instruction for manipulating 8 and 16 bit numbers with saturation, i.e. on over/underflow the numbers don’t wrap around by stripping highest byte[s] but stay at the min/max 8/16 bit value).

Another reason is some SSE instructions operate horizontally (phminposuw, pmaddubsw, psadbw, dpps), or are advanced swizzle instructions (shufps, pshufb, pshuflw, pshufhw, pslldq), both are very hard to autogenerate from these #pragma omp simd loops.