Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

We've used libsimdcpp to good effect: https://github.com/p12tic/libsimdpp

"libsimdpp is a portable header-only zero-overhead C++ low level SIMD library." Not yet sure how it compares to the linked library.



Do you find it’s easier to write code with that than rely on autovec?


You can't rely on autovectorization because it's a really brittle optimization that only works at the best of times, and generally only with simple loops.

For anything more complex, you need to write SIMD code explicitly. Getting good performance requires writing code where the full width of the registers is used. If the compiler falls back to using scalar arithmetic, it tends to pollute the surrounding code with register spilling when registers are required for scalar arithmetic (ie. only the 1st component of the xmm0 register is used).

Writing SIMD code is quite a bit of effort if you need to get it working well.


You can also not rely on things like tailcall optimization to automatically happen. That is why you usually annotate the function with @tailrec in Scala for example. The annotation doesn't do anything by itself. The compiler will just show an error/warning if the function is not optimized with a tail call.

Autovectorised SIMD code would probably need something like an "AUTOVEC" annotation at every single line to be effective which defeats the purpose of autovectorisation in the first place.


> Autovectorised SIMD code would probably need something like an "AUTOVEC" annotation

If you only need SIMD for stream processing, autovectorisation is OK.

Only there’re multiple autovectorizers in C. The default one is indeed very fragile. But the one in OpenMP 4 is better: http://www.hpctoday.com/hpc-labs/explicit-vector-programming...

But even that OMP 4 is very limited.

One reason is many SSE operations don’t map to C: approximate math (rcpps, rsqrtps), composite operations (FMA, AES), and saturated math (there’re dozens instruction for manipulating 8 and 16 bit numbers with saturation, i.e. on over/underflow the numbers don’t wrap around by stripping highest byte[s] but stay at the min/max 8/16 bit value).

Another reason is some SSE instructions operate horizontally (phminposuw, pmaddubsw, psadbw, dpps), or are advanced swizzle instructions (shufps, pshufb, pshuflw, pshufhw, pslldq), both are very hard to autogenerate from these #pragma omp simd loops.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: