Introduction to GEN Assembly in OpenCL

mattst88 · on Jan 23, 2016

Another Intel employee here (don't know the author). I work on Intel's Open Source OpenGL driver, which is part of Mesa. Most of the work I do is on our GLSL compiler and Gen hardware backend.

I've never used VTune (but it looks very cool) and some of the notation is different from what I'm used to.

If you're using an Intel GPU made in the last 10 years on Linux, the environment variable INTEL_DEBUG will allow you to see the disassembled shaders for a given program. Try "INTEL_DEBUG=fs,vs glxgears" to see the fragment and vertex shaders' assembly and the dumps of some intermediate representations along the way.

The Gen (or i965, as we call it) instruction set is really powerful, but does take a while to understand fully.

I've actually been trying to finish up an article about some tricks we do in the i965_dri.so driver (think bit twiddling hacks, but each using some interesting features of the instruction set). I'm curious if there's interest in such a thing. I'd probably be more motivated to finish it. :)

metafex · on Jan 23, 2016

There is definitely interest in this. It is always nice to dive into different computing architectures and see how stuff is done. And it's never wrong to write something more accessible than the Intel manuals ;) (one can read those if really necessary anyhow)

Narishma · on Jan 23, 2016

What's the reason for generating 2 versions of each fragment shader (SIMD8 and SIMD16)?

mattst88 · on Jan 23, 2016

That's a good question.

SIMD8/SIMD16 refers to the number of fragments processed per thread invocation. There is some overhead to spawning a thread, and so processing 16 fragments at a time is typically faster even though the shader itself is doing more work.

The driver provides both versions because it's the GPU that decides which version is use and where, even using both versions to shade the same primitive. For instance, on a triangle boundary maybe only 4 or 8 fragments are "lit", so the hardware spawns a SIMD8 thread and saves itself a little work.

SIMD16 shaders typically use twice the number of registers as SIMD8, and if they require registers to be spilled to memory it's likely the SIMD8 shader is faster even with the additional thread-spawning overhead. Lots of compiler optimization revolves around trying to squeak in under the register limit to get the program compiled as SIMD16 without spilling. :)

iheartmemcache · on Jan 23, 2016

Just a side-note (I'm not an Intel employee, but I'ven't seen much Intel awareness re: their tooling here [they're fairly bad at reaching out to this community, but they have tons of free tools available that'll save your team money if you're doing anything that's not IO bound]):

* http://ispc.github.io/ -> An Intel front-end compiler for "SPMD" that targets LLVM, which has proven to be useful.

* PIN -> Dynamic analysis, free. Think DTrace and Valgrind on steroids. (Not open source, I'll take what I can get)

* ICC -> Their compiler suite gets little love on HN (though there aren't many engineers who really write computationally intensive stuff here, or if they do I suppose they just throw Amazon instances at it), but it's so cheap for what you get. Such an extensive tooling set out of the box at around the pricing of VS (both of which are well under 5k for 1 seat re: their highest version).

My team has done a few jobs where we've jumped in using Intel tooling and both shifted and tightened the 1st-3rd quadrant computational time by literally 3-5x and 10-20x respectively by basically just using the Intel tooling and being fairly familiar with it. Shameless plug & all ;)