they're not a userspace construct in any meaningful way. libutils/Thread (capital-T Thread) is a thin wrapper around pthread_t, and most Android code uses pthread_t directly. what this hit is probably a bug related to timerslack_ns--there's some code to tweak timerslack values on a per-thread basis instead of a per-process basis even though audio should never be coming from processes low enough on the importance list to get a high timerslack value--but L bugs fixed in M are before my time.
GPUs don't have a page fault handler; when there's a page fault, it's an unrecoverable crash. Accordingly, zero-on-allocate (or potentially zero-on-free, but that makes assumptions about startup and teardown that may not be true) is the only way to do it.
X1 is actually the 64-bit ARM CPU configuration (Cortex A53 + Cortex A57), not Denver. K1, the predecessor of X1, comes in two flavors: 32-bit 4xCortex A15 and 64-bit 2xDenver. TK1 is 32-bit, Nexus 9 is 64-bit (and is the only device I know of with Denver).
rent increases are almost always less than prop 13 increases. prop 13 increases are capped at 2%. In the last decade, only once was rent control greater than 2% and once it was 2%. The other 8 years, it has been less, including 0.1% one year.
I actually read IJ on a Kindle and found it significantly easier for the most part than reading it in print because of links to endnotes, which removed the requirement to keep two sets of bookmarks (if you haven't read it, some endnotes in IJ are a sentence are two, some are 40 pages). There were some occasional issues with going back to the main text (IIRC the back button's stack wasn't saved across sleep), but overall it was much better for me.
But no, I'd never read a textbook on a Kindle. Can't flip around.
GPUs don't support precise exceptions. For example, you can't take a GPU program that contains a segfault, run it as a standard program (as in, not in a debug mode), and be presented with the exact instruction that generated the fault.
look up the target attribute and the ifunc attribute--it's basically a way to compile multiple versions of a function for different targets in a single source file and then use the dynamic linker to determine which one to resolve at runtime. obvious use is for things like optimized memcpy implementations.
So, it's otherwise automatic, except I just have to write a selector routine that tries to decide the best performing routine to run at runtime and implementation for each individual case with varying hardware support.
You only need to go to all that trouble if you want high performance across a variety of machines. If you are merely after bragging rights or trying to satisfy someone else's requirement, the theory is that you can compile the exact same piece of high level code using different optimization targets, and the compiler will do all the work for you, providing maximum performance for each instruction set practically for free...
Even more practically, Agner has a typically excellent description of the strengths and weaknesses of the dispatch strategies used by different compilers in Section 13 (p. 122) here: http://www.agner.org/optimize/optimizing_cpp.pdf
but that's GLES 2.0, which is significantly less flexible than the kinds of GPUs we're discussing here and is not even in the same ballpark as a CPU (and almost certainly significantly less strict in terms of floating point precision than a GLES 3 device).
https://github.com/raspberrypi/userland/blob/master/host_app... is part of the Raspberry Pi GPU FFT example code. That is not GLES 2.0 or even GL of any kind. That's VideoCore QPU assembly language to compile with qasm. I haven't tried writing anything for it, but it certainly looks like it's "the kinds of GPUs we're discussing here" and "in the same ballpark as a CPU".
disclaimer: I work in this space and have done so for a while, including previously on CUDA and on Titan.
GPUs for general purpose computation were never 100x faster than CPUs like people claimed in 2008 or so. They're just not. That was basically NV marketing mixed with a lot of people publishing some pretty bad early work on GPUs.
Lots of early papers that fanned GPU hype followed the same basic form: "We have this standard algorithm, we tested it on a single CPU core with minimal optimizations and no SIMD (or maybe some terrible MATLAB code with zero optimization), we tested a heavily optimized GPU version, and look the GPU version is faster! By the way, we didn't port any of those optimizations back to the CPU version or measure PCIe transfer time to/from the GPU." It was utterly trivial to get any paper into a conference by porting anything to the GPU and reporting a speedup. Most of the GPU related papers from this time were awful. I remember one in particular that claimed a 1000x speedup by timing just the amount of time it took for the kernel launch to the GPU instead of the actual kernel runtime, and somehow nobody (either the authors or the reviewers) realized that this was utterly impossible.
GPUs have more FLOPs and more memory bandwidth in exchange for requiring PCIe and lots of parallel work. if your algorithm needs those more than anything else (like cache), can minimize PCIe transfer time, and handles the whole massive parallelism thing well, then GPUs are a pretty good bet. If you can't, then they're not going to work particularly well.
(now, if you need to do 2D interpolation and can use the texture fetch hardware on the GPU to do it instead of a bunch of arbitrary math... yeah, that's a _huge_ performance increase because you get that interpolation for free from special-purpose hardware. but that's incredibly rare in practice)
ah, yes. :) very nice detailed summary of some of the issues in this sect of "academia" (I put that in quotes only because all the research seems to be co-written by corps).
I am into audio DSP & am planning to port a couple of audio algorithms (lots of FFT & linear algebra) to run on GPU but haven't even gotten to it because I considered it a pre-mature optimization to this point. I'm sure it would improve performance, but nowhere near what GPU advocates would claim.
My biggest reason?
"PCIe transfer time to/from GPU", plus it would be unoptimized GPU code. Once you read a few of these papers it becomes painfully obvious that a lot of tuning goes into the GPU algorithms that offer anything more than a low single-digit factor of speedup. It's still very significant (cutting a 3 hour algorithm down to 1 would be huge) but if you're in an early stage of research it may be a toss-up over whether its better to just tune the algorithm itself / run computations overnight rather than going through the trouble of writing a GPU-based POC. Maybe if you have 1 or 2 under your belt its not such a big deal but for most of the researchers I know GPU algorithm rewrites would not be trivial. (I've been doing enterprise Java coding for about 2 years now so the idea isn't so intimidating now, but in a past life of mucking around with Matlab scripts I'm sure it would have been daunting).