The goal of NN-512 is efficient neural net inference on inexpensive, CPU-only cloud compute instances
For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)
In contrast, GPU cloud compute is almost unbelievably expensive. Even Linode charges $1000 per month, or $1.50 per hour (look at the GPU plans: https://www.linode.com/pricing/#row--compute)
As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation
I'm not disagreeing with you. I acknowledge that there may be a market for CPU-only NN tasks.
I think a thorough benchmark, either by you or by someone else, will only help your case, by giving a clear picture to those who need to make a decision.
Fun fact, GPUs are massively under-utilized during NN training. So it's quite possible NN on a good CPU might be only slightly slower.
GPU underutilization depends on what, exactly, the model you're training is. It's not unreasonable to hit 80% or more of CUDA core usage on non-recurrent models like convnets, given sufficiently fast data pipelines and a reasonable batch size. Transformers and other recurrent functions hit 100% CUDA core utilization for large portions of each epoch, with the low-% usage on the comparatively short weight update at the end. As well, the current rule of thumb is that at the same price point (so a Xeon 4114 and a Nvidia Titan RTX) the GPU completes each epoch in 10% of the time as the CPU given the same compute graph... So it's highly unlikely that training will be anywhere close to as fast on a CPU as it is on a GPU.
For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)
In contrast, GPU cloud compute is almost unbelievably expensive. Even Linode charges $1000 per month, or $1.50 per hour (look at the GPU plans: https://www.linode.com/pricing/#row--compute)
As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation