In practice the functions just need to be piecewise differentiable. The RELU is the canonical example for deep learning. At kinks a subderivative is used.
It’s a little trickier than that, but you are generally right. The relaxation can involve many different things though, not just piecewise differentiability.
For example, when processing images that themselves may be intensity functions over a spatial domain, and the intensity function can have cusps, edges, occlusions, etc., then you need different weak-sense differentiability conditions, such as from Sobolev spaces, to guarantee that numerical gradient operators will succeed / converge where needed.
Actually I'm pretty sure the Rome numbers are for double precision whereas most numbers quoted for GPUs are for single precision or less, making Rome's 3.4tf even more impressive.
Yea I can't wait for the announcements to happen in the coming weeks. So far the best rumours I've seen seem to have them topping out at 48c for the new TR line, but none of them have been particularly authoritative looking, just that there haven't been leaks of anything looking at 64c yet. I suspect that what might happen is that they'll announce with up to 48c but then a few months down the line they'll announce the 64c cpu. That would line up well with what looks like caused the delays with them not being able to get the quantity of the chiplets they need. They'd be able to frontload all the lower core count demand and then when they don't need nearly as many of them start making the larger ones.
I really hope 32c would work with my Zenith Extreme x399, and 64c with 8-channel TRX80/WRX80. So I could upgrade old TR with a 32c Zen 2-one and buy another 64c with 4TB ECC LRDIMM for some Machine Learning tasks. I am also fine if AMD decided to do 64c TR with Zen 3 only (4xSMT?). But based on Blur ad, I guess they are going to release 64c TR based on Zen 2 as well, just to completely obliterate Intel in HEDT, even if it costs $5000.
Yea the x399 compatibility will decide when my upgrade happens. The Zen+ TRs weren't enough for me to justify but the Zen2 ones seem like they've finally hit. If I need to do a motherboard and other upgrades that'll delay me doing so for a while (need to see how PCIe passthrough and other stuff settles out with the new chipsets) but in either cases I'm going to end up upgrading to this next gen one way or another.
20-30% usually. You get faster cores but no LRDIMM (i.e. you are constrained effectively to 128GB ECC UDIMM, at best 256GB ECC UDIMM if you are lucky to get 32GB ECC UDIMM modules). EPYC has 4TB ECC LRDIMM ceiling, new TR on TRX80 might have the same ceiling as well. I am glad that AMD provides TR as they make way less $ on them than on EPYC, but it's a great marketing tool for them. I am running some TRs for Deep Learning rigs (PCIe slots are most important) on Linux, and they are great, Titan RTXs and Teslas run without any issue, but Zen 2 should give me much better performance on classical ML with Intel MKL/BLAS in PySpark/SciKit-Learn, so I can't wait to get some.
Intel makes rather pessimistic assumptions about AMD and uses the model name to pick which code path to use and ignores the CPU flags for floating point, etc.
So if you want to compare performance fairly I'd use gcc (or at least a non-intel compiler) and one of the MKL like libraries (ACML, gotoblas, openblas, etc). AMD has been directly contributing to various projects to optimize for AMD CPUs. They used to have their own compiler (that went from SGI -> cray -> pathscale or similar), but since then I believe have been contributing to GCC, LLVM, and various libraries.
If shopping I'd compare the highest end Ryzen + motherboard and the lowest end Epyc single socket chip and motherboard and try to guesstimate that price/performance for your workload.
Generally the Threadrippers seem like a much lower volume product and the motherboards are often quite expensive (for the current generation). Both Ryzen and Epyc enjoy significantly higher volumes.
Keep in mind that Threadripper has twice the memory bandwidth of Ryzen, but half the memory bandwidth of Epyc.
I've got the 9550 and run Ubuntu on it. No issues whatsoever and I do very compute-intensive work on it.
One thing that I've found is very important is to clean out the fans often, otherwise dust builds up and prevents cooling. Just unscrew the plate on the underside of the laptop and blow / brush / pick out the dust that's built up (both the fan intake and the fan outlet). Doing this every few months has been a game changer for my 9550.
Essentially, RNNs and feed forward networks are very similar - RNNs are just "unrolled through time" and every timestep shares the same weights. The activations are slightly different as well, but the core concept is the same as feed forward networks; it's not a completely different concept or idea.
I find it hard to believe that SGD would be faster than the closed form solutions for linear regression (gels, gelsd etc.). The closed-form solutions give a lot of other benefits in practical settings as well which makes them more likely to be used if possible. SGD + related optimizers give benefits with non-convex or non-analytical loss functions or with non-linear layers / more than one layer.
Then why would anyone use tensorflow with this loss function in practice. In my school's ML class, we used this technique too (in addition to closed form solution). Is there any practical reason to use an optimizer to solve a linear problem?
Note that it's not just the loss function. It's the loss function combined with a very specific problem formulation - namely a neural network with only linear activations (equivalent to a 0-layer network). Once you go to non-linear layers or a different loss it's no longer solved analytically.
I do see a lot of people writing tutorials like OP's. See for example:
The existence of these articles should not be taken as an indication of best practice. They often have the goal of teaching SGD in a simplified setting, not teaching best practice for LLS. I suppose only nice thing about using TF / SGD for such a simple problem is that you now have starting point for solving more complex problems (RELU activation, cross-entropy loss, more layers, etc.).
A few other points as to why you would never SGD for LLS:
1) it's always way slower than the closed form matrix solutions
2) if you're doing SGD instead of just GD, there's noise in which "rows" are in a given batch - as a result, repeated runs may not converge to exactly the same final weights. This never happens with the analytical solution which always gets exactly the same result.
3) if you're doing this as part of a data science pipeline which is likely the case in the real world, you'll likely want to do some cross-validation. In the SGD case you have to recompute the entire solution for each fold whereas in the LLS case you can immediately compute CVs once you've calculated the initial XTX / XTYs. This makes the process of using LLS even faster than SGD.