More

jing · on Dec 12, 2024

No it can’t. That’s d to d

_zoltan_ · on Dec 13, 2024

no.

[root@gh200 nvbandwidth]# nvidia-smi | grep GH200

| 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | 0 |

[root@gh200 nvbandwidth]# ./nvbandwidth | grep -E 'host_to_device_memcpy_sm|device_to_host_memcpy_sm' | grep ^SUM

SUM host_to_device_memcpy_sm 357.45

SUM device_to_host_memcpy_sm 352.05

[root@gh200 nvbandwidth]#

d2d is much higher:

[root@gh200 cuda-samples]# ./bin/sbsa/linux/release/bandwidthTest --dtod | grep -B1 32000

   Transfer Size (Bytes) Bandwidth(GB/s)
   32000000   2294.8

[root@gh200 cuda-samples]#

saagarjha · on Dec 12, 2024

Well, device to device is technically doubled because you have a read and a write. But yes

shaklee3 · on Dec 13, 2024

No. D2D is arroubd 3TB/s

jing · on Nov 4, 2019

In practice the functions just need to be piecewise differentiable. The RELU is the canonical example for deep learning. At kinks a subderivative is used.

mlthoughts2018 · on Nov 4, 2019

It’s a little trickier than that, but you are generally right. The relaxation can involve many different things though, not just piecewise differentiability.

For example, when processing images that themselves may be intensity functions over a spatial domain, and the intensity function can have cusps, edges, occlusions, etc., then you need different weak-sense differentiability conditions, such as from Sobolev spaces, to guarantee that numerical gradient operators will succeed / converge where needed.

https://en.m.wikipedia.org/wiki/Sobolev_space

jing · on Oct 23, 2019

Actually I'm pretty sure the Rome numbers are for double precision whereas most numbers quoted for GPUs are for single precision or less, making Rome's 3.4tf even more impressive.

bitL · on Oct 23, 2019

Half of Titan V/V100 FP64 sounds unbelievable! Can't wait to get my hands on 64c Threadripper with TRX80!

simcop2387 · on Oct 23, 2019

Yea I can't wait for the announcements to happen in the coming weeks. So far the best rumours I've seen seem to have them topping out at 48c for the new TR line, but none of them have been particularly authoritative looking, just that there haven't been leaks of anything looking at 64c yet. I suspect that what might happen is that they'll announce with up to 48c but then a few months down the line they'll announce the 64c cpu. That would line up well with what looks like caused the delays with them not being able to get the quantity of the chiplets they need. They'd be able to frontload all the lower core count demand and then when they don't need nearly as many of them start making the larger ones.

bitL · on Oct 23, 2019

I really hope 32c would work with my Zenith Extreme x399, and 64c with 8-channel TRX80/WRX80. So I could upgrade old TR with a 32c Zen 2-one and buy another 64c with 4TB ECC LRDIMM for some Machine Learning tasks. I am also fine if AMD decided to do 64c TR with Zen 3 only (4xSMT?). But based on Blur ad, I guess they are going to release 64c TR based on Zen 2 as well, just to completely obliterate Intel in HEDT, even if it costs $5000.

simcop2387 · on Oct 24, 2019

Yea the x399 compatibility will decide when my upgrade happens. The Zen+ TRs weren't enough for me to justify but the Zen2 ones seem like they've finally hit. If I need to do a motherboard and other upgrades that'll delay me doing so for a while (need to see how PCIe passthrough and other stuff settles out with the new chipsets) but in either cases I'm going to end up upgrading to this next gen one way or another.

rrss · on Oct 23, 2019

How much does threadripper usually cost relative to epyc for same # of cores?

bitL · on Oct 23, 2019

20-30% usually. You get faster cores but no LRDIMM (i.e. you are constrained effectively to 128GB ECC UDIMM, at best 256GB ECC UDIMM if you are lucky to get 32GB ECC UDIMM modules). EPYC has 4TB ECC LRDIMM ceiling, new TR on TRX80 might have the same ceiling as well. I am glad that AMD provides TR as they make way less $ on them than on EPYC, but it's a great marketing tool for them. I am running some TRs for Deep Learning rigs (PCIe slots are most important) on Linux, and they are great, Titan RTXs and Teslas run without any issue, but Zen 2 should give me much better performance on classical ML with Intel MKL/BLAS in PySpark/SciKit-Learn, so I can't wait to get some.

silvr · on Oct 23, 2019

Naive question: Are you able to use MKL on an AMD chip without jumping through too many hoops?

bitL · on Oct 23, 2019

Yes, just pip install ..., but it's 2x slower than on Intel for Zen/Zen+. Only Zen 2 is close to Intel.

sliken · on Oct 24, 2019

Intel makes rather pessimistic assumptions about AMD and uses the model name to pick which code path to use and ignores the CPU flags for floating point, etc.

So if you want to compare performance fairly I'd use gcc (or at least a non-intel compiler) and one of the MKL like libraries (ACML, gotoblas, openblas, etc). AMD has been directly contributing to various projects to optimize for AMD CPUs. They used to have their own compiler (that went from SGI -> cray -> pathscale or similar), but since then I believe have been contributing to GCC, LLVM, and various libraries.

bitL · on Oct 24, 2019

Yeah, still, Zen 2 is much faster in OpenBLAS and is faster in MKL than Zen/+ as well.

sliken · on Oct 24, 2019

It's lumpy and depends on exactly when you ask.

If shopping I'd compare the highest end Ryzen + motherboard and the lowest end Epyc single socket chip and motherboard and try to guesstimate that price/performance for your workload.

Generally the Threadrippers seem like a much lower volume product and the motherboards are often quite expensive (for the current generation). Both Ryzen and Epyc enjoy significantly higher volumes.

Keep in mind that Threadripper has twice the memory bandwidth of Ryzen, but half the memory bandwidth of Epyc.

jing · on Oct 23, 2019

Why not just get the 7702p?

bitL · on Oct 23, 2019

I guess TR will be a bit cheaper and higher clocked? And I don't really care that much about ECC errors for ML.

jing · on July 23, 2019

https://www.fast.ai/2018/07/02/adam-weight-decay/

jing · on June 23, 2019

Particularly since it would not be unreasonable to assume that the "mi" in mimalloc is incorrectly pronounced like the "mi" in Microsoft.

jing · on June 19, 2019

Yep. And Mellanox does it too fwiw -

https://community.mellanox.com/s/article/vma-improves-redis-...

jing · on June 9, 2019

I've got the 9550 and run Ubuntu on it. No issues whatsoever and I do very compute-intensive work on it.

One thing that I've found is very important is to clean out the fans often, otherwise dust builds up and prevents cooling. Just unscrew the plate on the underside of the laptop and blow / brush / pick out the dust that's built up (both the fan intake and the fan outlet). Doing this every few months has been a game changer for my 9550.

jing · on Feb 3, 2019

I think this resource could be helpful:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Essentially, RNNs and feed forward networks are very similar - RNNs are just "unrolled through time" and every timestep shares the same weights. The activations are slightly different as well, but the core concept is the same as feed forward networks; it's not a completely different concept or idea.

jing · on Oct 19, 2018

I find it hard to believe that SGD would be faster than the closed form solutions for linear regression (gels, gelsd etc.). The closed-form solutions give a lot of other benefits in practical settings as well which makes them more likely to be used if possible. SGD + related optimizers give benefits with non-convex or non-analytical loss functions or with non-linear layers / more than one layer.

gnulinux · on Oct 19, 2018

Then why would anyone use tensorflow with this loss function in practice. In my school's ML class, we used this technique too (in addition to closed form solution). Is there any practical reason to use an optimizer to solve a linear problem?

jing · on Oct 19, 2018

Note that it's not just the loss function. It's the loss function combined with a very specific problem formulation - namely a neural network with only linear activations (equivalent to a 0-layer network). Once you go to non-linear layers or a different loss it's no longer solved analytically.

I do see a lot of people writing tutorials like OP's. See for example:

https://towardsdatascience.com/linear-regression-using-gradi...

The existence of these articles should not be taken as an indication of best practice. They often have the goal of teaching SGD in a simplified setting, not teaching best practice for LLS. I suppose only nice thing about using TF / SGD for such a simple problem is that you now have starting point for solving more complex problems (RELU activation, cross-entropy loss, more layers, etc.).

A few other points as to why you would never SGD for LLS:

1) it's always way slower than the closed form matrix solutions

2) if you're doing SGD instead of just GD, there's noise in which "rows" are in a given batch - as a result, repeated runs may not converge to exactly the same final weights. This never happens with the analytical solution which always gets exactly the same result.

3) if you're doing this as part of a data science pipeline which is likely the case in the real world, you'll likely want to do some cross-validation. In the SGD case you have to recompute the entire solution for each fold whereas in the LLS case you can immediately compute CVs once you've calculated the initial XTX / XTYs. This makes the process of using LLS even faster than SGD.

jing · on April 1, 2016

The Xeons support much more RAM.