Sample rates in audio hardware aren't like programming constants, where they're the same for everybody. Over 30 minutes, a 0.05% sample rate error gets you 1s of drift over the recording. As a reference, USB 2.0 has a 0.25% frequency tolerance (and is used to clock many audio devices).
Cheap quartz clocks in computers and some USB ADCs especially are prone to slightly changing their rates depending on temperature. So the sample rates can differ relative to each other.
The clock drifts. Something needs to count those seconds. Even when the drift is small, phasing distortions become pretty obvious on lengthy recordings.
There's some interesting work going on in the AES to support synchronised audio over wide area networks, either through better recovery of PTP clocks distributed through WANs or using PTP with GNSS.
Maybe actual clock differences? Not sure if that's the case, but in audio engineering, a separate clock may be used to keep all devices involved in-sync (many pro-level audio devices have a "clock" input for this very reason).
In RF engineering, it's typical to have all of your equipment referencing the same 10MHz clock (or a 1 pulse per second or IRIG-B). If I don't have a GPS receiver or a rubidium source, then I'll just pick the newest, most expensive piece of equipment with a built-in reference clock and fan it out to the rest of the equipment on the bench. Some portable spectrum analyzers have built-in GPS receivers so even out in the field you know you have a good reference.