I want to preface this - I don't have strong opinion already here, and I'm curio...

antongribok · on Aug 17, 2022

When Ceph migrated from Filestore to Bluestore, that enabled data scrubbing and checksumming for data (older versions before Bluestore were only verifying metadata).

Ceph (by default) does metadata scrubs every 24 hours, and data scrubs (deep-scrub) weekly (configurable, and you can manually scrub individual PGs at any time if that's your thing). I believe the default checksum used is "crc32c", and it's configurable, but I've not played with changing it. At work we get scrub errors on average maybe weekly now, at home I've not had a scrub error yet on this cluster in the past year (I did have a drive that failed and still needs to be replaced).

My RPi setup certainly does not have ECC RAM as far as I'm aware, but neither does my current ZFS setup (also a 6 drive RAIDZ2).

Nothing stopping you from running Ceph on boxes with ECC RAM, we certainly do that at my job.

lathiat · on Aug 18, 2022

Weekly scrub errors are definitely not normal. There has been a few bug fixes in Ceph & the Kernel. I would check how up to date your packages are.

Hardware errors are also possible but there has been a few software bugs so worth checking.

Here's a really curly one we recently found and solved at work (only in March) that causes both scrub errors and sometimes bluefs aborts, it's a kernel patch. Likely to happen under memory pressure:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

https://tracker.ceph.com/issues/22464#note-50

But there have been various others.

mike_d · on Aug 18, 2022

> Weekly scrub errors are definitely not normal

That depends on the scale of your cluster. The Backblaze drive report shows they lose 1.3% of their disks per year on average. Larger operations will have one or more people who just replace disks all day.

antongribok · on Aug 18, 2022

I've not had a single PG inconsistency on my RPi cluster in the one year it's been running. Weekly errors are at work.

Most of these are for actual disk media failures.

For 100,000 disks what is a normal failure rate in your opinion?

lathiat · on Aug 18, 2022

@mike_d you are right I made a very bad assumption about the size of the environment :)

@antongribok sounds like you have it in hand having seen your subsequent comments just didn’t want to leave someone assuming that error rate was normal in the smaller clusters that are more commonly but not universally the case. Evidently not the case for you :)