I want to preface this - I don't have strong opinion already here, and I'm curious about Ceph. As someone who runs a 6 drive raidz2 at home (w/ ECC RAM) does your Ceph config give you similar data integrity guarantees to ZFS? If so, what are the key points of the config that enable that?
When Ceph migrated from Filestore to Bluestore, that enabled data scrubbing and checksumming for data (older versions before Bluestore were only verifying metadata).
Ceph (by default) does metadata scrubs every 24 hours, and data scrubs (deep-scrub) weekly (configurable, and you can manually scrub individual PGs at any time if that's your thing). I believe the default checksum used is "crc32c", and it's configurable, but I've not played with changing it. At work we get scrub errors on average maybe weekly now, at home I've not had a scrub error yet on this cluster in the past year (I did have a drive that failed and still needs to be replaced).
My RPi setup certainly does not have ECC RAM as far as I'm aware, but neither does my current ZFS setup (also a 6 drive RAIDZ2).
Nothing stopping you from running Ceph on boxes with ECC RAM, we certainly do that at my job.
Weekly scrub errors are definitely not normal. There has been a few bug fixes in Ceph & the Kernel. I would check how up to date your packages are.
Hardware errors are also possible but there has been a few software bugs so worth checking.
Here's a really curly one we recently found and solved at work (only in March) that causes both scrub errors and sometimes bluefs aborts, it's a kernel patch. Likely to happen under memory pressure:
That depends on the scale of your cluster. The Backblaze drive report shows they lose 1.3% of their disks per year on average. Larger operations will have one or more people who just replace disks all day.
@mike_d you are right I made a very bad assumption about the size of the environment :)
@antongribok sounds like you have it in hand having seen your subsequent comments just didn’t want to leave someone assuming that error rate was normal in the smaller clusters that are more commonly but not universally the case. Evidently not the case for you :)