Hacker Timesnew | past | comments | ask | show | jobs | submitlogin
In OpenZFS and Btrfs, everyone was just guessing (phoronix.com)
195 points by whereistimbo on Nov 30, 2023 | hide | past | favorite | 165 comments


I don't know about BTRFS, but with ZFS there's a lot of bad information and assumptions floating around. It's one of my favorite topics to ask AI about because it shows the obvious deficiencies in AI as it regurgitates all the bad info. So when you go to the issue tracker, it seems plausible to me that you're going to find issues where the devs can't explain what's happening because the user(s) might be doing something crazy.

With the exception of a couple people like Jim Salters, I don't take any advice I see about ZFS unless I know the person is a dev or has commits on the project. And most of the time I take the advice I've seen 'ryao' post which is, paraphrased, "use the defaults unless you know what you're tuning". That should go without saying, but there are a lot of armchair "geniuses" posting online about ZFS and they are just guessing.

And, how to you get any new developers if the expectation is an immediate, complete understanding of the system? That just isn't realistic and, in most cases, the dev that knows a little bit probably knows exponentially more than an average user. Interacting on the bug tracker and debugging things they don't completely understand is going to improve their understanding. There's value there.

Things have gotten more complicated too. We aren't using spinning disks without caches any more, right? I can understand a disk from 20 years ago. They're relatively simple. I couldn't tell you the first thing about SSDs or NVMe stuff.

There are going to be bugs. Just be glad the devs acknowledge them and fix them. Personally I still rank ZFS as the least likely to lose my data and I'll keep using it. I put roughly zero value in a forum full of people complaining.


On the other hand there are a couple of things in ZFS that can make a big difference. For instance: changing the page size to be the same size as Postgresql uses (8kb); disabling the writing of 'last accessed' timestamps; or whether or not the filesystem attempts deduplication. From what I remember (it's been a while) these can be configured on a per-dataset basis, where a dataset is a lightweight filesystem more akin to a directory, and many of them can be changed on a live filesystem.

Here, have at it: https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops...

So personally I'd encourage messing around (on something non-prod, of course). Measure measure measure! Does it actually make any difference.

It's a shame a bug crept in to an otherwise pretty much faultless filesystem. I'm sure there are some truths in there about commercial support or not running the bleeding edge or whatever. And making backups, even for ZFS backed filesystems. Sacrilege, I know.


> whether or not the filesystem attempts deduplication

Unless I'm just parroting truisms here, dedupe is almost never a good idea and compression is a much more common case-by-case.

> So personally I'd encourage messing around (on something non-prod, of course). Measure measure measure! Does it actually make any difference.

100%. So much depends on the particular workload and resources.

> And making backups, even for ZFS backed filesystems. Sacrilege, I know.

Logical backups, specifically. We have seen more than once corruption and data-loss in replicated encrypted snapshots due to bugs in ZFS.


Deduplication requires an absolutely enormous amount of memory, unless you have a dataset that has a huge amount of duplication for some reason (maybe a company or school where a lot of people have their own network storage and put the same large files in?) it is usually never the right choice to enable dedup.

It is cheaper to buy more disk than the memory to enable dedup (somebody could run the numbers). ZFS ARC also likes lots of memory and you get good performance gains with more memory allocated to it.

In other words, turn on LZ4 compression everywhere and don’t think about it otherwise.


Deduplication in ZFS can be pushed onto dedicated flash storage now instead of requiring gobs of RAM.


How to do this?


Great channel for side attacks btw.


Doesn't even need SSD's to open a side-channel: https://lore.kernel.org/linux-btrfs/CAKDzk=-HZardsLFH5c9HYre...


Without having physical access to my hardware, where every disk is FDE anyway?


Physical access not required. Dedupe side attack required timing measurement.


Oh got some hate there.. pissed off the zfs kids with a dose of reality.

Dedupe can be used to infer secret data. .. more news at 11


This is my understanding as well. These days we default to zstd, though.


Oh good, I have it off. I was pondering that feature years ago when I set it up. Couldn't remember where I landed.


Data loss bugs in ZFS?? Wow.

My personal approach to filesystem evaluation is pretty much "How long has it been since I've seen someone talk about a data loss bug". This is my first time hearing one for ZFS. BTRFS seems to have at least one post a year. Ext4 has almost none, but maybe that's because people trust it and blame hardware when unsure what the issue is.


Comparing zfs/btrfs with ext4 is not entirely fair. They're in different domains when it comes to scope and features. To some extent, silent corruption in ext4 is expected but in zfs it's a bug.


Also, compression. CPUs are more than capable of handling it on the fly, and the reduced dataset size helps IO.

This [0] is a brief post on MySQL, but most of it applies equally to Postgres.

[0]: https://www.percona.com/blog/mysql-zfs-performance-update/


I'd be curious if the addition of zstd compression has any effect on the results. Could be a fun follow-up follow-up :-)


For Postgres just follow the original doc:

https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

compression=lz4 recordsize=32K

In postgres -> full_page_writes=off


But also experiment with recordsizes of 8k and 16k. The larger the recordsize, the more potential compression you get, which may be offset by additional delay caused by read/modify/write of the larger record (i.e. reading/writing 32k instead of 8k or 16k).


True, with platter storage i tend more to 32k with nvme more toward 16k. But yes testing is a good thing, but some settings should be just untouched if one is not absolutely sure it's a good thing, aka logbias=throughput.


> disabling the writing of 'last accessed' timestamps

atime is the most useless filesystem feature I’ve heard of. But on most Unix-like systems it’s enabled by default and the main thing it does is turn reads into writes, degrading performance. All the while ctime isn’t a standard Unix filesystem feature. And ctime to me always seemed infinitely more useful (and cheaper!) than atime.

</rant>


relatime has been the Linux default since 2009. It makes that be one write per 24 hours.


while it is unfortunate it requires a write, it is useful in some cases. for example, finding old objects in a cache, or today i used it to find out what firmware my kernel is loading by checking atime of /lib/firmware.


So it's useful as a debugging feature that you might want to briefly turn on once every decade.

Finding old objects in a cache is an issue specific to caches and should be implemented there.


For the latter you could enable it, reboot and get the same information right away, then disable it?


i did not see a clear list of all firmware files read by the kernel in dmesg, so the find/atime trick worked well.


> changing the page size to be the same size as Postgresql uses (8kb)

Mind, you shouldn't really change the record size without knowing your use-case. Make a PostgreSQL file system, set it's recordsize to 8k. That's a good idea.

Not so much a good idea doing that to all your filesystems. Stick to the defaults without a good justification to deviate.


A lot of end users run ZFS through TrueNAS/FreeNAS and I indeed found that community to spread a lot of misinfo and un-actionable advice. I liken them to gamers lapping up any snake oil solution to get 0.1 fps more. I would suggest anyone trying to learn about ZFS to avoid it, it definitely was counter-productive to me.

Reddit, random blogs and official documentation (often that by Oracle) turned out to be much better resources.


I think it comes down to the fact that the TrueNAS developers treat it as an appliance (which is a good thing, IMO). However, because so many people's only experience with ZFS is TrueNAS, when iXsystems says "Doing X on TrueNAS is a bad idea, that's not a configuration we support," it gets cargo culted as "doing X on ZFS is bad."


Using it through something like TrueNAS with sensible defaults is probably a better idea for the average person than trying to roll their own deployment, no?


Im not sure running TrueNAS counts as learning ZFS, which I think is GP's point.

If you want to use ZFS, then TrueNAS with sensible defaults is a great way to go. If you want to learn ZFS enough to roll your own deployment, change the configuration correctly and understand the changes you make then don't start with TrueNAS, and definitely don't listen to the TrueNAS community.

Or at least, that's how I read the GP comment.


I mean, pretty much if you want to learn anything you're going to have to put a lot of time into it. With ZFS it will be building different servers with different disk layouts and figuring out how to test and benchmark those options. Which just takes an absolutely massive amount of time.

I know, I was doing just that 5+ years ago.

Commercial storage vendors aren't really any different. Mostly the biggest difference here is they offer a very limited subset of hardware and disk configurations and tend to iterate over that very small configuration space. Step outside of that config space and they won't have much insight.

This remind me of the earlier days of Mac vs PC. Mac tended to have a very small hardware set of supported devices, and in many ways helped stability. PC supported pretty much anything you threw at it, which commonly had issues with crappy drivers and untested interactions.


> If you want to use ZFS, then TrueNAS with sensible defaults is a great way to go.

I evaluated TrueNAS Scale about 6 months ago and bailed on it very quickly. My install with their "sensible defaults" ended up having swap partitions on all my spinning disks. That alone was enough for me to decide against using it. One bad choice likely means other bad choices in my experience. They also trampled / reset my 'zfs_zrc_max' tunable every time a VM got started (I think) and it wasn't obvious what was doing it.

I didn't think there was any value in learning the quirks of TrueNAS over doing a setup from scratch, so I ended up going with Ubuntu LTS and doing all my own config / testing.


Reminds me of a game (MMO) I play on the weekends. Any time someone new joins, they always ask how come frame rates are so low. Then we begin the ”set your graphic settings to high” conversation. They immediately say we are full of shit. It’s already set to low and getting 10 fps.

Eventually, someone explains: “just try it, or stop complaining.”

30s later, they exclaim: “holy ** I’m getting 60 fps now!”

Sometimes, what looks like misinformation isn’t misinformation.

TL;DR: this game switches to a software rendering pipeline on “medium” or lower. You have to set it on “high” to switch to the hardware rendering pipeline.


Is that a visible option somewhere or do you only have one setting, the preset?


The preset is the only way to influence it.


Wow! Thats a great story


One of my favorite comments chains here was was something to the affect of:

Commenter posts a nice summary of how ZFS will try a quick compression algorithm to see if a block is even compressible and then try other more expensive ones to achieve better compression.

Another commenter asks how they know that.

First commenter replies that they wrote the code for it.

Nothing more than a ZFS user (to some extent my company relies on it even) here, so I just try and color inside the lines, but I'm ever impressed by it.


You probably mean [1].

[1] - https://hackertimes.com/item?id=36608024


That is it, thank you.


> So when you go to the issue tracker, it seems plausible to me that you're going to find issues where the devs can't explain what's happening because the user(s) might be doing something crazy.

No, that is not plausible. It's a filesystem. Behaviour in all cases for a filesystem should be predictable. There are zero things a user should be able to do that would make a filesystem non deterministic.


Tbh guessing is part of the way to solve a bug. You need to make a hypothesis and the test if it is true. If not try another.

Saying the zfs/btrfs teams have no idea means the person has no idea about software development process ... It is a system none fully knows a system that complex.

If they did, the bug wouldn't have happened.


> Everyone, literally, is just guessing. And then running scripts to evaluate the odds that things are actually working correctly. Just like BTRFS.

When filesystem or databases have serious bugs, they are often heisenbugs. Incredibly hard to pin down. You need to be able to replicate the bug to find what is happening.

In one of the first jobs I had, large Oracle database started to corrupt data repeatedly about once a month or so. Oracle had to send on-site team to monitor the database operations to catch the bug in action. They basically had their own datacenter to record all the logs from database, networks and operating system.

It took weeks for them to catch the bug. It was related to certain network driver in the operating system and the bug manifested only with some specific network traffic pattern.


The hardest bug I have ever worked with was an embedded device losing all data on the flash chip.

But the issue was, this was happening even after we removed all instructions to delete data from the flash.

The device had most traces between controller and flash completely hidden as a precaution for hacking/snooping, making it extremely difficult to diagnose the issue as the issue completely vanished in a test harness when the flash was connected to controller outside the board.

The problem also happened exceedingly rarely -- we needed about 100 of these devices to run constantly a certain operation and it took us to wait for about a week until one of these devices died on average. But we had millions of them in the wild and we would get hundreds or thousands die on our customers every week.

It took half a year of debugging and tens of millions of dollars in losses.

We found that one of unrelated components on the board emitted too much noises. And a software change to speed up writes to the flash removed a special sequence of that preceded and followed every one word write to the flash. This special sequence protected the flash from accepting a random noise as a command. Usually, the command would be something harmful or meaningless. But sometimes it would be interpreted as a real command -- and one of these was a command to clear entire flash chip.


I forgot to mention the best part of it. The fix was a one character code change (to enable a feature that we have disabled before).


Another source of bogus commands can be brownouts. When the voltage is low enough to confuse communication signals, but not low enough to prevent destructive events from occurring.


>> When the voltage is low enough to confuse communication signals, but not low enough to prevent destructive events from occurring.

I heard from a coworker that an older version of the product had bumped from a 16MHz chip to 20MHz to run more code. Chip supplier screened parts for the higher speed, but didn't cover higher speed across the normal voltage range. When supply dipped (within the old spec) switching speeds slowed. The first circuit to fail was something in the branch condition logic. This caused all conditionals to not branch, which lead to running all code in the memory space in order. That included a few instructions to.... fire the airbag. It was a rare thing, but a very big deal.


I wonder if anyone else instantly gets a visual after reading your story:

“Business woman on plane: Are there a lot of these kinds of accidents?

Narrator: You wouldn't believe.

Business woman on plane: Which car company do you work for?

Narrator: A major one.”


I heard the story decades ago, and the problem had been fixed some years prior to that. There are undoubtedly hundreds of similar issues across the tech industry, most get quietly fixed before much damage is done. I only tell the story because it's an example of how obscure details come together in unexpected ways to cause failures. That happens everywhere.


And how brittle the borders we programmers see a strict, as a branching condition, "this should never happen". But in hardware, things are different.

How do you even protect against such a thing you just described? Put the airbag on a diffent microprocessor?


Maybe put some code directly before bag deployment code that disarms the system. So even if the execution reaches it directly through the branches, nothing will happen


Sure but what if the address bus glitches past that block?


Maybe put arming code somewhere completely different?

Of course it's possible to create more and more improbable scenarios. That's not to say those never happen; I've seen really improbable occurrences.


All the stuff that counted on our board had brownout protection. And if it didn't it is relatively easy to deal with (as long as you are aware of the problem).

(We may have learned a bit about noise and improved further iterations of the device, but sw change was enough.)


I think something similar happened to my sd card at a hackathon years ago.

We couldn’t change the data on it. Linux would report that data was written to it, but when we unmounted and remounted it, the data was unchanged.

https://devpost.com/software/xcaliber


Oh god, that's insane


epic debugging story


Which mcu are you using ? The possibility of noise becoming a flash command is practically 0. Also too much noise will more importantly corrupt you ram. Flash can be sensitive to noise and that can cause bit flips.


1. What does choice of MCU have to do with it?

2. Noise can definitely affect other components. Traces are antennas that both broadcast as well as receive signals.

As technology marches forward, the voltages and tolerances for communication become smaller and tighter. At the same time, you integrate more functionality on the board so maybe your sensitive MCU and flash and ram are all living close by to some radio equipment and PSU and whatever else. Power interruptions to certain components can create momentary flashes of noise that can act as EMP bomb going off. On other occasions a noisy, badly designed high frequency power component can emit noise that is picked up by some random traces due to their geometric configuration and length.

Your comment suggests you never worked on a large project with thousands of components and dozens of major chips?


from the description the flash was on a separate chip: I would guess most likely a SPI interface. You can absolutely get noise picked up on a SPI interface which looks like a command: and if an erase command is 16 bits long, you are going to see that command just with random bits every so often.


Yes, it was separate chip (per description). Yes, it was SPI flash. And yes, the word was 16 bits.


>The problem also happened exceedingly rarely -- we needed about 100 of these devices to run constantly a certain operation and it took us to wait for about a week until one of these devices died

That's one failure per 16,800 hours, or 700 days. That's pretty goddamn rare.


Once you have 700 units in the field, one failure per 700 days is no longer quite so rare.


And that’s running the particular op repeatedly, so in practice it’s an even rarer event. However with “millions of them in the wild” it doesn’t matter. Scaling is tough.


We had to clean out a small warehouse and repurpose it for testing rig. We have created a piece of hardware to do automated testing and recording, and then we created a small production chain to and produced 100 of them.

Fortunately, we already had all the design, development and production resources to do the project on site (yes, we were doing it all, not outsourcing to China).


First, you may enjoy [1].

Second, a story from many years ago.

The local ACM chapter I was a member of bought a whole bunch of e1000 NICs, back in the day when 1Gb at relatively affordable prices was new, and a switch, and threw them in our servers and desktops, which were a motley assortment of Intel, AMD, and Other things, running various flavors of Linux and Solaris.

_Some_ of the Linux systems had a problem where eventually, the NIC would stop responding and dmesg would flood with "Tx Unit Hang" or "Rx Unit Hang", and not work again until a reboot.

If we swapped the NIC from a system where this never happened to one with the same model NIC where this happened, the problem kept happening - that is, it was something about the system+NIC, not just the NIC.

Intel eventually gave us a shipping label to borrow one of our systems when they said they hadn't been able to reproduce it but kept having people report it, and we said "we have a number of them".

They made some really exciting noises after some debugging, and eventually concluded the issue was that our really cheapass Sempron board that we sent them screwed up badly if you tried negotiating 64-bit PCI in their normal PCI slots, and would sometimes miss the messages between the NIC and the host saying that Tx Ring buffer 0 is free/full, etc, and eventually, it would miss so many of those that it would think the NIC had no more buffers, and here we are. They couldn't reproduce it because all their testbeds were well-designed and tested things.

Choice quotes mid-debugging include "I do not see how anyone could make a PCI device work reliably in this system".

They did figure out a workaround, and ship it, and asked if they could keep the system permanently for future testing.

But that was exciting.

[1] - https://github.com/openzfs/zfs/pull/15588


A second fun one.

In a previous life, I was paid to build storage systems for some HPC-ish workloads.

So I was testing a bunch of cheap-ish desktop drives that had really nice (for the time) sequential throughput, supposedly, and threw together a Supermicro system with a Xeon, ECC RAM and some SAS HBAs and several external enclosures full of these disks, made a couple of raidz3s, and started trying to stress it.

I quickly found that rarely but somewhat reliably, I'd get correctable checksum errors, and because it was raidz3, it always had a lot of spare recovery bits, but the numbers kept going up...and went up even after a scrub finished and "corrected" all of them.

Well, SAS has checksums over the wire, so I'd be seeing disk errors if the wires were eating my bits, and it was across all the disk controllers, so either they were all bad or it wasn't the controllers, ECC RAM and no correctable or uncorrectable events fired...

This predated ZoL, so this was originally on early illumos - I then tried FreeBSD, and it did the same thing.

Huh.

So, the drives in question were Samsung HD204UIs, which, it turns out, have a really spicy firmware bug, where if you send them a SMART IDENTIFY request with data in the write cache (e.g. it already told the OS it was stably written out), it just...dropped it.

So the background smartd I had running for collecting data on the disks was causing them to eat some of the writes whenever that lined up.

The funniest part was, though, that Samsung released a firmware update, but it doesn't change the reported firmware revision, so you can only know by testing if the disk is going to eat your data like that.

I believe smartctl still prints a big warning to this day about all this if you ask it about those drive models.


One more for the road.

So, I have a little old SPARC64 niagara 2 box, which is a strange beast for a number of reasons.

The onboard NIC is a strange bespoke Sun-created Ethernet card, over the PCIe bus.

Since it's a SPARC in 202x, it's not really well tested or supported, so updating is always fun.

One day, I lost power, and on next boot, it booted a newer kernel, and the kernel panicked.

I rapidly determined this was from trying and failing to initialize the NIC, and attempting to bisect it turned out to be really exciting - after getting impossible bisect results, I realized, once I booted a "working" kernel, all successive kernels worked until a cold boot.

The problem, you see, was that they had changed some things about the PCIe initialization to enable certain kinds of memory protection features.

The actual memory protection feature works fine on this hardware - but the _check_ of whether the feature is supported causes the machine to panic.

But once the machine is booted, it doesn't try to re-initialize this on warm boot, so it doesn't trip this problem.


Disk drives really are the worst. Someone who works at Google or Amazon or The Internet Archive could (and should!) write a book where every single chapter tells the story of discovering some new batch of hard drives with a new and crazy firmware bug.


> certain network driver in the operating system and the bug manifested only with some specific network traffic pattern.

Found a similar issue in a switch stack relating to multicast stream delivery. Knowing about this class of bug and having isolated it down to the switch, as the only possible device in the path, being responsible, I filed a ticket with the appropriate team to get the stack rebooted.

They were understandably skeptical of my supposed prescience and so I had to spend a considerably larger amount of time writing test cases and running deep packet captures in order to prove that a single bit, in the port field, was getting flipped under circumstances that were repeatable but not entirely explainable without even more work.

Finally, they relented and agreed that, just turning it off, then on again, was indeed the best initial solution. Fortunately with all the test harnesses in place we were able to prove that it cleared the issue on that set of ports.


Network drivers contribute to "bugs" surprisingly often. I've had my share of issues with them.

I'd go as far to claim this: If you're shipping a product that relies on networking at your customer's premises you basically need to maintain a list of cards and drivers that you support. Especially if you're doing anything weird, where "weird" means anything but bog standard TCP.


How can network driver corrupt a database?


Most network drivers run in kernel context and can access and write to everything.

Additionally, unless your system has an iommu and configures it to strictly limit your NIC, most NICs can DMA to any address in memory.

Unintentionally providing the wrong address to receive a packet with is a great way to corrupt a database.


By corrupting the data that was received.


I never really accept bug resolutions like that. Oracle corrupted data because a network driver fed it bad information? Did this network driver manage to preserve the legality of the corrupted info? Unlikely, so why did oracle believe it? If the corrupted message was still legal, then it sounds like the protocol isn't robust.

I'm sure there's all kinds of nuance to your tale, but it still sounds like Oracle's bug to me. They just found what it was in your system that introduced the behaviour they hadn't anticipated, and then got you to remove it.


network drivers can usually corrupt arbitrary kernel memory


The bug was finally fixed with the following commit: https://github.com/openzfs/zfs/pull/15579/commits/679738cc40...

It looks like they managed to exactly pin down what was happening.


Sidenote: What a great commit message. The expanded comment explains why both checks are necessary, but the commit message gives so much more context for anyone wondering and `git blame`ing that line.


Totally. Also: I've never seen Sponsored-by:

> Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

> Sponsored-by: Klara, Inc.

> Sponsored-by: Wasabi Technology, Inc.


This is a longstanding tradition from FreeBSD. A list of our commit message trailers: https://docs.freebsd.org/en/articles/committers-guide/#_incl....

"Sponsored by:" search in FreeBSD commit messages: https://freshbsd.org/?q=%22Sponsored+by%3A%22


I like it! Seems like a great way to incentivise companies to contribute.


> I like it! Seems like a great way to incentivise companies to contribute.

It also helps with lawsuits, which was a major thing right at the beginning of history of BSD.

* https://en.wikipedia.org/wiki/UNIX_System_Laboratories,_Inc.....

Linux later had the lawsuit issue with SCO and IBM and tracking where certain things came from (which was not helped by the fact that Linus Torvalds refused to use source code tracking for the longest time (later taking up Bitkeeper, and later developing git)).

* https://en.wikipedia.org/wiki/SCO–Linux_disputes


The beginning of BSD was the 1978 shipment of 1BSD by Bill Joy.

You likely mean the end of the effort to replace all AT&T code, which was more than a decade later.


A more "correct" fix has been posted https://github.com/openzfs/zfs/pull/15615


Neither of these fixes are great because neither of them include tests. This is a dataloss bug in a filesystem, no existing test covered it, and no new tests were added to demonstrate the efficacy of the supposed fix or prevent regressions.


There was, actually, a test, but it turns out to be rather hard to make a deterministic test to reproduce a rare race condition without sticking your fingers in at runtime and forcing the "wrong" ordering.


There's an ongoing work on a stress test for this. https://github.com/openzfs/zfs/pull/15608


The same was true of the fix for the completely deterministic cp from unencrypted to encrypted with block cloning. There doesn’t seem to me much incentive to write tests.


OpenZFS is on an exceedingly short list of software in my life that I actually trust to do what it's supposed to. Where most software just up and falls over every so often, ZFS chugs along, day in and day out.

Then ZFS has one bug and everyone starts acting like the sky is falling. And sure, it was a bad bug, but you had to be pretty unlucky to trigger it, and it was present in ZFS for all of a month.

Meanwhile, people's Google Drives are apparently forgetting the last the eight months of changes.

OpenZFS is fine.


Block cloning happened to expose this bug easier, but it seems like this bug might actually date back to the very beginning of ZFS with Sun, with the opportunities to trigger it being so rare, that nobody had noticed until now.

I suppose it's still a bad bug, but it goes to show that file systems are complicated beasts, practically impossible to test all the ways they can be put through the wringer, and bugs like this can hide for 17 years before being unearthed.


> it seems like this bug might actually date back to the very beginning of ZFS with Sun

Looks like you might be right about that. The oldest commit referenced in the fix [0] was from 2006[1], which was ~5 months after Sun released ZFS.

Glad this is fixed!

[0] https://github.com/openzfs/zfs/pull/15571

[1] https://github.com/illumos/illumos-gate/commit/c543ec060d


I’m sure for those who encountered it, it’s not easy to dismiss. But it does seem like this might be getting a little extra attention perhaps because it’s relatively uncommon to hear about a flaw in ZFS at all.

Drawbacks and tradeoffs abound, but outright “bugs” doesn’t seem like something ZFS is generally associated with.


While I am firmly in the ZFS camp, my feeling is that there has indeed been a gradual slide in the disciplined development of ZFS.

After some initial bumps in brand new software, the Sun kernel group did a good job of avoiding corruption. Once Sun was absorbed by Oracle, OSS ZFS moved into illumos, who were overall quite good at doing the same (although they had less resources to play with). OpenZFS brought ZFS to the rest of the world (good), but I can't help but notice that the illumos devs are increasingly worried about pulling changes back from OpenZFS into illumos.

The increased popularity is for the best, but the change rate has increased, with all that entails.


Note that the bug that is the topic of discussion here predates OpenZFS. Whether or not there has been a slide in disciplined development in OpenZFS, this bug does not support that assertion.


Since the topic is whether or not the current OZFS developers understand what is going on well enough to reliably fix the bug, I think it still applies.


No, the topic of that thread is to prevent such bug's in the first place.

And i think that OP is right, development speed is really a bit too fast atm for a filestystem. Maybe a codereview ala openbsd could do the job?


Too much source, eventually a human cannot follow it. A common software (and not only) problem.

Maybe sometimes once in 2-3 decades it's worth taking the lessons learned and re-doing the thing? Simplifying and streamlining it greatly in the process?

Like, this bug, the file holes - do we even need them now, with very good compression for once?


> Good god people, this is not the way something as complex as a file system should be developed. There need to be lead architects that define the operation of the system, document it clearly and concisely, and create verification systems using both coverage and fuzz testing. And as new changes and/or features are introduced the verification systems must be updated to accommodate them.

ZFS has literally all of these things. I’m not sure if he’s just new to the community or making really poor assumptions based on his experience with btrfs.

Heck just look at GitHub. Anytime a new feature is being developed part of the merge process includes results from the test suites…


It's the phoronix-forum...90% troll's and wannabe OS-engineers.


I’m not going to corroborate the entire article, but my experience from an outage 2 days ago on our new ZFS on Linux file server has left a bad taste despite years of great use with ZFS in the FreeBSD world.

Ok. So, We hit the deadlock that was fixed here:-> https://github.com/rohan-puri/zfs/commit/8e4d086c13c16bc461b... but it never got merged to the master or release as far as I see it Current Master: https://github.com/openzfs/zfs/blob/acb33ee1c169bf1c1f687db1...

When I look up the problem, I could only see the issue being discussed, and probably leading to that commit, and then it was not in my current release (2.1) despite several years later. I’m wondering if ZFS still holds that high standard for reliability.


Isn't it there?

https://github.com/openzfs/zfs/commit/fd7265c646f40e364396af...

    git branch --contains fd7265c
    
    * master
Since 0.8.

    git branch -r --contains fd7265c | head
    
      origin/HEAD -> origin/master
      origin/compat-5.15-META
      origin/issue-14573-backport
      origin/master
      origin/revert-14721-remove-quota-zap
      origin/zfs-0.8-release
      origin/zfs-0.8.7-staging
      origin/zfs-2.0-release
      origin/zfs-2.0.3-staging
      origin/zfs-2.0.4-staging


First, there is a FreeBSD Errata Notice for this that offers an nice quick collection of the various bugs and subsequent repairs, with links to summaries, for anyone who is catching up on this issue:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275308

Second, I don't like the editorialization of this title ("In OpenZFS and Btrfs, everyone was just guessing") at all. No, nobody was "just guessing", but as far as I know there is no featureful FS that has undergone formal verification. It's a large codebase started by a now long defunct company that solved critical problems and delivered a lot of value, but there absolutely can be issues lurking over long time periods. It's a testament if anything that work and usage did uncover the issue, it wasn't brushed away at all but instead drilled down on and solved in short order, and now the already extensive test suites are expanded again in an organized way. And spekaing of "critical problems":

Third, a lot of the commentary around these sorts of things seems to indulge in noticing the rare misses while ignoring many hits. Amongst other reasons part of the core motivation for me to switch to ZFS 100% in 2010/2011 or so and doggedly stick with it ever since was precisely because I experienced data rot (permanent corruption) with my data under previous filesystems where that WAS NOT A BUG. HFS/UFS/NTFS/XFS/whatever, none of them offer any guarantees of data integrity in the first place! Bit flips somewhere, hardware has issues, copying has noise, whatever? RAID-5 write hole? Those or lots of other things are not a bugs at all in old filesystems, because that's just how those primitive things were. I've been carrying forward data since my first Apple IIe, and went back to find some of my old work, early digital photos and drawings I cared about, and somewhere along the line it had gotten mucked up. I know not where or when, because there was no chain of checksumming and trust that would have a chance to alert me. It's impossible for a human to keep up with terabytes of data manually, it has to be fully automated, baked in. At least in ZFS there being any corruption is a drop-everything big deal problem that is incredibly rare and niche and serious people will care very much about. Pretending the old stuff was good or even acceptable is pure bullshit.

It doesn't have to be mathematically perfect to deliver value, and importantly to be far superior to everything that came before. And it's inarguable that it's been battle tested very very hard for a very long time at this point. It's certainly saved my bacon a few times.


Yep, this is how I ended up running ZFS. Because with XFS the answer to "oh no a power cut" was "lol, a bunch of files might be zeros. Figure out which ones."

The whole weird implication that there's something else which is perfect is always bizarre: there isn't, everyone knows that, it's extremely non-trivial to do that.


> Because with XFS the answer to "oh no a power cut" was "lol, a bunch of files might be zeros. Figure out which ones."

Got bitten by that one once as well. I recall the XFS FAQ had something like in it like: "If you know it hurts, then don't do it."


Wasn't this XFS problem solved by write barriers?

I remember that happening to me once long ago, but never on rhel 7 or above.


I just want to say that you can do checksums on...I mean under xfs/ext etc on linux with dm_integrity, I've talked to many admins who have never heard of this device mapper.


Somewhat related, RedHat withdrew from the attempt to productionize BRTFS, apparently it was fully removed in RHEL 8:

- https://access.redhat.com/solutions/197643

Also, there's a PDF doc (from a conference presentation), covering many layers of potential alternatives for ZFS features:

- https://hackertimes.com/item?id=38484598

But "just" slapping all these layers on each other (configuring each individually and then hoping for the best) doesn't seem like the best idea for production systems either.

I guess back in the Sun/Solaris days, people still had the luxury of releasing (major) features in major versions, every few years instead of weeks or months... and as far as storage, backup, security solutions go - can't really do this right with the move fast & break things culture...


Suse has been using production btrfs for years now. https://en.opensuse.org/SDB:BTRFS


Supposedly (and I only know this because Kent has talked about it in various public conferences) Kent has a regular conference call with Red Hat engineers and there is interest from Red Hat in helping it get polished up. So perhaps in a few years bcachefs will be a supported option.


One thing to remember about OpenZFS is that it is in many ways a port of a file system more than the file system itself. I could see developers there being less steeped in the file system, and more familiar with the porting layer.

I'm going to spend a moment praising scripts though...

A dozen or more years ago I was running a bunch of backup servers. The backup system was rsync to a zfs, then snapshot the zfs. Then do a rolling delete of snapshots to achieve X daily, Y weekly, and Z monthly copies. This is because the hardlink trick is amazingly inefficient.

I started off using Nevada, but had issues and no real expertise there, so I switched to Linuz with ZFS fuse. That worked about as well as Nevada, but about once a month or two zfs-fuse would die.

I built a script that would simulate the access patterns of the backup systems. I then went through a few iterations with the developers of running the script (which logged its actions as shell commands, so the log could be re-executed), reporting the results to the developers, and them producing a fix. After a few back-and-forths with the developers, zfs-fuse became rock solid for my use case. And the script was the key.


So are we linking from forums to other forums now just to talk smack about the forumgoers on said other forum?


yeah, why is this not linking to the actual article?

https://www.phoronix.com/news/OpenZFS-Data-Corruption-Battle


I have been more than satisfied with single-drive Btrfs with OpenSUSE so far, but nevertheless I can't wait for bcachefs to prove itself. On paper at least, bcachefs seems like the most attractive of the three.


ZFS will still be preferable for production for a long while until bcachefs has proven itself, and bug reports slow down. That said, I'm also stoked for bcachefs.

ZFS has always appeared to me like a bloated mess. Btrfs is more minimal, but too much so, in that it doesn't do parity (at least very reliably), which sucks because I don't always want mirroring. And it has a reputation for gobbling your data up.

If bcachefs can provide all the most useful features of ZFS, while being minimal and baked into Linux by default, man that will be amazing.


bcachefs's caching (foreground_target, promote_target, background_target) seems more straight forward than ZFS's ZIL/L2ARC stuff since with bcachefs you can use the same device for accelerating writes and reads of frequently accessed files, while (from what I understand) with ZFS you need two partitions, one for each. I'm not a ZFS expert by any means so maybe I've got the wrong ideas, but bcachefs seems a lot more elegant and straight forward in many regards. Still, I'm not jumping to adopt a brand new filesystem.


What does bcachefs need that it doesn't have?


I worded it weirdly. It has what it needs already I think, but I'm saying: if it can reliably and efficiently do all the most useful functions of ZFS, while being built-in to Linux, and not starting off terribly bloated, that will be fantastic.


Going by the most recent Phoronix benchmarks it still needs a fair bit of optimization work. Of course, I'm glad that Kent is focusing on squashing bugs and polishing the architecture first.


Reading the comment re: what the FS dev teams should do by someone who I bet didn't pay a dime for the software just upsets me.


Also going on about how everything was done right "back in my day". Which doesn't sound true to me, every old graybeard story I read was full of complaining about how buggy and unusable the systems by the old tech giants were.


Mr. Behlendorf seems to work at an nonprofit that gets grants from DARPA. So if they are US citizen, they did in fact pay for this.


So many hate mongers in that thread, rubbing salt in the wound rather than trying to fix a problem that could happen for any filesystem.


People are unreasonably tribal about file systems.


> People are unreasonably tribal about file systems.

I think it's probably worse than usual with ZFS, mostly for two reasons: a lot of hype (justified or not) early on its life, leading to hype aversion and/or backlash, and a "sour grapes" effect because its license ensures it will never (or at least until its copyright expires) be merged into mainline Linux, so those who can't or won't use it (due to a need or desire to run only mainline Linux kernel code) feel left out and/or get annoyed at seeing it pushed as the best thing since sliced bread all the time.

Adding to that, its main competitor (btrfs) is known for having had (real or perceived) reliability issues early on its life, making its use more controversial (and its proponents more defensive at seeing the same reliability issues being raised all the time).

And file systems are important, since people are reasonably concerned about the integrity of their data. They are also not easy to migrate (other than a rarely used tool to migrate in place from the ext2/3/4 family to btrfs, migrating filesystems usually requires dumping and reloading the whole data), making the choice of filesystems an important decision.


Well, it's typical of the phoronix forum, they even start a fight over zstd compression levels. Like a bunch of 12 year olds who "mastered" installing system x, they now have to defend their choice of distribution, file system, desktop/wm, wayland or x11, and at the same time attack everything else because their choice was superior.


Most of these screeds come down to a fundamental error: trusting a single filesystem with your important data.

If you trust a single filesystem with your data, you're wrong. It doesn't matter how fancy it is, how many parity blocks it has, or how many disks it's backed by: a single logical filesystem definitionally cannot be redundant. If you pretend it is, it will blow up in your face one day, and it will be nobody's fault but yours.

My backup is a drawer full of individual hard drives. Unlike any filesystem that ever has been or ever will be invented, that drawer cannot corrupt my data.


So user's are at fault for trusting a filesystem that advertises itself as being reliable?

Even if user's should use multiple redundant filesystems (lol ok), that doesn't invalidate the criticisms in the linked forum post.


> So user's are at fault for trusting a filesystem that advertises itself as being reliable?

Yes, absolutely 100% at fault. There are many failure modes here beyond corruption: human error, electrical surges, flooding, fire. Hell, somebody could break into your house and steal the NAS!

Filesystem bugs aren't a big deal. If they are a big deal to you, it's because you're trusting a system with a single point of failure, and the consequences of that inevitable failure are not acceptable to you. Stop doing that.

If I could choose between introducing a random filesystem bug and a random wifi driver bug into the kernel on my laptop, I'd take the filesystem bug every day of the week and twice on Sunday. Reinstalling my laptop is at worst a minor inconvenience that takes me an hour, and I can always use a different filesystem until the bug is fixed.


Could you imagine if we treated other pieces of technology the same way backup fanatics talk about backups?

It's your fault for not having 2 extra phones on you in case your phone bricks itself at a bad time.

It's your fault for not encrypting your signal messages by hand, encryption bugs aren't a big deal, if they're a big deal to you, it's because you're trusting a system with a single point of failure, and the consequences of that inevitable failure are not acceptable to you. Stop doing that.

Your position sounds insane to me of applied to any other piece of technology


Equating a filesystem corruption with bricking a device shows you don't get it. Filesystems are ephemeral: all I have to do after a corruption is reinstall the system, and it's as good as new.

You also missed the most important part of what I said:

>> because you're trusting a system with a single point of failure, and the consequences of that inevitable failure are not acceptable to you.

I don't care if my phone fails: it's disposable. Sure, there are times it might be inconvenient, but it's not like I need it to drive my car or get into my house.

If your phone failing is some huge big horrible problem for you, yeah, you absolutely ought to carry two phones around. It's unreasonable to expect consumer grade technology to be that reliable.


Phone failing is a temporary inconvenience solved by getting a new one.

Backups failing mean 15 years of family photo memories forever lost.

Risk mitigations should reasonably scale in proportion to the cost of loss.


What you are describing sounds like the usual backup strategies. Filesystem bugs that silently corrupt your data will also get synced and backed up.


> Filesystem bugs that silently corrupt your data will also get synced and backed up

This is a very easy problem to solve: don't do incremental backups. Or have N backups and rotate, which isn't as good but still gives you more time to notice. Hard drives are cheap.


Still doesn't excuse a filesystem claimed to be designed for reliability from having shoddy development practices.


Lack of architectural direction is a common problem for many open source projects. They may have fancy user-facing documentation and a big test suite, but architecture schemas and rationale for design decisions are nowhere to be found.

Usually you have to dig through the git blame until you discover the commit message is nothing but “Merge fix-something into main”, from there you have to crawl the commits to find a PR number. Inside the PR, the only useful data is “Fixes: #1234”. From there, if you are lucky, you can find some rationale hidden among the 100 comments from stale-bots. If you are not lucky, the info you need is described in another issue of another library that is cross-linking to this issue.

For something sensitive as a file system, hearing they are no better is extra scary.


The crappiest bug I had was at one store - instead of using the recommended kit they wanted a cheaper solution so opted for a different barcode scanner but still reputable from a reputable supplier.

Store started complaining about sometimes scanning muffins - this was where the price was embedded inside the barcode - the prices were ridiculous for muffins but the barcode has a luhn check digit.

Must be our code management said - so we looked at our serial driver until we eventually found evidence through logging all serial I/O traffic that it was the barcode scanner that was sending back a corrupted barcode.

I would more worried what you can't see - SSD controllers - yikes don't want to think about the bugs in those things.


I'm a little sad to see this. I've been using ZFS on my PCs for years with minimal issue and significant benefits. [1] I've even contributed [2]. And I remain a huge fan.

I don't believe that "everyone is just guessing". There are some pretty knowledgeable folk that work on this.

[1] I've triggered some corruption in snapshots on an encrypted pool. No permanent problems resulted and no data was lost.

[2] I provided a very minor documentation fix that was encouraged and promptly merged. https://github.com/openzfs/openzfs-docs/pull/472


Welcome to fucking software. I updated Mac OS X one time (or maybe it was just iTunes) and it deleted my home directory (remember that one)? I didn't open the mail promptly and missed the recall notice that my car's antilock brakes had a failure mode that was catastrophic in snow, and then I drove in the snow and then couldn't stop the car. I flew on this new model of Boeing jet a couple years ago, and it had a bug and crashed and I died.

OK, one of those is not true but this is how ALL software of any complexity is. I'm also old and ornery like the OP and I don't like it and if I get drunk I will talk hella shit about it but it's just a fact.

Software sucks. There was indeed a time when we (collectively) did a lot more to try to prove it worked before shipping it. But then:

a.) the money people realized it made money if it mostly worked, and made less money if we went all formal-proofy on it

b.) the software people realized that all the specs and architecture documents and fuzz testing and automated semantic analysis and literal actual voodoo dolls all did totally help, BUT...

(you know, they help ... BUT probably wouldn't have helped in this case, or lots of other cases...)

So this is just the way software is in almost every walk of life. ZFS is amazing, one of the most life-improving software technologies for me; up there with antibiotics, and UTF-8.

Yes, they had a pretty amazing bug, too. So yeah, "... at least for now, there's simply no way to reliably detect bit rot and other data integrity issues and be assured they can be remedied."

Which is just like it ALWAYS HAS BEEN and ALWAYS WILL BE.

But in the fullness of time, so far, ZFS has gotten us closer than anything else to that (impossible, unacheivable) ideal.


So par for the course for about every piece of complex enough software on this planet.


Race condition is as race condition does.


We need RustFS :).


We need NVMe-KV-LFS, which would be a key-value addressed large object store based on NVMe key-value namespaces.


It's easy to forget the difference between an open source project and a corporate project. Feel free to contribute with your vast wealth of knowledge.


A terrifying accusation. I've had both OpenZFS and Btrfs eat my data, luckily I had backups but did lose a few files. XFS is probably the best bet on Linux, it was beautifully designed and implemented by SGI back in the day. Hopefully the ChatGPT generation don't start modifying that code too much.


> XFS is probably the best bet on Linux

Yes, agree. However, I also lost files on it (had files overwritten with 0s, as other people already commented here). Also, it's your best bet of you don't need the extra features provided by ZFS. Thanks to its checksum validation, I detected bad SATA cables three times already (on different systems), instead of getting corrupted data.

>it was beautifully designed and implemented by SGI back in the day. Hopefully the ChatGPT generation don't start modifying that code too much.

Except it's not what happened. Please check this talk from Dave Chinner, where he explains how XFS was developed and ported to Linux and then made robust and stable by him and other contributors. The disk format has changed many times, now it's robust and resilient against metadata corruption.

XFS Development talk by Dave Chinner https://m.youtube.com/watch?v=FegjLbCnoBw&t=683s&pp=ygUQeGZz...

History of Linux FS, also by Dave Chinner: https://m.youtube.com/watch?v=DxZzSifuV4Q&pp=ygUQeGZzIGRhdmU...


It sounds like it's in safe hands, good to know!


I am not sure why this is linked into the comment. That comment is not interesting. The parent article is interesting.


I see a lot of comments here using the terms ZFS and Open ZFS interchangeably.

But to my understanding, these are 2 different things.

- ZFS is the very stable, non open source project that only works on BSD. - Open ZFS is an effort to rewrite ZFS in a totally open and free format that now runs on Linux.

This was the idea I had, but I might be incorrect.


You're completely wrong.

Open source ZFS wound up in FreeBSD (and from there, other BSDs) before Sun stopped releasing Solaris source.

ZoL was a project based on said open Solaris source to run it on Linux, and while the Linux glue layer was not the FreeBSD glue layer, it was still based on the same ZFS source.

OpenZFS switched from being based on the illumos codebase to being based on ZoL a couple years ago, merged in FreeBSD support, and FreeBSD 13+ ships based on OpenZFS and not the original ZFS port to FreeBSD.

None of them are a new implementation.


And to complement the other comments, the ZFS developers quit Oracle when it was made proprietary and moved on to work on OpenZFS, which has continued evolving.

Talk by Bryan Cantrill https://www.youtube.com/watch?v=-zRN7XLCRhc

44:16 "[paraphrasing Bryan C.] I told Oracle management if they changed the license, the core ZFS would quit. And they lost all the ZFS team."


Everything is OpenZFS now except Oracle's fork.


From my recollection, the basic story is Sun open sourced Solaris in the early 2000's (OpenSolaris). Today's OpenZFS codebase is an incarnation of the code released in OpenSolaris. It has since been ported to Linux and continues on as a separate thing from ZFS. Oracle has since "un-open sourced" Solaris, and is why ZFS and OpenZFS are different things today, although they both share a heritage.


In many ways, ZFS is an ideal case. It was started at Sun, where robust and thoughtful engineering was the norm. That gives us an extraordinarily well–built foundation to build on. We lost a lot when Sun was killed off by Oracle.


Bug has been fix (apparently), btw.. https://github.com/openzfs/zfs/releases/tag/zfs-2.2.2


I'd wager money Brian Behlendorf has an idea how ZFS works..


Wouldn't a file system in a slightly higher level language solve some of this pain? (edit: in terms of being able to understand the system)


You're at odds with the desire (necessity, actually) to have precise low-level control, not just of the in-memory layout of the data structures, but also of the performance characteristics of your code (e.g. no unnecessary pointer-chasing). Higher-level languages tend to make it easier to pile up abstractions; we want the orthogonal property of making it harder to shoot yourself in the foot.

You can look at ZFS for inspiration. It is more complex than say ext4, but it's less complex than the full stack of mdadm+LVM+ext4 - the latter piles up the abstractions, where ZFS is able to "reach around" and e.g. directly track free physical blocks across multiple devices. So when you need to resilver the pool, you don't need to copy the unused blocks, and reduce the load on the pool (and thus the chance of double failure).

Are there programming languages that have such properties? At the risk of perpetuating the meme, I'd say Rust (with no_std) fits that description. I can't tell how would it help in this specific instance (I'm very far from an expert in FS implementation), but it does tend to prevent data races in general.


I like Rust, but you are going to be writing a lot of unsafe Rust for a filesystem implementation. Multiple processes are writing to the filesystem simultaneously. "Ownership" is fuzzy and is moving around.

At that point, is Rust buying you anything for how much it's going to get in your way?

I really don't see an advantage to Rust when operating at these kinds of low levels.


We have concrete data on this at this point, and it’s just not true that, even in these sorts of low level programs, everything ends up unsafe.

And beyond that, rust has many features that are useful separate from memory safety.

https://asahilinux.org/2022/11/tales-of-the-m1-gpu/ being just one example.

That said I have no opinion if they should write this driver in Rust or not, I simply do not know about the details. But in general, “it’s too low level and so tons of unsafe and so therefore Rust is useless” is at least arguable, if not just fully incorrect, as a general point.


Have you ever written a file system? Most of the work is conforming to the semantics of the interface to the kernel/programs calling it. Very little of that requires unsafe code, and the bulk of the provably unsafe stuff (physically writing to memory/disk) is very simple.

The complex stuff can and should be written at a higher level than C.


> Have you ever written a file system? Most of the work is conforming to the semantics of the interface to the kernel/programs calling it.

Yes, I have. And I have implemented partial file locking semantics. And I can painfully remember what I went through to validate it.

Quite a few of those pointers are write pointers which are simultaneously active with a lot of read pointers and they have different owners. That is a task which is screaming "Rust is going to make your life miserable."


There are entire kernels written in Rust with 10% or less unsafe code.


"Just rewrite it in Rust" is a bit meme-y, but what I wanted to highlight is how you can look at ZFS itself for inspiration. ZFS has checksums to detect bit rot; maybe you want a language with better support for contracts / static analysis. ZFS has zvols to expose virtual block devices for use with other filesystems; maybe you want a language with safe C interop; etc. Any improvement in safety over plain old C would be desirable, and even unsafe Rust is safer.


For the bug in the commit linked up? I don't see how: the issue is a logic error causing a data race which depends on a whole bunch of factors a higher level language can't model directly.

The next step up for that is formal verification which is staggeringly time consuming (read: expensive) and hard to understand in the first place.


Maybe something like Ada with SPARK, or at least partially transactional memory to make some types "atomic" in behaviour.

But that would make it way harder to add to systems it's already running on.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: