> " To people calling, "why do they use Excel?" that's like asking "why must we be subjected to gravity?""
I respectfully disagree with this. Excel is fundamentally not suited to analysing *omics data. It's often the default program affiliated with a .csv filetype on people's computers, but trying to get an entire field of scientific research to rewrite itself based on its glorified bugs is...wrong, in my opinion.
If you see wrong things in the world, do you accept them as they are, or try -- however ineffectually -- to force change for the better? I for one bang the drum into the wind and try to get biochemists off it. I teach people to be very sceptical of excel in my stats courses, for example (aside from some showstopping bugs and downright dangerous defaults, its RNG is particularly crap).
Excel is not fundamentally suited, but it's fundamentally used.
I wish you luck, but I long ago declared defeat against Excel. I can't prevent everyone from using it. I have come to accept that whatever Excel does is what I have to live with, because someone, somewhere out there hasn't heard the anti-Excel gospel yet.
I'm actually having a lot of scenarios where Excel is banned except for some managers, C-level and board type stuff.
If you produce, touch or directly act on data, Excel is no longer allowed. On the other levels the people that still use it (because they think they have the skill - which they often don't) can't do meaningful direct-action on those files anyway so that solves part of the problem of acting on bad data or bad data processing.
We do allow web-based spreadsheets, that has the benefit of limiting what you can do and drives people to either put a request in with the correct team or take internal courses on how to do it right.
Some people hate it, but then again, some people hate Excel. The difference is that the people that don't hate it now have an empowered skillset that they didn't have before. In theory you could do the same with Excel-oriented internal courses, but we didn't want to do Excel because a lot of us think it's stupid (often simply for emotional reasons but the push was based on hard data like process metrics and user input).
To add to this: it's probably because I find myself attracted to data-driven organisations where using Excel didn't help the workforce much anyway, and the positions I'm in allow for a certain degree of influence on the tools that we support and how much explicit allow/deny we apply on them.
The difference is that the people that don't hate it now have an empowered skillset that they didn't have before.
Not knowing how to use the primary tool in your field because you only know how to use a web-based spreadsheet with 1/100th the functionality is not an "empowered skillset." It's a cruel trick to play on employees, who will have difficulty finding further jobs in that field because they don't have a basic understanding of the tool everyone else uses.
I was talking about Apache Spark via Databricks. ~220 of the people that were attached to Excel and PowerBI switched over for their calculations and data processing, 3 remained. One was let go, one didn't actually use Excel to write data, only to see it and search it, and the third wasn't able to learn the replacement so they got to keep Excel.
It's not a cruel trick and even if it was: web-based spreadsheets might not have the same functionality, but the people that were using the desktop-based spreadsheets weren't doing much more than basic computation on fields and search/replace. That works fine on the web.
It's not like we took away their toys and gave them dirt instead. We just made everyone take the startup course to get them acquainted and if that went well we would decide our next steps. It went well, and the next steps became removing Excel from the standard workflows.
Quantitative social science (which maybe a lot of comp bio people might regard as a "softer" science) has really made huge strides in moving this kind of analysis to R. Of course one person can't change it, but those with high status in the field can make a difference by leading the way and setting standards.
But they don't, and that's a problem. It's an education problem: everybody automatically defaults to Excel, even when it's not the proper tool. People need to learn about better tools.
You can't force people to use a different tool that they don't want to use. They [(computational) biologists] know other tools pretty well, they don't want to use them. If you want to help, work on better tools for them, based on their needs. Regulation is about the worst thing you could've thought of. If you said this to my gf's research group, they would yell at you, Excel is a loved piece of software among them, and other spreadsheet applications are not even close. They don't want to have a dozen different tools that they have to switch between fifty times an hour, converting data in the meantime, they want to know one or two very well and have the data all there. Excel is a data swiss army knife, which is exactly what they need, it has some warts but forcing them off Excel really is not the solution.
Where am I talking about forcing them? You're the only one bringing that up, and I strongly disagree with it. Education is not force, it's empowering.
The only reason they want to use a tool that doesn't really fit their use case, is because they're not aware of better tools. I find it very hard to believe that scientists actually want to use a tool that corrupts their data. If that is true, then that is absolutely a problem with their attitude towards science and data.
It's not that hard to imagine a tool that can do exactly what Excel can, but without corrupting your data. It might even exist already. LibreOffice got mentioned a lot; it can do almost(?) everything Excel can, but without corrupting your data. If there are problems with it that make it useless to scientists, there's a good chance the LibreOffice community can fix them.
I think better tools that preserve the integrity of their data are absolutely the solution here.
You're taking a minor, very unusual issue that happens only to a subset of a subset of a subset of users and resolving it with regulation, while forgeting all the cases where Excel works perfectly for them. Libreoffice is good, but it has much more warts than Excel does and the presentantion capabilities (a key functionality) simply are not as good.
Are you seriously opposed to education because you see it as regulation?
Also, I don't think it's a very minor issue if the tool you use to process your data, changes your data. It may only do that in limited cases, but the fact that it does it at all should alarm anyone who cares about data integrity.
> This is the exact sort of thing that indicates market-power being concentrated to a point requiring severe, acute regulatory action.
This is what I react to. Sorry I missed that you're not the original commenter, that's my bad.
I am not sure though what other tool should the biologists learn - they usually know R and Python and they use Excel because they need what Excel offers; there isn't any other software that allows them to easily do whatever they need to do with the data, quickly iterating on the ideas AND then present it nicely.
The solution here should be to fix Excel, but it really is a small issue (that has been worked around, too; and if you know how to use excel correctly, it does not happen to you) compared to making another Excel, which seems like a monstrous task.
Yeah, that wasn't me, and that comment was already countered by pointing out there was plenty of competition in the spreadsheet market. My reaction was that a lot of people may not be aware of all the options and simply default to Excel because that's what they know.
You're claiming that they do know the alternatives but the alternatives fail in worse ways than Excel does.
I don't know what the magic sauce is that makes Excel so much better than the alternatives, or why it would be such a monstrous task to replicate that. My impression is just that to many people, Excel is simply the default tool to enter data in, no matter what the actual problems with Excel are. Excel is like gravity; good or bad, you put up with it because it is what it is. And if that's the case, that'd be pretty bad and in need of change, because Excel is not at all like gravity; it's merely one tool of many. If there's a more appropriate tool, people should use that. If there isn't, one could be developed.
I stand by my point that a standard data collection tool that modifies the data you put into it is a really bad idea for any field where accuracy and data integrity is important, and I'd expect science to be one of those fields. You may be able to work around the limitations of the tool, but the risk is still there.
Ah yes. Without acute regulatory action, Lotus 1-2-3 would have never been defeated, the PalmPilot would have reigned supreme as a smartphone device to this day, and Starbucks would have taken over all coffee production.
Having Excel be the default for .csv files has caused me problems in the past as well with Excel automatically changing data when you open it. The unfortunate thing I have come across multiple times is a csv file that has some values with a lot of significant figures, i.e.
123456789876, 987654321234
After opening in Excel it will convert these to:
1.23E+11, 9.87E+11
Then resaving the file (as a csv) will store the values in scientific notation and lose all of the digits.
This issue is common in the oil and gas industry. Wells are typically identified with a 14-digit number that Excel likes to convert to scientific notation and then truncate digits when saving to CSV.
That's the problem of having Excel open CSV files by default - if you use Data//From Text/CSV you get a chance to change the import process to make those columns text rather than Excel guessing they are numbers.
Same here, but with IMEI numbers. Excel silently truncates them, it’s frustrating. Why can’t the auto format detect and just store them in text format?
I was thinking that perhaps the autoformat options could be explicitly shown in the column headers
number date currency
A B C
and you could click on the format to change it. Italics could show assumed formats, bold could show user set formats.
I'm newly returned to (very minor) Excel use and assumed formats with no feedback seems wrong. My situation is dates _not_ being picked up as dates and it not being obvious that it hasn't; I'm using ISO format (eg 2020-06-17), I guess you have to create a new date format or use USA-ordering before it will recognise dates.
I would be happy if Excel supported some way to specify the column type in the CSV itself. Ie first row headers, second row data type/hint, subsequent rows the actual data.
At the moment users need to remember every time they import the CSV to specify A data type for each incorrectly handled column. Suffice to say they don’t always remember... and with large CSVs with hundreds of columns it’s just a pain.
What might be better is what I'll call “CSV Schema”: support for paired CSV files where one normal CSV has data, and the other is a schema file which contains headers which match the data files headers with a standard extra first header prepends (call it “meta-name”), and the data rows each provided metadata for the corresponding column in the data file. The most common (and often the only) metadata row might be for type information, but the format could support additional metadata.
This would allow existing CSV tools that work with data files to continue to work without modification, while CSV Schema aware tools could make use of the schema.
This is what happens on explicit import, but, formatting is per-cell, not per-column. "General" is the format for "idk, guess" - it stays General after a transform, it's not converted.
Because most people want to import numbers as numbers most of the time. Excel detects numbers and treats them as numbers. You do have the option to import them as text. So, Excel (as usual) does what's the best for most people.
The reasonable default should be not to change existing data. If the user wants to, by all means do it, but not before. If you want to be smart about it, offer a popup asking if a particular column or field should be interpreted as a number or date, but by default, keep everything a string.
Why does excel truncate a very long number silently? That is what happens with IMEI numbers. It cuts off the last 4-6 digits. It’s destroying numbers.
Excel should know any number that is longer than it can internally support should be stored as text. Or they need to introduce a data type that can store arbitrary precision numbers.
It's not that Excel can't support that long of a number in general. When you open up a csv in Excel, it is not being worked on in "Excel format" where it is displaying the number in scientific notation but it is still retaining the full number. If you click on the cell in the text window in the top it shows you the full number.
The asinine thing happens when you save it back as a csv. Excel doesn't think to itself "hey, I should write these full values I have stored in each cell to the csv file". It just writes whatever is currently displayed to the csv file instead.
Yes, and? It should be a simple checklist item for any scientific journal to ask the author whether they used excel or not. If they say yes, reject the paper unless they can show their work hasn't been affected negatively by using excel.
On the practical side, how would people show that their work hasn't been affected negatively by using Excel? How would the journals evaluate that? I suspect this would just become another box to check - yes, we use Excel, yes, we checked the results, like the last 50 times.
Intellectually, it feels snobbish to single out Excel like this. I'm a software engineer in science, and I generally agree that scientists should learn some kind of coding. But you can make mistakes in Python or R as well - not to mention in physical experiments. We should check data and methods in general, not presume incompetence if people use one tool.
By providing a repo with raw data, and the code that runs on it that eventually produces the results that are in the manuscript. Anything else is just a bunch of handwaving.
It's not coincidental that big evil FDA/pharma requires people to sign - on paper - their lab notes (and to have lab notes, and to sign that they calibrated the equipment, and that they checked for common mistakes).
And yes, I know this costs a lot, and that this is sort of a pipe dream. And I'm not saying this is the best solution. Renaming genes might be a lot better, because it just allows the field to move on, and Excel will eventually die out, or at least researchers will move away from it - maybe to Jupyter, if we're lucky.
So, all in all, of course Excel is just a symptom, but society is already pretty stubborn when it comes to financing R&D and progress.
That doesn't solve the analysis problem downstream when a non-expert is getting started, doesn't know about Excel's anti-features, and starts doing analysis using the only tool that the world has ever told them is acceptable for tabular data.
Ideally excel would change, but since we know it won't, and we want to work with lots of people with minimal problems, we must adapt.
Do they mean March 1 when they import data from a text file?
It's one thing to change typed in user text in real time. That's not causing any problems. It's another to randomly mutate cells amongst tens of thousands of rows. I don't think that has ever helped anyone.
How are these non-experts getting started? If they are in academia, academia should start teaching these practical things too. (Yes, I know the problem is that many old school bigwig researchers are doing even worse things.)
There are people of all sorts: amateurs that want to play with the data but will not make study of it their primary field, trainees that will make use of tabular data extensively in their career, and then experts in more rarefied fields (e.g. immunology, clinicians, etc.)
If I can get a trained immunologist looking at my data, I'd much rather have 5 more minutes of their analytical skills than teaching them about common data exchange pitfalls.
yeah but then you'll just make them use SAS and nobody wants that. just try convincing anyone who graduated two decades ago to use something reasonable like R
But we didn't accept them as they are. We looked at two options, change excel or change gene names, and picked the one most likely to solve the problem. Gene names mean nothing anyway. We're going from random letters that mean nothing to humans and computers to still random letters that mean nothing but at least don't get confused for dates.
there was a time when a gene was named to facilitate search
through a bookshelf full of laboratory notes.
you could look at a genes lable and know exactly where to find reference to the gene, the lable would indicate the stack number the shelf, the section the volume the page[s] and paragraph....lables such as Sonichedgehog dont do that anymore.
I've been in the room when deciding what to call a gene that the lab had just characterized/identified as interesting for the publication they were working on, and let's just say that making it easy to catalog and find the gene wasn't really on their mind. They'd actually been using a deliberately obfuscated name in conferences to prevent people from scooping their work on the gene before, which is pretty common.
conference is a different game but that doesnt change what happens in the lab, if you look at the overall picture you can see the potential for duplication of alphanumeric strings. this is why in conference lables must be developed as you have a wider context in conference than in your own lab. You can be scooped for research regardless of what lable is given to a gene, and for that matter even if you make publication first as not everybody has the same ethical framework of honesty or due accreditation.
the time before computers were used reflexively was when we would record data, in cursive written in hardcover bound note books. a putative gene has a name that allows it to be found. later if things work nicely that putative is confirmed to be a gene and if things work ideally the locus can be established
Surely then the blame should go one step further back: .csv and tab-separated are poor file formats for *omics data and excel (and pandas, etc.) have no choice but to guess data types and metadata. Unfortunately I'm just not sure there's any format that's sufficiently better to ever replace them.
If someone types MARCH1, DEC1, OCT4, SEPT1 into a column, why should Excel not treat these as dates? Probably 99.999% of users would prefer this auto-understanding. To make users have to then select cell formatting and select date is a terrible UI decision.
It's odd to call such a useful feature a bug.
>Excel is fundamentally not suited to analysing *omics data
But it's excellent for a massive range of basic analysis and especially for a unified platform to share results, which is why everyone uses it. It's good for publication or proposal steps of creating graphs from data output from other more specialized systems, that usually lack the flexibility and widespread use that Excel has.
If my group uses some specialized, often in house, analysis software, my results cannot be spread to others unless I put them into a more common and accessible format. And nothing in the genomics and wider scientific community has anywhere near the widespread availability as Excel.
You might as well complain they also use word docs instead of Latex or troff. Not everyone wants to spend days fighting software instead of doing work.
> If someone types MARCH1, DEC1, OCT4, SEPT1 into a column, why should Excel not treat these as dates?
It should treat these as dates, unless told otherwise. But Excel goes one step further, and throws away the text you originally typed in. That means, if you type "DEC1" into a spreadsheet, see it looks stupid, and change the data type of that cell to "Text", then it's too late, and Excel will show "37226". It has destroyed the original data and replaced it with the numeric equivalent of the date.
The data type of the spreadsheet cells should be a display format only. The text you typed in (or loaded from a CSV file) should be preserved, and the data type of the cell should only dictate how that is treated and displayed, not how it is stored.
Excel does that, largely: cells have a datatype, a value, and a format (among other properties). Changing the format leaves the value unchanged.
However, when you input something into a cell, Excel has to parse it and decide what value and datatype to assign to it. That is just inevitable, and the same everywhere: in Python, `x = 3` assigns an integer, `x = 3.` a floating point, and `x='3'` a string. Similarly hints (or explicit specifications of type) are possible in Excel.
However, given that, what you typed originally is not recoverable anywhere (Python doesn't record whether you wrote `x=3.` or `x = 3.00`, why should it?).
So, to the extent that it is reasonable, Excel does what you request, and to the extent it doesn't do it, it's not reasonable, I'd say: A cell need not maintain the original pre-parsing input.
> The text you typed in (or loaded from a CSV file) should be preserved
That’s even worse, since a user assumes the cell holds what it shows. Excel, for example, auto completes entries, which is extremely useful for data entry, and for accuracy. If it only held the few characters entered, now it has to repeat exactly the same autocompletes every time it opens, and has to do so even if the data that originally caused the auto complete changes. All this is nearly algrithmically I’m-Seville without changing the document into a history. And now CSV is useless for data transfer.
It’s vastly more sane to change to the data the user meant to enter, show it to the user for confirmation, and save that.
That the 99.99% use case bites the remaining 1 in 10,000 is a perfectly reasonable trade off, done throughout society in uncounted ways.
It is shortsighted to call this behavior a "useful feature". It was and is a grave and serious mistake.
Automatically converting SEPT1 to a date caused far more damage than the benefit of the fleeting convenience.
The damage that this "feature" caused and will keep causing goes far beyond the domain of bioinformatics.
Even in bioinformatics things have not been "solved", for the next decade the same gene will have two names when looking it up in previously published data.
> Automatically converting SEPT1 to a date caused far more damage than the benefit of the fleeting convenience.
Who are you to judge? For the novice, even for most users it'll be what they want. And for the expert (or even intermediate user) it is trivial to override.
How exactly would you trivially correct the massive number of published results in genomics data where SEPT1 was converted to a date?
20% of publications had errors introduced by Excel in them! The real number is probably much higher (Nature had a rate of 35%). What is the rationale in saying that the benefits must still be worth it?
See: Gene name errors are widespread in the scientific literature
The root of the problem is not about how Excel displays information, or how Excel parses the data but that the data it tacitly changed upon saving it.
Open a CSV file that has SEPT1, don't take any action, save it right away. Boom, the data has been changed and you cannot recover the original information.
It's one thing if it were typed in, but this is not typically from typing, it's from opening a CSV file that has a field that's OCT4. (And honestly, if I want a date, I'll hit the date formatting button, Excel.)
It is super hard to figure out a way to import data into excel without triggering auto-conversions. I knew a way a few years ago when I was dealing with lots more biologists, but I have forgotten now. If I had to do it now, I would write XLSX directly then manually check the file in Excel, because Excel is fundamentally untrustworthy.
I respectfully disagree with this. Excel is fundamentally not suited to analysing *omics data. It's often the default program affiliated with a .csv filetype on people's computers, but trying to get an entire field of scientific research to rewrite itself based on its glorified bugs is...wrong, in my opinion.
If you see wrong things in the world, do you accept them as they are, or try -- however ineffectually -- to force change for the better? I for one bang the drum into the wind and try to get biochemists off it. I teach people to be very sceptical of excel in my stats courses, for example (aside from some showstopping bugs and downright dangerous defaults, its RNG is particularly crap).