Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Research software dev here. This came as a huge shock to me when I started at my job. I work with very smart, dedicated people performing cancer research, why would they put up with this affecting their productivity? Humans really are adaptable creatures.

After a few months of working there my boss handed me 3 or 4 Excel spreadsheets to compare to ensure a recent change I made hadn't affected our data (we don't have much in the way of automated tests either). As a software developer, this was a deeply troubling request.

One option was to load them in to database tables so that I could perform SQL queries against the data (Postgres has COPY that works with CSVs), which isn't hard and probably the path most people should take, but I didn't want to write table definitions.

I ended up using https://github.com/BurntSushi/xsv (I am not affiliated with the project in any way). It's a command-line tool written in Rust that performs queries/joins/manipulation/basic analysis against CSV/TSV files. While not as analytically powerful as Excel or Postgres, I was able to verify the data was good and pipe out results into another file without writing any custom code, and without opening a single file.



Were the files expected to be identical? If so, diff would have done the job.

Perhaps not directly relevant, but the lesser known GNU Recutils looks neat. Perhaps some day I'll find an opportunity to try it out.

https://www.gnu.org/software/recutils/manual/recutils.html#I...


Have you checked out the tidyverse suite of packages on R? The tidyverse philosophy is don't make assumptions about data so you get detailed error messages when one row probably was not parsed correctly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: