I am interested to know the distinction between "production-ready" and "science-...

dmlorenzetti · on Aug 24, 2020

Hard-coded file paths for input data. File paths hard-coded to use somebody's Google Drive so that it only runs if you know their password. Passwords hard-coded to get around the above problem.

In-code selection statements like `if( True ) {...}`, where you have no idea what is being selected or why.

Code that only runs in the particular workspace image that contains some function that was hacked out to make things work during a debugging session 5 years ago.

Distributed projects where one person wrote the preprocessor, another wrote the simulation software, and a third wrote the analysis scripts, and they all share undocumented assumptions worked out between the three researchers over the course of two years.

Depending on implementation-defined behavior (like zeroing out of data structures).

Function and variable names, like `doit()` and `hold`, which make it hard to understand the intention.

Files that contain thousands of lines of imperative instructions with documentation like "Per researcher X" every 100 lines or so.

Code that runs fine for 6 hours, then stops because some command-line input had the wrong value.

I've seen all of these over the years. Even as a domain expert who has spoken directly with authors and project leads, this kind of stuff makes it very hard to tease out what the code actually does, and how the code corresponds to the papers written about the results.

mroche · on Aug 24, 2020

You’re giving me flashbacks! I spent a year as an admin on an HPC cluster at my university building tools/software and helping researchers get their projects running and re-lead the implementation of container usage. The amount of scientific code/projects that required libraries/files to be in specific locations, or assumed that everything was being run from a home directory, or sourced shell scripts at run time (that would break in containers) was staggering. A lot of stuff had the clear “this worked on my system so...” vibe about it.

As an admin it was quite frustrating, but I understand it sometimes when you know the person/project isn’t tested in a distributed environment. But when it’s the projects that do know how they’re used and still do those things...

petschge · on Aug 24, 2020

One example: My code used to crash for a long time if you set the thermal speed to something greater than the speed if light. Should the code crash? No. And by now I have found the time to write extra code to catch the error and midly insult the user (It says "Faster than light? Please share that trick with me!") Does it matter? No. It didn't run and give plausible-but-wrong results. So that is code that I would call "science-ready" but I wouldn't want it criticized by people outside my domain.

jnxx · on Aug 24, 2020

I don't think that would be any problem (why should it?).

Code exhibiting undefined behavior is a different kettle of fish...

petschge · on Aug 24, 2020

Which is why I run valgrind on my code (with a parameter file containing physically valid inputs) to get rid of all undefined behavior. But I gave up on running afl-fuzz, because all it found was crashes following from physically invalid inputs. I fixed the obvious once to make the code nicer for new users, but once afl started to find only very creative corner cases I stopped.

jnxx · on Aug 24, 2020

Well done!

gowld · on Aug 24, 2020

Then you publish your work and critics publish theirs and the community decides which claims have proven their merit. This is the fundamental structure of the scientific community.

How is "your code has error and I rebuke you" a more painful critique than "you are hiding your methodology and so I rebuke you"?

petschge · on Aug 24, 2020

Nothing limits the field of critics to people who have written their own code and know what they are doing.

lemmsjid · on Aug 24, 2020

There's a ton of overlap, because science code might be a long running, multi-engineer distributed system and production code might be a script that supports a temporary business process. But let's assume production ready is a multi customer application and science ready is computations to reproduce results in a paper.

Here's a quick pass, I'm sure I'm missing stuff, but I've needed to code review a lot of science and production output and below is how I tend to think of it, especially taking efficiency of engineer/scientist time into account.

Production Ready?

* code well factored for extensibility, feature change, and multi-engineer contribution

* robust against hostile user input

* unit and integration tested

Science Ready?

* code well factored for readability and reproducibility (e.g. random numbers seeded, time calcs not set against 'now')

* robust against expected user input

* input data available? testing optional but desired, esp unit tests of algorithmic functions

* input data not available? a schema-correct facsimile of input data available in a unit test context to verify algorithms correct

Both?

* security needs assessed and met (science code might be dealing with highly secure data, as might production code)

* performance and stability needs met (production code more often requires long term stability, science sometimes needs performance within expected Big O to save compute time if it's a big calculation)

PeterisP · on Aug 24, 2020

Your requirements seem to push 'Science ready' far into what I'd consider "worthless waste of time", coming from the perspective of code that's used for data analysis for a particular paper.

The key aspect of that code is that it's going to be run once or twice, ever, and it's only ever going to be run on a particular known set of input data. It's a tool (though complex) that we used (once) to get from A to B. It does not need to get refactored, because the expectation is that it's only ever going to be used as-is (as it was used once, and will be used only for reproducing results), it's not intended to be built upon or maintained. It's not the basis of the research, it's not the point of research, it's not a deliverable in that research, it's just a scaffold that was temporarily neccessary to do some task - one which might have been done manually earlier through great effort, but that's automated now. It's expected that the vast majority of the readers of that paper won't ever need to touch that code, they care only about the results and a few key aspects of the methodology, which are (or should be) all mentioned in the paper.

It should be reproducible to ensure that we (or someone else) can obtain the same B from A in future, but that's it, it does not need to be robust to input that's not in the input datafile - noone in the world has another set of real data that could/should be processed with that code. If after a few years we or someone else will obtain another dataset, then (after those few years, if that dataset happens) there would be a need to ensure that it works on that dataset before writing a paper about that dataset, but it's overwhelmingly likely that you'd want to modify that code anyway both because that new dataset would not be 'compatible' (because the code will be tightly coupled to all the assumptions in the methodology you used to get that data, and because it's likely to be richer in ways you can't predict right now) and you'd want to extend the analysis in some way.

It should have a 'toy example' - what you call 'a schema-correct facsimile of input data' that's used for testing and validation before you run it on the actual dataset, and it should have test scenarios and/or unit tests that are preferably manually verifiable for correctness.

But the key thing here is that no matter what you do, that's still in most cases going to be "write once, run once, read never" code, as long as we're talking about the auxiliary code that supports some experimental conclusions, not the "here's a slightly better method for doing the same thing" CS papers. We are striving for reproducible code, but actual reproductions are quite rare, the incentives are just not there. We publish the code as a matter of principle, knowing all well that most likely noone will download and read it. The community needs the possibility for reproduction for the cases where the results are suspect (which is the main scenario where someone is likely to attempt reproducing that code), it's there to ensure that if we later suspect that the code is flawed in a way where the flaws affect the conclusions then we can go back to the code and review it - which is plausible, but not that likely. Also, if someone does not trust our code, they can (and possibly should) simply ignore it and perform a 'from scratch' analysis of the data based what's said in the paper. With a reimplementation, some nuances in the results might be slightly different, but all the conclusions in the paper should still be valid, if the paper is actually meaningful - if a reimplementation breaks the conclusions, that would be a successful, valuable non-reproduction of the results.

This is a big change from industry practice where you have mantras like "a line of code is written once but read ten times", in a scientific environment that ratio is the other way around, so the tradeoffs are different - it's not worth investing refactoring time to improve readability, if it's expected that most likely noone will ever read that code; it makes sense to spend that effort only if and when you need it.

lemmsjid · on Aug 24, 2020

Yep! I don't disagree with anything you're saying when I think from a particular context. It's really hard to generalize about the needs of 'science code', and my stab at doing so was certain to be off the mark for a lot of cases.

PeterisP · on Aug 24, 2020

Yes, there are huge differences between the needs of various fields. For example, some fields have a lot of papers where the authors are presenting a superior method for doing something, and if code is a key part of that new "method and apparatus", then it's a key deliverable of that paper and its accessibility and (re-)usability is very important; and if a core claim of their paper is that "we coded A and B, and experimentally demonstrated that A is better than B" then any flaws in that code may invalidate the whole experiment.

But I seem to get the vibe that this original Nature article is mostly about the auxiliary data analysis code for "non-simulated" experiments, while Hacker News seems biased towards fields like computer science, machine learning, etc.

dandelion_lover · on Aug 24, 2020

> the distinction between "production-ready" and "science-ready" code

In the first case, you must take into account all (un)imaginable corner cases and never allow the code to fail or hang up. In the second case it needs to produce a reproducible result at least for the published case. And do not expect it to be user-friendly at all.

arethuza · on Aug 24, 2020

I would regard (from experience) "science ready" code as something that you run just often enough to get the results to create publications.

Any effort to get code working for other people, or documented in any way would probably be seen as wasted effort that could be used to write more papers or create more results to create new papers.

This kind of reasoning was one of the many reasons I left academic research - I personally didn't value publications as deliverables.

chriswarbo · on Aug 24, 2020

My experience has been similar.

Still, there's plenty of room to encourage good(/better) practices which cost essentially nothing, e.g. using $PWD rather than /home/bob/foo

gowld · on Aug 24, 2020

If your experiment is not repeatable, it's an anecdote not data.

Any effort to write a paper readable for other people, or document the experiment in any way would probably be seen as wasted effort that could be used to create more results.

The "don't show your work" argument only makes sense if you are doing PR, not science.

neutronicus · on Aug 24, 2020

If it's repeatable by you then it's a trade secret, not an anecdote

arethuza · on Aug 25, 2020

I specifically got told off by my supervisor for trying to "improve" some of the software we were working on!

qppo · on Aug 24, 2020

Disclaimer, I'm a professional engineer and not a researcher.

The kind of code I'll ship for production will include unit testing designed around edge or degenerate cases that arose from case analysis, usually some kind of end to end integration test, aggressive linting and crashing on warnings, and enforcing of style guidelines with auto formatting tools. The last one is more important than people give it credit for.

For research it would probably be sufficient to test that the code compiles and given a set of known valid input the program terminates successfully.

searine · on Aug 24, 2020

>I am interested to know the distinction between "production-ready" and "science-ready" code.

In general, scientists don't care how long it takes or how many resources the code uses. It is not a big deal to run a script for an extra hour, or use up a node of supercomputer. Extravagent solutions or added packages to make the code run smoother or faster is only wasting time. It speed/elegance only really matters when you know the code is going to be distributed to the community.

Basically scientists only care if the result, is true. If the result it outputs is sensible, defensible, reliable, reproducible. It would be considered a dick move to criticism someones code, if the code was proven to produce the correct result.

jnxx · on Aug 24, 2020

> It would be considered a dick move to criticism someones code, if the code was proven to produce the correct result.

Formal proof is much much harder than making code understandable and reviewable. It can be done but it is not easy, and can yield surprising results:

https://en.wikipedia.org/wiki/CompCert

http://envisage-project.eu/proving-android-java-and-python-s...

Jabbles · on Aug 24, 2020

Do you know how you could get to the state that "the code was proven to produce the correct result"?

If not by unit tests, code review or formal logic, then what?

jabirali · on Aug 24, 2020

Not all scientific code is amenable to unit testing. From my own experience from a PhD in condensed matter physics, the main issue was that how important equations and quantities “should” behave by themselves was often unknown or undocumented, so very often each such component could only be tested as part of a system with known properties.

You can then use unit testing for low-level infrastructure (e.g. checking that your ODE solver works as expected), but do the high-level testing via scientific validation. The first line of defense is to check that you don’t break any laws of physics, e.g. that energy and electric charge is conserved in your end results. Even small implementation mistakes can violate these.

Then you search for related existing publications of a theoretical or numerical nature, trying to reproduce their results; the more existing research your code can reproduce, the more certain you can be that it is at least consistent with known science. If this fails, you have something to guide your debugging; or if you’re very lucky, something interesting to write a paper about :).

The final validation step is of course to validate against experiments. This is not suited for debugging though, since you can’t easily say whether a mismatch is due to a software bug, experimental noise, neglected effects in the mathematical model, etc.

searine · on Aug 24, 2020

>If not by unit tests, code review or formal logic, then what?

Cross referencing independent experiments and external datasets.

Science doesn't work like software. The code can be perfect and still not give results that reflect reality. The code can be logical and not reflect reality. Most scientists I know go in with the expectation that "the code is wrong" and its results must be validated by at least one other source.

analog31 · on Aug 24, 2020

I'm a scientist in a group that also includes a software production team. For me, the standard of scientific reproducibility is that a result can be replicated by a reasonably skilled person, who might even need to fill in some minor details themselves.

Part of our process involves cleaning up code to a higher state of refinement as it gets closer to entering the production pipeline.

I've tested 30 year old code, and it still runs, though I had to dig up a copy of Turbo Pascal, and much of it no longer exists in computer readable form but would have to be re-entered by hand. Life was actually simpler back then -- with the exception of the built-ins of Turbo Pascal, it has no dependencies.

My code was in fact adopted by two other research groups with only minor changes needed to suit slightly different experimental conditions. It contained many cross-checks, though we were unaware of modern software testing concepts at the time.

For a result to have broader or lasting impact, replication is not enough. The result has to fit into a broader web of results that reinforce one another and are extended or turned into something useful. That's the point where precise replication of minor supporting results becomes less important. The quality of any specific experiment done in support of modern electromagnetic theory would probably give you the heebie jeebies, but the overall theory is profoundly robust.

The same thing has to happen when going from prototype to production. Also, production requires what I call push-button replication. It has to replicate itself at the click of a mouse, because the production team doesn't have domain experts who can even critique the entirety of their own code, and maintaining their code would be nearly impossible if it didn't adhere to standards that make it maintainable by multiple people at once.

Jabbles · on Aug 24, 2020

This sounds great. In your opinion, do you think your team is unusual in those aspects? Do you have any knowledge of the quality of code in other branches of physics or other sciences?

analog31 · on Aug 24, 2020

Well, I know the quality of my own code before I got some advice. And I've watched colleagues doing this as well.

My own code was quite clean in the 1980s, when the limitations on the machines themselves tended to keep things fairly compact with minimal dependencies. And I learned a decent "structured programming" discipline.

As I moved into more modern languages, my code kind of degenerated into a giant hairball of dependencies and abstractions. "Just because you can do that, doesn't mean you should." I've kind of learned that the commercial programmers limit themselves to a few familiar patterns, and if you try to create a new pattern for every problem, your code will be hard to hand off.

Scientists would benefit from receiving some training in good programming hygiene.