I'm reminded of a guy tasked with testing a 32 bit floating point library. After a number of false starts he realized the most effective way possible.
Brute force.
Oh other story. An OG (original geek) I know once proved the floating point unit on a mainframe was broken via brute force. Bad part meant the decimal floats used by and only by the accounting department sometimes produced erroneous results.
Me I have a pitiful amount of unit tests for the main codebase I work on. One module which does have unit tests, also had a bug that resulted in us having to replace a couple of thousand units in the field.
Otherwise most of the bugs that chap my keister aren't about procedural correctness, they're about state.
How do you test it by brute force? How is the algorithm supposed to know what the right thing to return is, unless you rewrite the whole thing, correctly this time? And, if you've done that, just replace the actual function with the test and be done with it.
It's a good idea in principle, though your third and fourth properties are not true of floating point addition.
When you have one available, testing against the results against a reference implementation is a little more direct. Though, given how easy those property tests are, you might as well do both.
Your comparison source could be running on a different platform, or be slower, or ... and thus not be a drop-in replacement. (E.g. in the example of a broken floating-point unit, compare the results with a software emulation)
Hmm, true, it may have been a different platform/slower, thanks. It can't have been hardware/software, as the GP said they were testing a library (which implies software).
How is the algorithm supposed to know what the right thing to return is
Some times I have, for example, matlab code that I 'know' is right and want to verify that my C/Python/Julia... code does the same thing. So then I just call the matlab code, call the new code and check if they're the same.
Or I have a slow exact method and want to see if my faster approximation always gives an answer 'close enough' to the correct answer.
With floating-point math libraries especially, it's frequently the case that it's relatively easy to compute the desired result accurately by using higher precision for internal computations; e.g. you might compare against the same function evaluated in double and then rounded to float.
This sounds pretty sketchy at first, but it turns out that almost all of the difficulty (and hence almost all of the bugs) are in getting the last few bits right, and the last few bits of a much-higher precision result don't matter when you're testing a lower-precision implementation, so combined with some basic sanity checks this is actually a very reasonable testing strategy (and widely used in industry).
… this reply does nothing to answer the parent's question. It's obvious that you can run it on all possible values, but that doesn't answer the question of the return values. (unless you only want to verify "doesn't crash" or something similarly basic)
I believe the question was edited after I wrote my answer. This, or I just misread it - unfortunately HN doesn't let us know. Anyway, I'm providing another answer to the question as it is now.
In practice, testing _all_ possible inputs is usually not a feasible goal, and often downright impossible (since it would mean solving the halting problem).
While there are some languages and problem domains where formal verification methods are a possibility, for most of software development the best you can do is reducing solution space.
A sane type system would be a start for example.
And it's no coincidence that functional programming and practices with emphasis on immutability are on the rise; Rusts ownership system is a direct consequence as well.
TDD _is_ important, if simply for enabling well-factored code and somewhat guarding against regression bugs.
But - decades after Dijkstras statement (which someone has already posted in this thread) - the code coverage honeymoon finally seems to be over.
Polyspace [0] allows full state space checking. Usually, this is feasible because the kind of modules you test are heavily constrained 12 bit (or fewer) fixed point values (sensor output/ actor input) and contain no algebraic loops. With floats and/or large integers, this quickly becomes unwieldy (verification then takes weeks).
One of them would be a fuzz testing.
You start with a predefined 'corpus' of inputs which are then modified by the fuzzer to reach yet not covered code paths. In the process it discovers many bugs and crashes.
The input fuzzing process is rarely purely random. There are advanced techniques that allow the fuzzer to link input data to conditions of not covered branches.
It is quite useful mechanism for checking inputs, formats, behaviour patterns (if you have two solutions but one model, one simple that works 100% but is slowish and one more complex but very fast).
Fuzz testing is useful at testing whole systems in vivo.
Proofs are powerful, but you can still make errors in the proof or the transcription into code. And of course you don't know what other emergent behaviours will surprise you when the proved code interacts with unproved code.
Depending on your means and needs, you want to try both.
Parent was talking about full coverage. While fuzzers are great tools, able to generate test inputs for easily coverable parts of the code, formally correct software has to be designed for provability.
Many embedded safety and medical applications prove correctness for all inputs and their combinations. Somme also verify error behaviour. (Out of range.) Granted, this is a relatively small input space, typically a few sensors.
To prove it works with all possible inputs, there are other tools at your disposal.