Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I created a code review pipeline at work with a similar tradeoff and we found the cost is worth it. Time is a non-issue.

We could run Claude on our code and call it a day, but we have hundreds of style, safety, etc rules on a very large C++ codebase with intricate behaviour (cooperative multitasking be fun).

So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

"scaling time" on the other hand is useless. You can just divide the problem with subagents until it's time within a few minutes because that also increases quality due to less context/more focus.



Any LLM-based code review tooling I've tried has been lackluster (most comments not too helpful). Prose review is usually better.

> So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

Sure, you could make multiple LLM invocations (different temporature, different prompts, ...). But how does one separate the good comments from the bad comments? Another meta-LLM? [1] Do you know of anyone who summarizes the approach?

[1]: I suppose you could shard that out for as much compute you want to spend, with one LLM invocation judging/collating the results of (say) 10 child reviewers.


I have attempted to replicate the "workflow" LLM process where several LLMs come up with different variations of a way to solve a problem and a "judge" LLM reviews them and the go through different verification processes to see if this workflow increased the accuracy of the LLM's ability to solve the problem. For me, in my experiments, it didn't really make much difference but at the time I was using LLMs significantly dumber than current frontier models. HOWEVER...When I enable "Thinking Mode" on frontier LLM's like ChatGPT it DOES tend to solve problems that the non-thinking mode isn't able to solve so perhaps it's just a matter of throwing enough iterations at it for the LLM to be able to solve a particular complex problem.


> But how does one separate the good comments from the bad comments?

One thing that works very well for me (in a different context) is to ask to return two lists:

- Things that I must absolutely fix (bugs, typos, logic mistakes, etc.)

- Lesser fixes and other stylistic improvements

Then I look only at the first list.


You need human alignment on what constitutes a "good" comment. That means consistent rules.

Otherwise, some people feel review is too harsh, other people feel it is not harsh enough. AI does not fix inconsistent expectations.

> But how does one separate the good comments from the bad comments?

If the AI took a valid interpretation of the coding guidelines, it is a legitimate comment. If the AI is being overly pedantic, it is a documentation bug and we change the rules.


> This has completely replaced human code review for anything that isn't functional correctness

Isn’t functional correctness pretty much the only thing that matters though?


Well no, style is important too for humans when they read a codebase, so the LLMs the parent is running clearly have some value for them.

They're not claiming LLMs solved every problem, just that they made life easier by taking care of busywork that humans would otherwise be doing. I think personally this is quite a good use for them - offering suggestions on PRs say, as long as humans still review them as well.


But isn't style already achievable by running e.g. GNU indent?


Some examples of complex transformations linters can't catch:

* Function names must start with a verb.

* Use standard algorithms instead of for loops.

* Refactor your code to use IIFEs to make variables constexpr.

The verb one is the best example. Since we work adjacent to hardware, people like creating functions on structs representing register state called "REGISTER_XYZ_FIELD_BIT_1()" and you can't tell if this gets the value of the first field bit or sets something called field bit to 1.

If you rename it to `getRegisterXyzFieldBit1()` or `setRegisterXyzFieldBitTo1()` at least it becomes clear what they're doing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: