Kazani pfp
Kazani

@kazani

(1/2) 17.9% of LLM "optimized" code silently produces wrong outputs. Not crashes. Not errors. Just quietly wrong. A new paper (ComPilot, PACT 2025) ran the experiment directly: LLM proposes code transformations → compare output to original → if it matches, call it correct. 18% of the time, it matched on the test input and failed on random inputs. This is the core problem with "just ask the LLM to verify itself." It can't. What actually worked: give the LLM a real environment. It proposes a transformation, the compiler checks legality formally, the program runs, the speedup is measured. That number goes back to the LLM.
1 reply
0 recast
5 reactions