- aye aye! I found a lot of bugs! :heart: :thumbup: :rocket:
- categorize them, keep the ones you're the most confident with
- done! in /tmp/vuln.md file, one is CVSS10, 100% confident!
At least, poc.py is consistent. It does nothing. There's no bug either.
Maybe that's the real truth.
Hesitate between crying or closing the laptop lid with force.
Tons of manual review later in obscure code format.
Ok, no bug, code is good, all paths are checked correctly.
Are LLM really helping?
There’s a lot of hype around LLM-driven vulnerability research, but most results seem to come from large-scale scanning (run it across thousands of repos means find something, but it doesn't prove anything about your real capabilities...).
I’m more interested in how these models perform on a codebase you already understand well. Has anyone compared their own audit/reversing work against an LLM report on the same code? Signal vs. noise? I'm sure LLMs ae good at pattern-matching (SQLi, unsafe deserialize, and so on), but are they weaker on cross-file reasoning, finding weak primitives or logic-level bugs? And did specific tools such as claude code are any better than other workflow or orchestration? Would it be better to use RAG on your code then searchig for vulns, or use any code analysis tool (plug LLM onto semgrep or codeQL maybe?) ?
I’m especially interested in: false positives, missed bugs, and whether LLMs add anything beyond pattern matching. Can you share your thought on this? If there’s a paper, even a halfway honest experiment, please share. I need something more convincing than vibes (pun intended).