25 Followers
59 Following
159 Posts
Pentest, OSINT, reverse engineering, RCE/LPE addict.
My name is Johnny Clathetic.
Dark thought and weird to understand jokes. "If you don't understand, assume that I'm dumb."
https://heapdump.alwaysdata.net/

At least, poc.py is consistent. It does nothing. There's no bug either.

Maybe that's the real truth.

Hesitate between crying or closing the laptop lid with force.

Tons of manual review later in obscure code format.
Ok, no bug, code is good, all paths are checked correctly.

Are LLM really helping?

- ok, LLM, can you analyze vuln.md and poc.py to figure whats wrong?
- I don't think there is abug, the code is unreachable, structure of data is bad.
- Try again
- wrote /tmp/poc2.py
- still doesn't work
- Maybe no bug.
$ ./poc.py
(nothing happens)
$
- ok, LLM analyze all the bugs in /tmp/vuln.md and tell me
if you're confident with the bug?
- mmmh, not sure, but bugs are here
- write a poc.py
- done!
- ok, LLM find me bugs in this codebase
- aye aye! I found a lot of bugs! :heart: :thumbup: :rocket:
- categorize them, keep the ones you're the most confident with
- done! in /tmp/vuln.md file, one is CVSS10, 100% confident!
another night.. sleepless..
I'm tired
another night...

There’s a lot of hype around LLM-driven vulnerability research, but most results seem to come from large-scale scanning (run it across thousands of repos means find something, but it doesn't prove anything about your real capabilities...).

I’m more interested in how these models perform on a codebase you already understand well. Has anyone compared their own audit/reversing work against an LLM report on the same code? Signal vs. noise? I'm sure LLMs ae good at pattern-matching (SQLi, unsafe deserialize, and so on), but are they weaker on cross-file reasoning, finding weak primitives or logic-level bugs? And did specific tools such as claude code are any better than other workflow or orchestration? Would it be better to use RAG on your code then searchig for vulns, or use any code analysis tool (plug LLM onto semgrep or codeQL maybe?) ?

I’m especially interested in: false positives, missed bugs, and whether LLMs add anything beyond pattern matching. Can you share your thought on this? If there’s a paper, even a halfway honest experiment, please share. I need something more convincing than vibes (pun intended).