Scene 1 (The Introduction)
My first introduction to XBOW was a LinkedIn notification in late 2024. We're hiring at XBOW! We're on a mission to redefine offensive security through AI. The email was a LinkedIn summary. A repost from Nico Waisman. The “use after free” and windows memory coalescing legend from Immunity.1
The all-caps company name seemed obnoxious. Another overhyped AI startup.
I should have paid more attention. Months earlier, in October, my colleagues had been discussing that same XBOW. An AI tool that found and exploited stuff; better than human pen testers.
I remember thinking: Blah blah. LLMs can’t do anything outside of poorly regurgitating linked letters. Marketing makes an LLM an AI. LLMs just ran the BackTrack tools.
If I could travel back in time, I would eat every single one of those ‘linked letters’ myself. Perhaps thankfully, I just ignored the post. Not a single person was wiser to my incredible ignorance and flat out arrogance.
“I am delighted to introduce XBOW, which brings AI to offensive security, augmenting the productivity of pentesters, bug hunters and security researchers.” -Oege de Moor July 15, 20242
Scene 2 (The Crisis)
My boss and I decided to reanalyze a C# binary. It contained some significant security issues, taking me 6+ hours to uncover. With ChatGPT, those same issues were identified, plus 3 more, all within the span of a measly 2 hours.
That’s when I started to panic.
I’d been testing various applications that integrated LLMs. Doing the prompt injection, guardrail bypass, and standard web app/API stuff. My only experience using one as a pentest tool was for that C# binary experiment.
After BlackHat/DefCon 2025, things went sideways. My boss posted an XBOW article about their HackerOne results.3 I read it, still skeptical but less dismissive. Then another article followed – GPT-5 integration.4
I did a lot of research at that point. XBOW was submitting hundreds of objectively good findings. Supposedly, according to them they had over 50 critical and 242 highs containing vulns like: Remote Code Execution, SQL Injection, and XML External Entities (XXE).
LLMs were no longer glorified chat bots. They could identify legitimate security findings. And in XBOW’s case, these were validated through the HackerOne program.
“For the first time in bug bounty history, an autonomous penetration tester has reached the top spot on the US leaderboard.”5
Scene 3 (Panic)
XBOW with GPT-5 crushes: ...our agent successfully identified 70% of the vulnerabilities found in our previous setup (using a Sonnet/Gemini alloy) in a single run...
That article hit hard. XBOW’s agent found that 70% in a single automated run. There was no denying that what followed was a fairly intense sense of dread. Overwhelming dread. No maybe about it. I’d like to think I wasn’t being dramatic, but looking a year or two down the road, I started to envision huge drops in pen test pricing and large swaths of the offensive security field disappearing. Who would need new testers if LLMs could do the work? How the hell would juniors become seasoned pen testers? What did that mean for career progression?
I started thinking selfishly. Am I being replaced by an LLM? Was my career ending? Retire, in this economy? Could I, and survive? How do you pay for a kid’s college when you’re retired? How had I missed this happening?! I needed to figure out exactly what was necessary to prevent being replaced by a chat bot on steroids!
For a week or two I checked out mentally. I had to tamp down the panic. What could an LLM actually do and how much work did it take to get them to do that? Maybe I could prove they don’t do the things I can do?
What exactly were people doing? And how?!
I started researching.
“Automated pentesting rose 2.5X in 2024, which is key in scaling coverage, especially across web applications. However, the cybersecurity report 2025 also found a nearly 2000% increase in vulnerabilities discovered manually, particularly in areas that automation still struggles to handle: APIs, cloud configs, and complex chained exploits.”6
Next time: I started researching. Turns out, finding evidence of business logic exploitation in the wild is a lot harder than I expected. And the reasons why would reveal something much bigger than I was ready for.
Return to the Pentester’s Guide to AI Disruption: A 6-Part Series
- Nico Waisman: Security researcher known for heap exploitation techniques and Windows memory management research while at Immunity Inc. Now at XBOW. ↩︎
- XBOW Blog, “Introducing XBOW” (July 15, 2024), https://xbow.com/blog/introducing-xbow ↩︎
- XBOW Blog, “XBOW on HackerOne: What’s Next”, https://xbow.com/blog/xbow-on-hackerone-whats-next ↩︎
- XBOW Blog, “GPT-5”, https://xbow.com/blog/gpt-5 ↩︎
- XBOW Blog, “The road to Top 1: How XBOW did it”, https://xbow.com/blog/top-1-how-xbow-did-it ↩︎
- Astra Security, “Penetration Testing Trends 2025” ↩︎