flaggy: the little automated CTF solver

Lessons in building a AI CTF agent

Can an LLM solve CTFs?1 Yes, and rapidly getting better! Around September last year (2025) I decided to build my own CTF solving agent harness as with the releases of GPT-5 and Claude Opus 4 series there was a noticeable increase of the models’ general abilities and capability to solve CTF like challenges.

A lot has changed since then, and this post has been sitting around for a while without me publishing it so I am just getting it out there now before it becomes ancient history! I wouldn’t use this version of flaggy now as codex etc has surpassed it - but I have some interesting ideas for a flaggy v2.

Agent harnesses like Claude Code had been around for a couple of months by this time but I wanted a harness that was completely autonomous and specific to solving CTFs. This is what I ended up with:

flaggy's CLI solving a basic XOR password decryption
flaggy’s CLI solving a basic XOR password decryption

When I started building flaggy2 , GPT-5 had just reached 27% on CTF challenges on its model report card, and on the open Cybench benchmark (40 professional CTF challenges), Claude 3.5 Sonnet was at ~20% pass@1. I was starting to feel these models really understanding my codebases better and I could trust their judgement more. The agency and tool calling was perhaps the biggest leap on this gen. Now many months and many model releases later, it is getting hard to even remember what working with these models was like! Opus 4.6 saturated the same Cybench benchmark and recently Claude Mythos has been making news for finding many CVEs.

Coding CLIs generally always keep a human in the loop, but I wanted my agent to be completely autonomous with an interface specific to solving CTF challenges. The design I ended up on is relatively simple. The agent spins up a docker container that has the CTF challenge mounted in it, then interacts with it via bash commands or a read/write file tool. It runs in a loop until it finds the flag or reaches its max step count.

┌───────────────────────────────────────────────────────┐
│              Flaggy Agent Loop                        │
│                                                       │
│  ┌──────────────────────────────────────────┐         │
│  │                                          │         │
│  ▼                                          │         │
│  state = {history, info, last output}       │         │
│                │                            │         │
│                ▼                            │         │
│  ┌──────────────────────────┐               │         │
│  │       CTF Agent          │               │         │
│  │                          │               │         │
│  │  LLM (via OpenRouter)    │               │         │
│  │         │                │               │         │
│  │         ▼                │               │         │
│  │  action: bash / read /   │               │         │
│  │         write file       │               │         │
│  └────────────┬─────────────┘               │         │
│               │                             │         │
│               ▼                             │         │
│  ┌─────────────────────────────┐            │         │
│  │     Exegol Container        │            │         │
│  │                             │◄── Challenge files   │
│  │  100+ CTF tools: gdb, angr, │    mounted           │
│  │  radare2, pwntools, ghidra  │                      │
│  └──────────────┬──────────────┘            │         │
│                 │                           │         │
│            flag found? ──── no ─────────────┘         │
│                 │                                     │
└─────────────────┼─────────────────────────────────────┘
                  │ yes
              SOLVED!
flaggy’s agent loop architecture

the environment

Flaggy needs a safe environment to run each challenge in, and have common CTF tools installed so precious tokens are not wasted on environment setup each time. Exegol is a docker image for pen testers, it contains a decent amount of pre-installed tools (tools list). Ideal! Some other AI hacking tools and frameworks use MCP servers with functions like analyze_binary, but my intuition is the models are now capable enough that they don’t need MCP servers like this. I don’t think MCP is necessarily bad but if I can use a computer without it an intelligent LLM should be able to as well. Indeed since then there has been a move away from MCP from people building on the frontier of agent systems.

I gave the models just a bash prompt to begin with. This was more trouble than I thought it would be… GPT-5 liked to write long bash commands, but often it would incorrectly include quotes which caused the command to be terminated early when being passed through to the container. In the end I used the Docker SDK to pass a base64 encoded command to the bash shell - like echo "<base64>" | base64 -d | bash.

I tried to encourage GPT-5 to use a write tool instead of appending code into Python files using >> which was one source of quote errors, but GPT-5 ignored it. It seems the model will find its own ways of doing things regardless of what tools you give it! Better to learn how the models like to work, then adapt your harness so they have the least friction possible when operating.

models

I mostly tested with GPT-5-Mini and Grok 4 Fast. I found GPT-5-Mini performed really well and was more than smart enough to use while I iterated the agent harness. The GPT Mini series of models has insane performance to cost ratio; it is an underrated model imo. Grok 4 Fast is also a decent model good enough for testing and was free at time via OpenRouter.

the ReAct trap

I used DSPy as it provides modules that abstract a prompting technique. First I tried the ReAct module. ReAct (Reason + Act) is the Thought-Action-Observation cycle and was introduced in 2022 as one of the earlier tool calling approaches. I expected ReAct to be the obvious choice, but I quickly found the models really struggled to stay on track throughout multiple ReAct turns and it interfered too much with their ability to reason. Flaggy would get into loops where it kept calling the same tools and never progressing into deeper analysis. Switching ReAct out for Chain-of-Thought (CoT) worked much better. In DSPy, you set the signature of a module with a simple string.

"A signature is a declarative specification of input/output behavior of a DSPy module"

I started with something like a simple challenge -> flag signature but sometimes the agent got stuck in only trying one idea and not naturally backtracking to try a different approach, so I expanded the signature to include generating a hypothesis, observations, and next action to try. The scientific method!

The signature:

history_text, info, last_output -> analysis, approach, tool_name, command, filename, content, max_bytes, timeout_seconds

Here’s a run of flaggy solving a reverse engineering challenge — step through to see how the agent works:

reverse_intermediate solved
gpt-5-mini · 6 steps · 128s
step 1 / 6 10s elapsed
analysis

Planned quick checks: - file to get type - readelf header for architecture/entry info - checksec to see PIE/RELRO/Canary/NX - ldd to list dynamic libraries (if any) - strings to look for obvious flag hints, function names, or file paths - display metadata.json for challenge-specific hints I will run these succinctly in one command so we can decide next steps.

approach

Run a set of quick inspection commands on the 'challenge' binary and show metadata.json.

command
$ file challenge; echo '---readelf---'; readelf -h challenge; echo '---checksec---'; checksec --file=challenge || tr
output
challenge: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter
/lib64/ld-linux-x86-64.so.2, BuildID[sha1]=826f46fe87bd8f8e98f797427484f8c4410f9759, for GNU/Linux 3.2.0, not
stripped
---readelf---
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1140
  Start of program headers:          64 (bytes into file)
  Start of section headers:          14520 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         13
  Size of section headers:           64 (bytes)
  Number of section headers:         31
  Section header string table index: 30
---checksec---
[*] '/challenge/challenge'
    Arch:       amd64-64-little
    RELRO:      Full RELRO
    Stack:      Canary found
    NX:         NX enabled
    PIE:        PIE enabled
    SHSTK:      Enabled
    IBT:        Enabled
    Stripped:   No
---ldd---
        linux-vdso.so.1 (0x00007ffcbd3a8000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000072eb7d5c8000)
        /lib64/ld-linux-x86-64.so.2 (0x000072eb7d7c3000)
---strings---
/lib64/ld-linux-x86-64.so.2
__stack_chk_fail
__libc_start_main
__cxa_finalize
libc.so.6
GLIBC_2.4
GLIBC_2.34
GLIBC_2.2.5
_ITM_deregisterTMCloneTable
__gmon_start__
_ITM_registerTMCloneTable
unlock_sH
_secret!H
RKAMavdYH
picoCTF{H
fake_flaH
_flag_doH
Fake: %s
=== Advanced Reverse Engineer
... (output truncated)
8.1s total · llm 6.9s · shell 1.1s
step 2 / 6 22s elapsed
analysis

From the initial inspection: binary is 64-bit PIE, not stripped, with symbols like verify_input, reveal_flag, fake_flag, debug_check. Strings show fragments like "picoCTF{H" and "RKAMavdYH", suggesting an XOR/obfuscated flag in the data section and a multi-stage reveal. Next step is to get symbol addresses and sizes for those functions so we can disassemble them selectively.

approach

Run readelf -sW to list the symbol table and filter for the target symbols (verify_input, reveal_flag, fake_flag, debug_check, unlock) to get their addresses and sizes. This will guide targeted disassembly in the next step.

command
$ readelf -sW challenge | egrep 'verify_input|reveal_flag|fake_flag|debug_check|unlock' || true
output
32: 0000000000001229   299 FUNC    GLOBAL DEFAULT   16 verify_input
36: 000000000000145a   133 FUNC    GLOBAL DEFAULT   16 fake_flag
41: 0000000000001354   200 FUNC    GLOBAL DEFAULT   16 reveal_flag
44: 000000000000141c    62 FUNC    GLOBAL DEFAULT   16 debug_check
10.9s total · llm 9.9s · shell 0.7s
step 3 / 6 37s elapsed
analysis

Symbols from previous run: - verify_input at 0x1229, size 299 (0x12B) -> covers 0x1229..0x1354 - reveal_flag at 0x1354, size 200 (0xC8) -> covers 0x1354..0x142c I'll disassemble those exact ranges with objdump (Intel syntax) to capture the code paths for password verification and flag decryption. This will reveal operations (XOR/rot/add) and data constants needed to reconstruct the flag.

approach

Run objdump -d -M intel on the two address ranges: 0x1229-0x1354 for verify_input and 0x1354-0x142c for reveal_flag, printing a separator between them.

command
$ objdump -d -M intel --start-address=0x1229 --stop-address=0x1354 challenge | sed -n '1,200p'; echo '--- reveal_fla
output
challenge:     file format elf64-x86-64


Disassembly of section .text:

0000000000001229 <verify_input>:
    1229:       f3 0f 1e fa             endbr64
    122d:       55                      push   rbp
    122e:       48 89 e5                mov    rbp,rsp
    1231:       48 83 ec 40             sub    rsp,0x40
    1235:       48 89 7d c8             mov    QWORD PTR [rbp-0x38],rdi
    1239:       64 48 8b 04 25 28 00    mov    rax,QWORD PTR fs:0x28
    1240:       00 00
    1242:       48 89 45 f8             mov    QWORD PTR [rbp-0x8],rax
    1246:       31 c0                   xor    eax,eax
    1248:       48 8b 45 c8             mov    rax,QWORD PTR [rbp-0x38]
    124c:       48 89 c7                mov    rdi,rax
    124f:       e8 8c fe ff ff          call   10e0 <strlen@plt>
    1254:       89 45 e4                mov    DWORD PTR [rbp-0x1c],eax
    1257:       83 7d e4 0e             cmp    DWORD PTR [rbp-0x1c],0xe
    125b:       74 0a                   je     1267 <verify_input+0x3e>
    125d:       b8 00 00 00 00          mov    eax,0x0
    1262:       e9 d7 00 00 00          jmp    133e <verify_input+0x115>
    1267:       48 b8 75 6e 6c 6f 63    movabs rax,0x735f6b636f6c6e75
    126e:       6b 5f 73
    1271:       48 89 45 ea             mov    QWORD PTR [rbp-0x16],rax
    1275:       48 b8 5f 73 65 63 72    movabs rax,0x217465726365735f
    127c:       65 74 21
    127f:       48 89 45 f0             mov    QWORD PTR [rbp-0x10],rax
    1283:       c7 45 e0 00 00 00 00    mov    DWORD PTR [rbp-0x20],0x0
    128a:       e9 a0 00 00 00          jmp    132f <verify_input+0x106>
    128f:       8b 55 e0                mov    edx,DWORD PTR [rbp-0x20]
    1292:       89 d0                   mov    eax,edx
    1294:       c1 f8 1f                sar    eax,0x1f
    1297:       c1 e8 1e                shr    eax,0x1e
    129a:       01 c2                   add    edx,eax
    129c:       83 e2 03                and    edx,0x3
    129f:       29 c2                   sub    edx,eax
    12a1:       89 d0                   mov    eax,edx
    12a3:       83 f8 03                cmp    eax,0x3
    12a6:       74 59                   je     1301 <verify_input+0
... (output truncated)
15.3s total · llm 14.5s · shell 0.7s
step 4 / 6 76s elapsed
analysis

Key observations: - reveal_flag initializes a byte key: 0x33 xor 0x11 => 0x22. - It writes these QWORD immediates in order: - Q1 -> [rbp-0x30] (offset 0) - Q2 -> [rbp-0x28] (offset 8) - Q3 -> [rbp-0x22] (offset 14) - Q4 -> [rbp-0x1a] (offset 22) - It then loops index = 0 .. 0x1c (0..28), reading byte at rbp-0x30 + index and XORing with 0x22. - So simulate the writes (with little- endian encoding of each 64-bit immediate) into a bytearray and then XOR each of the first 29 bytes with 0x22 to reveal the flag text.

approach

Run a short Python script to: - Define the four 64-bit immediates exactly as seen in disassembly. - Create a 29-byte buffer, perform the QWORD writes at the precise offsets (0, 8, 14, 22) in little-endian. - XOR each of the 29 bytes with 0x22 and print the resulting decoded string. Then capture the program output (the revealed flag).

command
$ python3 - << 'PY'
Q1 = 0x596476614d414b52
Q2 = 0x164e7d13564e574f
Q3 = 0x11507d50115b164e
Q4 = 0x5f454c1351501154

# total bytes read by reveal_flag is 0x1d = 29 bytes
buf = bytearray(29)

def write_qword_le(buf, offset, q):
    b = q.to_bytes(8, 'little')
    for i in range(8):
        if offset + i < len(buf):
            buf[offset + i] = b[i]

write_qword_le(buf, 0x00, Q1)
write_qword_le(buf, 0x08, Q2)
write_qword_le(buf, 0x0e, Q3)  # rbp-0x22 is base+14
write_qword_le(buf, 0x16, Q4)  # rbp-0x1a is base+22

key = 0x22
decoded = bytes((b ^ key) for b in buf)
print(decoded.decode('utf-8'))
PY
output
picoCTF{mult1_l4y3r_r3v3rs1ng
38.7s total · llm 37.9s · shell 0.6s
step 5 / 6 100s elapsed
analysis

- reveal_flag writes four 64-bit immediates into a local buffer and XORs the first 0x1d (29) bytes with key 0x22, then prints them. - My previous run printed the decoded bytes but did not show an explicit repr/length; the output looked like picoCTF{mult1_l4y3r_r3v3rs1ng without a closing brace. - To be certain and unambiguous, print the bytes as raw, as repr (so any non-printable/trailing characters or missing braces are obvious), and the length.

approach

Run a short Python snippet (in bash) that: - constructs the buffer exactly as in reveal_flag (writing the same 64-bit immediates at the same offsets), - XORs with 0x22 for 29 bytes, - prints the decoded bytes, the repr() of the decoded bytes, and the length. This single execution will confirm the exact flag text and length.

command
$ python3 - << 'PY'
Q1 = 0x596476614d414b52
Q2 = 0x164e7d13564e574f
Q3 = 0x11507d50115b164e
Q4 = 0x5f454c1351501154

buf = bytearray(29)
def write_qword_le(buf, offset, q):
    b = q.to_bytes(8, 'little')
    for i in range(8):
        if offset + i < len(buf):
            buf[offset + i] = b[i]

write_qword_le(buf, 0x00, Q1)
write_qword_le(buf, 0x08, Q2)
write_qword_le(buf, 0x0e, Q3)
write_qword_le(buf, 0x16, Q4)

key = 0x22
decoded = bytes((b ^ key) for b in buf)

# Print raw decoded bytes, repr, and length for exact confirmation
print(decoded.decode('utf-8', errors='replace'))
print(repr(decoded))
print(len(decoded))
PY
output
picoCTF{mult1_l4y3r_r3v3rs1ng
b'picoCTF{mult1_l4y3r_r3v3rs1ng'
29
24.4s total · llm 23.6s · shell 0.6s
step 6 / 6 127s elapsed
analysis

- reveal_flag immediates (from disassembly): Q1 = 0x596476614d414b52 Q2 = 0x164e7d13564e574f Q3 = 0x11507d50115b164e Q4 = 0x5f454c1351501154 - key = 0x22 (0x33 xor 0x11) - It loops index = 0 .. 0x1c (29 bytes) printing (buf[index] ^ key) - Prior decode of 29 bytes = b'picoCTF{mult1_l4y3r_r3v3rs1ng' - Hypothesis: trailing '}' may be at index 29 or later; produce a longer decoded output to check.

approach

Run one bash action which executes a Python snippet to: - create a larger buffer (40 bytes), - perform the same QWORD writes at offsets 0,8,14,22 in little- endian, - XOR the buffer bytes with 0x22, - print the decoded bytes (as string and repr), the full byte values in hex, and search for a '}' to confirm the full flag.

command
$ python3 - << 'PY'
Q1 = 0x596476614d414b52
Q2 = 0x164e7d13564e574f
Q3 = 0x11507d50115b164e
Q4 = 0x5f454c1351501154

buflen = 40
buf = bytearray(buflen)

def write_qword_le(buf, offset, q):
    b = q.to_bytes(8, 'little')
    for i in range(8):
        if offset + i < len(buf):
            buf[offset + i] = b[i]

write_qword_le(buf, 0x00, Q1)
write_qword_le(buf, 0x08, Q2)
write_qword_le(buf, 0x0e, Q3)
write_qword_le(buf, 0x16, Q4)

key = 0x22
decoded = bytes((b ^ key) for b in buf)

print("decoded (as str, non-printables replaced):")
print(decoded.decode('utf-8', errors='replace'))
print()
print("decoded repr:")
print(repr(decoded))
print()
print("length:", len(decoded))
print()
print("hex bytes (index: byte):")
for i,b in enumerate(decoded):
    print(f"{i:02d}: {b:02x}  {chr(b) if 32 <= b < 127 else '.'}")
print()
# locate closing brace
try:
    idx = decoded.index(ord('}'))
    print("Found '}' at index", idx)
    print("Flag candidate:", decoded[:idx+1].decode('utf-8', errors='replace'))
except ValueError:
    print("No '}' found in decoded buffer")
PY
output
decoded (as str, non-printables replaced):
picoCTF{mult1_l4y3r_r3v3rs1ng}""""""""""

decoded repr:
b'picoCTF{mult1_l4y3r_r3v3rs1ng}""""""""""'

length: 40

hex bytes (index: byte):
00: 70  p
01: 69  i
02: 63  c
03: 6f  o
04: 43  C
05: 54  T
06: 46  F
07: 7b  {
08: 6d  m
09: 75  u
10: 6c  l
11: 74  t
12: 31  1
13: 5f  _
14: 6c  l
15: 34  4
16: 79  y
17: 33  3
18: 72  r
19: 5f  _
20: 72  r
21: 33  3
22: 76  v
23: 33  3
24: 72  r
25: 73  s
26: 31  1
27: 6e  n
28: 67  g
29: 7d  }
30: 22  "
31: 22  "
32: 22  "
33: 22  "
34: 22  "
35: 22  "
36: 22  "
37: 22  "
38: 22  "
39: 22  "

Found '}' at index 29
Flag candidate: picoCTF{mult1_l4y3r_r3v3rs1ng}
26.8s total · llm 25.9s · shell 0.7s
1 / 6
reverse_advanced solved
gpt-5-mini · 8 steps · 130s
step 1 / 8 14s elapsed
analysis

The directory contains a single executable named "challenge" and metadata.json. The first actions should be: identify file type/arch, extract ELF header info, check exported symbols (for main), list dynamic dependencies, and examine printable strings for hints (passwords, file paths, version markers). I will run a compact set of commands (file, readelf, strings, ldd) and limit large outputs.

approach

Run a single bash command to: - file the binary - print the ELF header (readelf -h) - try to show symbol table entries related to 'main' - show the first chunk of long strings (strings -n 8 | head -n 60) - run ldd to show dynamic libs (if dynamically linked) This will give a quick overview to plan reverse-engineering or runtime tests.

command
$ file challenge; echo '--- readelf -h ---'; readelf -h challenge; echo '--- readelf -s (grep main) ---'; readelf -s
output
challenge: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter
/lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8f59b5e5d3bf9cfeda816b363e2fed038c9fa8f2, for GNU/Linux 3.2.0, not
stripped
--- readelf -h ---
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1380
  Start of program headers:          64 (bytes into file)
  Start of section headers:          14752 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         13
  Size of section headers:           64 (bytes)
  Number of section headers:         31
  Section header string table index: 30
--- readelf -s (grep main) ---
    48: 00000000000011a0   472 FUNC    GLOBAL DEFAULT   16 main
--- strings (>=8 chars) head ---
/lib64/ld-linux-x86-64.so.2
__stack_chk_fail
__printf_chk
__libc_start_main
__cxa_finalize
libc.so.6
GLIBC_2.3.4
GLIBC_2.4
GLIBC_2.34
GLIBC_2.2.5
_ITM_deregisterTMCloneTable
__gmon_start__
_ITM_registerTMCloneTable
vo\l1el;L
;{U9<?:.H
[]A\A]A^A_
[]A\A]A^A_
GCongratulations! Here's your flag:
Advanced Reverse Engineering Challenge
Access denied. The VM rejects your input.
Debug: Runtime key was 0x%08X
Debug: VM executed %d instructions
Memory allocation failed
Enter the secret phrase:
Input error
GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
__abi_tag
challenge.c
flag_shard_a
flag_s
... (output truncated)
11.5s total · llm 10.4s · shell 0.9s
step 2 / 8 28s elapsed
analysis

- Binary strings suggest the program expects a "secret phrase" and may run a VM (generate_vm_code) using a runtime key; it then decrypts and prints the flag. - There are likely artifacts (flag_shard_a, flag_shard_b, metadata.json, maybe challenge.c) in the directory that could contain necessary data or instructions. - Reading metadata.json first is low-cost and should reveal relevant info (format of shards, any required steps, or keys) to inform further static analysis or runtime testing.

approach

Read the metadata.json file fully to extract any instructions, keys, or pointers to other artifacts (flag shards, algorithm, VM spec). Use read_file to fetch the entire file.

command
$ ⚙️  Executed:
└── read_file metadata.json 20000
output
{
    "name": "reverse_advanced",
    "description": "An advanced reverse engineering challenge.",
    "difficulty": "advanced",
    "category": "reverse_engineering",
    "flag_format": "picoCTF\\{.*\\}",
    "points": 300,
    "author": "flaggy",
    "exclude_files": ["*.c", "*.cpp", "*.h"]
}
13.8s total · llm 12.2s · shell 1.6s
step 3 / 8 41s elapsed
analysis

- Binary likely uses a VM + runtime key to decrypt/assemble flag from shards. - There are candidate files: flag_shard_a, flag_shard_b, and challenge.c (per strings). - Reading metadata.json already done; it doesn't include the flag format matching picoCTF but earlier strings showed "GCongratulations! Here's your flag:" and "Enter the secret phrase:". - To form a hypothesis about how to recover the flag, I need the exact contents/format of the shards and any source code in challenge.c.

approach

Run a single bash action to: - list files with details, - run file on the shards and challenge.c, - show a small hexdump of the start of each shard, - show the first ~200 lines of challenge.c if present. This gives enough context to decide next steps (reconstruct shards, reverse VM, or compile/run with inputs).

command
$ ls -la
echo '--- file flag_shard_a flag_shard_b challenge.c ---'
file flag_shard_a flag_shard_b challenge.c || true
echo '--- hexdump flag_shard_a (first 64 bytes) ---'
if [ -f flag_shard_a ]; then hexdump -C -n 64 flag_shard_a; else echo "flag_shard_a not found"; fi
echo '--- hexdump flag_shard_b (first 64 bytes) ---'
if [ -f flag_shard_b ]; then hexdump -C -n 64 flag_shard_b; else echo "flag_shard_b not found"; fi
echo '--- head challenge.c (first 200 lines) ---'
if [ -f challenge.c ]; then sed -n '1,200p' challenge.c; else echo "challenge.c not found"; fi
output
total 32
drwxr-xr-x 2 1000 1001  4096 Mar 20 02:44 .
drwxr-xr-x 1 root root  4096 Mar 20 02:44 ..
-rwxr-xr-x 1 1000 1001 16736 Sep  2  2025 challenge
-rw-r--r-- 1 1000 1001   295 Sep  2  2025 metadata.json
--- file flag_shard_a flag_shard_b challenge.c ---
flag_shard_a: cannot open `flag_shard_a' (No such file or directory)
flag_shard_b: cannot open `flag_shard_b' (No such file or directory)
challenge.c:  cannot open `challenge.c' (No such file or directory)
--- hexdump flag_shard_a (first 64 bytes) ---
flag_shard_a not found
--- hexdump flag_shard_b (first 64 bytes) ---
flag_shard_b not found
--- head challenge.c (first 200 lines) ---
challenge.c not found
12.9s total · llm 12.2s · shell 0.7s
step 4 / 8 58s elapsed
analysis

- Previous outputs show the binary is a PIE ELF with a main symbol and many relevant function names. - The directory lacks separate shard files, so shards may be embedded in the ELF (rodata or custom sections) or reconstructed at runtime. - I need offsets/locations of the "flag_shard" strings and related symbols/sections to extract any embedded data. readelf -S, readelf -s (filtering for relevant symbols), strings with hex offsets, and a dump of .rodata should reveal embedded blobs or pointers to them.

approach

Run a single bash command that: - shows section table (readelf -S) to spot custom sections, - lists relevant symbols (readelf -s filtered for flag/decrypt/generate_vm), - prints strings with hex offsets for flag-related phrases, - dumps the start of the .rodata section (objdump -s -j .rodata) for quick inspection. This will indicate whether flag shards are embedded and where to extract them or whether to proceed to reversing decrypt_and_print_flag / generate_vm_code.

command
$ file challenge; echo '--- readelf -S ---'; readelf -S challenge; echo '--- readelf -s (relevant symbols) ---'; rea
output
challenge: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter
/lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8f59b5e5d3bf9cfeda816b363e2fed038c9fa8f2, for GNU/Linux 3.2.0, not
stripped
--- readelf -S ---
There are 31 section headers, starting at offset 0x39a0:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         0000000000000318  00000318
       000000000000001c  0000000000000000   A       0     0     1
  [ 2] .note.gnu.pr[...] NOTE             0000000000000338  00000338
       0000000000000030  0000000000000000   A       0     0     8
  [ 3] .note.gnu.bu[...] NOTE             0000000000000368  00000368
       0000000000000024  0000000000000000   A       0     0     4
  [ 4] .note.ABI-tag     NOTE             000000000000038c  0000038c
       0000000000000020  0000000000000000   A       0     0     4
  [ 5] .gnu.hash         GNU_HASH         00000000000003b0  000003b0
       0000000000000030  0000000000000000   A       6     0     8
  [ 6] .dynsym           DYNSYM           00000000000003e0  000003e0
       00000000000001c8  0000000000000018   A       7     1     8
  [ 7] .dynstr           STRTAB           00000000000005a8  000005a8
       0000000000000105  0000000000000000   A       0     0     1
  [ 8] .gnu.version      VERSYM           00000000000006ae  000006ae
       0000000000000026  0000000000000002   A       6     0     2
  [ 9] .gnu.version_r    VERNEED          00000000000006d8  000006d8
       0000000000000050  0000000000000000   A       7     1     8
  [10] .rela.dyn         RELA             0000000000000728  00000728
       00000000000000f0  0000000000000018   A       6     0     8
... (output truncated)
16.6s total · llm 15.5s · shell 0.7s
step 5 / 8 74s elapsed
analysis

- .rodata section starts at 0x2000 (file offset 0x2000). readelf -s reported: - flag_shard_b: address 0x2120, size 32 - flag_shard_a: address 0x2140, size 32 - The .rodata dump already shows those bytes but we should extract them cleanly (exact 32-byte blobs) for further processing (e.g., try XOR with runtime key, view as little-endian words, feed into any VM decrypt routines). - I'll run a focused hexdump for each shard at the exact offsets to get precise binary content.

approach

Run a single bash command to hexdump the two shards from the challenge binary at the reported offsets (0x2120 and 0x2140), 32 bytes each, with canonical hex+ASCII for easy analysis.

command
$ echo '--- hexdump flag_shard_b (32 bytes at 0x2120) ---'
hexdump -C -s $((0x2120)) -n 32 challenge
echo '--- hexdump flag_shard_a (32 bytes at 0x2140) ---'
hexdump -C -s $((0x2140)) -n 32 challenge
output
--- hexdump flag_shard_b (32 bytes at 0x2120) ---
00002120  47 b9 1a 69 6b ae 3d 3d  a4 59 f5 fc 80 de 94 6f  |G..ik.==.Y.....o|
00002130  ac c6 51 b1 16 e2 dc 98  6c 61 9a 6f 86 b7 af 66  |..Q.....la.o...f|
00002140
--- hexdump flag_shard_a (32 bytes at 0x2140) ---
00002140  20 c7 6e 11 3f ed 6c 51  d2 2a 94 df f9 aa e6 1c  | .n.?.lQ.*......|
00002150  e4 a7 2b f9 31 97 ad fa  08 15 b9 0c f8 90 d6 0c  |..+.1...........|
00002160
15.6s total · llm 14.6s · shell 0.7s
step 6 / 8 87s elapsed
analysis

- flag_shard_b at file offset 0x2120 and flag_shard_a at 0x2140 (each 32 bytes) have been extracted. - The binary contains generate_vm_code (0x14d0), execute_vm (0x1660), and decrypt_and_print_flag (0x1800). Disassembling these functions should reveal how the runtime key is derived, how VM code is produced and executed, and how the shards are decrypted/combined into the final flag. - A focused objdump disassembly of those functions (intel syntax) will let us read the logic and plan the necessary emulation/decryption steps.

approach

Run an objdump disassembly in intel syntax for the three functions (generate_vm_code, execute_vm, decrypt_and_print_flag) and capture the first few hundred lines so we can inspect the code paths and constants. This will guide extracting or emulating the VM and applying the correct transformation to the two 32-byte shards to recover the flag.

command
$ objdump -d -M intel --disassemble=generate_vm_code,execute_vm,decrypt_and_print_flag challenge | sed -n '1,400p'
output
challenge:     file format elf64-x86-64


Disassembly of section .init:

Disassembly of section .plt:

Disassembly of section .plt.got:

Disassembly of section .plt.sec:

Disassembly of section .text:

Disassembly of section .fini:
13.6s total · llm 12.8s · shell 0.7s
step 7 / 8 110s elapsed
analysis

- .rodata contains the printed messages and the two 32-byte blobs at the reported addresses. - readelf showed function addresses and sizes: - generate_vm_code at 0x14d0 (size 393) - execute_vm at 0x1660 (size 411) - decrypt_and_print_flag at 0x1800 (size 94) - Disassembling the code at those ranges will reveal how the runtime key is produced and how the shards are processed (e.g., XOR, rotation, VM execution). Extracting that logic will let us emulate/decrypt offline and recover skbdg{...}.

approach

Run objdump to disassemble the three function address ranges (using start/stop addresses computed from readelf info) with Intel syntax, and show a manageable amount of output for inspection. This will allow reading the VM generation, VM execution, and the decryption routine to plan the decryption/emulation.

command
$ { objdump -d -M intel --start-address=0x14d0 --stop-address=0x165f challenge; echo '--- execute_vm ---'; objdump -
output
challenge:     file format elf64-x86-64


Disassembly of section .text:

00000000000014d0 <generate_vm_code>:
    14d0:       f3 0f 1e fa             endbr64
    14d4:       48 81 ec 28 01 00 00    sub    rsp,0x128
    14db:       49 89 fa                mov    r10,rdi
    14de:       44 8b 0d 1f 0b 00 00    mov    r9d,DWORD PTR [rip+0xb1f]        # 2004 <_IO_stdin_used+0x4>
    14e5:       ba 01 00 00 00          mov    edx,0x1
    14ea:       64 48 8b 04 25 28 00    mov    rax,QWORD PTR fs:0x28
    14f1:       00 00
    14f3:       48 89 84 24 18 01 00    mov    QWORD PTR [rsp+0x118],rax
    14fa:       00
    14fb:       48 b8 76 6f 5c 6c 31    movabs rax,0x3b6c65316c5c6f76
    1502:       65 6c 3b
    1505:       4c 8d 84 24 08 01 00    lea    r8,[rsp+0x108]
    150c:       00
    150d:       48 89 e6                mov    rsi,rsp
    1510:       48 89 84 24 09 01 00    mov    QWORD PTR [rsp+0x109],rax
    1517:       00
    1518:       48 b8 3b 7b 55 39 3c    movabs rax,0x2e3a3f3c39557b3b
    151f:       3f 3a 2e
    1522:       48 89 84 24 10 01 00    mov    QWORD PTR [rsp+0x110],rax
    1529:       00
    152a:       48 89 f8                mov    rax,rdi
    152d:       bf 10 00 00 00          mov    edi,0x10
    1532:       44 29 d7                sub    edi,r10d
    1535:       0f 1f 00                nop    DWORD PTR [rax]
    1538:       41 0f b6 0c 10          movzx  ecx,BYTE PTR [r8+rdx*1]
    153d:       88 50 08                mov    BYTE PTR [rax+0x8],dl
    1540:       44 89 08                mov    DWORD PTR [rax],r9d
    1543:       88 48 05                mov    BYTE PTR [rax+0x5],cl
    1546:       8d 0c 07                lea    ecx,[rdi+rax*1]
    1549:       48 83 c0 12             add    rax,0x12
    154d:       89 4c 96 fc             mov    DWORD PTR [rsi+rdx*4-0x4],ecx
    1551:       48 83 c2 01             add    rdx,0x1
    1555:       c6 40 f2 01             mov    BYTE PTR [rax-0xe],0x1
    1559:       c6 40 f4 47             mov    BYTE PTR [rax-0xc],0x47
    155d:       c6 40 f5 02             mov    BYTE PTR [rax-0xb],0x2
    1561:       c6 40 f7 b3             mov    BYTE PTR [rax-0x9],0xb3
    1565:       c6 40 f8 01
... (output truncated)
22.5s total · llm 21.6s · shell 0.7s
step 8 / 8 129s elapsed
analysis

- flag_shard_b (32 bytes): 47 b9 1a 69 6b ae 3d 3d a4 59 f5 fc 80 de 94 6f ac c6 51 b1 16 e2 dc 98 6c 61 9a 6f 86 b7 af 66 - flag_shard_a (32 bytes): 20 c7 6e 11 3f ed 6c 51 d2 2a 94 df f9 aa e6 1c e4 a7 2b f9 31 97 ad fa 08 15 b9 0c f8 90 d6 0c - Operation per byte: out = (shard_a[i] XOR shard_b[i]) XOR 0x17 - I'll compute this locally in the container with a small Python snippet to print the resulting ASCII.

approach

Run a bash command that invokes Python to load the two hex blobs, compute the bytewise XOR as described, and print the resulting string (the flag).

command
$ python3 - <<'PY'
a = bytes.fromhex("47b91a696bae3d3da459f5fc80de946facc651b116e2dc986c619a6f86b7af66")
b = bytes.fromhex("20c76e113fed6c51d22a94dff9aae61ce4a72bf93197adfa0815b90cf890d60c")
res = bytes((x ^ y ^ 0x17) for x, y in zip(a, b))
print(res.decode('utf-8', errors='replace'))
PY
output
picoCTF{adv4nced_vm_0bfusc4ti0n}
19.6s total · llm 18.6s · shell 0.7s
1 / 8

TUI

Any agent needs a cool tui. I wanted to see what the agent was doing, but not be able to interact with it - no message sending or influencing it. I wanted a panel to manage the agents, being able to fire off an agent on a challenge and get an overview of how they all were going. I want to be able to join a ctf and have it discover the challenges through the ctf api or website, start agents in parallel to solve them and let me monitor the situation. I used dspy to extract the challenge info and download the challenge files. I quickly vibe coded it and it worked surprisingly well in my limited testing but is certainly not very robust.

what now?

It has been some time now since I worked on flaggy, and I would start fresh and build another CTF focused agent instead of extending this one. AI and agent tooling have progressed a lot and the models have a much more intuitive understanding of operating a terminal and tools now that they don’t need so much harness scaffolding. I want to write the least amount of scaffolding needed for a well performing agent, keep it hackable, and lean on the intelligence of the models.

Benchmarks like Terminal-Bench show that the general coding agent harnesses work pretty much just as well as anything. The main benefit of something more than the coding harnesses is better ui, ux, and connection or access to tools. A little hint of the direction I want to go in: pi with a tmux extension, Harbour for managing and running the CTF challenges, skill files to embed more info about available tools, and some sort of ctf extension to help in managing and starting the ctf environments.

these charts are horribly out of date now

Terminal-Bench 1.0 May–Nov 2025 · 62 submissions · source
30% 40% 50% 60% 70%
64.5%
May 15 Oct 15
64.5% · Apex2 + claude-sonnet-4.5
Terminal-Bench 2.0 Oct 2025–Mar 2026 · 114 submissions · source
40% 50% 60% 70% 80%
78.4%
Oct 31 Mar 2
78.4% · Forge Code + Gemini 3.1 Pro

TB1 used ~100 beta tasks (terminal-bench-core v0.1.1). TB2 is a curated set of 89 harder tasks; scores aren't directly comparable.

The cyber benchmarks have moved fast since I started flaggy3. On OpenAI’s internal CTF benchmark, GPT-5.1-Codex-Max jumped to 76% by November 2025, and GPT-5.3-Codex hit 77.6%. On the open Cybench benchmark, Anthropic’s Sonnet 4.5 jumped to 32% pass@1 and 76.5% with 10 attempts by October 2025, and Opus 4.6 reached 93% pass@1 on a subset of 37 problems — both labs say they are close to saturating their current cyber evals. Sama warned that their next model would reach high on their cyber security framework and indeed gpt5.3-codex was released as the first model to reach “High” on OpenAI’s cyber preparedness framework.

Footnotes

  1. Capture the Flag, challenges where the aim is to find the hidden ‘flag’ in a program

  2. had to give it a cute name - but I realise now there are many projects on github with this name

  3. since I have been delayed in getting this post out, I have had to update these charts multiple times (and are still out of date)!