flaggy: the little automated CTF solver
Lessons in building a AI CTF agent
Can an LLM solve CTFs?1 Yes, and rapidly getting better! Around September last year (2025) I decided to build my own CTF solving agent harness as with the releases of GPT-5 and Claude Opus 4 series there was a noticeable increase of the models’ general abilities and capability to solve CTF like challenges.
A lot has changed since then, and this post has been sitting around for a while without me publishing it so I am just getting it out there now before it becomes ancient history! I wouldn’t use this version of flaggy now as codex etc has surpassed it - but I have some interesting ideas for a flaggy v2.
Agent harnesses like Claude Code had been around for a couple of months by this time but I wanted a harness that was completely autonomous and specific to solving CTFs. This is what I ended up with:

When I started building flaggy2 , GPT-5 had just reached 27% on CTF challenges on its model report card, and on the open Cybench benchmark (40 professional CTF challenges), Claude 3.5 Sonnet was at ~20% pass@1. I was starting to feel these models really understanding my codebases better and I could trust their judgement more. The agency and tool calling was perhaps the biggest leap on this gen. Now many months and many model releases later, it is getting hard to even remember what working with these models was like! Opus 4.6 saturated the same Cybench benchmark and recently Claude Mythos has been making news for finding many CVEs.
Coding CLIs generally always keep a human in the loop, but I wanted my agent to be completely autonomous with an interface specific to solving CTF challenges. The design I ended up on is relatively simple. The agent spins up a docker container that has the CTF challenge mounted in it, then interacts with it via bash commands or a read/write file tool. It runs in a loop until it finds the flag or reaches its max step count.
┌───────────────────────────────────────────────────────┐
│ Flaggy Agent Loop │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ │ │
│ ▼ │ │
│ state = {history, info, last output} │ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────────────┐ │ │
│ │ CTF Agent │ │ │
│ │ │ │ │
│ │ LLM (via OpenRouter) │ │ │
│ │ │ │ │ │
│ │ ▼ │ │ │
│ │ action: bash / read / │ │ │
│ │ write file │ │ │
│ └────────────┬─────────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────────────────┐ │ │
│ │ Exegol Container │ │ │
│ │ │◄── Challenge files │
│ │ 100+ CTF tools: gdb, angr, │ mounted │
│ │ radare2, pwntools, ghidra │ │
│ └──────────────┬──────────────┘ │ │
│ │ │ │
│ flag found? ──── no ─────────────┘ │
│ │ │
└─────────────────┼─────────────────────────────────────┘
│ yes
SOLVED!the environment
Flaggy needs a safe environment to run each challenge in, and have common CTF tools installed so precious tokens are not wasted on environment setup each time. Exegol is a docker image for pen testers, it contains a decent amount of pre-installed tools (tools list). Ideal! Some other AI hacking tools and frameworks use MCP servers with functions like analyze_binary, but my intuition is the models are now capable enough that they don’t need MCP servers like this. I don’t think MCP is necessarily bad but if I can use a computer without it an intelligent LLM should be able to as well. Indeed since then there has been a move away from MCP from people building on the frontier of agent systems.
I gave the models just a bash prompt to begin with. This was more trouble than I thought it would be… GPT-5 liked to write long bash commands, but often it would incorrectly include quotes which caused the command to be terminated early when being passed through to the container. In the end I used the Docker SDK to pass a base64 encoded command to the bash shell - like echo "<base64>" | base64 -d | bash.
I tried to encourage GPT-5 to use a write tool instead of appending code into Python files using >> which was one source of quote errors, but GPT-5 ignored it. It seems the model will find its own ways of doing things regardless of what tools you give it! Better to learn how the models like to work, then adapt your harness so they have the least friction possible when operating.
models
I mostly tested with GPT-5-Mini and Grok 4 Fast. I found GPT-5-Mini performed really well and was more than smart enough to use while I iterated the agent harness. The GPT Mini series of models has insane performance to cost ratio; it is an underrated model imo. Grok 4 Fast is also a decent model good enough for testing and was free at time via OpenRouter.
the ReAct trap
I used DSPy as it provides modules that abstract a prompting technique. First I tried the ReAct module. ReAct (Reason + Act) is the Thought-Action-Observation cycle and was introduced in 2022 as one of the earlier tool calling approaches. I expected ReAct to be the obvious choice, but I quickly found the models really struggled to stay on track throughout multiple ReAct turns and it interfered too much with their ability to reason. Flaggy would get into loops where it kept calling the same tools and never progressing into deeper analysis. Switching ReAct out for Chain-of-Thought (CoT) worked much better. In DSPy, you set the signature of a module with a simple string.
"A signature is a declarative specification of input/output behavior of a DSPy module"
I started with something like a simple challenge -> flag signature but sometimes the agent got stuck in only trying one idea and not naturally backtracking to try a different approach, so I expanded the signature to include generating a hypothesis, observations, and next action to try. The scientific method!
The signature:
history_text, info, last_output -> analysis, approach, tool_name, command, filename, content, max_bytes, timeout_seconds
Here’s a run of flaggy solving a reverse engineering challenge — step through to see how the agent works:
Planned quick checks: - file to get type - readelf header for architecture/entry info - checksec to see PIE/RELRO/Canary/NX - ldd to list dynamic libraries (if any) - strings to look for obvious flag hints, function names, or file paths - display metadata.json for challenge-specific hints I will run these succinctly in one command so we can decide next steps.
Run a set of quick inspection commands on the 'challenge' binary and show metadata.json.
$ file challenge; echo '---readelf---'; readelf -h challenge; echo '---checksec---'; checksec --file=challenge || tr challenge: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter
/lib64/ld-linux-x86-64.so.2, BuildID[sha1]=826f46fe87bd8f8e98f797427484f8c4410f9759, for GNU/Linux 3.2.0, not
stripped
---readelf---
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Position-Independent Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x1140
Start of program headers: 64 (bytes into file)
Start of section headers: 14520 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 13
Size of section headers: 64 (bytes)
Number of section headers: 31
Section header string table index: 30
---checksec---
[*] '/challenge/challenge'
Arch: amd64-64-little
RELRO: Full RELRO
Stack: Canary found
NX: NX enabled
PIE: PIE enabled
SHSTK: Enabled
IBT: Enabled
Stripped: No
---ldd---
linux-vdso.so.1 (0x00007ffcbd3a8000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000072eb7d5c8000)
/lib64/ld-linux-x86-64.so.2 (0x000072eb7d7c3000)
---strings---
/lib64/ld-linux-x86-64.so.2
__stack_chk_fail
__libc_start_main
__cxa_finalize
libc.so.6
GLIBC_2.4
GLIBC_2.34
GLIBC_2.2.5
_ITM_deregisterTMCloneTable
__gmon_start__
_ITM_registerTMCloneTable
unlock_sH
_secret!H
RKAMavdYH
picoCTF{H
fake_flaH
_flag_doH
Fake: %s
=== Advanced Reverse Engineer
... (output truncated) From the initial inspection: binary is 64-bit PIE, not stripped, with symbols like verify_input, reveal_flag, fake_flag, debug_check. Strings show fragments like "picoCTF{H" and "RKAMavdYH", suggesting an XOR/obfuscated flag in the data section and a multi-stage reveal. Next step is to get symbol addresses and sizes for those functions so we can disassemble them selectively.
Run readelf -sW to list the symbol table and filter for the target symbols (verify_input, reveal_flag, fake_flag, debug_check, unlock) to get their addresses and sizes. This will guide targeted disassembly in the next step.
$ readelf -sW challenge | egrep 'verify_input|reveal_flag|fake_flag|debug_check|unlock' || true 32: 0000000000001229 299 FUNC GLOBAL DEFAULT 16 verify_input
36: 000000000000145a 133 FUNC GLOBAL DEFAULT 16 fake_flag
41: 0000000000001354 200 FUNC GLOBAL DEFAULT 16 reveal_flag
44: 000000000000141c 62 FUNC GLOBAL DEFAULT 16 debug_check Symbols from previous run: - verify_input at 0x1229, size 299 (0x12B) -> covers 0x1229..0x1354 - reveal_flag at 0x1354, size 200 (0xC8) -> covers 0x1354..0x142c I'll disassemble those exact ranges with objdump (Intel syntax) to capture the code paths for password verification and flag decryption. This will reveal operations (XOR/rot/add) and data constants needed to reconstruct the flag.
Run objdump -d -M intel on the two address ranges: 0x1229-0x1354 for verify_input and 0x1354-0x142c for reveal_flag, printing a separator between them.
$ objdump -d -M intel --start-address=0x1229 --stop-address=0x1354 challenge | sed -n '1,200p'; echo '--- reveal_fla challenge: file format elf64-x86-64
Disassembly of section .text:
0000000000001229 <verify_input>:
1229: f3 0f 1e fa endbr64
122d: 55 push rbp
122e: 48 89 e5 mov rbp,rsp
1231: 48 83 ec 40 sub rsp,0x40
1235: 48 89 7d c8 mov QWORD PTR [rbp-0x38],rdi
1239: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28
1240: 00 00
1242: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
1246: 31 c0 xor eax,eax
1248: 48 8b 45 c8 mov rax,QWORD PTR [rbp-0x38]
124c: 48 89 c7 mov rdi,rax
124f: e8 8c fe ff ff call 10e0 <strlen@plt>
1254: 89 45 e4 mov DWORD PTR [rbp-0x1c],eax
1257: 83 7d e4 0e cmp DWORD PTR [rbp-0x1c],0xe
125b: 74 0a je 1267 <verify_input+0x3e>
125d: b8 00 00 00 00 mov eax,0x0
1262: e9 d7 00 00 00 jmp 133e <verify_input+0x115>
1267: 48 b8 75 6e 6c 6f 63 movabs rax,0x735f6b636f6c6e75
126e: 6b 5f 73
1271: 48 89 45 ea mov QWORD PTR [rbp-0x16],rax
1275: 48 b8 5f 73 65 63 72 movabs rax,0x217465726365735f
127c: 65 74 21
127f: 48 89 45 f0 mov QWORD PTR [rbp-0x10],rax
1283: c7 45 e0 00 00 00 00 mov DWORD PTR [rbp-0x20],0x0
128a: e9 a0 00 00 00 jmp 132f <verify_input+0x106>
128f: 8b 55 e0 mov edx,DWORD PTR [rbp-0x20]
1292: 89 d0 mov eax,edx
1294: c1 f8 1f sar eax,0x1f
1297: c1 e8 1e shr eax,0x1e
129a: 01 c2 add edx,eax
129c: 83 e2 03 and edx,0x3
129f: 29 c2 sub edx,eax
12a1: 89 d0 mov eax,edx
12a3: 83 f8 03 cmp eax,0x3
12a6: 74 59 je 1301 <verify_input+0
... (output truncated) Key observations: - reveal_flag initializes a byte key: 0x33 xor 0x11 => 0x22. - It writes these QWORD immediates in order: - Q1 -> [rbp-0x30] (offset 0) - Q2 -> [rbp-0x28] (offset 8) - Q3 -> [rbp-0x22] (offset 14) - Q4 -> [rbp-0x1a] (offset 22) - It then loops index = 0 .. 0x1c (0..28), reading byte at rbp-0x30 + index and XORing with 0x22. - So simulate the writes (with little- endian encoding of each 64-bit immediate) into a bytearray and then XOR each of the first 29 bytes with 0x22 to reveal the flag text.
Run a short Python script to: - Define the four 64-bit immediates exactly as seen in disassembly. - Create a 29-byte buffer, perform the QWORD writes at the precise offsets (0, 8, 14, 22) in little-endian. - XOR each of the 29 bytes with 0x22 and print the resulting decoded string. Then capture the program output (the revealed flag).
$ python3 - << 'PY'
Q1 = 0x596476614d414b52
Q2 = 0x164e7d13564e574f
Q3 = 0x11507d50115b164e
Q4 = 0x5f454c1351501154
# total bytes read by reveal_flag is 0x1d = 29 bytes
buf = bytearray(29)
def write_qword_le(buf, offset, q):
b = q.to_bytes(8, 'little')
for i in range(8):
if offset + i < len(buf):
buf[offset + i] = b[i]
write_qword_le(buf, 0x00, Q1)
write_qword_le(buf, 0x08, Q2)
write_qword_le(buf, 0x0e, Q3) # rbp-0x22 is base+14
write_qword_le(buf, 0x16, Q4) # rbp-0x1a is base+22
key = 0x22
decoded = bytes((b ^ key) for b in buf)
print(decoded.decode('utf-8'))
PY picoCTF{mult1_l4y3r_r3v3rs1ng - reveal_flag writes four 64-bit immediates into a local buffer and XORs the first 0x1d (29) bytes with key 0x22, then prints them. - My previous run printed the decoded bytes but did not show an explicit repr/length; the output looked like picoCTF{mult1_l4y3r_r3v3rs1ng without a closing brace. - To be certain and unambiguous, print the bytes as raw, as repr (so any non-printable/trailing characters or missing braces are obvious), and the length.
Run a short Python snippet (in bash) that: - constructs the buffer exactly as in reveal_flag (writing the same 64-bit immediates at the same offsets), - XORs with 0x22 for 29 bytes, - prints the decoded bytes, the repr() of the decoded bytes, and the length. This single execution will confirm the exact flag text and length.
$ python3 - << 'PY'
Q1 = 0x596476614d414b52
Q2 = 0x164e7d13564e574f
Q3 = 0x11507d50115b164e
Q4 = 0x5f454c1351501154
buf = bytearray(29)
def write_qword_le(buf, offset, q):
b = q.to_bytes(8, 'little')
for i in range(8):
if offset + i < len(buf):
buf[offset + i] = b[i]
write_qword_le(buf, 0x00, Q1)
write_qword_le(buf, 0x08, Q2)
write_qword_le(buf, 0x0e, Q3)
write_qword_le(buf, 0x16, Q4)
key = 0x22
decoded = bytes((b ^ key) for b in buf)
# Print raw decoded bytes, repr, and length for exact confirmation
print(decoded.decode('utf-8', errors='replace'))
print(repr(decoded))
print(len(decoded))
PY picoCTF{mult1_l4y3r_r3v3rs1ng
b'picoCTF{mult1_l4y3r_r3v3rs1ng'
29 - reveal_flag immediates (from disassembly): Q1 = 0x596476614d414b52 Q2 = 0x164e7d13564e574f Q3 = 0x11507d50115b164e Q4 = 0x5f454c1351501154 - key = 0x22 (0x33 xor 0x11) - It loops index = 0 .. 0x1c (29 bytes) printing (buf[index] ^ key) - Prior decode of 29 bytes = b'picoCTF{mult1_l4y3r_r3v3rs1ng' - Hypothesis: trailing '}' may be at index 29 or later; produce a longer decoded output to check.
Run one bash action which executes a Python snippet to: - create a larger buffer (40 bytes), - perform the same QWORD writes at offsets 0,8,14,22 in little- endian, - XOR the buffer bytes with 0x22, - print the decoded bytes (as string and repr), the full byte values in hex, and search for a '}' to confirm the full flag.
$ python3 - << 'PY'
Q1 = 0x596476614d414b52
Q2 = 0x164e7d13564e574f
Q3 = 0x11507d50115b164e
Q4 = 0x5f454c1351501154
buflen = 40
buf = bytearray(buflen)
def write_qword_le(buf, offset, q):
b = q.to_bytes(8, 'little')
for i in range(8):
if offset + i < len(buf):
buf[offset + i] = b[i]
write_qword_le(buf, 0x00, Q1)
write_qword_le(buf, 0x08, Q2)
write_qword_le(buf, 0x0e, Q3)
write_qword_le(buf, 0x16, Q4)
key = 0x22
decoded = bytes((b ^ key) for b in buf)
print("decoded (as str, non-printables replaced):")
print(decoded.decode('utf-8', errors='replace'))
print()
print("decoded repr:")
print(repr(decoded))
print()
print("length:", len(decoded))
print()
print("hex bytes (index: byte):")
for i,b in enumerate(decoded):
print(f"{i:02d}: {b:02x} {chr(b) if 32 <= b < 127 else '.'}")
print()
# locate closing brace
try:
idx = decoded.index(ord('}'))
print("Found '}' at index", idx)
print("Flag candidate:", decoded[:idx+1].decode('utf-8', errors='replace'))
except ValueError:
print("No '}' found in decoded buffer")
PY decoded (as str, non-printables replaced):
picoCTF{mult1_l4y3r_r3v3rs1ng}""""""""""
decoded repr:
b'picoCTF{mult1_l4y3r_r3v3rs1ng}""""""""""'
length: 40
hex bytes (index: byte):
00: 70 p
01: 69 i
02: 63 c
03: 6f o
04: 43 C
05: 54 T
06: 46 F
07: 7b {
08: 6d m
09: 75 u
10: 6c l
11: 74 t
12: 31 1
13: 5f _
14: 6c l
15: 34 4
16: 79 y
17: 33 3
18: 72 r
19: 5f _
20: 72 r
21: 33 3
22: 76 v
23: 33 3
24: 72 r
25: 73 s
26: 31 1
27: 6e n
28: 67 g
29: 7d }
30: 22 "
31: 22 "
32: 22 "
33: 22 "
34: 22 "
35: 22 "
36: 22 "
37: 22 "
38: 22 "
39: 22 "
Found '}' at index 29
Flag candidate: picoCTF{mult1_l4y3r_r3v3rs1ng} The directory contains a single executable named "challenge" and metadata.json. The first actions should be: identify file type/arch, extract ELF header info, check exported symbols (for main), list dynamic dependencies, and examine printable strings for hints (passwords, file paths, version markers). I will run a compact set of commands (file, readelf, strings, ldd) and limit large outputs.
Run a single bash command to: - file the binary - print the ELF header (readelf -h) - try to show symbol table entries related to 'main' - show the first chunk of long strings (strings -n 8 | head -n 60) - run ldd to show dynamic libs (if dynamically linked) This will give a quick overview to plan reverse-engineering or runtime tests.
$ file challenge; echo '--- readelf -h ---'; readelf -h challenge; echo '--- readelf -s (grep main) ---'; readelf -s challenge: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter
/lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8f59b5e5d3bf9cfeda816b363e2fed038c9fa8f2, for GNU/Linux 3.2.0, not
stripped
--- readelf -h ---
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Position-Independent Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x1380
Start of program headers: 64 (bytes into file)
Start of section headers: 14752 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 13
Size of section headers: 64 (bytes)
Number of section headers: 31
Section header string table index: 30
--- readelf -s (grep main) ---
48: 00000000000011a0 472 FUNC GLOBAL DEFAULT 16 main
--- strings (>=8 chars) head ---
/lib64/ld-linux-x86-64.so.2
__stack_chk_fail
__printf_chk
__libc_start_main
__cxa_finalize
libc.so.6
GLIBC_2.3.4
GLIBC_2.4
GLIBC_2.34
GLIBC_2.2.5
_ITM_deregisterTMCloneTable
__gmon_start__
_ITM_registerTMCloneTable
vo\l1el;L
;{U9<?:.H
[]A\A]A^A_
[]A\A]A^A_
GCongratulations! Here's your flag:
Advanced Reverse Engineering Challenge
Access denied. The VM rejects your input.
Debug: Runtime key was 0x%08X
Debug: VM executed %d instructions
Memory allocation failed
Enter the secret phrase:
Input error
GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
__abi_tag
challenge.c
flag_shard_a
flag_s
... (output truncated) - Binary strings suggest the program expects a "secret phrase" and may run a VM (generate_vm_code) using a runtime key; it then decrypts and prints the flag. - There are likely artifacts (flag_shard_a, flag_shard_b, metadata.json, maybe challenge.c) in the directory that could contain necessary data or instructions. - Reading metadata.json first is low-cost and should reveal relevant info (format of shards, any required steps, or keys) to inform further static analysis or runtime testing.
Read the metadata.json file fully to extract any instructions, keys, or pointers to other artifacts (flag shards, algorithm, VM spec). Use read_file to fetch the entire file.
$ ⚙️ Executed:
└── read_file metadata.json 20000 {
"name": "reverse_advanced",
"description": "An advanced reverse engineering challenge.",
"difficulty": "advanced",
"category": "reverse_engineering",
"flag_format": "picoCTF\\{.*\\}",
"points": 300,
"author": "flaggy",
"exclude_files": ["*.c", "*.cpp", "*.h"]
} - Binary likely uses a VM + runtime key to decrypt/assemble flag from shards. - There are candidate files: flag_shard_a, flag_shard_b, and challenge.c (per strings). - Reading metadata.json already done; it doesn't include the flag format matching picoCTF but earlier strings showed "GCongratulations! Here's your flag:" and "Enter the secret phrase:". - To form a hypothesis about how to recover the flag, I need the exact contents/format of the shards and any source code in challenge.c.
Run a single bash action to: - list files with details, - run file on the shards and challenge.c, - show a small hexdump of the start of each shard, - show the first ~200 lines of challenge.c if present. This gives enough context to decide next steps (reconstruct shards, reverse VM, or compile/run with inputs).
$ ls -la
echo '--- file flag_shard_a flag_shard_b challenge.c ---'
file flag_shard_a flag_shard_b challenge.c || true
echo '--- hexdump flag_shard_a (first 64 bytes) ---'
if [ -f flag_shard_a ]; then hexdump -C -n 64 flag_shard_a; else echo "flag_shard_a not found"; fi
echo '--- hexdump flag_shard_b (first 64 bytes) ---'
if [ -f flag_shard_b ]; then hexdump -C -n 64 flag_shard_b; else echo "flag_shard_b not found"; fi
echo '--- head challenge.c (first 200 lines) ---'
if [ -f challenge.c ]; then sed -n '1,200p' challenge.c; else echo "challenge.c not found"; fi total 32
drwxr-xr-x 2 1000 1001 4096 Mar 20 02:44 .
drwxr-xr-x 1 root root 4096 Mar 20 02:44 ..
-rwxr-xr-x 1 1000 1001 16736 Sep 2 2025 challenge
-rw-r--r-- 1 1000 1001 295 Sep 2 2025 metadata.json
--- file flag_shard_a flag_shard_b challenge.c ---
flag_shard_a: cannot open `flag_shard_a' (No such file or directory)
flag_shard_b: cannot open `flag_shard_b' (No such file or directory)
challenge.c: cannot open `challenge.c' (No such file or directory)
--- hexdump flag_shard_a (first 64 bytes) ---
flag_shard_a not found
--- hexdump flag_shard_b (first 64 bytes) ---
flag_shard_b not found
--- head challenge.c (first 200 lines) ---
challenge.c not found - Previous outputs show the binary is a PIE ELF with a main symbol and many relevant function names. - The directory lacks separate shard files, so shards may be embedded in the ELF (rodata or custom sections) or reconstructed at runtime. - I need offsets/locations of the "flag_shard" strings and related symbols/sections to extract any embedded data. readelf -S, readelf -s (filtering for relevant symbols), strings with hex offsets, and a dump of .rodata should reveal embedded blobs or pointers to them.
Run a single bash command that: - shows section table (readelf -S) to spot custom sections, - lists relevant symbols (readelf -s filtered for flag/decrypt/generate_vm), - prints strings with hex offsets for flag-related phrases, - dumps the start of the .rodata section (objdump -s -j .rodata) for quick inspection. This will indicate whether flag shards are embedded and where to extract them or whether to proceed to reversing decrypt_and_print_flag / generate_vm_code.
$ file challenge; echo '--- readelf -S ---'; readelf -S challenge; echo '--- readelf -s (relevant symbols) ---'; rea challenge: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter
/lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8f59b5e5d3bf9cfeda816b363e2fed038c9fa8f2, for GNU/Linux 3.2.0, not
stripped
--- readelf -S ---
There are 31 section headers, starting at offset 0x39a0:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .interp PROGBITS 0000000000000318 00000318
000000000000001c 0000000000000000 A 0 0 1
[ 2] .note.gnu.pr[...] NOTE 0000000000000338 00000338
0000000000000030 0000000000000000 A 0 0 8
[ 3] .note.gnu.bu[...] NOTE 0000000000000368 00000368
0000000000000024 0000000000000000 A 0 0 4
[ 4] .note.ABI-tag NOTE 000000000000038c 0000038c
0000000000000020 0000000000000000 A 0 0 4
[ 5] .gnu.hash GNU_HASH 00000000000003b0 000003b0
0000000000000030 0000000000000000 A 6 0 8
[ 6] .dynsym DYNSYM 00000000000003e0 000003e0
00000000000001c8 0000000000000018 A 7 1 8
[ 7] .dynstr STRTAB 00000000000005a8 000005a8
0000000000000105 0000000000000000 A 0 0 1
[ 8] .gnu.version VERSYM 00000000000006ae 000006ae
0000000000000026 0000000000000002 A 6 0 2
[ 9] .gnu.version_r VERNEED 00000000000006d8 000006d8
0000000000000050 0000000000000000 A 7 1 8
[10] .rela.dyn RELA 0000000000000728 00000728
00000000000000f0 0000000000000018 A 6 0 8
... (output truncated) - .rodata section starts at 0x2000 (file offset 0x2000). readelf -s reported: - flag_shard_b: address 0x2120, size 32 - flag_shard_a: address 0x2140, size 32 - The .rodata dump already shows those bytes but we should extract them cleanly (exact 32-byte blobs) for further processing (e.g., try XOR with runtime key, view as little-endian words, feed into any VM decrypt routines). - I'll run a focused hexdump for each shard at the exact offsets to get precise binary content.
Run a single bash command to hexdump the two shards from the challenge binary at the reported offsets (0x2120 and 0x2140), 32 bytes each, with canonical hex+ASCII for easy analysis.
$ echo '--- hexdump flag_shard_b (32 bytes at 0x2120) ---'
hexdump -C -s $((0x2120)) -n 32 challenge
echo '--- hexdump flag_shard_a (32 bytes at 0x2140) ---'
hexdump -C -s $((0x2140)) -n 32 challenge --- hexdump flag_shard_b (32 bytes at 0x2120) ---
00002120 47 b9 1a 69 6b ae 3d 3d a4 59 f5 fc 80 de 94 6f |G..ik.==.Y.....o|
00002130 ac c6 51 b1 16 e2 dc 98 6c 61 9a 6f 86 b7 af 66 |..Q.....la.o...f|
00002140
--- hexdump flag_shard_a (32 bytes at 0x2140) ---
00002140 20 c7 6e 11 3f ed 6c 51 d2 2a 94 df f9 aa e6 1c | .n.?.lQ.*......|
00002150 e4 a7 2b f9 31 97 ad fa 08 15 b9 0c f8 90 d6 0c |..+.1...........|
00002160 - flag_shard_b at file offset 0x2120 and flag_shard_a at 0x2140 (each 32 bytes) have been extracted. - The binary contains generate_vm_code (0x14d0), execute_vm (0x1660), and decrypt_and_print_flag (0x1800). Disassembling these functions should reveal how the runtime key is derived, how VM code is produced and executed, and how the shards are decrypted/combined into the final flag. - A focused objdump disassembly of those functions (intel syntax) will let us read the logic and plan the necessary emulation/decryption steps.
Run an objdump disassembly in intel syntax for the three functions (generate_vm_code, execute_vm, decrypt_and_print_flag) and capture the first few hundred lines so we can inspect the code paths and constants. This will guide extracting or emulating the VM and applying the correct transformation to the two 32-byte shards to recover the flag.
$ objdump -d -M intel --disassemble=generate_vm_code,execute_vm,decrypt_and_print_flag challenge | sed -n '1,400p' challenge: file format elf64-x86-64
Disassembly of section .init:
Disassembly of section .plt:
Disassembly of section .plt.got:
Disassembly of section .plt.sec:
Disassembly of section .text:
Disassembly of section .fini: - .rodata contains the printed messages and the two 32-byte blobs at the reported addresses. - readelf showed function addresses and sizes: - generate_vm_code at 0x14d0 (size 393) - execute_vm at 0x1660 (size 411) - decrypt_and_print_flag at 0x1800 (size 94) - Disassembling the code at those ranges will reveal how the runtime key is produced and how the shards are processed (e.g., XOR, rotation, VM execution). Extracting that logic will let us emulate/decrypt offline and recover skbdg{...}.
Run objdump to disassemble the three function address ranges (using start/stop addresses computed from readelf info) with Intel syntax, and show a manageable amount of output for inspection. This will allow reading the VM generation, VM execution, and the decryption routine to plan the decryption/emulation.
$ { objdump -d -M intel --start-address=0x14d0 --stop-address=0x165f challenge; echo '--- execute_vm ---'; objdump - challenge: file format elf64-x86-64
Disassembly of section .text:
00000000000014d0 <generate_vm_code>:
14d0: f3 0f 1e fa endbr64
14d4: 48 81 ec 28 01 00 00 sub rsp,0x128
14db: 49 89 fa mov r10,rdi
14de: 44 8b 0d 1f 0b 00 00 mov r9d,DWORD PTR [rip+0xb1f] # 2004 <_IO_stdin_used+0x4>
14e5: ba 01 00 00 00 mov edx,0x1
14ea: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28
14f1: 00 00
14f3: 48 89 84 24 18 01 00 mov QWORD PTR [rsp+0x118],rax
14fa: 00
14fb: 48 b8 76 6f 5c 6c 31 movabs rax,0x3b6c65316c5c6f76
1502: 65 6c 3b
1505: 4c 8d 84 24 08 01 00 lea r8,[rsp+0x108]
150c: 00
150d: 48 89 e6 mov rsi,rsp
1510: 48 89 84 24 09 01 00 mov QWORD PTR [rsp+0x109],rax
1517: 00
1518: 48 b8 3b 7b 55 39 3c movabs rax,0x2e3a3f3c39557b3b
151f: 3f 3a 2e
1522: 48 89 84 24 10 01 00 mov QWORD PTR [rsp+0x110],rax
1529: 00
152a: 48 89 f8 mov rax,rdi
152d: bf 10 00 00 00 mov edi,0x10
1532: 44 29 d7 sub edi,r10d
1535: 0f 1f 00 nop DWORD PTR [rax]
1538: 41 0f b6 0c 10 movzx ecx,BYTE PTR [r8+rdx*1]
153d: 88 50 08 mov BYTE PTR [rax+0x8],dl
1540: 44 89 08 mov DWORD PTR [rax],r9d
1543: 88 48 05 mov BYTE PTR [rax+0x5],cl
1546: 8d 0c 07 lea ecx,[rdi+rax*1]
1549: 48 83 c0 12 add rax,0x12
154d: 89 4c 96 fc mov DWORD PTR [rsi+rdx*4-0x4],ecx
1551: 48 83 c2 01 add rdx,0x1
1555: c6 40 f2 01 mov BYTE PTR [rax-0xe],0x1
1559: c6 40 f4 47 mov BYTE PTR [rax-0xc],0x47
155d: c6 40 f5 02 mov BYTE PTR [rax-0xb],0x2
1561: c6 40 f7 b3 mov BYTE PTR [rax-0x9],0xb3
1565: c6 40 f8 01
... (output truncated) - flag_shard_b (32 bytes): 47 b9 1a 69 6b ae 3d 3d a4 59 f5 fc 80 de 94 6f ac c6 51 b1 16 e2 dc 98 6c 61 9a 6f 86 b7 af 66 - flag_shard_a (32 bytes): 20 c7 6e 11 3f ed 6c 51 d2 2a 94 df f9 aa e6 1c e4 a7 2b f9 31 97 ad fa 08 15 b9 0c f8 90 d6 0c - Operation per byte: out = (shard_a[i] XOR shard_b[i]) XOR 0x17 - I'll compute this locally in the container with a small Python snippet to print the resulting ASCII.
Run a bash command that invokes Python to load the two hex blobs, compute the bytewise XOR as described, and print the resulting string (the flag).
$ python3 - <<'PY'
a = bytes.fromhex("47b91a696bae3d3da459f5fc80de946facc651b116e2dc986c619a6f86b7af66")
b = bytes.fromhex("20c76e113fed6c51d22a94dff9aae61ce4a72bf93197adfa0815b90cf890d60c")
res = bytes((x ^ y ^ 0x17) for x, y in zip(a, b))
print(res.decode('utf-8', errors='replace'))
PY picoCTF{adv4nced_vm_0bfusc4ti0n} TUI
Any agent needs a cool tui. I wanted to see what the agent was doing, but not be able to interact with it - no message sending or influencing it. I wanted a panel to manage the agents, being able to fire off an agent on a challenge and get an overview of how they all were going. I want to be able to join a ctf and have it discover the challenges through the ctf api or website, start agents in parallel to solve them and let me monitor the situation. I used dspy to extract the challenge info and download the challenge files. I quickly vibe coded it and it worked surprisingly well in my limited testing but is certainly not very robust.
what now?
It has been some time now since I worked on flaggy, and I would start fresh and build another CTF focused agent instead of extending this one. AI and agent tooling have progressed a lot and the models have a much more intuitive understanding of operating a terminal and tools now that they don’t need so much harness scaffolding. I want to write the least amount of scaffolding needed for a well performing agent, keep it hackable, and lean on the intelligence of the models.
Benchmarks like Terminal-Bench show that the general coding agent harnesses work pretty much just as well as anything. The main benefit of something more than the coding harnesses is better ui, ux, and connection or access to tools. A little hint of the direction I want to go in: pi with a tmux extension, Harbour for managing and running the CTF challenges, skill files to embed more info about available tools, and some sort of ctf extension to help in managing and starting the ctf environments.
these charts are horribly out of date now
TB1 used ~100 beta tasks (terminal-bench-core v0.1.1). TB2 is a curated set of 89 harder tasks; scores aren't directly comparable.
The cyber benchmarks have moved fast since I started flaggy3. On OpenAI’s internal CTF benchmark, GPT-5.1-Codex-Max jumped to 76% by November 2025, and GPT-5.3-Codex hit 77.6%. On the open Cybench benchmark, Anthropic’s Sonnet 4.5 jumped to 32% pass@1 and 76.5% with 10 attempts by October 2025, and Opus 4.6 reached 93% pass@1 on a subset of 37 problems — both labs say they are close to saturating their current cyber evals. Sama warned that their next model would reach high on their cyber security framework and indeed gpt5.3-codex was released as the first model to reach “High” on OpenAI’s cyber preparedness framework.
Footnotes
-
Capture the Flag, challenges where the aim is to find the hidden ‘flag’ in a program ↩
-
had to give it a cute name - but I realise now there are many projects on github with this name ↩
-
since I have been delayed in getting this post out, I have had to update these charts multiple times (and are still out of date)! ↩