BoxPwnr Traces & Benchmarks

Challenges

Solved

Best Model

Traces

Data Contamination Analysis

Did the model already know the answer? For each challenge we run a closed-book recall probe on the target model (no tools, no internet) and ask whether it remembers the challenge from its training data. An LLM judge with access to the actual challenge files grades each response as memorized or no, flagging only specific recall that lines up with the real challenge, not generic CTF intuition. We also scan the model's solve trace to see if the challenge name leaked in the BoxPwnr prompt or any tool output.

Loading...