Data Contamination Analysis
Did the model already know the answer? For each challenge we run a
closed-book recall probe on the target model (no tools, no internet) and
ask whether it remembers the challenge from its training data. An LLM
judge with access to the actual challenge files grades each response as
memorized or no, flagging only specific recall that
lines up with the real challenge, not generic CTF intuition. We also
scan the model's solve trace to see if the challenge name leaked in the
BoxPwnr prompt or any tool output.