o3 Pro on ARC-AGI Semi Private Eval Results ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier
To note, o3 Pro is *not* the same model we tested in Dec ‘24 (o3-preview) OpenAI has explicitly confirmed this. See reference tweet for more information
ARC Prize
ARC Prize17.4.2025
Clarifying o3’s ARC-AGI Performance OpenAI has confirmed: * The released o3 is a different model from what we tested in December 2024 * All released o3 compute tiers are smaller than the version we tested * The released o3 was not trained on ARC-AGI data, not even the train set * The released o3 is tuned for chat/product use, which introduces both strengths and weaknesses on ARC-AGI What ARC Prize will do: * We will re-test the released o3 (all compute tiers) and publish updated results. Prior scores will be labeled “preview” * We will test and release o4-mini results as soon as possible * We will test o3-pro once available
o3 results have been updated to reflect the 80% reduction in price
New to the chart are data points for o3 (High reasoning) and o4-mini (High reasoning). They were previously excluded because of model timeouts. OpenAI’s new ‘background mode’ has enabled us to process these models on high compute settings.
See leaderboard: Reproduce results:
108,75K