o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more. AI Insiders ($9!): https://www.patreon.com/AIExplained FrontierMath: https://epoch.ai/frontiermath https://arxiv.org/pdf/2411.04872 Chollet Statement:https://arcprize.org/blog/oai-o3-pub-breakthrough MLC Paper: https://www.scientificamerican.com/article/new-training-method-helps-ai-generalize-like-people-do/?utm_campaign=socialflow&utm_source=twitter&utm_medium=social AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf Human Performance on ARC-AGI: https://arxiv.org/pdf/2409.01374v1 Wei Tweet ‘3 months’:https://x.com/_jasonwei/status/1870184982007644614 Deliberative Alignment Paper: https://openai.com/index/deliberative-alignment/ Brown Safety Tweet: https://x.com/polynoamial/status/1870196476908834893 Swe-Bench Verified: https://openai.com/index/introducing-swe-bench-verified/ Amodei Prediction: https://x.com/OfirPress/status/1858567863788769518 David Dohan: 16 hours https://x.com/dmdohan/status/1870171404093796638 OpenAI Personal Writing: https://openai.com/index/learning-to-reason-with-llms/ https://simple-bench.com/ John Hallman Tweet: https://x.com/johnohallman/status/1870233375681945725 00:00 - Introduction 01:19 - What is o3? 03:18 - FrontierMath 05:15 - o4, o5 06:03 - GPQA 06:24 - Coding, Codeforces + SWE-verified, AlphaCode 2 08:13 - 1st Caveat 09:03 - Compositionality? 10:16 - SimpleBench? 13:11 - ARC-AGI, Chollet 20:25 - Safety Implicaitons AI Insiders: https://www.patreon.com/AIExplained Non-hype Newsletter: https://signaltonoise.beehiiv.com/ Podcast: https://aiexplainedopodcast.buzzsprout.com/