AI Reading for Saturday December 21

Dec 21, 2024

The big news is OpenAI’s o3 announcement.
I would highlight the coding benchmarks around minute 11. Note the slope of the o1 models vs o3 models in the first chart showing Codeforces coding ability and a large improvement in o3 v o1.
A very important caveat is they did not ship the models and announce pricing. they expect o3-mini to ship at the end of January.
The o3-mini rating of 2073 would put them in 2993rd place in the ranking, a strong developer.
The o3 ranking of 2727 would put them in 175th place, an extremely strong ranking. The caveat is, ~3x cost v. o1. (right chart shows 2400-ish ELO so that is confusing.) They didn't report low and medium effort. A theme in 2024 is the fast cheap mini models are not far behind the flagship models. That is a big gap.
The mini 'high effort' looks like it will cost < 50% vs o1. They haven't announced pricing, but it would be pulling a fast one if those axes didn't line up closely with actual costs. They have a note about cost estimates, which I don't get, if they ran it with those results they should know the true compute budget. Seems like there might be some wiggle room on costs if these are estimates.
So a big improvement, but beware of hype, training data contamination. There is always some slip between cup and lip, or ship
The other benchmark is performance on the ARC-AGI benchmark. o3 made a breakthrough, broke the record, hit the bar to match human performance on the benchmark, a/k/a achieved AGI by that measure. But to win the prize, the algo must be open source (spoiler: ofc it is not) and meet a cost criterion. o3 required something like $340k in compute to achieve that score, which does not meet the bar. So it's a bit of a bragging rights exercise, not a practical one. (I hope that's not how they got 2727 on Codeforces)
There is also a note saying o3 was 'tuned' on 75% of public ARC-AGI data, which is suggestive of training set contamination
Bottom line, they are claiming a big jump in performance, the Gemini 2.0 jump was more incremental. The proof will be in the 3rd party benchmarks when it ships. I'm inclined to discount the claims a bit til then but still looks amazeballs.
No new 4.5 foundation model. WSJ today has some inside dope on struggles to buiild Orion despite $500m training runs. The foundation model is pretty important. It drives everything else, including the reasoning model. Also the o1 multistep process doesn’t support all the basic features of the chat API, like temperature, some of the tools.
Between the fairly modest improvement in Gemini 2.0, and OpenAI’s pivot to reasoning models, and comments from leading lights like Sutskever, the hitting the wall/running out of road for LLM improvement seems more real. Maybe we will need something better than transformers, neuromorphic hardware, better knowledge representation, something that can train more efficiently and learn on the fly. TBH I wouldn’t be surprised if this takes some of the air out of some parts of the AI bubble, who needs 100,000-GPU Nvidia clusters if they don’t yield big improvements.
In this context the 12 days had some good stuff, reflects a greater emphasis on shipping products as opposed to research and magical new models.

Some inside dope on OpenAI 'Orion' struggles. After 2 $500m training runs, no GPT-5 yet. - WSJ

Google and OpenAI are going head-to-head. - Ars Technica

Gemini hasn't got huge traction despite catching up to OpenAI at the foundation model level. So they are putting chat on the home page. Meanwhile OpenAI is making Search generally available and moving towards an AI-native browser.