AI Reading for Saturday December 21
The big news is OpenAI’s o3 announcement.
I would highlight the coding benchmarks around minute 11. Note the slope of the o1 models vs o3 models in the first chart showing Codeforces coding ability and a large improvement in o3 v o1.
A very important caveat is they did not ship the models and announce pricing. they expect o3-mini to ship at the end of January.
The o3-mini rating of 2073 would put them in 2993rd place in the ranking, a strong developer.
The o3 ranking of 2727 would put them in 175th place, an extremely strong ranking. The caveat is, ~3x cost v. o1. (right chart shows 2400-ish ELO so that is confusing.) They didn't report low and medium effort. A theme in 2024 is the fast cheap mini models are not far behind the flagship models. That is a big gap.
The mini 'high effort' looks like it will cost < 50% vs o1. They haven't announced pricing, but it would be pulling a fast one if those axes didn't line up closely with actual costs. They have a note about cost estimates, which I don't get, if they ran it with those results they should know the true compute budget. Seems like there might be some wiggle room on costs if these are estimates.
So a big improvement, but beware of hype, training data contamination. There is always some slip between cup and lip, or ship
The other benchmark is performance on the ARC-AGI benchmark. o3 made a breakthrough, broke the record, hit the bar to match human performance on the benchmark, a/k/a achieved AGI by that measure. But to win the prize, the algo must be open source (spoiler: ofc it is not) and meet a cost criterion. o3 required something like $340k in compute to achieve that score, which does not meet the bar. So it's a bit of a bragging rights exercise, not a practical one. (I hope that's not how they got 2727 on Codeforces)
There is also a note saying o3 was 'tuned' on 75% of public ARC-AGI data, which is suggestive of training set contamination
Bottom line, they are claiming a big jump in performance, the Gemini 2.0 jump was more incremental. The proof will be in the 3rd party benchmarks when it ships. I'm inclined to discount the claims a bit til then but still looks amazeballs.
No new 4.5 foundation model. WSJ today has some inside dope on struggles to buiild Orion despite $500m training runs. The foundation model is pretty important. It drives everything else, including the reasoning model. Also the o1 multistep process doesn’t support all the basic features of the chat API, like temperature, some of the tools.
Between the fairly modest improvement in Gemini 2.0, and OpenAI’s pivot to reasoning models, and comments from leading lights like Sutskever, the hitting the wall/running out of road for LLM improvement seems more real. Maybe we will need something better than transformers, neuromorphic hardware, better knowledge representation, something that can train more efficiently and learn on the fly. TBH I wouldn’t be surprised if this takes some of the air out of some parts of the AI bubble, who needs 100,000-GPU Nvidia clusters if they don’t yield big improvements.
In this context the 12 days had some good stuff, reflects a greater emphasis on shipping products as opposed to research and magical new models.
Some inside dope on OpenAI 'Orion' struggles. After 2 $500m training runs, no GPT-5 yet. - WSJ
Google and OpenAI are going head-to-head. - Ars Technica
Gemini hasn't got huge traction despite catching up to OpenAI at the foundation model level. So they are putting chat on the home page. Meanwhile OpenAI is making Search generally available and moving towards an AI-native browser.
OpenAI announces new o3 models - TechCrunch
OpenAI teases new reasoning model—but don’t expect to try it soon. - The Verge
OpenAI Upgrades Its Smartest AI Model With Improved Reasoning Skills - Wired
OpenAI Unveils o3 System That Reasons Through Math, Science Problems - NYT
OpenAI launched its new AI reasoning model in September. It already has challengers, one from China and another from Google. - Business Insider
AI’s assault on our intellectual property must be stopped. - FT
Existing legal frameworks do not clearly establish ownership of your AI-generated code. - ZDNET
Digital twins of human organs, to experiment on models instead of the real thing, for instance to treat heart arrhythmias. - MIT Technology Review
Arizona School’s Curriculum Will Be Taught by AI, No Teachers - Gizmodo
AI traffic cameras could be watching you on the road - NBC News
Is conflict of interest still a thing now :|
Music Can Thrive in the AI Era - WIRED
Google Is Working on AI-Powered Scam Detection for Chrome - Lifehacker
Comparing GitHub Copilot and Cursor. - Geeky Gadgets
Man woman camera TV.
Chinese researchers face hurdles in attending NeurIPS - South China Morning Post
Generative AI needs and augments human skills to create exceptional experiences. - MIT Technology Review
Half of job applicants rely on AI for résumés, raising authenticity concerns. - ComputerWeekly.com
'Supercycle’ for AI PCs and smartphones is a bust, analyst says - Tom's Hardware
German researchers have figured out how to use AI to identify whiskey aromas - NPR
Do not use AI for evil - Reddit
AI-powered teddy bear concept is an interactive babysitter for kids - Yanko Design - Modern Industrial Design News
Follow the latest AI headlines via SkynetAndChill.com on Bluesky