TL;DR: We have no official data for any of the largest models. Epoch AI is guesstimating the total training compute and often basing new estimates on older ones.
Rationale
Leading AI companies are not disclosing any data regarding the total training compute necessary for their models. Therefore, we have no official data for any of the largest models in the list.
Grok and GPT's total training compute is estimated. For Claude and Gemini, there is an estimate in the "Training compute notes" section, but it's not reported in the adjacent column that will be used for resolution.
What this means in practice is that we are currently forecasting the likelihood that Epoch AI's estimates for one model would surpass the required threshold.
So, how is Epoch AI estimating these values? There appears to be no consistent methodology, but they use multiple approaches depending on which data is available. Here are a few examples:
- In most cases, it's a formula based on the number of parameters the model has (which, in most cases, is not publicly disclosed and therefore estimated)
- GPT-5's training compute is based on nothing solid. It's just a series of hypotheses of what the compute should be in relation to GPT-4 (whose compute notes read "this is a rough estimate based on public information, much less information than most other systems in the database")
- Claude 3.7's total training compute is based on the claim that it took "a few tens of millions of dollars" to train. Epoch AI took the geometric average of $20-90 M and reverse-engineered the total compute based on hypotheses of the cost of training.
- Most importantly, Grok 4's (our current benchmark model) is based on qualitative assumptions relative to Grok 3's training, which was based on estimates of a training duration of "approximately 3 months".
Overall, it seems just like a giant guesstimation game, which brings reasonable and plausible estimates, but might not be best suited to be the resolution criteria for a forecasting question.
Why do you think you're right?
Both Wysa and PathChat could reach the end of the pipeline within March, yet the former has been under review for 3 years now, implying that there might be some concerns there. Similar concerns would apply to PathChat as well, so I'm expecting both a) a longer approval and b) a lower success rate.
This would seem to be confirmed by a recent DHAC meeting (as highlighted by @grainmummy here) where the FDA signalled the intention to create an ad-hoc regulatory framework for these new devices. This would likely take months and produce as a result a new set of requirements that manufacturer would be asked to meet, further extending the timeline.
The creation of a new risk-based framework specifically for GenAI technologies is likely to take more than the 4 months that are left in this question, so the likelihood for approval of any llm-based device is very low.
I'm remaining higher than the crowd due to the existance of these two companies with products that have already been in the pipeline for enough time.
Why might you be wrong?