*|MC:SUBJECT|*

View this email in your browser

Generated with ChatGPT

Horizon Fund Update: The DeepSeek Whale Makes a Splash

Language models are very inefficient. We know this, because our brains can achieve general intelligence with about 1/10,000th of training data, and run on just 20W of power. There are 2GW datacentres currently under construction, and we've still yet to achieve AGI (though we may be close).

That isn't to say there hasn't been progress on efficiency. The November 2022 ChatGPT required a datacentre to run, but today there are dozens of small language models with similar performance that can run on a phone or laptop. Computational performance of AI accelerators doubles every two years, and performance per dollar (FLOP/s per $) improves by +30% and energy efficiency (FLOP/s per watt) improves by +50% each year (per Epoch AI). Changing number formats/precision has yielded even larger gains, with 12x better performance when switching from FP32 to tensor-INT8. Over the last 8 years, NVIDIA has been able to increase AI accelerator compute by 1,000x and reduce energy required per token by 45,000x. There have been all kinds of breakthroughs on the algorithmic side, from the pre and post training pipeline through to the attention mechanism of the transformer model itself (e.g. flash attention, multi-query attention and RoPE). Research published last year analysing the 2012-23 period found that the compute threshold to achieve a given performance threshold halves every 8 months. Most recently, the whole industry has shifted its focus to scaling 'test time compute', following research showing that scaling up 'thinking time' at inference by 10x yields roughly the same performance gain as scaling up pre-training by 10x - this makes quite a difference as pre-training might cost $100MM while inference might only cost a penny.

Despite these trends, training clusters keep getting larger - xAI's 100k cluster of H100s is 10x the number of GPUs used to train GPT-3 in 2020. In terms of compute performance xAI's cluster is 70x larger at FP64 or 130x at FP16 number format. NVIDIA has been one of the largest beneficiaries of these scaling trends, and the market is on high alert for some new disruptor. Maybe state space models might displace the transformer architecture, allowing sub-quadratic scaling of context length, or maybe quantum computers are better suited to running non-deterministic AI models, given they are inherently probabilistic themselves. Or maybe some kids in a a garage somewhere will come up with something completely new, perhaps something that doesn't require mountains of matrix multiplications, and datacentres can go back to running on nice, cheap CPUs.

Who are DeepSeek and what did they do?

Enter DeepSeek, who are not a bunch of kids in a garage, but a well-resourced, deeply technical team of ex-quants and PhDs from some of China's top universities who have been highly respected and closely watched in AI research circles for over a year. With over 200 researchers/engineers, they aren't far off the scale OpenAI was at a couple of years ago (the DeepSeek R1 paper has ~200 contributors). Billionaire founder Liang Wenfeng bought 10,000 NVIDIA A100s in 2021, before export controls came in, and has since added an undisclosed number of H800s. There are rumours they might have as many as 50,000 H100s that have found their way around export controls, though this unconfirmed.

Their first language model, DeepSeek LLM was released in November 2023, and was competitive with Meta's Llama 2 70B. DeepSeek V2 was released in May 2024, and was competitive with Llama 3 70B. But it wasn't until last Tuesday that DeepSeek started to make waves in the market, with the release of DeepSeek V3 and their reasoning model DeepSeek R1. So what's the big deal?

Performance

On the standard suite of benchmarks, DeepSeek V3 is competitive with the best publicly released models from OpenAI, Anthropic and Meta. Perhaps even more shockingly, DeepSeek's R1 reasoning model is competitive with OpenAI's o1 reasoning model, which became generally available just a few months ago.

This is a reality check for frontier labs, who generally put China at 12-24 months behind. Just last week, Anthropic CEO and founder Dario Amodei was at Davos stressing the importance of maintaining this lead, as it gives research labs some breathing room to spend time working on AI safety and alignment - a head to head race would create a dynamic whereby these considerations might be side lined to ensure the other side doesn't get to superintelligence first.

Source: DeepSeek

DeepSeek's reasoning model achieves similar scores to OpenAI's o1 in the hardest competitive math, science and coding benchmarks

Source: DeepSeek

Running Costs

Both DeepSeek V3 and R1 are considerably cheaper in terms of $/token than most US labs (Google's are more competitive). Because DeepSeek's models are open source, there is also the option to download the weights and run them locally.

DeepSeek's R1 model is about 5-10x cheaper than o1 mini and 25-50x cheaper than o1 (we note OpenAI is about to release o3 mini which will is almost as performant as o1 but will likely be closer to o1 mini in cost)

Source: DeepSeek

Order-of-magnitude reductions in inference costs for DeepSeek's R1 versus OpenAI's o1 is in line with broader LLM trends which show inference costs declining at least 10x per year

Source: a16z

DeepSeek has shifted the Pareto frontier for $/performance (top right is best, note how highly Google's new Gemini models score)

Source: @swyx

Engineering

Perhaps the biggest surprise (at least for investors) in the DeepSeek paper was the base model training run cost of $5.6MM. This is a theoretical figure based on the 2.788MM H800 GPUs hours used in pre-training on their 2048 GPU cluster on the final run. The paper itself gives the proviso "Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data". Nevertheless at 3.3e24 FLOPs this is about 90% less training compute than Meta's Llama 3 405B model.

This was achieved through numerous optimisations at every layer of the stack:

Optimisations actually began at chip level, below even CUDA, with DeepSeek engineers writing assembly language to overcome H800 bandwidth throttling. This in itself is an extraordinary technical feat
Numerous other techniques were used, all of which are in the public domain, but DeepSeek seems to have combined them in an elegant way to achieve the large overall efficiency gains in base model pre-training. These include: Mixture of Experts architecture, Multi-head latent attention, multi-token prediction, and the use of mixed number formats to reduce memory use
The reinforcement learning process for the R1 reasoning model was almost entirely self-supervised, in the spirit of AlphaGo. This is highly reproducible with low barriers to entry and we would expect RL-trained iterations of language models to take off at breakneck pace from here

The 'Buts'

Despite a refreshing level of transparency on the technical side, there are still a few question marks:

There is a chance the numbers are made up. Maybe the training FLOPs are misstated, and DeepSeek has access to sanctioned chips, or the benchmark figures are inflated, or the whole thing is some kind of Chinese psy-op. This is a minority view - the AI research community generally agree that the model training figures and performance are real
Where did DeepSeek get 14.8 trillion high quality training tokens? Did they train on GPT 4o or Claude 3.5 outputs? This is is called model 'distillation'. It's fairly common practice in the open source community, and hard to prove, but it goes against the frontier labs' terms of service. This would explain why they score very similarly to GPT 4o/Claude 3.5 on all the benchmarks, but never significantly above
DeepSeek inference costs could be set at breakeven or even at a loss. SemiAnalysis estimate Anthropic and OpenAI set inferencing pricing at around 75% gross margins
Despite the initial surge of downloads on the app store, it is unlikely that DeepSeek will gain broad adoption outside of China due to data security concerns
The base model is competing against distilled versions of GPT-4 generation models - it remains to be seen what the 100k training cluster models are like (e.g. Grok 3, Llama 4). Gemini 2 has been soft-launched and beats DeepSeek's models in ELO scores and $/perfomance

What Does it Mean for AI Stocks?

In theory all of DeepSeek's optimisations are replicable, both in regards to pre-training their base model and the highly efficient and novel reinforcement learning approach used to train their reasoning model. This is great news for open source and start-ups, for whom the gigantic capital requirements at the frontier pose an insurmountable barrier to entry. but also for the frontier labs themselves, who just got handed 10x more compute (Claude seems perpetually rate-limited). The $1 billion training run that couldn't be justified on ROI grounds might now be done for $100 million. The so-so intelligence that can be run on-device, be it on your phone, laptop, or new form factors like VR headsets and AR glasses just got levelled up closer to today's most performant state of the art. The 10GW or 100GW datacentres that scaling laws seemed to demand by 2028 and 2030 could never have realistically been financed, built or powered - perhaps now it's a moot point.

The huge numbers projected out by researchers in the frontier labs were never taken seriously by the markets, and nor were they ever priced into the AI stocks. NVIDIA is forecast to grow earnings not by 2x or 10x, but by +51% this year and +23% the year after. At the time of writing it has a NTM P/E of 26.6x - very close to its ChatGPT-era low of 24.3x in December 2023, and -30% below its average of 38.4x over the last two years. This year's Blackwell chips have 33x the inference performance of last year's Hoppers - far greater than the 10x efficiency gains demonstrated by DeepSeek, yet nobody is saying this is bad news for NVIDIA as people need far fewer chips.

It feels like the AGI timeline has been brought forward by at least a year, and the race has got closer. This will spur more investment, not less, as well as speed up AI adoption across all industries. In terms of the near term outlook for AI stocks, Microsoft earnings on Wednesday evening will be critical. If the message is that efficiency gains are good news because demand outweighs what the hyperscalers can provide, there will be a quick rebound in AI infrastructure-linked stocks. If the message is that efficiency gains allow Microsoft to pare back some capex, and do more with less to improve ROI, the picture will be more mixed.

It is seemingly obligatory for financial journalists to always write 'highly-valued' in front of 'AI stocks' but it isn't clear on what basis

Source: Bloomberg; Green Ash Partners.

Mark Zuckerberg posted this three days after the DeepSeek paper was published

Source: Threads

Microsoft's CEO seems to indicate cheaper AI means more AI

“To see the DeepSeek's new model, it’s super impressive in terms of both how they have really effectively done an open-source model that does this inference-time compute, and is super-compute efficient. We should take the developments out of China very, very seriously.” Microsoft CEO Satya Nadella at the World Economic Forum

"Jevons paradox strikes again! As AI gets more efficient and accessible, we will see its use skyrocket, turning it into a commodity we just can't get enough of." - Satya Nadella on X

Green Ash Partners LLP
11 Albemarle Street
London
W1S 4HH

Tel: +44 203 170 7421
Email: info@greenash-partners.com

NOTICE TO RECIPIENTS: The information contained in and accompanying this communication is confidential and may also be legally privileged, or otherwise protected from disclosure. It is intended solely for the use of the intended recipient(s). If you are not the intended recipient of this communication, please delete and destroy all copies in your possession, notify the sender that you have received this communication in error, and note that any review or dissemination of, or the taking of any action in reliance on, this communication is expressly prohibited.

This email is for information purposes only and does not constitute an offer or solicitation of an offer for the product and may not be used as an offer or a solicitation. The opinions herein do not take into account individual clients’ circumstances, objectives, or needs. Before entering into any investment, each client is urged to consider the suitability of the product to their particular circumstances and to independently review, with professional advisors as necessary, the specific risks incurred, in particular at the financial, regulatory, and tax levels.

All and any examples of financial strategies/investments set out in this email are for illustrative purposes only and do not represent past or future performance. The information and analysis contained herein have been based on sources believed to be reliable. However, Green Ash Partners does not guarantee their timeliness, accuracy, or completeness, nor does it accept any liability for any loss or damage resulting from their use. All information and opinions as well as the prices indicated are subject to change without notice. Past performance is no guarantee of current or future returns and you may consequently get back less than you invested.