|
Applied AI: The Great Unhobbling
|
|
"I think the unhobbling trajectory looks much more like an agent than a chatbot and much more almost like basically a drop-in remote worker" - Leopold Ashenbrenner on the Dwarkesh podcast, June 2024
|
|
Last year were wrote about The Age of Agents, in which we claimed, " we seem to be on a fairly clear path to millions, or even billions, of agentic AIs that are better than most humans and most things". Yet ten months on, for general users of ChatGPT, Gemini or Claude, close to 100% of their interactions still take the form of a chatbot conversation.
So what gives? Well, in keeping with the trend since the launch of ChatGPT, software engineering has been the leading adoption market, and where the most rapid change is underway. While most were shutting shop for the Christmas break and market news slowed to a trickle, AI twitter was erupting over Claude Code. Claude Code is Anthropic's command-line tool for agentic coding. It lets developers delegate coding tasks to Claude directly from their terminal - things like writing, editing, debugging, and refactoring code across entire codebases. But this isn't completely new - after all, the first iteration of Claude Code was released in preview a year ago. Google released Gemini CLI last June and OpenAI released Codex in August. These coding agents have been around for most of last year, so what changed in December?
First of all, it feels like some kind of a threshold has been reached in coding. Perhaps this is best expressed by this post from Andrej Karpathy:
|
|
|
|
Source: X
|
|
|
Or this one from Boris Cherny, the actual creator and Head of Claude Code at Anthropic:
|
|
|
|
Source: X
|
|
|
Or this somewhat controversial one from the Principal Engineer at Google:
|
|
|
|
Source: X
|
|
|
The catalyst for this qualitative vibe-shift across the industry was the release of Anthropic's latest frontier model (Claude Opus 4.5) and the release of Claude Code v2. Claude 4.5 reclaimed Anthropic's crown as the leader in coding models and, in addition, operated with a 50% success rate on tasks that take humans five hours according to the METR long-horizon benchmark; when we wrote the Age of Agents ten months ago the state-of-the-art was less than an hour (the METR long-horizon benchmark became the load-bearing eval for agentic performance in 2025).
|
|
|
|
Source: METR
|
|
|
|
Sholto is a Reinforcement Learning Researcher at Anthropic. Source: X
|
|
|
So Claude Code is a big deal for coders. "So what?", you may say. Software engineers only comprise about 3% of the world's billion knowledge workers - isn't all of this confined to a few million techies?
For us, the interesting trend over the holidays was watching 'Claude Code moments' spread beyond AI/tech circles, to financial analysts, scientists, academics, even generalists.
Yes Claude Code helps people code who don't know how to code, but it is much more than that. It has been characterised by some as akin to having a little spirit or ghost in your computer, that can go off and perform tasks for long periods of time, only returning when completed. It can create, edit, convert or otherwise manipulate all of the file types typically used by knowledge workers, and connect to all kinds of tools, apps, systems of record and data repositories via APIs or MCP servers. It doesn't even have to be one little ghost - Claude Code's creator, Boris Cherney, shared his typical set up which involves running 5 Claudes in parallel in his terminal and another 5-10 Claudes via claude.ai/code which is web hosted and allows him to 'kick off' sessions from his phone or laptop while he's away from his PC.
|
|
|
The v2 update to Claude Code in December combines a harness (an orchestration layer for the underlying language model) with a growing list of features and components:
-
Planning – Claude Code can decompose complex analytical projects into structured phases, reasoning through dependencies before execution (e.g. breaking down "analyse this company's competitive position" into financial statement extraction, peer identification, ratio calculation, and synthesis stages; or sequencing a due diligence review across legal, financial, and operational workstreams)
Context management – The system maintains awareness of project scope, prior findings, and user preferences across extended work sessions (e.g. remembering that you're evaluating targets through an ESG lens and applying that filter consistently; or tracking which sections of a 200-page contract have already been reviewed for specific clause types)
Subagents – Claude Code can dispatch specialised subsidiary agents to handle discrete research tasks (e.g. spawning a web search subagent to pull the latest quarterly earnings releases while simultaneously extracting data from uploaded financial statements; or researching regulatory precedents while drafting compliance documentation)
Tool handling – The harness orchestrates calls to external data sources and services, managing retrieval and parsing (e.g. searching for recent SEC filings or analyst commentary, fetching current market data to contextualise historical analysis, or pulling case law summaries relevant to a contract dispute)
Skills – Pre-built capability modules encode best practices for document production (e.g., the docx skill generates properly formatted investment memos with tracked changes for review cycles; the xlsx skill builds financial models with formula preservation and automatic recalculation; the pptx skill creates client-ready pitch decks with consistent styling; the pdf skill extracts and processes data from scanned contracts or regulatory filings)
Prompt Presets – Configurable instruction templates establish consistent analytical standards and house style (e.g. "always cite primary sources," "use UK English and formal register," "include risk factors in every investment recommendation," or "flag any GDPR implications in contract reviews")
File system access (permissions) – Granular controls govern which directories and files the system can read or modify (e.g., granting read access to a client's data room while restricting write permissions to your firm's working folder; or allowing document creation in /outputs while keeping sensitive template files in /templates as read-only)
All of these can be created by the user (or by Claude), and shared across an organisation, presenting a huge possibility space of customisation options.
They also elegantly address some of the fundamental limitations of LLMs, specifically relating to long and short term memory: planning externalises reasoning into persistent, auditable steps rather than relying on the model to hold everything in its head; context management acts as a form of working memory that prioritises relevant information within the finite context window; skills encode institutional knowledge that would otherwise need to be re-explained in every session; and prompt presets preserve house style and analytical standards that the model has no innate way to retain between conversations.
|
|
|
This offers a glimpse into how knowledge work might change in the coming months and years: individuals marshalling 10s of AI agents in parallel across a diverse range of discrete tasks and projects. Some tasks may take minutes, others hours, and so their human manager must act as the orchestral conductor, giving little nudges and signals to set the agents off on new jobs as they complete the previous ones. Adapting to this new paradigm will take time, as existing professional workflows are deeply etched grooves in organisational structures and in the minds of the professionals themselves. In the near term, the operation of Claude Code via a command-line interface (CLI) will feel unfriendly, inaccessible and anachronistic to non-coders, harking back to the days of MS-DOS in the 80s.
|
|
|
Unfriendly UI: One of our early attempts to create a research-assistant sub agent
|
|
|
Just as Windows arrived a few years later, to abstract away the unnatural command prompt with intuitive visualisations, Claude Code will evolve into a more friendly format as it diffuses to finance, accounting, law, marketing and everywhere else. But unlike Windows, it won't take years - the AI labs are actively targeting the high-value non-coding professions, hiring investment bankers, lawyers, accountants in droves to generate signals for reinforcement learning environments (numerous 3rd party data providers have sprung up to provide this service).
If all of the complexity is to be abstracted away, then why bother diving in to this 'MS-DOS' phase? Maybe there's no need. But early adoption in the early days might be the only opportunity to understand these systems deeply, before rising underlying complexity makes it impossible. It also allows small, dynamic teams, such as ours at Green Ash, to augment our productivity and distinguish ourselves from slower moving corporate behemoths.
We should all take Karpathy's advice: "Roll up your sleeves to not fall behind"
|
|
|
|