On the various "Year In Review 2025" blogs
From https://snscratchpad.com/posts/looking-ahead-2026/
“What matters is not the power of any given [LLM] model, but how people choose to apply it to achieve their goals.”
This matches what I’ve seen in practice. The people who benefit most from these tools are the ones who integrate them into their workflows with intent, not those who chase every new model or capability. The marginal gains from more powerful models don’t meaningfully change behavior for users who never made the initial leap. The users who already adopted the tools, continue to reap their benefits and apply them effectively toward their goals, provided they avoid getting distracted by constant tinkering.
“humans being equipped with these new cognitive amplifier tools as we relate to each other [is] the product design question we need to debate and answer”
There’s been a lot of literature around agent-to-agent communication, MCPs, and protocol design, but comparatively little attention on how humans communicate with each other in a post-LLM world. IRL, we don’t assume baseline LLM usage the way we implicitly assume users will use Web 2.0 products like search or maps. That gap matters. Maybe the answer is a different GUI for LLMs that democratize access. Maybe it’s just time and gradual normalization, similar to how paper forms, PDFs, and DocuSign still coexist decades into “digital transformation.” Either way, this feels underexplored relative to its importance.
“For AI to have societal permission it must have real world eval impact.”
Coding agents are a good example. Engineers didn’t adopt them because they scored well on SWE-bench, they adopted them because they worked on their own real life problems. Eval has consistently been the key factor for AI-native applications. Evals that mirror real user behavior, combined with interpretation grounded in business outcomes rather than leaderboard scores, are what actually drive trust and adoption beyond the tech community.
From https://karpathy.bearblog.dev/year-in-review-2025/
“By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like “reasoning” to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples).”
“[LLMs] are at the same time a genius polymath and a confused and cognitively challenged grade schooler, seconds away from getting tricked by a jailbreak to exfiltrate your data”
“Training on the test set is a new art form.”
Eval becomes both a competitive moat and a failure mode. Where you source your eval data and how closely it reflects reality can make or break an AI-native product in a single interaction. To the user, it either feels like magic or it doesn’t. There’s very little middle ground.
“Will the LLM labs capture all applications or are there green pastures for LLM apps [like Cursor]?”
We’re back to the thick client vs thin client debate (I still remember doing tests on verifying a Blackberry thin client migration!) but as we saw with single-page applications (SPAs) it’s more of a hybrid solution moving forward.
“Claude Code (CC) emerged as the first convincing demonstration of what an LLM Agent looks like […] Anthropic got this order of precedence correct and packaged CC into a delightful, minimal CLI form factor that changed what AI looks like - it’s not just a website you go to like Google, it’s a little spirit/ghost that “lives” on your computer. This is a new, distinct paradigm of interaction with an AI.”
The shift from tab-to-autocomplete (like the original GitHub copilot) to agents controlling more aspects of development from the terminal to launch a build, to running tests to validate their work, and even searching blog posts like this one for insights to what we actually want them to do was the fundamental shift. It reframes the interaction model. Devs, of all people, are really against constant interruptions and context switching from a suggested way to write a function. The UX that worked was for devs to explaini the problem and a potential solution in plain English, delegate the execution to the AI agent, and finally review the working result it provides. The agent works autonomously while the human validates.
“not only does vibe coding empower regular people to approach programming, it empowers trained professionals to write a lot more (vibe coded) software that would otherwise never be written […] code is suddenly free, ephemeral, malleable, discardable after single use”
Ephemeral software still feels underutilized. There’s a large opportunity in treating code as a JIT artifact rather than something that must always harden into a maintained system. This is a good area to expand in the future.
“[On Nano banana being the first hint of an LLM GUI] in terms of the UIUX, “chatting” with LLMs is a bit like issuing commands to a computer console in the 1980s. Text is the raw/favored data representation for computers (and LLMs), but it is not the favored format for people, especially at the input. People actually dislike reading text - it is slow and effortful. Instead, people love to consume information visually and spatially and this is why the GUI has been invented in traditional computing. In the same way, LLMs should speak to us in our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc.
Three years after ChatGPT, LLM interaction is still dominated by chat from web, app, or CLI. Text remains the primary input and output. When has someone actually done the meeting pre-read document or the competitive analysis you worked on, model quality has advanced rapidly but the UI layer has lagged. That gap represents one of the highest-leverage areas for progress, making LLM input and outputs easier for people without having to dive deep into the tech behind it all.
–
… and finally we have the Pantone Color of the Year 2026 – Cloud Dancer!
context window – the blog