That’s the real “secret” to successful AI transformations.
If you work in a product, engineering, or data science role, you’ve likely noticed a strange contradiction: Building AI prototypes has become absurdly easy, while building _successful_ AI solutions in production can end up being as hard as it always was.
Why is that? A few things stand out:
- **Model creation is now trivial** compared to the past. You can get remarkable performance from foundation models and a natural language prompt. And even more with some modest fine-tuning of those models.
- **While nothing has changed about how we should build machine learning powered applications**. We still need rigorous data science workflows to design, evolve, and "productionize" AI systems.
- **And observability got more complicated**. You’re now debugging huge text traces instead of numeric logs describing explanatory and response variable states.
- **The model capability border hasn’t shifted as much as some people assume**—large language models (LLMs) still lack goal-seeking behavior and self-awareness. They behave much more like child prodigy than wise geniuses. I believe we are still years away from truly autonomous, goal-seeking AI. Understand what is feasible with current AI and what is not.
The ease with which you can build a working prototype today creates a powerful illusion: _“AI (Machine Learning) is easy with LLMs.”_ I postulate that illusion is the trap that is responsible for the [ever increasing (MIT NANDA, Jul 2025: 95%)](https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf) estimate of [failed AI projects (RAND, Aug 2024: 80%)](https://www.rand.org/pubs/research_reports/RRA2680-1.html). We are learning what is easy today, and which things still are hard.
Although _models_ are easier to create, **the surrounding data science workflow has barely changed**. In fact, in some ways, it has gotten more complex. Being successful at the AI race is all about creating a strong data science culture in your company, and teaching machine learning principles to everyone.
This post will explain exactly why, and use the example of building an AI-based customer-support agent to show that the need for rigorous data science hasn't changed.
## The illusion of ease
Let’s admit it: we’re spoiled compared to the ML teams of 2017 when it comes to building models and prototypes.
You can spin up a model in minutes that outperforms what entire research groups struggled with not that long ago. You can fine-tune small models, run RAG pipelines, or deploy agents without wrestling with CUDA kernels or feature store engineering marathons.
From the outside, it looks like this:
> - _Modeling now is commoditized._
> - _The system can learn everything via prompts and lookups._
> - _Just plug in an LLM and make [it] work._
And that last line is precisely the problem.
Because the parts of data science that were always difficult—clarifying the goal, defining success, building evaluation harnesses, validating data quality, calibrating confidence, understanding the long tail, collaborating with SMEs—are still absolutely required. In fact, those steps have become **more** important because modern AI systems are way more flexible. Today's LLMs can give the *impression* of correctness while being catastrophically wrong.
With LLMs, the modeling step became easy. But everything else is still needed when [building machine learning powered applications (Ameisen, 2020)](https://howtolearnmachinelearning.com/books/machine-learning-books/building-machine-learning-powered-applications-going-from-idea-to-product/).
## What hasn’t changed
You still need to take the following steps, even if marketing says otherwise.
### 1. You still require a measurable definition of success
If you don’t decide what “good” looks like upfront, and [how to measure it (Managing ML Projects: Measuring success; Google, 2025)](https://developers.google.com/machine-learning/managing-ml-projects/success), an LLM will confidently lead you on a magical tour of doing everything except the thing you needed.
### 2. You still need empirical rigor
A test data harness is not optional. There still is data drift, model change, hallucinations, and emergent behaviors in complex systems. If you don’t measure frequently and consistently, you won’t even notice the regression. This is known as ["evals" (Husain, Nov 2025)](https://hamel.dev/blog/posts/evals-faq/) in the LLM community.
### 3. You still want confidence scoring
LLMs don’t naturally say “I’m unsure.” You must construct that signal yourself—otherwise the system will simply guess, articulate, and hallucinate. Without it, you cannot calibrate your responses or let users pick their own confidence thresholds. [(Steyvers et al; What large language models know and what people think they know; Nature Machine Intelligence, Jan 2025)](https://www.nature.com/articles/s42256-024-00976-7)
### 4. You still need interpretability
Users won’t trust an opaque agent: [you need to explain machine learning models (Lakshmanan; Google Cloud AI & ML, Jun 2021)](https://cloud.google.com/blog/products/ai-machine-learning/why-you-need-to-explain-machine-learning-models). You need to provide structured explanations, escalation reasoning, and transparent behavior for a successful machine learning system.
### 5. Long-tailed problems are still long-tailed
The danger of having a project that becomes a [diseconomy of scale (a16z, Aug 2020)](https://a16z.com/taming-the-tail-adventures-in-improving-ai-economics/) is as real today as it was before the arrival of LLMs. The bar of what kind of automation or prediction leads to a diseconomy of scale or not has risen, but not disappeared.
If anything, LLMs _increase_ the surface area of potential failure. You now have more behaviors to worry about, not fewer.
### 6. Your outputs are still limited by your inputs
Bad data does not magically transform into truth simply because the model is large and poetic. Improve your data quality first.
Plus, prompts are very susceptible to the context the see: Ask an LLM to produce clean code by prompting it with a good book about software engineering and you might get reasonable results. Prompt it with "create a well designed blogging site that can serve thousands of visitors per day" and you'll probably get a fancy hallucination that might look (dangerously) impressive.
This is just an old adage: "garbage in - garbage out", aka GIGO.
### 7. SMEs are still indispensable
Subject matter experts know the edge cases, the institutional weirdness, the unwritten rules. AI cannot infer this unless you teach it all about those cases - and that can be painstaking work. Like "garbage in - garbage out," this isn't terribly surprising, either - nor is it limited to machine learning projects. If you do not engage with the SMEs from the early prototype on, you will likely fail to address all their needs.
All of these pillars of the data science workflow remain intact. None have been deprecated. And with a higher bar of what is possible, many have become more complex.
## So what _is_ new?
While modeling has become really accessible to the masses, in exchange you need to cope with more challenging model observability and calibration.
### 1. Modeling has become trivial
If you accept that prompts are basically models, models have become trivial. For most problems, a manually tuned prompt is all you need.
You no longer *need* to understand the parameters of a Support Vector Machine, how to tune Gradient Descent, or even how to train a Neural Network. For many use-cases, just crafting a robust prompt and adding some guardrails is sufficient to get a reasonable "model" off the ground.
And you can do more with smaller models than ever before. For example, you now can (low-rank) adapt large models. Which means you can fine-tune models faster, and with less beefy GPUs (e.g., with [unsloth.ai](https://unsloth.ai/) or with [Huggingface' PEFT library](https://huggingface.co/docs/peft/index)). And you can even use smaller models that are as efficient as an untuned, larger model. But you get much more efficient inference behavior from the smaller model.
That said, it’s also dangerously easy to overfit, get distorted behavior, or create brittle systems with narrow-world understanding and hallucinations. Especially if you do not follow the data science doctrine described before.
### 2. Observability turned into reading novels
In classic machine learning, introspection meant analyzing feature distributions, model coefficients, network weights, and confusion matrices. Now, introspection means combing through pages of generated reasoning, tool calls, draft responses, and self-corrections.
We used to log numbers. Now we log thought trails. Making sense of that to find patterns and signals can be a lot trickier than you imagine at first. And you might need to address privacy and security concerns when logging traces.
LLM tracing frameworks like [LangSmith](https://www.langchain.com/langsmith/observability), [LangFuse](https://langfuse.com/), or [PromptFlow](https://microsoft.github.io/promptflow/) exist precisely to help in that area, but you still need to tickle out the learnings and tune them to your needs. Automating this process is the frontier of building agentic systems.
Finally, you can ask LLMs to estimate the confidence in a piece of generated content. But that is prone to hallucinations and model biases. Getting robust confidence values is not possible with today's LLMs ([unlike, say, AlphaFold; PNAS, Aug 2024](https://www.pnas.org/doi/10.1073/pnas.2315002121)). Which makes it more challenging to trust the results.
### 3. Model capability borders have vastly expanded
An often-cited analogy nails it: Today's LLMs are like extremely bright children with zero wisdom. They can explain calculus but can’t sense when you’re confused about fractions. They [win gold at the math olympiad (OpenAI, 2025)](https://intuitionlabs.ai/articles/ai-reasoning-math-olympiad-imo), but lack goal seeking behavior and adaptive self-reflection - the hallmarks of true intelligence. They don’t truly care about what you need—they only know how to mirror the human mind.
This is why LLMs cannot yet be tutors, strategists, planners, or _true_ agents in the human sense. You must design systems that respect that limitation. It is exactly why Andej Karpathy [recently said (Dwarkesh Podcast
Oct 2025)](https://www.dwarkesh.com/p/andrej-karpathy), we are not in the year of agents, it is "a lot more accurately described as the decade of agents."
## A realistic example: Building an AI customer support agent
Let’s walk through a typical project that looks straightforward from the outside but quickly exposes why data science rigor matters more than ever in today's "AI world".
The project idea is simple: “Let’s automate support!” The team’s initial excitement is understandable. An AI agent that answers user questions, uses retrieval-augmented generation, can call internal APIs via the model context protocol (MCP), and escalates when needed? The perfect use case for AI!
They plug in an LLM. It works surprisingly well for early demos. Spirits rise. But then reality arrives.
**Demonstrating success becomes the first challenge.**
Is the goal fewer tickets? Faster responses? Higher satisfaction? Safety? The team realizes the model performs well in demos but might not necessarily solve the actual business problem that drives the need. The team identifies the success metrics and then reproducibly and continuously measure success to improve the system. Building a test harness becomes the first piece of real data science work.
**Next comes understanding capability limits.**
The model is excellent at troubleshooting. But it is terrible at detecting subtle frustration, negotiating cancellations, or identifying when the user is confused. It lacks human-like judgment. The team needs to filter out requests that are too risky to resolve with an LLM and escalate those to human staff.
**Subject Matter Experts (SMEs) point out countless quirks.**
Special support cases; Regional exceptions; Never-do-this rules; Compliance constraints. None of this lived in the existing support manual docs. Resolving all edge cases while not regressing in other areas turns out to be another unplanned challenge.
**Debugging becomes a text adventure.**
Each query generates pages of reasoning traces. The agent tries tools, revises itself, spirals into tangents, or confidently invents nonexistent API parameters. Digging through those traces to find common, worth-while issues to fix becomes more challenging as the project progresses.
Luckily we have LLMs to help with the LLM outputs (no pun intended...)
**A massive test harness becomes mandatory.**
The team builds hundreds of scenarios covering everyday questions, escalations, edge cases, policy traps, and failure modes. This becomes the only way to keep the system stable.
**Confidence scoring becomes critical.**
Low confidence → escalate. High confidence → answer. Incorrect confidence calibration → disaster looms. In the worst case, the system needs to be very conservative, but that puts a dampener on the initial success metric...
**Fine-tuning helps… until it breaks.**
The team finds that a domain-tuned model performs brilliantly. Until the team feeds it new variants and it all collapses. They realize they’ve overfit, the model learned the examples "by heart" but failed to generalize.
At least, more balanced data fixes it—but of course that was more "unforeseen" work.
By this point, anyone should understands the real lesson here:
The _model_ is the easy part, and with AI getting a working 80% prototype is really, really fast.
Building a robust _workflow_ is still a challenging **data science** project.
---
The [Transformer-based generative AI revolution (FT.com explaining Transformers, Sep 2023)](https://ig.ft.com/generative-ai/) changed the tooling, greatly expanded what is possible, and lowered barriers to entry. But it did not eliminate any other parts of building reliable machine learning systems. If anything, it added to those parts.
So this is why, in my eyes, companies fail at implementing AI—not because the models aren’t powerful enough, but because we still need to follow the processes, rigor, and collaboration that always have defined successful data science projects. And we are only learning to identify the border between easy and challenging AI problems, while many new teams working on AI projects might have little or no machine learning background. Therefore, join me in my quest of enabling more people to think and work like true data scientists!