How LLMs Actually Got Here
A walkthrough of how we went from recurrent neural networks to ChatGPT, and where things stand now.

If you started paying attention to AI when ChatGPT blew up in late 2022, you missed about five years of context that makes the current moment less magical and more understandable. Want to fill in that context because it changes how you think about what these tools can do, what they can't, and where they're probably going.
Going roughly chronological, but not hitting every milestone. Just the ones that actually changed how things work. If the foundational math interests you, I wrote a separate piece on how neural networks work under the hood with code you can run.
Before Transformers
Natural language processing existed long before the current wave. The tools were just worse.
For most of the 2010s, the dominant approach was Recurrent Neural Networks โ specifically LSTMs (Long Short-Term Memory networks). The idea: feed the network one word at a time, in order, letting it build an internal state as it processes the sequence. After reading a whole sentence, the internal state encodes (in theory) the sentence's meaning.
First problem: sequential processing. You couldn't parallelize because each word depended on the output from the previous word. Training was slow. GPUs sat there with most cores idle, processing one token at a time.
Second, bigger problem: the network forgot things. A 500-word paragraph? By the end, early words had been overwritten by recent ones. The "memory" degraded over distance. LSTMs were designed to fix this with gates controlling what to remember and forget. Helped, but only to a point. Long documents remained a struggle.
Text quality from these models was not great. Grammatically plausible sentences, but coherence fell apart after a few sentences. No sustained logic. No topic consistency. Plausible-sounding word salad.
2017: Attention Is All You Need
A Google research team published a paper with that title and it changed the field. Not gradually โ essentially overnight, as AI research timescales go.
The Transformer architecture does something completely different from RNNs: processes the entire input at once, in parallel. Instead of reading word by word, it examines all words simultaneously and computes relationships between every pair. The mechanism: "attention" โ for each word, the model calculates how much to attend to every other word in the sequence.
Solved both problems simultaneously. Parallelization: every word processed at once, thousands of GPU cores actually put to work. Long-range dependencies: word at position 1 has a direct connection to word at position 500 โ no information surviving through 499 processing steps.
Practical impact: much bigger models trained much faster, handling much longer inputs with better coherence.
I remember reading the paper and thinking it was interesting without being certain it would be as consequential as it turned out. The title felt bold โ claiming you didn't need recurrence at all. Within a couple of years, every state-of-the-art NLP system was transformer-based. BERT, GPT, T5 โ all transformers.
GPT and Scaling
OpenAI's contribution wasn't a new architecture. GPT-1 (2018) and GPT-2 (2019) used the same transformer from Google's paper. What OpenAI did: scale it up. More parameters. More data.
GPT-2 had 1.5 billion parameters, trained on WebText โ roughly 8 million web pages. Results were surprisingly good for the time. Coherent multi-paragraph text. Maintained topics across several sentences. Basic question answering. OpenAI initially refused to release the full model over misuse concerns, which generated considerable press.
GPT-3, 2020. 175 billion parameters. A much larger slice of the internet for training data. And this is where things got strange. GPT-3 could do things nobody had specifically trained it to do. Give it a few examples of a task โ "translate English to French, here are three examples, now do this one" โ and it figured out the pattern. This was called "in-context learning" and it surprised everyone, including the people who built it.
The leading theory: these abilities emerged from scale. More parameters, more data, more compute. At some point the model got good enough at statistical pattern matching that it appeared to understand language. Whether it actually understands anything is a philosophical question without a satisfying answer. Practical capabilities were undeniable regardless.
GPT-3 training cost was estimated at several million dollars in compute. Important context โ only a handful of organizations worldwide could afford to train at that scale. Research became concentrated in well-funded labs.
RLHF and ChatGPT
Between GPT-3 and ChatGPT, the raw model capabilities didn't change dramatically. The big shift was in how the model was fine-tuned to interact with humans.
Base GPT-3 is a text completion engine. Predicts the next word. Type "The capital of France is" and it predicts "Paris." Type "How do I bake a cake?" and it might continue with another question, a recipe, or a Wikipedia-style paragraph. It doesn't know it's supposed to be helpful.
ChatGPT added a fine-tuning step called RLHF โ Reinforcement Learning from Human Feedback. Human annotators wrote example conversations showing how a helpful assistant should respond. The model was fine-tuned on those examples. Then annotators ranked different outputs from best to worst, and a reward model was trained on those rankings. The language model was then further trained to maximize the reward model's score.
The result felt like talking to a person. Answering questions. Refusing harmful requests. Maintaining conversational context. Behaving like a polite, knowledgeable assistant. The underlying technology didn't change much โ still next-token prediction. But the user experience transformed from "weird text completion tool" to "thing you can have a conversation with."
That UX shift is what drove viral adoption. ChatGPT reached 100 million users in two months. Most people's first AI interaction was through this interface, setting expectations that were both helpful (demonstrating what the technology could do) and misleading (the conversational format making the model seem smarter and more reliable than it is).
The Context Window Race
After ChatGPT, the next competitive axis was context window length. Early models processed about 4,000 tokens at a time โ roughly 3,000 words. Document longer than that? The model couldn't see all of it at once.
Mattered for practical applications. Analyzing a contract, summarizing a research paper, asking questions about a codebase โ you needed the model to see the entire document. With 4K context, you had to chop it into pieces and process them separately, losing cross-reference information.
By 2024, context windows expanded to 128K tokens (Anthropic, OpenAI) and 1 million (Google). This enabled new use cases and made RAG (Retrieval-Augmented Generation) the standard pattern for building applications on top of LLMs. Rather than fine-tuning a model on your data (expensive, slow, easy to mess up), you retrieve relevant documents at query time and include them in the prompt. The model generates answers using both training knowledge and provided context.
Worked on a few RAG implementations. Results are mixed. When retrieval surfaces the right documents, answers are great. When retrieval misses or returns irrelevant content, the model either hallucinates or gives a vague non-answer. Retrieval pipeline quality matters more than language model quality in my experience โ which is a weird inversion of what you'd expect.
Where Things Stand (2026)
Honest assessment: LLMs are very good at some things and unreliable at others. The gap between those two categories is where most engineering work happens.
What works well: transforming text between formats (summarization, translation, reformatting), generating boilerplate code, answering questions when the answer is well-represented in training data, pattern matching against examples, holding natural-sounding conversations.
What doesn't work well: multi-step logical reasoning where each step depends on the previous one, arithmetic beyond simple cases, knowing the limits of their own knowledge, producing factually accurate information consistently. The hallucination problem โ generating confident, detailed, completely wrong information โ has improved but hasn't been solved.
The way I think about LLMs now: very fast, somewhat unreliable research assistants with a massive but frozen knowledge base and a strong tendency to tell you what you want to hear. Useful? Very. Trustworthy? No. I use them constantly โ drafting, code generation, exploring unfamiliar topics โ and verify everything. Verification isn't optional. I wrote about the practical side of getting reliable output in my prompt engineering guide.
The "AI will replace developers" narrative seems overblown at current capability levels. AI is making developers faster, not redundant. The bottleneck in software development was never typing speed โ it was understanding requirements, designing systems, debugging unexpected behavior, making judgment calls about tradeoffs. LLMs help with some of those (especially debugging) but can't handle the judgment calls independently.
Where things go from here โ not sure. There's debate about whether scaling is hitting diminishing returns. Whether making models bigger and training on more data keeps producing capability improvements, or whether a plateau is approaching. Some researchers think entirely new approaches are needed. Others think the current trajectory has runway ahead.
The Open Source Counter-Movement
One development worth covering because it changed the competitive dynamics: open-weight models.
When GPT-3 launched, the assumption was that only giant companies could build capable LLMs. The compute costs were prohibitive. But Meta released LLaMA in early 2023, followed by LLaMA 2 later that year, and suddenly researchers and smaller companies had access to models that were competitive with GPT-3.5 for many tasks. Mistral in France released a 7-billion parameter model that punched well above its weight. The open source community fine-tuned these base models for specific tasks and shared the results.
By 2025, you could run a capable LLM on a high-end laptop. Not as good as the latest proprietary models. But good enough for a lot of practical applications โ code completion, summarization, data extraction, chatbot interfaces for internal tools. And crucially, running locally means no API costs, no data leaving your network, no dependency on a third party's uptime or pricing decisions.
This changed the calculus for a lot of companies. When the choice was between $0.002 per API call to OpenAI or running your own model on existing hardware, the economics depended on volume. Below a certain usage threshold, the API is cheaper. Above it, self-hosting wins. Some companies now run open-weight models for routine tasks and fall back to proprietary APIs for the hardest queries. A tiered approach that was impossible when only OpenAI and Google had competitive models.
The implication for the industry: the moat around LLM capabilities is narrower than it appeared in 2022. The proprietary models still lead on the hardest benchmarks. But "good enough for production use" is achievable with open-weight models, and the gap keeps closing with each release. Where the real moat lies โ if it lies anywhere โ is probably in data, fine-tuning expertise, and deployment infrastructure rather than raw model architecture.
The Fine-Tuning vs. Prompting Debate
One practical question that comes up constantly for teams building on LLMs: should you fine-tune a model on your own data, or just write better prompts?
For most teams, prompting wins. Fine-tuning requires collecting and cleaning training data, setting up a training pipeline, managing model versions, and paying significant compute costs. The process takes days to weeks. And if your underlying data changes โ new products, updated policies, different formatting โ you have to retrain.
Prompting with good examples (few-shot) and retrieval-augmented generation gets you 80-90% of the way there for most production use cases. You can update the behavior instantly by changing the prompt or the retrieved documents. No training run. No waiting. The iteration cycle is minutes instead of days.
Where fine-tuning does make sense: when you need the model to adopt a very specific style or handle a niche domain where the base model consistently fails. A medical coding system where precision matters more than flexibility. A customer support bot that needs to match a particular brand voice across thousands of edge cases. A code completion tool trained on your company's internal frameworks. These are cases where the investment in fine-tuning pays for itself because the prompting alternative would require absurdly long context windows stuffed with examples.
The mistake I see teams make: jumping straight to fine-tuning because it feels like the "serious" approach. Try prompting first. Try RAG second. Fine-tune only when you've hit a wall that neither can solve. Most teams never reach that wall.
The Practical Integration Question
For developers โ which is most of the audience for this post โ the relevant question isn't "how do transformers work" but "how do I use these models in what I'm building?"
The answer has shifted over the past two years. Early on, the standard integration was: call the OpenAI API, get a text response, display it. Simple. Limited.
Now the common patterns are more nuanced. RAG for grounding responses in your own data. Function calling / tool use for letting the model trigger actions in your system. Structured output with JSON mode or Pydantic schemas for reliable data extraction. Agent loops where the model decides what to do next based on observations. Each of these patterns has its own failure modes and best practices that are still being worked out.
The biggest misconception I see among developers new to LLM integration: treating the model like a database. "Ask a question, get the answer." Models don't work that way. They generate plausible continuations of text based on patterns in their training data. Sometimes those continuations are factually correct. Sometimes they're not. The probabilistic nature is inherent, not a bug to be fixed.
Building reliable applications on top of a probabilistic foundation requires defensive engineering โ validation of model output, fallback paths for when the model fails, human review for high-stakes decisions, confidence scoring when possible. It's a different discipline from traditional software engineering where function outputs are deterministic. Some developers adapt quickly. Others find it deeply uncomfortable that the same input can produce different outputs on different runs.
Where Things Go From Here
I try not to predict technology. Been wrong too many times. The honest answer is that nobody โ including the people building these models โ knows exactly where the ceiling is.
A few things I expect to hold regardless of which specific predictions pan out. The distinction between "AI can do this task" and "AI can do this task reliably enough to deploy without human oversight" will remain significant for years. The costs of running large models will continue to drop, making self-hosting and smaller-model approaches more viable. And the gap between what AI can do in a demo and what it can do in production โ with messy data, edge cases, adversarial inputs, and uptime requirements โ will remain the main engineering challenge.
The most useful framing I've found: treat LLMs as a new category of tool, not as a replacement for existing ones. A calculator didn't replace mathematicians. A spell checker didn't replace editors. LLMs won't replace developers or writers or analysts โ they'll change what those jobs look like, the same way every significant tool does. The people who figure out how to use the tools effectively while understanding their limitations will do well. The people who either reject them entirely or trust them blindly will struggle in different ways.
What seems clear: models available today are useful enough for real workflows, flawed enough that blind trust is dangerous, and improving fast enough that anything written about their limitations might be outdated within a year.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

Machine Learning in Python โ Starting Without a PhD
How I went from confused by terminology to building useful models with scikit-learn, which algorithm to reach for first, and why feature engineering matters more than model choice.

Prompt Engineering is Not a Real Job (But You Still Need to Learn It)
The 'prompt whisperer' industry is mostly a grift, but there are three techniques that actually help when working with language models.

Neural Networks Without the Hype
What neural networks actually do under the hood. Code included. No analogies about brains.