How LLMs Actually Got Here
A walkthrough of how we went from recurrent neural networks to ChatGPT, and where things stand now.

If you started paying attention to AI when ChatGPT blew up in late 2022, you missed about five years of context that makes the current moment a lot less magical and a lot more understandable. I want to fill in that context because I think it changes how you think about these tools โ what they're good at, what they're not, and where they're likely going.
I'm going to go roughly in chronological order, but I'm not going to hit every milestone. Just the ones that actually changed how things work.
Before Transformers
Natural language processing existed long before the current wave. The tools were just worse.
For most of the 2010s, the dominant approach was Recurrent Neural Networks, specifically a variant called LSTMs (Long Short-Term Memory networks). The basic idea: feed the network one word at a time, in order, and let it build up an internal state as it processes the sequence. After reading a whole sentence, the internal state encodes (in theory) the meaning of the sentence.
The first problem was that it was sequential. You couldn't parallelize the processing because each word depended on the output from the previous word. Training was slow. GPUs sit there with most of their cores idle, processing one token at a time.
The second, bigger problem was that the network forgot things. If you fed it a 500-word paragraph, by the time it reached the end, the early words had been overwritten by the more recent ones. The "memory" degraded over distance. LSTMs were specifically designed to fix this with gates that control what to remember and what to forget, and they helped, but only to a point. Long documents were still a struggle.
The quality of text these models produced was not great. If you used any of the text generation tools from 2016 or 2017, you know what I mean. They could produce grammatically plausible sentences but the coherence fell apart after a few sentences. No sustained logic, no consistent topic, just plausible-sounding word salad.
2017: Attention Is All You Need
A team at Google published a paper called "Attention Is All You Need" and it changed everything. Not gradually. Basically overnight, in AI research terms.
The Transformer architecture does something different from RNNs: it processes the entire input at once, in parallel. Instead of reading word by word, it looks at all the words simultaneously and computes relationships between every pair of words in the input. The mechanism for this is called "attention" โ for each word, the model calculates how much it should "attend to" every other word in the sequence.
This solved both problems at once. Parallelization: since every word is processed simultaneously, you can throw thousands of GPU cores at it and actually use them. And long-range dependencies: a word at position 1 has a direct connection to a word at position 500 โ no information has to survive through 499 processing steps.
The practical impact was that you could now train much bigger models much faster, and the resulting models could handle much longer inputs with better coherence.
I remember reading the paper at the time and thinking it was interesting but not being sure it was going to be as big as it turned out to be. The title was provocative โ claiming you didn't need recurrence at all felt bold. But within a couple of years, basically every state-of-the-art NLP system was transformer-based. BERT, GPT, T5 โ all transformers.
GPT and the Scaling Hypothesis
OpenAI's contribution wasn't a new architecture. GPT-1 (2018) and GPT-2 (2019) used the same transformer architecture from the Google paper. What OpenAI did was scale it up and train it on a lot more data.
GPT-2 had 1.5 billion parameters and was trained on a dataset called WebText โ about 8 million web pages. The results were surprisingly good for the time. It could generate coherent multi-paragraph text, maintain a topic across several sentences, even do basic question answering. OpenAI initially refused to release the full model because they were worried about misuse, which generated a lot of press.
Then GPT-3 came out in 2020 with 175 billion parameters, trained on a much larger slice of the internet. And this is where things got weird. GPT-3 could do things nobody had specifically trained it to do. You could give it a few examples of a task โ "translate English to French, here are three examples, now do this one" โ and it would figure out the pattern and perform the task. This was called "in-context learning" and it surprised everyone, including the people who built it.
The leading theory was that these abilities emerged from scale. More parameters, more data, more compute. At some point the model just... got good enough at statistical pattern matching that it appeared to understand language. Whether it actually understands anything is a philosophical question I don't have a good answer to. But the practical capabilities were undeniable.
The cost of training GPT-3 was estimated at several million dollars in compute. That's important context because it meant only a few organizations in the world could afford to train models like this. The research became concentrated in a handful of well-funded labs.
RLHF and ChatGPT
Between GPT-3 and ChatGPT, the raw model capabilities didn't change dramatically. The big shift was in how the model was fine-tuned to interact with humans.
Base GPT-3 is a text completion engine. It predicts the next word. If you type "The capital of France is" it predicts "Paris." But if you type "How do I bake a cake?" it might continue with another question, or a recipe, or a Wikipedia-style paragraph. It doesn't know it's supposed to be helpful.
ChatGPT added a fine-tuning step called RLHF โ Reinforcement Learning from Human Feedback. The process: human annotators wrote example conversations showing how a helpful assistant should respond. The model was fine-tuned on those examples. Then human annotators ranked different model outputs from best to worst, and a reward model was trained on those rankings. The language model was then further trained to maximize the reward model's score.
The result was a model that felt like talking to a person. It would answer questions, refuse harmful requests, maintain conversational context, and generally behave like a polite, knowledgeable assistant. The underlying technology didn't change much โ it was still next-token prediction โ but the user experience transformed from "weird text completion tool" to "thing I can have a conversation with."
That UX shift is what made it go viral. ChatGPT reached 100 million users in two months. Most people's first interaction with AI was through this interface, which set expectations that I think have been both helpful and misleading. Helpful because it demonstrated what the technology could do. Misleading because the conversational format made the model seem smarter and more reliable than it actually is.
The Context Window Race
After ChatGPT, the next competitive dimension was context window length. Early models could only process about 4,000 tokens at a time (roughly 3,000 words). If your document was longer than that, too bad โ the model couldn't see all of it at once.
This mattered for practical applications. If you wanted to analyze a contract, summarize a research paper, or ask questions about your codebase, you needed the model to see the entire document. With a 4K context window, you had to chop the document into pieces and process them separately, which lost a lot of the cross-reference information.
By 2024, context windows had expanded to 128K tokens (Anthropic, OpenAI) and even 1 million tokens (Google). This opened up new use cases. Instead of building complex retrieval pipelines to find relevant document chunks, you could sometimes just paste the entire document into the prompt. Not always the best approach โ it's expensive and slower โ but for some applications it was the simplest solution.
This expansion also enabled RAG (Retrieval-Augmented Generation) to become the standard pattern for building applications on top of LLMs. Rather than fine-tuning a model on your data (expensive, slow, and easy to mess up), you retrieve relevant documents at query time and include them in the prompt. The model then generates an answer using both its training knowledge and the provided context.
I've worked on a few RAG implementations and the results are... mixed. When the retrieval works well and surfaces the right documents, the answers are great. When the retrieval misses or returns irrelevant content, the model either hallucinates an answer or gives a vague non-answer. The quality of the retrieval pipeline matters more than the quality of the language model in my experience, which is a weird inversion of what you'd expect.
Where We Are Now (2026)
The honest assessment: LLMs are very good at some things and unreliable at others, and the gap between those two categories is where most of the engineering work is happening.
What they're good at: transforming text from one format to another (summarization, translation, reformatting), generating boilerplate code, answering questions when the answer is well-represented in the training data, pattern matching against examples, and holding a natural-sounding conversation.
What they struggle with: genuine logical reasoning (multi-step problems where each step depends on the previous one), arithmetic beyond simple cases, knowing the limits of their own knowledge, and producing factually accurate information consistently. The hallucination problem โ generating confident, detailed, and completely wrong information โ hasn't been solved. It's been improved, but not solved.
The way I think about LLMs now is as very fast, somewhat unreliable research assistants with a massive but frozen knowledge base and a strong tendency to tell you what you want to hear. Useful? Very. Trustworthy? No. I use them constantly in my work โ for drafting, for code generation, for exploring unfamiliar topics โ but I verify everything they tell me. The verification step isn't optional.
The "AI will replace developers" narrative seems overblown to me, at least at the current capability level. AI is making developers faster, not redundant. The bottleneck in software development was never typing speed โ it was understanding requirements, designing systems, debugging unexpected behavior, and making judgment calls about tradeoffs. LLMs help with some of those (especially the debugging part, honestly) but they can't do the judgment calls on their own.
Where things go from here, I'm honestly not sure. There's a debate about whether we're hitting diminishing returns on the scaling approach โ whether making models bigger and training them on more data will keep producing capability improvements, or whether we're approaching a plateau. Some researchers think we need entirely new approaches. Others think the current trajectory has a long runway ahead.
I try not to make predictions about technology. I've been wrong too many times. What I do know is that the models available today are useful enough to integrate into real workflows, flawed enough that you can't trust them blindly, and improving fast enough that anything I write about their limitations might be outdated within a year.
Written by
Anurag Sinha
Developer who writes about the stuff I actually use day-to-day. If I got something wrong, let me know.
Found this useful?
Share it with someone who might find it helpful too.
Comments
Loading comments...
Related Articles
Prompt Engineering is Not a Real Job (But You Still Need to Learn It)
The 'prompt whisperer' industry is mostly a grift, but there are three techniques that genuinely help when working with language models.
Neural Networks Without the Hype
What neural networks actually do under the hood. Code included. No analogies about brains.