Neural Networks Without the Hype
What neural networks actually do under the hood. Code included. No analogies about brains.

Most neural network tutorials lie to you in the first sentence. They tell you a neural network works like a brain. It doesn't. Not in any way that helps you understand what's happening when you actually sit down and write one. Brains are wet, messy, electrochemical. What we're building here? It's a chain of matrix multiplications with knobs you can turn. That's the whole thing โ numbers in, math in the middle, numbers out.
I've gone back and forth on how to explain this. Diagrams don't stick. Analogies get in the way. What finally worked for me (and for the handful of people I've taught this to in person) was building a tiny model from scratch, watching it fail, watching it learn, and then poking at the pieces until the confusion went away. So that's what we're doing. We'll build one small network in numpy, train it on fake data, and I'll point out the spots where I personally got confused along the way.
You need basic Python. Some comfort with numpy helps but isn't strictly required. And patience, probably, because the first time this stuff makes sense it's like a switch flipping โ but the switch takes a minute.
A Chain of Multiplications
Here's the mental model that actually holds up. You've got input data. Numbers. Could be pixel values, sensor readings, whatever โ the network doesn't care what they represent. Those numbers get multiplied by a matrix of weights, then a non-linear function squashes the result, then another weight matrix multiplies that, another squash, and so on until something comes out the other end.
Each multiply-then-squash step? That's a layer.
And the squashing โ the non-linear function between layers โ that part isn't optional. Without it, stacking ten layers of matrix multiplication would collapse down to one layer mathematically. Just linear algebra being linear algebra. The non-linearity is what lets the network model curves and corners and all the shapes that real data actually makes. Straight lines aren't enough for most problems.
Data goes in one side, passes through these layers, and a prediction comes out. People call this the forward pass. Nothing fancy about the name โ it just means "going forward through the network."
What Squashing Actually Looks Like
Let's start with sigmoid, because it's the one I understood first and it still makes the most intuitive sense for beginners.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
Feed it a big positive number, you get something near 1. A big negative number pushes the output toward 0. Zero itself gives you exactly 0.5. If you plotted it, you'd see an S-curve โ everything gets compressed into a range between 0 and 1, no matter how extreme the input.
Why does that matter? Well, if your network's job is answering yes-or-no questions โ "is this image a cat?" or "should this email go to spam?" โ then an output between 0 and 1 maps nicely onto a probability. Your model says 0.92 and you read that as "92% confident this is spam." Clean.
Now, here's the thing. Modern deep networks have mostly moved on to ReLU (max(0, x)) for their hidden layers. ReLU trains faster, and it sidesteps a nasty problem called vanishing gradients that sigmoid causes in deep architectures. But for what we're building today โ a small two-layer network โ sigmoid works fine and it's easier to wrap your head around. I think there's value in learning with the simpler tool first, even if it's not what production systems use.
Building the Skeleton
Here's where our model gets born. Not trained โ just initialized. A blank slate with random parameters.
class SimpleNetwork:
def __init__(self, input_size, hidden_size, output_size):
# Weight matrix for input -> hidden layer
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
# Weight matrix for hidden layer -> output
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))
A few decisions got baked in here that aren't obvious unless you've been burned by them.
Random initialization. See that np.random.randn(input_size, hidden_size) * 0.01? The random part matters more than you'd guess. I once set all the weights to zero because it seemed like a clean starting point โ "let the network figure it out from nothing," was my thinking. Bad idea. Every neuron in a layer computed the exact same thing. Every gradient was identical. Every update pushed them in the same direction by the same amount. After a thousand steps of training, they were all still clones of each other. The network had effectively collapsed into one neuron per layer. Random values break that symmetry. Each neuron starts different and diverges from there.
And the * 0.01 โ that keeps the initial values small. Large weights cause sigmoid to saturate (outputs jammed near 0 or 1), which makes gradients tiny and training grinds to a halt. I've watched training runs just refuse to converge because the initialization was too large. Small values keep sigmoid in its middle range where the gradients actually flow.
The bias vectors (self.b1, self.b2) are offsets added after multiplication. They give each neuron room to shift when its activation kicks in. Without them, the layer's output would be zero whenever the input was zero, and that limits what the network can represent. Biases start at zero and that's fine โ they don't have the symmetry problem that weights do.
The Forward Pass in Code
Prediction happens here. No learning yet โ just the network looking at data and guessing.
def forward(self, X):
# Layer 1: multiply inputs by weights, add bias
self.Z1 = np.dot(X, self.W1) + self.b1
# Apply activation function
self.A1 = sigmoid(self.Z1)
# Layer 2: hidden activations * weights + bias
self.Z2 = np.dot(self.A1, self.W2) + self.b2
# Final activation gives us the prediction
self.A2 = sigmoid(self.Z2)
return self.A2
Notice that the intermediate values โ Z1, A1, Z2, A2 โ all get stored on the object. Seems wasteful if you only wanted a prediction. But training needs these values later for backpropagation. If you were running this model in production and just making predictions, you probably wouldn't keep them around.
Let's run some numbers through it to see what happens before any training has occurred.
model = SimpleNetwork(input_size=4, hidden_size=8, output_size=1)
# A single example with 4 features
sample_input = np.array([[120, 1, 0.2, 5]])
prediction = model.forward(sample_input)
print(f"Output: {prediction[0][0]:.4f}")
# Output: something close to 0.5
Around 0.5. Every time. Makes sense โ weights are random and tiny, so the network hasn't formed any opinions about anything. It's shrugging. Saying "I don't know, maybe?" Training is how we fix that.
Teaching the Network to Be Less Wrong
Forward pass was the easy part. Training is where the actual learning happens, and it's where I spent most of my time confused when I was starting out.
The loop goes like this: show an example, get a prediction, measure how wrong that prediction was (using something called a loss function), then figure out which direction to nudge each weight so the prediction gets a little less wrong next time. Do that a few thousand times and the weights settle into values that produce useful outputs. Probably. Hopefully.
The "figuring out which direction" step is backpropagation โ it uses the chain rule from calculus to trace how much each weight contributed to the error. I'm not going to derive it here because textbooks do that better, and honestly the code reads more clearly than the equations anyway:
def train(self, X, y, learning_rate=0.1):
# Forward pass
prediction = self.forward(X)
m = X.shape[0] # number of examples
# How wrong were we? (derivative of binary cross-entropy loss)
dZ2 = prediction - y
# Gradients for layer 2 weights and biases
dW2 = (1/m) * np.dot(self.A1.T, dZ2)
db2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)
# Propagate the error back to layer 1
dA1 = np.dot(dZ2, self.W2.T)
dZ1 = dA1 * self.A1 * (1 - self.A1) # sigmoid derivative
# Gradients for layer 1 weights and biases
dW1 = (1/m) * np.dot(X.T, dZ1)
db1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)
# Update weights โ move them in the opposite direction of the gradient
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
learning_rate โ how big each step is. Too big and the weights overshoot, bouncing around without settling. Too small and training takes forever. Finding a good value is honestly more feel than formula. Adaptive methods like Adam and RMSProp automate some of that, but for a toy example a fixed rate does the job.
That prediction - y at the top is your error signal. Network guessed 0.8, true answer was 0, so the error is 0.8. From there, the function works backward through each layer, computing how much every weight contributed to that mistake and adjusting them in the opposite direction. Weights that pushed the prediction the wrong way get pulled back. Ones that helped get reinforced slightly.
Running the Full Loop
Now we connect everything โ data generation, model creation, training iterations.
# Generate some toy data
np.random.seed(42)
X = np.random.randn(200, 4)
# Label: 1 if sum of features > 0, else 0
y = (X.sum(axis=1, keepdims=True) > 0).astype(float)
model = SimpleNetwork(4, 8, 1)
# Train for 1000 iterations
for i in range(1000):
model.train(X, y, learning_rate=0.5)
if i % 200 == 0:
predictions = model.forward(X)
loss = -np.mean(y * np.log(predictions + 1e-8) + (1-y) * np.log(1-predictions + 1e-8))
accuracy = np.mean((predictions > 0.5) == y)
print(f"Step {i}: loss={loss:.4f}, accuracy={accuracy:.2%}")
After 1000 steps, accuracy should land above 95%. Pretty good for a network that started by shrugging at everything. The task itself is dead simple โ predict whether four random numbers add up to something positive โ but that's the point. We can see the full loop working: forward pass, loss computation, backward pass, weight update. Repeat. No magic.
Things I Deliberately Left Out
Quite a bit, actually. Regularization, which stops the network from just memorizing its training data instead of spotting patterns. Batch normalization, which stabilizes training by normalizing the activations between layers. Dropout โ randomly switching off neurons during training so the network doesn't lean on any single one too heavily. Different optimizers. Learning rate schedules that reduce the step size over time. Better weight initialization methods.
Each one solves a specific headache that shows up when you move from toy problems to real ones. Overfitting. Vanishing gradients. Training that won't stabilize. Convergence that crawls. The architecture you just saw โ layers of matrix multiplications with non-linear activations โ is the same foundation under everything from image classifiers to GPT. What changes at scale is the parameter count (billions instead of dozens) and the accumulated bag of tricks that make training actually work when things get that big.
I trace the path from this kind of architecture to transformers and GPT in my post on how LLMs actually got here. They aren't a different species โ just different wiring.
When Training Goes Sideways
The hardest thing about neural networks, for me personally, wasn't understanding the math. It was building intuition for when something goes wrong and you're staring at a loss curve that won't budge, trying to figure out which of seventeen possible problems you're actually dealing with.
Loss spikes to NaN in the first few steps? Almost always learning rate. It's too high โ gradients explode, weights fly off to infinity, and numpy starts returning garbage. Drop it by a factor of 10. Try again.
Loss decreases for a while, then flatlines well above zero? Network might not have enough capacity. More hidden neurons or an extra layer can fix that. Or it might be a data issue โ features that haven't been normalized, labels that are noisy.
Loss looks beautiful on training data but the model falls apart on anything new? That's overfitting. The network memorized examples instead of learning the actual pattern. It'll score 99% on data it's seen and 60% on data it hasn't. Reduce the model size, add dropout, or get more training data. Usually some combination.
One thing that's saved me hours of confusion: plot the loss curve. Just a basic matplotlib line chart โ loss on the y-axis, training steps on the x-axis. The shape of that curve tells you more about what's happening inside the network than any amount of staring at printed numbers. A smooth decline means things are working. Wild oscillation means the learning rate is too aggressive. A flat line from step zero means something is broken, possibly a bug in your gradient computation.
These days I don't even look at raw loss values anymore. Just the curve.
The Confusion That Nobody Warns You About
A handful of concepts tripped me up when I was learning this, and I keep seeing them trip up others. Not advanced stuff โ just things that tutorials either skip or bury in footnotes.
Training versus inference. Training is the expensive part โ running millions of examples through the network, computing gradients, updating weights. It runs for hours on GPUs. Sometimes days. Inference means using the already-trained network to make a prediction on new data. That's fast, just a single forward pass with no gradient math. When someone says they're "running a model in production," they mean inference. When they talk about the cost of building a model, they mean training. Back when I first heard these terms thrown around, I didn't realize they described two completely different computational profiles. Now it seems obvious, but it wasn't.
Parameters versus hyperparameters. Parameters are the weights and biases โ the things backpropagation adjusts during training. Your network discovers them. Hyperparameters are the settings you choose before training even starts: learning rate, number of layers, neurons per layer, batch size, how long to train. You pick them. Finding good hyperparameters is mostly trial and error, maybe guided by some systematic search (grid search, random search, Bayesian optimization if you're feeling fancy). From what I've seen, experienced practitioners develop a starting intuition for these and then iterate. There's no formula that just gives you the right answer.
Epochs, iterations, and batches. Say you've got 10,000 training examples. You don't jam all 10,000 through at once โ you break them into batches of, say, 32 or 64 examples each. Processing one batch counts as one iteration. Processing every batch โ meaning every example has been seen once โ that's one epoch. Training for 100 epochs means the network sees each example 100 times. More epochs generally improves the model up to a point, and past that point you're just memorizing training data. Knowing where that point is? Experience, mostly.
What Happens When You Stack More Layers
Our network had one hidden layer. Production models might have dozens. Hundreds, even. What does piling on more layers actually buy you?
Roughly: each layer picks up a different level of abstraction. In an image network, early layers detect edges and simple textures โ low-level stuff. Middle layers combine those into shapes. Circles, rectangles, curves. Later layers put shapes together into objects a human would recognize โ a face, a wheel, a cat's ear. Each layer builds on what the previous one found. So deeper networks can represent increasingly complex hierarchies of features without needing any single layer to do too much heavy lifting.
But depth costs you. The vanishing gradient problem I mentioned earlier gets worse with every layer you add. Gradients flowing backward through many layers can shrink to almost nothing by the time they reach the early ones, which means those layers barely learn at all. Training a 50-layer network wasn't really practical until people figured out residual connections (skip connections that let gradients hop over layers) and batch normalization. Those techniques showed up around 2015, and suddenly very deep networks became trainable. Felt like a dam breaking.
Overfitting gets worse too. A network with millions of parameters and not enough data to keep it honest will memorize training examples instead of picking up on generalizable patterns. You get 99% accuracy on training data, 60% on anything new. Dropout (randomly killing neurons during training), L2 weight decay (penalizing big weight values), data augmentation (artificially expanding your training set with transformations) โ all of these exist to fight that tendency. Honestly, a lot of the art of training neural networks comes down to managing this tension: making the model powerful enough to learn the signal without letting it learn the noise.
From Our Toy Model to GPT
What we built is a fully connected feedforward network. Every neuron in one layer talks to every neuron in the next. Simplest possible architecture.
Convolutional neural networks, or CNNs, swap in a different kind of layer โ one that slides a small filter across the input looking for local patterns. Great for images, where pixels near each other tend to be related. You wouldn't use a fully connected layer for a 1024x1024 image โ the weight matrix alone would be enormous and most of those connections would be pointless.
Recurrent networks, RNNs, add loops. Information from previous time steps feeds back into the current one. That makes them useful for sequential data โ text, audio, time series. But they're slow to train and struggle with long sequences, which is why they've mostly been replaced.
Transformers โ the architecture behind GPT, BERT, and most of what's making headlines right now โ threw out the recurrence entirely. They use an "attention" mechanism that lets every element in the input look at every other element directly. No loops, no sequential processing bottleneck. I trace that whole evolution in my post on how LLMs actually got here.
But here's the part that surprised me when I first grasped it: all of these architectures share the same core loop. Layers of multiplications with non-linear activations. A loss function measuring error. Backpropagation computing gradients. An optimizer nudging weights. Repeat until loss stops going down. Everything else โ attention heads, convolutional filters, residual connections, layer normalization โ those are specialized structures designed to help the network learn specific kinds of patterns more efficiently. They're important, sure. But the engine underneath is what you've already read.
Go Break It
Roughly 40 lines of Python and no libraries beyond numpy. That's a working neural network. Copy the code, run it, and then โ and this matters โ break it on purpose. Change the hidden layer from 8 neurons to 32 and see what happens to accuracy. Set the learning rate to 5.0 and watch the loss explode in real time. Try 0.001 and watch training crawl. Remove the activation function entirely and see the network fail to learn anything nonlinear. Step through the weight matrices in a debugger.
That kind of poking around builds intuition faster than reading ever will. Including reading this.
Once you've got the fundamentals down, the practical question shifts to how you actually get consistent results out of these models โ I wrote about that in my prompt engineering guide, covering what works when you're trying to get repeatable behavior from language models.
I keep thinking about something, though. There's a gap between this toy example and, say, a 70-billion-parameter language model that feels like it shouldn't be as wide as it is. Architecturally, the pieces are the same. Multiplication, activation, loss, gradient, update. Same loop. But the engineering challenges at scale โ distributed training across thousands of GPUs, mixed-precision arithmetic, gradient checkpointing to fit anything into memory, data pipelines that can keep the hardware fed โ those are a different kind of problem entirely. Not a math problem. An infrastructure problem. And I'm not sure any amount of building small networks from scratch really prepares you for that transition, even though every tutorial (this one included) implies it does.
Maybe that's fine. Maybe the point of the toy version isn't to prepare you for scale. Maybe it's just to kill the mystery.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

Machine Learning in Python โ Starting Without a PhD
How I went from confused by terminology to building useful models with scikit-learn, which algorithm to reach for first, and why feature engineering matters more than model choice.

How LLMs Actually Got Here
A walkthrough of how we went from recurrent neural networks to ChatGPT, and where things stand now.

Prompt Engineering is Not a Real Job (But You Still Need to Learn It)
The 'prompt whisperer' industry is mostly a grift, but there are three techniques that actually help when working with language models.