Machine Learning in Python โ Starting Without a PhD
How I went from confused by terminology to building useful models with scikit-learn, which algorithm to reach for first, and why feature engineering matters more than model choice.

Most ML tutorials teach you the wrong things first. They open with math notation, jump into gradient descent, throw linear algebra at you before you've even loaded a CSV file. I spent about six months reading papers and watching lectures that assumed comfort with eigenvalues and matrix decomposition. None of that got me closer to building anything. What actually did? Installing scikit-learn and writing twelve lines of Python. My first classifier hit 87% accuracy on a classification task. Twelve lines. Took less time than scaffolding a React project.
Here's what I think the disconnect is. The academic side of ML wants you to understand everything before you touch anything. But building useful prediction systems doesn't require a PhD's worth of background โ it requires knowing which tool to grab and when, how to shape your inputs, and where the common traps are. That's what I'm going to go through here, and I'm going to be opinionated about it. Some of what I skip might surprise you.
Start With Scikit-Learn. Don't Look Back.
Deep learning dominates the conversation. TensorFlow, PyTorch, billion-parameter architectures โ they're everywhere in the headlines. And for most problems you'll encounter at work (predicting churn, sorting support tickets, flagging anomalies in time series), none of that matters. Classical algorithms inside scikit-learn handle these just fine, and they're far simpler to train, debug, and put into production.
I wrote about neural networks and what they do under the hood, and I genuinely find them impressive for specific use cases. But grabbing a neural network when a random forest would do the job is like spinning up Kubernetes for a personal blog. Overkill. Slower feedback loop. More things to break.
Here's what a complete ML pipeline looks like in scikit-learn:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import pandas as pd
# Load data
df = pd.read_csv('customer_data.csv')
# Features and target
X = df[['age', 'monthly_spend', 'support_tickets', 'months_active']]
y = df['churned'] # 1 = churned, 0 = active
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
Load, split, train, evaluate. Done. That train_test_split call isn't optional โ you're separating out a portion the algorithm never sees during fitting, so your evaluation reflects how it performs on genuinely new inputs. I skipped that step early on and couldn't figure out why my "95% accurate" classifier fell apart in production. It had just memorized the answers. Setting random_state=42 keeps results reproducible across runs, which matters a lot when you're debugging something and the numbers shift every time you re-execute.
Picking Algorithms (And Why I Don't Agonize Over It)
Choosing an algorithm used to paralyze me. Dozens of options, each with caveats. After a couple years of building things, I've landed on a pretty short list that covers roughly 80% of what I run into.
For classification โ predicting a category โ I reach for random forest first, every time. Works well out of the box. Handles numerical and categorical features together. Doesn't overfit easily. If interpretability matters (someone needs to understand why the classifier made a specific call), logistic regression is better. Yes, it's called regression. It classifies. The naming is unfortunate. Coefficients from a logistic regression tell you exactly how much each input pushes the prediction one direction or the other, which is something a random forest can't do as cleanly. And when neither of those gets the numbers where I need them, gradient boosting through XGBoost or LightGBM usually closes the gap โ though you'll spend more time tuning knobs.
Regression problems (predicting a continuous number) follow a similar pattern. Linear regression as a baseline. If the relationship between inputs and output isn't roughly linear, random forest regressor. Gradient boosting regressor when I need to squeeze out everything possible and have time for hyperparameter sweeps.
Clustering and anomaly detection come up less often for me, but when they do: K-Means if I have a rough idea of how many groups should exist, DBSCAN if I don't (it also spots outliers on its own, which is nice). Isolation Forest for anomaly detection โ point it at your dataset and it'll flag the weird stuff with minimal configuration.
I want to say something that probably sounds dismissive but I mean it sincerely. The algorithm choice matters way less than people think. A random forest with thoughtfully crafted inputs will outperform a neural network fed raw, poorly structured features. Almost every time. I've seen it happen over and over. Spend your thinking on what goes into the model.
Feature Engineering โ Where the Actual Work Happens
Raw columns from a database are rarely ready for an algorithm. Transforming them into signals that capture real patterns โ that's where results come from. And it requires knowing the problem domain, not just the Python API.
Take churn prediction. A customer made 12 purchases last year. Useful fact? Maybe. But what if 11 happened in the first two months and the last was eight months ago? A single count hides a collapsing engagement curve. You need features that encode recency and trend, not just totals.
So I'd build something like: purchases in the last 30 days, purchases in the last 90 days, ratio of recent to total, days since the most recent purchase, average gap between purchases, and whether the trend is going up, down, or flat.
from datetime import datetime, timedelta
def engineer_features(customer_df, purchases_df):
now = datetime.now()
features = {}
for customer_id in customer_df['id']:
purchases = purchases_df[purchases_df['customer_id'] == customer_id]
features[customer_id] = {
'total_purchases': len(purchases),
'purchases_last_30d': len(purchases[purchases['date'] > now - timedelta(days=30)]),
'purchases_last_90d': len(purchases[purchases['date'] > now - timedelta(days=90)]),
'days_since_last_purchase': (now - purchases['date'].max()).days if len(purchases) > 0 else 999,
'avg_days_between_purchases': purchases['date'].diff().dt.days.mean() if len(purchases) > 1 else 999,
}
return pd.DataFrame.from_dict(features, orient='index')
That days_since_last_purchase column alone is probably more predictive than most other features combined. Someone who hasn't bought anything in 200 days has almost certainly already gone, regardless of their historical purchase count. Recent behavior tells you more than lifetime summaries. Always.
And here's the thing that took me a while to internalize: good feature engineering requires domain knowledge, not algorithmic sophistication. The person who understands the business โ who knows why customers leave, what signals precede a cancellation โ contributes more to prediction quality than someone who can implement a custom kernel function from scratch. I've seen this play out on multiple teams. Involving someone from the business side in a feature brainstorm session produces better results than a week of hyperparameter tuning.
Dealing With Messy Inputs
Real-world datasets have holes. Somebody didn't fill in their age on a form. A sensor went dark for two hours. A database column was added six months back and older records don't have it. Most algorithms choke on missing values, so you need a plan.
The simplest option โ dropping any row with a gap โ loses information. If 30% of your rows have at least one blank field, you've just thrown away 30% of your signal. Sometimes that's acceptable. Usually it isn't.
Imputation fills blanks with something reasonable. Median works well for numerical columns (it's not thrown off by extreme outliers the way a mean can be). For categorical columns, the most common value is a standard fill.
from sklearn.impute import SimpleImputer
# Numerical columns โ fill with median
num_imputer = SimpleImputer(strategy='median')
X[['age', 'income']] = num_imputer.fit_transform(X[['age', 'income']])
# Categorical columns โ fill with most frequent
cat_imputer = SimpleImputer(strategy='most_frequent')
X[['city', 'plan_type']] = cat_imputer.fit_transform(X[['city', 'plan_type']])
But here's a trick I wish I'd learned earlier. Sometimes the absence itself is informative. A customer who left their phone number blank might behave differently from one who filled it in. You can capture that by adding a binary indicator โ was the field empty? โ before you impute the missing value.
X['age_missing'] = X['age'].isna().astype(int)
X['age'] = X['age'].fillna(X['age'].median())
Now you have both: the filled-in value so the algorithm doesn't error out, and a flag encoding the fact that it was missing. Together they carry more signal than either alone.
Scaling, And a Mistake That Cost Me a Week
Some algorithms are sensitive to how large numbers are relative to each other. If one feature ranges from 0 to 1 and another from 0 to 100,000, anything that relies on distance calculations โ k-nearest neighbors, SVMs โ gets completely dominated by the bigger numbers. Two approaches fix this. StandardScaler shifts features to have a mean of zero and a standard deviation of one. MinMaxScaler compresses everything into a 0-to-1 range.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use the SAME scaler, don't refit
See that comment on the third line? I ignored it once, back in 2024, and it cost me a week of confusion. I was fitting the scaler on the full dataset before splitting into train and test. Seems harmless. It isn't. Information from the test set bleeds into the preprocessing step, which means your evaluation is quietly optimistic. The scaler should only learn its parameters from training samples, then apply those same parameters to test samples. Minor-sounding distinction. Measurable impact, especially with smaller datasets.
Tree-based methods โ random forests, gradient boosting โ don't care about feature scale at all. They split on threshold values and the magnitude is irrelevant. Linear algorithms, SVMs, and neural networks do care. My general rule: if I'm not sure whether scaling matters for what I'm using, I scale anyway. It rarely hurts.
Getting More Reliable Numbers With Cross-Validation
One train/test split gives you one number. And that number depends entirely on which examples happened to land in each bucket. Maybe the split was generous and the test set was easy. Maybe it was harsh. You can't tell from a single run.
Cross-validation removes that uncertainty โ or at least reduces it. Split into K pieces. Train K separate instances, each holding out a different piece for evaluation. Average the results.
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
# Accuracy: 0.847 (+/- 0.023)
Five folds means five separate evaluations, each on 20% of the pool. The mean tells you roughly what to expect. The standard deviation tells you how consistent it is โ a high spread suggests the algorithm is either overfitting or you don't have enough samples to get a stable reading.
I rely on cross-validation for comparing approaches and tweaking configurations. For the final "is this good enough to ship?" evaluation, I still keep a completely separate holdout that wasn't touched during any iteration of development. Belt and suspenders.
Tuning Knobs (But Not Too Much)
Every algorithm has configuration parameters โ how many trees in a forest, how deep each can grow, the minimum number of samples before a split is allowed. These are hyperparameters, and finding good values for them is where grid search comes in.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1 # Use all CPU cores
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.3f}")
# Use the best model
best_model = grid_search.best_estimator_
That grid has 3 x 4 x 3 x 3 = 108 combinations, each run through 5-fold cross-validation, so 540 fits. n_jobs=-1 throws all your CPU cores at it, which turns "go get coffee and come back tomorrow" into "go get coffee and come back in ten minutes." RandomizedSearchCV is worth knowing about too โ instead of trying every combination, it samples randomly from the parameter space and often finds comparably good settings in a fraction of the wallclock time.
Here's my honest take on tuning, though. It typically buys you 1-3% improvement over default settings. Feature engineering typically buys 5-15%. I've watched teams spend weeks perfecting hyperparameters when they hadn't even looked at whether their input features made sense. Priorities.
Keeping Preprocessing Honest With Pipelines
Scikit-learn's Pipeline object chains preprocessing and modeling into a single unit, which prevents a whole class of subtle bugs:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
Fitting and predicting through a pipeline guarantees that scaling parameters are learned from only the fitting samples and applied consistently to anything you predict on later. Without it, you're relying on yourself to remember the right order of operations every time. In production, I've never deployed anything that wasn't wrapped in a pipeline โ the risk of accidentally mismatching preprocessing steps is too high when code gets touched by multiple people over months.
Pipelines also slot directly into cross-validation and grid search. Preprocessing gets correctly re-applied within each fold, which is something most tutorials gloss over but matters in practice.
Measuring What Actually Matters
Accuracy, by itself, lies to you on imbalanced problems. If 95% of customers don't churn, a classifier that always says "not churning" scores 95% while doing absolutely nothing useful. You need to look at precision (of the people you flagged as churners, how many actually were?), recall (of the people who did churn, how many did you catch?), and F1 (a single number balancing both). Precision matters when false alarms are expensive โ you don't want to send retention offers to everyone. Recall matters when missing a case is costly โ you really don't want to lose someone you could have saved.
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
I think the confusion matrix is the most underappreciated output in all of ML. It shows you, in plain numbers, exactly what your classifier gets right and wrong โ how many actual churners it missed (false negatives) and how many loyal customers it wrongly flagged (false positives). The cost of those two mistakes is rarely symmetric. Missing someone who's about to leave (and who might have stayed with a phone call or a discount) probably costs more than sending an unnecessary retention email. That asymmetry should drive which metric you optimize toward. And yet I see teams fixate on accuracy as if all errors are equal. They aren't.
When Scikit-Learn Isn't Enough
For tabular, structured information โ customer records, transactions, sensor readings โ scikit-learn is my answer almost every time. Where it falls short: images need convolutional architectures in PyTorch or TensorFlow (scikit-learn simply doesn't have the right building blocks). Text has been transformed by the rise of attention-based architectures โ the LLM landscape changed what's possible, and while scikit-learn's TF-IDF approach still makes a decent baseline, transformer-based solutions outperform it for nearly any NLP task now. Very large datasets that won't fit in memory need distributed tools like Dask-ML or Spark MLlib.
And then there's the "99.5% vs. 98%" question. Deep architectures and sophisticated ensembles can close that gap when it matters. For a medical diagnosis system, 1.5 percentage points could mean lives. For a product recommendation widget? Probably not worth the complexity cost.
I'll tell you something that might be hard to believe. I've shipped production systems built on logistic regression with carefully built features that beat neural networks a previous team had spent months constructing. The features did the heavy lifting. Not the algorithm.
Exploration Before Modeling
Nobody talks about this enough, but roughly 80% of an ML project is preparing and understanding your inputs. Loading. Cleaning. Exploring distributions. Checking for oddities. Getting comfortable with pandas and a plotting library (matplotlib, seaborn โ pick one) matters more than memorizing fifteen algorithms.
import seaborn as sns
import matplotlib.pyplot as plt
# Explore distributions
df.describe()
# Check for class imbalance
df['churned'].value_counts(normalize=True)
# Correlation between features
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()
# Distribution of a feature by target class
sns.boxplot(x='churned', y='monthly_spend', data=df)
plt.show()
A correlation heatmap takes thirty seconds to generate and saves hours. You can immediately spot which features relate to each other (multicollinearity โ redundant signals that bloat your input space without helping) and which relate to the target variable. Features that show zero correlation with what you're predicting are probably dead weight. Dropping them simplifies things. Five minutes of exploration here redirects effort toward the columns that actually carry predictive power.
Mistakes I Burned Time On (So Maybe You Won't)
A few that stand out from my first year or so.
I evaluated on the same samples I'd trained on. Got 99% accuracy. Felt great. Deployed it. Watched it score 70% on real inputs. Classic overfitting โ the algorithm had memorized the answers, not learned the patterns. Always split before you fit. Always evaluate on held-out observations.
I ignored class imbalance. Built a fraud detector on a dataset where 99.5% of transactions were legitimate. My classifier learned to say "legitimate" every time and scored 99.5%. Technically correct. Functionally useless. Setting class_weight='balanced' in scikit-learn classifiers is often enough to fix this. SMOTE (synthetic oversampling) and stratified sampling are other options, depending on how severe the imbalance is.
And the sneakiest one โ feature leakage. I accidentally included a column that wouldn't be available at prediction time. Using "days until churn" to predict churn sounds absurd when you spell it out, but subtler versions of this happen all the time. Every column needs to pass one test: "would I actually have this information at the exact moment I need to make this prediction?" If the answer is no, it's leakage. Your evaluation will look great. Your production results won't.
The loop, when I strip away all the specifics, is: pick a simple algorithm, shape your inputs well, evaluate honestly, ship it, and iterate. That's production ML. The more advanced techniques exist for the cases where the simple path genuinely isn't enough โ and from what I've seen, you'd be surprised how rarely that is.
I honestly don't know if deep learning will make all of this irrelevant in five years. Maybe attention-based architectures will eventually handle tabular prediction better than random forests, and feature engineering will be something the model does on its own. Some people seem certain that's where things are heading. I'm not. Classical methods keep working, keep being simple, keep being interpretable. But I've been wrong before, and the field moves fast. I guess we'll find out.
Further Resources
- scikit-learn User Guide โ Full documentation for every algorithm, preprocessing technique, and evaluation metric covered in this post.
- Google Machine Learning Crash Course โ A free, practical introduction to machine learning concepts with hands-on exercises using TensorFlow and real-world datasets.
- Kaggle Learn โ Free micro-courses on pandas, feature engineering, and machine learning that let you practice on real datasets in the browser.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

Neural Networks Without the Hype
What neural networks actually do under the hood. Code included. No analogies about brains.

How LLMs Actually Got Here
A walkthrough of how we went from recurrent neural networks to ChatGPT, and where things stand now.

Prompt Engineering is Not a Real Job (But You Still Need to Learn It)
The 'prompt whisperer' industry is mostly a grift, but there are three techniques that actually help when working with language models.