automation_scripts.py: A Blog Post in 150 Lines of Code
Four Python scripts I actually use. Bulk renamer, downloads folder organizer, duplicate finder, and a website change detector.

Two hundred invoices. That was the size of the folder when my regex rename went sideways.
An accounting firm had hired me for some freelance work back in 2024. They needed their invoice PDFs reorganized โ INV_2024_ClientName_Draft.pdf turned into ClientName_INV_2024_Final.pdf. Straightforward. Five minutes of Python, maybe ten if the coffee hadn't kicked in yet.
Opened a REPL, wrote a re.sub call, ran it on the entire folder without testing first. My regex was greedier than expected. Files ended up named things like _INV_2024_Final.pdf, the client portion just... gone. Swallowed by a capture group that matched too broadly. The next hour was spent cross-referencing file sizes and creation dates against a backup, trying to reconstruct which document belonged to which client. Not my finest Friday evening.
That disaster is why my scripts folder exists. Not because of any particular love for clean automation. Because specific moments of frustration โ and in this case, low-grade panic โ pushed me to write something that would stop the same problem from biting me twice. None of these are polished. A proper software engineer would have opinions about the error handling (minimal) and the configuration approach (nonexistent), probably. They work, though. Been working for a couple of years now. A few of them run on schedules I've mostly forgotten about until something breaks.
Four scripts come up often enough to be worth sharing here.
The Bulk File Renamer
Direct descendant of the invoice catastrophe. When I sat down to write the safer version, the first decision โ before anything else โ was making dry_run=True the default. Not a flag you have to remember. Not a best practice you read about and then forget. The default behavior previews what would happen without changing a single file on disk.
import os
import re
from pathlib import Path
def bulk_regex_rename(directory: str, pattern: str, replacement: str, dry_run: bool = True):
"""
Renames files in a directory using regex.
dry_run=True by default because I learned the hard way.
"""
target = Path(directory)
renamed_count = 0
if dry_run:
print("=== DRY RUN (no files will be changed) ===\n")
for filepath in sorted(target.iterdir()):
if filepath.is_file():
new_name = re.sub(pattern, replacement, filepath.name)
if new_name != filepath.name:
renamed_count += 1
print(f" {filepath.name}")
print(f" -> {new_name}\n")
if not dry_run:
filepath.rename(target / new_name)
print(f"--- {renamed_count} file(s) {'would be' if dry_run else ''} renamed ---")
You run it once, scan the output, and if the before/after mapping looks right, flip dry_run to False and run again. Two seconds of extra caution. Would have saved me an hour of forensic file recovery that Friday.
A few details that came from actually using this thing. The sorted() call wrapping target.iterdir() โ that wasn't in version one. Without sorting, files come back in whatever order the filesystem decides. Alphabetical on Windows, mostly random on Linux. When you're squinting at dry run output trying to spot mistakes, predictable ordering makes bad patterns visible faster. A renamed file that doesn't look right jumps out when everything around it is in alphabetical order.
Name collisions aren't handled at all. Two files ending up with the same output name? Second one quietly stomps the first. No warning, no backup, just gone. Adding a collision guard has been on my to-do list for roughly eighteen months. Hasn't bitten me yet because my use cases tend to produce unique names. Probably don't point this at anything irreplaceable without adding that check yourself.
The original version used os.path.join and string manipulation everywhere. Switching to pathlib.Path objects cleaned up maybe fifteen lines. One of those standard library modules where, once you start using it, the old way feels like writing assembly by comparison.
The Downloads Folder Organizer
Everybody's Downloads folder is a warzone. Screenshots next to tax forms next to random .json files from some API debugging session three weeks ago. Eventually you can't find anything, so you just download the same file again, and the problem compounds.
import shutil
from pathlib import Path
def organize_downloads(directory: str):
"""
Sorts files into subfolders by type.
Run this on a schedule if you're as messy as I am.
"""
target = Path(directory)
extensions = {
'Images': ['.jpg', '.jpeg', '.png', '.gif', '.webp', '.svg', '.bmp'],
'Docs': ['.pdf', '.docx', '.doc', '.csv', '.xlsx', '.pptx', '.txt'],
'Archives': ['.zip', '.tar', '.gz', '.rar', '.7z'],
'Code': ['.py', '.js', '.ts', '.json', '.html', '.css', '.md'],
'Videos': ['.mp4', '.mkv', '.avi', '.mov'],
'Audio': ['.mp3', '.wav', '.flac', '.aac'],
'Installers': ['.exe', '.msi', '.dmg', '.deb'],
}
# Build a reverse map: extension -> folder name
ext_map = {}
for folder, exts in extensions.items():
for ext in exts:
ext_map[ext] = folder
moved = 0
skipped = 0
for item in target.iterdir():
if item.is_file():
folder_name = ext_map.get(item.suffix.lower())
if folder_name:
dest_folder = target / folder_name
dest_folder.mkdir(exist_ok=True)
dest_path = dest_folder / item.name
if dest_path.exists():
# Don't overwrite, just skip
print(f" [SKIP] {item.name} (already exists in {folder_name}/)")
skipped += 1
continue
shutil.move(str(item), str(dest_path))
print(f" [MOVED] {item.name} -> {folder_name}/")
moved += 1
print(f"\nDone. Moved {moved} files, skipped {skipped}.")
Match extensions to folder names. Move files. Skip anything that would collide. That skip-on-exists guard was added after running the script twice in a row one night โ second pass tried moving files already sitting in their destination folders, threw errors everywhere. Debugging your own cleanup tool at 11 PM because you executed it on autopilot is a particular flavor of irony.
Windows Task Scheduler runs this every Sunday at midnight. Monday mornings: clean Downloads folder. Small thing. Adds up over time.
Extension list started at four categories. Now seven. Still occasionally misses .heic files from iPhones and .webm videos. Apple's image format decisions are a whole separate grievance that doesn't belong here. Folders in the Downloads directory get ignored on purpose โ those are usually project directories placed there deliberately, and auto-sorting them would cause more trouble than it prevents.
No recursion into subdirectories, either. Once files land in Images/ or Docs/, running the script again shouldn't try to re-sort their contents. Narrow scope. Predictable behavior. Those matter more than clever handling of edge cases nobody encounters.
Finding Duplicate Files (and Reclaiming 12 GB)
Laptop started complaining about disk space. Suspected duplicates โ photos copied into multiple folders during half-finished organization binges, documents saved in three places "just in case." The usual accumulation of a person who means well but doesn't follow through.
Approach: hash file contents, look for collisions. Same hash means (almost certainly) identical contents, regardless of what the files are named or where they live.
import hashlib
from pathlib import Path
def find_duplicates(directory: str, min_size_kb: int = 10):
"""
Finds duplicate files by hashing contents.
Only checks files above min_size_kb to avoid flagging
tiny config files that are legitimately the same.
"""
hashes = {}
duplicates = []
min_bytes = min_size_kb * 1024
print(f"Scanning {directory} for duplicates...\n")
all_files = [f for f in Path(directory).rglob('*') if f.is_file() and f.stat().st_size > min_bytes]
print(f"Found {len(all_files)} files above {min_size_kb} KB\n")
for i, filepath in enumerate(all_files):
# Hash first 2MB โ good enough for finding dupes,
# fast enough for large file collections
try:
chunk = filepath.read_bytes()[:2097152]
file_hash = hashlib.sha256(chunk).hexdigest()
except PermissionError:
continue
if file_hash in hashes:
original = hashes[file_hash]
size_mb = filepath.stat().st_size / 1024 / 1024
duplicates.append({
'file': filepath,
'original': original,
'size_mb': size_mb
})
print(f" DUPE: {filepath.name} ({size_mb:.1f} MB)")
print(f" matches: {original.name}")
print()
else:
hashes[file_hash] = filepath
# Progress indicator for big scans
if (i + 1) % 500 == 0:
print(f" ...checked {i + 1}/{len(all_files)} files")
total_wasted = sum(d['size_mb'] for d in duplicates)
print(f"--- Found {len(duplicates)} duplicates ---")
print(f"--- Wasted space: {total_wasted:.1f} MB ---")
return duplicates
Design choices, briefly.
Only the first 2 MB of each file gets hashed. Could two files share an identical opening 2 MB and diverge after that? Theoretically. With photos, PDFs, and typical desktop clutter? Never happened in practice. The full-hash version ran against my photo library for over twenty minutes. Partial hashing finished in about forty seconds. That math settles itself.
The min_size_kb parameter filters out tiny files. Without it, the first run flagged hundreds of "duplicates" that were all identical .gitignore and .editorconfig files scattered across different project directories. Those are supposed to be the same. Setting the floor at 10 KB removed that noise without losing real results.
And โ important โ the script only finds duplicates. It does not delete anything. Tried adding auto-delete once. Nearly lost a batch of vacation photos because the script chose to keep the copy buried in some temp folder and remove the version in my organized photo library. Learned from that. Print the results. Decide manually. Delete by hand. Less dramatic. Much less regret.
First full scan of my home directory turned up 12 GB of duplicates. Mostly photos I'd copied into various folders during several failed attempts at organizing my photo library. Creating duplicates while trying to organize files. The circularity wasn't lost on me.
Watching Websites for Changes
Newest addition to the collection. Also the simplest. Also the one with the best backstory.
Concert tickets. Venue website said "tickets on sale soon" with no date attached. Found myself refreshing that page four or five times a day like it was going to change faster if I stared at it harder. After a few days of this, realized the behavior was indistinguishable from what a bot does โ so why not let an actual bot handle it.
import hashlib
import requests
from pathlib import Path
from datetime import datetime
def check_for_changes(url: str, state_dir: str = ".web_watch"):
"""
Checks if a webpage has changed since the last check.
Stores state files in a directory so you can watch multiple URLs.
"""
state_path = Path(state_dir)
state_path.mkdir(exist_ok=True)
# Use URL hash as filename so we can track multiple sites
url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
state_file = state_path / f"{url_hash}.txt"
try:
response = requests.get(url, timeout=10, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
response.raise_for_status()
current_hash = hashlib.md5(response.text.encode('utf-8')).hexdigest()
except requests.exceptions.RequestException as e:
print(f"[ERROR] Couldn't reach {url}")
print(f" {e}")
return None
now = datetime.now().strftime("%Y-%m-%d %H:%M")
if state_file.exists():
old_hash = state_file.read_text().strip()
if old_hash != current_hash:
print(f"[CHANGED] {url}")
print(f" Detected at {now}")
state_file.write_text(current_hash)
return True
else:
print(f"[NO CHANGE] {url} (checked {now})")
return False
else:
print(f"[FIRST RUN] {url}")
print(f" Saved initial snapshot at {now}")
state_file.write_text(current_hash)
return None
The only part that required actual thought was state management. Version one stored everything in a single flat file โ one URL at a time. Current version creates a .web_watch directory and gives each URL its own state file, named by hashing the URL into a short hex string. No weird characters in filenames, no path length issues, scales to as many pages as you want to track.
User-Agent header got added because the concert venue was blocking Python's default request signature. Script kept reporting connection errors. Spent ten minutes convinced the site was down before figuring out they were just rejecting traffic that looked automated. Adding a browser-like agent string fixed it immediately. A little dishonest, maybe. But those were good seats.
Cron job checks a list of URLs every thirty minutes. Changes get written to a log file. Keep meaning to hook up a Slack webhook or a text notification for actual alerts. Keep not doing it. Reading the log has been sufficient. My project backlog has a lot of "one of these days" items that have been sitting there long enough to qualify as permanent residents.
Biggest limitation: the script fires on any change to the page HTML. Modern sites have rotating timestamps, ad placements, session tokens baked into the markup โ stuff that shifts on every single load even when the real content hasn't moved. News sites are terrible for this. The "trending now" sidebar cycles constantly and triggers a false positive every time. You can work around it by parsing the DOM and only hashing a specific section, but that means writing per-site extraction logic. Only worth it for pages you plan to monitor long-term.
For the concert tickets, though? Worked exactly as intended. Page flipped from "on sale soon" to having an actual purchase button, the script caught it within thirty minutes, and tickets were secured that same afternoon. Dumb approach. Right result.
How These Actually Run
None of these are packaged as standalone CLI programs. They live in ~/scripts/ and get imported into whatever context makes sense at the time โ a Jupyter notebook, a one-off script, sometimes just a bare REPL session:
from organize import organize_downloads
organize_downloads("C:/Users/anurag/Downloads")
Scheduled ones โ the downloads organizer and the web watcher โ go through Windows Task Scheduler. Linux would be cron. If you're spending a lot of time in the terminal running these kinds of things, my post on Linux command line essentials covers the shell fundamentals that make the workflow smoother. Initial setup is annoying. After that you forget the schedule exists, which is the whole point.
Error handling across all four is minimal by design. No retry logic. No structured logging. No YAML configuration. These run on my machine for my own purposes. When something fails, the traceback appears in my terminal and I deal with it. Wrapping forty lines of script in enterprise-grade error infrastructure makes the code harder to read without making it meaningfully more reliable in this context. If I were distributing these to other people, different story. For ~/scripts/? Overhead without payoff.
There is a line, though. The renamer failing on one file? Acceptable. The web watcher crashing mid-check? It'll run again in thirty minutes, no big deal. The downloads organizer overwriting a file I actually needed? That's the line. Data loss doesn't get a pass. That's why the skip-on-collision check exists.
Scripts That Remain Theoretical
Phone has a running list of automation ideas. Growing faster than the completion rate, which sits at maybe 15% if you're generous with the definition of "complete."
A clipboard history manager that dumps everything copied to a searchable text file. Windows clipboard history exists but it doesn't go back far enough and the search is useless for finding something from three days ago. Started building this twice. Both attempts stalled on figuring out how to poll the clipboard without pegging CPU in a tight loop. Probably needs an event hook instead of polling. Haven't investigated far enough to know for sure.
An email attachment saver that talks to Gmail, pulls every attachment from the past month, and sorts them by sender. Documents arrive via email and then sit in my inbox until I forget they exist. Gmail API's OAuth flow is the blocker โ authentication setup always takes three times longer than the docs suggest, and by the time I've wrestled with consent screens and token refresh logic, the motivation has evaporated.
A screenshot OCR tool that watches a folder and runs text extraction on new images automatically. Would make screenshots searchable by content instead of just filename. Tesseract exists, the wiring shouldn't be complicated, I just haven't carved out the afternoon to do it.
If you're curious how Python compares to compiled languages for more performance-sensitive work, I compared Rust vs Go for backend services โ but for quick-and-dirty automation like these scripts, Python's write-run-iterate cycle is hard to match. Twenty minutes from "this task is annoying" to "this task is automated." Compiled languages can't touch that feedback loop for throwaway tooling.
Those Two Hundred Invoices
Funny epilogue to the accounting firm story: the client never found out. Recovered everything from the backup within the hour, delivered properly renamed files the next morning, invoice paid in full. No damage except to my ego and my resting heart rate for a few hours.
But every time I run bulk_regex_rename with dry run enabled and scan that preview output before committing to anything real, I think about those two hundred PDFs. The forty minutes spent writing a safer version has probably prevented three or four similar mistakes since โ not identical situations, different files and different patterns, but the same category of "I was confident this would work and it didn't."
That might be the actual argument for writing your own tools instead of grabbing something off GitHub. Not because yours are better. They're usually worse โ fewer features, rougher edges, no documentation. But you wrote them after something went specifically wrong, so they protect against the exact failures you've lived through. The dry run default isn't a design pattern I picked up from a best-practices article. It's scar tissue from a Friday night spent matching file sizes to client names in a backup directory.
Take any of these if they're useful. Add the collision detection I keep not adding. Build the Slack webhook I keep not building. Half the point of small tools is seeing what someone else does with them once the skeleton exists. The other half is never having to manually rename two hundred files again.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

Vim for Productivity โ Practical Motions, Macros, and the Plugins That Matter
Not a Vim-vs-VSCode argument. Just the motions, macros, and habits that made me genuinely faster at editing text after years of resisting modal editing.

Regex Mastery โ Stop Copy-Pasting Patterns You Don't Understand
Learning regex properly changed how I handle text processing. Named groups, lookaheads, and real-world patterns I actually use in production.

Contributing to Open Source โ From First-PR Anxiety to Merged Code
How I got past the fear of my first pull request, found projects worth contributing to, and learned to read unfamiliar codebases without drowning.