Prompt Engineering is Not a Real Job (But You Still Need to Learn It)
The 'prompt whisperer' industry is mostly a grift, but there are three techniques that actually help when working with language models.

Most prompt engineering advice is nonsense dressed up as methodology.
Not all of it. But the industry that's sprung up around "prompt whispering" โ the $200 courses, the numbered technique lists, the screenshots of a single good ChatGPT response presented as proof that a method works โ most of it doesn't hold up once you're building something real with actual users depending on the output.
Here's what happened that changed how I think about this. Last October, I spent four hours trying to get a language model to extract structured data from scanned restaurant menus. The task: take messy OCR text, pull out dish names, prices, and dietary tags, return JSON. The OCR pipeline worked. The schema was defined. The model kept producing plausible-looking garbage โ inventing dishes that weren't on the menu, hallucinating reasonable-sounding prices that were wrong, marking things as vegan that obviously contained shrimp.
The maddening part: about one in five attempts, the output was perfect. Exactly right. So the model could do it. It just mostly didn't.
Tried rewriting the prompt a dozen times. Tried politeness. Tried aggression. Tried wrapping everything in XML tags because someone on Twitter said that helped. Tried "take a deep breath." Tried offering a hypothetical tip. None of it made a consistent difference.
Eventually I fixed it. And looking back, the fix wasn't some secret incantation. It was boring. Just... communicating more carefully. Which might be the only honest thing anyone can say about this entire field.
Where the Popular Advice Falls Apart
There's a disconnect between how prompting is discussed online and how it works when you're actually building something. The online version is full of "secret techniques" and confident claims that adding "you are an expert" to a prompt improves output quality by 40%. People screenshot their best ChatGPT result and present it as proof. Nobody shows the four attempts before that where the technique produced nothing useful.
Tested a lot of these over the past year or so. Persona prompting. Temperature tweaking. "Tree of thought." Prompt chaining across multiple calls. Some help sometimes. Most don't help reliably enough to build on.
The persona trick is a decent example. "You are a senior database administrator with 20 years of experience" โ you're not giving the model expertise. Models don't have experience. What you're doing, as near as I can tell, is nudging the token distribution toward patterns associated with technical database content rather than casual forum posts. A filter, not a teacher. Sometimes that filter helps. When you're asking about PostgreSQL indexing strategies, biasing toward DBA-sounding text probably improves accuracy. When you paste in a specific schema and ask for analysis, whether the model is "pretending" to be a DBA or not makes almost no measurable difference to correctness.
The emotional manipulation stuff is even weaker. "This is very important to my career." "My boss will fire me if this is wrong." People report that these phrases improve quality. Maybe they do trigger some statistical association with careful human writing. But across hundreds of API calls on structured extraction tasks, appending emotional stakes had no statistically significant effect on accuracy. Zero. For creative writing where "quality" is subjective, maybe it matters. For anything you can actually measure? Haven't seen it.
Three Things That Actually Worked
Back to the menu problem. What fixed it wasn't cleverness. Three things, and they're all sort of the same idea wearing different clothes.
Being obnoxiously specific about the output format. Not "return JSON" but spelling out the exact schema field by field. Types and constraints included. Went from two sentences of instruction to a full paragraph of specification: "Return a JSON array where each element has the fields name (string, exactly as written on the menu, do not paraphrase), price (number, in dollars, null if not listed), and dietary_tags (array of strings, only from the set: vegan, vegetarian, gluten-free, contains-nuts, contains-shellfish, or empty array if none indicated)."
Felt like overkill. Worked.
The insight here isn't about system prompts or "roles." The interesting part is constraining the output space before generation starts. Every constraint โ this field must come from this set, this value must be a number, no conversational filler โ narrows what the model considers as valid continuation. You're not making it smarter. You're making it harder for it to go wrong.
Examples. Showing the model two or three completed input-output pairs before asking it to handle new input. Hate calling this "few-shot prompting" because it sounds like jargon designed to make a simple concept impressive. But pasting three real menus with correct JSON output, then the new menu to process โ accuracy jumped from maybe 60% to around 90%. Bigger improvement than any other single change.
Why this works so well? My best guess โ and nobody really knows how these models work inside โ is that examples do something instructions can't. Instructions describe what you want in natural language, but natural language is ambiguous. "Extract dish names exactly as written" โ what does "exactly" mean? Does it include the description after the name? The price on the same line? The model has to interpret those instructions, and interpretation varies run to run. Examples show what "exactly as written" looks like for this task, with this data. The model pattern-matches from examples instead of interpreting words. Examples are the spec. Instructions are commentary.
Breaking the task into stages. Not in the "chain of thought" sense of asking the model to show its reasoning. Literally splitting into separate API calls. First call: "List every distinct dish mentioned, one per line, nothing else." Second call: "For each dish, find the price. If no price, write NULL." Third call: assembly into the final schema with dietary tags. Each call narrow enough that the model almost couldn't get it wrong. Final output better than any single-call approach, even with perfect instructions and examples.
Chain of thought taken to its logical end, maybe. People talk about it as asking the model to "think step by step" in a single prompt โ and that does help for reasoning tasks where intermediate calculations matter. But for complex structured extraction, separating the steps into distinct API calls was more effective. Each call has one job. The model can't get confused about what it's supposed to be doing because it's only doing one thing.
The Uncertain Middle
A whole category of advice that I can't confidently call good or bad.
XML-style tags for prompt sections โ wrapping content in <context>, <instructions>, <examples>. Seems to help with longer prompts where different content types are mashed together. But does it help because the model was trained on XML and recognizes structure? Would markdown headers work just as well? Tried both. Can't tell the difference consistently. Instinct says tags help more past a thousand tokens or so, where the model might lose track of which section is instructions and which is data. For shorter prompts, probably noise.
Temperature. Standard advice: low for factual tasks, higher for creative. Makes sense directionally. But the specific value? People argue with conviction that 0.3 is the sweet spot for code generation while 0.7 is right for marketing copy. Don't buy it. Difference between 0.2 and 0.5 on a factual task is negligible if the prompt is well-constructed. Temperature amplifies variance that's already present. Fix the prompt and temperature becomes less important. Usually just leave it at the default unless there's a specific reason to change it.
Meta-prompting โ having one call generate a response and a second call evaluate it. Interesting. Seen it catch certain error types. But it doubles API costs and latency, and the evaluator has the same blind spots as the generator. A model that confidently invents a fake dish name will often confidently approve that same fake dish when asked to review. Diminishing returns hit fast. For anything that actually matters, a human still reviews the output. That hasn't changed.
The Part Nobody Wants to Hear
The stuff that works in prompt engineering is, stripped down, just clear communication. Be specific about what you want. Show examples of what correct output looks like. Break complex tasks into simple pieces. These aren't techniques. They're what a competent manager does when delegating to a new team member. The entire field, minus the mystique, is project management for a very fast, somewhat unreliable research assistant with a massive but frozen knowledge base and a strong inclination to tell you what you want to hear.
Which raises the shelf-life question. Models keep improving โ I traced the evolution from basic neural networks to GPT to ChatGPT in my post on how LLMs actually got here. Tasks that required elaborate prompt gymnastics a year ago now work with a simple direct question. The menu extraction problem? Tried it again recently with a newer model and a bare-bones prompt. Worked almost as well as my carefully engineered multi-stage pipeline. Not quite as well. But close enough that the engineering effort was questionable.
Seen this pattern repeatedly. You spend weeks crafting the perfect prompt pipeline, then six months later a new model comes out and a naive prompt gets 95% of the way. The shelf life of prompt engineering work is short. Like optimizing assembly in the 1990s โ valuable at the time, but the compiler was about to get much better.
So Should You Learn This?
Sort of. The underlying skill โ being precise about requirements, providing good examples, decomposing problems โ that's timeless. Those are just engineering skills applied to a new interface. The specific tricks and templates and frameworks? Ephemeral. Tuned to the quirks of current models, and current models are a snapshot of a thing that's changing fast.
I still catch myself spending too long tweaking a prompt when the real question is whether the model is the right tool at all. There's a bias that sets in: every problem starts looking like a prompting problem. Sometimes you just need a regex. A database query. A Python script that does the job in 20 lines. Sometimes you need a human. The best prompt engineers are the ones quickest to say "this isn't a language model problem" and reach for something else.
Where is this headed? The optimistic version: prompting becomes trivially easy as models get better at understanding intent, and the whole discipline dissolves into regular software engineering. The pessimistic version: models plateau and prompt engineering becomes an increasingly arcane specialization โ like SEO but for AI, a field of diminishing returns that exists because the underlying system is imperfect. I'd bet on something in between, but with low confidence. The rate of change makes prediction feel pointless.
What I can say right now: if you're working with language models and getting inconsistent results, the answer is probably not a cleverer template from Reddit. It's probably that you need to be more specific, show the model what good output looks like, and break the task into smaller pieces. Unglamorous. Doesn't make for a good thread on social media.
Structured Output Is the Real Win
One technique that's more recent and deserves its own section: forcing the model to return structured data in a specified format. Not just "return JSON" โ actually constraining the model's output to valid JSON matching a schema.
OpenAI's function calling, Anthropic's tool use, and various open-source frameworks (Instructor, Outlines) all provide ways to guarantee the model output matches a structure you define. Instead of hoping the model returns valid JSON and parsing it with a try/catch, you define the expected schema and the API ensures compliance.
This changed our menu extraction pipeline more than any prompting technique. Instead of parsing free-text output and hoping for valid JSON, we defined the output schema explicitly and got valid structured data on every call. No parsing errors. No "the model added a chatty preamble before the JSON" problems. No missing closing brackets.
The catch: structured output modes sometimes constrain the model's reasoning. If you force JSON-only output, the model can't "think out loud" in text before producing the structured answer. Some tasks benefit from that reasoning step. For those, a two-call approach still works better โ first call for reasoning, second call with structured output for the final answer.
When to Walk Away From the Model
This might be the most practical thing I can say about prompt engineering: know when to stop trying.
I wasted an entire afternoon last month trying to get a model to parse dates from free-text event descriptions. "Next Tuesday," "the 3rd of March," "two weeks from now," "last Friday of the month." The model handled maybe 70% of cases correctly. Spent hours tweaking the prompt, adding examples, trying different decomposition strategies. Got it to maybe 80%. Still not production-ready.
Then I spent twenty minutes writing a rule-based parser using a date-parsing library. Handled 95% of cases. Deterministic. No API costs. No latency. Runs locally. The remaining 5% edge cases were things like "the Tuesday after Easter" which even a human would need to think about.
The sunk cost trap with prompting is real. You've already invested time crafting the prompt, building the pipeline, and testing variations. Walking away feels like wasting that effort. But the model is a tool, not a commitment. If a regex, a lookup table, or a simple script solves the problem more reliably, switch. Nobody gives you points for using AI when something simpler works better.
My rough heuristic: if I've spent more than an hour on prompt iteration and accuracy is still below 85%, I step back and ask whether a non-ML approach exists. Often it does. The model is best at tasks where natural language understanding is actually required โ summarization, classification of nuanced text, generating human-readable explanations. For structured parsing of predictable formats, traditional code usually wins.
The Disappearing Discipline
Something I've noticed over the past year: the more I work with language models, the more I treat them like tools and the less I treat them like magic. The early days were full of experimentation โ "what if I phrase it this way?" "what if I add this role description?" There was a sense of discovery.
Now it's more mechanical. I know the three things that work (be specific, show examples, break it apart). I apply them. When the model fails, I check whether one of those three things is missing. Usually it is. Fix it. Move on.
The danger is overreliance. It's tempting to throw every problem at the model because it's fast and often good enough. But "good enough" accumulates a lot of subtle errors over time. A financial report where 95% of the extracted numbers are correct means 5% are wrong, and you don't know which 5%. A customer-facing chatbot that's right 90% of the time is wrong 10% of the time, and every wrong answer is a support ticket or a lost customer.
The right question isn't "can the model do this task?" It's "can the model do this task reliably enough that I'm comfortable not checking every output?" For most production applications, the answer is still no. The model helps. A human still verifies.
But after a year of doing this daily, in production, with actual users depending on the output โ that's what I've got. Maybe different next year. Probably different, actually. I keep writing things down about how these models work and then crossing them out six months later. Take all of it with the appropriate amount of doubt.
Written by
Anurag Sinha
Full-stack developer specializing in React, Next.js, cloud infrastructure, and AI. Writing about web development, DevOps, and the tools I actually use in production.
Stay Updated
New articles and tutorials sent to your inbox. No spam, no fluff, unsubscribe whenever.
I send one email per week, max. Usually less.
Comments
Loading comments...
Related Articles

How LLMs Actually Got Here
A walkthrough of how we went from recurrent neural networks to ChatGPT, and where things stand now.

Machine Learning in Python โ Starting Without a PhD
How I went from confused by terminology to building useful models with scikit-learn, which algorithm to reach for first, and why feature engineering matters more than model choice.

Neural Networks Without the Hype
What neural networks actually do under the hood. Code included. No analogies about brains.