When you integrate an LLM into a toy project, almost any prompt will do. You type a sentence, the model responds with something plausible, and you move on. Production web applications are a different universe entirely. Your prompt is now a contract between your frontend code and a nondeterministic text generator, and every ambiguity in that contract will eventually surface as a bug in front of a real user.
In the apps we build at Ratatat Labs, prompts drive core features: HevyDuty AI generates entire workout programs with specific sets, reps, and weight recommendations, while SimplBiz extracts structured financial data from receipt photographs. A prompt that works 90% of the time is not good enough when the remaining 10% produces malformed JSON that crashes the UI or misreads a decimal on someone's expense report. The gap between a demo-quality prompt and a production-quality prompt is where most of the real engineering effort lives.
The techniques below are drawn from shipping these features to real users. None of them require exotic tooling. They require patience, systematic testing, and a willingness to treat prompts as code that deserves the same rigor as any other module in your codebase.
The single most impactful technique for production prompt engineering is constraining the model's output to a strict JSON schema. When HevyDuty generates a workout, the response must parse cleanly into a TypeScript interface with fields like exerciseName, sets, reps, restSeconds, and weightKg. If the model returns a conversational paragraph instead of JSON, the entire feature is broken.
We achieve this by embedding the exact schema in the system prompt, providing a concrete example of a valid response, and ending the user prompt with an explicit instruction like "Respond ONLY with valid JSON matching the schema above. Do not include markdown fences, explanations, or commentary." Gemini's response_mime_type parameter set to application/json further constrains the output format at the API level, which dramatically reduces formatting failures.
Even with these safeguards, we wrap every JSON.parse call in a try-catch and validate the parsed object against a Zod schema before passing it to the UI. When validation fails, we log the raw response for debugging and show the user a graceful retry option rather than a white screen. This defensive posture has caught dozens of edge cases that would have otherwise been silent failures.
Temperature controls randomness, and the right setting depends entirely on what you are asking the model to do. For SimplBiz receipt scanning, we use a temperature near 0.1. Extracting a dollar amount from a receipt image has one correct answer, and we want the model to give us its highest-confidence interpretation every time. Creative variation is the enemy here.
For HevyDuty workout generation, we run at 0.7 to 0.9. Users who generate multiple programs for the same goal should get meaningfully different routines, not the same exercises in the same order. A higher temperature ensures variety in exercise selection while the structured schema keeps the output format consistent. The schema acts as guardrails while temperature controls the creative range within those guardrails.
One non-obvious finding: temperature interacts with prompt specificity. A highly detailed prompt with a strict schema tolerates higher temperatures because the structure constrains the randomness to the dimensions you actually want varied. A vague prompt at high temperature is chaos. A specific prompt at high temperature is controlled creativity.
Some tasks require the model to reason through multiple steps before producing a final answer. HevyDuty's workout generation is a good example: the model must consider the user's training goal, experience level, available days per week, equipment access, and recent training history before selecting exercises, assigning volumes, and distributing muscle groups across sessions.
We use a chain-of-thought approach by structuring the system prompt with explicit reasoning steps: "First, determine the appropriate weekly volume per muscle group based on the user's experience level. Second, distribute muscle groups across the available training days to ensure adequate recovery. Third, select exercises that match the user's equipment and experience constraints. Fourth, assign sets, reps, and RPE targets for each exercise." This sequencing prevents the model from jumping straight to exercise selection without considering the broader program structure.
The key insight is that chain-of-thought is not just about getting better answers. It is about getting more predictable answers. When the model reasons step by step, its outputs cluster more tightly around sensible programs. Without chain-of-thought, you see occasional wild outputs where the model assigns 30 sets of bicep curls to a 3-day full-body program because nothing in the prompt guided it through volume distribution first.
Users will submit inputs your prompts were never designed for. Someone will paste an image of a handwritten grocery list into SimplBiz's receipt scanner. Someone will request a HevyDuty workout for "losing 50kg in one week." Your prompts need to handle these gracefully rather than hallucinating plausible-looking but dangerous output.
We build rejection logic directly into the system prompt: "If the uploaded image does not appear to be a receipt or invoice, respond with the JSON object {error: 'not_a_receipt', confidence: 0}." This gives the model an explicit exit path for inputs it cannot process, and our frontend code checks for the error field before attempting to display results. Without this escape hatch, the model will try to extract receipt data from any image, often returning fabricated amounts with high apparent confidence.
For HevyDuty, we enforce safety constraints in the prompt based on experience level. The system prompt includes rules like "For beginners, never prescribe Olympic lifts, plyometric box jumps above 20 inches, or more than 4 working sets per exercise." These rules act as a safety net that the model respects remarkably well, though we also validate the output programmatically to catch any violations.
- Define explicit rejection paths in your prompt for out-of-domain inputs
- Add safety constraints as numbered rules in the system prompt
- Validate model output programmatically even when prompt constraints exist
- Log rejected and edge-case inputs to improve prompts iteratively
Prompt changes are code changes, and they deserve a testing process. We maintain a test suite of 20-30 representative inputs for each prompt-driven feature: receipts in different languages, faded thermal paper photos, handwritten notes, workout requests spanning every combination of goal and experience level. After every prompt revision, we run the full suite and compare outputs against expected results.
The testing process is not fully automated because evaluating LLM output quality often requires human judgment. Instead, we use a semi-automated approach: a script sends each test input through the prompt, parses the response, checks schema validity, and flags outputs that differ significantly from the previous run. A developer then reviews the flagged outputs to decide whether the changes are improvements or regressions.
This approach has caught regressions that would have been invisible without systematic testing. A prompt change that improved workout variety for advanced users once silently broke the beginner safety constraints, prescribing barbell squats to a first-time lifter. The test suite caught it because one of the test cases was specifically a beginner requesting their first program.
After shipping prompt-driven features across multiple apps, a few principles have become clear. First, shorter prompts are not better prompts. Production system prompts for HevyDuty run to several hundred words because every sentence prevents a category of failure. Second, the model's behavior is more sensitive to the structure and ordering of your prompt than to individual word choices. Moving a constraint from the end to the beginning of the system prompt can have a larger effect than rewriting it entirely.
Third, invest in observability. Log every prompt, response, and validation result in a structured format. When a user reports a problem, you need to reconstruct exactly what the model saw and said. Without logs, debugging prompt issues is guesswork. Finally, treat prompt engineering as an ongoing practice rather than a one-time setup. As your user base grows, the diversity of inputs grows with it, and your prompts need to evolve in response.