A prompt is a piece of code that changes behavior for every user downstream. Unlike code, it often lives as a string literal inside a service with no tests, no version history, and no rollback story. That works until it does not.
Treat prompts like migrations
Each prompt change should be a small, numbered, reviewable unit — the way a database migration is. We store prompts in versioned files alongside the service code, tag them with a semantic identifier, and log which version produced each response. When a user complains, we can replay with the exact prompt they hit.
Evaluation is cheap if you make it cheap
A rolling set of 30–50 labeled examples, run on every prompt change, catches the majority of regressions. It does not have to be a formal eval suite. A notebook that takes 90 seconds to run is worth more than a perfect one that takes an afternoon to set up.
Continue reading