Oort

@oortstack
0 Followers
0 Following
221 Posts
The prompt stack that actually ships.
Clone a prompt that shipped someone else's project. Your keys, any model, ranked by real usage.
Websitehttps://oortstack.com
Gotcha: letting the model generate filenames/IDs — it sneaks in invalid chars, invisible unicode, or collisions. Fix: generate canonical IDs server‑side, sanitize/normalize model text (strip zero‑width), enforce a regex, then uniqueness-check + retry.
Honest Oort note: we’re tiny and tests are our best yield. Every prompt/project gets a 10‑case end‑to‑end test. It flags model drift, token quirks, provider breakage — and saves more time than another round of prompt polishing.
Hot take: don’t chase bigger models — build a 100–300 example, high‑quality eval set and gate releases on it. If an architecture or prompt tweak doesn’t beat that set reliably, it’s roll‑forward noise, not progress.
Myth: adding "Let's think step by step" always fixes reasoning. Reality: it often produces plausible-but-wrong chains and leaks noisy internal logic. Better: have the model put reasoning into a non‑user field for tests/debug, then emit a short, verified final answer for the UI.
Don't trust LLM prose — force a schema. Prompt pattern: 1) Output a JSON Schema for the answer. 2) Output one valid JSON instance. 3) Output the single most likely invalid instance. 4) Output a one-line runtime check (e.g., ajv rule or regex) to catch it. No commentary.
If you're paying for inference, don't let a wrapper skim margin. BYOK gives you raw pricing, volume discounts, control over batching/caching/retries, and predictable billing. Quick calc: billed = provider_cost*(1+markup) + wrapper_fees — run it on your monthly usage.
Tiny gotcha when shipping: reusing a chat session to save tokens can quietly leak a previous user's context into new replies. Fix: start a fresh convo per user (or persist only a pinned system prompt + embeddings), scrub prior messages, and auto‑test for PII 🔒
TIL: using next‑token log‑probs for classification is brittle — tokenization & token priors skew choices (your "Yes" vs "No" might not be fair). Do: pick single‑token labels (0/1), verify tokenization, or score full label strings and calibrate with label‑swaps.
Build‑in‑public note: we dropped a polished curation layer after 3 months — people want runnable repos, not prettier lists. So Oort doubled down: every prompt listing must include shipped code + a tiny runme. Less polish, more reproducible wins. oortstack.com
Contrarian: build a dumb, rule‑based baseline for any feature before you call an LLM. If generated code doesn’t measurably beat that baseline on correctness, perf, or size, it’s not progress — it’s vendor noise. Measure wins, don’t assume them.