2025-mvp-stack

My 2025 AI Stack for Production RAG (and why)

July 25, 2025

Why My 2025 AI Stack Is TypeScript-First (Python Only Where It Counts)

AI product development has changed dramatically in the last two years. In 2023, most AI engineers defaulted to Python-heavy stacks — FastAPI for APIs, LangChain (Python) for orchestration, HuggingFace for embeddings and fine-tunes. That made sense when models were new and everything happened in notebooks.

But in real products, 90% of the work isn’t Python. It’s UI/UX, auth, payments, user state, analytics — and the faster you ship those, the faster you learn. That’s why in 2025 I shifted to a TypeScript-first stack for nearly everything, and I only bring Python in when it’s truly needed. Result: faster iteration, cleaner integration, and an architecture that’s observable, swappable, and built for change.


#Core Philosophy

  • TS-first for speed — Next.js + Convex + modern TS tooling ship features in hours, not days.
  • Python only where it’s irreplaceable — fine-tuning, LoRA, specialized CV/NLP pipelines.
  • Everything observable — if I can’t see what the model retrieved or why it hallucinated, it’s a demo, not a product.
  • Swappable components — vector DBs, embeddings, and LLMs can change without rewrites.

#1) Frontend & Application Layer (All TS)

  • Framework: Next.js (App Router) — mix server/client components, stream responses, share API contracts.
  • Database & state: Convex — real-time, serverless storage, cron jobs with no extra infra.
  • Auth: Clerk — OAuth, magic links, SSO in minutes.
  • UI/Styling: Tailwind CSS + shadcn/ui — fast, consistent, themeable.
  • File handling: UploadThing or Vercel Blob.

💡 Example: Nightly RAG index refresh = a Convex cron job calling the vector DB—no extra servers, no DevOps overhead.


#2) LLM Orchestration & Retrieval (TS)

  • Orchestration: LangChain.js / LangGraph.js.
  • Retrieval: Weaviate or Qdrant for managed ops; pgvector when I want Postgres-native queries.
  • Embeddings: OpenAI text-embedding-3-large for general use; Voyage AI for multilingual; Cohere for budget scale.
  • LLMs: Mix and match — Groq (Llama 3) for low latency; OpenAI for reasoning-heavy work.
  • UI integration: Vercel AI SDK for streaming chat/completions.

💡 Why not Pinecone? I prefer Weaviate/Qdrant for more control over index params and cost structure.


#3) The Python “Island” (Only When Needed)

I don’t run the whole backend in Python anymore, but I keep a microservice for:

  • LoRA / fine-tuning (HuggingFace PEFT & Transformers)
  • Custom CV/Audio/NLP pipelines (OpenFace, librosa, spaCy)
  • Self-hosted inference on Modal, Replicate, or Runpod

It’s a small FastAPI service, deployed separately, called from the TS backend only when necessary.

💡 Benefit: I can scale GPU-heavy workloads independently; the main app stays fast.


#4) Evaluation & Observability

If you can’t debug model behavior, you’re flying blind. I track:

  • Retrieval hits + metadata
  • Latency per pipeline stage
  • Hallucination scores (LLM-graded or heuristic)
  • User feedback loops

Tools: LangSmith for tracing/A-B prompt tests, Sentry for app errors.

💡 Example: Logging retrieval hits with timestamps exposed a timezone bug breaking nightly index refreshes.


#5) Why This Stack Works

  • Speed: Product features ship faster in TS.
  • Flexibility: Swap vector DBs or LLMs in hours.
  • Scalability: Python workloads don’t slow the rest of the app.
  • Observability: Every stage is logged and traceable.

#When to Break the TS-First Rule

  • Research-heavy prototypes where model iteration speed > product speed.
  • Internal tools where UI/UX isn’t critical.
  • Deep integrations with Python-only libraries.

For production AI products with real users, TS-first keeps me fastest.


#TL;DR

  • Ship fast.
  • Track everything.
  • Keep it modular.

Pivots are inevitable. This stack lets me change the LLM, embeddings, or vector DB without starting over.