Decide in 30 minutes: scope, volume, and constraints

Name the primary job

Pick one target outcome before comparing tools: deflect support tickets, improve self-serve documentation, or help internal agents answer faster. A mixed goal creates conflicting evaluation prompts.

Inventory content and access

List every source the assistant may read: docs site, changelogs, GitHub issues, ticketing history, internal wiki, and chat archives. Mark each source as public, customer-only, or internal.

Map usage patterns

Write the expected question types, supported languages, privacy sensitivity, and peak concurrency windows. Example: public API setup questions need different controls than account-specific billing questions.

Choose answer surfaces

Decide where answers must appear: docs widget, in-app help, Slack or Teams bot, search overlay, or API-only. Each surface changes authentication, logging, and UX requirements.

Define human handoff

Set escalation rules for missing sources, low confidence, account-specific requests, or policy-sensitive topics. Attach the user question, retrieved passages, source URLs, session metadata, and attempted answer.

Cost check: Kapa vs DIY RAG bill of materials

Read kapa.ai’s pricing page before building any spreadsheet: capture the active tier, request or MAU meter, included limits, overage behavior, and SLA language (source: kapa.ai Pricing page (consulted 2026-06)).

A DIY RAG bill of materials has six lines: LLM inference, embeddings, vector database storage and queries, crawler/indexer jobs, application hosting, and observability for traces, errors, and cost attribution.

Model inference with provider unit pricing for prompt tokens and response tokens; compare OpenAI and Anthropic because long answers and large retrieved context change the bill differently by model family (source: OpenAI Pricing page (consulted 2026-06); source: Anthropic Pricing page (consulted 2026-06)).

Model retrieval with embedding-token costs, vector writes, vector reads, and GB-month storage; Pinecone exposes those units directly for storage and operations planning (source: Pinecone Pricing page (consulted 2026-06)).

Dimension	Hosted Kapa	DIY RAG
Time-to-value	Vendor-managed ingestion, UI, and support workflows	You wire crawler, indexer, API, UI, monitoring, and feedback loops
Cost predictability	Plan limits and metering come from the pricing page	Costs move with tokens, embeddings, vector reads, writes, storage, and hosting
Control	Less control over retrieval internals and roadmap	Full control over chunking, reranking, caching, data retention, and model choice

The main cost levers are context window size, chunking strategy, embedding cadence, cache hit rate, and deduplication. Remove repeated boilerplate before embeddings, cache stable answers, and update embeddings only for changed documents.

Evaluate answer quality: a 1‑hour A/B test you can run today

Build the question set

Pull representative prompts from closed tickets, product search logs, and PM notes. For every prompt, attach the canonical doc URL, release note, or source snippet that proves the expected answer.

Standardize the prompt harness

Use the same system prompt for every contender: tone, refusal policy, citation style, and escalation rule. Use user prompts with explicit task framing, such as: “Answer from the attached docs only and cite every factual claim.”

Run blind A/B

Hide vendor names behind neutral labels. Keep retrieval settings, corpus, chunking policy, and prompts identical. Record groundedness, exactness, refusals, and latency percentiles for every answer.

Score the matrix

Create a sheet with columns for accuracy, groundedness with citations, latency, maintenance overhead, and analytics depth. Apply pre-agreed weights before reading vendor names.

Pick, then inspect outliers

Select the highest weighted score, then manually review failures that contradict the aggregate result. Outliers to inspect include correct answers without citations, cited hallucinations, and slow but precise responses.

Integration, security, and governance: non‑negotiables

A support bot that can read private docs needs controls for source, ticket, and customer context access.

Require SAML or OIDC SSO, then map product roles to existing IdP groups. Index private repos and ticket systems with least-privilege service accounts, not personal access tokens tied to a human owner.

Verify encryption in transit and at rest, retention controls, and redaction for PII and secrets before ingestion. Export APIs must let you move conversations, feedback, and indexed metadata without vendor intervention.

Pin model versions for repeatable evaluations. Route tasks by risk: retrieval-heavy answers, summarization, and escalation triage can use different models. Keep system prompts, safety rules, and memory policies configurable outside application code.

Every answer needs a trace showing the user query, retrieved sources, prompt version, model, and final citations. Capture thumbs, corrections, and unresolved questions, then export analytics to your warehouse for incident review.

Ask for data residency options, a DPA, and a current subprocessor list before production traffic. Check where data is stored and which third parties process it.

Zero‑downtime swap plan (one sprint)

Rebuild the knowledge layer

Create a fresh index from production documentation, not from the old vendor export. Dedupe pages, chunk deterministically, and assign stable IDs from canonical URL plus heading path. Precompute embeddings in a staging project so production traffic never waits on backfill.

Split traffic behind a flag

Ship a server-side feature flag that can route the same request shape to kapa.ai or the alternative. Ramp gradually by cohort or workspace, and watch guardrails such as cited-source presence, fallback rate, escalation rate, and user feedback.

Preserve analytics continuity

Map conversation events before launch: question submitted, answer shown, source clicked, feedback sent, escalation requested. Keep the same funnel names and cohort keys where possible. Export histories and rehydrate them only when the target supports compatible conversation state.

Prove parity before cutover

Run the existing test harness against both systems using the same prompts, docs, and scoring rubric. Any low-confidence answer or answer without a supporting source should fail open to human support, not guess.

Cut over with a rollback path

Switch the widget or API endpoint only after parity checks pass. Keep the old endpoint wired but disabled behind a kill switch. Run a short shadow period where requests are mirrored for safety, without showing shadow answers to users.

How to Choose a Kapa.ai Alternative: A Practical Playbook