← Back to the course

This is the production spec — the contract the course builds toward. The guided course teaches you to reach exactly this runnable result. Skim it if you'd rather build straight from the target.

Catalens — Project Spec

Single source of truth for the Catalens course (src/content/projects/catalens.mdx). The course must teach toward exactly this runnable project. Spotlight: MongoDB Atlas Vector Search. Backends: Go (default) + TypeScript, both implementing the same contract. Free to complete ($0): Atlas M0 + a free Google AI Studio key + local runtime + an emulator.


1. Overview & definition of done

Catalens is a visual product-recognition service. A shopper photographs a product; the backend turns the photo into a typed descriptor (Gemini Vision), embeds that descriptor (Gemini embeddings), and matches it against a live catalog with one MongoDB Atlas $vectorSearch aggregation — vector similarity and a category pre-filter in a single query over heterogeneous documents — then ranks the results by score and cuts them off at a calibrated confidence threshold.

Definition of done (the runnable result a learner ends with):

  1. An Atlas M0 cluster holding a products collection (~15 seeded products across ≥3 categories), each with a stored embedding and embeddingHash, behind a products_vec Atlas Vector Search index.
  2. A backend (Go or TypeScript) exposing POST /recognize that, given a photo, returns { descriptor, matches:[{...product, score}], noMatch } — the spotlight pipeline end to end.
  3. A mobile app (Compose, Flutter, or SwiftUI) that captures/picks a photo, calls /recognize, and shows ranked matches with their scores, the Vision descriptor + which pre-filter fired, a live threshold slider, and a first-class no-match state.
  4. An integration test (per backend) proving the two outcomes against the real Atlas index: a known product photo matches above the threshold T; an out-of-catalog photo returns a clean noMatch.

How the learner SEES it run locally, for $0: start the backend against Atlas M0 with a free Gemini key, run the mobile app on an emulator/simulator, tap a bundled sample photo, and watch a ranked match list with scores appear — or a clean “no confident match” for an unstocked item. No cloud deploy is required; Cloud Run is an optional extra.

The spotlight is load-bearing: remove MongoDB Atlas Vector Search and the project’s core (similarity + metadata pre-filter in one query over a schema-volatile catalog) cannot exist. The backend language is a swappable shell; the match is the database.


2. Architecture (components and how they connect)

            ┌─────────────┐  multipart {image, category?}   ┌──────────────────────────┐
 Mobile app │ Compose /   │ ──────────────────────────────▶ │ Backend  POST /recognize │
 (camera or │ Flutter /   │ ◀────────────────────────────── │ (Go default | TypeScript)│
  gallery)  │ SwiftUI     │   { descriptor, matches[], noMatch }                        │
            └─────────────┘                                  └─────────┬────────────────┘

                              1. Vision (image → typed descriptor)     │
                                ┌────────────────────────────────────▶ │
                                │  Gemini Vision  (generateContent,     │
                                │  responseSchema, category enum)       │
                                │                                       ▼
                                │  2. Embed (descriptor text → vector)  embeddingText(descriptor)
                                │  Gemini embeddings (gemini-embedding-001, outputDimensionality D)
                                │  → L2-normalize when D < 3072 ────────┐
                                │                                       ▼
                                │                            3. $vectorSearch (one aggregation)
                                │                            MongoDB Atlas:  embedding NN
                                └─────────────────────────── + category pre-filter + $meta score


                                                            4. threshold T → ranked matches | noMatch
  • Mobile app never calls Gemini or Mongo directly. It only calls your backend. The Gemini key and the Mongo URI live server-side.
  • Gemini does two jobs: Vision (photo → descriptor) and embeddings (text → vector). Same key, free tier.
  • MongoDB Atlas does the match: nearest-neighbour over embedding plus the metadata pre-filter, in one $vectorSearch aggregation, returning each candidate’s vectorSearchScore.
  • The ingest pass (run once, and again whenever a product’s content changes) embeds every product and stores the vector on its document, behind the same products_vec index the query reads.

Why descriptor-text embeddings (the road not taken)

We embed the text of a Vision descriptor for both catalog and query — not the raw image. The honest reason: it keeps the project on one free embedding model for catalog and query, yields a human-readable descriptor you can debug, and makes the match explainable. The invariant this creates: catalog and query must be embedded the same way (same model, same dimensions, same normalization), because we are comparing embed(text-of-catalog-product) against embed(text-of-Vision-descriptor-of-photo). A learner must seed products they can actually photograph (or generate matching images), because the comparison is descriptor-text vs descriptor-text — not image-pixels vs image-pixels.


3. Runnable structure (the repo the learner ends with)

Both backends share the same module layout in spirit; names differ by language. The app entrypoint composes everything: it opens one Mongo client, builds the Gemini client, registers routes + middleware, and shuts down cleanly.

Go (default)

catalens/
  go.mod                       # module github.com/you/catalens
  cmd/api/main.go              # ENTRYPOINT: load config, open Mongo client, build deps,
                               #   register routes, http.Server + graceful shutdown
  internal/config/config.go    # env: MONGODB_URI, MONGODB_DB, GEMINI_API_KEY,
                               #   GEMINI_VISION_MODEL, GEMINI_EMBED_MODEL, EMBED_DIM (D), THRESHOLD (T), PORT
  internal/catalog/store.go    # Store: Search(ctx, q) / Upsert / collection handles
  internal/gemini/gemini.go    # Vision (descriptor) + Embed (vector) + L2-normalize helper
  internal/recognize/service.go# RecognizeService: orchestrates Vision→embed→Search→threshold
  internal/recognize/handler.go# POST /recognize HTTP handler (multipart in, JSON out)
  internal/embedtext/text.go   # embeddingText(doc|descriptor) — the SHARED builder (ingest == query)
  cmd/seed/main.go             # seed ~15 products + create _worker_state doc (idempotent upsert)
  cmd/ingest/main.go           # embed every product, store embedding + embeddingHash
  cmd/worker/main.go           # (feature: dynamic-embeddings) change-stream re-embed worker
  testdata/                    # bundled sample product photos (match seeded products) + a true-negative

TypeScript

catalens/
  package.json
  src/server.ts                # ENTRYPOINT: MongoClient.connect, build deps, Hono routes, serve
  src/config.ts                # same env vars as Go
  src/catalog/store.ts         # CatalogStore: search()/upsert()/collection handles
  src/gemini.ts                # vision()/embed()/l2normalize()
  src/recognize/service.ts     # orchestrates Vision→embed→search→threshold
  src/recognize/handler.ts     # POST /recognize Hono handler
  src/embedText.ts             # embeddingText() — the SHARED builder (ingest == query)
  src/seed.ts                  # seed ~15 products + _worker_state doc (idempotent)
  src/ingest.ts                # embed every product, store embedding + embeddingHash
  src/worker.ts                # (feature) change-stream re-embed worker
  testdata/                    # bundled sample photos + a true-negative

Key interfaces (named explicitly — same contract, both languages)

Store / CatalogStore — the only thing the recognise service knows about persistence:

  • Search(ctx, query) -> []Match where query = { queryVector: float[D], categoryHint?: string, numCandidates: int, limit: int, exact?: bool, inStockOnly?: bool }. Runs the $vectorSearch aggregation and returns ranked Match{ id, name, brand, category, attributes, price, inStock, score }.

The Store.Search seam is a named contract, not a mandatory file: the course teaches the $vectorSearch pipeline inline in the /recognize handler for clarity (one place to read the spotlight end to end). Extracting it behind a Store.Search method is an idiomatic refactor, not a missing piece.

  • Upsert(ctx, product) -> id (seed + ingest).
  • SetEmbedding(ctx, id, vector, hash) (ingest + worker).
  • LogMiss(ctx, miss) (feature: no-match-analytics; writes to a separate search_misses collection).

Vision + Embed (the gemini package):

  • Vision(ctx, imageBytes, mime) -> DescriptorgenerateContent with responseMimeType:"application/json" and a responseSchema whose category is an enum of the catalog’s known categories.
  • Embed(ctx, text) -> float[D] — embeds, then L2-normalizes when D < 3072 (see §4).

RecognizeService — the spotlight orchestration, language-agnostic in shape: Recognize(ctx, imageBytes, mime, categoryHint?) -> RecognizeResponse. It: (1) Vision → descriptor, (2) embeddingText(descriptor)Embed → query vector, (3) Store.Search with the category pre-filter (falling back to no filter when the filtered result is empty — see §6 “category gap”), (4) apply threshold T, (5) on no-match optionally LogMiss. Returns the canonical response shape in §5.


4. Data model

Collection products (the catalog — heterogeneous documents)

Common fields every match relies on, plus per-category attributes:

fieldtypenotes
_idObjectIdgenerated
namestringrequired
brandstringrequired
categorystringrequired; must be one of the catalog’s known categories (drives the Vision enum + pre-filter)
attributesobjectper-category (sneakers: colour/material/sizes; tea: flavour/caffeine/grams; …)
priceintcents
inStockboolused by the substitutes feature filter
sku / barcodestringoptional; unique index for the barcode feature fast-path
imageRefstringoptional; pointer to the product image (used by the change-stream worker)
embeddingfloat[D]added at ingest; length === numDimensions of the index; L2-normalized when D < 3072
embeddingHashstringsha256 of embeddingText(doc); the idempotency guard for re-embedding

Atlas Vector Search index products_vec (on products)

{
  "fields": [
    { "type": "vector", "path": "embedding", "numDimensions": 768, "similarity": "cosine" },
    { "type": "filter", "path": "category" },
    { "type": "filter", "path": "brand" },
    { "type": "filter", "path": "inStock" }
  ]
}
  • numDimensions must equal the embedding length D you ingest with. A mismatch breaks the build/query — the single most common setup error.
  • Only fields declared type:"filter" can appear in a $vectorSearch filter. inStock is declared up front so the substitutes feature works without an index change.
  • Embedding-normalization invariant (load-bearing): gemini-embedding-001 returns embeddings that are only pre-normalized at the full 3072 dimensions. At any smaller outputDimensionality (e.g. 768 or 1536) the vectors carry varying magnitude that distorts cosine similarity, so you must L2-normalize every vector — catalog and query — before storing/searching, or use D = 3072. (Confirmed: https://ai.google.dev/gemini-api/docs/embeddings — manual normalization is required for non-3072 dims; gemini-embedding-2 auto-normalizes truncated dims, so pairing the model id with the normalize rule keeps a model swap correct.)

Collection _worker_state (prerequisite for the change-stream worker)

A single document { _id: "embeddings-worker", resumeToken: <BSON resume token | null> }, seeded by the seed step so the worker has a row to read/update from the start (the FK-row analogue). On each handled event the worker writes the latest resume token here; on startup it reads it back via resumeAfter/SetResumeAfter.

Collection search_misses (feature: no-match-analytics)

{
  "at": "ISODate",
  "descriptor": { "category": "...", "brand": "...", "colour": "...", "form": "...", "visibleText": "...", "attributes": ["..."] },
  "nearMisses": [ { "name": "...", "score": 0.62 } ],
  "threshold": 0.75
}

Separate collection so analytics writes never touch the catalog the recognise path reads. No raw image is stored (descriptor + scores only — privacy-aware default).

Migrations / seed order (prerequisites first)

  1. Create products (lazily on first insert) and seed ~15 products across ≥3 categories — idempotent upsert by (brand, name).
  2. Seed the _worker_state document (so the worker prerequisite exists before the feature).
  3. Run ingest to populate embedding + embeddingHash on every product.
  4. Create the products_vec index (Atlas UI or API) with numDimensions === D.
  5. (features) create the unique index on sku/barcode; search_misses is lazily created on first miss.

There are no foreign keys (document store), but _worker_state is the explicit prerequisite row the worker path needs, and every product must have an embedding of length D before the products_vec index is usable — ingest is a hard prerequisite of the recognise step.


5. API & event contract (the one canonical shape)

Every step, client, and test shares exactly these shapes.

POST /recognize

  • Request: multipart/form-data
    • image (file, required) — a product photo (JPEG/PNG).
    • category (string, optional) — a category hint; normally omitted (the descriptor supplies it).
  • Response 200 — match:
    {
      "descriptor": { "brand": "Northpeak", "category": "sneakers", "colour": "red",
                      "form": "low-top", "visibleText": "", "attributes": ["leather"] },
      "filterApplied": "sneakers",
      "matches": [
        { "id": "…", "name": "Trailblazer Low", "brand": "Northpeak", "category": "sneakers",
          "attributes": { "colour": "red", "material": "leather" }, "price": 8900, "inStock": true,
          "score": 0.88 }
      ],
      "noMatch": false
    }
  • Response 200 — no confident match: { "descriptor": {…}, "filterApplied": null, "matches": [], "noMatch": true } (an honest “I don’t know”, not an error status). The no-match branch still carries descriptor and filterApplied — same envelope as a match, only matches is empty — so the client can keep showing what Vision saw and which pre-filter ran (or that it fell back). filterApplied is typically null on a no-match because a true out-of-catalog photo reaches the threshold step only after the unfiltered fallback (§6).
  • Status / error codes:
    • 200 — match or no-match (both are success).
    • 400image part missing or unreadable ({ "error": "image required" }).
    • 415 — unsupported media type (not JPEG/PNG), optional.
    • 502 — upstream Gemini call failed after retries ({ "error": "vision unavailable" }).
    • 500 — unexpected server error.

Field contract (Match): id (string), name, brand, category (strings), attributes (object), price (int cents), inStock (bool), score (number in [0,1]). Matches are ordered best-first. score is the Atlas vectorSearchScore. filterApplied echoes which category pre-filter fired (or null if the search ran unfiltered) so the UI can show the spotlight at work.

Score semantics (load-bearing). For cosine similarity Atlas maps the raw cosine [-1,1] into [0,1] as (1 + cosine) / 2. So an unrelated (orthogonal) photo’s nearest stranger still scores ~0.5, not 0 — 0.5 is the “no real similarity” floor, and real matches for a clean photo sit well above it. A naive low threshold like T=0.3 is therefore meaningless; calibrate T above the ~0.5 floor. (Confirmed: MongoDB normalizes cosine as (1+cosine)/2.)

Feature endpoints (off by default — see §8)

  • GET /products/{id}/substitutes{ matches:[{...product, score}] } (in-stock only, lower threshold).
  • POST /recognize/shelf[ { box, matches:[{...product, score}], noMatch } ] (multi-shelf fan-out).
  • GET /analytics/top-misses?since=<ISO>[ { category, brand, requests, avgNearMiss, lastRequested } ].

Wire/event message — change-stream event (feature: dynamic-embeddings)

The worker consumes MongoDB change-stream events on products opened with fullDocument:"updateLookup" and a $match pipeline that only lets content edits through:

operationType ∈ {insert, update, replace}
AND ( insert | replace  OR  updateDescription.updatedFields has one of:
      name | brand | category | attributes | imageRef )

Per event: text = embeddingText(fullDocument); hash = sha256(text); skip if hash == doc.embeddingHash (idempotent); else Embed(text)SetEmbedding(id, vector, hash). The write-back touches only embedding/embeddingHash, which the $match excludes — so it never re-triggers the worker. After each handled event, persist resumeToken to _worker_state; on startup read it back as resumeAfter.


6. Build order (dependency-ordered; each step’s prerequisites already exist)

  1. Prerequisites & local tooling — Go toolchain (or Node), mongosh, curl/base64, a sample image.
  2. Atlas M0 — cluster + MONGODB_URI, MONGODB_DB (Vector Search is Atlas-only).
  3. Gemini key — confirm Vision + embeddings respond (portable base64; name sample.jpg).
  4. Model the catalog + seedproducts (~15, ≥3 categories) + the _worker_state doc; idempotent.
  5. Shared embeddingText builder + normalize helper — defined once, reused by ingest, query, worker.
  6. Ingest — embed each product (fetch with find(), iterate by doc._id), L2-normalize when D < 3072, store embedding + embeddingHash.
  7. Create products_vec indexnumDimensions === D; category/brand/inStock as filters; cosine.
  8. Design the recognise pipeline + Vision responseSchemacategory as an enum of the known categories (closes the Vision-vs-catalog gap), descriptor → embeddable text.
  9. Scaffold the API + /recognize skeleton + a Gemini-from-code worked example (per backend) — full import block / package install; the entrypoint that composes client + routes + shutdown.
  10. ★ Recognise end to end (per backend) — Vision → embed (normalized) → $vectorSearch (category pre-filter, fall back to unfiltered when the filtered result is empty) → ranked matches + scores.
  11. Threshold or no-match — apply T; clean noMatch.
  12. Calibrate T — score known/unknown photos; the ~0.5 cosine floor; no-match UX.
  13. Frontend (per platform) — capture/pick → /recognize → ranked matches + scores + descriptor/filter panel + live threshold slider + nearest-below-threshold no-match UX.
  14. Integration tests (per backend) — known photo matches above T; unknown → noMatch.
  15. Optional deploy — Cloud Run, free-tier eligible.
  16. Feature modules (off by default) — §8.

Each step depends only on earlier ones: the index (7) needs ingest (6); ingest needs the shared builder (5) and the seed (4); recognise (10) needs the index (7), the responseSchema (8), and the scaffold (9); the worker feature needs _worker_state (4) and the shared builder (5).


7. Backends — Go (default) + TypeScript, same contract

Parity points (both must hold):

  • Same response shape (§5) byte-for-byte in field names and types; matches best-first; score in [0,1]; filterApplied echoed; noMatch is a 200.
  • Same $vectorSearch stage: index:"products_vec", path:"embedding", queryVector length D, numCandidates (~20× limit, must be ≥ limit), limit, optional filter { category:{$eq:hint} }, optional exact:true (ENN baseline). $project adds score:{$meta:"vectorSearchScore"}.
  • Same embedding rule: same model + same outputDimensionality D + L2-normalize when D < 3072 for both catalog and query.
  • Same category-gap handling: Vision category constrained to the known enum; on an empty filtered result, retry the search unfiltered before declaring no-match.
  • Same fire-and-forget miss logging (feature) into a separate search_misses collection; no raw image.

Go specifics (verified):

  • Module go.mongodb.org/mongo-driver/v2; import all three sub-packages used: go.mongodb.org/mongo-driver/v2/mongo, .../v2/mongo/options, .../v2/bson. One go get go.mongodb.org/mongo-driver/v2/mongo pulls the whole module; the import lines must list the sub-packages. bson.ObjectID is the v2 type name (was primitive.ObjectID in v1). mongo.Connect(options.Client().ApplyURI(uri)) is the v2 signature (no context arg).
  • Gemini Go SDK google.golang.org/genai: genai.NewClient(ctx, &genai.ClientConfig{APIKey: key, Backend: genai.BackendGeminiAPI}); Vision via client.Models.GenerateContent(ctx, model, contents, &genai.GenerateContentConfig{ResponseMIMEType:"application/json", ResponseSchema: …}) with &genai.Blob{Data: imageBytes, MIMEType:"image/jpeg"} as an inline Part; embeddings via client.Models.EmbedContent(ctx, model, contents, &genai.EmbedContentConfig{OutputDimensionality: &d}).

TypeScript specifics:

  • Official mongodb driver; new MongoClient(uri) + await client.connect(); collection.aggregate(pipeline) .toArray(); collection.watch(pipeline, { fullDocument:"updateLookup", resumeAfter }) (async-iterable).
  • Gemini via the REST API (x-goog-api-key, v1beta, generateContent/embedContent) or @google/genai.

Neither backend hard-codes a model id: read GEMINI_VISION_MODEL / GEMINI_EMBED_MODEL from config and link the official model list (https://ai.google.dev/gemini-api/docs/models). The only current free-tier embedding model needing normalization is gemini-embedding-001, so the normalize rule is the safe default.


8. Optional feature modules (off by default; each extends, never rewrites, the spec)

  • hybrid — fuse $vectorSearch with Atlas Search $search over brand/name/visibleText via RRF (or the Atlas $rankFusion stage where available); one confidence cut on the fused score; no-match path preserved.
  • substitutesGET /products/{id}/substitutes: the product’s own embedding as the query vector, $vectorSearch with filter:{inStock:true} excluding the product, a lower threshold (recall over precision). Uses the inStock filter field already in the index.
  • multi-shelfPOST /recognize/shelf: Vision returns an array of items (array responseSchema) or detect-then-recognise per crop; reuse the per-item pipeline; return [{box, matches[], noMatch}].
  • barcode — exact sku/barcode lookup (unique index) before the vector fallback; cheap/deterministic path first, vector path only on a miss.
  • dynamic-embeddings — the change-stream re-embed worker (§5 event shape). Prerequisite: the _worker_state doc (§4) and the shared embeddingText builder. Go (mongo.ChangeStream + SetFullDocument(options.UpdateLookup) + SetResumeAfter) and TS (watch(...)) parity.
  • performance — make numCandidates configurable; sweep it against an exact:true ENN baseline (ENN is the ground truth for catalogs under ~10k docs) and pick the smallest value within recall tolerance; numCandidateslimit, ~20× limit as the documented starting point.
  • no-match-analytics — fire-and-forget LogMiss into search_misses on the no-match branch (Go/TS parity), plus a common GET /analytics/top-misses aggregation ranking unmet demand.

9. Free-to-complete ($0)

NeedFree optionFirst-appears note
Vector DBMongoDB Atlas M0 (no card; a real replica set; Vector Search + Atlas Search + change streams all run on it; local mongod cannot build the vector index)“Costs nothing” on the Atlas step
AI (Vision + embeddings)Google AI Studio free-tier key (one key, both jobs)“Costs nothing” on the Gemini step
Backend runtimeLocal Go toolchain or Node
MobileAndroid emulator / iOS Simulator + bundled sample photos (so every scan is free + reproducible)
Deploy (optional)Cloud Run free monthly allotment (scales to zero); key in Secret Manager”Costs nothing” on the deploy step

Confirm current free-tier limits on the official docs; nothing in the default path requires a paid service.