Catalens — Project Spec
Single source of truth for the Catalens course (
src/content/projects/catalens.mdx). The course must teach toward exactly this runnable project. Spotlight: MongoDB Atlas Vector Search. Backends: Go (default) + TypeScript, both implementing the same contract. Free to complete ($0): Atlas M0 + a free Google AI Studio key + local runtime + an emulator.
1. Overview & definition of done
Catalens is a visual product-recognition service. A shopper photographs a product; the backend turns the
photo into a typed descriptor (Gemini Vision), embeds that descriptor (Gemini embeddings), and matches it
against a live catalog with one MongoDB Atlas $vectorSearch aggregation — vector similarity and a
category pre-filter in a single query over heterogeneous documents — then ranks the results by score and cuts
them off at a calibrated confidence threshold.
Definition of done (the runnable result a learner ends with):
- An Atlas M0 cluster holding a
productscollection (~15 seeded products across ≥3 categories), each with a storedembeddingandembeddingHash, behind aproducts_vecAtlas Vector Search index. - A backend (Go or TypeScript) exposing
POST /recognizethat, given a photo, returns{ descriptor, matches:[{...product, score}], noMatch }— the spotlight pipeline end to end. - A mobile app (Compose, Flutter, or SwiftUI) that captures/picks a photo, calls
/recognize, and shows ranked matches with their scores, the Vision descriptor + which pre-filter fired, a live threshold slider, and a first-class no-match state. - An integration test (per backend) proving the two outcomes against the real Atlas index: a known
product photo matches above the threshold
T; an out-of-catalog photo returns a cleannoMatch.
How the learner SEES it run locally, for $0: start the backend against Atlas M0 with a free Gemini key, run the mobile app on an emulator/simulator, tap a bundled sample photo, and watch a ranked match list with scores appear — or a clean “no confident match” for an unstocked item. No cloud deploy is required; Cloud Run is an optional extra.
The spotlight is load-bearing: remove MongoDB Atlas Vector Search and the project’s core (similarity + metadata pre-filter in one query over a schema-volatile catalog) cannot exist. The backend language is a swappable shell; the match is the database.
2. Architecture (components and how they connect)
┌─────────────┐ multipart {image, category?} ┌──────────────────────────┐
Mobile app │ Compose / │ ──────────────────────────────▶ │ Backend POST /recognize │
(camera or │ Flutter / │ ◀────────────────────────────── │ (Go default | TypeScript)│
gallery) │ SwiftUI │ { descriptor, matches[], noMatch } │
└─────────────┘ └─────────┬────────────────┘
│
1. Vision (image → typed descriptor) │
┌────────────────────────────────────▶ │
│ Gemini Vision (generateContent, │
│ responseSchema, category enum) │
│ ▼
│ 2. Embed (descriptor text → vector) embeddingText(descriptor)
│ Gemini embeddings (gemini-embedding-001, outputDimensionality D)
│ → L2-normalize when D < 3072 ────────┐
│ ▼
│ 3. $vectorSearch (one aggregation)
│ MongoDB Atlas: embedding NN
└─────────────────────────── + category pre-filter + $meta score
│
▼
4. threshold T → ranked matches | noMatch
- Mobile app never calls Gemini or Mongo directly. It only calls your backend. The Gemini key and the Mongo URI live server-side.
- Gemini does two jobs: Vision (photo → descriptor) and embeddings (text → vector). Same key, free tier.
- MongoDB Atlas does the match: nearest-neighbour over
embeddingplus the metadata pre-filter, in one$vectorSearchaggregation, returning each candidate’svectorSearchScore. - The ingest pass (run once, and again whenever a product’s content changes) embeds every product and
stores the vector on its document, behind the same
products_vecindex the query reads.
Why descriptor-text embeddings (the road not taken)
We embed the text of a Vision descriptor for both catalog and query — not the raw image. The honest
reason: it keeps the project on one free embedding model for catalog and query, yields a human-readable
descriptor you can debug, and makes the match explainable. The invariant this creates: catalog and query
must be embedded the same way (same model, same dimensions, same normalization), because we are comparing
embed(text-of-catalog-product) against embed(text-of-Vision-descriptor-of-photo). A learner must seed
products they can actually photograph (or generate matching images), because the comparison is descriptor-text
vs descriptor-text — not image-pixels vs image-pixels.
3. Runnable structure (the repo the learner ends with)
Both backends share the same module layout in spirit; names differ by language. The app entrypoint composes everything: it opens one Mongo client, builds the Gemini client, registers routes + middleware, and shuts down cleanly.
Go (default)
catalens/
go.mod # module github.com/you/catalens
cmd/api/main.go # ENTRYPOINT: load config, open Mongo client, build deps,
# register routes, http.Server + graceful shutdown
internal/config/config.go # env: MONGODB_URI, MONGODB_DB, GEMINI_API_KEY,
# GEMINI_VISION_MODEL, GEMINI_EMBED_MODEL, EMBED_DIM (D), THRESHOLD (T), PORT
internal/catalog/store.go # Store: Search(ctx, q) / Upsert / collection handles
internal/gemini/gemini.go # Vision (descriptor) + Embed (vector) + L2-normalize helper
internal/recognize/service.go# RecognizeService: orchestrates Vision→embed→Search→threshold
internal/recognize/handler.go# POST /recognize HTTP handler (multipart in, JSON out)
internal/embedtext/text.go # embeddingText(doc|descriptor) — the SHARED builder (ingest == query)
cmd/seed/main.go # seed ~15 products + create _worker_state doc (idempotent upsert)
cmd/ingest/main.go # embed every product, store embedding + embeddingHash
cmd/worker/main.go # (feature: dynamic-embeddings) change-stream re-embed worker
testdata/ # bundled sample product photos (match seeded products) + a true-negative
TypeScript
catalens/
package.json
src/server.ts # ENTRYPOINT: MongoClient.connect, build deps, Hono routes, serve
src/config.ts # same env vars as Go
src/catalog/store.ts # CatalogStore: search()/upsert()/collection handles
src/gemini.ts # vision()/embed()/l2normalize()
src/recognize/service.ts # orchestrates Vision→embed→search→threshold
src/recognize/handler.ts # POST /recognize Hono handler
src/embedText.ts # embeddingText() — the SHARED builder (ingest == query)
src/seed.ts # seed ~15 products + _worker_state doc (idempotent)
src/ingest.ts # embed every product, store embedding + embeddingHash
src/worker.ts # (feature) change-stream re-embed worker
testdata/ # bundled sample photos + a true-negative
Key interfaces (named explicitly — same contract, both languages)
Store / CatalogStore — the only thing the recognise service knows about persistence:
Search(ctx, query) -> []Matchwherequery = { queryVector: float[D], categoryHint?: string, numCandidates: int, limit: int, exact?: bool, inStockOnly?: bool }. Runs the$vectorSearchaggregation and returns rankedMatch{ id, name, brand, category, attributes, price, inStock, score }.
The
Store.Searchseam is a named contract, not a mandatory file: the course teaches the$vectorSearchpipeline inline in the/recognizehandler for clarity (one place to read the spotlight end to end). Extracting it behind aStore.Searchmethod is an idiomatic refactor, not a missing piece.
Upsert(ctx, product) -> id(seed + ingest).SetEmbedding(ctx, id, vector, hash)(ingest + worker).LogMiss(ctx, miss)(feature: no-match-analytics; writes to a separatesearch_missescollection).
Vision + Embed (the gemini package):
Vision(ctx, imageBytes, mime) -> Descriptor—generateContentwithresponseMimeType:"application/json"and aresponseSchemawhosecategoryis an enum of the catalog’s known categories.Embed(ctx, text) -> float[D]— embeds, then L2-normalizes when D < 3072 (see §4).
RecognizeService — the spotlight orchestration, language-agnostic in shape:
Recognize(ctx, imageBytes, mime, categoryHint?) -> RecognizeResponse. It: (1) Vision → descriptor,
(2) embeddingText(descriptor) → Embed → query vector, (3) Store.Search with the category pre-filter
(falling back to no filter when the filtered result is empty — see §6 “category gap”), (4) apply threshold T,
(5) on no-match optionally LogMiss. Returns the canonical response shape in §5.
4. Data model
Collection products (the catalog — heterogeneous documents)
Common fields every match relies on, plus per-category attributes:
| field | type | notes |
|---|---|---|
_id | ObjectId | generated |
name | string | required |
brand | string | required |
category | string | required; must be one of the catalog’s known categories (drives the Vision enum + pre-filter) |
attributes | object | per-category (sneakers: colour/material/sizes; tea: flavour/caffeine/grams; …) |
price | int | cents |
inStock | bool | used by the substitutes feature filter |
sku / barcode | string | optional; unique index for the barcode feature fast-path |
imageRef | string | optional; pointer to the product image (used by the change-stream worker) |
embedding | float[D] | added at ingest; length === numDimensions of the index; L2-normalized when D < 3072 |
embeddingHash | string | sha256 of embeddingText(doc); the idempotency guard for re-embedding |
Atlas Vector Search index products_vec (on products)
{
"fields": [
{ "type": "vector", "path": "embedding", "numDimensions": 768, "similarity": "cosine" },
{ "type": "filter", "path": "category" },
{ "type": "filter", "path": "brand" },
{ "type": "filter", "path": "inStock" }
]
}
numDimensionsmust equal the embedding lengthDyou ingest with. A mismatch breaks the build/query — the single most common setup error.- Only fields declared
type:"filter"can appear in a$vectorSearchfilter.inStockis declared up front so thesubstitutesfeature works without an index change. - Embedding-normalization invariant (load-bearing):
gemini-embedding-001returns embeddings that are only pre-normalized at the full 3072 dimensions. At any smalleroutputDimensionality(e.g. 768 or 1536) the vectors carry varying magnitude that distorts cosine similarity, so you must L2-normalize every vector — catalog and query — before storing/searching, or use D = 3072. (Confirmed: https://ai.google.dev/gemini-api/docs/embeddings — manual normalization is required for non-3072 dims;gemini-embedding-2auto-normalizes truncated dims, so pairing the model id with the normalize rule keeps a model swap correct.)
Collection _worker_state (prerequisite for the change-stream worker)
A single document { _id: "embeddings-worker", resumeToken: <BSON resume token | null> }, seeded by the seed
step so the worker has a row to read/update from the start (the FK-row analogue). On each handled event the
worker writes the latest resume token here; on startup it reads it back via resumeAfter/SetResumeAfter.
Collection search_misses (feature: no-match-analytics)
{
"at": "ISODate",
"descriptor": { "category": "...", "brand": "...", "colour": "...", "form": "...", "visibleText": "...", "attributes": ["..."] },
"nearMisses": [ { "name": "...", "score": 0.62 } ],
"threshold": 0.75
}
Separate collection so analytics writes never touch the catalog the recognise path reads. No raw image is stored (descriptor + scores only — privacy-aware default).
Migrations / seed order (prerequisites first)
- Create
products(lazily on first insert) and seed ~15 products across ≥3 categories — idempotent upsert by(brand, name). - Seed the
_worker_statedocument (so the worker prerequisite exists before the feature). - Run ingest to populate
embedding+embeddingHashon every product. - Create the
products_vecindex (Atlas UI or API) withnumDimensions === D. - (features) create the unique index on
sku/barcode;search_missesis lazily created on first miss.
There are no foreign keys (document store), but _worker_state is the explicit prerequisite row the worker
path needs, and every product must have an embedding of length D before the products_vec index is usable
— ingest is a hard prerequisite of the recognise step.
5. API & event contract (the one canonical shape)
Every step, client, and test shares exactly these shapes.
POST /recognize
- Request:
multipart/form-dataimage(file, required) — a product photo (JPEG/PNG).category(string, optional) — a category hint; normally omitted (the descriptor supplies it).
- Response 200 — match:
{ "descriptor": { "brand": "Northpeak", "category": "sneakers", "colour": "red", "form": "low-top", "visibleText": "", "attributes": ["leather"] }, "filterApplied": "sneakers", "matches": [ { "id": "…", "name": "Trailblazer Low", "brand": "Northpeak", "category": "sneakers", "attributes": { "colour": "red", "material": "leather" }, "price": 8900, "inStock": true, "score": 0.88 } ], "noMatch": false } - Response 200 — no confident match:
{ "descriptor": {…}, "filterApplied": null, "matches": [], "noMatch": true }(an honest “I don’t know”, not an error status). The no-match branch still carriesdescriptorandfilterApplied— same envelope as a match, onlymatchesis empty — so the client can keep showing what Vision saw and which pre-filter ran (or that it fell back).filterAppliedis typicallynullon a no-match because a true out-of-catalog photo reaches the threshold step only after the unfiltered fallback (§6). - Status / error codes:
200— match or no-match (both are success).400—imagepart missing or unreadable ({ "error": "image required" }).415— unsupported media type (not JPEG/PNG), optional.502— upstream Gemini call failed after retries ({ "error": "vision unavailable" }).500— unexpected server error.
Field contract (Match): id (string), name, brand, category (strings), attributes (object),
price (int cents), inStock (bool), score (number in [0,1]). Matches are ordered best-first.
score is the Atlas vectorSearchScore. filterApplied echoes which category pre-filter fired (or
null if the search ran unfiltered) so the UI can show the spotlight at work.
Score semantics (load-bearing). For cosine similarity Atlas maps the raw cosine
[-1,1]into[0,1]as(1 + cosine) / 2. So an unrelated (orthogonal) photo’s nearest stranger still scores ~0.5, not 0 — 0.5 is the “no real similarity” floor, and real matches for a clean photo sit well above it. A naive low threshold likeT=0.3is therefore meaningless; calibrateTabove the ~0.5 floor. (Confirmed: MongoDB normalizes cosine as(1+cosine)/2.)
Feature endpoints (off by default — see §8)
GET /products/{id}/substitutes→{ matches:[{...product, score}] }(in-stock only, lower threshold).POST /recognize/shelf→[ { box, matches:[{...product, score}], noMatch } ](multi-shelf fan-out).GET /analytics/top-misses?since=<ISO>→[ { category, brand, requests, avgNearMiss, lastRequested } ].
Wire/event message — change-stream event (feature: dynamic-embeddings)
The worker consumes MongoDB change-stream events on products opened with fullDocument:"updateLookup" and a
$match pipeline that only lets content edits through:
operationType ∈ {insert, update, replace}
AND ( insert | replace OR updateDescription.updatedFields has one of:
name | brand | category | attributes | imageRef )
Per event: text = embeddingText(fullDocument); hash = sha256(text); skip if hash == doc.embeddingHash
(idempotent); else Embed(text) → SetEmbedding(id, vector, hash). The write-back touches only
embedding/embeddingHash, which the $match excludes — so it never re-triggers the worker. After each
handled event, persist resumeToken to _worker_state; on startup read it back as resumeAfter.
6. Build order (dependency-ordered; each step’s prerequisites already exist)
- Prerequisites & local tooling — Go toolchain (or Node),
mongosh,curl/base64, a sample image. - Atlas M0 — cluster +
MONGODB_URI,MONGODB_DB(Vector Search is Atlas-only). - Gemini key — confirm Vision + embeddings respond (portable base64; name
sample.jpg). - Model the catalog + seed —
products(~15, ≥3 categories) + the_worker_statedoc; idempotent. - Shared
embeddingTextbuilder + normalize helper — defined once, reused by ingest, query, worker. - Ingest — embed each product (fetch with
find(), iterate bydoc._id), L2-normalize when D < 3072, storeembedding+embeddingHash. - Create
products_vecindex —numDimensions === D;category/brand/inStockas filters; cosine. - Design the recognise pipeline + Vision responseSchema —
categoryas an enum of the known categories (closes the Vision-vs-catalog gap), descriptor → embeddable text. - Scaffold the API +
/recognizeskeleton + a Gemini-from-code worked example (per backend) — full import block / package install; the entrypoint that composes client + routes + shutdown. - ★ Recognise end to end (per backend) — Vision → embed (normalized) →
$vectorSearch(category pre-filter, fall back to unfiltered when the filtered result is empty) → ranked matches + scores. - Threshold or no-match — apply
T; cleannoMatch. - Calibrate
T— score known/unknown photos; the ~0.5 cosine floor; no-match UX. - Frontend (per platform) — capture/pick →
/recognize→ ranked matches + scores + descriptor/filter panel + live threshold slider + nearest-below-threshold no-match UX. - Integration tests (per backend) — known photo matches above
T; unknown →noMatch. - Optional deploy — Cloud Run, free-tier eligible.
- Feature modules (off by default) — §8.
Each step depends only on earlier ones: the index (7) needs ingest (6); ingest needs the shared builder (5)
and the seed (4); recognise (10) needs the index (7), the responseSchema (8), and the scaffold (9); the
worker feature needs _worker_state (4) and the shared builder (5).
7. Backends — Go (default) + TypeScript, same contract
Parity points (both must hold):
- Same response shape (§5) byte-for-byte in field names and types; matches best-first;
scorein[0,1];filterAppliedechoed;noMatchis a 200. - Same
$vectorSearchstage:index:"products_vec",path:"embedding",queryVectorlength D,numCandidates(~20×limit, must be ≥limit),limit, optionalfilter { category:{$eq:hint} }, optionalexact:true(ENN baseline).$projectaddsscore:{$meta:"vectorSearchScore"}. - Same embedding rule: same model + same
outputDimensionalityD + L2-normalize when D < 3072 for both catalog and query. - Same category-gap handling: Vision
categoryconstrained to the known enum; on an empty filtered result, retry the search unfiltered before declaring no-match. - Same fire-and-forget miss logging (feature) into a separate
search_missescollection; no raw image.
Go specifics (verified):
- Module
go.mongodb.org/mongo-driver/v2; import all three sub-packages used:go.mongodb.org/mongo-driver/v2/mongo,.../v2/mongo/options,.../v2/bson. Onego get go.mongodb.org/mongo-driver/v2/mongopulls the whole module; the import lines must list the sub-packages.bson.ObjectIDis the v2 type name (wasprimitive.ObjectIDin v1).mongo.Connect(options.Client().ApplyURI(uri))is the v2 signature (no context arg). - Gemini Go SDK
google.golang.org/genai:genai.NewClient(ctx, &genai.ClientConfig{APIKey: key, Backend: genai.BackendGeminiAPI}); Vision viaclient.Models.GenerateContent(ctx, model, contents, &genai.GenerateContentConfig{ResponseMIMEType:"application/json", ResponseSchema: …})with&genai.Blob{Data: imageBytes, MIMEType:"image/jpeg"}as an inlinePart; embeddings viaclient.Models.EmbedContent(ctx, model, contents, &genai.EmbedContentConfig{OutputDimensionality: &d}).
TypeScript specifics:
- Official
mongodbdriver;new MongoClient(uri)+await client.connect();collection.aggregate(pipeline) .toArray();collection.watch(pipeline, { fullDocument:"updateLookup", resumeAfter })(async-iterable). - Gemini via the REST API (
x-goog-api-key,v1beta,generateContent/embedContent) or@google/genai.
Neither backend hard-codes a model id: read GEMINI_VISION_MODEL / GEMINI_EMBED_MODEL from config and link
the official model list (https://ai.google.dev/gemini-api/docs/models). The only current free-tier embedding
model needing normalization is gemini-embedding-001, so the normalize rule is the safe default.
8. Optional feature modules (off by default; each extends, never rewrites, the spec)
hybrid— fuse$vectorSearchwith Atlas Search$searchoverbrand/name/visibleTextvia RRF (or the Atlas$rankFusionstage where available); one confidence cut on the fused score; no-match path preserved.substitutes—GET /products/{id}/substitutes: the product’s ownembeddingas the query vector,$vectorSearchwithfilter:{inStock:true}excluding the product, a lower threshold (recall over precision). Uses theinStockfilter field already in the index.multi-shelf—POST /recognize/shelf: Vision returns an array of items (array responseSchema) or detect-then-recognise per crop; reuse the per-item pipeline; return[{box, matches[], noMatch}].barcode— exactsku/barcodelookup (unique index) before the vector fallback; cheap/deterministic path first, vector path only on a miss.dynamic-embeddings— the change-stream re-embed worker (§5 event shape). Prerequisite: the_worker_statedoc (§4) and the sharedembeddingTextbuilder. Go (mongo.ChangeStream+SetFullDocument(options.UpdateLookup)+SetResumeAfter) and TS (watch(...)) parity.performance— makenumCandidatesconfigurable; sweep it against anexact:trueENN baseline (ENN is the ground truth for catalogs under ~10k docs) and pick the smallest value within recall tolerance;numCandidates≥limit, ~20×limitas the documented starting point.no-match-analytics— fire-and-forgetLogMissintosearch_misseson the no-match branch (Go/TS parity), plus a commonGET /analytics/top-missesaggregation ranking unmet demand.
9. Free-to-complete ($0)
| Need | Free option | First-appears note |
|---|---|---|
| Vector DB | MongoDB Atlas M0 (no card; a real replica set; Vector Search + Atlas Search + change streams all run on it; local mongod cannot build the vector index) | “Costs nothing” on the Atlas step |
| AI (Vision + embeddings) | Google AI Studio free-tier key (one key, both jobs) | “Costs nothing” on the Gemini step |
| Backend runtime | Local Go toolchain or Node | — |
| Mobile | Android emulator / iOS Simulator + bundled sample photos (so every scan is free + reproducible) | — |
| Deploy (optional) | Cloud Run free monthly allotment (scales to zero); key in Secret Manager | ”Costs nothing” on the deploy step |
Confirm current free-tier limits on the official docs; nothing in the default path requires a paid service.