Gemini is the default AI of this curriculum because one key and one API give you a multimodal model that
reads long documents, returns structured JSON, calls your tools, and embeds text for search. This track
moves from “first generateContent call” to streaming, multimodal input, function calling, JSON schemas,
embeddings for RAG, and safety — tagged by level so you can read only as deep as you need. Always keep the
key server-side; never ship it in a mobile or web client. Model ids change over time (the gemini-2.x
family at time of writing) — check the model list for the
exact current id.
Get an API key and store it as an env var
BeginnerCreate an API key in Google AI Studio, then export it as GEMINI_API_KEY in your shell. Never paste the
key into client code or commit it.
Why the key lives in the environment, not the code
An AI Studio key authenticates every request and is a secret — anyone holding it can spend on your account.
Keep it out of source control and out of any client that ships to users (a mobile binary or browser bundle
can be unpacked). The SDKs read GEMINI_API_KEY from the environment by default, so exporting it is the
least error-prone path. For production on Google Cloud you’d graduate to Vertex AI with IAM instead of a
raw key — see the GCP track.
Set the key and smoke-test with REST
# Create a key at https://aistudio.google.com/apikey, then:
export GEMINI_API_KEY="your-key-here"
# One curl proves the key and the endpoint work (swap in a current model id):
curl -s "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"contents":[{"parts":[{"text":"Say hello in five words."}]}]}'Make your first generateContent call from code
BeginnerInstall the official SDK for your language, then send one text prompt and print the reply.
What generateContent is, and the SDKs that wrap it
generateContent is the core endpoint: you send contents (your prompt parts) and get back candidates of
generated text. Google ships official SDKs for Python (google-genai), JavaScript/TypeScript
(@google/genai), and Go (google.golang.org/genai), plus plain REST for anything else. The SDKs read the
key from GEMINI_API_KEY, handle retries, and expose the same surface as REST. This curriculum’s backend
lane is Python and Go, so start there.
Install the SDK and send one prompt
# Python
pip install google-genai
# or JavaScript
npm install @google/genai# main.py — reads GEMINI_API_KEY from the environment
from google import genai
client = genai.Client()
resp = client.models.generate_content(
model="gemini-2.5-flash", # check the docs for the current id
contents="Explain what an API key is, in two sentences.",
)
print(resp.text)Agent prompt — paste into an agent with repo access
Role: Senior backend engineer in this repo (Python 3.11+ service).
Context: GEMINI_API_KEY is set in the environment. The official SDK is google-genai (import: `from google import genai`). Model id is configurable via env GEMINI_MODEL (default "gemini-2.5-flash"); do not hardcode a model that may be retired.
Task: Add a thin module app/llm.py with a function generate(prompt: str) -> str that calls client.models.generate_content and returns resp.text.
Requirements:
- Construct genai.Client() once at module load; read the model id from os.environ.get("GEMINI_MODEL", "gemini-2.5-flash").
- generate(prompt) raises a clear ValueError if prompt is empty; never logs the API key.
- No network call at import time; the client is lazy or constructed but unused until generate() runs.
Tests / acceptance:
- `python -c "import app.llm"` imports without error and without making a network request.
- With a fake/monkeypatched client, generate("hi") returns the stubbed text; generate("") raises ValueError.
- `ruff check app/llm.py` is clean.
Output: a unified diff plus a one-paragraph note on where the model id is configured.Stream the response token by token
BeginnerSwitch to the streaming variant so text appears as it is generated instead of after a pause.
Why streaming, and how it differs from one shot
generateContent returns the whole answer at once; the streaming variant
(generate_content_stream in the SDK, streamGenerateContent over REST) yields chunks as the model
produces them. For anything a human reads live — a chat reply, a long summary — streaming cuts perceived
latency dramatically because the first words show in well under a second. The trade-off: you handle a
sequence of partial chunks and concatenate their .text rather than reading one final string.
Stream chunks as they arrive
from google import genai
client = genai.Client()
stream = client.models.generate_content_stream(
model="gemini-2.5-flash",
contents="Write a three-sentence intro to vector databases.",
)
for chunk in stream:
print(chunk.text, end="", flush=True)
print()Chat prompt — paste into a chat to get the code
Role: Gemini teacher. The reader has no repo access here — return complete, runnable code.
Context: Python 3.11+, the google-genai SDK installed, GEMINI_API_KEY in the environment.
Task: Show a small async FastAPI endpoint GET /stream?q=... that proxies a Gemini streaming response to the client as Server-Sent Events, keeping the key server-side.
Requirements:
- Use client.models.generate_content_stream and yield each chunk.text as an SSE "data:" line.
- The key never appears in the response or in any client-visible header.
- Model id read from env GEMINI_MODEL with a sensible default.
Tests / acceptance (describe, since no repo):
- curl -N "localhost:8000/stream?q=hello" prints incremental data: lines, then closes.
- Inspecting the network response shows no API key.
Output: the complete FastAPI file, no commentary.Steer behaviour with a system instruction
BeginnerPass a system_instruction that sets the model’s role and rules, separate from the user’s prompt.
Why the system instruction is its own channel
A system instruction defines who the model is and how it should behave (“You are a terse SQL tutor; never
write DELETE without a WHERE”). Keeping it separate from the user turn means the persona and guardrails
persist across the conversation and aren’t something the user can casually overwrite in one message. It’s
the cheapest, highest-leverage way to shape tone, format, and refusals before you reach for tools or
schemas. In the SDK it goes in the request config, not in contents.
Set a system instruction
from google import genai
from google.genai import types
client = genai.Client()
resp = client.models.generate_content(
model="gemini-2.5-flash",
contents="How do I list running containers?",
config=types.GenerateContentConfig(
system_instruction=(
"You are a concise DevOps assistant. Answer in one shell command "
"followed by one sentence of explanation. Never invent flags."
),
),
)
print(resp.text)Send an image alongside text (multimodal input)
IntermediateAttach an image to the same prompt and ask the model to describe or reason about it.
Why multimodal is Gemini's signature strength
Gemini is natively multimodal: a single contents array can mix text with images, and (model permitting)
audio and video. You don’t run a separate vision model and stitch results — one call reasons over the
picture and the question together (“What’s the error in this screenshot?”, “Transcribe this receipt as
JSON”). For larger or reused media, upload it with the File API and reference the returned handle instead
of inlining bytes on every request. Keep an eye on the per-request size and token limits in the docs.
Inline an image with the prompt
from google import genai
from google.genai import types
client = genai.Client()
with open("screenshot.png", "rb") as f:
image_bytes = f.read()
resp = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
types.Part.from_bytes(data=image_bytes, mime_type="image/png"),
"What does the error message in this screenshot say, and how do I fix it?",
],
)
print(resp.text)Agent prompt — paste into an agent with repo access
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set. A multimodal-capable model id is in env GEMINI_MODEL. Sample fixtures exist under tests/fixtures/ (a small PNG receipt.png).
Task: Add describe_image(path: str, question: str) -> str in app/vision.py that sends the image bytes plus the question in one generate_content call and returns resp.text.
Requirements:
- Detect the MIME type from the file extension (.png, .jpg/.jpeg); raise ValueError on anything else.
- Read the file as bytes and pass it via types.Part.from_bytes; the question goes in the same contents list.
- Never load the whole image into a log line.
Tests / acceptance:
- With a monkeypatched client returning a fixed string, describe_image("tests/fixtures/receipt.png", "total?") returns that string.
- describe_image("note.txt", "x") raises ValueError (unsupported type).
- `pytest tests/test_vision.py` passes; `ruff check app/vision.py` is clean.
Output: a unified diff plus a one-line note on the MIME types supported.Force structured JSON with a response schema
IntermediateSet the response MIME type to JSON and supply a schema so the reply is parseable, not prose.
Why a schema beats 'please reply in JSON'
Asking a model to “respond in JSON” works until it doesn’t — a stray sentence or a trailing comma breaks
your parser. Gemini supports constrained decoding: set response_mime_type="application/json" and a
response_schema, and the model is constrained to emit JSON matching that shape. You get a reliable
contract you can json.loads and validate, which is what makes Gemini safe to put behind a typed API. Pair
it with a Pydantic model (Python) or a typed struct (Go) so your code and the schema can’t drift.
Constrain output to a typed schema
from google import genai
from google.genai import types
from pydantic import BaseModel
import json
class Receipt(BaseModel):
merchant: str
total_cents: int
currency: str
client = genai.Client()
resp = client.models.generate_content(
model="gemini-2.5-flash",
contents="Merchant: Cafe Luna. Total: $12.50 USD. Extract the fields.",
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=Receipt,
),
)
data = Receipt.model_validate_json(resp.text) # parses & validates
print(data.total_cents) # 1250Agent prompt — paste into an agent with repo access
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set, model id in env GEMINI_MODEL. Pydantic v2 is available.
Task: Add extract_receipt(text: str) -> Receipt in app/extract.py using a Pydantic Receipt(merchant: str, total_cents: int, currency: str) as the response_schema with response_mime_type="application/json".
Requirements:
- Money is integer cents (total_cents), never a float; currency is a 3-letter code.
- Parse the reply with Receipt.model_validate_json; on a validation error, raise a domain error ExtractionError, do not return raw text.
- Do not post-process or regex the model output to "fix" JSON — rely on the schema.
Tests / acceptance:
- With a monkeypatched client returning '{"merchant":"Cafe Luna","total_cents":1250,"currency":"USD"}', extract_receipt(...) returns a Receipt with total_cents == 1250.
- A monkeypatched reply of malformed JSON raises ExtractionError.
- `pytest tests/test_extract.py` passes; `ruff check app/extract.py` is clean.
Output: a unified diff plus a one-paragraph note on why constrained decoding beats prompt-only JSON.Let the model call your functions (tool calling)
IntermediateDeclare your functions as tools so the model can ask to call them, then you run them and return the result.
How the function-calling loop actually works
Function (tool) calling lets the model request an action without executing anything itself. You declare
function signatures (name, description, parameters); when the model decides a call is needed it returns a
functionCall with arguments instead of text. Your code runs the real function — a weather API, a DB
query — and sends the result back as a functionResponse, and the model continues with that grounding.
This is the backbone of agents: the model reasons, your code acts, and the loop repeats until there’s a
final answer. You stay in control of side effects because you execute the calls.
Declare a tool and run the call it requests
from google import genai
from google.genai import types
def get_weather(city: str) -> dict:
# your real implementation calls a weather API
return {"city": city, "temp_c": 21, "sky": "clear"}
client = genai.Client()
resp = client.models.generate_content(
model="gemini-2.5-flash",
contents="What's the weather in Istanbul?",
config=types.GenerateContentConfig(tools=[get_weather]),
)
# The SDK can auto-run the Python function and feed the result back;
# resp.text holds the final natural-language answer.
print(resp.text)Agent prompt — paste into an agent with repo access
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set, model id in env GEMINI_MODEL. We expose one internal tool, lookup_order(order_id: str) -> dict, that reads from a fake in-memory store in tests.
Task: Wire lookup_order as a Gemini tool in app/agent.py with answer(question: str) -> str, so a question like "is order A123 shipped?" triggers the function call and returns a grounded answer.
Requirements:
- Register the function via config=types.GenerateContentConfig(tools=[lookup_order]).
- lookup_order has a typed signature and a docstring the model can read; unknown ids return {"error": "not found"} (no exception).
- Side effects stay in your code: the model only requests the call; your function executes it.
Tests / acceptance:
- With a monkeypatched client that emulates one functionCall for order "A123" then a final text answer, answer("is order A123 shipped?") includes the store's status string.
- lookup_order("ZZZ") returns {"error": "not found"}.
- `pytest tests/test_agent.py` passes; `ruff check app/agent.py` is clean.
Output: a unified diff plus a one-paragraph description of the request/response loop.Reason over a very long document (long context)
AdvancedFeed a whole document into one prompt and ask questions across all of it, instead of chunking by hand.
When long context replaces retrieval — and when it doesn't
Gemini models accept very large context windows (hundreds of thousands to over a million tokens, depending on the model — check the docs for the current limit). That means you can drop an entire contract, codebase, or transcript into one call and ask cross-cutting questions without building a retrieval pipeline first. It’s the simplest path when the corpus fits and is read occasionally. The trade-off: every token in the window is paid for on every call and adds latency, so for a large, frequently queried corpus you switch to retrieval (RAG) — embed once, fetch only the relevant chunks. Long context and RAG are complementary, not rivals: use the window for “read this whole thing now”, use RAG for “search this big thing repeatedly”.
Ask across a whole file in one call
from google import genai
client = genai.Client()
with open("contract.txt", "r", encoding="utf-8") as f:
document = f.read()
resp = client.models.generate_content(
model="gemini-2.5-pro", # use a model whose context limit covers your doc
contents=[
"Answer only from the document below. List every payment deadline and its clause number.",
document,
],
)
print(resp.text)Agent prompt — paste into an agent with repo access
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set, a long-context model id in env GEMINI_MODEL. Test fixtures hold a multi-thousand-token text file under tests/fixtures/contract.txt.
Task: Add answer_over_document(doc: str, question: str) -> str in app/longctx.py that puts the whole document in one prompt and instructs the model to answer only from it.
Requirements:
- Prepend a grounding instruction: answer only from the supplied document; if the answer is absent, reply exactly "Not found in the document."
- Do NOT chunk or embed here — this is the long-context path; the document goes in verbatim.
- If the document is empty, raise ValueError before calling the API.
Tests / acceptance:
- With a monkeypatched client echoing a canned answer, answer_over_document(open(fixture).read(), "deadline?") returns that answer.
- answer_over_document("", "x") raises ValueError without a network call.
- `pytest tests/test_longctx.py` passes; `ruff check app/longctx.py` is clean.
Output: a unified diff plus a one-paragraph note on when to switch this to RAG.Embed text and build RAG with a vector store
AdvancedTurn text into embedding vectors with the embeddings model, store them, and retrieve the nearest chunks to ground a generation.
Why embeddings need a partner — and how RAG fits together
An embedding is a fixed-length vector that captures meaning; similar text lands at nearby points. Gemini’s
text-embedding model (gemini-embedding-001 at time of writing — confirm the current id and its dimension
in the docs) gives you those vectors, but the API does not store or search them. You pair it with a vector
store: this curriculum uses Postgres + pgvector, so the same database that holds your rows holds the
embeddings and does the nearest-neighbour search. Retrieval-Augmented Generation is the loop: embed your
documents once, embed the user’s question at query time, fetch the top-k closest chunks, and pass them as
context to generateContent. The model answers from your data, with citations, instead of guessing. The
column dimension must exactly match the embedding model’s output dimension — see the
PostgreSQL track for the pgvector half.
Embed text, then ground a generation on the matches
from google import genai
client = genai.Client()
# 1. Embed a query (do the same for your documents at index time).
emb = client.models.embed_content(
model="gemini-embedding-001", # check the docs for the current id + dimension
contents="How do refunds work?",
)
query_vector = emb.embeddings[0].values # store/search these in pgvector
# 2. Your vector store returns the top-k chunks for query_vector (see the Postgres track).
top_chunks = vector_store_search(query_vector, k=4) # your code
# 3. Ground the answer on the retrieved chunks only.
resp = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
"Answer the question using ONLY the context below. Cite the chunk ids you used.",
"Context:\n" + "\n---\n".join(top_chunks),
"Question: How do refunds work?",
],
)
print(resp.text)Agent prompt — paste into an agent with repo access
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set. Embedding model id in env EMBED_MODEL (default "gemini-embedding-001"); generation model in env GEMINI_MODEL. A VectorStore protocol exists with search(vector: list[float], k: int) -> list[str]; a fake is injected in tests. Do NOT implement the store here — see the Postgres/pgvector track.
Task: Implement answer_with_rag(question: str, store: VectorStore, k: int = 4) -> str in app/rag.py: embed the question, retrieve top-k chunks, and generate a grounded answer that cites the chunks.
Requirements:
- Embed the question with client.models.embed_content; pass the resulting vector to store.search.
- The generation prompt must instruct the model to answer ONLY from the retrieved context and to say so if the answer is absent.
- The embedding dimension is whatever the model returns; do not hardcode 768 vs 1536 — read it from the response and assert the store accepts that length.
Tests / acceptance:
- With a monkeypatched embed_content (returns a fixed vector) and a fake store returning two known chunks, answer_with_rag("refunds?", fake_store) calls store.search exactly once with k=4 and returns the canned grounded answer.
- If the store returns no chunks, the function still returns a "no relevant context found" style answer without crashing.
- `pytest tests/test_rag.py` passes; `ruff check app/rag.py` is clean.
Output: a unified diff plus a one-paragraph note on where embeddings stop and the vector store begins.Configure safety settings deliberately
AdvancedSet the safety thresholds explicitly so harmful-content filtering matches your product, and handle a blocked response.
Why you read the safety metadata, not just resp.text
Gemini applies configurable safety filters across harm categories (harassment, hate speech, sexually
explicit, dangerous content). You can tune the blocking threshold per category, but the important discipline
is handling the outcome: a response can come back without usable text because it was blocked, and a prompt
can be blocked before generation. Read the response’s prompt_feedback and the candidate’s
finish_reason/safety_ratings rather than assuming resp.text is always present — that’s the difference
between a robust service and one that throws NoneType in production. Loosening thresholds is a product
decision you make consciously, with the categories named in code.
Set thresholds and check for a block
from google import genai
from google.genai import types
client = genai.Client()
resp = client.models.generate_content(
model="gemini-2.5-flash",
contents="Summarise the support ticket below.",
config=types.GenerateContentConfig(
safety_settings=[
types.SafetySetting(
category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
threshold=types.HarmBlockThreshold.BLOCK_ONLY_HIGH,
),
],
),
)
# Don't assume text is present — it may have been blocked.
if not resp.candidates or resp.candidates[0].finish_reason.name != "STOP":
print("blocked or incomplete:", resp.prompt_feedback)
else:
print(resp.text)Agent prompt — paste into an agent with repo access
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set, model id in env GEMINI_MODEL.
Task: Add safe_generate(prompt: str) -> str in app/safety.py that sets explicit safety_settings and never raises AttributeError on a blocked response.
Requirements:
- Configure at least one explicit SafetySetting (category + threshold) via GenerateContentConfig.
- If the response has no candidates, or the first candidate's finish_reason is not "STOP", return a constant SAFE_FALLBACK string and log the prompt_feedback (not the prompt text or the key).
- Otherwise return resp.text. Never assume resp.text is non-None.
Tests / acceptance:
- With a monkeypatched client returning a candidate whose finish_reason.name == "SAFETY", safe_generate(...) returns SAFE_FALLBACK and does not raise.
- With a normal STOP candidate, safe_generate(...) returns the text.
- `pytest tests/test_safety.py` passes; `ruff check app/safety.py` is clean.
Output: a unified diff plus a one-paragraph note on which fields signal a block.Make calls cheap and resilient in production
AdvancedCache and trim what you send, set timeouts and retries, and pick the right model size per call.
Where the cost and the failures actually come from
Two production realities dominate: tokens cost money on every call, and the network fails. Control cost by
choosing the smallest model that passes your evals (a flash model for routine work, a pro model only
where reasoning depth earns it), trimming context, and using context caching for large, reused prefixes so
you don’t resend the same document every turn. Control failures by setting explicit timeouts, retrying
transient errors (HTTP 429/5xx) with backoff and jitter, and respecting rate limits — the SDKs surface
these. Wrap the model behind your own interface so swapping the id or the provider later is a one-line
change, and put an eval suite in front of any prompt change so “cheaper” never quietly means “worse”.
Timeouts, retry on transient errors, model choice
import time
from google import genai
from google.genai import errors
client = genai.Client(http_options={"timeout": 30_000}) # ms
def generate_with_retry(prompt: str, model: str, attempts: int = 3) -> str:
for i in range(attempts):
try:
return client.models.generate_content(model=model, contents=prompt).text
except errors.APIError as e:
if e.code in (429, 500, 503) and i < attempts - 1:
time.sleep(2 ** i) # exponential backoff
continue
raiseAgent prompt — paste into an agent with repo access
Role: Senior backend / reliability engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set. Calls go through app/llm.py from earlier steps. Model id in env GEMINI_MODEL.
Task: Harden the Gemini client with a configurable timeout and a retry wrapper generate_with_retry(prompt, model, attempts=3) in app/llm.py.
Requirements:
- Construct the client with an explicit request timeout.
- Retry only on transient codes (429, 500, 503) with exponential backoff; re-raise everything else immediately and after the final attempt.
- Do not retry on 400/401/403 (bad request / auth) — those won't fix themselves.
- The wrapper is pure with respect to the SDK: tests inject a fake client.
Tests / acceptance:
- A fake client that raises a 503 twice then succeeds: generate_with_retry returns the success text after 3 calls.
- A fake client raising 400 once: generate_with_retry raises immediately (one call, no retry).
- `pytest tests/test_llm.py` passes; `ruff check app/llm.py` is clean.
Output: a unified diff plus a short table of which status codes retry vs fail fast.Where to take it next
- Build the AI service this track points at in Helix Assistant, where Gemini drives a multimodal, tool-using assistant grounded on your own data via RAG.
- Host these calls behind a real backend in the Python track — Gemini is the model, Python is the service that orchestrates prompts, tools, and retrieval.
- Store the embeddings and run nearest-neighbour search with pgvector in the PostgreSQL track — the vector-store half of RAG.