← All tech

AI · default pick

Gemini API

Google's multimodal LLM API — long context, streaming, tool calling, structured output, embeddings.

  • Multimodal prompts mixing text, images, audio, and video
  • Very long context windows over whole documents
  • Structured JSON output enforced by a response schema
  • Tool / function calling that drives real APIs
  • Text embeddings that power semantic search and RAG
Use it when

Reach for Gemini when you want one API that reasons over text plus images, audio, and video, returns reliable JSON, calls your tools, and reads long documents — all behind a single key you keep server-side.

Reach for something else when

Don't put the key in a mobile or browser client (proxy through your backend), don't use a chat model where a tiny classifier or regex would do, and for hosting-your-own-weights or strict on-prem isolation reach for a self-hosted open model instead.

Official docs ↗


Gemini is the default AI of this curriculum because one key and one API give you a multimodal model that reads long documents, returns structured JSON, calls your tools, and embeds text for search. This track moves from “first generateContent call” to streaming, multimodal input, function calling, JSON schemas, embeddings for RAG, and safety — tagged by level so you can read only as deep as you need. Always keep the key server-side; never ship it in a mobile or web client. Model ids change over time (the gemini-2.x family at time of writing) — check the model list for the exact current id.

Get an API key and store it as an env var

Beginner

Create an API key in Google AI Studio, then export it as GEMINI_API_KEY in your shell. Never paste the key into client code or commit it.

Why the key lives in the environment, not the code

An AI Studio key authenticates every request and is a secret — anyone holding it can spend on your account. Keep it out of source control and out of any client that ships to users (a mobile binary or browser bundle can be unpacked). The SDKs read GEMINI_API_KEY from the environment by default, so exporting it is the least error-prone path. For production on Google Cloud you’d graduate to Vertex AI with IAM instead of a raw key — see the GCP track.

Set the key and smoke-test with REST
Run these in your terminal / editor
# Create a key at https://aistudio.google.com/apikey, then:
export GEMINI_API_KEY="your-key-here"

# One curl proves the key and the endpoint work (swap in a current model id):
curl -s "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"Say hello in five words."}]}]}'

Make your first generateContent call from code

Beginner

Install the official SDK for your language, then send one text prompt and print the reply.

What generateContent is, and the SDKs that wrap it

generateContent is the core endpoint: you send contents (your prompt parts) and get back candidates of generated text. Google ships official SDKs for Python (google-genai), JavaScript/TypeScript (@google/genai), and Go (google.golang.org/genai), plus plain REST for anything else. The SDKs read the key from GEMINI_API_KEY, handle retries, and expose the same surface as REST. This curriculum’s backend lane is Python and Go, so start there.

Install the SDK and send one prompt
Run these in your terminal / editor
# Python
pip install google-genai

# or JavaScript
npm install @google/genai
# main.py — reads GEMINI_API_KEY from the environment
from google import genai

client = genai.Client()
resp = client.models.generate_content(
    model="gemini-2.5-flash",  # check the docs for the current id
    contents="Explain what an API key is, in two sentences.",
)
print(resp.text)
Agent prompt — paste into an agent with repo access
For Claude Code / Cursor / an agent that can read & edit this repo.
Role: Senior backend engineer in this repo (Python 3.11+ service).
Context: GEMINI_API_KEY is set in the environment. The official SDK is google-genai (import: `from google import genai`). Model id is configurable via env GEMINI_MODEL (default "gemini-2.5-flash"); do not hardcode a model that may be retired.
Task: Add a thin module app/llm.py with a function generate(prompt: str) -> str that calls client.models.generate_content and returns resp.text.
Requirements:
- Construct genai.Client() once at module load; read the model id from os.environ.get("GEMINI_MODEL", "gemini-2.5-flash").
- generate(prompt) raises a clear ValueError if prompt is empty; never logs the API key.
- No network call at import time; the client is lazy or constructed but unused until generate() runs.
Tests / acceptance:
- `python -c "import app.llm"` imports without error and without making a network request.
- With a fake/monkeypatched client, generate("hi") returns the stubbed text; generate("") raises ValueError.
- `ruff check app/llm.py` is clean.
Output: a unified diff plus a one-paragraph note on where the model id is configured.

Stream the response token by token

Beginner

Switch to the streaming variant so text appears as it is generated instead of after a pause.

Why streaming, and how it differs from one shot

generateContent returns the whole answer at once; the streaming variant (generate_content_stream in the SDK, streamGenerateContent over REST) yields chunks as the model produces them. For anything a human reads live — a chat reply, a long summary — streaming cuts perceived latency dramatically because the first words show in well under a second. The trade-off: you handle a sequence of partial chunks and concatenate their .text rather than reading one final string.

Stream chunks as they arrive
Run these in your terminal / editor
from google import genai

client = genai.Client()
stream = client.models.generate_content_stream(
    model="gemini-2.5-flash",
    contents="Write a three-sentence intro to vector databases.",
)
for chunk in stream:
    print(chunk.text, end="", flush=True)
print()
Chat prompt — paste into a chat to get the code
For a plain chat. It returns complete code; you paste it in yourself.
Role: Gemini teacher. The reader has no repo access here — return complete, runnable code.
Context: Python 3.11+, the google-genai SDK installed, GEMINI_API_KEY in the environment.
Task: Show a small async FastAPI endpoint GET /stream?q=... that proxies a Gemini streaming response to the client as Server-Sent Events, keeping the key server-side.
Requirements:
- Use client.models.generate_content_stream and yield each chunk.text as an SSE "data:" line.
- The key never appears in the response or in any client-visible header.
- Model id read from env GEMINI_MODEL with a sensible default.
Tests / acceptance (describe, since no repo):
- curl -N "localhost:8000/stream?q=hello" prints incremental data: lines, then closes.
- Inspecting the network response shows no API key.
Output: the complete FastAPI file, no commentary.

Steer behaviour with a system instruction

Beginner

Pass a system_instruction that sets the model’s role and rules, separate from the user’s prompt.

Why the system instruction is its own channel

A system instruction defines who the model is and how it should behave (“You are a terse SQL tutor; never write DELETE without a WHERE”). Keeping it separate from the user turn means the persona and guardrails persist across the conversation and aren’t something the user can casually overwrite in one message. It’s the cheapest, highest-leverage way to shape tone, format, and refusals before you reach for tools or schemas. In the SDK it goes in the request config, not in contents.

Set a system instruction
Run these in your terminal / editor
from google import genai
from google.genai import types

client = genai.Client()
resp = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="How do I list running containers?",
    config=types.GenerateContentConfig(
        system_instruction=(
            "You are a concise DevOps assistant. Answer in one shell command "
            "followed by one sentence of explanation. Never invent flags."
        ),
    ),
)
print(resp.text)

Send an image alongside text (multimodal input)

Intermediate

Attach an image to the same prompt and ask the model to describe or reason about it.

Why multimodal is Gemini's signature strength

Gemini is natively multimodal: a single contents array can mix text with images, and (model permitting) audio and video. You don’t run a separate vision model and stitch results — one call reasons over the picture and the question together (“What’s the error in this screenshot?”, “Transcribe this receipt as JSON”). For larger or reused media, upload it with the File API and reference the returned handle instead of inlining bytes on every request. Keep an eye on the per-request size and token limits in the docs.

Inline an image with the prompt
Run these in your terminal / editor
from google import genai
from google.genai import types

client = genai.Client()
with open("screenshot.png", "rb") as f:
    image_bytes = f.read()

resp = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        types.Part.from_bytes(data=image_bytes, mime_type="image/png"),
        "What does the error message in this screenshot say, and how do I fix it?",
    ],
)
print(resp.text)
Agent prompt — paste into an agent with repo access
For Claude Code / Cursor / an agent that can read & edit this repo.
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set. A multimodal-capable model id is in env GEMINI_MODEL. Sample fixtures exist under tests/fixtures/ (a small PNG receipt.png).
Task: Add describe_image(path: str, question: str) -> str in app/vision.py that sends the image bytes plus the question in one generate_content call and returns resp.text.
Requirements:
- Detect the MIME type from the file extension (.png, .jpg/.jpeg); raise ValueError on anything else.
- Read the file as bytes and pass it via types.Part.from_bytes; the question goes in the same contents list.
- Never load the whole image into a log line.
Tests / acceptance:
- With a monkeypatched client returning a fixed string, describe_image("tests/fixtures/receipt.png", "total?") returns that string.
- describe_image("note.txt", "x") raises ValueError (unsupported type).
- `pytest tests/test_vision.py` passes; `ruff check app/vision.py` is clean.
Output: a unified diff plus a one-line note on the MIME types supported.

Force structured JSON with a response schema

Intermediate

Set the response MIME type to JSON and supply a schema so the reply is parseable, not prose.

Why a schema beats 'please reply in JSON'

Asking a model to “respond in JSON” works until it doesn’t — a stray sentence or a trailing comma breaks your parser. Gemini supports constrained decoding: set response_mime_type="application/json" and a response_schema, and the model is constrained to emit JSON matching that shape. You get a reliable contract you can json.loads and validate, which is what makes Gemini safe to put behind a typed API. Pair it with a Pydantic model (Python) or a typed struct (Go) so your code and the schema can’t drift.

Constrain output to a typed schema
Run these in your terminal / editor
from google import genai
from google.genai import types
from pydantic import BaseModel
import json

class Receipt(BaseModel):
    merchant: str
    total_cents: int
    currency: str

client = genai.Client()
resp = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Merchant: Cafe Luna. Total: $12.50 USD. Extract the fields.",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=Receipt,
    ),
)
data = Receipt.model_validate_json(resp.text)  # parses & validates
print(data.total_cents)  # 1250
Agent prompt — paste into an agent with repo access
For Claude Code / Cursor / an agent that can read & edit this repo.
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set, model id in env GEMINI_MODEL. Pydantic v2 is available.
Task: Add extract_receipt(text: str) -> Receipt in app/extract.py using a Pydantic Receipt(merchant: str, total_cents: int, currency: str) as the response_schema with response_mime_type="application/json".
Requirements:
- Money is integer cents (total_cents), never a float; currency is a 3-letter code.
- Parse the reply with Receipt.model_validate_json; on a validation error, raise a domain error ExtractionError, do not return raw text.
- Do not post-process or regex the model output to "fix" JSON — rely on the schema.
Tests / acceptance:
- With a monkeypatched client returning '{"merchant":"Cafe Luna","total_cents":1250,"currency":"USD"}', extract_receipt(...) returns a Receipt with total_cents == 1250.
- A monkeypatched reply of malformed JSON raises ExtractionError.
- `pytest tests/test_extract.py` passes; `ruff check app/extract.py` is clean.
Output: a unified diff plus a one-paragraph note on why constrained decoding beats prompt-only JSON.

Let the model call your functions (tool calling)

Intermediate

Declare your functions as tools so the model can ask to call them, then you run them and return the result.

How the function-calling loop actually works

Function (tool) calling lets the model request an action without executing anything itself. You declare function signatures (name, description, parameters); when the model decides a call is needed it returns a functionCall with arguments instead of text. Your code runs the real function — a weather API, a DB query — and sends the result back as a functionResponse, and the model continues with that grounding. This is the backbone of agents: the model reasons, your code acts, and the loop repeats until there’s a final answer. You stay in control of side effects because you execute the calls.

Declare a tool and run the call it requests
Run these in your terminal / editor
from google import genai
from google.genai import types

def get_weather(city: str) -> dict:
    # your real implementation calls a weather API
    return {"city": city, "temp_c": 21, "sky": "clear"}

client = genai.Client()
resp = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="What's the weather in Istanbul?",
    config=types.GenerateContentConfig(tools=[get_weather]),
)
# The SDK can auto-run the Python function and feed the result back;
# resp.text holds the final natural-language answer.
print(resp.text)
Agent prompt — paste into an agent with repo access
For Claude Code / Cursor / an agent that can read & edit this repo.
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set, model id in env GEMINI_MODEL. We expose one internal tool, lookup_order(order_id: str) -> dict, that reads from a fake in-memory store in tests.
Task: Wire lookup_order as a Gemini tool in app/agent.py with answer(question: str) -> str, so a question like "is order A123 shipped?" triggers the function call and returns a grounded answer.
Requirements:
- Register the function via config=types.GenerateContentConfig(tools=[lookup_order]).
- lookup_order has a typed signature and a docstring the model can read; unknown ids return {"error": "not found"} (no exception).
- Side effects stay in your code: the model only requests the call; your function executes it.
Tests / acceptance:
- With a monkeypatched client that emulates one functionCall for order "A123" then a final text answer, answer("is order A123 shipped?") includes the store's status string.
- lookup_order("ZZZ") returns {"error": "not found"}.
- `pytest tests/test_agent.py` passes; `ruff check app/agent.py` is clean.
Output: a unified diff plus a one-paragraph description of the request/response loop.

Reason over a very long document (long context)

Advanced

Feed a whole document into one prompt and ask questions across all of it, instead of chunking by hand.

When long context replaces retrieval — and when it doesn't

Gemini models accept very large context windows (hundreds of thousands to over a million tokens, depending on the model — check the docs for the current limit). That means you can drop an entire contract, codebase, or transcript into one call and ask cross-cutting questions without building a retrieval pipeline first. It’s the simplest path when the corpus fits and is read occasionally. The trade-off: every token in the window is paid for on every call and adds latency, so for a large, frequently queried corpus you switch to retrieval (RAG) — embed once, fetch only the relevant chunks. Long context and RAG are complementary, not rivals: use the window for “read this whole thing now”, use RAG for “search this big thing repeatedly”.

Ask across a whole file in one call
Run these in your terminal / editor
from google import genai

client = genai.Client()
with open("contract.txt", "r", encoding="utf-8") as f:
    document = f.read()

resp = client.models.generate_content(
    model="gemini-2.5-pro",  # use a model whose context limit covers your doc
    contents=[
        "Answer only from the document below. List every payment deadline and its clause number.",
        document,
    ],
)
print(resp.text)
Agent prompt — paste into an agent with repo access
For Claude Code / Cursor / an agent that can read & edit this repo.
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set, a long-context model id in env GEMINI_MODEL. Test fixtures hold a multi-thousand-token text file under tests/fixtures/contract.txt.
Task: Add answer_over_document(doc: str, question: str) -> str in app/longctx.py that puts the whole document in one prompt and instructs the model to answer only from it.
Requirements:
- Prepend a grounding instruction: answer only from the supplied document; if the answer is absent, reply exactly "Not found in the document."
- Do NOT chunk or embed here — this is the long-context path; the document goes in verbatim.
- If the document is empty, raise ValueError before calling the API.
Tests / acceptance:
- With a monkeypatched client echoing a canned answer, answer_over_document(open(fixture).read(), "deadline?") returns that answer.
- answer_over_document("", "x") raises ValueError without a network call.
- `pytest tests/test_longctx.py` passes; `ruff check app/longctx.py` is clean.
Output: a unified diff plus a one-paragraph note on when to switch this to RAG.

Embed text and build RAG with a vector store

Advanced

Turn text into embedding vectors with the embeddings model, store them, and retrieve the nearest chunks to ground a generation.

Why embeddings need a partner — and how RAG fits together

An embedding is a fixed-length vector that captures meaning; similar text lands at nearby points. Gemini’s text-embedding model (gemini-embedding-001 at time of writing — confirm the current id and its dimension in the docs) gives you those vectors, but the API does not store or search them. You pair it with a vector store: this curriculum uses Postgres + pgvector, so the same database that holds your rows holds the embeddings and does the nearest-neighbour search. Retrieval-Augmented Generation is the loop: embed your documents once, embed the user’s question at query time, fetch the top-k closest chunks, and pass them as context to generateContent. The model answers from your data, with citations, instead of guessing. The column dimension must exactly match the embedding model’s output dimension — see the PostgreSQL track for the pgvector half.

Embed text, then ground a generation on the matches
Run these in your terminal / editor
from google import genai

client = genai.Client()

# 1. Embed a query (do the same for your documents at index time).
emb = client.models.embed_content(
    model="gemini-embedding-001",  # check the docs for the current id + dimension
    contents="How do refunds work?",
)
query_vector = emb.embeddings[0].values  # store/search these in pgvector

# 2. Your vector store returns the top-k chunks for query_vector (see the Postgres track).
top_chunks = vector_store_search(query_vector, k=4)  # your code

# 3. Ground the answer on the retrieved chunks only.
resp = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        "Answer the question using ONLY the context below. Cite the chunk ids you used.",
        "Context:\n" + "\n---\n".join(top_chunks),
        "Question: How do refunds work?",
    ],
)
print(resp.text)
Agent prompt — paste into an agent with repo access
For Claude Code / Cursor / an agent that can read & edit this repo.
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set. Embedding model id in env EMBED_MODEL (default "gemini-embedding-001"); generation model in env GEMINI_MODEL. A VectorStore protocol exists with search(vector: list[float], k: int) -> list[str]; a fake is injected in tests. Do NOT implement the store here — see the Postgres/pgvector track.
Task: Implement answer_with_rag(question: str, store: VectorStore, k: int = 4) -> str in app/rag.py: embed the question, retrieve top-k chunks, and generate a grounded answer that cites the chunks.
Requirements:
- Embed the question with client.models.embed_content; pass the resulting vector to store.search.
- The generation prompt must instruct the model to answer ONLY from the retrieved context and to say so if the answer is absent.
- The embedding dimension is whatever the model returns; do not hardcode 768 vs 1536 — read it from the response and assert the store accepts that length.
Tests / acceptance:
- With a monkeypatched embed_content (returns a fixed vector) and a fake store returning two known chunks, answer_with_rag("refunds?", fake_store) calls store.search exactly once with k=4 and returns the canned grounded answer.
- If the store returns no chunks, the function still returns a "no relevant context found" style answer without crashing.
- `pytest tests/test_rag.py` passes; `ruff check app/rag.py` is clean.
Output: a unified diff plus a one-paragraph note on where embeddings stop and the vector store begins.

Configure safety settings deliberately

Advanced

Set the safety thresholds explicitly so harmful-content filtering matches your product, and handle a blocked response.

Why you read the safety metadata, not just resp.text

Gemini applies configurable safety filters across harm categories (harassment, hate speech, sexually explicit, dangerous content). You can tune the blocking threshold per category, but the important discipline is handling the outcome: a response can come back without usable text because it was blocked, and a prompt can be blocked before generation. Read the response’s prompt_feedback and the candidate’s finish_reason/safety_ratings rather than assuming resp.text is always present — that’s the difference between a robust service and one that throws NoneType in production. Loosening thresholds is a product decision you make consciously, with the categories named in code.

Set thresholds and check for a block
Run these in your terminal / editor
from google import genai
from google.genai import types

client = genai.Client()
resp = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Summarise the support ticket below.",
    config=types.GenerateContentConfig(
        safety_settings=[
            types.SafetySetting(
                category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                threshold=types.HarmBlockThreshold.BLOCK_ONLY_HIGH,
            ),
        ],
    ),
)

# Don't assume text is present — it may have been blocked.
if not resp.candidates or resp.candidates[0].finish_reason.name != "STOP":
    print("blocked or incomplete:", resp.prompt_feedback)
else:
    print(resp.text)
Agent prompt — paste into an agent with repo access
For Claude Code / Cursor / an agent that can read & edit this repo.
Role: Senior backend engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set, model id in env GEMINI_MODEL.
Task: Add safe_generate(prompt: str) -> str in app/safety.py that sets explicit safety_settings and never raises AttributeError on a blocked response.
Requirements:
- Configure at least one explicit SafetySetting (category + threshold) via GenerateContentConfig.
- If the response has no candidates, or the first candidate's finish_reason is not "STOP", return a constant SAFE_FALLBACK string and log the prompt_feedback (not the prompt text or the key).
- Otherwise return resp.text. Never assume resp.text is non-None.
Tests / acceptance:
- With a monkeypatched client returning a candidate whose finish_reason.name == "SAFETY", safe_generate(...) returns SAFE_FALLBACK and does not raise.
- With a normal STOP candidate, safe_generate(...) returns the text.
- `pytest tests/test_safety.py` passes; `ruff check app/safety.py` is clean.
Output: a unified diff plus a one-paragraph note on which fields signal a block.

Make calls cheap and resilient in production

Advanced

Cache and trim what you send, set timeouts and retries, and pick the right model size per call.

Where the cost and the failures actually come from

Two production realities dominate: tokens cost money on every call, and the network fails. Control cost by choosing the smallest model that passes your evals (a flash model for routine work, a pro model only where reasoning depth earns it), trimming context, and using context caching for large, reused prefixes so you don’t resend the same document every turn. Control failures by setting explicit timeouts, retrying transient errors (HTTP 429/5xx) with backoff and jitter, and respecting rate limits — the SDKs surface these. Wrap the model behind your own interface so swapping the id or the provider later is a one-line change, and put an eval suite in front of any prompt change so “cheaper” never quietly means “worse”.

Timeouts, retry on transient errors, model choice
Run these in your terminal / editor
import time
from google import genai
from google.genai import errors

client = genai.Client(http_options={"timeout": 30_000})  # ms

def generate_with_retry(prompt: str, model: str, attempts: int = 3) -> str:
    for i in range(attempts):
        try:
            return client.models.generate_content(model=model, contents=prompt).text
        except errors.APIError as e:
            if e.code in (429, 500, 503) and i < attempts - 1:
                time.sleep(2 ** i)  # exponential backoff
                continue
            raise
Agent prompt — paste into an agent with repo access
For Claude Code / Cursor / an agent that can read & edit this repo.
Role: Senior backend / reliability engineer in this repo (Python 3.11+).
Context: google-genai SDK, GEMINI_API_KEY set. Calls go through app/llm.py from earlier steps. Model id in env GEMINI_MODEL.
Task: Harden the Gemini client with a configurable timeout and a retry wrapper generate_with_retry(prompt, model, attempts=3) in app/llm.py.
Requirements:
- Construct the client with an explicit request timeout.
- Retry only on transient codes (429, 500, 503) with exponential backoff; re-raise everything else immediately and after the final attempt.
- Do not retry on 400/401/403 (bad request / auth) — those won't fix themselves.
- The wrapper is pure with respect to the SDK: tests inject a fake client.
Tests / acceptance:
- A fake client that raises a 503 twice then succeeds: generate_with_retry returns the success text after 3 calls.
- A fake client raising 400 once: generate_with_retry raises immediately (one call, no retry).
- `pytest tests/test_llm.py` passes; `ruff check app/llm.py` is clean.
Output: a unified diff plus a short table of which status codes retry vs fail fast.

Where to take it next

  • Build the AI service this track points at in Helix Assistant, where Gemini drives a multimodal, tool-using assistant grounded on your own data via RAG.
  • Host these calls behind a real backend in the Python track — Gemini is the model, Python is the service that orchestrates prompts, tools, and retrieval.
  • Store the embeddings and run nearest-neighbour search with pgvector in the PostgreSQL track — the vector-store half of RAG.