MediaPipe Master Guide · 4E Virtual Design

Model Reference

all channels · every model

Pose Landmarker

Tracks 33 body landmarks in 3D from a single video frame. Each landmark returns normalized image coordinates (x, y), depth (z), and a confidence score. A second "world landmarks" output gives real-world metric coordinates in metres centred on the hips — ideal for animation rigs, biomechanics, and VR retargeting.

The MediaPipe pose topology is BlazePose's 33-point skeleton. It extends the standard COCO 17 keypoints with extra hand/foot points and a richer face anchor set, making it well-suited to mocap retargeting.

Landmarks

per detected person

Channels each

x, y, z, vis, presence

Image-space

165

33 × 5 channels

World coords

33 × 3, metric

Segmentation

opt.

person mask H×W

Models

lite / full / heavy

x · normalized

Horizontal position 0.0 (left edge) → 1.0 (right edge) of the input frame.

y · normalized

Vertical position 0.0 (top) → 1.0 (bottom). Y axis points down, like screen coords.

z · depth

Depth roughly normalized to torso width. Negative = closer to camera, positive = further away.

visibility

0 → 1 confidence the joint is visible (not occluded) and lies in frame.

presence

0 → 1 confidence the joint actually exists in the image (vs. predicted by the model).

All 33 Pose Landmarks

#	Name	Region	Range	Notes

Lite model

Smallest, fastest. Good for mobile / WebGPU constrained devices. Lower accuracy on extreme poses.

Full model

The default. Solid balance of accuracy and speed for most desktop / laptop scenarios.

Heavy model

Best landmark precision, especially for fast motion and edge poses. Higher latency.

Configuration options

Option	Type	Default	What it does
runningMode	enum	'IMAGE'	'IMAGE' / 'VIDEO' / 'LIVE_STREAM'. VIDEO uses cross-frame tracking — choose this for webcam.
numPoses	number	1	Maximum people to detect. Up to ~5 supported; cost scales linearly.
minPoseDetectionConfidence	number	0.5	Threshold for the detector stage. Raise to suppress false positives in busy scenes.
minPosePresenceConfidence	number	0.5	Threshold for the landmark presence head — how confident the model is the person is there.
minTrackingConfidence	number	0.5	Threshold for tracking continuity between frames. Lower = stickier (fewer re-detections).
outputSegmentationMasks	boolean	false	If true, result includes a per-pixel person mask. See note below.
baseOptions.delegate	enum	'GPU'	'GPU' (WebGL/WebGPU) or 'CPU' (WASM). GPU is 3–10× faster on supported hardware.
baseOptions.modelAssetPath	string	—	URL or local path to the .task model file (lite / full / heavy).

Segmentation mask output

Set outputSegmentationMasks: true to additionally receive result.segmentationMasks[0] — a single-channel mask the same resolution as the input. Each pixel is the model's confidence (0.0–1.0) that the pixel belongs to the person. Useful for AR compositing, virtual backgrounds, and driving alpha mattes for Blender/UE compositing without a separate segmenter.

Code recipe · minimal pose landmarker init + per-frame call

import { PoseLandmarker, FilesetResolver }
  from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/vision_bundle.mjs';

const vision = await FilesetResolver.forVisionTasks(
  'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/wasm'
);

const pose = await PoseLandmarker.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/pose_landmarker/pose_landmarker_full/float16/latest/pose_landmarker_full.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  numPoses: 1,
  minPoseDetectionConfidence: 0.5,
  minTrackingConfidence: 0.5,
  outputSegmentationMasks: false
});

// per RAF tick:
const r = pose.detectForVideo(videoEl, performance.now());
// r.landmarks[0]      → 33 image-space landmarks  (x, y, z, visibility, presence)
// r.worldLandmarks[0] → 33 metric coords centred on hips
// r.segmentationMasks[0] → per-pixel person mask (if enabled)

Face Landmarker

The most channel-rich model in the suite. Outputs a 478-point face mesh (468 face + 10 iris), plus 52 ARKit-compatible blendshape weights, plus a 4×4 facial transformation matrix in metric space. Together, this is everything you need to drive a MetaHuman, ARKit avatar, or custom rig.

The 52 blendshapes match Apple's ARKit specification — so weights from MediaPipe drop straight into MetaHuman ARKit-mapped facial poses, Live Link Face, or any rig authored against the standard.

Mesh points

468

canonical mesh

+ Iris

5 per eye

Total mesh

478

x, y, z each

Blendshapes

ARKit-compatible

Transform

4×4

head pose matrix

Total channels

~1,486

per detected face

Face mesh (468)

Dense topology covering the entire face surface. Indexed identically to TFLite Face Mesh — community UV maps & rigs are interchangeable.

Iris (10)

Indices 468–472 = subject's left iris (centre + 4 perimeter), 473–477 = subject's right iris. Use the centre point for gaze direction; perimeter points give pupil dilation when scaled by face size.

Blendshapes (52)

Each value 0.0–1.0. Directly weight ARKit/MetaHuman pose targets. Includes neutral, brows, eyes, jaw, mouth, cheeks, nose.

Transformation matrix

4×4 homogeneous matrix mapping the canonical mesh into camera space (metric units). Drive head rotation, position, and scale from this.

52 Face Blendshapes (the killer feature for facial mocap)

#	Name	Region	Range	Maps to (typical)

Face Mesh Region Index Map

Region	Indices	Count	Use

Configuration options

Option	Type	Default	What it does
runningMode	enum	'IMAGE'	'IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
numFaces	number	1	Maximum faces to track. Each face costs an extra inference pass.
minFaceDetectionConfidence	number	0.5	Detector threshold.
minFacePresenceConfidence	number	0.5	Landmark presence threshold.
minTrackingConfidence	number	0.5	Cross-frame tracking threshold.
outputFaceBlendshapes	boolean	false	Critical: set true to receive the 52 ARKit blendshape weights. Disabled by default to save compute.
outputFacialTransformationMatrixes	boolean	false	Set true to receive the 4×4 head-pose matrix per face.
baseOptions.delegate	enum	'GPU'	'GPU' or 'CPU'.

Code recipe · face landmarker with blendshapes + transform matrix

const face = await FaceLandmarker.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  numFaces: 1,
  outputFaceBlendshapes: true,           // 52 ARKit weights — drives MetaHuman / Live Link Face
  outputFacialTransformationMatrixes: true // 4×4 head pose
});

const r = face.detectForVideo(videoEl, performance.now());
// r.faceLandmarks[0]                    → 478 mesh points (x, y, z)
// r.faceBlendshapes[0].categories       → 52 weighted shapes [{ categoryName, score }]
// r.facialTransformationMatrixes[0].data → Float32Array(16) — column-major 4×4

Hand Landmarker

Tracks up to two hands independently, each with 21 landmarks. Returns image-space coordinates, world coordinates (metric), and a handedness label (Left / Right) with a confidence score. The skeleton topology is symmetric and great for sign language, AR interaction, and instrument tracking.

Hands

max simultaneous

Per-hand

landmarks

Channels each

x, y, z

+ Handedness

L/R + score

World coords

metric per hand

Total/hand

128

incl. handedness

21 Hand Landmarks (per hand)

#	Name	Region	Bone	Notes

CMC

Carpometacarpal

Base of the thumb where it meets the wrist — the "ball" of the joint.

MCP

Metacarpophalangeal

Knuckle joint where the finger meets the palm.

PIP

Proximal interphalangeal

Middle finger joint — the one that bends most when curling.

DIP

Distal interphalangeal

The joint nearest the fingertip.

TIP

Fingertip

The very end of the digit.

⚠

The mirror handedness gotcha

The model labels handedness from the camera's point of view. When you mirror the preview (so it feels like a selfie), your physical right hand appears on the right side of the mirrored image — but the model still labels it "Left" because that's where it sees it on the unmirrored input. Fix: either don't mirror the input you feed the model, or invert handedness.categoryName in your downstream code when mirror is on. The Live Lab toggles a banner to remind you.

Configuration options

Option	Type	Default	What it does
runningMode	enum	'IMAGE'	'IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
numHands	number	1	Maximum hands. Set to `2` for both. Each hand is its own inference pass.
minHandDetectionConfidence	number	0.5	Detector threshold.
minHandPresenceConfidence	number	0.5	Landmark presence threshold.
minTrackingConfidence	number	0.5	Cross-frame tracking continuity. Lower = stickier track.
baseOptions.delegate	enum	'GPU'	'GPU' or 'CPU'.

Code recipe · two-handed tracking + handedness

const hands = await HandLandmarker.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/latest/hand_landmarker.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  numHands: 2
});

const r = hands.detectForVideo(videoEl, performance.now());
// r.landmarks      → [hand1Landmarks, hand2Landmarks] each: 21 × {x,y,z}
// r.worldLandmarks → metric coords, wrist-relative
// r.handedness     → [[{ categoryName: 'Left'|'Right', score }], ...]

// Mirror-aware handedness:
function trueHand(label, mirrored) {
  if (!mirrored) return label;
  return label === 'Left' ? 'Right' : 'Left';
}

Gesture Recognizer

Builds on the Hand Landmarker and adds a classifier on top — outputting a categorical gesture label per hand alongside the landmarks. The default model ships with 8 classes (7 gestures + None). The classifier is replaceable: train your own custom gesture set with Model Maker and swap it in.

Default classes

incl. None

+ Hand outputs

landmarks

Score per class

0.0 – 1.0

Custom?

yes

via Model Maker

Default Gesture Classes

#	Class	Symbol	Description
0	None	—	No recognized gesture / below confidence threshold
1	Closed_Fist	✊	All four fingers and thumb closed into a fist
2	Open_Palm	🖐	Hand fully open, fingers extended and spread
3	Pointing_Up	☝	Index finger extended upward, others closed
4	Thumb_Down	👎	Thumb extended downward, fingers curled
5	Thumb_Up	👍	Thumb extended upward, fingers curled
6	Victory	✌	Index + middle extended (peace / victory sign)
7	ILoveYou	🤟	Thumb + index + pinky extended (ASL "I love you")

⚙

Custom gestures

Use MediaPipe Model Maker to retrain the classification head on your own dataset (15–50 samples per gesture is enough to start). The hand landmarker stays fixed; only the lightweight classifier swaps. Output stays in the same shape: category_name + score.

Configuration options

Option	Type	Default	What it does
runningMode	enum	'IMAGE'	'IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
numHands	number	1	Maximum hands. Each hand gets its own gesture classification.
minHandDetectionConfidence	number	0.5	Detector threshold for finding hands.
minHandPresenceConfidence	number	0.5	Landmark presence threshold.
minTrackingConfidence	number	0.5	Cross-frame tracking threshold.
cannedGesturesClassifierOptions	object	{}	`{ scoreThreshold, categoryAllowlist, categoryDenylist }` for the built-in 8 classes.
customGesturesClassifierOptions	object	{}	Same shape as canned, but applied to your custom Model-Maker-trained classifier.
baseOptions.delegate	enum	'GPU'	'GPU' or 'CPU'.

Code recipe · gesture recognition with handedness

const gestures = await GestureRecognizer.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/latest/gesture_recognizer.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  numHands: 2
});

const r = gestures.recognizeForVideo(videoEl, performance.now());
// r.gestures     → [[{ categoryName: 'Thumb_Up', score: 0.93 }], ...]  (top-1 per hand)
// r.handedness   → [[{ categoryName: 'Left'|'Right', score }], ...]
// r.landmarks    → 21 image-space landmarks per hand
// r.worldLandmarks → metric coords per hand

Holistic Landmarker

The full-body monster: combines pose + face + both hands in a single inference graph, with shared cropping/tracking between sub-models for efficiency. This is the model that powers most full-body avatar mocap pipelines.

Tasks API status: a unified HolisticLandmarker is now part of MediaPipe Tasks Vision (Web/Python). On platforms where it isn't yet shipped, you can compose three landmarkers in parallel each frame — pose, face and hand — and merge the results. The Live Lab's Holistic mode demonstrates the composed approach, which is identical in channel structure to the legacy Solutions API output.

Pose

landmarks

Face

468

mesh points

Left hand

landmarks

Right hand

landmarks

Total points

543

per frame

+ Segmentation

opt.

person mask

→

Sequential cropping

The pose model runs first; its wrist/face anchors are used to crop ROIs that get passed to the face mesh and hand landmarkers — far cheaper than running each in isolation.

⨯

Synced output

All three sub-models return for the same frame, so timestamps line up perfectly. Critical for clean retargeting.

⚠

Trade-offs

Hand fidelity inside Holistic is lower than running the standalone Hand Landmarker on a tight crop. For close-up hand work, prefer Hand Landmarker.

✓

Best for

Full-body avatar mocap (Blender/UE rigs), live performance capture, dance / yoga / fitness apps where you need everything at once.

Configuration options (composed approach)

Option	Type	Default	What it does
poseLandmarker.*	object	—	All Pose Landmarker options apply (numPoses, confidences, segmentation, model variant, delegate).
faceLandmarker.*	object	—	All Face Landmarker options apply. Set `outputFaceBlendshapes` & `outputFacialTransformationMatrixes` to true for full mocap.
handLandmarker.*	object	—	All Hand Landmarker options apply. `numHands: 2` recommended.
timestamp	number	—	Pass the same `performance.now()` to all three for guaranteed alignment.

Code recipe · composed full-body capture (pose + face + hands in parallel)

// Init three landmarkers once
const [pose, face, hands] = await Promise.all([
  PoseLandmarker.createFromOptions(vision, { runningMode: 'VIDEO', numPoses: 1, baseOptions: poseModel }),
  FaceLandmarker.createFromOptions(vision, { runningMode: 'VIDEO', numFaces: 1,
    outputFaceBlendshapes: true, outputFacialTransformationMatrixes: true, baseOptions: faceModel }),
  HandLandmarker.createFromOptions(vision, { runningMode: 'VIDEO', numHands: 2, baseOptions: handModel })
]);

// per RAF tick, run all three with the same timestamp
const t = performance.now();
const [poseR, faceR, handsR] = [
  pose.detectForVideo(videoEl, t),
  face.detectForVideo(videoEl, t),
  hands.detectForVideo(videoEl, t)
];

// Merge into a single holistic payload
const holistic = {
  timestamp: t,
  pose:        poseR.landmarks?.[0]      ?? null,
  poseWorld:   poseR.worldLandmarks?.[0] ?? null,
  face:        faceR.faceLandmarks?.[0]  ?? null,
  blendshapes: faceR.faceBlendshapes?.[0]?.categories ?? null,
  headMatrix:  faceR.facialTransformationMatrixes?.[0]?.data ?? null,
  hands:       handsR.landmarks?.map((lm, i) => ({
    label: handsR.handedness[i][0].categoryName,
    score: handsR.handedness[i][0].score,
    landmarks: lm,
    world: handsR.worldLandmarks[i]
  })) ?? []
};

Code recipe · unified HolisticLandmarker (Tasks Vision)

import { HolisticLandmarker, FilesetResolver }
  from '@mediapipe/tasks-vision';

const holistic = await HolisticLandmarker.createFromOptions(vision, {
  baseOptions: { modelAssetPath: '.../holistic_landmarker.task', delegate: 'GPU' },
  runningMode: 'VIDEO',
  outputFaceBlendshapes: true,
  minFaceDetectionConfidence: 0.5,
  minHandLandmarksConfidence: 0.5,
  minPoseDetectionConfidence: 0.5
});

const r = holistic.detectForVideo(videoEl, performance.now());
// r.poseLandmarks, r.poseWorldLandmarks
// r.faceLandmarks, r.faceBlendshapes
// r.leftHandLandmarks, r.rightHandLandmarks
// r.leftHandWorldLandmarks, r.rightHandWorldLandmarks

Object Detector

Localizes objects in the frame and labels each with a category. Returns a list of detections, each with a bounding box, one or more categories with scores, and an optional keypoint set on supported models. Default ships are EfficientDet-Lite variants trained on COCO.

COCO classes

default model

Bbox channels

x,y,w,h

Score

0.0 – 1.0

Custom

yes

Model Maker

Output channels per detection

Channel	Type	Range	Description
bbox.origin_x	geometry	0 → W	Top-left X in pixels of the input image.
bbox.origin_y	geometry	0 → H	Top-left Y in pixels of the input image.
bbox.width	geometry	px	Width of the bounding box in pixels.
bbox.height	geometry	px	Height of the bounding box in pixels.
categories[].category_name	label	str	e.g. "person", "cup", "laptop". COCO label set by default.
categories[].score	confidence	0 → 1	Per-category confidence, sorted descending.
categories[].index	label	int	Numeric class index in the model's labelmap.
keypoints[]	geometry	opt.	Some specialised models output keypoints with each detection (e.g. face, hand corners).

COCO 80-class label set (used by default models)

Configuration options

Option	Type	Default	What it does
runningMode	enum	'IMAGE'	'IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
maxResults	number	-1	Cap on detections returned. -1 = all that pass threshold.
scoreThreshold	number	0.5	Minimum confidence — anything lower is dropped.
categoryAllowlist	string[]	[]	Whitelist of category names to keep. Empty = all.
categoryDenylist	string[]	[]	Blacklist of category names to drop.
baseOptions.modelAssetPath	string	—	Default ships are EfficientDet-Lite0 (faster) and EfficientDet-Lite2 (more accurate).
baseOptions.delegate	enum	'GPU'	'GPU' or 'CPU'.

Code recipe · object detector with allowlist

const det = await ObjectDetector.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/object_detector/efficientdet_lite0/float16/latest/efficientdet_lite0.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  scoreThreshold: 0.5,
  categoryAllowlist: ['person', 'cup', 'laptop']
});

const r = det.detectForVideo(videoEl, performance.now());
// r.detections → [{ boundingBox: {originX, originY, width, height},
//                  categories: [{ categoryName, score, index }] }, ...]

Image Segmenter & Interactive Segmenter

Per-pixel classification of the input image. Output is a segmentation mask (one or more) where each pixel is labelled with a class index or class probability. Use cases: virtual backgrounds, AR clothing, hair styling, portrait lighting, body part isolation.

Selfie segmentation

2 classes: background, person. Lightweight, ideal for video calls / Zoom-style virtual backgrounds.

Multi-class selfie

6 classes: background, hair, body-skin, face-skin, clothes, others.

Hair segmentation

2 classes: background, hair. Higher hair-edge precision than the multi-class model.

DeepLab v3

21 PASCAL VOC classes: people, animals, vehicles, indoor objects.

Interactive Segmenter

Click or tap a point in the image; the model returns a mask of that object. Built on the MagicTouch model.

Multi-class Selfie Segmentation labels

Idx	Class	RGB hint	Use
0	background	0,0,0	Everything not part of the subject. Use for virtual backgrounds.
1	hair	128,0,0	Scalp hair. Drive AR hair colour, virtual styling.
2	body-skin	0,128,0	Skin on neck, arms, hands.
3	face-skin	128,128,0	Skin on the face. Useful for makeup / beautification effects.
4	clothes	0,0,128	Garments. Drive AR try-on or background-replacement edge cases.
5	others	128,0,128	Subject pixels that don't fit the four classes (glasses, hats, etc).

Output channels (Image Segmenter)

Channel	Shape	Type	Description
category_mask	H × W × 1	uint8	Each pixel is the integer class index (0–N).
confidence_masks[k]	H × W × 1	float32	One mask per class — pixel value = probability ∈ [0, 1] of belonging to class k.

Configuration options

Option	Type	Default	What it does
runningMode	enum	'IMAGE'	'IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
outputCategoryMask	boolean	false	If true, returns a single H×W uint8 mask with class indices.
outputConfidenceMasks	boolean	true	If true, returns one float mask per class with probabilities.
displayNamesLocale	string	'en'	Locale for class display names (where available).
baseOptions.modelAssetPath	string	—	Selfie / Multi-class / Hair / DeepLabV3 / SelfieMulticlass — pick the model for your use case.
baseOptions.delegate	enum	'GPU'	'GPU' or 'CPU'.

Code recipe · multi-class selfie segmenter

const seg = await ImageSegmenter.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/image_segmenter/selfie_multiclass_256x256/float32/latest/selfie_multiclass_256x256.tflite',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  outputCategoryMask: true,
  outputConfidenceMasks: false
});

const r = seg.segmentForVideo(videoEl, performance.now());
// r.categoryMask → MPMask  (.getAsUint8Array() → H*W bytes, each a class idx 0–5)
// IMPORTANT: call r.close() when you're done with the masks to free GPU memory
r.close();

The rest of the MediaPipe family

vision · audio · text · genai

Face Detector

Lightweight cousin of Face Landmarker (BlazeFace). Returns a bounding box, score, and 6 keypoints: right eye, left eye, nose tip, mouth centre, right ear tragion, left ear tragion. Ideal as a cheap pre-stage before a heavier model.

Face Stylizer

Stylizes the face region into a target style (cartoon, oil painting, sketch). Custom styles trainable via Model Maker.

Image Classifier

Whole-image label classification. EfficientNet-Lite default. Returns top-K categories with scores.

Image Embedder

Returns a 1024-d feature vector per image. Use for similarity search, clustering, k-NN retrieval.

Pose Detector (legacy)

Older single-step pose API from the Solutions package. Same 33-point topology but different runtime.

Audio Classifier

Default model is YamNet — classifies short audio chunks (typically 0.975s windows) into AudioSet's 521-class ontology (speech, music, footsteps, applause, sirens, etc).

Audio Embedder

Returns a feature vector for an audio chunk — use for retrieval / similarity in sound libraries.

Text Classifier

Sentiment / topic / safety classification of arbitrary text. BERT-based.

Text Embedder

Sentence-level embeddings for semantic search and clustering.

Language Detector

Detects the language of a text snippet across 100+ languages with confidence scores.

LLM

LLM Inference

On-device inference for small open LLMs (Gemma, Phi, Falcon variants). Runs locally on web / Android / iOS.

Image Generator

On-device Stable-Diffusion-style generation (Android currently). Text-to-image, optional ControlNet.

Channel summary across the suite

Model	Domain	Output channels	Total numeric / item

How MediaPipe thinks

A non-technical mental model: MediaPipe is a graph of small ML models that pass tensors to each other, all running on-device. Most vision pipelines follow the same recipe — detect, crop, refine, output.

Frame in

A camera frame arrives as a tensor (H × W × 3, RGB, normalized to 0–1). MediaPipe wraps it in an ImageFrame with a timestamp so downstream nodes know which frame each output belongs to.

Detector finds the region of interest

A small fast model (e.g. BlazeFace, BlazePalm, BlazePose Detector) finds where the subject is in the frame and returns a tight crop. This is the cheap step — it runs every frame.

Landmark model refines

The crop is fed to the heavier landmark model (Face Mesh, Hand Landmarker, Pose Landmarker). This regresses the precise landmark coordinates inside the crop.

Tracking saves the next frame

Once landmarks are known, MediaPipe predicts the ROI for the next frame from them — skipping the detector entirely while tracking is stable. This is why MediaPipe is so fast: most frames only run the landmark model.

Per-channel post-processing

Some channels are derived: blendshapes are regressed from the face mesh by a small MLP head; world coordinates are computed from image landmarks via a separate sub-network; visibility/presence are sigmoid outputs from auxiliary heads.

Out comes the result object

Each call returns a structured object: arrays of landmarks, blendshape weights, classification categories, segmentation masks. You read the channels you care about and feed them to your rig, your widget, your analytics — whatever happens next.

Reading the values: what's normalized, what isn't

Image-normalized

Most landmark x / y values are 0.0 → 1.0 relative to the image. Multiply by image width / height to get pixels. Convenient because it's resolution-independent.

Pseudo-depth

z values are relative, not metric. Useful for ordering (which finger is closer) but not for measuring real distance.

World coords

The "world" landmark output is in metres, centred on the subject's hip (pose) or wrist (hand). Use these for biomechanics and 3D rigs.

Confidence scores

0.0 → 1.0 sigmoid outputs. Visibility is "is this point in frame?", presence is "does this point exist in the image at all?".

Minimal Tasks Vision boilerplate

// Load the Tasks Vision bundle (browser, ES modules)
import { FilesetResolver, FaceLandmarker } from "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/vision_bundle.mjs";

const vision = await FilesetResolver.forVisionTasks(
  "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/wasm"
);

const faceLandmarker = await FaceLandmarker.createFromOptions(vision, {
  baseOptions: { modelAssetPath: ".../face_landmarker.task", delegate: "GPU" },
  outputFaceBlendshapes: true,
  outputFacialTransformationMatrixes: true,
  numFaces: 1,
  runningMode: "VIDEO"
});

// Each frame:
const result = faceLandmarker.detectForVideo(videoEl, performance.now());
// result.faceLandmarks       → array of landmark arrays (one per face)
// result.faceBlendshapes     → array of {categories: [{categoryName, score}, ...]}
// result.facialTransformationMatrixes → 4×4 matrix per face

Running modes — IMAGE vs VIDEO vs LIVE_STREAM

IMAGE

One still image per call. Each call is independent — no tracking carries over. Use for batch processing, dataset labelling, photo-app filters. Method: detect(imageBitmap).

VIDEO

Sequential frames + monotonic timestamps. The model uses tracking to stabilise output between frames. The Live Lab uses this. Method: detectForVideo(frame, timestampMs) — call it from your render loop.

LIVE_STREAM

Asynchronous. Push frames as fast as you like; results arrive via callback. Best when you can't block the UI thread. Method: detectAsync(frame, timestampMs); supply resultListener in options.

Rough performance guidance

indicative · varies wildly by hardware

Model	Variant	~ms / frame	~FPS	Notes
Pose Landmarker	Lite	3 – 6	150–300	Mobile-friendly. Light on extreme poses.
Pose Landmarker	Full	6 – 12	80–160	Default. Solid all-rounder.
Pose Landmarker	Heavy	12 – 25	40–80	Best precision. Use when fast motion or tricky angles.
Face Landmarker	w/o blendshapes	3 – 6	150–300	Pure mesh, fastest face mode.
Face Landmarker	+ blendshapes + matrix	5 – 10	100–200	Full mocap mode. The two extra heads add cost.
Hand Landmarker	—	3 – 6 / hand	100–250	Cost roughly doubles for two hands.
Gesture Recognizer	—	4 – 8 / hand	90–200	Hand Landmarker + tiny classifier.
Holistic (composed)	—	15 – 35	30–60	Sum of three landmarkers. Drop Heavy pose for higher rates.
Object Detector	EfficientDet-Lite0	8 – 18	50–120	Good baseline.
Object Detector	EfficientDet-Lite2	15 – 35	25–60	More accurate, heavier.
Image Segmenter	Selfie / 256	3 – 8	120–300	Tiny model, 256×256 internally.

Numbers are indicative on a typical desktop with a discrete or modern integrated GPU and the GPU delegate. CPU delegate is usually 3–10× slower. Expect lower numbers on phones and battery-saver mode.

Glossary

terms used throughout this guide

Landmark: A single named point detected on a body / face / hand. Output is a 3D coordinate plus, depending on the model, visibility and presence scores.
Blendshape: A facial expression target weighted 0–1 (e.g. jawOpen: 0.7). Adds linearly to a neutral mesh. MediaPipe outputs 52 ARKit-compatible weights per face.
Image-normalized coords: x and y as 0.0 → 1.0 across the input frame, regardless of resolution. Multiply by frame width / height to convert to pixels.
World coords: Real-world metric (metres) coordinates centred on the subject's hip (pose) or wrist (hand). Use for biomechanics, 3D rigs and AR.
Pseudo-depth (z): A relative z value, scaled roughly to torso width (pose) or hand width (hand). Ordering is reliable; absolute distance is not.
Visibility: Confidence (0–1) that the landmark is in frame and not occluded. Useful for masking unstable joints.
Presence: Confidence (0–1) that the landmark actually exists in the image (vs. predicted blindly). Distinct from visibility.
Handedness: The "Left" or "Right" label per detected hand, with a confidence score. Watch the mirror gotcha — see Hands tab.
ROI (Region of Interest): The cropped sub-rectangle of the input that the landmark model actually processes. Detector finds it, landmark refines inside it, tracker carries it to the next frame.
BlazePose / BlazeFace: The internal architectures behind the pose / face detectors. Single-shot detectors trained for low latency.
Transformation matrix: A 4×4 matrix mapping the canonical face mesh into camera (metric) space. Drives head rotation, position, and scale on a 3D rig.
ARKit: Apple's spec for facial blendshapes (52 names: jawOpen, browInnerUp, etc). MediaPipe outputs match it directly.
Tasks API vs Solutions API: Tasks (newer) is the unified runtime for all model types — vision, audio, text, generative. Solutions (legacy) was per-domain (Pose, Hands, etc). Most pipelines should target Tasks.
Delegate: The compute backend — 'GPU' (WebGL / WebGPU / GLES) or 'CPU' (WASM). GPU is much faster on supported hardware.
Model Maker: MediaPipe's tool for retraining classification heads (gestures, image classes, object categories) on your own data — without rewriting the underlying backbones.
Live Link Face / MetaHuman: Unreal Engine's facial mocap pipeline. Consumes ARKit-shaped blendshapes — so MediaPipe outputs drop straight in.

Keyboard shortcuts

work anywhere on the page

SpaceStart / stop camera

MToggle mirror

OToggle skeleton overlay

1Pose mode

2Face mode

3Hands mode

4Gesture mode

5Holistic mode

RStart / stop recording

SSnapshot PNG

BToggle WebSocket broadcast

MediaPipe, decoded.Every channel, every model, live.

Live Lab

Activate camera to begin

Live Channels

Model Reference

Pose Landmarker

x · normalized

y · normalized

z · depth

visibility

presence

All 33 Pose Landmarks

Lite model

Full model

Heavy model

Configuration options

Segmentation mask output

Face Landmarker

Face mesh (468)

Iris (10)

Blendshapes (52)

Transformation matrix

52 Face Blendshapes (the killer feature for facial mocap)

Face Mesh Region Index Map

Configuration options

Hand Landmarker

21 Hand Landmarks (per hand)

Carpometacarpal

Metacarpophalangeal

Proximal interphalangeal

Distal interphalangeal

Fingertip

The mirror handedness gotcha

Configuration options

Gesture Recognizer

Default Gesture Classes

Custom gestures

Configuration options

Holistic Landmarker

Sequential cropping

Synced output

Trade-offs

Best for

Configuration options (composed approach)

Object Detector

Output channels per detection

COCO 80-class label set (used by default models)

Configuration options

Image Segmenter & Interactive Segmenter

Selfie segmentation

Multi-class selfie

Hair segmentation

DeepLab v3

Interactive Segmenter

Multi-class Selfie Segmentation labels

Output channels (Image Segmenter)

Configuration options

The rest of the MediaPipe family

Face Detector

Face Stylizer

Image Classifier

Image Embedder

Pose Detector (legacy)

Audio Classifier

Audio Embedder

Text Classifier

Text Embedder

Language Detector

LLM Inference

Image Generator

Channel summary across the suite

How MediaPipe thinks

Frame in

Detector finds the region of interest

Landmark model refines

Tracking saves the next frame

Per-channel post-processing

Out comes the result object

Reading the values: what's normalized, what isn't

Image-normalized

MediaPipe, decoded.
Every channel, every model, live.