4E
4E Virtual Design
Enhancing Reality · Amplifying Results
camera idle
◆ Master Guide · Tasks Vision API

MediaPipe, decoded.
Every channel, every model, live.

A complete map of every input channel MediaPipe exposes — pose joints, face blendshapes, hand landmarks, gestures, segmentation maps, and more. Turn on your camera to see the values stream in real time, then explore the full reference for every model in the suite.

Live Lab

webcam → mediapipe → channels
no model
FPS
— detections

Activate camera to begin

Webcam stays on your device. MediaPipe runs locally in the browser — nothing is uploaded.

Mirror is on. The model labels your physical right hand as "Left" and vice-versa, because it sees the mirrored image. Toggle Mirror off for canonical labels (or invert in your downstream code).
capture
broadcast

Live Channels

active:
AWAITING SIGNAL
Start the camera to see channels stream live

Model Reference

all channels · every model

Pose Landmarker

Tracks 33 body landmarks in 3D from a single video frame. Each landmark returns normalized image coordinates (x, y), depth (z), and a confidence score. A second "world landmarks" output gives real-world metric coordinates in metres centred on the hips — ideal for animation rigs, biomechanics, and VR retargeting.

The MediaPipe pose topology is BlazePose's 33-point skeleton. It extends the standard COCO 17 keypoints with extra hand/foot points and a richer face anchor set, making it well-suited to mocap retargeting.
Landmarks
33
per detected person
Channels each
5
x, y, z, vis, presence
Image-space
165
33 × 5 channels
World coords
99
33 × 3, metric
Segmentation
opt.
person mask H×W
Models
3
lite / full / heavy
x

x · normalized

Horizontal position 0.0 (left edge) → 1.0 (right edge) of the input frame.

y

y · normalized

Vertical position 0.0 (top) → 1.0 (bottom). Y axis points down, like screen coords.

z

z · depth

Depth roughly normalized to torso width. Negative = closer to camera, positive = further away.

v

visibility

0 → 1 confidence the joint is visible (not occluded) and lies in frame.

p

presence

0 → 1 confidence the joint actually exists in the image (vs. predicted by the model).

All 33 Pose Landmarks

#NameRegionRangeNotes
L

Lite model

Smallest, fastest. Good for mobile / WebGPU constrained devices. Lower accuracy on extreme poses.

F

Full model

The default. Solid balance of accuracy and speed for most desktop / laptop scenarios.

H

Heavy model

Best landmark precision, especially for fast motion and edge poses. Higher latency.

Configuration options

OptionTypeDefaultWhat it does
runningModeenum'IMAGE''IMAGE' / 'VIDEO' / 'LIVE_STREAM'. VIDEO uses cross-frame tracking — choose this for webcam.
numPosesnumber1Maximum people to detect. Up to ~5 supported; cost scales linearly.
minPoseDetectionConfidencenumber0.5Threshold for the detector stage. Raise to suppress false positives in busy scenes.
minPosePresenceConfidencenumber0.5Threshold for the landmark presence head — how confident the model is the person is there.
minTrackingConfidencenumber0.5Threshold for tracking continuity between frames. Lower = stickier (fewer re-detections).
outputSegmentationMasksbooleanfalseIf true, result includes a per-pixel person mask. See note below.
baseOptions.delegateenum'GPU''GPU' (WebGL/WebGPU) or 'CPU' (WASM). GPU is 3–10× faster on supported hardware.
baseOptions.modelAssetPathstringURL or local path to the .task model file (lite / full / heavy).
M

Segmentation mask output

Set outputSegmentationMasks: true to additionally receive result.segmentationMasks[0] — a single-channel mask the same resolution as the input. Each pixel is the model's confidence (0.0–1.0) that the pixel belongs to the person. Useful for AR compositing, virtual backgrounds, and driving alpha mattes for Blender/UE compositing without a separate segmenter.

Code recipe · minimal pose landmarker init + per-frame call
import { PoseLandmarker, FilesetResolver }
  from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/vision_bundle.mjs';

const vision = await FilesetResolver.forVisionTasks(
  'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/wasm'
);

const pose = await PoseLandmarker.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/pose_landmarker/pose_landmarker_full/float16/latest/pose_landmarker_full.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  numPoses: 1,
  minPoseDetectionConfidence: 0.5,
  minTrackingConfidence: 0.5,
  outputSegmentationMasks: false
});

// per RAF tick:
const r = pose.detectForVideo(videoEl, performance.now());
// r.landmarks[0]      → 33 image-space landmarks  (x, y, z, visibility, presence)
// r.worldLandmarks[0] → 33 metric coords centred on hips
// r.segmentationMasks[0] → per-pixel person mask (if enabled)

Face Landmarker

The most channel-rich model in the suite. Outputs a 478-point face mesh (468 face + 10 iris), plus 52 ARKit-compatible blendshape weights, plus a 4×4 facial transformation matrix in metric space. Together, this is everything you need to drive a MetaHuman, ARKit avatar, or custom rig.

The 52 blendshapes match Apple's ARKit specification — so weights from MediaPipe drop straight into MetaHuman ARKit-mapped facial poses, Live Link Face, or any rig authored against the standard.
Mesh points
468
canonical mesh
+ Iris
10
5 per eye
Total mesh
478
x, y, z each
Blendshapes
52
ARKit-compatible
Transform
4×4
head pose matrix
Total channels
~1,486
per detected face
M

Face mesh (468)

Dense topology covering the entire face surface. Indexed identically to TFLite Face Mesh — community UV maps & rigs are interchangeable.

I

Iris (10)

Indices 468–472 = subject's left iris (centre + 4 perimeter), 473–477 = subject's right iris. Use the centre point for gaze direction; perimeter points give pupil dilation when scaled by face size.

B

Blendshapes (52)

Each value 0.0–1.0. Directly weight ARKit/MetaHuman pose targets. Includes neutral, brows, eyes, jaw, mouth, cheeks, nose.

T

Transformation matrix

4×4 homogeneous matrix mapping the canonical mesh into camera space (metric units). Drive head rotation, position, and scale from this.

52 Face Blendshapes (the killer feature for facial mocap)

#NameRegionRangeMaps to (typical)

Face Mesh Region Index Map

RegionIndicesCountUse

Configuration options

OptionTypeDefaultWhat it does
runningModeenum'IMAGE''IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
numFacesnumber1Maximum faces to track. Each face costs an extra inference pass.
minFaceDetectionConfidencenumber0.5Detector threshold.
minFacePresenceConfidencenumber0.5Landmark presence threshold.
minTrackingConfidencenumber0.5Cross-frame tracking threshold.
outputFaceBlendshapesbooleanfalseCritical: set true to receive the 52 ARKit blendshape weights. Disabled by default to save compute.
outputFacialTransformationMatrixesbooleanfalseSet true to receive the 4×4 head-pose matrix per face.
baseOptions.delegateenum'GPU''GPU' or 'CPU'.
Code recipe · face landmarker with blendshapes + transform matrix
const face = await FaceLandmarker.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  numFaces: 1,
  outputFaceBlendshapes: true,           // 52 ARKit weights — drives MetaHuman / Live Link Face
  outputFacialTransformationMatrixes: true // 4×4 head pose
});

const r = face.detectForVideo(videoEl, performance.now());
// r.faceLandmarks[0]                    → 478 mesh points (x, y, z)
// r.faceBlendshapes[0].categories       → 52 weighted shapes [{ categoryName, score }]
// r.facialTransformationMatrixes[0].data → Float32Array(16) — column-major 4×4

Hand Landmarker

Tracks up to two hands independently, each with 21 landmarks. Returns image-space coordinates, world coordinates (metric), and a handedness label (Left / Right) with a confidence score. The skeleton topology is symmetric and great for sign language, AR interaction, and instrument tracking.

Hands
2
max simultaneous
Per-hand
21
landmarks
Channels each
3
x, y, z
+ Handedness
1
L/R + score
World coords
63
metric per hand
Total/hand
128
incl. handedness

21 Hand Landmarks (per hand)

#NameRegionBoneNotes
CMC

Carpometacarpal

Base of the thumb where it meets the wrist — the "ball" of the joint.

MCP

Metacarpophalangeal

Knuckle joint where the finger meets the palm.

PIP

Proximal interphalangeal

Middle finger joint — the one that bends most when curling.

DIP

Distal interphalangeal

The joint nearest the fingertip.

TIP

Fingertip

The very end of the digit.

The mirror handedness gotcha

The model labels handedness from the camera's point of view. When you mirror the preview (so it feels like a selfie), your physical right hand appears on the right side of the mirrored image — but the model still labels it "Left" because that's where it sees it on the unmirrored input. Fix: either don't mirror the input you feed the model, or invert handedness.categoryName in your downstream code when mirror is on. The Live Lab toggles a banner to remind you.

Configuration options

OptionTypeDefaultWhat it does
runningModeenum'IMAGE''IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
numHandsnumber1Maximum hands. Set to 2 for both. Each hand is its own inference pass.
minHandDetectionConfidencenumber0.5Detector threshold.
minHandPresenceConfidencenumber0.5Landmark presence threshold.
minTrackingConfidencenumber0.5Cross-frame tracking continuity. Lower = stickier track.
baseOptions.delegateenum'GPU''GPU' or 'CPU'.
Code recipe · two-handed tracking + handedness
const hands = await HandLandmarker.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/latest/hand_landmarker.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  numHands: 2
});

const r = hands.detectForVideo(videoEl, performance.now());
// r.landmarks      → [hand1Landmarks, hand2Landmarks] each: 21 × {x,y,z}
// r.worldLandmarks → metric coords, wrist-relative
// r.handedness     → [[{ categoryName: 'Left'|'Right', score }], ...]

// Mirror-aware handedness:
function trueHand(label, mirrored) {
  if (!mirrored) return label;
  return label === 'Left' ? 'Right' : 'Left';
}

Gesture Recognizer

Builds on the Hand Landmarker and adds a classifier on top — outputting a categorical gesture label per hand alongside the landmarks. The default model ships with 8 classes (7 gestures + None). The classifier is replaceable: train your own custom gesture set with Model Maker and swap it in.

Default classes
8
incl. None
+ Hand outputs
21
landmarks
Score per class
1
0.0 – 1.0
Custom?
yes
via Model Maker

Default Gesture Classes

#ClassSymbolDescription
0NoneNo recognized gesture / below confidence threshold
1Closed_FistAll four fingers and thumb closed into a fist
2Open_Palm🖐Hand fully open, fingers extended and spread
3Pointing_UpIndex finger extended upward, others closed
4Thumb_Down👎Thumb extended downward, fingers curled
5Thumb_Up👍Thumb extended upward, fingers curled
6VictoryIndex + middle extended (peace / victory sign)
7ILoveYou🤟Thumb + index + pinky extended (ASL "I love you")

Custom gestures

Use MediaPipe Model Maker to retrain the classification head on your own dataset (15–50 samples per gesture is enough to start). The hand landmarker stays fixed; only the lightweight classifier swaps. Output stays in the same shape: category_name + score.

Configuration options

OptionTypeDefaultWhat it does
runningModeenum'IMAGE''IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
numHandsnumber1Maximum hands. Each hand gets its own gesture classification.
minHandDetectionConfidencenumber0.5Detector threshold for finding hands.
minHandPresenceConfidencenumber0.5Landmark presence threshold.
minTrackingConfidencenumber0.5Cross-frame tracking threshold.
cannedGesturesClassifierOptionsobject{}{ scoreThreshold, categoryAllowlist, categoryDenylist } for the built-in 8 classes.
customGesturesClassifierOptionsobject{}Same shape as canned, but applied to your custom Model-Maker-trained classifier.
baseOptions.delegateenum'GPU''GPU' or 'CPU'.
Code recipe · gesture recognition with handedness
const gestures = await GestureRecognizer.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/latest/gesture_recognizer.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  numHands: 2
});

const r = gestures.recognizeForVideo(videoEl, performance.now());
// r.gestures     → [[{ categoryName: 'Thumb_Up', score: 0.93 }], ...]  (top-1 per hand)
// r.handedness   → [[{ categoryName: 'Left'|'Right', score }], ...]
// r.landmarks    → 21 image-space landmarks per hand
// r.worldLandmarks → metric coords per hand

Holistic Landmarker

The full-body monster: combines pose + face + both hands in a single inference graph, with shared cropping/tracking between sub-models for efficiency. This is the model that powers most full-body avatar mocap pipelines.

Tasks API status: a unified HolisticLandmarker is now part of MediaPipe Tasks Vision (Web/Python). On platforms where it isn't yet shipped, you can compose three landmarkers in parallel each frame — pose, face and hand — and merge the results. The Live Lab's Holistic mode demonstrates the composed approach, which is identical in channel structure to the legacy Solutions API output.
Pose
33
landmarks
Face
468
mesh points
Left hand
21
landmarks
Right hand
21
landmarks
Total points
543
per frame
+ Segmentation
opt.
person mask

Sequential cropping

The pose model runs first; its wrist/face anchors are used to crop ROIs that get passed to the face mesh and hand landmarkers — far cheaper than running each in isolation.

Synced output

All three sub-models return for the same frame, so timestamps line up perfectly. Critical for clean retargeting.

Trade-offs

Hand fidelity inside Holistic is lower than running the standalone Hand Landmarker on a tight crop. For close-up hand work, prefer Hand Landmarker.

Best for

Full-body avatar mocap (Blender/UE rigs), live performance capture, dance / yoga / fitness apps where you need everything at once.

Configuration options (composed approach)

OptionTypeDefaultWhat it does
poseLandmarker.*objectAll Pose Landmarker options apply (numPoses, confidences, segmentation, model variant, delegate).
faceLandmarker.*objectAll Face Landmarker options apply. Set outputFaceBlendshapes & outputFacialTransformationMatrixes to true for full mocap.
handLandmarker.*objectAll Hand Landmarker options apply. numHands: 2 recommended.
timestampnumberPass the same performance.now() to all three for guaranteed alignment.
Code recipe · composed full-body capture (pose + face + hands in parallel)
// Init three landmarkers once
const [pose, face, hands] = await Promise.all([
  PoseLandmarker.createFromOptions(vision, { runningMode: 'VIDEO', numPoses: 1, baseOptions: poseModel }),
  FaceLandmarker.createFromOptions(vision, { runningMode: 'VIDEO', numFaces: 1,
    outputFaceBlendshapes: true, outputFacialTransformationMatrixes: true, baseOptions: faceModel }),
  HandLandmarker.createFromOptions(vision, { runningMode: 'VIDEO', numHands: 2, baseOptions: handModel })
]);

// per RAF tick, run all three with the same timestamp
const t = performance.now();
const [poseR, faceR, handsR] = [
  pose.detectForVideo(videoEl, t),
  face.detectForVideo(videoEl, t),
  hands.detectForVideo(videoEl, t)
];

// Merge into a single holistic payload
const holistic = {
  timestamp: t,
  pose:        poseR.landmarks?.[0]      ?? null,
  poseWorld:   poseR.worldLandmarks?.[0] ?? null,
  face:        faceR.faceLandmarks?.[0]  ?? null,
  blendshapes: faceR.faceBlendshapes?.[0]?.categories ?? null,
  headMatrix:  faceR.facialTransformationMatrixes?.[0]?.data ?? null,
  hands:       handsR.landmarks?.map((lm, i) => ({
    label: handsR.handedness[i][0].categoryName,
    score: handsR.handedness[i][0].score,
    landmarks: lm,
    world: handsR.worldLandmarks[i]
  })) ?? []
};
Code recipe · unified HolisticLandmarker (Tasks Vision)
import { HolisticLandmarker, FilesetResolver }
  from '@mediapipe/tasks-vision';

const holistic = await HolisticLandmarker.createFromOptions(vision, {
  baseOptions: { modelAssetPath: '.../holistic_landmarker.task', delegate: 'GPU' },
  runningMode: 'VIDEO',
  outputFaceBlendshapes: true,
  minFaceDetectionConfidence: 0.5,
  minHandLandmarksConfidence: 0.5,
  minPoseDetectionConfidence: 0.5
});

const r = holistic.detectForVideo(videoEl, performance.now());
// r.poseLandmarks, r.poseWorldLandmarks
// r.faceLandmarks, r.faceBlendshapes
// r.leftHandLandmarks, r.rightHandLandmarks
// r.leftHandWorldLandmarks, r.rightHandWorldLandmarks

Object Detector

Localizes objects in the frame and labels each with a category. Returns a list of detections, each with a bounding box, one or more categories with scores, and an optional keypoint set on supported models. Default ships are EfficientDet-Lite variants trained on COCO.

COCO classes
80
default model
Bbox channels
4
x,y,w,h
Score
1
0.0 – 1.0
Custom
yes
Model Maker

Output channels per detection

ChannelTypeRangeDescription
bbox.origin_xgeometry0 → WTop-left X in pixels of the input image.
bbox.origin_ygeometry0 → HTop-left Y in pixels of the input image.
bbox.widthgeometrypxWidth of the bounding box in pixels.
bbox.heightgeometrypxHeight of the bounding box in pixels.
categories[].category_namelabelstre.g. "person", "cup", "laptop". COCO label set by default.
categories[].scoreconfidence0 → 1Per-category confidence, sorted descending.
categories[].indexlabelintNumeric class index in the model's labelmap.
keypoints[]geometryopt.Some specialised models output keypoints with each detection (e.g. face, hand corners).

COCO 80-class label set (used by default models)

Configuration options

OptionTypeDefaultWhat it does
runningModeenum'IMAGE''IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
maxResultsnumber-1Cap on detections returned. -1 = all that pass threshold.
scoreThresholdnumber0.5Minimum confidence — anything lower is dropped.
categoryAllowliststring[][]Whitelist of category names to keep. Empty = all.
categoryDenyliststring[][]Blacklist of category names to drop.
baseOptions.modelAssetPathstringDefault ships are EfficientDet-Lite0 (faster) and EfficientDet-Lite2 (more accurate).
baseOptions.delegateenum'GPU''GPU' or 'CPU'.
Code recipe · object detector with allowlist
const det = await ObjectDetector.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/object_detector/efficientdet_lite0/float16/latest/efficientdet_lite0.task',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  scoreThreshold: 0.5,
  categoryAllowlist: ['person', 'cup', 'laptop']
});

const r = det.detectForVideo(videoEl, performance.now());
// r.detections → [{ boundingBox: {originX, originY, width, height},
//                  categories: [{ categoryName, score, index }] }, ...]

Image Segmenter & Interactive Segmenter

Per-pixel classification of the input image. Output is a segmentation mask (one or more) where each pixel is labelled with a class index or class probability. Use cases: virtual backgrounds, AR clothing, hair styling, portrait lighting, body part isolation.

SS

Selfie segmentation

2 classes: background, person. Lightweight, ideal for video calls / Zoom-style virtual backgrounds.

MC

Multi-class selfie

6 classes: background, hair, body-skin, face-skin, clothes, others.

H

Hair segmentation

2 classes: background, hair. Higher hair-edge precision than the multi-class model.

D

DeepLab v3

21 PASCAL VOC classes: people, animals, vehicles, indoor objects.

IS

Interactive Segmenter

Click or tap a point in the image; the model returns a mask of that object. Built on the MagicTouch model.

Multi-class Selfie Segmentation labels

IdxClassRGB hintUse
0background0,0,0Everything not part of the subject. Use for virtual backgrounds.
1hair128,0,0Scalp hair. Drive AR hair colour, virtual styling.
2body-skin0,128,0Skin on neck, arms, hands.
3face-skin128,128,0Skin on the face. Useful for makeup / beautification effects.
4clothes0,0,128Garments. Drive AR try-on or background-replacement edge cases.
5others128,0,128Subject pixels that don't fit the four classes (glasses, hats, etc).

Output channels (Image Segmenter)

ChannelShapeTypeDescription
category_maskH × W × 1uint8Each pixel is the integer class index (0–N).
confidence_masks[k]H × W × 1float32One mask per class — pixel value = probability ∈ [0, 1] of belonging to class k.

Configuration options

OptionTypeDefaultWhat it does
runningModeenum'IMAGE''IMAGE' / 'VIDEO' / 'LIVE_STREAM'.
outputCategoryMaskbooleanfalseIf true, returns a single H×W uint8 mask with class indices.
outputConfidenceMasksbooleantrueIf true, returns one float mask per class with probabilities.
displayNamesLocalestring'en'Locale for class display names (where available).
baseOptions.modelAssetPathstringSelfie / Multi-class / Hair / DeepLabV3 / SelfieMulticlass — pick the model for your use case.
baseOptions.delegateenum'GPU''GPU' or 'CPU'.
Code recipe · multi-class selfie segmenter
const seg = await ImageSegmenter.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/image_segmenter/selfie_multiclass_256x256/float32/latest/selfie_multiclass_256x256.tflite',
    delegate: 'GPU'
  },
  runningMode: 'VIDEO',
  outputCategoryMask: true,
  outputConfidenceMasks: false
});

const r = seg.segmentForVideo(videoEl, performance.now());
// r.categoryMask → MPMask  (.getAsUint8Array() → H*W bytes, each a class idx 0–5)
// IMPORTANT: call r.close() when you're done with the masks to free GPU memory
r.close();

The rest of the MediaPipe family

vision · audio · text · genai
FD

Face Detector

Lightweight cousin of Face Landmarker (BlazeFace). Returns a bounding box, score, and 6 keypoints: right eye, left eye, nose tip, mouth centre, right ear tragion, left ear tragion. Ideal as a cheap pre-stage before a heavier model.

FS

Face Stylizer

Stylizes the face region into a target style (cartoon, oil painting, sketch). Custom styles trainable via Model Maker.

IC

Image Classifier

Whole-image label classification. EfficientNet-Lite default. Returns top-K categories with scores.

IE

Image Embedder

Returns a 1024-d feature vector per image. Use for similarity search, clustering, k-NN retrieval.

PD

Pose Detector (legacy)

Older single-step pose API from the Solutions package. Same 33-point topology but different runtime.

AC

Audio Classifier

Default model is YamNet — classifies short audio chunks (typically 0.975s windows) into AudioSet's 521-class ontology (speech, music, footsteps, applause, sirens, etc).

AE

Audio Embedder

Returns a feature vector for an audio chunk — use for retrieval / similarity in sound libraries.

TC

Text Classifier

Sentiment / topic / safety classification of arbitrary text. BERT-based.

TE

Text Embedder

Sentence-level embeddings for semantic search and clustering.

LD

Language Detector

Detects the language of a text snippet across 100+ languages with confidence scores.

LLM

LLM Inference

On-device inference for small open LLMs (Gemma, Phi, Falcon variants). Runs locally on web / Android / iOS.

IG

Image Generator

On-device Stable-Diffusion-style generation (Android currently). Text-to-image, optional ControlNet.

Channel summary across the suite

ModelDomainOutput channelsTotal numeric / item

How MediaPipe thinks

A non-technical mental model: MediaPipe is a graph of small ML models that pass tensors to each other, all running on-device. Most vision pipelines follow the same recipe — detect, crop, refine, output.

1

Frame in

A camera frame arrives as a tensor (H × W × 3, RGB, normalized to 0–1). MediaPipe wraps it in an ImageFrame with a timestamp so downstream nodes know which frame each output belongs to.

2

Detector finds the region of interest

A small fast model (e.g. BlazeFace, BlazePalm, BlazePose Detector) finds where the subject is in the frame and returns a tight crop. This is the cheap step — it runs every frame.

3

Landmark model refines

The crop is fed to the heavier landmark model (Face Mesh, Hand Landmarker, Pose Landmarker). This regresses the precise landmark coordinates inside the crop.

4

Tracking saves the next frame

Once landmarks are known, MediaPipe predicts the ROI for the next frame from them — skipping the detector entirely while tracking is stable. This is why MediaPipe is so fast: most frames only run the landmark model.

5

Per-channel post-processing

Some channels are derived: blendshapes are regressed from the face mesh by a small MLP head; world coordinates are computed from image landmarks via a separate sub-network; visibility/presence are sigmoid outputs from auxiliary heads.

6

Out comes the result object

Each call returns a structured object: arrays of landmarks, blendshape weights, classification categories, segmentation masks. You read the channels you care about and feed them to your rig, your widget, your analytics — whatever happens next.

Reading the values: what's normalized, what isn't

N

Image-normalized

Most landmark x / y values are 0.0 → 1.0 relative to the image. Multiply by image width / height to get pixels. Convenient because it's resolution-independent.

Z

Pseudo-depth

z values are relative, not metric. Useful for ordering (which finger is closer) but not for measuring real distance.

W

World coords

The "world" landmark output is in metres, centred on the subject's hip (pose) or wrist (hand). Use these for biomechanics and 3D rigs.

σ

Confidence scores

0.0 → 1.0 sigmoid outputs. Visibility is "is this point in frame?", presence is "does this point exist in the image at all?".

Minimal Tasks Vision boilerplate

// Load the Tasks Vision bundle (browser, ES modules) import { FilesetResolver, FaceLandmarker } from "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/vision_bundle.mjs"; const vision = await FilesetResolver.forVisionTasks( "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/wasm" ); const faceLandmarker = await FaceLandmarker.createFromOptions(vision, { baseOptions: { modelAssetPath: ".../face_landmarker.task", delegate: "GPU" }, outputFaceBlendshapes: true, outputFacialTransformationMatrixes: true, numFaces: 1, runningMode: "VIDEO" }); // Each frame: const result = faceLandmarker.detectForVideo(videoEl, performance.now()); // result.faceLandmarks → array of landmark arrays (one per face) // result.faceBlendshapes → array of {categories: [{categoryName, score}, ...]} // result.facialTransformationMatrixes → 4×4 matrix per face

Running modes — IMAGE vs VIDEO vs LIVE_STREAM

I

IMAGE

One still image per call. Each call is independent — no tracking carries over. Use for batch processing, dataset labelling, photo-app filters. Method: detect(imageBitmap).

V

VIDEO

Sequential frames + monotonic timestamps. The model uses tracking to stabilise output between frames. The Live Lab uses this. Method: detectForVideo(frame, timestampMs) — call it from your render loop.

L

LIVE_STREAM

Asynchronous. Push frames as fast as you like; results arrive via callback. Best when you can't block the UI thread. Method: detectAsync(frame, timestampMs); supply resultListener in options.

Rough performance guidance

indicative · varies wildly by hardware
ModelVariant~ms / frame~FPSNotes
Pose LandmarkerLite3 – 6150–300Mobile-friendly. Light on extreme poses.
Pose LandmarkerFull6 – 1280–160Default. Solid all-rounder.
Pose LandmarkerHeavy12 – 2540–80Best precision. Use when fast motion or tricky angles.
Face Landmarkerw/o blendshapes3 – 6150–300Pure mesh, fastest face mode.
Face Landmarker+ blendshapes + matrix5 – 10100–200Full mocap mode. The two extra heads add cost.
Hand Landmarker3 – 6 / hand100–250Cost roughly doubles for two hands.
Gesture Recognizer4 – 8 / hand90–200Hand Landmarker + tiny classifier.
Holistic (composed)15 – 3530–60Sum of three landmarkers. Drop Heavy pose for higher rates.
Object DetectorEfficientDet-Lite08 – 1850–120Good baseline.
Object DetectorEfficientDet-Lite215 – 3525–60More accurate, heavier.
Image SegmenterSelfie / 2563 – 8120–300Tiny model, 256×256 internally.
Numbers are indicative on a typical desktop with a discrete or modern integrated GPU and the GPU delegate. CPU delegate is usually 3–10× slower. Expect lower numbers on phones and battery-saver mode.

Glossary

terms used throughout this guide
Landmark
A single named point detected on a body / face / hand. Output is a 3D coordinate plus, depending on the model, visibility and presence scores.
Blendshape
A facial expression target weighted 0–1 (e.g. jawOpen: 0.7). Adds linearly to a neutral mesh. MediaPipe outputs 52 ARKit-compatible weights per face.
Image-normalized coords
x and y as 0.0 → 1.0 across the input frame, regardless of resolution. Multiply by frame width / height to convert to pixels.
World coords
Real-world metric (metres) coordinates centred on the subject's hip (pose) or wrist (hand). Use for biomechanics, 3D rigs and AR.
Pseudo-depth (z)
A relative z value, scaled roughly to torso width (pose) or hand width (hand). Ordering is reliable; absolute distance is not.
Visibility
Confidence (0–1) that the landmark is in frame and not occluded. Useful for masking unstable joints.
Presence
Confidence (0–1) that the landmark actually exists in the image (vs. predicted blindly). Distinct from visibility.
Handedness
The "Left" or "Right" label per detected hand, with a confidence score. Watch the mirror gotcha — see Hands tab.
ROI (Region of Interest)
The cropped sub-rectangle of the input that the landmark model actually processes. Detector finds it, landmark refines inside it, tracker carries it to the next frame.
BlazePose / BlazeFace
The internal architectures behind the pose / face detectors. Single-shot detectors trained for low latency.
Transformation matrix
A 4×4 matrix mapping the canonical face mesh into camera (metric) space. Drives head rotation, position, and scale on a 3D rig.
ARKit
Apple's spec for facial blendshapes (52 names: jawOpen, browInnerUp, etc). MediaPipe outputs match it directly.
Tasks API vs Solutions API
Tasks (newer) is the unified runtime for all model types — vision, audio, text, generative. Solutions (legacy) was per-domain (Pose, Hands, etc). Most pipelines should target Tasks.
Delegate
The compute backend — 'GPU' (WebGL / WebGPU / GLES) or 'CPU' (WASM). GPU is much faster on supported hardware.
Model Maker
MediaPipe's tool for retraining classification heads (gestures, image classes, object categories) on your own data — without rewriting the underlying backbones.
Live Link Face / MetaHuman
Unreal Engine's facial mocap pipeline. Consumes ARKit-shaped blendshapes — so MediaPipe outputs drop straight in.

Keyboard shortcuts

work anywhere on the page
SpaceStart / stop camera
MToggle mirror
OToggle skeleton overlay
1Pose mode
2Face mode
3Hands mode
4Gesture mode
5Holistic mode
RStart / stop recording
SSnapshot PNG
BToggle WebSocket broadcast