A complete map of every input channel MediaPipe exposes — pose joints, face blendshapes, hand landmarks, gestures, segmentation maps, and more. Turn on your camera to see the values stream in real time, then explore the full reference for every model in the suite.
Tracks 33 body landmarks in 3D from a single video frame. Each landmark returns normalized image coordinates (x, y), depth (z), and a confidence score. A second "world landmarks" output gives real-world metric coordinates in metres centred on the hips — ideal for animation rigs, biomechanics, and VR retargeting.
Horizontal position 0.0 (left edge) → 1.0 (right edge) of the input frame.
Vertical position 0.0 (top) → 1.0 (bottom). Y axis points down, like screen coords.
Depth roughly normalized to torso width. Negative = closer to camera, positive = further away.
0 → 1 confidence the joint is visible (not occluded) and lies in frame.
0 → 1 confidence the joint actually exists in the image (vs. predicted by the model).
| # | Name | Region | Range | Notes |
|---|
Smallest, fastest. Good for mobile / WebGPU constrained devices. Lower accuracy on extreme poses.
The default. Solid balance of accuracy and speed for most desktop / laptop scenarios.
Best landmark precision, especially for fast motion and edge poses. Higher latency.
| Option | Type | Default | What it does |
|---|---|---|---|
| runningMode | enum | 'IMAGE' | 'IMAGE' / 'VIDEO' / 'LIVE_STREAM'. VIDEO uses cross-frame tracking — choose this for webcam. |
| numPoses | number | 1 | Maximum people to detect. Up to ~5 supported; cost scales linearly. |
| minPoseDetectionConfidence | number | 0.5 | Threshold for the detector stage. Raise to suppress false positives in busy scenes. |
| minPosePresenceConfidence | number | 0.5 | Threshold for the landmark presence head — how confident the model is the person is there. |
| minTrackingConfidence | number | 0.5 | Threshold for tracking continuity between frames. Lower = stickier (fewer re-detections). |
| outputSegmentationMasks | boolean | false | If true, result includes a per-pixel person mask. See note below. |
| baseOptions.delegate | enum | 'GPU' | 'GPU' (WebGL/WebGPU) or 'CPU' (WASM). GPU is 3–10× faster on supported hardware. |
| baseOptions.modelAssetPath | string | — | URL or local path to the .task model file (lite / full / heavy). |
Set outputSegmentationMasks: true to additionally receive result.segmentationMasks[0] — a single-channel mask the same resolution as the input. Each pixel is the model's confidence (0.0–1.0) that the pixel belongs to the person. Useful for AR compositing, virtual backgrounds, and driving alpha mattes for Blender/UE compositing without a separate segmenter.
import { PoseLandmarker, FilesetResolver } from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/vision_bundle.mjs'; const vision = await FilesetResolver.forVisionTasks( 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.21/wasm' ); const pose = await PoseLandmarker.createFromOptions(vision, { baseOptions: { modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/pose_landmarker/pose_landmarker_full/float16/latest/pose_landmarker_full.task', delegate: 'GPU' }, runningMode: 'VIDEO', numPoses: 1, minPoseDetectionConfidence: 0.5, minTrackingConfidence: 0.5, outputSegmentationMasks: false }); // per RAF tick: const r = pose.detectForVideo(videoEl, performance.now()); // r.landmarks[0] → 33 image-space landmarks (x, y, z, visibility, presence) // r.worldLandmarks[0] → 33 metric coords centred on hips // r.segmentationMasks[0] → per-pixel person mask (if enabled)
The most channel-rich model in the suite. Outputs a 478-point face mesh (468 face + 10 iris), plus 52 ARKit-compatible blendshape weights, plus a 4×4 facial transformation matrix in metric space. Together, this is everything you need to drive a MetaHuman, ARKit avatar, or custom rig.
Dense topology covering the entire face surface. Indexed identically to TFLite Face Mesh — community UV maps & rigs are interchangeable.
Indices 468–472 = subject's left iris (centre + 4 perimeter), 473–477 = subject's right iris. Use the centre point for gaze direction; perimeter points give pupil dilation when scaled by face size.
Each value 0.0–1.0. Directly weight ARKit/MetaHuman pose targets. Includes neutral, brows, eyes, jaw, mouth, cheeks, nose.
4×4 homogeneous matrix mapping the canonical mesh into camera space (metric units). Drive head rotation, position, and scale from this.
| # | Name | Region | Range | Maps to (typical) |
|---|
| Region | Indices | Count | Use |
|---|
| Option | Type | Default | What it does |
|---|---|---|---|
| runningMode | enum | 'IMAGE' | 'IMAGE' / 'VIDEO' / 'LIVE_STREAM'. |
| numFaces | number | 1 | Maximum faces to track. Each face costs an extra inference pass. |
| minFaceDetectionConfidence | number | 0.5 | Detector threshold. |
| minFacePresenceConfidence | number | 0.5 | Landmark presence threshold. |
| minTrackingConfidence | number | 0.5 | Cross-frame tracking threshold. |
| outputFaceBlendshapes | boolean | false | Critical: set true to receive the 52 ARKit blendshape weights. Disabled by default to save compute. |
| outputFacialTransformationMatrixes | boolean | false | Set true to receive the 4×4 head-pose matrix per face. |
| baseOptions.delegate | enum | 'GPU' | 'GPU' or 'CPU'. |
const face = await FaceLandmarker.createFromOptions(vision, { baseOptions: { modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task', delegate: 'GPU' }, runningMode: 'VIDEO', numFaces: 1, outputFaceBlendshapes: true, // 52 ARKit weights — drives MetaHuman / Live Link Face outputFacialTransformationMatrixes: true // 4×4 head pose }); const r = face.detectForVideo(videoEl, performance.now()); // r.faceLandmarks[0] → 478 mesh points (x, y, z) // r.faceBlendshapes[0].categories → 52 weighted shapes [{ categoryName, score }] // r.facialTransformationMatrixes[0].data → Float32Array(16) — column-major 4×4
Tracks up to two hands independently, each with 21 landmarks. Returns image-space coordinates, world coordinates (metric), and a handedness label (Left / Right) with a confidence score. The skeleton topology is symmetric and great for sign language, AR interaction, and instrument tracking.
| # | Name | Region | Bone | Notes |
|---|
Base of the thumb where it meets the wrist — the "ball" of the joint.
Knuckle joint where the finger meets the palm.
Middle finger joint — the one that bends most when curling.
The joint nearest the fingertip.
The very end of the digit.
The model labels handedness from the camera's point of view. When you mirror the preview (so it feels like a selfie), your physical right hand appears on the right side of the mirrored image — but the model still labels it "Left" because that's where it sees it on the unmirrored input. Fix: either don't mirror the input you feed the model, or invert handedness.categoryName in your downstream code when mirror is on. The Live Lab toggles a banner to remind you.
| Option | Type | Default | What it does |
|---|---|---|---|
| runningMode | enum | 'IMAGE' | 'IMAGE' / 'VIDEO' / 'LIVE_STREAM'. |
| numHands | number | 1 | Maximum hands. Set to 2 for both. Each hand is its own inference pass. |
| minHandDetectionConfidence | number | 0.5 | Detector threshold. |
| minHandPresenceConfidence | number | 0.5 | Landmark presence threshold. |
| minTrackingConfidence | number | 0.5 | Cross-frame tracking continuity. Lower = stickier track. |
| baseOptions.delegate | enum | 'GPU' | 'GPU' or 'CPU'. |
const hands = await HandLandmarker.createFromOptions(vision, { baseOptions: { modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/latest/hand_landmarker.task', delegate: 'GPU' }, runningMode: 'VIDEO', numHands: 2 }); const r = hands.detectForVideo(videoEl, performance.now()); // r.landmarks → [hand1Landmarks, hand2Landmarks] each: 21 × {x,y,z} // r.worldLandmarks → metric coords, wrist-relative // r.handedness → [[{ categoryName: 'Left'|'Right', score }], ...] // Mirror-aware handedness: function trueHand(label, mirrored) { if (!mirrored) return label; return label === 'Left' ? 'Right' : 'Left'; }
Builds on the Hand Landmarker and adds a classifier on top — outputting a categorical gesture label per hand alongside the landmarks. The default model ships with 8 classes (7 gestures + None). The classifier is replaceable: train your own custom gesture set with Model Maker and swap it in.
| # | Class | Symbol | Description |
|---|---|---|---|
| 0 | None | — | No recognized gesture / below confidence threshold |
| 1 | Closed_Fist | ✊ | All four fingers and thumb closed into a fist |
| 2 | Open_Palm | 🖐 | Hand fully open, fingers extended and spread |
| 3 | Pointing_Up | ☝ | Index finger extended upward, others closed |
| 4 | Thumb_Down | 👎 | Thumb extended downward, fingers curled |
| 5 | Thumb_Up | 👍 | Thumb extended upward, fingers curled |
| 6 | Victory | ✌ | Index + middle extended (peace / victory sign) |
| 7 | ILoveYou | 🤟 | Thumb + index + pinky extended (ASL "I love you") |
Use MediaPipe Model Maker to retrain the classification head on your own dataset (15–50 samples per gesture is enough to start). The hand landmarker stays fixed; only the lightweight classifier swaps. Output stays in the same shape: category_name + score.
| Option | Type | Default | What it does |
|---|---|---|---|
| runningMode | enum | 'IMAGE' | 'IMAGE' / 'VIDEO' / 'LIVE_STREAM'. |
| numHands | number | 1 | Maximum hands. Each hand gets its own gesture classification. |
| minHandDetectionConfidence | number | 0.5 | Detector threshold for finding hands. |
| minHandPresenceConfidence | number | 0.5 | Landmark presence threshold. |
| minTrackingConfidence | number | 0.5 | Cross-frame tracking threshold. |
| cannedGesturesClassifierOptions | object | {} | { scoreThreshold, categoryAllowlist, categoryDenylist } for the built-in 8 classes. |
| customGesturesClassifierOptions | object | {} | Same shape as canned, but applied to your custom Model-Maker-trained classifier. |
| baseOptions.delegate | enum | 'GPU' | 'GPU' or 'CPU'. |
const gestures = await GestureRecognizer.createFromOptions(vision, { baseOptions: { modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/latest/gesture_recognizer.task', delegate: 'GPU' }, runningMode: 'VIDEO', numHands: 2 }); const r = gestures.recognizeForVideo(videoEl, performance.now()); // r.gestures → [[{ categoryName: 'Thumb_Up', score: 0.93 }], ...] (top-1 per hand) // r.handedness → [[{ categoryName: 'Left'|'Right', score }], ...] // r.landmarks → 21 image-space landmarks per hand // r.worldLandmarks → metric coords per hand
The full-body monster: combines pose + face + both hands in a single inference graph, with shared cropping/tracking between sub-models for efficiency. This is the model that powers most full-body avatar mocap pipelines.
HolisticLandmarker is now part of MediaPipe Tasks Vision (Web/Python). On platforms where it isn't yet shipped, you can compose three landmarkers in parallel each frame — pose, face and hand — and merge the results. The Live Lab's Holistic mode demonstrates the composed approach, which is identical in channel structure to the legacy Solutions API output.The pose model runs first; its wrist/face anchors are used to crop ROIs that get passed to the face mesh and hand landmarkers — far cheaper than running each in isolation.
All three sub-models return for the same frame, so timestamps line up perfectly. Critical for clean retargeting.
Hand fidelity inside Holistic is lower than running the standalone Hand Landmarker on a tight crop. For close-up hand work, prefer Hand Landmarker.
Full-body avatar mocap (Blender/UE rigs), live performance capture, dance / yoga / fitness apps where you need everything at once.
| Option | Type | Default | What it does |
|---|---|---|---|
| poseLandmarker.* | object | — | All Pose Landmarker options apply (numPoses, confidences, segmentation, model variant, delegate). |
| faceLandmarker.* | object | — | All Face Landmarker options apply. Set outputFaceBlendshapes & outputFacialTransformationMatrixes to true for full mocap. |
| handLandmarker.* | object | — | All Hand Landmarker options apply. numHands: 2 recommended. |
| timestamp | number | — | Pass the same performance.now() to all three for guaranteed alignment. |
// Init three landmarkers once const [pose, face, hands] = await Promise.all([ PoseLandmarker.createFromOptions(vision, { runningMode: 'VIDEO', numPoses: 1, baseOptions: poseModel }), FaceLandmarker.createFromOptions(vision, { runningMode: 'VIDEO', numFaces: 1, outputFaceBlendshapes: true, outputFacialTransformationMatrixes: true, baseOptions: faceModel }), HandLandmarker.createFromOptions(vision, { runningMode: 'VIDEO', numHands: 2, baseOptions: handModel }) ]); // per RAF tick, run all three with the same timestamp const t = performance.now(); const [poseR, faceR, handsR] = [ pose.detectForVideo(videoEl, t), face.detectForVideo(videoEl, t), hands.detectForVideo(videoEl, t) ]; // Merge into a single holistic payload const holistic = { timestamp: t, pose: poseR.landmarks?.[0] ?? null, poseWorld: poseR.worldLandmarks?.[0] ?? null, face: faceR.faceLandmarks?.[0] ?? null, blendshapes: faceR.faceBlendshapes?.[0]?.categories ?? null, headMatrix: faceR.facialTransformationMatrixes?.[0]?.data ?? null, hands: handsR.landmarks?.map((lm, i) => ({ label: handsR.handedness[i][0].categoryName, score: handsR.handedness[i][0].score, landmarks: lm, world: handsR.worldLandmarks[i] })) ?? [] };
HolisticLandmarker (Tasks Vision)import { HolisticLandmarker, FilesetResolver } from '@mediapipe/tasks-vision'; const holistic = await HolisticLandmarker.createFromOptions(vision, { baseOptions: { modelAssetPath: '.../holistic_landmarker.task', delegate: 'GPU' }, runningMode: 'VIDEO', outputFaceBlendshapes: true, minFaceDetectionConfidence: 0.5, minHandLandmarksConfidence: 0.5, minPoseDetectionConfidence: 0.5 }); const r = holistic.detectForVideo(videoEl, performance.now()); // r.poseLandmarks, r.poseWorldLandmarks // r.faceLandmarks, r.faceBlendshapes // r.leftHandLandmarks, r.rightHandLandmarks // r.leftHandWorldLandmarks, r.rightHandWorldLandmarks
Localizes objects in the frame and labels each with a category. Returns a list of detections, each with a bounding box, one or more categories with scores, and an optional keypoint set on supported models. Default ships are EfficientDet-Lite variants trained on COCO.
| Channel | Type | Range | Description |
|---|---|---|---|
| bbox.origin_x | geometry | 0 → W | Top-left X in pixels of the input image. |
| bbox.origin_y | geometry | 0 → H | Top-left Y in pixels of the input image. |
| bbox.width | geometry | px | Width of the bounding box in pixels. |
| bbox.height | geometry | px | Height of the bounding box in pixels. |
| categories[].category_name | label | str | e.g. "person", "cup", "laptop". COCO label set by default. |
| categories[].score | confidence | 0 → 1 | Per-category confidence, sorted descending. |
| categories[].index | label | int | Numeric class index in the model's labelmap. |
| keypoints[] | geometry | opt. | Some specialised models output keypoints with each detection (e.g. face, hand corners). |
| Option | Type | Default | What it does |
|---|---|---|---|
| runningMode | enum | 'IMAGE' | 'IMAGE' / 'VIDEO' / 'LIVE_STREAM'. |
| maxResults | number | -1 | Cap on detections returned. -1 = all that pass threshold. |
| scoreThreshold | number | 0.5 | Minimum confidence — anything lower is dropped. |
| categoryAllowlist | string[] | [] | Whitelist of category names to keep. Empty = all. |
| categoryDenylist | string[] | [] | Blacklist of category names to drop. |
| baseOptions.modelAssetPath | string | — | Default ships are EfficientDet-Lite0 (faster) and EfficientDet-Lite2 (more accurate). |
| baseOptions.delegate | enum | 'GPU' | 'GPU' or 'CPU'. |
const det = await ObjectDetector.createFromOptions(vision, { baseOptions: { modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/object_detector/efficientdet_lite0/float16/latest/efficientdet_lite0.task', delegate: 'GPU' }, runningMode: 'VIDEO', scoreThreshold: 0.5, categoryAllowlist: ['person', 'cup', 'laptop'] }); const r = det.detectForVideo(videoEl, performance.now()); // r.detections → [{ boundingBox: {originX, originY, width, height}, // categories: [{ categoryName, score, index }] }, ...]
Per-pixel classification of the input image. Output is a segmentation mask (one or more) where each pixel is labelled with a class index or class probability. Use cases: virtual backgrounds, AR clothing, hair styling, portrait lighting, body part isolation.
2 classes: background, person. Lightweight, ideal for video calls / Zoom-style virtual backgrounds.
6 classes: background, hair, body-skin, face-skin, clothes, others.
2 classes: background, hair. Higher hair-edge precision than the multi-class model.
21 PASCAL VOC classes: people, animals, vehicles, indoor objects.
Click or tap a point in the image; the model returns a mask of that object. Built on the MagicTouch model.
| Idx | Class | RGB hint | Use |
|---|---|---|---|
| 0 | background | 0,0,0 | Everything not part of the subject. Use for virtual backgrounds. |
| 1 | hair | 128,0,0 | Scalp hair. Drive AR hair colour, virtual styling. |
| 2 | body-skin | 0,128,0 | Skin on neck, arms, hands. |
| 3 | face-skin | 128,128,0 | Skin on the face. Useful for makeup / beautification effects. |
| 4 | clothes | 0,0,128 | Garments. Drive AR try-on or background-replacement edge cases. |
| 5 | others | 128,0,128 | Subject pixels that don't fit the four classes (glasses, hats, etc). |
| Channel | Shape | Type | Description |
|---|---|---|---|
| category_mask | H × W × 1 | uint8 | Each pixel is the integer class index (0–N). |
| confidence_masks[k] | H × W × 1 | float32 | One mask per class — pixel value = probability ∈ [0, 1] of belonging to class k. |
| Option | Type | Default | What it does |
|---|---|---|---|
| runningMode | enum | 'IMAGE' | 'IMAGE' / 'VIDEO' / 'LIVE_STREAM'. |
| outputCategoryMask | boolean | false | If true, returns a single H×W uint8 mask with class indices. |
| outputConfidenceMasks | boolean | true | If true, returns one float mask per class with probabilities. |
| displayNamesLocale | string | 'en' | Locale for class display names (where available). |
| baseOptions.modelAssetPath | string | — | Selfie / Multi-class / Hair / DeepLabV3 / SelfieMulticlass — pick the model for your use case. |
| baseOptions.delegate | enum | 'GPU' | 'GPU' or 'CPU'. |
const seg = await ImageSegmenter.createFromOptions(vision, { baseOptions: { modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/image_segmenter/selfie_multiclass_256x256/float32/latest/selfie_multiclass_256x256.tflite', delegate: 'GPU' }, runningMode: 'VIDEO', outputCategoryMask: true, outputConfidenceMasks: false }); const r = seg.segmentForVideo(videoEl, performance.now()); // r.categoryMask → MPMask (.getAsUint8Array() → H*W bytes, each a class idx 0–5) // IMPORTANT: call r.close() when you're done with the masks to free GPU memory r.close();
Lightweight cousin of Face Landmarker (BlazeFace). Returns a bounding box, score, and 6 keypoints: right eye, left eye, nose tip, mouth centre, right ear tragion, left ear tragion. Ideal as a cheap pre-stage before a heavier model.
Stylizes the face region into a target style (cartoon, oil painting, sketch). Custom styles trainable via Model Maker.
Whole-image label classification. EfficientNet-Lite default. Returns top-K categories with scores.
Returns a 1024-d feature vector per image. Use for similarity search, clustering, k-NN retrieval.
Older single-step pose API from the Solutions package. Same 33-point topology but different runtime.
Default model is YamNet — classifies short audio chunks (typically 0.975s windows) into AudioSet's 521-class ontology (speech, music, footsteps, applause, sirens, etc).
Returns a feature vector for an audio chunk — use for retrieval / similarity in sound libraries.
Sentiment / topic / safety classification of arbitrary text. BERT-based.
Sentence-level embeddings for semantic search and clustering.
Detects the language of a text snippet across 100+ languages with confidence scores.
On-device inference for small open LLMs (Gemma, Phi, Falcon variants). Runs locally on web / Android / iOS.
On-device Stable-Diffusion-style generation (Android currently). Text-to-image, optional ControlNet.
| Model | Domain | Output channels | Total numeric / item |
|---|
A non-technical mental model: MediaPipe is a graph of small ML models that pass tensors to each other, all running on-device. Most vision pipelines follow the same recipe — detect, crop, refine, output.
A camera frame arrives as a tensor (H × W × 3, RGB, normalized to 0–1). MediaPipe wraps it in an ImageFrame with a timestamp so downstream nodes know which frame each output belongs to.
A small fast model (e.g. BlazeFace, BlazePalm, BlazePose Detector) finds where the subject is in the frame and returns a tight crop. This is the cheap step — it runs every frame.
The crop is fed to the heavier landmark model (Face Mesh, Hand Landmarker, Pose Landmarker). This regresses the precise landmark coordinates inside the crop.
Once landmarks are known, MediaPipe predicts the ROI for the next frame from them — skipping the detector entirely while tracking is stable. This is why MediaPipe is so fast: most frames only run the landmark model.
Some channels are derived: blendshapes are regressed from the face mesh by a small MLP head; world coordinates are computed from image landmarks via a separate sub-network; visibility/presence are sigmoid outputs from auxiliary heads.
Each call returns a structured object: arrays of landmarks, blendshape weights, classification categories, segmentation masks. You read the channels you care about and feed them to your rig, your widget, your analytics — whatever happens next.
Most landmark x / y values are 0.0 → 1.0 relative to the image. Multiply by image width / height to get pixels. Convenient because it's resolution-independent.
z values are relative, not metric. Useful for ordering (which finger is closer) but not for measuring real distance.
The "world" landmark output is in metres, centred on the subject's hip (pose) or wrist (hand). Use these for biomechanics and 3D rigs.
0.0 → 1.0 sigmoid outputs. Visibility is "is this point in frame?", presence is "does this point exist in the image at all?".
One still image per call. Each call is independent — no tracking carries over. Use for batch processing, dataset labelling, photo-app filters. Method: detect(imageBitmap).
Sequential frames + monotonic timestamps. The model uses tracking to stabilise output between frames. The Live Lab uses this. Method: detectForVideo(frame, timestampMs) — call it from your render loop.
Asynchronous. Push frames as fast as you like; results arrive via callback. Best when you can't block the UI thread. Method: detectAsync(frame, timestampMs); supply resultListener in options.
| Model | Variant | ~ms / frame | ~FPS | Notes |
|---|---|---|---|---|
| Pose Landmarker | Lite | 3 – 6 | 150–300 | Mobile-friendly. Light on extreme poses. |
| Pose Landmarker | Full | 6 – 12 | 80–160 | Default. Solid all-rounder. |
| Pose Landmarker | Heavy | 12 – 25 | 40–80 | Best precision. Use when fast motion or tricky angles. |
| Face Landmarker | w/o blendshapes | 3 – 6 | 150–300 | Pure mesh, fastest face mode. |
| Face Landmarker | + blendshapes + matrix | 5 – 10 | 100–200 | Full mocap mode. The two extra heads add cost. |
| Hand Landmarker | — | 3 – 6 / hand | 100–250 | Cost roughly doubles for two hands. |
| Gesture Recognizer | — | 4 – 8 / hand | 90–200 | Hand Landmarker + tiny classifier. |
| Holistic (composed) | — | 15 – 35 | 30–60 | Sum of three landmarkers. Drop Heavy pose for higher rates. |
| Object Detector | EfficientDet-Lite0 | 8 – 18 | 50–120 | Good baseline. |
| Object Detector | EfficientDet-Lite2 | 15 – 35 | 25–60 | More accurate, heavier. |
| Image Segmenter | Selfie / 256 | 3 – 8 | 120–300 | Tiny model, 256×256 internally. |
jawOpen: 0.7). Adds linearly to a neutral mesh. MediaPipe outputs 52 ARKit-compatible weights per face.jawOpen, browInnerUp, etc). MediaPipe outputs match it directly.'GPU' (WebGL / WebGPU / GLES) or 'CPU' (WASM). GPU is much faster on supported hardware.