hosette.
← Back to projects

2026 · instrument · web audio · gesture · music

Jazz Hands

A browser webcam instrument that turns hands and face into jazz-adjacent harmony

Role
Designer, builder, interaction grammarian
Stack
TypeScriptViteMediaPipe Tasks VisionTone.jsWeb AudioTouchDesignerPythonMIDI
Live
jazzhands.hosette.net ↗
Jazz Hands: A browser webcam instrument that turns hands and face into jazz-adjacent harmony

In plain terms: a webcam-based instrument where your hands play chord-row gestures, your face adds shimmer, and the browser does the harmony. Allow camera, raise both hands, play badly for twenty seconds, then slightly less badly.

Status
live
Demo
open
Input
webcam
Hard part
gesture grammar

why

I wanted an on-screen synth I could play with my body.

Not a virtuoso interface. Not an app pretending the laptop camera is secretly a Stradivarius. A small question: could hands make harmony without becoming a worse piano? Could the webcam become an instrument you practice, instead of a demo you babysit?

The reference in the room was Mi.Mu gloves, but only as a feeling. I wasn’t trying to copy the hardware. I wanted the part where gesture becomes language — where a motion stops being a command and starts being a musical habit.

The constraint was ordinary on purpose. Laptop camera. Browser or TouchDesigner. No special gloves, no sensors. Hands in frame, face in frame, sound out.

what I built

Jazz Hands turns webcam-tracked hands and face into jazz-adjacent harmony.

The first version lived in TouchDesigner. MediaPipe tracked face and hands; a Python bridge turned landmarks into chord state; MIDI fed Ableton or Logic. The TouchDesigner file became the lab bench — camera nodes, data tables, smoothing curves, hand labels, overlay experiments, reset scripts, little rituals for getting the camera back after the laptop slept.

The second version, the public instrument, is browser-native. MediaPipe Tasks Vision for tracking, Tone.js for the electric-piano voice, a small TypeScript state machine for the chord grammar. No TouchDesigner, no Ableton or Logic, no BlackHole, no local MIDI setup. Click begin, allow the camera, the cream title page falls away, and the dark performance HUD comes up underneath.

The webcam stays in the browser. No camera frames are uploaded. That matters. A webcam instrument should not quietly become a webcam service.

Jazz Hands in play: left hand showing a song-move gesture, right hand playing, with the HUD reading Dm11 in F
Two hands, two jobs. Left chooses the song move; right plays in F. The instrument says back what it heard: Dm11.

the gesture grammar

The current mode is called circle song mode. Two hands, two jobs.

right hand — key, volume, rest:

gestureresult
fistrest
open / relaxedplay
higherlouder
lowersofter
inner edgeprevious key
outer edgenext key

The right hand walks the circle of fifths:

F C G D A E B Gb Db Ab Eb Bb

left hand — song moves:

gesturemove
fist / unclearI
1 fingerii
2 fingersV
3 fingersvi
4 fingersIV
open handnext progression step

Left-hand height changes texture, low to high:

movelowmidhigh
Imaj7maj9maj13#11
iim7m9m11
V7137alt
vim7m11m9add11
IVmaj7maj9maj13#11

Mouth open adds air: shimmer, brightness, a touch of upper harmony. Slightly ridiculous. Immediately legible. Good combination.

In C, the easy loop is Cmaj9 → Am11 → Dm9 → G13. The circle of fifths becomes a place to move through, not a chart you’re expected to admire.

decisions I’d defend

Separate the hands by musical job. Early mappings treated finger counts like menu options: one finger this note, two fingers that chord. It worked, technically, and felt awful. The better split was role-based: right hand handles key, volume, and rest; left hand handles harmonic moves. One hand asks where are we and how loud. The other asks what kind of song move is this. The two-hand split is the discipline.

Make silence explicit. A closed right hand means rest. Not error, not tracking failure, not “oops, no input.” Rest. Camera tracking is uncertain by nature, so silence has to be something you can play on purpose. Same principle as Omunikorudo’s HOLD latching: the instrument has to be sayable, including when it’s saying nothing.

Song moves, not arbitrary chords. The left hand chooses I, ii, V, vi, IV. Enough to make useful progressions without turning the hand into a chord encyclopedia. The grammar is small on purpose: a vocabulary you can learn in a sitting and use for a year.

Lush chords by default. No plain major triad mode for the main path. The default colors are maj9, m9, 13, m11, maj13#11, and altered-dominant brightness when the hand goes high. If the input is imprecise, the sound should still land somewhere forgiving. Tracking jitter is a fact; voicings can absorb it.

Show what the system thinks, then get out of the way. The overlay is a translation layer: current chord, key, mouth air, volume, hand landmarks, tiny guide. Without it, the instrument feels haunted. With too much of it, the interface stands in front of the performance waving a clipboard. The current version is closer, not done — a learning mode and a performance mode want to be different surfaces.

what I learned

Hands are noisy data with opinions. Finger counts flicker. A hand disappears behind your head. A mirrored camera makes physical-left and screen-left disagree. Raised and lowered hands can invert depending on which coordinate system is telling the story. Most of the work was not adding features — it was getting the system to stop overreacting.

Smoothing is an aesthetic decision. Too little, and the chord jitters. Too much, and the instrument feels sleepy. The interesting range is small: just enough delay for intention to register, not enough delay for the body to feel ignored. Same range that makes the difference between SPF’s two-sprint discipline holding and feeling slow — taste at the edge of the parameter, not the middle.

Sound design changes interaction design. The first browser synth sounded like a choir pad. Pretty, but the tails blurred chord changes, and old notes hung around as false drones. Moving toward an electric piano changed the grammar itself: faster attack, shorter release, less reverb, more readable harmony. The body needs sound that forgives without smearing.

The overlay has to teach and then get out of the way. Learning mode wants labels. Performance mode wants restraint. They shouldn’t be the same surface. The browser version currently compromises — it explains the grammar while keeping the biggest readout to the current chord. Acceptable for a prototype, probably too much for a performance.

what would prove it

Jazz Hands has no analytics, on purpose. If it did, these are the three things I’d look at:

  • Time from begin to first chord predicts return. The state machine auto-inits audio on first input and the start-card falls away when the camera comes on. If users who hear their first chord within five seconds of clicking begin come back, the lifecycle is doing the work. If they don’t, the bottleneck is the grammar, and the hint overlay needs to teach faster.
  • Open-hand “next” usage separates players from explorers. Users who hold one shape and listen are exploring the sound. Users who advance the progression with the open-hand gesture are playing a song. Retention divergence between the two cohorts would say the song-move vocabulary earned its place; convergence would say the grammar is teaching exploration but not playing.
  • Mouth-air usage tracks comfort. Opening your mouth in front of a webcam feels silly until it stops feeling silly. The progression from I won’t do thatI do it without thinking is the actual onboarding curve for a body-based instrument. If mouth-air usage rises across sessions for the same user, the instrument is becoming a habit; if it stays flat, the feature is novelty.

Two risks the design has to keep refusing:

  • Plateau after the novelty. Without a recorder, looper, or exporter, the instrument has a ceiling. That ceiling was chosen — recording pulls the product toward DAW-lite, which isn’t what’s being built. The hedge is making the playable surface good enough that users leave glad, not bored. Same cut Omunikorudo made on the looper.
  • Browser camera and audio drift under your feet. Safari, Chrome, and Firefox handle Web Audio and MediaStream differently, and iOS adds gesture-requirement quirks desktop doesn’t. The shared audio engine and the ESC panic-stop mitigate but don’t eliminate. Every browser update is a small risk; the only defense is keeping the lifecycle code legible enough to fix on a Tuesday.

what’s next

Four moves, in order:

  • Learning mode and performance mode become separate surfaces. One teaches the grammar with labels and a hint overlay. The other mostly disappears: chord readout, light circle strip, nothing else competing with the user on the camera frame.
  • A short event log, before audio export. The interesting artifact isn’t the sound file — it’s the sequence of gestures and harmonic decisions. A log of chord changes with timestamps is more useful than a WAV the first time you go to remember what you played.
  • Sound presets, electric piano staying default. Later: soft organ, bass split, glassy pad, choir as an optional return rather than the house style.
  • Web MIDI as advanced mode. Browser version should work without setup, but musicians should eventually be able to point Jazz Hands at Ableton or Logic and use it as a controller. Same wedge as Omunikorudo’s planned MIDI in/out: the demo is the front door, MIDI is what makes it serious.

Stays free and playable at jazzhands.hosette.net.

The longer question is whether this can become an instrument someone can actually learn — not a webcam trick, but a small musical language. The cut between those two is what every decision above is trying to defend.

See also

All projects

Working on something similar?

I take a small handful of consulting briefs a year and am always up for trading notes with anyone shipping in this space — send a note.

Or: values behind the work · obsessions that shape it · other projects.