Why alignment is needed¶
The recording setup¶
Two independent recording systems run in parallel during each participant's session:
- A Yamaha Disklavier writes performances to
.midfiles on its host computer. - A Sony FX30 camera overhead records video to
.MP4files (with a matching.XMLsidecar) at roughly 240 frames per second.
Both systems are time-stamped using the host clock at the moment each file is closed — there is no wired sync, no SMPTE timecode, no clapper, and no embedded event broadcast between the two machines.
The clock-offset problem¶
In practice the two host clocks drift against each other by a constant 1–20 minutes per participant:
- The offset is stable within a participant's recording session (it does not meaningfully drift over one hour of recording).
- The offset varies between participants because the machines reboot and NTP-sync independently.
- The offset cannot be recovered automatically from file headers — the Disklavier's embedded track-name timestamp is unreliable on many takes, and the FX30 XML
CreationDateis likewise not trusted (see the docstrings inalignment_tool/io/midi_adapter.py:1-9andalignment_tool/io/camera_adapter.py:1-8).
Because of this, a MIDI event at t = 123.5 s in trial_001.mid does not correspond to the frame with wall-clock timestamp 123.5 s in C0001.MP4. The real correspondence is midi_unix_time = camera_unix_time + Δ for some participant-specific Δ that has to be measured.
Why it has to be manual¶
Automatic onset detection is possible in principle but was ruled out for this dataset:
- The overhead camera view does not always capture every keypress with a clean visual signature — some keystrokes are occluded by hands, and illumination varies trial-to-trial.
- A mismatched or missed automatic onset would produce a silently-wrong alignment; a researcher then has to verify every result anyway.
- The manual workflow is fast in practice: one M/C marker pair, one click, done.
This tool therefore exists to let a researcher:
- Look at a MIDI file and a video of the same piece side-by-side.
- Identify the same keystroke in both.
- Have the tool compute and save the resulting time offset.
What alignment unlocks¶
Once a participant's global_shift_seconds (and any per-clip anchors) are saved, downstream analyses can:
- Seek to an arbitrary MIDI event in the corresponding camera frame.
- Crop video clips around particular notes or phrases.
- Train per-frame or per-event models that need both modalities time-aligned.
Without this alignment step, the MIDI and video are effectively two disconnected datasets.