Why alignment is needed¶

The recording setup¶

Two independent recording systems run in parallel during each participant's session:

A Yamaha Disklavier writes performances to .mid files on its host computer.
A Sony FX30 camera overhead records video to .MP4 files (with a matching .XML sidecar) at roughly 240 frames per second.

Both systems are time-stamped using the host clock at the moment each file is closed — there is no wired sync, no SMPTE timecode, no clapper, and no embedded event broadcast between the two machines.

The clock-offset problem¶

In practice the two host clocks drift against each other by a constant 1–20 minutes per participant:

The offset is stable within a participant's recording session (it does not meaningfully drift over one hour of recording).
The offset varies between participants because the machines reboot and NTP-sync independently.
The offset cannot be recovered automatically from file headers — the Disklavier's embedded track-name timestamp is unreliable on many takes, and the FX30 XML CreationDate is likewise not trusted (see the docstrings in alignment_tool/io/midi_adapter.py:1-9 and alignment_tool/io/camera_adapter.py:1-8).

Because of this, a MIDI event at t = 123.5 s in trial_001.mid does not correspond to the frame with wall-clock timestamp 123.5 s in C0001.MP4. The real correspondence is midi_unix_time = camera_unix_time + Δ for some participant-specific Δ that has to be measured.

Why it has to be manual¶

Automatic onset detection is possible in principle but was ruled out for this dataset:

The overhead camera view does not always capture every keypress with a clean visual signature — some keystrokes are occluded by hands, and illumination varies trial-to-trial.
A mismatched or missed automatic onset would produce a silently-wrong alignment; a researcher then has to verify every result anyway.
The manual workflow is fast in practice: one M/C marker pair, one click, done.

This tool therefore exists to let a researcher:

Look at a MIDI file and a video of the same piece side-by-side.
Identify the same keystroke in both.
Have the tool compute and save the resulting time offset.

What alignment unlocks¶

Once a participant's global_shift_seconds (and any per-clip anchors) are saved, downstream analyses can:

Seek to an arbitrary MIDI event in the corresponding camera frame.
Crop video clips around particular notes or phrases.
Train per-frame or per-event models that need both modalities time-aligned.

Without this alignment step, the MIDI and video are effectively two disconnected datasets.