Pilotly — Real-Time Video Inference for Film | Punch
← ALL WORK / AI PROJECTS / PILOTLY
Pilotly PILOTLY · REAL-TIME VIDEO INFERENCE · PILOT.LY ↗

It watches the cut while you make it.

How Punch built Pilotly a real-time video-inference system for editors on major productions — scoring whether each scene works, and whether the scenes work together.

REAL-TIME INFERENCE SCENE EFFICACY MULTIMODAL AI CUSTOM EMBEDDINGS
PILOTLY · EFFICACY MONITOR — CUT 14, "THE HEIST" LIVE INFERENCE · 240 MS / SCENE (SAMPLE)
PRODUCT WALKTHROUGH · SCENE-BY-SCENE EFFICACY, LIVE
ONE FRAME, FULLY READ

It doesn't just watch. It reads the scene.

As the clip rolls in slow motion, the model names the subjects, the environment, and the action live. Pick a question — the overlay re-reads the frame to answer it.

PILOTLY · READING SUBJECTS SLOW-MO · LIVE INFERENCE
SPEED 0.50×
FIG. 04 — SEMANTIC SCENE UNDERSTANDING · SUBJECTS, ENVIRONMENT, ACTION (ORIGINAL FRAME — NO STUDIO IP)
ASK THE FRAME

Our model answers any question about what's on screen — it reads the live frame, nothing is pre-scripted.

>_ Ask anything about this scene
TRY A FEW \u2014 OR TYPE YOUR OWN
THE EVIDENCE · LIVE Q&A

Ask the footage anything. Watch the model answer.

Actual screen recordings of the Pilotly UI — an editor asks a question, the model reads the indexed clip and answers. Pick a question to watch the session.

PILOTLY · LIVE SESSION
MODEL RESPONSE · 01/03

What is this video about?

A two-fighter sword sequence on an open snow field — cold, overcast, deliberate. Indexed as 12 scenes · 178 scored moments.

THE CLIP BEING READ · SOURCE FOOTAGE
COVERT OP · INDEXED · 12 SCENES
Every answer above is read live from this footage — the same clip uploaded to the Pilotly UI on the left.
ACTUAL PILOTLY UI · SCREEN-RECORDED MODEL SESSIONS
AND IT TALKS — ASK THE CUT ANYTHING

It reads the footage before it answers.

Every reply is pulled from your real frames and Pilotly's filmmaking knowledge base — and it shows the exact scenes and moments it drew from. Nothing invented.

PILOTLY · ASK YOUR FOOTAGE ● CUT INDEXED · 12 SCENES · 178 MOMENTS
Why is scene 8 scoring low?
The getaway runs 22 seconds past your reference arc and visual energy drops after the second cutaway — tightening the driving inserts would recover the pace.GROUNDED IN: SCENE 8 · 14 MOMENTS · REF CORPUS
FROM THE BENCH · UNSTAGED

Not a mockup. The engine, mid-build.

A working session on the scene-search service: type "lady in the lake" and the index returns the exact moments — scored 0.978, timestamped to the second — while the model drafts lighting notes alongside. Text, image, or both as the query.

Development screenshot of the Pilotly scene-search interface — a text query returning scored, timestamped moments next to AI lighting analysis
FIG. 05 — THE SCENE-SEARCH SERVICE, RUNNING · TEXT + IMAGE QUERIES OVER EMBEDDED MOMENTS · DEV SCREENSHOT, UNRETOUCHED
HOW IT KEEPS UP WITH AN EDITOR

Re-scored on every recut. The loop is the product.

For the engineers who scrolled this far — the path a cut takes from ingest to the editor's monitor, and the part you can't buy off the shelf.

INGEST CUT → SCENES → MOMENTS
APOLLO VIDEO LLM · RICH DESCRIPTIONS
PUNCH EMBEDDINGS FUSED TEXT+IMAGE · PAST THE 512-D CEILING
EFFICACY SCORER SCENE + ARC · VS REFERENCE CORPUS
EDITOR'S MONITOR ✓ ~240 MS / SCENE (SAMPLE)
SCENE × 12
MOMENT × 178 — DESCRIPTION · DURATION · ENERGY · METADATA
FIG. 11 — THE HIERARCHY · A CUT IS SCENES; SCENES ARE MOMENTS. EVERY MOMENT IS SCORED AND SEARCHABLE.
OFF-THE-SHELF MULTIMODAL EMBEDDINGS512-D · SHORT TEXT ONLY
PUNCH CUSTOM EMBEDDING MODELLONG-FORM TEXT + IMAGES, FUSED

Scene descriptions run long; 512 dimensions can't hold them next to imagery. The in-house model fuses both into one vector per moment — text finds frames, frames find text, cosine-ranked. Query images are sharpened and de-darkened before search.

FIG. 12 — PAST THE CEILING · THE MODEL YOU CAN'T BUY (DIMS PROPRIETARY)
PAST THE 512-D CEILING · IN DETAIL

One vector per moment — text and image, fused.

Off-the-shelf embeddings cap at 512 dimensions and were trained on short captions — they can't hold a paragraph of scene description next to the frame it describes. So we trained our own. Here's what it does that you can't buy.

01 · FUSION
SCENE TEXT · 380 TOKENS
frame FRAME PIXELS
ONE FUSED VECTOR · 1,024-D [ 0.21, −0.08, 0.44, … ]
Long-form description and the frame collapse into a single dense vector — the text and what it depicts living in the same space.
02 · BOTH DIRECTIONS
TEXT QUERY FINDS FRAMES
IMAGE QUERY FINDS TEXT
cos θ cos 0.94 · NEAREST WINS
Because both modalities share one space, a typed prompt retrieves the matching frames and a dropped still retrieves the matching descriptions — cosine-ranked, the same math each way.
03 · QUERY CONDITIONING
raw query RAW · DARK
conditioned query SHARPENED
Query images get sharpened and de-darkened before they ever hit the index — so an underexposed reference still lands on the right moment instead of the nearest shadow.
FIG. 13 — THE EMBEDDING MODEL · FUSED MULTIMODAL VECTORS, COSINE-RANKED, QUERY-CONDITIONED (ARCHITECTURE PROPRIETARY)
WHAT THE EDITING BAY COULDN'T MEASURE

"Does this scene land?" was a gut call. On a 200-person production, gut calls are expensive.

Editors on major films screen a cut dozens of times asking the same questions: does the scene hold? Does the act sag? Does the sequence of scenes carry the arc? Pilotly wanted those answers while the editor works — not after the test screening.

Punch built the inference engine underneath: every scene is understood, embedded, and scored in real time — for its own efficacy, and for its place in the whole.

“Every scene scored while the editor works — not after the screening.”
240 msPER-SCENE INFERENCE (SAMPLE — CONFIRM)
>512DIMENSIONS — THE CEILING, BROKEN IN-HOUSE
2SCORES PER CUT — EVERY SCENE + THE WHOLE
WHAT WE BUILT
MODELSApollo video-LLM integrationCustom fused multimodal embedding model (1,024-D)Real-time scene efficacy scorer
PIPELINEFive-stage real-time inference pipelineScene & moment indexing (12 scenes · 178 moments)Query-image conditioning — sharpen + de-darken
PRODUCT SURFACEBi-directional cosine retrieval (text ↔ frames)Grounded RAG chat over the cutEditor-side inference API — ~240ms / scene
SERVICES PROVEN Real-time inference engineering LLM systems & multimodal RAG Custom model development CLIENT Pilotly — video intelligence for editors STATUS In final testing — and we say so. CASE STUDYpilotly case study coverCase study PDF IN THE NEWS FUEL CYCLE · NOV 2024 A deep dive into FCX’s newest partner: Pilotly. ↗ INSIGHT PLATFORMS · 2024 Pilotly — testing consumer response to films, trailers, and promos. ↗ PILOT.LY · CURRENT Audible, NBCUniversal, and Ovation TV test creative content on Pilotly. ↗
Punch / Lab NotesISSUE 14
LATEST BUILD
THIS ISSUEInside the models we shipped this month.What we trained, what surprised us, and the eval that let it ship.
PUNCH · AI RESEARCH NEWSLETTERThe AI we’re building next — in your inbox.Research notes from the lab, the models we’re training, and the new projects we’re shipping. A few times a month — no spam, no fluff.FREE FOR SUBSCRIBERS · THE AI EVAL PLAYBOOK — SEE INSIDE ›
RECENT ISSUESPer-subscriber avatarsYOLO eval at scaleOffline product recognition· JOINED BY 2,400+ ENGINEERS & FOUNDERS
Building real-time AI for creative tools? Talk to the engineers who built this →
PREVIOUS — NUMINNUMINTeaching the chart to remember. NEXT — ZWILTZWILTThe interview that interviews back.