← ALL WORK
      /
      AI PROJECTS
      /
      PILOTLY
    

PILOTLY · REAL-TIME VIDEO INFERENCE · PILOT.LY ↗

It watches the cut while you make it.

How Punch built Pilotly a real-time video-inference system for editors on major productions — scoring whether each scene works, and whether the scenes work together.

REAL-TIME INFERENCE SCENE EFFICACY MULTIMODAL AI CUSTOM EMBEDDINGS

PILOTLY · EFFICACY MONITOR — CUT 14, "THE HEIST" LIVE INFERENCE · 240 MS / SCENE (SAMPLE)

PRODUCT WALKTHROUGH · SCENE-BY-SCENE EFFICACY, LIVE

ONE FRAME, FULLY READ

It doesn't just watch. It reads the scene.

As the clip rolls in slow motion, the model names the subjects, the environment, and the action live. Pick a question — the overlay re-reads the frame to answer it.

PILOTLY · READING SUBJECTS SLOW-MO · LIVE INFERENCE

SPEED 0.50×

FIG. 04 — SEMANTIC SCENE UNDERSTANDING · SUBJECTS, ENVIRONMENT, ACTION (ORIGINAL FRAME — NO STUDIO IP)

ASK THE FRAME

Our model answers any question about what's on screen — it reads the live frame, nothing is pre-scripted.

>_ Ask anything about this scene ↑

TRY A FEW \u2014 OR TYPE YOUR OWN

THE EVIDENCE · LIVE Q&A

Ask the footage anything. Watch the model answer.

Actual screen recordings of the Pilotly UI — an editor asks a question, the model reads the indexed clip and answers. Pick a question to watch the session.

PILOTLY · LIVE SESSION

MODEL RESPONSE · 01/03

What is this video about?

A two-fighter sword sequence on an open snow field — cold, overcast, deliberate. Indexed as 12 scenes · 178 scored moments.

THE CLIP BEING READ · SOURCE FOOTAGE

COVERT OP · INDEXED · 12 SCENES

Every answer above is read live from this footage — the same clip uploaded to the Pilotly UI on the left.

ACTUAL PILOTLY UI · SCREEN-RECORDED MODEL SESSIONS

AND IT TALKS — ASK THE CUT ANYTHING

It reads the footage before it answers.

Every reply is pulled from your real frames and Pilotly's filmmaking knowledge base — and it shows the exact scenes and moments it drew from. Nothing invented.

PILOTLY · ASK YOUR FOOTAGE ● CUT INDEXED · 12 SCENES · 178 MOMENTS

Why is scene 8 scoring low?

The getaway runs 22 seconds past your reference arc and visual energy drops after the second cutaway — tightening the driving inserts would recover the pace.GROUNDED IN: SCENE 8 · 14 MOMENTS · REF CORPUS

FROM THE BENCH · UNSTAGED

Not a mockup. The engine, mid-build.

A working session on the scene-search service: type "lady in the lake" and the index returns the exact moments — scored 0.978, timestamped to the second — while the model drafts lighting notes alongside. Text, image, or both as the query.

Development screenshot of the Pilotly scene-search interface — a text query returning scored, timestamped moments next to AI lighting analysis — FIG. 05 — THE SCENE-SEARCH SERVICE, RUNNING · TEXT + IMAGE QUERIES OVER EMBEDDED MOMENTS · DEV SCREENSHOT, UNRETOUCHED

HOW IT KEEPS UP WITH AN EDITOR

Re-scored on every recut. The loop is the product.

For the engineers who scrolled this far — the path a cut takes from ingest to the editor's monitor, and the part you can't buy off the shelf.

INGEST CUT → SCENES → MOMENTS

→

APOLLO VIDEO LLM · RICH DESCRIPTIONS

→

PUNCH EMBEDDINGS FUSED TEXT+IMAGE · PAST THE 512-D CEILING

→

EFFICACY SCORER SCENE + ARC · VS REFERENCE CORPUS

→

EDITOR'S MONITOR ✓ ~240 MS / SCENE (SAMPLE)

SCENE × 12

MOMENT × 178 — DESCRIPTION · DURATION · ENERGY · METADATA

FIG. 11 — THE HIERARCHY · A CUT IS SCENES; SCENES ARE MOMENTS. EVERY MOMENT IS SCORED AND SEARCHABLE.

OFF-THE-SHELF MULTIMODAL EMBEDDINGS512-D · SHORT TEXT ONLY

PUNCH CUSTOM EMBEDDING MODELLONG-FORM TEXT + IMAGES, FUSED

Scene descriptions run long; 512 dimensions can't hold them next to imagery. The in-house model fuses both into one vector per moment — text finds frames, frames find text, cosine-ranked. Query images are sharpened and de-darkened before search.

FIG. 12 — PAST THE CEILING · THE MODEL YOU CAN'T BUY (DIMS PROPRIETARY)

PAST THE 512-D CEILING · IN DETAIL

One vector per moment — text and image, fused.

Off-the-shelf embeddings cap at 512 dimensions and were trained on short captions — they can't hold a paragraph of scene description next to the frame it describes. So we trained our own. Here's what it does that you can't buy.

01 · FUSION

SCENE TEXT · 380 TOKENS

FRAME PIXELS

↓

ONE FUSED VECTOR · 1,024-D [ 0.21, −0.08, 0.44, … ]

Long-form description and the frame collapse into a single dense vector — the text and what it depicts living in the same space.

02 · BOTH DIRECTIONS

TEXT QUERY FINDS FRAMES

IMAGE QUERY FINDS TEXT

cos θ cos 0.94 · NEAREST WINS

Because both modalities share one space, a typed prompt retrieves the matching frames and a dropped still retrieves the matching descriptions — cosine-ranked, the same math each way.

03 · QUERY CONDITIONING

RAW · DARK

→

SHARPENED

Query images get sharpened and de-darkened before they ever hit the index — so an underexposed reference still lands on the right moment instead of the nearest shadow.

FIG. 13 — THE EMBEDDING MODEL · FUSED MULTIMODAL VECTORS, COSINE-RANKED, QUERY-CONDITIONED (ARCHITECTURE PROPRIETARY)

WHAT THE EDITING BAY COULDN'T MEASURE

"Does this scene land?" was a gut call. On a 200-person production, gut calls are expensive.

Editors on major films screen a cut dozens of times asking the same questions: does the scene hold? Does the act sag? Does the sequence of scenes carry the arc? Pilotly wanted those answers while the editor works — not after the test screening.

Punch built the inference engine underneath: every scene is understood, embedded, and scored in real time — for its own efficacy, and for its place in the whole.

“Every scene scored while the editor works — not after the screening.”

240 msPER-SCENE INFERENCE (SAMPLE — CONFIRM)

>512DIMENSIONS — THE CEILING, BROKEN IN-HOUSE

2SCORES PER CUT — EVERY SCENE + THE WHOLE

WHAT WE BUILT

MODELS✓Apollo video-LLM integration✓Custom fused multimodal embedding model (1,024-D)✓Real-time scene efficacy scorer

PIPELINE✓Five-stage real-time inference pipeline✓Scene & moment indexing (12 scenes · 178 moments)✓Query-image conditioning — sharpen + de-darken

PRODUCT SURFACE✓Bi-directional cosine retrieval (text ↔ frames)✓Grounded RAG chat over the cut✓Editor-side inference API — ~240ms / scene

SERVICES PROVEN Real-time inference engineering LLM systems & multimodal RAG Custom model development CLIENT Pilotly — video intelligence for editors STATUS In final testing — and we say so. CASE STUDY pilotly case study cover

Case study PDF ↓IN THE NEWS FUEL CYCLE · NOV 2024 A deep dive into FCX’s newest partner: Pilotly. ↗ INSIGHT PLATFORMS · 2024 Pilotly — testing consumer response to films, trailers, and promos. ↗ PILOT.LY · CURRENT Audible, NBCUniversal, and Ovation TV test creative content on Pilotly. ↗

Punch / Lab NotesISSUE 14

LATEST BUILD

THIS ISSUEInside the models we shipped this month.What we trained, what surprised us, and the eval that let it ship.

PUNCH · AI RESEARCH NEWSLETTERThe AI we’re building next — in your inbox.Research notes from the lab, the models we’re training, and the new projects we’re shipping. A few times a month — no spam, no fluff.★FREE FOR SUBSCRIBERS · THE AI EVAL PLAYBOOK — SEE INSIDE ›

RECENT ISSUESPer-subscriber avatarsYOLO eval at scaleOffline product recognition· JOINED BY 2,400+ ENGINEERS & FOUNDERS

Building real-time AI for creative tools? Talk to the engineers who built this →

←PREVIOUS — NUMIN

Teaching the chart to remember.

NEXT — ZWILT

The interview that interviews back.→

Punch AI ● ONLINE · VOICE OR TEXT

Want this built for you?

You just saw the Pilotly build. Tell Punch AI your idea — it’ll scope how we’d build yours, live, then hand off to the engineer who’d do it.

TRY ASKING