Secure Caption Pipeline Design

The ingest boundary is the highest-risk choke point in any caption workflow, and it fails in two directions at once: a malformed payload can corrupt a decoder, and a well-formed but non-compliant payload can carry a regulatory violation straight into playout. A secure ingest stage closes both gaps with quantified, machine-checkable limits before a single cue reaches the multiplexer: every payload is hashed (SHA-256), every upstream asset is signature-verified against a trusted key, encoding is normalized to UTF-8 with ≥99% byte-confidence before parsing, timecodes are validated against a ±2 frame (±66.7 ms at 29.97 fps) budget, and displayed text is held under a sustained 20 characters-per-second (cps) ceiling. A payload that misses any limit is quarantined with the failing frame range and clause id attached, never silently truncated.

This page is the hardened-ingest layer of the broader Broadcast Captioning Architecture & Compliance framework. It runs before the acceptance gates in the FCC Part 79 compliance checklist and the Ofcom code on subtitling standards — its job is to guarantee that whatever those gates receive is authenticated, well-encoded, and structurally sane, so a downstream validator never has to defend against a hostile input. Every limit below is read from the threshold reference table so the same code re-points at another jurisdiction by swapping a config row.

The gates run in order, cheapest and most security-critical first, inside a sandboxed worker. Every gate must pass before the asset reaches the downstream validators; any single failure drops it onto the fail rail into a write-once quarantine, tagged with the clause and frame range that tripped.

Problem framing

Caption files are executable payloads in disguise. A malformed SCC, SRT, or WebVTT document can trigger XML entity expansion, regex catastrophic backtracking, or buffer overruns in legacy CEA-608/708 decoders and playout automation. Treating the file as trusted text — opening it in the main process, decoding it with a guessed codec, parsing it with a backtracking regex — is the root cause of the majority of ingest-stage incidents. The design principle here is the opposite: the payload is hostile until proven otherwise, parsing happens in an isolated worker with a restricted filesystem, and provenance is established cryptographically before any business logic runs.

The compliance half of the problem is just as concrete. CEA-608’s physical channel sustains roughly 30 bytes of displayed text per second; in practice broadcasters hold a sustained ceiling of 20 cps so legacy decoders keep up and on-screen text stays readable. A burst over that ceiling does not throw an error on its own — it ships, fails QC downstream, and becomes a re-render or a reportable incident. The secure ingest stage measures instantaneous density with a sliding window and quarantines the burst at the boundary, where a fix costs a re-ingest instead of a recall. The character-rate dimension itself is enforced as a first-class gate in enforcing character rate limits in QC; this page treats it as one of several admission checks.

Pipeline stage & prerequisites

Secure ingest is the pre-parse, pre-validate boundary. It sits in front of the canonical cue model that the rest of the pipeline depends on: raw files arrive (SDI extraction, SMPTE ST 2110-40 ancillary, or file drop), and this stage decides whether each one is allowed to become cues at all. Normalization into {start_ms, end_ms, text, region} records is shared with SRT timestamp normalization, parsing SCC with Python libraries and WebVTT cue extraction and validation; the difference is that those run after admission, on input this stage has already proven safe. Continuous, runtime sync measurement is out of scope and belongs to automated sync drift detection.

Required tooling:

Tool / library	Version (tested)	Role at the ingest boundary
`cryptography`	42.0+	Ed25519 signature verification of upstream vendor payloads
`charset_normalizer`	3.3+	Encoding detection + UTF-8 normalization with confidence score
`pysrt`	1.1.2	Parse SRT cue onsets into structured timestamps
`webvtt-py`	0.4.6+	Parse WebVTT cues and placement cue settings
`ffprobe` (FFmpeg)	6.0+	Read stream timebase / first audio PTS for the sync reference
`numpy`	1.26+	Vectorized density and drift arithmetic
`pytest`	8.0+	Threshold regression fixtures for each gate
Python	3.10+	`mmap`, `fractions` for exact NTSC rationals, `concurrent.futures`

Step-by-step implementation

The gates run in order, cheapest and most security-critical first, so a hostile payload is rejected before it ever reaches the parser. Each step is a minimal working unit; in production they compose into one admit() call per asset.

Step 1 — Stage in isolation, hash, and sniff the type

Never open an untrusted caption file in the main application context. Read it as bytes through mmap (zero-copy, bounded), compute a content hash for the audit trail, and confirm the declared type matches the byte signature before anything parses it. The hash is the asset’s identity for the rest of the pipeline.

import hashlib
import mmap
from pathlib import Path

# Magic prefixes we accept at ingest; everything else is quarantined unparsed
CAPTION_SIGNATURES = {
    b"Scenarist_SCC V1.0": "scc",
    b"WEBVTT": "webvtt",
    b"\xef\xbb\xbfWEBVTT": "webvtt",   # UTF-8 BOM + WEBVTT header
}

def stage_and_identify(path: str, max_bytes: int = 256 * 1024 * 1024) -> dict:
    p = Path(path)
    size = p.stat().st_size
    if size > max_bytes:                       # bound memory before any read
        raise ValueError(f"payload {size} B exceeds ingest ceiling {max_bytes} B")
    with open(p, "rb") as fh, mmap.mmap(fh.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        sha256 = hashlib.sha256(mm).hexdigest()
        head = bytes(mm[:64])
        fmt = next((f for sig, f in CAPTION_SIGNATURES.items()
                    if head.startswith(sig)), None)
        # SRT has no magic number; fall back to a strict structural sniff
        if fmt is None and head.lstrip().split(b"\n", 1)[0].strip().isdigit():
            fmt = "srt"
    if fmt is None:
        return {"sha256": sha256, "format": None, "admit": False,
                "reason": "unrecognized payload signature"}
    return {"sha256": sha256, "format": fmt, "size": size, "admit": True}

Step 2 — Verify the upstream signature

Captioning vendors and AI transcription services should sign their deliverables. Verify the detached signature against a pinned public key before normalization; a payload that fails verification is treated exactly like a malformed one. Ed25519 is the practical default — small keys, fast verification, no parameter footguns.

from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PublicKey
from cryptography.exceptions import InvalidSignature

def verify_signature(payload: bytes, signature: bytes, public_key_raw: bytes) -> bool:
    # Zero-trust ingest: provenance is proven before parsing, not after
    key = Ed25519PublicKey.from_public_bytes(public_key_raw)  # 32-byte raw key
    try:
        key.verify(signature, payload)   # raises on any tamper/mismatch
        return True
    except InvalidSignature:
        return False

Step 3 — Normalize encoding with a confidence floor

A guessed codec is how BOM bytes get counted as content and accented glyphs turn into mojibake. Detect the encoding, require a confidence floor, and re-emit clean UTF-8 with the BOM stripped before the parser ever sees the text.

from charset_normalizer import from_bytes

ENCODING_CONFIDENCE_FLOOR = 0.99   # reject low-confidence guesses, don't ship mojibake

def normalize_to_utf8(raw: bytes) -> str:
    result = from_bytes(raw).best()
    if result is None:
        raise ValueError("no decodable encoding found")
    # chaos is 0..1 noise; 1 - chaos approximates detection confidence
    if (1.0 - result.chaos) < ENCODING_CONFIDENCE_FLOOR:
        raise ValueError(f"encoding confidence below {ENCODING_CONFIDENCE_FLOOR}")
    text = str(result)
    return text.lstrip("")       # strip UTF-8 BOM so it isn't counted as a char

Step 4 — Validate timecode against the frame budget

Once the text is clean, every cue onset is checked for a valid timecode and an in-budget offset from the audio reference. Derive frame duration from the exact NTSC rational, never a rounded literal, or the tolerance drifts over a long program.

import re
from fractions import Fraction
from dataclasses import dataclass

NTSC_FPS = Fraction(30000, 1001)             # SMPTE ST 12-1 — exact 29.97 rational
FRAME_MS = float(1 / NTSC_FPS * 1000)        # ~33.3667 ms per frame
SYNC_TOLERANCE_MS = 2 * FRAME_MS             # FCC 47 CFR § 79.1(j) — ±2 frame budget
TC_RE = re.compile(r"^\d{2}:\d{2}:\d{2}[:;]\d{2}$")   # ';' marks drop-frame

@dataclass
class TimecodeVerdict:
    drift_ms: float
    compliant: bool

def tc_to_ms(tc: str) -> float:
    if not TC_RE.match(tc):
        raise ValueError("timecode must be HH:MM:SS:FF or HH:MM:SS;FF")
    h, m, s, f = map(int, re.split(r"[:;]", tc))
    return (h * 3600 + m * 60 + s) * 1000.0 + f * FRAME_MS

def validate_onset(cue_start_tc: str, audio_onset_ms: float) -> TimecodeVerdict:
    drift = tc_to_ms(cue_start_tc) - audio_onset_ms
    return TimecodeVerdict(round(drift, 3), abs(drift) <= SYNC_TOLERANCE_MS)

The audio reference comes from ffprobe reading packet presentation timestamps only — no decode — so the check stays cheap enough to run on every asset:

import subprocess, json

def first_audio_pts_ms(media_path: str) -> float:
    out = subprocess.run(
        ["ffprobe", "-v", "error", "-select_streams", "a:0",
         "-show_entries", "packet=pts_time", "-read_intervals", "%+#1",
         "-of", "json", media_path],
        capture_output=True, text=True, check=True,
    )
    return float(json.loads(out.stdout)["packets"][0]["pts_time"]) * 1000.0

Step 5 — Enforce the character-density ceiling

Measure displayed-character density per cue with a sliding window and reject sustained bursts over the cps ceiling. Count only visible characters, and floor the cue duration at one frame so a zero-length cue cannot divide by zero.

import numpy as np

CPS_LIMIT = 20.0          # sustained operational ceiling for legacy 608 decoders
CPS_HARD_MAX = 30.0       # CEA-608 physical channel byte budget — never exceed

def density_cps(text: str, start_ms: float, end_ms: float) -> float:
    duration_s = max(end_ms - start_ms, FRAME_MS) / 1000.0
    visible = len(text.strip().replace(" ", ""))   # exclude whitespace
    return visible / duration_s

def density_verdict(cues: list[dict]) -> dict:
    rates = np.array([density_cps(c["text"], c["start_ms"], c["end_ms"]) for c in cues])
    over = np.where(rates > CPS_LIMIT)[0]
    return {
        "peak_cps": round(float(rates.max()), 2) if rates.size else 0.0,
        "over_indices": over.tolist(),
        "compliant": bool(over.size == 0),
    }

Step 6 — Fold the gates into one admission verdict

The gates collapse into a single machine-readable record the media asset management (MAM) system consumes over REST or gRPC. Any failing gate sets pipeline_action to quarantine and records the source hash for the audit trail.

import json

def build_admission(sha256, fmt, sig_ok, enc_ok, sync, density) -> str:
    checks = {
        "format_recognized": fmt is not None,
        "signature_ok": sig_ok,
        "encoding_ok": enc_ok,
        "sync_drift_ms": sync.drift_ms,
        "sync_compliant": sync.compliant,
        "peak_cps": density["peak_cps"],
        "density_compliant": density["compliant"],
    }
    admitted = all([fmt is not None, sig_ok, enc_ok,
                    sync.compliant, density["compliant"]])
    return json.dumps({
        "source_hash": sha256,
        "format": fmt,
        "admission_status": "ADMITTED" if admitted else "REJECTED",
        "pipeline_action": "ROUTE_TO_VALIDATE" if admitted else "QUARANTINE",
        "checks": checks,
    }, indent=2)

The architectural detail of running these gates inside a sandboxed, memory-bounded worker — mmap-backed I/O, generator-driven batching, OOM protection — is covered in the child page, building a compliant caption ingestion gateway. The format-translation matrix where most density and sync gates actually trip is documented in SCC vs SRT vs WebVTT architecture.

Threshold reference table

Every limit the gates read lives here, not in prose, so a config swap re-points the same code at a different timebase or jurisdiction.

Parameter	Value	Spec / source	Notes
Sustained density ceiling	20 cps	Industry operational practice	Conservative limit for legacy 608 decoders
Hard density maximum	~30 bytes/s	ANSI/CTA-608-E channel budget	Physical 608 channel bandwidth — never exceed
Sync tolerance	±2 frames (±66.7 ms @ 29.97)	47 CFR § 79.1(j)	Tightens to ±1 frame on 59.94/60 fps progressive
Frame duration (29.97)	33.3667 ms	SMPTE ST 12-1	Use `Fraction(30000,1001)`, never a rounded literal
Encoding confidence floor	≥ 99%	Ingest policy	Reject low-confidence detections; strip BOM
Max payload size	256 MB (configurable)	Ingest policy	Bounds memory before any read
Signature algorithm	Ed25519	RFC 8032	32-byte raw public key, detached signature
Content hash	SHA-256	FIPS 180-4	Asset identity + audit-trail key
Drop-frame marker	`;` in timecode	SMPTE ST 12-1	`:` = non-drop, `;` = drop-frame

Verification & test pattern

Lock each threshold behind a pytest fixture so a config change that loosens a gate fails CI rather than silently admitting a bad asset. Test both sides of every boundary — exactly at the limit must pass, one step beyond must fail.

import pytest
from ingest import (validate_onset, FRAME_MS, density_cps,
                    CPS_LIMIT, verify_signature)
from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey

@pytest.fixture
def two_frames_ms():
    return 2 * FRAME_MS                      # ±66.7 ms reference boundary

def test_onset_at_boundary_admits(two_frames_ms):
    v = validate_onset_from_ms(1000.0 + two_frames_ms, 1000.0)
    assert v.compliant is True              # exactly ±2 frames passes (§ 79.1(j))

def test_density_just_over_ceiling_rejected():
    # 21 visible chars in 1.0 s = 21 cps > 20 cps ceiling
    assert density_cps("A" * 21, 0.0, 1000.0) > CPS_LIMIT

def test_signature_roundtrip_then_tamper():
    priv = Ed25519PrivateKey.generate()
    pub = priv.public_key().public_bytes_raw()
    payload = b"WEBVTT\n\n00:00:01.000 --> 00:00:02.000\nhello\n"
    sig = priv.sign(payload)
    assert verify_signature(payload, sig, pub) is True
    assert verify_signature(payload + b"x", sig, pub) is False   # tamper -> reject

# Test helper: validate against a precomputed onset in ms
def validate_onset_from_ms(start_ms, audio_ms):
    from ingest import TimecodeVerdict, SYNC_TOLERANCE_MS
    drift = start_ms - audio_ms
    return TimecodeVerdict(round(drift, 3), abs(drift) <= SYNC_TOLERANCE_MS)

Troubleshooting / failure modes

Signature verification fails on every vendor delivery : Root cause: the public key was loaded from a PEM/DER wrapper while from_public_bytes expects 32 raw bytes, or the vendor signs the normalized text while you verify the raw bytes. Fix: pin the exact raw key format and agree with the vendor on whether the signature covers pre- or post-normalization bytes — verify the same bytes they signed (Step 2, before normalization).

Encoding gate rejects valid Latin-1 SCC sidecars : Root cause: short payloads give charset_normalizer too few bytes for a confident decision, so confidence falls below the 99% floor. Fix: lower the floor for sub-kilobyte files, or supply an encoding hint from the delivery manifest instead of relying on detection alone.

Density gate fires on legitimate fast dialogue : Root cause: counting whitespace or measuring a single short cue in isolation instead of a sliding window over the sustained rate. Fix: exclude whitespace (already done in density_cps) and average over a multi-cue window so a single rapid line does not trip the sustained ceiling.

Drop-frame timecode arithmetic off by ~3.6 s per hour : Root cause: treating 29.97 as 30 fps or ignoring the ; drop-frame marker. Fix: derive frame duration from Fraction(30000, 1001) and branch counting on the separator, as in Step 4.

mmap raises ValueError: cannot mmap an empty file : Root cause: a zero-byte drop from a truncated upload. Fix: check st_size > 0 before mapping and quarantine empties as malformed rather than letting the call raise.

BOM bytes counted as a visible character, tipping a cue over the cps limit : Root cause: normalizing without stripping the UTF-8 BOM. Fix: lstrip("") after decode (Step 3); the deeper encoding-repair patterns live in fixing UTF-8 encoding errors in SCC files.

Operational notes

At batch scale this stage is I/O-bound on ffprobe and signature verification, not on the Python gates, so the throughput lever is process concurrency around the subprocess and crypto calls rather than vectorization. Size the worker pool to roughly the number of physical cores; each worker mmaps one payload, holds the cue list plus a single reference frame, and frees the mapping as soon as the verdict is written, keeping per-worker resident size well under 100 MB even for feature-length programs. Run each worker with a restricted filesystem view (a read-only mount of the drop directory plus a write-only quarantine path) so a parser exploit cannot reach the rest of the host — the generator-driven batching that makes this safe is detailed in async batch caption processing.

Keep every gate stateless and idempotent: a given payload and key must always yield the same verdict, which lets the QC layer re-ingest a quarantined asset after a fix without special-casing. Write every admission record — admitted or rejected — to an append-only (WORM) store keyed by the SHA-256 hash; quarantined files are versioned and locked, never deleted, so chain-of-custody survives the asset itself. Wire the same verdict schema into CI/CD gating for caption builds so a non-compliant asset fails the build instead of reaching distribution.

Frequently asked questions

Why verify signatures at ingest instead of trusting TLS in transit? TLS protects the hop, not the artifact. A file can arrive over a valid TLS channel from a compromised vendor account, or be rewritten on a shared landing volume after delivery. An Ed25519 signature over the payload bytes proves the artifact itself is untampered, independent of how it travelled.

Is the 20 cps ceiling a hard regulatory number? No. The hard limit is the CEA-608 channel’s ~30 bytes/s physical budget; 20 cps is the conservative operational ceiling broadcasters apply so legacy decoders keep up and text stays readable. Codify it as a config value, not a constant, so premium or accessibility-focused deliveries can tighten it.

Should normalization run before or after signature verification? Verify first, on the exact bytes the vendor signed. Normalizing before verification changes the bytes and invalidates the signature. Once provenance is proven, normalize the verified payload and carry the original hash forward as the audit identity.

Building a compliant caption ingestion gateway — Memory-safe, sandboxed worker that runs these gates at archive scale.
FCC Part 79 compliance checklist — The US acceptance gates that receive whatever this stage admits.
SCC vs SRT vs WebVTT architecture — Format translation matrix where most density and sync gates actually trip.
Ofcom code on subtitling standards — The UK threshold set to swap in for multi-territory ingest.
Enforcing character rate limits in QC — The character-rate dimension treated here as one admission check, enforced as a standalone gate.

Part of: Broadcast Captioning Architecture & Compliance.

Secure Caption Pipeline Design

Continue reading

Related in Architecture & Compliance