Integrating caption QC into GitHub Actions CI

The exact failure mode this page solves is a caption QC step that passes on a workstation but is killed with SIGKILL on a GitHub-hosted runner, letting an unvalidated SCC, SRT or WebVTT artifact reach mux. GitHub’s ubuntu-latest runners expose roughly 7 GB of RAM shared across concurrent matrix jobs, dependency resolution and the OS; a parser that loads an entire broadcast package into a Python list breaches that ceiling and the job is OOM-killed before it can assert a single threshold. The contract a runnable gate must honour is the one defined by the parent CI/CD gating for caption builds step: bounded resident memory (target under 256 MB), frame-accurate timecode under the FCC 47 CFR § 79.1 ±2-frame sync tolerance, and a non-zero exit code the instant any cue breaches the CEA-608 reading-rate ceiling. Anything weaker produces false passes that surface only at a FCC Part 79 compliance audit or a broadcaster ingest rejection.

The gate script: streaming, frame-accurate SCC validation

# caption_gate.py — stdin SCC -> exit 0 (pass) / 1 (compliance fail) / 2 (misuse)
import json
import re
import sys
from collections import deque
from dataclasses import dataclass
from typing import Iterator


@dataclass
class CaptionCue:
    start_frame: int
    text: str
    raw_line: str


def parse_timecode_to_frames(tc: str, fps: float = 29.97, drop_frame: bool = True) -> int:
    """HH:MM:SS;FF -> integer frames. Integer math keeps SMPTE ST 12-1 drop-frame exact."""
    m = re.match(r"(\d{2}):(\d{2}):(\d{2})[;:](\d{2})", tc)
    if not m:
        raise ValueError(f"Invalid timecode: {tc}")
    h, mi, s, f = (int(m.group(i)) for i in range(1, 5))
    nominal = round(fps)                                   # 30 for 29.97 NTSC
    frames = (h * 3600 + mi * 60 + s) * nominal + f
    if drop_frame and abs(fps - 29.97) < 0.01:
        # SMPTE ST 12-1 drop-frame: drop 2 frames/min except every 10th minute
        total_min = h * 60 + mi
        frames -= 2 * (total_min - total_min // 10)
    return frames


def stream_parse_scc(fh, fps: float = 29.97) -> Iterator[CaptionCue]:
    """Generator parser — constant memory regardless of package size."""
    ctrl = re.compile(r"\b(9420|9425|9426|9427|9428|942[b-f]|8080)\b", re.I)  # CEA-608 control codes
    hexp = re.compile(r"\b([0-9a-fA-F]{4})\b")
    tcre = re.compile(r"(\d{2}:\d{2}:\d{2}[;:]\d{2})\s+(.*)")
    for line in fh:                                        # one line at a time, never .read()
        line = line.strip()
        if not line or line.lower().startswith("scenarist_scc"):
            continue
        m = tcre.match(line)
        if not m:
            continue
        chars = []
        for word in hexp.findall(ctrl.sub("", m.group(2))):
            hi, lo = (int(word, 16) >> 8) & 0x7F, int(word, 16) & 0x7F
            if hi >= 0x20:
                chars.append(chr(hi))
            if lo >= 0x20:
                chars.append(chr(lo))
        text = "".join(chars).strip()
        if text:
            yield CaptionCue(parse_timecode_to_frames(m.group(1), fps), text, line)


def validate(cues: Iterator[CaptionCue], fps: float = 29.97,
             max_cps: float = 20.0, window_s: float = 1.0) -> list[str]:
    """Sliding-window reading-rate check. CEA-608 caps usable throughput near 20 cps."""
    out, win, wf = [], deque(), int(window_s * fps)        # FCC 79.1 readability proxy
    for c in cues:
        win.append((c.start_frame, len(re.sub(r"\s+", "", c.text))))
        while win and (c.start_frame - win[0][0]) >= wf:
            win.popleft()
        rate = sum(n for _, n in win) / window_s
        if rate > max_cps:
            out.append(f"frame {c.start_frame}: {rate:.1f} cps > {max_cps} cps")
    return out


if __name__ == "__main__":
    try:
        violations = validate(stream_parse_scc(sys.stdin))
    except ValueError as exc:                              # malformed input = gate misuse, not a content fail
        print(f"gate error: {exc}", file=sys.stderr)
        sys.exit(2)
    report = {"violations": violations, "count": len(violations)}
    with open("qc_report.json", "w", encoding="utf-8") as r:
        json.dump(report, r, indent=2)
    sys.exit(1 if violations else 0)                       # non-zero blocks the merge/deploy

Code walkthrough

parse_timecode_to_frames is the foundation of every other check. Floating-point seconds (datetime, float(...)) accumulate IEEE 754 rounding that compounds across thousands of cues, so the function converts HH:MM:SS;FF straight to an integer frame count. The drop-frame branch implements SMPTE ST 12-1 exactly — two frames are skipped each minute except every tenth minute — which is what keeps NTSC 29.97 timecode from drifting the ~3.6 seconds per hour that otherwise trips the ±2-frame tolerance enforced by automated sync drift detection.

stream_parse_scc is a generator, and that is the entire point on a CI runner. It reads the file object line by line and yields one CaptionCue at a time, so resident memory is a function of a single line plus the sliding window — not of package size. The ctrl pattern strips CEA-608 control codes (pop-on 9420, mid-row, end-of-caption) before the hex pairs are decoded to 7-bit ASCII, so control-only lines never count toward text density. Because the function takes a file object rather than a path, the gate reads from sys.stdin and never materialises the artifact on disk.

validate implements the reading-rate assertion as a bounded collections.deque. Each cue’s visible (whitespace-stripped) character count enters the window; entries older than window_s are popped from the left in O(1). When the windowed character-per-second total exceeds max_cps, a violation string is appended. The 20 cps default is the practical CEA-608 throughput ceiling that FCC 47 CFR § 79.1 readability expectations are measured against — denser caption decoders drop characters or stutter.

The __main__ block encodes the three-state exit contract the parent cluster defines: a ValueError from malformed timecode is misuse and exits 2 so the failure routes to engineering, a populated violations list is a compliance failure and exits 1 to block the merge, and a clean track exits 0. The qc_report.json is always written so the workflow can upload it as an immutable audit artifact even when the gate fails.

Threshold reference table

Parameter	Value	Source / clause
Sync tolerance	±2 frames	FCC 47 CFR § 79.1
Reading-rate ceiling	20 cps (window 1.0 s)	CEA-608 usable throughput
Drop-frame correction	2 frames/min, skip every 10th	SMPTE ST 12-1
Resident-memory target	< 256 MB	GitHub `ubuntu-latest` (~7 GB shared)
Audit artifact retention	90–365 days	Broadcaster audit policy
Exit codes	0 pass / 1 fail / 2 misuse	CI/CD gate contract

Wiring the gate into a GitHub Actions workflow

The job pipes the artifact into caption_gate.py over stdin, lets the exit code gate the merge, and uploads qc_report.json with if: always() so the audit trail survives a failing run.

name: Caption QC Pipeline
on:
  push:
    paths: ['captions/**.scc', 'captions/**.srt', 'captions/**.vtt']
  pull_request:
    paths: ['captions/**.scc', 'captions/**.srt', 'captions/**.vtt']

jobs:
  validate-captions:
    runs-on: ubuntu-latest
    env:
      PYTHONDONTWRITEBYTECODE: "1"
    steps:
      - uses: actions/checkout@v5
      - uses: actions/setup-python@v6
        with:
          python-version: '3.12'
          cache: pip
      - name: Run caption QC gate
        run: python caption_gate.py < captions/program_001.scc
        # exit 1 fails the step; continue-on-error defaults to false, so the merge is blocked
      - name: Upload compliance audit
        uses: actions/upload-artifact@v4
        if: always()                       # keep the report even when the gate fails
        with:
          name: caption-qc-audit
          path: qc_report.json
          retention-days: 90

Because runs-on: ubuntu-latest declares the predecessor of every downstream packaging job, a non-zero exit propagates as a required-check failure and the branch protection rule refuses the merge. For libraries of thousands of assets, replace the single invocation with a strategy.matrix that shards captions/ by hash prefix or program ID across parallel runners — the same deterministic partitioning used in async batch caption processing — so each runner holds only its shard and no two jobs race the same file.

Edge cases & known gotchas

Drop-frame vs non-drop: the gate auto-corrects only near 29.97 fps. SCC packages authored at 25 fps (PAL) or 23.976 fps must pass drop_frame=False, or the correction injects phantom drift and produces false reading-rate windows.
BOM and encoding: a UTF-8 BOM or mojibake header on the first SCC line shifts the scenarist_scc sniff; normalise encoding first per fixing UTF-8 encoding errors in SCC files.
Runner version skew: a local pass that fails on CI is almost always a Python or dependency micro-version difference changing rounding or format detection. Pin exact versions in the lockfile and let cache: pip install pre-built wheels.
always() is mandatory: without if: always() on the upload step, a failing gate (exit 1) skips the artifact upload and you lose the very report the audit needs.
Empty/control-only tracks: a file of pure control codes yields zero CaptionCues and exits 0 — pair the gate with a presence check if “captions exist at all” is itself a requirement.

Integration hook

This script is the runner-side implementation of one step in the broader gate. It consumes the normalised cue stream and emits the exact exit-code-plus-JSON contract that CI/CD gating for caption builds specifies, slotting in immediately after caption generation and before mux. The qc_report.json it writes is the unit of record that scheduled QC report generation later aggregates into drift and reading-rate trend dashboards.

Frequently asked questions

Why does the gate pass locally but get killed on the runner? A path-based parser that calls .read() loads the whole package into a list; on a shared ~7 GB runner that triggers an OOM SIGKILL before any threshold runs. The streaming generator above keeps resident memory bounded so the same input that fails on disk passes over stdin.

What exit code should the step return? 0 to proceed, 1 for a genuine compliance failure that blocks the merge or deploy, and 2 for malformed input or gate misuse so the failure routes to engineering rather than the captioning team.

How do I stop drop-frame timecode causing false drift failures? Confirm the source framerate and pass the correct drop_frame flag; left unhandled, NTSC drop-frame timecode injects ~3.6 s/hour of phantom drift that trips the ±2-frame tolerance.

Enforcing FCC character rate limits programmatically — the rolling cps/WPM math behind this gate’s reading-rate threshold.
Detecting sync drift in automated QC pipelines — the PTS-alignment check that complements the frame-accurate timecode here.
CI/CD gating for caption builds — the parent gate: exit-code contract, format matrix and threshold rule set.

Part of: Automated QC Validation & Reporting — the deterministic caption QC and reporting reference.