PUBLIC API · v1.0

Score voice
annotation quality.

Public API that grades Hindi/English voice-annotation transcripts against gold references using the Smallest ASR Labelling Framework. Deterministic HARD rules plus LLM-judged SOFT rules. Per-clip detail and batch averages.

#Overview

Built for annotation vendors and ML teams shipping Indic ASR datasets. Gives you a defensible per-clip score, the rule that was broken, and a batch-level pass rate — all in a single JSON call.

What it does

  • Compares an annotator's JSON transcript against a gold JSON for the same audio
  • Runs ~32 HARD rules (regex / Unicode-script / whitelist) — no judgment, deterministic
  • Runs ~14 SOFT rules through an LLM-judge with a 0.85 confidence threshold
  • Returns a score in [0, 1], a pass/fail (default threshold 0.97), every violation, and the math behind the score
  • Supports single-pair and batch scoring, with per-clip + aggregate stats

What it's not

  • Not a transcription service — bring your own annotator + gold pair
  • Not language-detection — assumes the framework's Hindi+English (code-switched) target
  • Not authoritative — final acceptance is the customer's call; this is a QC signal
Cost: ~$0.0005 per file (≈5 LLM calls only when SOFT rules genuinely disagree). Most files finish in <500 ms.

#Categories — WER, Punctuation, Tags

Every score response groups violations under three top-level buckets. The portal renders them as three cards; the API returns them under breakdown.categories. Each bucket has its own score, pass flag, and violation list.

BucketWhat it coversRules
wer
Lexical accuracy
Word-level correctness — script enforcement, numbers spelled out, dates, acronyms, currency, percentages, third-language removal, structural sanity. Includes the blended Latin-WER / Indic-CER number. 1.1, 1.2, 1.3, 1.4, 1.7, 2.x, 4, structural
punctuation Rule 1.6 family. Banned characters (; : " ' @ % $ ₹ # &), full-stop choice (/./?/!), comma usage, question marks. 1.6
tags
Speaker labels & markup
Every bracketed or angle-bracketed marker: fillers, speaker diarization, stutters, truncations, singing, in-text actions, segment emotion, noise tags, whisper. 1.5, 3.x, 5.x, 6, 7.x, 8.x

Per-category scoring

wer:    max(0, min(1, (1 − blended_error) − 0.01·n_hard))
punctuation:    max(0, min(1, 1 − 0.05·n_hard − 0.025·n_soft))
tags:    max(0, min(1, 1 − 0.05·n_hard − 0.025·n_soft))

Each bucket has its own passed flag (threshold 0.97). The top-level score is the original composite — it still considers every violation across all buckets, so the overall pass/fail stays consistent.

Example response (categories field)

{
  "score": 0.92,
  "passed": false,
  "breakdown": {
    "categories": {
      "wer": {
        "name": "wer",
        "label": "Lexical accuracy (WER)",
        "score": 0.98,
        "passed": true,
        "hard_violation_count": 0,
        "soft_disagreement_count": 0,
        "rescued_count": 0,
        "violations": [],
        "soft_disagreements": [],
        "rescued": [],
        "details": {
          "blended_lexical_error": 0.02,
          "latin_token_count": 12,
          "indic_token_count": 18
        }
      },
      "punctuation": {
        "name": "punctuation",
        "label": "Punctuation",
        "score": 0.95,
        "passed": false,
        "hard_violation_count": 1,
        "soft_disagreement_count": 0,
        "rescued_count": 0,
        "violations": [
          {"rule_id":"1.6","severity":"hard","message":"Banned character ';'", ...}
        ],
        "soft_disagreements": [],
        "rescued": [],
        "details": {}
      },
      "tags": {
        "name": "tags",
        "label": "Tags & speaker labels",
        "score": 0.975,
        "passed": true,
        "hard_violation_count": 0,
        "soft_disagreement_count": 1,
        "rescued_count": 0,
        "violations": [],
        "soft_disagreements": [
          {"rule_id":"7.2","severity":"soft","message":"Emotion primary label disagreement", ...}
        ],
        "rescued": [],
        "details": {}
      }
    },
    "hard_violations": [...],     // flat list, kept for back-compat
    "soft_disagreements": [...],
    "rescued_disagreements": [...],
    "math": { ... }
  }
}
Tip: Most integrations should read breakdown.categories and ignore the flat hard_violations / soft_disagreements lists — they contain the same data in a denormalised form, kept only for back-compat with the v0 shape.

#Quickstart

Sixty-second copy-paste. Replace the base URL with your deployment.

1. Score one pair

curl -X POST https://your-deployment.vercel.app/api/score \
  -H "Content-Type: application/json" \
  -d '{
    "annotator": [
      {"speaker":"Speaker_1","start":"00:00:00.000","end":"00:00:02.500",
       "text":"मैंने कल Delhi से नया laptop खरीदा। [None]"}
    ],
    "gold": [
      {"speaker":"Speaker_1","start":"00:00:00.000","end":"00:00:02.500",
       "text":"मैंने कल Delhi से नया laptop खरीदा। [None]"}
    ]
  }'

Returns:

{
  "file_id": "annotator",
  "score": 1.0,
  "passed": true,
  "breakdown": {
    "hard_violations": [],
    "soft_disagreements": [],
    "rescued_disagreements": [],
    "lexical": { "blended": 0.0, ... },
    "math": {
      "lex_component": 1.0,
      "hard_penalty": 0.0,
      "soft_penalty": 0.0,
      "final_score": 1.0,
      "pass_threshold": 0.97,
      "formula": "score = max(0, min(1, (1 − blended_lexical_error) − 0.01·n_hard − 0.005·n_soft))"
    }
  }
}

2. Score a whole batch

curl -X POST https://your-deployment.vercel.app/api/batch \
  -H "Content-Type: application/json" \
  -d '{
    "pairs": [
      {"file_id":"clip_001","annotator":[...],"gold":[...]},
      {"file_id":"clip_002","annotator":[...],"gold":[...]},
      {"file_id":"clip_003","annotator":[...],"gold":[...]}
    ]
  }'

Returns per-clip results and a summary:

{
  "results": [
    { "file_id": "clip_001", "ok": true,  "result": { "score": 0.987, "passed": true,  ... } },
    { "file_id": "clip_002", "ok": true,  "result": { "score": 0.943, "passed": false, ... } },
    { "file_id": "clip_003", "ok": false, "error": "Failed to parse submissions",
                              "detail": "Top-level JSON must be a list of utterance objects." }
  ],
  "summary": {
    "n_total": 3, "n_scored": 2, "n_failed_to_parse": 1,
    "n_passed": 1, "n_failed": 1,
    "pass_rate": 0.5,
    "average_score": 0.965,
    "avg_hard_violations": 1.0,
    "avg_soft_disagreements": 0.5,
    "pass_threshold": 0.97,
    "elapsed_ms": 412
  }
}

#Auth & limits

AspectValue
AuthenticationNone. Public API.
CORSAccess-Control-Allow-Origin: *
Request body capVercel function limit (~4.5 MB body)
Batch sizeNo hard cap — practical limit is the 10 s function timeout
Rate limitNone today. May be added if abused.
Content-Typeapplication/json; charset=utf-8
Privacy: Requests are not logged with content. SOFT-rule disagreements (gold/annotator text snippets) are sent to Google's Gemini API for judgment. Don't send anything you wouldn't put in a public spreadsheet.

#POST /api/score

Score one annotator submission against one gold reference.

POST /api/score Single pair → ScoreResult

Request body

FieldTypeDescription
annotatorUtterance[] Annotator's transcript for the file.
goldUtterance[] Gold reference for the same file. Must have the same number of utterances.

Response (200)

ScoreResult — see schema.

Example

POST /api/score
{
  "annotator": [{"speaker":"Speaker_1","text":"मैंने hello। [None]"}],
  "gold":      [{"speaker":"Speaker_1","text":"मैंने hello। [None]"}]
}

#POST /api/batch

Score many pairs in one request. Bad pairs don't kill the batch.

POST /api/batch Many pairs → per-clip + aggregate

Request body

FieldTypeDescription
pairsPair[] Array of {file_id?, annotator, gold} objects. file_id is optional but recommended — it appears in every result entry.
pass_thresholdnumber Optional. Default 0.97. Overrides the pass cutoff for every clip in the batch.

Response (200)

FieldTypeDescription
results[]object One entry per input pair. Either {ok:true, result: ScoreResult} or {ok:false, error, detail}.
summaryobject Aggregate stats over ok:true entries only — see fields below.

Summary fields

FieldTypeDescription
n_totalintTotal pairs in the request
n_scoredintPairs that scored successfully
n_failed_to_parseintBad pairs (skipped)
n_passedintScored ≥ pass_threshold
n_failedintScored < pass_threshold
pass_ratefloatn_passed / n_scored
average_scorefloatMean over scored pairs
avg_hard_violationsfloatMean HARD violations per scored pair
avg_soft_disagreementsfloatMean SOFT disagreements per scored pair
elapsed_msintServer-side wall time
Bad pairs don't abort the batch. A parse error, missing keys, or utterance-count mismatch in pair N returns an ok:false entry; pairs 0..N-1 and N+1..end still score normally. The summary aggregates only over ok:true.

#GET /api/docs

Machine-readable OpenAPI 3.1 spec. Use it to generate clients or import into Postman.

GET /api/docs OpenAPI 3.1 JSON

Returns the full OpenAPI document for every endpoint and schema on this site.

curl https://your-deployment.vercel.app/api/docs > openapi.json
# Import openapi.json into Postman / Insomnia / openapi-generator

#Schemas

Utterance

One row of an annotator's or gold's transcription.

FieldTypeReq?Description
speakerstringyes Speaker_1, Speaker_2, ..., or Speaker_Machine (Rule 3.1).
textstringyes The transcription. A trailing [Label] tag (e.g. [Happy]) is extracted as the segment-level emotion and stripped before lexical scoring. Two trailing tags are accepted for the rare multi-label case (Rule 7.2).
startstring|numberno HH:MM:SS.mmm or float seconds.
endstring|numberno As above.

ScoreResult

The per-file score envelope.

FieldTypeDescription
file_idstringEcho of the input file_id (or auto-generated for single-pair calls).
scorefloat ∈ [0,1]Composite overall score across all categories.
passedboolscore >= pass_threshold
breakdown.categories{wer, punctuation, tags} Primary shape. Three top-level CategoryScore objects — each with its own score, pass flag, and violation list. See the Categories section.
breakdown.mathScoreMath Step-by-step arithmetic — what every penalty was, what the final number is.
breakdown.hard_violationsViolation[] Flat list across all categories. Kept for back-compat — prefer categories.
breakdown.soft_disagreementsViolation[] Flat list. Same data appears inside each category's soft_disagreements.
breakdown.rescued_disagreementsViolation[] Flat list. LLM-judge-accepted disagreements; not penalized.
breakdown.lexicalLexicalScore Aggregate lexical numbers. WER bucket's details surfaces the relevant subset.

CategoryScore

One of wer, punctuation, or tags.

FieldTypeDescription
name"wer" | "punctuation" | "tags"Bucket key.
labelstringHuman-readable label for UI rendering (e.g. "Lexical accuracy (WER)").
scorefloat ∈ [0,1]Per-category score, independent of the overall composite.
passedboolscore >= pass_threshold for this bucket.
pass_thresholdfloatDefault 0.97.
hard_violation_countintCount of HARD violations in this bucket.
soft_disagreement_countintCount of unrescued SOFT disagreements in this bucket.
rescued_countintCount of LLM-judge-rescued SOFT disagreements (audit-only).
violationsViolation[]HARD violations in this bucket.
soft_disagreementsViolation[]Unrescued SOFT disagreements in this bucket.
rescuedViolation[]Rescued disagreements (for audit).
detailsobjectFor wer: {blended_lexical_error, latin_token_count, indic_token_count}. Empty for the other buckets.

Violation

FieldTypeDescription
rule_idstringPDF rule number (e.g. 1.6, 7.2) or structural.*.
severity"hard" | "soft"Class of the broken rule.
messagestringHuman-readable explanation.
utterance_indexint0-based index into the utterances array.
foundstringWhat the annotator wrote.
expectedstringWhat the rule (or gold) expected.

#Scoring math

No black box. The exact arithmetic for every result is returned in breakdown.math.

score = max(0, min(1,   (1 − blended_error) − 0.01·n_hard − 0.005·n_soft ))
TermMeaning
blended_error Token-weighted blend of WER (Latin script) + CER (Indic scripts). Per the client SLA of "WER under 3%".
n_hard Count of HARD rule violations. Each costs 0.01.
n_soft Count of SOFT disagreements the LLM-judge did NOT rescue. Each costs 0.005.
pass_threshold Default 0.97. Override per-request on /api/batch.

Why these weights?

  • Lexical dominates because the buyer SLA is "WER < 3%". A clean transcription gets close to 1.0 even with a few minor rule slips.
  • HARD violations get 2× the SOFT penalty because they're unambiguous — no judgment, no rescue.
  • 0.97 threshold mirrors the 3% WER ceiling, so passing the threshold also means meeting buyer's lexical bar.

#Rules: HARD vs SOFT

~70% of rules are deterministic (regex / Unicode-script / whitelist). ~30% need genuine judgment and are routed to an LLM-judge.

PartRuleClassScorer treatment
1.1Script enforcement (Latin/Devanagari) HARD Unicode-block check per token; cross-script substitution = error.
1.5Fillers — whitelist [um] / [uh] HARD [mhm], [er], [ah] = error.
1.5Filler placement SOFT ±2 token tolerance from gold position.
1.6Banned characters ; : " ' @ % $ ₹ # & HARD Regex; any occurrence = error.
1.6Punctuation choice (comma / full-stop) SOFT LLM-judge: defensible in this context?
2.1Numbers spelled out, never digits HARD Regex; [0-9] outside tag content = error.
3.1Speaker label format + grouping HARD Regex + permutation-invariant cluster match.
5.1Stutter format r=report HARD Regex.
7.1In-text actions whitelist ([laughing], [cough], etc.) HARD Whitelist lookup.
7.2Emotion label choice SOFT LLM-judge with priority hierarchy + 50% dominance rule.
8.b[NOISE]/[MUSIC] casing HARD UPPERCASE required.

Full classification table: docs/rule-classification.md.

#Error codes

StatusWhenBody
200OKScoreResult or BatchResponse
400Malformed JSON, missing top-level keys { "error": "...", "detail": "..." }
422Submissions can't be parsed (single-pair only) { "error": "Could not parse submissions", "detail": "..." }
500Unexpected internal error { "error": "...", "detail": "...", "trace": "..." }

In batch mode, parse errors do NOT return 422 — they appear as ok:false entries inside results[] with HTTP 200.