Score voice
annotation quality.
Public API that grades Hindi/English voice-annotation transcripts against gold references using the Smallest ASR Labelling Framework. Deterministic HARD rules plus LLM-judged SOFT rules. Per-clip detail and batch averages.
#Overview
Built for annotation vendors and ML teams shipping Indic ASR datasets. Gives you a defensible per-clip score, the rule that was broken, and a batch-level pass rate — all in a single JSON call.
What it does
- Compares an annotator's JSON transcript against a gold JSON for the same audio
- Runs ~32 HARD rules (regex / Unicode-script / whitelist) — no judgment, deterministic
- Runs ~14 SOFT rules through an LLM-judge with a 0.85 confidence threshold
- Returns a score in
[0, 1], a pass/fail (default threshold0.97), every violation, and the math behind the score - Supports single-pair and batch scoring, with per-clip + aggregate stats
What it's not
- Not a transcription service — bring your own annotator + gold pair
- Not language-detection — assumes the framework's Hindi+English (code-switched) target
- Not authoritative — final acceptance is the customer's call; this is a QC signal
#Categories — WER, Punctuation, Tags
Every score response groups violations under three top-level buckets.
The portal renders them as three cards; the API returns them under
breakdown.categories. Each bucket has its own score, pass flag,
and violation list.
| Bucket | What it covers | Rules |
|---|---|---|
| wer Lexical accuracy |
Word-level correctness — script enforcement, numbers spelled out, dates, acronyms, currency, percentages, third-language removal, structural sanity. Includes the blended Latin-WER / Indic-CER number. | 1.1, 1.2, 1.3, 1.4, 1.7, 2.x, 4, structural |
| punctuation | Rule 1.6 family. Banned characters (; : " ' @ % $ ₹ # &), full-stop choice (।/./?/!), comma usage, question marks. |
1.6 |
| tags Speaker labels & markup |
Every bracketed or angle-bracketed marker: fillers, speaker diarization, stutters, truncations, singing, in-text actions, segment emotion, noise tags, whisper. | 1.5, 3.x, 5.x, 6, 7.x, 8.x |
Per-category scoring
punctuation: max(0, min(1, 1 − 0.05·n_hard − 0.025·n_soft))
tags: max(0, min(1, 1 − 0.05·n_hard − 0.025·n_soft))
Each bucket has its own passed flag (threshold 0.97). The top-level
score is the original composite — it still considers every violation across all
buckets, so the overall pass/fail stays consistent.
Example response (categories field)
{
"score": 0.92,
"passed": false,
"breakdown": {
"categories": {
"wer": {
"name": "wer",
"label": "Lexical accuracy (WER)",
"score": 0.98,
"passed": true,
"hard_violation_count": 0,
"soft_disagreement_count": 0,
"rescued_count": 0,
"violations": [],
"soft_disagreements": [],
"rescued": [],
"details": {
"blended_lexical_error": 0.02,
"latin_token_count": 12,
"indic_token_count": 18
}
},
"punctuation": {
"name": "punctuation",
"label": "Punctuation",
"score": 0.95,
"passed": false,
"hard_violation_count": 1,
"soft_disagreement_count": 0,
"rescued_count": 0,
"violations": [
{"rule_id":"1.6","severity":"hard","message":"Banned character ';'", ...}
],
"soft_disagreements": [],
"rescued": [],
"details": {}
},
"tags": {
"name": "tags",
"label": "Tags & speaker labels",
"score": 0.975,
"passed": true,
"hard_violation_count": 0,
"soft_disagreement_count": 1,
"rescued_count": 0,
"violations": [],
"soft_disagreements": [
{"rule_id":"7.2","severity":"soft","message":"Emotion primary label disagreement", ...}
],
"rescued": [],
"details": {}
}
},
"hard_violations": [...], // flat list, kept for back-compat
"soft_disagreements": [...],
"rescued_disagreements": [...],
"math": { ... }
}
}
breakdown.categories and
ignore the flat hard_violations / soft_disagreements lists — they
contain the same data in a denormalised form, kept only for back-compat with the v0 shape.
#Quickstart
Sixty-second copy-paste. Replace the base URL with your deployment.
1. Score one pair
curl -X POST https://your-deployment.vercel.app/api/score \
-H "Content-Type: application/json" \
-d '{
"annotator": [
{"speaker":"Speaker_1","start":"00:00:00.000","end":"00:00:02.500",
"text":"मैंने कल Delhi से नया laptop खरीदा। [None]"}
],
"gold": [
{"speaker":"Speaker_1","start":"00:00:00.000","end":"00:00:02.500",
"text":"मैंने कल Delhi से नया laptop खरीदा। [None]"}
]
}'
Returns:
{
"file_id": "annotator",
"score": 1.0,
"passed": true,
"breakdown": {
"hard_violations": [],
"soft_disagreements": [],
"rescued_disagreements": [],
"lexical": { "blended": 0.0, ... },
"math": {
"lex_component": 1.0,
"hard_penalty": 0.0,
"soft_penalty": 0.0,
"final_score": 1.0,
"pass_threshold": 0.97,
"formula": "score = max(0, min(1, (1 − blended_lexical_error) − 0.01·n_hard − 0.005·n_soft))"
}
}
}
2. Score a whole batch
curl -X POST https://your-deployment.vercel.app/api/batch \
-H "Content-Type: application/json" \
-d '{
"pairs": [
{"file_id":"clip_001","annotator":[...],"gold":[...]},
{"file_id":"clip_002","annotator":[...],"gold":[...]},
{"file_id":"clip_003","annotator":[...],"gold":[...]}
]
}'
Returns per-clip results and a summary:
{
"results": [
{ "file_id": "clip_001", "ok": true, "result": { "score": 0.987, "passed": true, ... } },
{ "file_id": "clip_002", "ok": true, "result": { "score": 0.943, "passed": false, ... } },
{ "file_id": "clip_003", "ok": false, "error": "Failed to parse submissions",
"detail": "Top-level JSON must be a list of utterance objects." }
],
"summary": {
"n_total": 3, "n_scored": 2, "n_failed_to_parse": 1,
"n_passed": 1, "n_failed": 1,
"pass_rate": 0.5,
"average_score": 0.965,
"avg_hard_violations": 1.0,
"avg_soft_disagreements": 0.5,
"pass_threshold": 0.97,
"elapsed_ms": 412
}
}
#Auth & limits
| Aspect | Value |
|---|---|
| Authentication | None. Public API. |
| CORS | Access-Control-Allow-Origin: * |
| Request body cap | Vercel function limit (~4.5 MB body) |
| Batch size | No hard cap — practical limit is the 10 s function timeout |
| Rate limit | None today. May be added if abused. |
| Content-Type | application/json; charset=utf-8 |
#POST /api/score
Score one annotator submission against one gold reference.
Request body
| Field | Type | Description |
|---|---|---|
annotator | Utterance[] |
Annotator's transcript for the file. |
gold | Utterance[] |
Gold reference for the same file. Must have the same number of utterances. |
Response (200)
ScoreResult — see schema.
Example
POST /api/score
{
"annotator": [{"speaker":"Speaker_1","text":"मैंने hello। [None]"}],
"gold": [{"speaker":"Speaker_1","text":"मैंने hello। [None]"}]
}
#POST /api/batch
Score many pairs in one request. Bad pairs don't kill the batch.
Request body
| Field | Type | Description |
|---|---|---|
pairs | Pair[] |
Array of {file_id?, annotator, gold} objects. file_id is optional but recommended — it appears in every result entry. |
pass_threshold | number |
Optional. Default 0.97. Overrides the pass cutoff for every clip in the batch. |
Response (200)
| Field | Type | Description |
|---|---|---|
results[] | object |
One entry per input pair. Either {ok:true, result: ScoreResult} or {ok:false, error, detail}. |
summary | object |
Aggregate stats over ok:true entries only — see fields below. |
Summary fields
| Field | Type | Description |
|---|---|---|
n_total | int | Total pairs in the request |
n_scored | int | Pairs that scored successfully |
n_failed_to_parse | int | Bad pairs (skipped) |
n_passed | int | Scored ≥ pass_threshold |
n_failed | int | Scored < pass_threshold |
pass_rate | float | n_passed / n_scored |
average_score | float | Mean over scored pairs |
avg_hard_violations | float | Mean HARD violations per scored pair |
avg_soft_disagreements | float | Mean SOFT disagreements per scored pair |
elapsed_ms | int | Server-side wall time |
N returns an ok:false entry; pairs 0..N-1
and N+1..end still score normally. The summary aggregates only over ok:true.
#GET /api/docs
Machine-readable OpenAPI 3.1 spec. Use it to generate clients or import into Postman.
Returns the full OpenAPI document for every endpoint and schema on this site.
curl https://your-deployment.vercel.app/api/docs > openapi.json
# Import openapi.json into Postman / Insomnia / openapi-generator
#Schemas
Utterance
One row of an annotator's or gold's transcription.
| Field | Type | Req? | Description |
|---|---|---|---|
speaker | string | yes | Speaker_1, Speaker_2, ..., or Speaker_Machine (Rule 3.1). |
text | string | yes | The transcription. A trailing [Label] tag (e.g. [Happy]) is extracted as the segment-level emotion and stripped before lexical scoring. Two trailing tags are accepted for the rare multi-label case (Rule 7.2). |
start | string|number | no | HH:MM:SS.mmm or float seconds. |
end | string|number | no | As above. |
ScoreResult
The per-file score envelope.
| Field | Type | Description |
|---|---|---|
file_id | string | Echo of the input file_id (or auto-generated for single-pair calls). |
score | float ∈ [0,1] | Composite overall score across all categories. |
passed | bool | score >= pass_threshold |
breakdown.categories | {wer, punctuation, tags} | Primary shape. Three top-level CategoryScore objects — each with its own score, pass flag, and violation list. See the Categories section. |
breakdown.math | ScoreMath | Step-by-step arithmetic — what every penalty was, what the final number is. |
breakdown.hard_violations | Violation[] | Flat list across all categories. Kept for back-compat — prefer categories. |
breakdown.soft_disagreements | Violation[] | Flat list. Same data appears inside each category's soft_disagreements. |
breakdown.rescued_disagreements | Violation[] | Flat list. LLM-judge-accepted disagreements; not penalized. |
breakdown.lexical | LexicalScore | Aggregate lexical numbers. WER bucket's details surfaces the relevant subset. |
CategoryScore
One of wer, punctuation, or tags.
| Field | Type | Description |
|---|---|---|
name | "wer" | "punctuation" | "tags" | Bucket key. |
label | string | Human-readable label for UI rendering (e.g. "Lexical accuracy (WER)"). |
score | float ∈ [0,1] | Per-category score, independent of the overall composite. |
passed | bool | score >= pass_threshold for this bucket. |
pass_threshold | float | Default 0.97. |
hard_violation_count | int | Count of HARD violations in this bucket. |
soft_disagreement_count | int | Count of unrescued SOFT disagreements in this bucket. |
rescued_count | int | Count of LLM-judge-rescued SOFT disagreements (audit-only). |
violations | Violation[] | HARD violations in this bucket. |
soft_disagreements | Violation[] | Unrescued SOFT disagreements in this bucket. |
rescued | Violation[] | Rescued disagreements (for audit). |
details | object | For wer: {blended_lexical_error, latin_token_count, indic_token_count}. Empty for the other buckets. |
Violation
| Field | Type | Description |
|---|---|---|
rule_id | string | PDF rule number (e.g. 1.6, 7.2) or structural.*. |
severity | "hard" | "soft" | Class of the broken rule. |
message | string | Human-readable explanation. |
utterance_index | int | 0-based index into the utterances array. |
found | string | What the annotator wrote. |
expected | string | What the rule (or gold) expected. |
#Scoring math
No black box. The exact arithmetic for every result is returned in breakdown.math.
| Term | Meaning |
|---|---|
blended_error |
Token-weighted blend of WER (Latin script) + CER (Indic scripts). Per the client SLA of "WER under 3%". |
n_hard |
Count of HARD rule violations. Each costs 0.01. |
n_soft |
Count of SOFT disagreements the LLM-judge did NOT rescue. Each costs 0.005. |
pass_threshold |
Default 0.97. Override per-request on /api/batch. |
Why these weights?
- Lexical dominates because the buyer SLA is "WER < 3%". A clean transcription gets close to 1.0 even with a few minor rule slips.
- HARD violations get 2× the SOFT penalty because they're unambiguous — no judgment, no rescue.
- 0.97 threshold mirrors the 3% WER ceiling, so passing the threshold also means meeting buyer's lexical bar.
#Rules: HARD vs SOFT
~70% of rules are deterministic (regex / Unicode-script / whitelist). ~30% need genuine judgment and are routed to an LLM-judge.
| Part | Rule | Class | Scorer treatment |
|---|---|---|---|
| 1.1 | Script enforcement (Latin/Devanagari) | HARD | Unicode-block check per token; cross-script substitution = error. |
| 1.5 | Fillers — whitelist [um] / [uh] |
HARD | [mhm], [er], [ah] = error. |
| 1.5 | Filler placement | SOFT | ±2 token tolerance from gold position. |
| 1.6 | Banned characters ; : " ' @ % $ ₹ # & |
HARD | Regex; any occurrence = error. |
| 1.6 | Punctuation choice (comma / full-stop) | SOFT | LLM-judge: defensible in this context? |
| 2.1 | Numbers spelled out, never digits | HARD | Regex; [0-9] outside tag content = error. |
| 3.1 | Speaker label format + grouping | HARD | Regex + permutation-invariant cluster match. |
| 5.1 | Stutter format r=report |
HARD | Regex. |
| 7.1 | In-text actions whitelist ([laughing], [cough], etc.) |
HARD | Whitelist lookup. |
| 7.2 | Emotion label choice | SOFT | LLM-judge with priority hierarchy + 50% dominance rule. |
| 8.b | [NOISE]/[MUSIC] casing |
HARD | UPPERCASE required. |
Full classification table: docs/rule-classification.md.
#Error codes
| Status | When | Body |
|---|---|---|
200 | OK | ScoreResult or BatchResponse |
400 | Malformed JSON, missing top-level keys | { "error": "...", "detail": "..." } |
422 | Submissions can't be parsed (single-pair only) | { "error": "Could not parse submissions", "detail": "..." } |
500 | Unexpected internal error | { "error": "...", "detail": "...", "trace": "..." } |
In batch mode, parse errors do NOT return 422 — they appear as ok:false entries inside results[] with HTTP 200.