PUBLIC API · v1.0

Score voice
annotation quality.

Public API that grades Hindi/English voice-annotation transcripts against gold references using the Smallest ASR Labelling Framework. Deterministic HARD rules plus LLM-judged SOFT rules. Per-clip detail and batch averages.

Quickstart → Try it in browser openapi.json

#Overview

Built for annotation vendors and ML teams shipping Indic ASR datasets. Gives you a defensible per-clip score, the rule that was broken, and a batch-level pass rate — all in a single JSON call.

What it does

Compares an annotator's JSON transcript against a gold JSON for the same audio
Runs ~32 HARD rules (regex / Unicode-script / whitelist) — no judgment, deterministic
Runs ~14 SOFT rules through an LLM-judge with a 0.85 confidence threshold
Returns a score in [0, 1], a pass/fail (default threshold 0.97), every violation, and the math behind the score
Supports single-pair and batch scoring, with per-clip + aggregate stats

What it's not

Not a transcription service — bring your own annotator + gold pair
Not language-detection — assumes the framework's Hindi+English (code-switched) target
Not authoritative — final acceptance is the customer's call; this is a QC signal

Cost: ~$0.0005 per file (≈5 LLM calls only when SOFT rules genuinely disagree). Most files finish in <500 ms.

#Categories — WER, Punctuation, Tags

Every score response groups violations under three top-level buckets. The portal renders them as three cards; the API returns them under breakdown.categories. Each bucket has its own score, pass flag, and violation list.

Bucket	What it covers	Rules
wer Lexical accuracy	Word-level correctness — script enforcement, numbers spelled out, dates, acronyms, currency, percentages, third-language removal, structural sanity. Includes the blended Latin-WER / Indic-CER number.	`1.1`, `1.2`, `1.3`, `1.4`, `1.7`, `2.x`, `4`, structural
punctuation	Rule 1.6 family. Banned characters (`; : " ' @ % $ ₹ # &`), full-stop choice (`।`/`.`/`?`/`!`), comma usage, question marks.	`1.6`
tags Speaker labels & markup	Every bracketed or angle-bracketed marker: fillers, speaker diarization, stutters, truncations, singing, in-text actions, segment emotion, noise tags, whisper.	`1.5`, `3.x`, `5.x`, `6`, `7.x`, `8.x`

Per-category scoring

wer:    max(0, min(1, (1 − blended_error) − 0.01·n_hard))
punctuation:    max(0, min(1, 1 − 0.05·n_hard − 0.025·n_soft))
tags:    max(0, min(1, 1 − 0.05·n_hard − 0.025·n_soft))

Each bucket has its own passed flag (threshold 0.97). The top-level score is the original composite — it still considers every violation across all buckets, so the overall pass/fail stays consistent.

Example response (categories field)

{
  "score": 0.92,
  "passed": false,
  "breakdown": {
    "categories": {
      "wer": {
        "name": "wer",
        "label": "Lexical accuracy (WER)",
        "score": 0.98,
        "passed": true,
        "hard_violation_count": 0,
        "soft_disagreement_count": 0,
        "rescued_count": 0,
        "violations": [],
        "soft_disagreements": [],
        "rescued": [],
        "details": {
          "blended_lexical_error": 0.02,
          "latin_token_count": 12,
          "indic_token_count": 18
        }
      },
      "punctuation": {
        "name": "punctuation",
        "label": "Punctuation",
        "score": 0.95,
        "passed": false,
        "hard_violation_count": 1,
        "soft_disagreement_count": 0,
        "rescued_count": 0,
        "violations": [
          {"rule_id":"1.6","severity":"hard","message":"Banned character ';'", ...}
        ],
        "soft_disagreements": [],
        "rescued": [],
        "details": {}
      },
      "tags": {
        "name": "tags",
        "label": "Tags & speaker labels",
        "score": 0.975,
        "passed": true,
        "hard_violation_count": 0,
        "soft_disagreement_count": 1,
        "rescued_count": 0,
        "violations": [],
        "soft_disagreements": [
          {"rule_id":"7.2","severity":"soft","message":"Emotion primary label disagreement", ...}
        ],
        "rescued": [],
        "details": {}
      }
    },
    "hard_violations": [...],     // flat list, kept for back-compat
    "soft_disagreements": [...],
    "rescued_disagreements": [...],
    "math": { ... }
  }
}

Tip: Most integrations should read breakdown.categories and ignore the flat hard_violations / soft_disagreements lists — they contain the same data in a denormalised form, kept only for back-compat with the v0 shape.

#Quickstart

Sixty-second copy-paste. Replace the base URL with your deployment.

1. Score one pair

curl -X POST https://your-deployment.vercel.app/api/score \
  -H "Content-Type: application/json" \
  -d '{
    "annotator": [
      {"speaker":"Speaker_1","start":"00:00:00.000","end":"00:00:02.500",
       "text":"मैंने कल Delhi से नया laptop खरीदा। [None]"}
    ],
    "gold": [
      {"speaker":"Speaker_1","start":"00:00:00.000","end":"00:00:02.500",
       "text":"मैंने कल Delhi से नया laptop खरीदा। [None]"}
    ]
  }'

Returns:

{
  "file_id": "annotator",
  "score": 1.0,
  "passed": true,
  "breakdown": {
    "hard_violations": [],
    "soft_disagreements": [],
    "rescued_disagreements": [],
    "lexical": { "blended": 0.0, ... },
    "math": {
      "lex_component": 1.0,
      "hard_penalty": 0.0,
      "soft_penalty": 0.0,
      "final_score": 1.0,
      "pass_threshold": 0.97,
      "formula": "score = max(0, min(1, (1 − blended_lexical_error) − 0.01·n_hard − 0.005·n_soft))"
    }
  }
}

2. Score a whole batch

curl -X POST https://your-deployment.vercel.app/api/batch \
  -H "Content-Type: application/json" \
  -d '{
    "pairs": [
      {"file_id":"clip_001","annotator":[...],"gold":[...]},
      {"file_id":"clip_002","annotator":[...],"gold":[...]},
      {"file_id":"clip_003","annotator":[...],"gold":[...]}
    ]
  }'

Returns per-clip results and a summary:

{
  "results": [
    { "file_id": "clip_001", "ok": true,  "result": { "score": 0.987, "passed": true,  ... } },
    { "file_id": "clip_002", "ok": true,  "result": { "score": 0.943, "passed": false, ... } },
    { "file_id": "clip_003", "ok": false, "error": "Failed to parse submissions",
                              "detail": "Top-level JSON must be a list of utterance objects." }
  ],
  "summary": {
    "n_total": 3, "n_scored": 2, "n_failed_to_parse": 1,
    "n_passed": 1, "n_failed": 1,
    "pass_rate": 0.5,
    "average_score": 0.965,
    "avg_hard_violations": 1.0,
    "avg_soft_disagreements": 0.5,
    "pass_threshold": 0.97,
    "elapsed_ms": 412
  }
}

#Auth & limits

Aspect	Value
Authentication	None. Public API.
CORS	`Access-Control-Allow-Origin: *`
Request body cap	Vercel function limit (~4.5 MB body)
Batch size	No hard cap — practical limit is the 10 s function timeout
Rate limit	None today. May be added if abused.
Content-Type	`application/json; charset=utf-8`

Privacy: Requests are not logged with content. SOFT-rule disagreements (gold/annotator text snippets) are sent to Google's Gemini API for judgment. Don't send anything you wouldn't put in a public spreadsheet.

#POST /api/score

Score one annotator submission against one gold reference.

POST /api/score Single pair → ScoreResult

Request body

Field	Type	Description
`annotator`	`Utterance[]`	Annotator's transcript for the file.
`gold`	`Utterance[]`	Gold reference for the same file. Must have the same number of utterances.

Response (200)

ScoreResult — see schema.

Example

POST /api/score
{
  "annotator": [{"speaker":"Speaker_1","text":"मैंने hello। [None]"}],
  "gold":      [{"speaker":"Speaker_1","text":"मैंने hello। [None]"}]
}

#POST /api/batch

Score many pairs in one request. Bad pairs don't kill the batch.

POST /api/batch Many pairs → per-clip + aggregate

Request body

Field	Type	Description
`pairs`	`Pair[]`	Array of `{file_id?, annotator, gold}` objects. `file_id` is optional but recommended — it appears in every result entry.
`pass_threshold`	`number`	Optional. Default `0.97`. Overrides the pass cutoff for every clip in the batch.

Response (200)

Field	Type	Description
`results[]`	`object`	One entry per input pair. Either `{ok:true, result: ScoreResult}` or `{ok:false, error, detail}`.
`summary`	`object`	Aggregate stats over `ok:true` entries only — see fields below.

Summary fields

Field	Type	Description
`n_total`	int	Total pairs in the request
`n_scored`	int	Pairs that scored successfully
`n_failed_to_parse`	int	Bad pairs (skipped)
`n_passed`	int	Scored ≥ pass_threshold
`n_failed`	int	Scored < pass_threshold
`pass_rate`	float	`n_passed / n_scored`
`average_score`	float	Mean over scored pairs
`avg_hard_violations`	float	Mean HARD violations per scored pair
`avg_soft_disagreements`	float	Mean SOFT disagreements per scored pair
`elapsed_ms`	int	Server-side wall time

Bad pairs don't abort the batch. A parse error, missing keys, or utterance-count mismatch in pair N returns an ok:false entry; pairs 0..N-1 and N+1..end still score normally. The summary aggregates only over ok:true.

#GET /api/docs

Machine-readable OpenAPI 3.1 spec. Use it to generate clients or import into Postman.

GET /api/docs OpenAPI 3.1 JSON

Returns the full OpenAPI document for every endpoint and schema on this site.

curl https://your-deployment.vercel.app/api/docs > openapi.json
# Import openapi.json into Postman / Insomnia / openapi-generator

#Schemas

Utterance

One row of an annotator's or gold's transcription.

Field	Type	Req?	Description
`speaker`	string	yes	`Speaker_1`, `Speaker_2`, ..., or `Speaker_Machine` (Rule 3.1).
`text`	string	yes	The transcription. A trailing `[Label]` tag (e.g. `[Happy]`) is extracted as the segment-level emotion and stripped before lexical scoring. Two trailing tags are accepted for the rare multi-label case (Rule 7.2).
`start`	string\|number	no	`HH:MM:SS.mmm` or float seconds.
`end`	string\|number	no	As above.

ScoreResult

The per-file score envelope.

Field	Type	Description
`file_id`	string	Echo of the input `file_id` (or auto-generated for single-pair calls).
`score`	float ∈ [0,1]	Composite overall score across all categories.
`passed`	bool	`score >= pass_threshold`
`breakdown.categories`	{wer, punctuation, tags}	Primary shape. Three top-level CategoryScore objects — each with its own score, pass flag, and violation list. See the Categories section.
`breakdown.math`	ScoreMath	Step-by-step arithmetic — what every penalty was, what the final number is.
`breakdown.hard_violations`	Violation[]	Flat list across all categories. Kept for back-compat — prefer `categories`.
`breakdown.soft_disagreements`	Violation[]	Flat list. Same data appears inside each category's `soft_disagreements`.
`breakdown.rescued_disagreements`	Violation[]	Flat list. LLM-judge-accepted disagreements; not penalized.
`breakdown.lexical`	LexicalScore	Aggregate lexical numbers. WER bucket's `details` surfaces the relevant subset.

CategoryScore

One of wer, punctuation, or tags.

Field	Type	Description
`name`	`"wer"` \| `"punctuation"` \| `"tags"`	Bucket key.
`label`	string	Human-readable label for UI rendering (e.g. `"Lexical accuracy (WER)"`).
`score`	float ∈ [0,1]	Per-category score, independent of the overall composite.
`passed`	bool	`score >= pass_threshold` for this bucket.
`pass_threshold`	float	Default `0.97`.
`hard_violation_count`	int	Count of HARD violations in this bucket.
`soft_disagreement_count`	int	Count of unrescued SOFT disagreements in this bucket.
`rescued_count`	int	Count of LLM-judge-rescued SOFT disagreements (audit-only).
`violations`	Violation[]	HARD violations in this bucket.
`soft_disagreements`	Violation[]	Unrescued SOFT disagreements in this bucket.
`rescued`	Violation[]	Rescued disagreements (for audit).
`details`	object	For wer: `{blended_lexical_error, latin_token_count, indic_token_count}`. Empty for the other buckets.

Violation

Field	Type	Description
`rule_id`	string	PDF rule number (e.g. `1.6`, `7.2`) or `structural.*`.
`severity`	`"hard"` \| `"soft"`	Class of the broken rule.
`message`	string	Human-readable explanation.
`utterance_index`	int	0-based index into the utterances array.
`found`	string	What the annotator wrote.
`expected`	string	What the rule (or gold) expected.

#Scoring math

No black box. The exact arithmetic for every result is returned in breakdown.math.

score = max(0, min(1, (1 − blended_error) − 0.01·n_hard − 0.005·n_soft ))

Term	Meaning
`blended_error`	Token-weighted blend of WER (Latin script) + CER (Indic scripts). Per the client SLA of "WER under 3%".
`n_hard`	Count of HARD rule violations. Each costs `0.01`.
`n_soft`	Count of SOFT disagreements the LLM-judge did NOT rescue. Each costs `0.005`.
`pass_threshold`	Default `0.97`. Override per-request on `/api/batch`.

Why these weights?

Lexical dominates because the buyer SLA is "WER < 3%". A clean transcription gets close to 1.0 even with a few minor rule slips.
HARD violations get 2× the SOFT penalty because they're unambiguous — no judgment, no rescue.
0.97 threshold mirrors the 3% WER ceiling, so passing the threshold also means meeting buyer's lexical bar.

#Rules: HARD vs SOFT

~70% of rules are deterministic (regex / Unicode-script / whitelist). ~30% need genuine judgment and are routed to an LLM-judge.

Part	Rule	Class	Scorer treatment
1.1	Script enforcement (Latin/Devanagari)	HARD	Unicode-block check per token; cross-script substitution = error.
1.5	Fillers — whitelist `[um]` / `[uh]`	HARD	`[mhm]`, `[er]`, `[ah]` = error.
1.5	Filler placement	SOFT	±2 token tolerance from gold position.
1.6	Banned characters `; : " ' @ % $ ₹ # &`	HARD	Regex; any occurrence = error.
1.6	Punctuation choice (comma / full-stop)	SOFT	LLM-judge: defensible in this context?
2.1	Numbers spelled out, never digits	HARD	Regex; `[0-9]` outside tag content = error.
3.1	Speaker label format + grouping	HARD	Regex + permutation-invariant cluster match.
5.1	Stutter format `r=report`	HARD	Regex.
7.1	In-text actions whitelist (`[laughing]`, `[cough]`, etc.)	HARD	Whitelist lookup.
7.2	Emotion label choice	SOFT	LLM-judge with priority hierarchy + 50% dominance rule.
8.b	`[NOISE]`/`[MUSIC]` casing	HARD	UPPERCASE required.

Full classification table: docs/rule-classification.md.

#Error codes

Status	When	Body
`200`	OK	ScoreResult or BatchResponse
`400`	Malformed JSON, missing top-level keys	`{ "error": "...", "detail": "..." }`
`422`	Submissions can't be parsed (single-pair only)	`{ "error": "Could not parse submissions", "detail": "..." }`
`500`	Unexpected internal error	`{ "error": "...", "detail": "...", "trace": "..." }`

In batch mode, parse errors do NOT return 422 — they appear as ok:false entries inside results[] with HTTP 200.

Score voiceannotation quality.

#Overview

What it does

What it's not

#Categories — WER, Punctuation, Tags

Per-category scoring

Example response (categories field)

#Quickstart

1. Score one pair

2. Score a whole batch

#Auth & limits

#POST /api/score

Request body

Response (200)

Example

#POST /api/batch

Request body

Response (200)

Summary fields

#GET /api/docs

#Schemas

Utterance

ScoreResult

CategoryScore

Violation

#Scoring math

Why these weights?

#Rules: HARD vs SOFT

#Error codes

Score voice
annotation quality.