Advanced Features

Refer to the features below for improving accuracy, debugging latency and more.

🔍 Context Biasing (Hotwords)

You can boost recognition of important or uncommon phrases by specifying hotwords during the request.

Using Hotwords

Define your hotwords as a JSON array. You can specifiy a higher "boosting score" if you would like to provide extra emphasis to longer phrases (recommended!). Currently, the default score applied is 1.5 which should be sufficient for single words.

curl --location 'https://bodhi.navana.ai/api/transcribe' \
--header 'x-customer-id: <customer_id>' \
--header 'x-api-key: <api_key>' \
--form 'transaction_id=<uuid>' \
--form 'audio_file=@"<audio_file_path>"' \
--form 'model="hi-banking-v2-8khz"' \
--form 'hotwords="[{\"phrase\":\"बोधी\"},{\"phrase\":\"स्पीच रिकग्निशन\",\"score\":4.5}]"'

Best Practices

Best Practice
Description

✅ Use uncommon words

Target domain-specific or rare phrases like "बोधी स्पीच रिकग्निशन"

✅ Use local script

Always write in Devanagari (e.g. बोधी, not bodhi)

✅ Avoid punctuation

Remove quotes, commas, periods

✅ Use higher scores for longer phrases

e.g. "बोधी स्पीच रिकग्निशन " -> 2.5 vs

"बोधी" -> 1.5

Avoid copying hotwords from other providers without validation. Bodhi may already support commonly spoken Hindi words natively.

Warnings

  • Avoid very short particles like "का", "की", "ए", etc.

  • Don’t boost every word in a sentence — only uncommon or error-prone segments.

  • Phrases work better for commonly missed phrases, individual tokens are better for rare words.

  • Avoid boosting words that already work as is.


🔢 Parse Numbers into Numerals

Bodhi supports converting spoken number words into actual digits using the parse_number flag in the form values.

This is useful when transcribing sentences that include monetary values, phone numbers, addresses, or quantities — especially for use cases like banking, insurance, and logistics.

curl --location 'https://bodhi.navana.ai/api/transcribe' \
--header 'x-customer-id: <customer_id>' \
--header 'x-api-key: <api_key>' \
--form 'transaction_id="<uuid>"' \
--form 'audio_file=@"<audio_file_path>"' \
--form 'model="hi-banking-v2-8khz"' \
--form 'parse_number="True"'

🧾 Example

Mode
Output

Without parse_number

"घर बनाने के लिए मुझे पच्चीस लाख का लोन चाहिए"

With parse_number: True

"घर बनाने के लिए मुझे 2500000 का लोन चाहिए"


🌐 Language Support

This feature is currently available for:

  • Hindi (hi)

  • Malayalam (ml)

  • Kannada (kn)

  • Gujarati (gu)

  • Marathi (mr)

Want support for another language? Reach out to support@navanatech.in


📦 Aux Metadata

Set aux: True in your form values to receive server-side diagnostic metadata along with your transcript response.

This is useful for logging, benchmarking, or correlating timestamps across systems.

curl --location 'https://bodhi.navana.ai/api/transcribe' \
--header 'x-customer-id: <customer_id>' \
--header 'x-api-key: <api_key>' \
--form 'transaction_id="<uuid>"' \
--form 'audio_file=@"<audio_file_path>"' \
--form 'model="hi-banking-v2-8khz"' \
--form 'aux="true"'

📘 What You Get

When enabled, each final transcript message will include an aux_info block:

"aux_info": {
        "request_time": 0.273680048,
        "received_request_time": "2025-05-19T09:44:50.975311686Z",
        "segments_meta": [
            {
                "tokens": [
                    " घ",
                    "र",
                    " बना",
                    "ने",
                    " के",
                    " लिए",
                    " मुझे",
                    " प",
                    "च",
                    "्",
                    "च",
                    "ी",
                    "स",
                    " लाख",
                    " का",
                    " ल",
                    "ो",
                    "न",
                    " चाहिए"
                ],
                "timestamps": [
                    1,
                    1.16,
                    1.4399999,
                    1.7199999,
                    1.8399999,
                    2,
                    2.24,
                    2.44,
                    2.48,
                    2.6399999,
                    2.6799998,
                    2.72,
                    2.76,
                    2.9199998,
                    3.12,
                    3.28,
                    3.32,
                    3.4399998,
                    3.72
                ],
                "start_time": 0,
                "end_time": 3.72,
                "text": " घर बनाने के लिए मुझे पच्चीस लाख का लोन चाहिए",
                "confidence": 0.8847437
            }
        ],
        "confidence": 0.8847437 
    }
Field
Description

request_time (float)

Total time in seconds that the server spent handling this request (excluding network transfer delays).

received_request_time (timestamp)

The timestamp (UTC) when the server received the initial WebSocket connection or request.

segments_meta (array of objects)

Detailed view of all segment objects (transcripts separated by silences) recognized for the audio file provided. Each segment object has the following information:

  • tokens: Array of strings representing individual text pieces (or "tokens") recognized from the segment. Tokens may include words or parts of words.

  • timestamps: Array of numerical values indicating when each token was detected in the segment (in seconds). Each timestamp aligns with the tokens array, so the i-th timestamp represents the time at which the i-th token was spoken. Useful for measuring latency.

  • start_time: Starting point (in seconds) of the current segment in the overall audio timeline.

  • end_time: Ending point (in seconds) of the current segment in the overall audio timeline.

  • text: Transription belonging to the current segment

  • confidence: Confidence score (float between 0 and 1) for the model’s prediction for this segment.

confidence

Confidence score (float between 0 and 1) for the model’s prediction for the entire audio.

Note: This is an average of all segment confidences. Note: This field will not be present if the model does not predict any text for the audio.

This can help you:

  • Profile server-side performance

  • Track session start times

  • Debug slow or idle sessions

  • Assessing how confident the model is about its prediction

Last updated