Advanced Features
Refer to the features below for improving accuracy, debugging latency and more.
🔍 Context Biasing (Hotwords)
You can boost recognition of important or uncommon phrases by specifying hotwords during the request.
Using Hotwords
Define your hotwords as a JSON array. You can specifiy a higher "boosting score" if you would like to provide extra emphasis to longer phrases (recommended!). Currently, the default score applied is 1.5
which should be sufficient for single words.
curl --location 'https://bodhi.navana.ai/api/transcribe' \
--header 'x-customer-id: <customer_id>' \
--header 'x-api-key: <api_key>' \
--form 'transaction_id=<uuid>' \
--form 'audio_file=@"<audio_file_path>"' \
--form 'model="hi-banking-v2-8khz"' \
--form 'hotwords="[{\"phrase\":\"बोधी\"},{\"phrase\":\"स्पीच रिकग्निशन\",\"score\":4.5}]"'
Best Practices
✅ Use uncommon words
Target domain-specific or rare phrases like "
बोधी स्पीच रिकग्निशन"
✅ Use local script
Always write in Devanagari (e.g. बोधी, not bodhi
)
✅ Avoid punctuation
Remove quotes, commas, periods
✅ Use higher scores for longer phrases
e.g. "
बोधी स्पीच रिकग्निशन " -> 2.5
vs
"
बोधी" -> 1.5
Avoid copying hotwords from other providers without validation. Bodhi may already support commonly spoken Hindi words natively.
Warnings
Avoid very short particles like
"का"
,"की"
,"ए"
, etc.Don’t boost every word in a sentence — only uncommon or error-prone segments.
Phrases work better for commonly missed phrases, individual tokens are better for rare words.
Avoid boosting words that already work as is.
🔢 Parse Numbers into Numerals
Bodhi supports converting spoken number words into actual digits using the parse_number
flag in the form values.
This is useful when transcribing sentences that include monetary values, phone numbers, addresses, or quantities — especially for use cases like banking, insurance, and logistics.
curl --location 'https://bodhi.navana.ai/api/transcribe' \
--header 'x-customer-id: <customer_id>' \
--header 'x-api-key: <api_key>' \
--form 'transaction_id="<uuid>"' \
--form 'audio_file=@"<audio_file_path>"' \
--form 'model="hi-banking-v2-8khz"' \
--form 'parse_number="True"'
🧾 Example
Without parse_number
"घर बनाने के लिए मुझे पच्चीस लाख का लोन चाहिए"
With parse_number: True
"घर बनाने के लिए मुझे 2500000 का लोन चाहिए"
🌐 Language Support
This feature is currently available for:
Hindi (
hi
)Malayalam (
ml
)Kannada (
kn
)Gujarati (
gu
)Marathi (
mr
)
Want support for another language? Reach out to support@navanatech.in
📦 Aux Metadata
Set aux: True
in your form values to receive server-side diagnostic metadata along with your transcript response.
This is useful for logging, benchmarking, or correlating timestamps across systems.
curl --location 'https://bodhi.navana.ai/api/transcribe' \
--header 'x-customer-id: <customer_id>' \
--header 'x-api-key: <api_key>' \
--form 'transaction_id="<uuid>"' \
--form 'audio_file=@"<audio_file_path>"' \
--form 'model="hi-banking-v2-8khz"' \
--form 'aux="true"'
📘 What You Get
When enabled, each final transcript message will include an aux_info
block:
"aux_info": {
"request_time": 0.273680048,
"received_request_time": "2025-05-19T09:44:50.975311686Z",
"segments_meta": [
{
"tokens": [
" घ",
"र",
" बना",
"ने",
" के",
" लिए",
" मुझे",
" प",
"च",
"्",
"च",
"ी",
"स",
" लाख",
" का",
" ल",
"ो",
"न",
" चाहिए"
],
"timestamps": [
1,
1.16,
1.4399999,
1.7199999,
1.8399999,
2,
2.24,
2.44,
2.48,
2.6399999,
2.6799998,
2.72,
2.76,
2.9199998,
3.12,
3.28,
3.32,
3.4399998,
3.72
],
"start_time": 0,
"end_time": 3.72,
"text": " घर बनाने के लिए मुझे पच्चीस लाख का लोन चाहिए",
"confidence": 0.8847437
}
],
"confidence": 0.8847437
}
request_time (float)
Total time in seconds that the server spent handling this request (excluding network transfer delays).
received_request_time (timestamp)
The timestamp (UTC) when the server received the initial WebSocket connection or request.
segments_meta (array of objects)
Detailed view of all segment objects (transcripts separated by silences) recognized for the audio file provided. Each segment object has the following information:
tokens: Array of strings representing individual text pieces (or "tokens") recognized from the segment. Tokens may include words or parts of words.
timestamps: Array of numerical values indicating when each token was detected in the segment (in seconds). Each timestamp aligns with the tokens array, so the i-th timestamp represents the time at which the i-th token was spoken. Useful for measuring latency.
start_time: Starting point (in seconds) of the current segment in the overall audio timeline.
end_time: Ending point (in seconds) of the current segment in the overall audio timeline.
text: Transription belonging to the current segment
confidence: Confidence score (float between 0 and 1) for the model’s prediction for this segment.
confidence
Confidence score (float between 0 and 1) for the model’s prediction for the entire audio.
Note: This is an average of all segment confidences. Note: This field will not be present if the model does not predict any text for the audio.
This can help you:
Profile server-side performance
Track session start times
Debug slow or idle sessions
Assessing how confident the model is about its prediction
Last updated