Calibration-Reasoning Framework for Descriptive Speech Quality Assessment

Sample outputs

Dimension-wise ratings

The model is tasked to select the most appropriate score for a given speech sample, with the scale provided as guidance.

e.g. How would you rate the noise level in the audio? Use the following scale to describe its impact: 5. Not noticeable, 4. Slightly noticable, 3. Noticeable but not intrusive, 2. Somewhat intrusive, 1. Very intrusive.

Each quality dimension is rated on scale from 1 to 5, and higher is better.

Each score is provided in the format : S (S_true) , where S is the generated score, S_true is the ground truth score.

QualiSpeech-FT (Baseline)

Sample id Audio Naturalness Noise Distortion Listening effort Continuity Overall quality (MOS)
sysnistf-0000018.wav 3.0 (2.0) 4.0 (5.0) 3.0 (2.0) 3.0 (3.0) 4.0 (2.0) 3.0 (2.0)
sysnistf-0000105.wav 2.0 (2.0) 5.0 (5.0) 1.0 (2.0) 2.0 (2.0) 2.0 (2.0) 1.0 (2.0)
sys10tts-0000070.wav 3.0 (3.0) 3.0 (2.0) 5.0 (3.0) 5.0 (2.0) 5.0 (2.0) 3.0 (1.0)

Calibration-Reasoning (Ours)

Sample id Audio Naturalness Noise Distortion Listening effort Continuity Overall quality (MOS)
sysnistf-0000018.wav 3.0 (2.0) 3.0 (5.0) 2.0 (2.0) 3.0 (3.0) 2.0 (2.0) 2.0 (2.0)
sysnistf-0000105.wav 2.0 (2.0) 5.0 (5.0) 2.0 (2.0) 3.0 (2.0) 2.0 (2.0) 2.0 (2.0)
sys10tts-0000070.wav 3.0 (3.0) 2.0 (2.0) 4.0 (3.0) 2.0 (2.0) 4.0 (2.0) 3.0 (1.0)

Specific descriptions

The model further describes the type of noise/distortion and temporal characteristics.

e.g. Analyze the noise present in this audio clip.

Sample id Input audio QualiSpeech-FT Calibration-Reasoning
Noise Distortion Unnatural pause Noise Distortion Unnatural pause
sysnistf-0000018.wav Very low level of background noise. There is mechanical sound distortion at 0-10s. Very smooth. Background noise not noticeable; no intrusive environmental sounds. Significant pitch distortion and mechanical sound between 0.5 to 9.5 seconds; speech sounds unnatural. Breaks and stutters cause discontinuity, especially between 3 to 8 seconds.
sysnistf-0000105.wav None detected. There is pitch distortion with mechanical sound and intermittent electric current quality in the audio from 0 to 7.5 seconds. None detected. Background noise is not noticeable; no significant interference. Significant pitch distortion and mechanical sound between 0.5-7.5 seconds. Breaks and stutters cause unnatural pauses and discontinuity, especially between 2-6 seconds.
sys10tts-0000070.wav Moderate background noise, particularly from 0 to 8.5 seconds; noticeable but not intrusive. None detected. No unnatural pauses detected. Moderate outdoor noise, birds and chickens audible from 0 to 8.5 seconds. Slight distortion with pitch alterations and mechanical quality, especially from 0.5 to 8.5 seconds. Less natural speech between 2.5 and 3.5 seconds.

Detailed comments

The model analyzes each aspect individually before synthesizing all the dimensions to derive an overall score.

e.g. Evaluate the quality of the audio you just heard. Start by analyzing specific aspects such as noise levels, distortion, naturalness, speed, listening effort, continuity, feeling of voice. Then provide an overall assessment based on these factors.

Sample id Audio QualiSpeech-FT Calibration-Reasoning
sysnistf-0000018.wav The speech sample presents a very low level of background noise, making it easy to focus on the content. However, there is noticeable distortion characterized by pitch alterations and a mechanical sound, particularly evident in the first 10 seconds, which detracts from the overall clarity. The audio is mostly smooth with only minor discontinuities, allowing for a relatively coherent listening experience. Listeners may find that only moderate effort is required to understand the speech, indicating some accessibility despite the distortion. The naturalness of the voice is somewhat lacking, particularly in the initial segment, which feels less human-like. The speaking speed is moderate, contributing to a balanced delivery. Overall, while the speech has some redeeming qualities, the combination of distortion and unnaturalness results in a fair quality assessment. The speech sample presents a generally favorable background noise level, being not noticeable, which positively contributes to the listening experience. However, there is a significant issue with distortion, particularly between 0.5 to 9.5 seconds, where pitch distortion and a mechanical sound create an unnatural quality. The audio also exhibits discontinuity, with breaks and stutters that disrupt the flow, making it fairly disjointed. Listeners may find considerable effort is required to understand the speech, compounded by its unnatural tone and rhythm, especially between 3 to 8 seconds. The speaking speed is moderate, which is a neutral aspect. Overall, the voice, described as a young man's voice with a mechanical tone, lacks warmth and naturalness, leading to a poor overall quality assessment. Thus, while the background noise is well-managed, the combination of distortion, discontinuity, and unnaturalness significantly detracts from the overall listening experience.
sysnistf-0000105.wav The speech sample presents a very low level of background noise, making it easy to focus on the content. However, there is noticeable distortion characterized by pitch alterations and a mechanical sound, particularly evident from 0 to 7.5 seconds, which detracts from the overall clarity. The audio is mostly smooth with only minor discontinuities, allowing for a relatively coherent listening experience. Listeners may find that only moderate effort is required to understand the speech, indicating some accessibility despite the distortion. The naturalness of the voice is somewhat lacking, as it does not closely resemble typical human speech patterns. The speaking speed is moderate, contributing to a balanced delivery. Overall, while the speech has some redeeming qualities, the combination of distortion and unnaturalness results in a fair quality assessment. The speech sample presents a generally favorable background noise level, being not noticeable, which positively contributes to the listening experience. However, there is a significant issue with distortion, particularly between 0.5 to 7.5 seconds, where pitch distortion and a mechanical sound create an unnatural quality. The audio also exhibits discontinuity, with breaks and stutters that disrupt the flow, making it fairly disjointed. Listeners may find moderate effort required to understand the content, compounded by the unnaturalness of the speech, especially between 2 to 6 seconds, where it lacks the fluidity and clarity of natural human speech. The speaking speed is moderate, which is acceptable but does not compensate for the overall poor quality of the audio. The voice, described as a middle-aged woman's soft tone, adds a pleasant touch, yet the overall impression remains poor due to the significant flaws in distortion and discontinuity.
audio/sys10tts-0000070.wav The speech sample presents a moderate level of background noise, particularly noticeable from 0 to 8.5 seconds, which slightly detracts from the overall clarity. However, the speech itself is free from distortion, maintaining a clear and natural sound. The audio playback is very smooth, with no breaks or stutters, allowing for easy comprehension. Listeners can engage with the content effortlessly, requiring no significant mental strain to understand the message. The naturalness of the speech is somewhat lacking, as it does not fully resemble human speech in tone and rhythm, but it remains acceptable. The speaking speed is moderate, contributing to a comfortable listening experience. The voice, described as a young woman's soft and gentle tone, adds a relaxing quality to the overall impression. While the overall quality is fair, the combination of background noise and naturalness issues prevents it from achieving a higher rating. The speech sample presents a moderate level of background noise, with outdoor sounds like birds and chickens audible from 0 to 8.5 seconds, making it somewhat noticeable but not overly intrusive. There is slight distortion characterized by pitch alterations and a mechanical quality, particularly evident from 0.5 to 8.5 seconds, which affects the clarity of the speech. The audio maintains a mostly smooth flow, with only minor discontinuities from 0.5 to 8.5 seconds. Listeners experience a moderate effort to comprehend the content, indicating that while attention is necessary, understanding is achievable without significant strain. The naturalness of the speech is somewhat lacking, especially between 2.5 and 3.5 seconds, where it feels less human-like. The speaking speed is moderate, contributing to a balanced delivery. Overall, the voice, described as a young woman's soft tone, adds a pleasant quality, but the combination of these factors results in a fair overall quality assessment.