detecting emotions and maybe something else in voice
Posted: Mon Mar 17, 2014 12:37 am
I'm starting to think about a project, that would measure somewhat emotional correlates of continuous speech.
Found this article and this article. While still some articles must be found I guess, few questions emerge. These parameters are pointed:
- fundamental frequency [can be tracked automatically by Martin's pitch tracker]
- speech rate [how to automate it to get decent results? how to approach it?],
- pauses [I guess this is silence detector],
- voice intensity [peak detection or RMS?],
- voice onset time [I'm confused with this one],
- jitter (pitch perturbations) [as I understand, this is calculated over the pitch detection array],
- shimmer (loudness perturbations) [as I understand, this is calculated over the peak/RMS detection array],
- voice breaks [I'm confused with this one],
- pitch jumps [I'm confused with this one, i.e. what to measure],
- and measures of voice quality:
-- the relative extent of high- versus low-frequency energy in the spectrum [what exactly is measured? two filters and RMS per band?],
-- the frequency and bandwidth of energy peaks in the spectrum due to natural resonances of the vocal tract called formants [and how to approach them?].
Should be some other parameters in the game?
It's a starting point I guess. The main goal would be to create some sort of oscilloscope view,s that will show emotional cues that come out of acoustic measurements/features. Yet we need to figure out what is what.
Found this article and this article. While still some articles must be found I guess, few questions emerge. These parameters are pointed:
- fundamental frequency [can be tracked automatically by Martin's pitch tracker]
- speech rate [how to automate it to get decent results? how to approach it?],
- pauses [I guess this is silence detector],
- voice intensity [peak detection or RMS?],
- voice onset time [I'm confused with this one],
- jitter (pitch perturbations) [as I understand, this is calculated over the pitch detection array],
- shimmer (loudness perturbations) [as I understand, this is calculated over the peak/RMS detection array],
- voice breaks [I'm confused with this one],
- pitch jumps [I'm confused with this one, i.e. what to measure],
- and measures of voice quality:
-- the relative extent of high- versus low-frequency energy in the spectrum [what exactly is measured? two filters and RMS per band?],
-- the frequency and bandwidth of energy peaks in the spectrum due to natural resonances of the vocal tract called formants [and how to approach them?].
Should be some other parameters in the game?
It's a starting point I guess. The main goal would be to create some sort of oscilloscope view,s that will show emotional cues that come out of acoustic measurements/features. Yet we need to figure out what is what.