The vocal interaction plays a central role in the interpersonal communication situations and it tends to be more or less coloured by emotions. The emotional development of an individual is influenced both by universal and cultural and by intra- and interpersonal aspects. This continuum of the emotional development is called emosphere and it can be described in a four-dimensional field.
Prosodic variables like fundamental frequency (F0), sound pressure level (SPL or Leq) and temporal aspects such as word – pause relation, duration of a phoneme or a syllable have largely been studied in relation to emotional expressions. The role of voice quality, instead, has been studied far less.
Voice quality is a combination of two factors, the voice source (vibrating vocal folds) and vocal tract function (resonances, i.e. formants). Both of these factors are seen in the manner sound energy is distributed along the frequency range in the spectrum. According to earlier findings, the recognition of vocal emotional information takes place within the first 100-150 ms of the expression and primarily appears to be based on voice quality. Perception of valence (positive, negative or neutral colouring of the voice) is even faster than cognitive identification of an actual emotion. In order to find out what kind of role, if any, voice quality plays in emotional communication, the effect of pitch variation was eliminated by using short samples (~ 100 – 2, 500 ms) in every study of this dissertation. This strict definition for the research object seemed justified since the technical equipment used in speech and speaker recognition and other applications (e.g. applications for disabled people) are developing fast and more detailed knowledge of ever smaller units is needed in order to create more natural sound quality. The results of the present study may also be used as basic knowledge for emotional voice production in the education of vocologists, speech communication researchers and actors.
In the Article I the interest was to see if there were voice quality parameters which may affect the perception of emotional valence and psycho-physiological activity level other than those frequently studied speech prosodic characteristics, F0, Leq and duration. In the Article II the aim was to investigate whether there were differences between human listeners and computer classification of the emotional stimuli and what kind of differences they might be. The third study reported in Article III investigated the role of F3 in conveyance of emotional valence using semi-synthesized vowels with F3 modifications. The last fourth investigation Article IV focused on perception of emotional qualities in mono-pitched expressions of different vowels.
The speech data for the first and second studies were collected from professional actors who read a text expressing sadness, joy, anger, tenderness and a neutral emotional state in random order. A stress-carrying vowel [a:] in a Finnish word ‘taakkahan’ was extracted for analyses. In the third study, some of the [a:] samples derived from the first study, were used as material for semi-synthesis, where F3 was raised in frequency, and lowered, and removed completely, otherwise the spectral structures were intact. For the fourth study student actors produced three mono-pitched prolonged vowels [a:], [i:] and [u:] expressing five emotional states in random order. The emotionally expressed vowel samples were presented to 30, 40 or 50 listeners whose task was to note which emotion or valence (positive, neutral or negative) they heard. The samples were analyzed for their acoustic characteristics, and statistical analyses were made. The acoustic variable relations to the valence and psycho-physiological activity level perceived were studied. In the second study, the results of the listening test were compared to the results of the automatic classification test. Confusion matrices were created for the intended and perceived emotions in the human evaluation test and in the automatic emotion classification experiment.
It was concluded that: 1. It appeared to be possible to identify emotional valence from vowel samples as short as on average 150 ms in duration and the actual emotions from vowel samples on average 2400 ms in duration. 2. The automatic classification of emotional phoneme-length stimuli has also been shown to be possible with a good accuracy rate. Human listeners’ accuracy in recognizing emotional content in speech was clearly below the computer classification. 3. Voice source did not only reflect variations of F0 and Leq but appeared to have an independent role in expression, reflecting phonation types. 5. Formant frequencies F1, F2, F3 and F4 were related to the valence perceived in vowel [a:]. The perception of positive valence tended to be associated with a higher frequency of F3 but no clear pattern could be detected, probably reflecting the differences in formant use on different activity levels. 6. Mono-pitched vowels [a:], [i:] and [u:] differed in their capacity to carry emotional information: [a:] conveyed better anger, tenderness and neutrality than the other two vowels; Anger was conveyed well by all vowels studied; Joy was slightly better recognized in vowel [i:] than in [a:] but distinctively better in [i:] than in [u:]; However, sadness was signalled well by both [i:] and [u:] vowels. In vowels [i:] and [u:] Leq was the only statistically significant variable in emotional expressions. This may be due to different use of voice source and filter characteristics in different vowels or due to the fact that the same phonatory or articulatory characteristics have different acoustic consequences in the vocal tract setting in different vowels. 7. In both genders, psycho-physiological activity level was coded mainly through Leq. 8. Perception of valence tends to be a complex multilevel parameter with wide individual variations (i.e. due to differences in the individual emosphere). 9. The perceptual effects of the interplay between voice source and formant frequencies in different vowels warrant further study by modified synthetic samples yet preserving natural sound. 10. There may be more hesitation in males than females in making decisions on the quality of emotional information perceived. Whether the reason for this is simply motivational or due to gender differences in brain processing warrants further study.