Research
Research direction and topics of interest
Main Research Topics
Unit-selection concatenative synthesis is currently the dominant speech synthesis approach as it still produces the highest synthetic speech quality in terms of naturalness and intelligibility, especially in the context of commercial applications. However, its limitations are now well understood. The increasing demand for a wider set of voices, which will be enhanced with expressivity and will pose low storage/computational demands is continuously raising. These are today some of the main drivers in the field of speech synthesis.
Unit-selection speech synthesis
Since the development of the first unit-selection speech synthesizer for the Greek language (e.g. [Raptis et al., 2010]), the Group has been working on improving various aspects of the system and the overall voice building process, both for Greek and for language-independent modules. Such work includes the grapheme-to-phoneme component ([Chalamandaris et al., 2005]), the speech database creation process ([Chalamandaris et al, 2009a]), the pitch-marking process ([Chalamandaris et al, 2009b]]), the speech database size ([Tsiakoulis et al, 2008]), the speech engine footprint and performance ([Karabetsos et al., 2009]) etc.
Expressive/emotional speech synthesis
Concatenative synthesis per se does not particularly lend itself to expressive speech; only a "neutral" speaking style is possible or expressive speech of limited expressivity tailored to narrow application domains.
The Group is working on new methods and tools for indentifying, analyzing, modeling and generating the prosodic and acoustic effects of expressive speech, for a breadth of applications. Results in these areas can either feed parametric speech synthesis models or be used to enhance unit-selection speech synthesizers. Particular focus is given on expressive speech for child-directed applications and storytelling where the traditional "big-six" approaches for emotion do not seem to be well suited for dealing with the perplexities of the narrative style.
Parametric and hybrid approaches
Due to the limitations of concatenative methods, there is a growing interest in shifting to new paradigms employing parametric or hybrid approaches in TtS. Such approaches include more recent statistical parametric approaches, for example based on HMMs thus linking speech synthesis to the well developed corpus of methods and tools developed for speech recognition, but also a resurgence of older parametric approaches in the light of recent developments such as improved vocoding techniques or other parametric (or hybrid) speech models. Among the advantages of these methods is that the speech representation schemes they adopt, allow for easier manipulation of speech without severe quality degradation. This way, they offer an efficient framework for voice conversion, prosodic modeling etc. Furthermore, the computational and storage requirements of the resulting TtS systems are significantly lower, thus making them more suitable for low-resourced environments such as cell phones and portable devices. However, the speech quality of such systems still remains, in most cases, lower that that of unit selection systems. This is partly due to known issues in parametric TtS, which are the subject of intense research at international level.
First result of the Group in statistical parametric TtS methods have lead to the first HMM synthesizer for Greek [Karabetsos et al., 2008]. Work is carried on aiming to include a fuller acoustic and prosodic modeling of the Greek language. In addition, basic research is carried on specific known issues of statistical parametric speech synthesis, such as the side-effects of vocoding, the adequacy of the acoustic modeling and the over-smoothing of parameter trajectories as a result of their statistical modeling, in order to improve the quality of parametric speech. In parallel, alternative parametric TtS methods are also investigated (e.g. [Raptis et al., 2001; Raptis & Carayannis, 1997]), fertilized with computational intelligence and machine learning tools.
Multimodal/audiovisual speech synthesis
The parametric TtS framework is particularly suitable for modeling not only prosodic and acoustic speech parameters but also parameters that relate to facial expression and body movement. Thus, they can be naturally extended to the area of multimodal (or audiovisual) speech synthesis so as to drive not only acoustic models but also graphical models of synthesis.
Speech processing
In the context of speech processing, the analysis, modeling and robust representation of speech are significant research areas that support and feed research efforts in various other fields. In cooperation with other research teams, the Group performs basic research in areas such as spectral estimation (e.g. [Karabetsos et al., 2005; Tsiakoulis et al., 2005]), speech feature extraction (e.g. [Tsiakoulis et al., 2009; Tsiakoulis & Potamianos, 2009]), distance measures ([Karabetsos et al., 2010]) and other.
Music processing
Music processing has been traditionally fertilized by the results of speech processing. The Group has coordinated the successful IMUTUS project (e.g. [Raptis et al., 2005b]) and has initiated its follow-up, the VEMUS project. Issues of particular interest to the Group are music recognition and extraction of music features.
At the cross-section of speech and music processing is the area of singing speech synthesis, where the current dominant approach is unit selection synthesis employing appropriately crafted databases of singing speech. Of course, issues such as quality degradation become particularly evident when speech is significantly modified. Thus, all the research questions from the speech synthesis area (parametric approaches, voice conversion etc), naturally emerge also in the area of singing speech synthesis.