Journals & Magazines >IEEE Access >Volume: 12

Exploring Emotion and Emotional Variability as DigitalBiomarkers in Frontotemporal Dementia Speech

The analysis pipeline: Audio data from both Interactive Emotional Dyadic Motion Capture (IEMOCAP) and the UMel dataset, including speech from Frontotemporal Dementia (FTD...

Abstract:

Frontotemporal Dementia (FTD) encompasses a diverse group of progressive neurodegenerative diseases that impact speech production and comprehension, higher-order cognitio...Show More

Metadata

Abstract:

Frontotemporal Dementia (FTD) encompasses a diverse group of progressive neurodegenerative diseases that impact speech production and comprehension, higher-order cognition, behavior, and motor control. Traditional acoustic speech markers have been extensively studied in FTD, as have assessments capturing apathy and impairments in recognizing and expressing emotion. This work leverages machine learning to track changes in emotional content within the speech of individuals with FTD and healthy controls. The aim of the project is to develop tools for assessing and monitoring emotional changes in individuals with FTD, quantifying these subtle aspects of the disease and thus potentially providing insights for assessing future therapeutic interventions. A retrospective analysis was conducted on a dataset comprising standard elicited speech tasks performed by 78 individuals diagnosed with FTD and 55 healthy elderly controls. We employed an ensemble-based convolutional neural network (CNN) classifier trained on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset to extract emotion scores from processed speech samples. The classifier was applied with a sliding window to the FTD and healthy control narratives to facilitate a granular examination of emotional changes throughout longer speech samples. Analysis of variance (ANOVA) was used to test for group differences in average emotion scores as well as emotional variability over the duration of the speech samples. Compared to healthy controls, people with FTD demonstrated reduced emotional change in a monologue task describing a happy experience, as measured by the interquartile range (IQR) (p ¡ 0.005) and slope of “happy” emotion scores vs. time (p ¡ 0.005). During a picture description task, people with FTD displayed a slightly elevated average level of frustration (p ¡ 0.005). Increased frustration levels in individuals with FTD could potentially indicate their difficulties in accomplishing the task. This stu...

The analysis pipeline: Audio data from both Interactive Emotional Dyadic Motion Capture (IEMOCAP) and the UMel dataset, including speech from Frontotemporal Dementia (FTD...

Published in: IEEE Access ( Volume: 12)

Page(s): 71419 - 71432

Date of Publication: 20 May 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3402999

Funding Agency:

Contents

SECTION I.

Introduction

Frontotemporal Dementia (FTD) encompasses a clinically heterogeneous spectrum of early-onset, progressive neurodegenerative diseases affecting behaviour, speech, language, cognition and motor function, resulting in significant personal, social, and economic burden [1]. The disease has a significant impact on participants” emotional processing; in particular, apathy, presenting clinically as a motivation and flatness of affect [2], relates to poorer disease prognosis including increased caregiver burden and progression of neurodegeneration [3]. Disease-modifying therapies targeting the underlying pathobiology of FTD are in development [4], driving the need for quantitative markers of disease presence and progression.

Digital speech biomarkers have shown significant potential in characterizing neurological and respiratory illness [5], [6], as well as in characterizing FTD subtype-specific deficits in acoustic and lexical aspects of speech [7], [8], [9]. The most common subtype, behavioral variant FTD (bvFTD) is clinically characterized by social-behavioral deficits, apathy, and executive dysfunction [2]; in terms of digital speech biomarkers, people with bvFTD demonstrate relative preservation of phonology, yet impaired prosody, frequent pausing, and reduced lexical diversity [10], [11], [12], [13]. Primary progressive aphasia (PPA) subtypes of FTD include the semantic variant (svPPA), characterized by word comprehension and confrontation naming deficits that lead for example to greater pronoun use, and the nonfluent variant (nfvPPA), which is characterized by apraxia, agrammatism and effortful, nonfluent speech leading to increased speech errors and pauses [8].

An important part of the burden of FTD, especially for caregivers, is that it causes deficits in recognizing or expressing emotion [14], [15], [16], [17] that differ among variants [18]. bvFTD is linked to impairments in emotion perception [19] and impaired processing of emotions like embarrassment is linked to behavioral disturbances in early bvFTD [20], [21]. Apathy is a particularly important aspect of emotional response in FTD, especially in bvFTD [3]. Apathy is currently assessed by clinicians through observation or clinician-administered testing [22], with multiple apathy scales in use [23], [24], [25], [26], [27]. Clinical-rated apathy shows less consistency than other measures used in identifying probable bvFTD [28]. It has been proposed that automated tools to objectively measure apathy and other emotional content of speech could be beneficial by allowing more wide-spread screening and assessment of participants [29], [30], [31].

Here, we sought to address these needs by applying recently developed tools for sentiment and emotion analysis. In many applications, sentiment is judged based on speech transcripts [32], [33]; for example, Friedman and Ballentine [32] analyzed transcripts to quantify effects of psychoactive substances. However, we believe audio has important advantages in our application, as emotion changes may occur at a fast time scale (within individual sentences) not easily captured through transcripts. Thus we focused instead on acoustic-based SER, which detects low-level latent features from acoustic data that can be used to categorize speaker emotions into discrete categories, including anger, boredom, disgust, surprise, fear, joy, happiness, neutral and sadness [34], [35], [36], [37]. Relevant to our work, Linz et al. [29] demonstrated an ability to predict clinical scores of apathy in a population of mild cognitive impairment (MCI) participants based on acoustic cues in speech. Other existing SER models based on audio input [35], [38], [39], [40], [41], [42], [43], [44] show are trained to predict emotion of an overall utterance but are not well suited to capturing temporal fluctuations in emotion (Appendix shows an example of over-training leading to rapid emotion switches; other issues in these models include discrepancies in audio clip duration and lack of a “frustration” class, a potentially important emotion in participants with cognitive difficulties). Recent work leverages labels generated by ChatGPT to further explore the intensity of emotions (instead of assigning a single class) [45]. However, the ChatGPT enhanced labels are still not a continuous output that can be easily used to track emotion changes over time.

We hypothesized that the impact of FTD on emotional expression would alter the perceived emotional content of speech in people with FTD, and that these impacts could be objectively quantified using automated SER analysis of speech tasks that elicit emotional content. Hence, in this study, we leveraged transfer learning techniques, by training a convolutional neural network (CNN)-based SER model to recognize emotions on a database of healthy controls, Interactive Emotional Dyadic Motion Capture (IEMOCAP) [46] and then applying this trained model it to a separate dataset of individuals with FTD and healthy elderly controls. Features designed to capture variability in expressed emotion were then extracted.

The main contribution of this paper lies in presenting a robust framework for tracking variability in expressed emotions; this framework leverages machine learning algorithms to analyze the temporal dynamics and variability of emotions during narratives via the slope and variability in scored emotions. This work builds on earlier work by members of our group [6], [47]. While we used a particular deep learning SER approach, our framework could be adapted to future SER models developed in this rapidly evolving field. A second contribution is that we compare these metrics in different participants (healthy elderly participants and those with FTD), demonstrating differences between the populations which suggest automated scoring of emotions has potential as a tool for clinical assessment.

The paper’s structure is as follows: the Methods section covers dataset descriptions, audio data preprocessing steps, and the training and evaluation of the emotion recognition model through transfer learning on the widely used IEMOCAP dataset [46]. The Methods section also describes our approach for characterizing temporal variation in emotional content when our SER model (trained on IEMOCAP) is applied to new data. In the Results section, we first present performance of the emotion recognition model on IEMOCAP, then present findings from characterizing the monologue and picture description tasks in FTD and healthy elderly participants. In the Discussion section, we discuss implications of the studys findings and as well as observed differences in emotional expression between FTD participants and healthy controls. In the appendix, we explore the limitations of using another pretrained emotion classifier model for our task (wav2vec2-IEMOCAP [35]).

Our results demonstrate that individuals with FTD demonstrate similar average levels of emotion as healthy controls during a monologue task but have significantly reduced variability vs. time in perceived emotion, consistent with a flatter, less emotionally engaged affect. In addition, individuals with FTD exhibit a small but significant increase in frustration scores during a picture description task. It is important to replicate these results in a dataset that includes clinician-rated scores of emotional processing and affect. Nevertheless, our results suggest that automated emotion scoring may be a useful tool for quantifying the impact of FTD and other disorders on participants” ability to express emotion in daily life.

SECTION II.

Methods

A. Data Source

a: Emotions Training Data

For SER model training, we used the IEMOCAP dataset to train machine learning classifiers for emotion categorization in audio recordings of speech [46]. The IEMOCAP dataset comprises audio recordings from professional actors who express a range of emotions through speech. It is organized into five sessions, each featuring a different pair of male and female actors engaged in both scripted and improvised dialogue.

To prepare the data, audio recordings from IEMOCAP were truncated into utterances of average length 4.5 seconds, ensuring that each utterance contained speech from a single actor. Trained annotators then classified these recordings into ten emotion categories: “angry,” “happy,” “disgusted,” “fear,” “frustrated,” “excited,” “neutral,” “sad,” “surprised,” and “others.” The “others” category represents emotions outside the predefined set for the dataset. Additionally, the category “xxx” was used to indicate samples where annotators did not reach a consensus.

The original IEMOCAP dataset consists of 10,039 utterances, with the emotional category distribution shown in Fig. 1. However, we filtered the dataset by excluding the “others” and “xxx” categories as these have limited interpretability. Furthermore, we excluded the categories “fear,” “disgusted,” and “surprised” due to their small sample sizes. Last, following the approach of previous studies using IEMOCAP [39], [40], we combined the “excited” and “happy” categories into a single “happy” category. The filtered IEMOCAP dataset used in this study consists of 7,380 utterances and the distribution of emotion categories is illustrated in Fig. 2.

FIGURE 1.

Distribution of all labeled emotion categories from the original IEMOCAP dataset. The category “xxx” represents samples where annotators did not reach a consensus; other emotion labels are as shown.

MIT Libraries

MIT Libraries

Exploring Emotion and Emotional Variability as DigitalBiomarkers in Frontotemporal Dementia Speech

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Methods

A. Data Source

a: Emotions Training Data

b: FTD and Healthy Participant Data

B. Data Preprocessing

C. Emotion Recognition Model

D. Statistical Analysis

Results

A. Emotion Recognition Model Performance

B. Emotion Recognition in FTD Speech

a: Monologue Task

b: Picture Description Task

Discussion

ACKNOWLEDGMENT

AppendixEmotion Tracking Using Wav2Vec2-Iemocap

Emotion Tracking Using Wav2Vec2-Iemocap

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Appendix
Emotion Tracking Using Wav2Vec2-Iemocap