Introduction
Frontotemporal Dementia (FTD) encompasses a clinically heterogeneous spectrum of early-onset, progressive neurodegenerative diseases affecting behaviour, speech, language, cognition and motor function, resulting in significant personal, social, and economic burden [1]. The disease has a significant impact on participants” emotional processing; in particular, apathy, presenting clinically as a motivation and flatness of affect [2], relates to poorer disease prognosis including increased caregiver burden and progression of neurodegeneration [3]. Disease-modifying therapies targeting the underlying pathobiology of FTD are in development [4], driving the need for quantitative markers of disease presence and progression.
Digital speech biomarkers have shown significant potential in characterizing neurological and respiratory illness [5], [6], as well as in characterizing FTD subtype-specific deficits in acoustic and lexical aspects of speech [7], [8], [9]. The most common subtype, behavioral variant FTD (bvFTD) is clinically characterized by social-behavioral deficits, apathy, and executive dysfunction [2]; in terms of digital speech biomarkers, people with bvFTD demonstrate relative preservation of phonology, yet impaired prosody, frequent pausing, and reduced lexical diversity [10], [11], [12], [13]. Primary progressive aphasia (PPA) subtypes of FTD include the semantic variant (svPPA), characterized by word comprehension and confrontation naming deficits that lead for example to greater pronoun use, and the nonfluent variant (nfvPPA), which is characterized by apraxia, agrammatism and effortful, nonfluent speech leading to increased speech errors and pauses [8].
An important part of the burden of FTD, especially for caregivers, is that it causes deficits in recognizing or expressing emotion [14], [15], [16], [17] that differ among variants [18]. bvFTD is linked to impairments in emotion perception [19] and impaired processing of emotions like embarrassment is linked to behavioral disturbances in early bvFTD [20], [21]. Apathy is a particularly important aspect of emotional response in FTD, especially in bvFTD [3]. Apathy is currently assessed by clinicians through observation or clinician-administered testing [22], with multiple apathy scales in use [23], [24], [25], [26], [27]. Clinical-rated apathy shows less consistency than other measures used in identifying probable bvFTD [28]. It has been proposed that automated tools to objectively measure apathy and other emotional content of speech could be beneficial by allowing more wide-spread screening and assessment of participants [29], [30], [31].
Here, we sought to address these needs by applying recently developed tools for sentiment and emotion analysis. In many applications, sentiment is judged based on speech transcripts [32], [33]; for example, Friedman and Ballentine [32] analyzed transcripts to quantify effects of psychoactive substances. However, we believe audio has important advantages in our application, as emotion changes may occur at a fast time scale (within individual sentences) not easily captured through transcripts. Thus we focused instead on acoustic-based SER, which detects low-level latent features from acoustic data that can be used to categorize speaker emotions into discrete categories, including anger, boredom, disgust, surprise, fear, joy, happiness, neutral and sadness [34], [35], [36], [37]. Relevant to our work, Linz et al. [29] demonstrated an ability to predict clinical scores of apathy in a population of mild cognitive impairment (MCI) participants based on acoustic cues in speech. Other existing SER models based on audio input [35], [38], [39], [40], [41], [42], [43], [44] show are trained to predict emotion of an overall utterance but are not well suited to capturing temporal fluctuations in emotion (Appendix shows an example of over-training leading to rapid emotion switches; other issues in these models include discrepancies in audio clip duration and lack of a “frustration” class, a potentially important emotion in participants with cognitive difficulties). Recent work leverages labels generated by ChatGPT to further explore the intensity of emotions (instead of assigning a single class) [45]. However, the ChatGPT enhanced labels are still not a continuous output that can be easily used to track emotion changes over time.
We hypothesized that the impact of FTD on emotional expression would alter the perceived emotional content of speech in people with FTD, and that these impacts could be objectively quantified using automated SER analysis of speech tasks that elicit emotional content. Hence, in this study, we leveraged transfer learning techniques, by training a convolutional neural network (CNN)-based SER model to recognize emotions on a database of healthy controls, Interactive Emotional Dyadic Motion Capture (IEMOCAP) [46] and then applying this trained model it to a separate dataset of individuals with FTD and healthy elderly controls. Features designed to capture variability in expressed emotion were then extracted.
The main contribution of this paper lies in presenting a robust framework for tracking variability in expressed emotions; this framework leverages machine learning algorithms to analyze the temporal dynamics and variability of emotions during narratives via the slope and variability in scored emotions. This work builds on earlier work by members of our group [6], [47]. While we used a particular deep learning SER approach, our framework could be adapted to future SER models developed in this rapidly evolving field. A second contribution is that we compare these metrics in different participants (healthy elderly participants and those with FTD), demonstrating differences between the populations which suggest automated scoring of emotions has potential as a tool for clinical assessment.
The paper’s structure is as follows: the Methods section covers dataset descriptions, audio data preprocessing steps, and the training and evaluation of the emotion recognition model through transfer learning on the widely used IEMOCAP dataset [46]. The Methods section also describes our approach for characterizing temporal variation in emotional content when our SER model (trained on IEMOCAP) is applied to new data. In the Results section, we first present performance of the emotion recognition model on IEMOCAP, then present findings from characterizing the monologue and picture description tasks in FTD and healthy elderly participants. In the Discussion section, we discuss implications of the studys findings and as well as observed differences in emotional expression between FTD participants and healthy controls. In the appendix, we explore the limitations of using another pretrained emotion classifier model for our task (wav2vec2-IEMOCAP [35]).
Our results demonstrate that individuals with FTD demonstrate similar average levels of emotion as healthy controls during a monologue task but have significantly reduced variability vs. time in perceived emotion, consistent with a flatter, less emotionally engaged affect. In addition, individuals with FTD exhibit a small but significant increase in frustration scores during a picture description task. It is important to replicate these results in a dataset that includes clinician-rated scores of emotional processing and affect. Nevertheless, our results suggest that automated emotion scoring may be a useful tool for quantifying the impact of FTD and other disorders on participants” ability to express emotion in daily life.
Methods
A. Data Source
a: Emotions Training Data
For SER model training, we used the IEMOCAP dataset to train machine learning classifiers for emotion categorization in audio recordings of speech [46]. The IEMOCAP dataset comprises audio recordings from professional actors who express a range of emotions through speech. It is organized into five sessions, each featuring a different pair of male and female actors engaged in both scripted and improvised dialogue.
To prepare the data, audio recordings from IEMOCAP were truncated into utterances of average length 4.5 seconds, ensuring that each utterance contained speech from a single actor. Trained annotators then classified these recordings into ten emotion categories: “angry,” “happy,” “disgusted,” “fear,” “frustrated,” “excited,” “neutral,” “sad,” “surprised,” and “others.” The “others” category represents emotions outside the predefined set for the dataset. Additionally, the category “xxx” was used to indicate samples where annotators did not reach a consensus.
The original IEMOCAP dataset consists of 10,039 utterances, with the emotional category distribution shown in Fig. 1. However, we filtered the dataset by excluding the “others” and “xxx” categories as these have limited interpretability. Furthermore, we excluded the categories “fear,” “disgusted,” and “surprised” due to their small sample sizes. Last, following the approach of previous studies using IEMOCAP [39], [40], we combined the “excited” and “happy” categories into a single “happy” category. The filtered IEMOCAP dataset used in this study consists of 7,380 utterances and the distribution of emotion categories is illustrated in Fig. 2.
Distribution of all labeled emotion categories from the original IEMOCAP dataset. The category “xxx” represents samples where annotators did not reach a consensus; other emotion labels are as shown.
b: FTD and Healthy Participant Data
We retrospectively analyzed an existing dataset containing audio recordings collected during elicited speech tasks from people with FTD and healthy elderly controls (denoted below as “UMel dataset”). Participants provided informed consent and were seated in a sound-attenuated room (ambient noise in audio files ¡
In this study, we focused our analysis on participants with a diagnosis of bvFTD, nfvPPA, and svPPA. We chose to analyze only monologue and picture description tasks which have similar audio recording lengths as the IEMOCAP dataset utterances.
In the monologue task, participants are asked to describe a happy event of their own choice. In the picture description task, participants are shown a picture (Fig. 3) and asked to describe the picture in their own words [53]. All participants completed the monologue task, however, only a subset of people with FTD completed the picture description task.
The number of recordings per diagnosis for each task are shown in Table 1. In total, we analyzed audio recordings from 93 monologue and 44 picture description tasks from people with FTD, and audio recordings from 55 monologue and 58 picture description tasks from healthy elderly controls.
B. Data Preprocessing
To maintain consistency between the IEMOCAP and UMel datasets, we first manually annotated and removed interviewer speech from audio recordings from the UMel dataset. This resulted in utterances containing only participant speech.
Next, we transformed all audio files from the IEMOCAP and UMel datasets to Mel Frequency Cepstral Coefficients (MFCC) representation using the librosa library [54]. Files were converted to a sampling rate of 22,050 Hz sampling rate and MFCCs were computed using default librosa parameters (2048-point Hanning windowed FFTs, 75% overlap, 20 MFCC coefficients, maximum frequency set to Nyquist rate). MFCC values were plotted to form images, which were first resized to
C. Emotion Recognition Model
Our analysis pipeline, as depicted in Fig. 4, involved two steps: 1) Training the emotion classifier, and 2) Conducting emotion tracking.
Flowchart illustrating the process of tracking emotions in a long speech sample from a monologue task, with the same process applied to speech samples from the PIC task. “MFCC” refers to “Mel Frequency Cepstral Coefficients”, “OEM” refers to “Optimal Ensemble Model” described in III.
First, we employed standard transfer learning techniques to train five CNN source models: AlexNet [55], AlexNet-GAP [56], VGG11 [57], ResNet18, and ResNet50 [58]. Originally developed for image classification, these models were adapted to classify emotions using visual MFCC spectrograms from the IEMOCAP dataset. During the training phase, we designated session 5 of the IEMOCAP dataset as the test set to evaluate model performance. Sessions 1, 2, and 3 were used as the training sets. We performed hyper-parameter tuning and model selection using session 4 as a validation set to optimize our models and select the best ensemble. To address class imbalance in the emotion categories, we employed the Synthetic Minority Oversampling Technique (SMOTE) [59] during model training. To prevent overtraining, we limited training epochs to 20, a significantly smaller value compared to similar works [34], [35], [36], [37]. The epoch size is chosen to be 32. The learning process is optimized using stochastic gradient descent (SGD) with a learning rate of 0.001 and a momentum factor of 0.9, exclusively applying updates to the parameters of the newly added fully connected layer. A learning rate scheduler is employed, reducing the learning rate by a factor of 0.1 every 7 epochs. Additionally, we utilized early stopping to determine the optimal epoch number. The ensemble was formed by averaging the probabilities generated by the five base models.
Next, we applied the trained emotion classifier to each recording in the UMel dataset to track the progression of emotions over time. For this purpose, we extracted emotion scores for five-second long windows and estimated the percentage of each emotion category (happy, neutral, frustrated, angry, and sad). We used sliding windows shifted by 0.1 seconds to smoothly capture emotions vs. time. This approach generated emotion densities that show the change in emotion percentages over the duration of the elicited speech tasks.
D. Statistical Analysis
We created custom Python code to statistically analyze the emotion percentages for each elicited speech task. We computed the mean, standard deviation, and (interquantile range) IQR of the emotion scores for each utterance. We employed Analysis of variance (ANOVA) to detect group differences. We also used the Games-Howell test for pairwise comparisons to determine which specific comparisons exhibited significant differences. In order to capture the changes in emotion scores, we performed robust line fits using the statsmodels packages with Huber distance [60]. The Huber distance is defined to be:\begin{align*} L_{\delta }(a) = \begin{cases} \displaystyle \frac {1}{2}a^{2} & \text {for }{|a|\leq \delta } \\ \displaystyle \delta \cdot \ \left ({{|a| - \frac {1}{2}\delta }}\right ) & \text {otherwise.} \end{cases}\end{align*}
These line fits allowed us to estimate slopes and determine whether there were statistically significant differences between groups. Furthermore, we conducted binomial tests to evaluate whether the slopes of healthy elderly controls were outside the 95% range of slopes observed in people with FTD. Given the limited sample size in the picture description task, we combined all subtypes of FTD into a single FTD participant group for the statistical analyses.
Results
A. Emotion Recognition Model Performance
Model validation identified the best-performing model, referred to as the Optimal Ensemble Model (OEM), which combines AlexNet, ResNet18, and VGG11 architectures. This model classified five emotion categories (“angry”, “happy”, “sad”, “neutral”, and “frustrated”) with 50% accuracy (compared to the 20% expected for random guessing) and achieved a top-two class accuracy of 72%.
We compared our model to the wav2vec2-IEMOCAP model, a speech emotion recognition algorithm built upon the wav2vec2 base and fine-tuned on the IEMOCAP dataset. The wav2vec2-IEMOCAP model achieved 75% accuracy on the slightly simpler task of classifying four emotion categories (“angry”, “happy”, “sad”, and “neutral”) [35]. However, limitations of the wav2vec2-IEMOCAP model in capturing changes in emotions over time are discussed in Appendix.
The confusion matrix for OEM depicts that that the model achieved the highest accuracies for the “happy” and “sad” emotions (Fig. 5). Receiver operating characteristic (ROC) analysis revealed a micro-averaged area under the curve (AUC) of 0.76.
Confusion matrix of OEM, illustrating the model”s performance in accurately predicting emotion categories. The matrix displays the predicted emotion labels along the x-axis and the true emotion labels along the y-axis.
The primary purpose of OEM was to accurately capture the distribution and density of emotion percentages and track the smooth transition of emotions over time, rather than to assign audio recordings to a single emotion category with high confidence. This particular use is highlighted with the subsequent application of OEM to the analysis of emotion over time in speech recordings (
B. Emotion Recognition in FTD Speech
a: Monologue Task
Across all diagnostic groups, OEM consistently identified “happy” as the emotion with the highest percentage, followed by “neutral” as the emotion with the second-highest percentage (Table 2). In contrast, the remaining emotions (“frustrated,” “angry,” and “sad”) had significantly lower percentages.
Fig. 7 displays the output for a single healthy elderly control participant, while Fig. 8 illustrates an example of the model output for a single bvFTD participant. In both cases, the model predictions exhibit smooth transitions over the time series, with the dominant emotion being “happy.” This observation aligns with the nature of the monologue task, in which participants are instructed to describe a happy experience. In addition, the healthy elderly control participant had more noticeable variability in the happy emotion score over the duration of the audio recording, a point examined below in detail.
Receiver Operating Curve (ROC) of OEM, showing performance per emotion as well as micro-averaged performance.
Emotion percentages over time for a healthy control participant completing the monologue task, obtained using OEM. The black line indicates the robust line fit of the “happy” emotion over time. Color scheme: red (“angry”), orange (“happy”), purple (“frustrated”), green (“neutral”), blue (“sad”).
Emotion percentages over time for a bvFTD participant completing the monologue task, obtained using OEM. The black line indicates the robust line fit of the “happy” emotion over time. Color scheme: red (“angry”), orange (“happy”), purple (“frustrated”), green (“neutral”), blue (“sad”).
No statistically significant differences were observed in the mean and standard deviation of the “happy” emotion percentage over time among the different diagnostic groups. However, the healthy elderly controls exhibited a wider IQR for the “happy” emotion compared to bvFTD, svPPA, and nfvPPA, as depicted in Fig. 9 with statistical significance (
To further investigate the relationship between the variability in the “happy” emotion percentage over time and the observed difference in the IQR between FTD participants and healthy elderly controls, we detrended the time series data. After detrending, we observed that only one participant group (svPPA) retained a statistically significant difference in IQR from the healthy elderly controls. This indicates that the between-group differences in IQR are mainly explained by the overall trends in the “happy” emotion percentage over time, rather than fluctuations around the trend.
We next explored changes in the “happy” emotion percentage over time as a novel measure of emotional variability. To quantify this, we calculated the absolute value of the change in the “happy” emotion percentage over time (as depicted in Fig. 11). Notably, the slopes of the “happy” emotion percentage indicated change over time in healthy elderly controls, while in people with FTD the slopes were relatively stable. Robust line fit analysis indicated that the “happy” emotion percentage slopes from healthy elderly controls deviated from the 95% confidence intervals of the “happy” emotion percentage slopes from FTD participants (Fig. 10). This difference was found to be statistically significant from binomial testing (Binomial test p-value =0.0008).
Slope (with 95% confident interval) for the robust line fit of “happy” emotion percentage over time for the monologue task obtained using OEM. The dotted-dashed line denotes the range that contains 95% of the slope of FTD participants.
Absolute value of the slope of “happy” emotion over time on the monologue task obtained using OEM.
As a sensitivity analysis, we conducted a re-analysis of the monologue task using the emotions classifier described previously [35]. This classifier is known for its high accuracy in categorical classification of emotion categories, although it is not specifically designed to capture mixtures of emotions as OEM does through emotion percentages. The results presented in the appendix validated our observation of reduced variability in the “happy” emotion for FTD participants, thereby supporting the robustness of our findings.
b: Picture Description Task
In contrast to the monologue task, the picture description task did not exhibit a dominant emotion in any group, as indicated in Table 3. However, when comparing the combined FTD group (bvFTD, svPPA, and nfvPPA) to healthy elderly controls, we observed a statistically significant higher average, standard deviation, and IQR for the “frustrated” emotion percentage, as illustrated in Fig. 12.
Average and IQR (per-recording) of “frustrated” emotion over time for the picture description task obtained using OEM. “**” means Bonferroni corrected p-values <0.005, “*” means Bonferroni corrected p-values <0.05, “N.S” means not statistically significant.
A robust line fit analysis of the “frustrated” emotion percentage showed that 95% of all participants, including both FTD and healthy elderly controls, had negligible changes over time, with absolute slopes of the “frustrated” emotion percentage below 0.001. This suggests that the observed differences in the “frustrated” emotion between groups are not driven by temporal variation but rather reflect inherent distinctions between the groups. It is possible that the decline in formal language abilities experienced by participants contributes to a sense of loss of control, which can lead to their frustration and anger [62], [63]. These may contribute to the subtle differences in the “frustrated” emotion percentage observed.
Discussion
The research community currently lacks objective digital biomarkers for assessing emotional response in neurological disorders such as FTD. Hence, our goal was to capture the dynamic nature of emotions in long narratives from people with FTD so that we can assess the emotional characteristics that have been observed by caregivers and family of people with FTD. Previous SER algorithms have focused on assigning a single categorical emotion label to each utterance. In contrast, we sought to capture variations in expressed emotions over time. Due to this different goal, we found the other pre-trained SER algorithms unsuitable for tracking emotion changes (e.g. discrepancies in audio clip durations, overtraining leading to rapid emotion switches, and overlooking frustration, see Appendix). Therefore, we developed a new model following standard transfer learning techniques, and then applied our ensemble-based method for SER with a sliding window to long narratives from people with FTD and healthy elderly controls so that we can capture granular temporal variations in expressed emotions. Statistical analysis of model-assigned emotion percentages from bvFTD and healthy participants showed differing results by task. In monologue tasks where participants were asked to describe a happy experience, both FTD and healthy elderly controls demonstrated high levels of “happy” and “neutral” emotions, but people with FTD exhibited less emotional variability relative to healthy elderly controls. In particular, people with FTD uniformly showed flat trajectories, i.e., the slope of line fits over time, in the dominant “happy” emotion. In contrast, healthy elderly controls were statistically much more likely to exhibit variability in the “happy” emotion percentage assigned by the model. In the picture description task (Cookie Theft), there was no dominant emotion, and neither FTD nor healthy elderly control participants exhibited noticeable emotional variation over time. However, people with FTD exhibited higher average percentages of “frustration”, potentially because this task is more difficult for individuals with FTD. It is important to note that our available dataset for picture description was much smaller than for monologue, which impacts our ability to analyze FTD subtypes.
It is important to note that SER scores can only capture perceived emotion, which is relevant for social interaction but may differ from true emotions experienced by participants. For example, because FTD may affect motor control of speech [51], it is possible that people with FTD may have altered ability to express emotions that they may be experiencing. However, the general trend to more negative emotions like frustration and to less emotional variability is consistent with clinical assessments of FTD. People with bvFTD are known to exhibit increased apathy [3], which may be related to the flatness of happy emotion percentages we observed in the monologue data for bvFTD participants.
Our work has several important limitations related to the dataset used. Our dataset does not include the published batteries for assessing participants’ apathy or emotional processing. Analyzing a dataset which does include these ratings would allow us to better validate our approach relative to gold standard speaking difficulties. It would be also interesting to see how well the results correlate for example with speaking difficulties or with caregiver-reported behavior from people with FTD.
A second limitation is that our dataset is not large (especially for picture description tasks), so it is important that these results be verified in additional datasets and in related neurological conditions, and that the elicited speech task dependence we report above should be further explored. A related limitation concerns the variability in the manifestations of FTD within and across subtypes. The analysis of the picture description test which combined all FTD participants into a single category is especially affected by this. In addition, we lacked longitudinal data, which limited our ability to look for changes in emotion expression as the disease progresses. Identifying or collecting longitudinal datasets in FTD would allow researchers to understand how the emotional content of FTD speech changes longitudinally over disease progression, similar to [64].
A final technical limitation is that the training of the emotion classifier was performed on the IEMOCAP dataset which consists of speech samples from US English speakers, while classification was performed in an Australian cohort (note, we were not able to identify an emotions training dataset for speakers of Australian English). While there is evidence that basic emotions are communicated across cultures [65], and US and Australian speech appear to be similar enough that Americans and Australians can generally recognize each other’s emotions, recent work suggests the possible existence of “emotional accents” [66]. Thus, this train-test accent mismatch could potentially impact emotion scoring results. However, while average emotions scores could be affected by this mismatch in accent, features related to variability of emotions within a single recording (important in the monologue task) should be more robust to this accent mismatch. The future scope involves addressing the gaps noted above, most importantly by replicating the work in a larger, ideally longitudinal dataset which includes clinical batteries for rating apathy and emotional processing. A second area for future work involves adapting this emotion tracking framework to the continuously developing field of SER. This adaptation may even encompass the integration of diverse modalities, such as transcriptions and video recordings, enabling a more nuanced and comprehensive understanding of the emotions exhibited by FTD participants. Further development of SER-based emotion recognition tools that focuses on detecting apathy and levels of apathy may be a valuable tool for future research and clinical trials as noted in [31]. If such tools prove clinically useful, it will be critical to overcome the known limitations of current speech processing methods in low resource languages [67] and when data quality is non-ideal [68]. If robust methods are demonstrated in multiple datasets, SER-based biomarkers could provide clinicians with a valuable new tool for evaluating the quality of life and social interactions in people with FTD and related disorders.
ACKNOWLEDGMENT
The authors would like to extend their sincere appreciation to Yuan Gong and Jim Glass for their valuable suggestions and review. Their valuable input and expertise were instrumental in the completion of this work. (Yishu Gong and Fjona Parllaku are co-first authors.)
AppendixEmotion Tracking Using Wav2Vec2-Iemocap
Emotion Tracking Using Wav2Vec2-Iemocap
In this appendix, we present the results of a sensitivity study that employs the previously developed wav2vec2-IEMOCAP classifier [35] on our FTD Monologue task dataset, following the methodology outlined in the Methods section.
As previously mentioned, wav2vec2-IEMOCAP demonstrates superior performance in classifying emotions within the IEMOCAP dataset. However, this sensitivity study reveals the limitations of previously developed emotion classifiers when applied to our dataset, particularly: a) the absence of frustration as a class, and b) overfitting, leading to rapid switching between emotions. These shortcomings underscore the need for the development of the OEM classifier. Nevertheless, the wav2vec2-IEMOCAP results support our overall finding that FTD participants exhibit “flatter” emotional trajectories during monologues, even when assessed using a different classifier.
Typically, as machine learning classifiers undergo more training epochs, they become more confident in their predictions, resulting in a higher probability of classifying an audio clip accurately. Here we present the predicted emotions over time in Fig. 13 for the same two participants demonstrated earlier in the main text in Fig. 7 and Fig. 8.
Emotion tracking over time for a bvFTD participant and a healthy participant completing the monologue task using wav2vec2-IEMOCAP. A moving average filter of size 10 seconds was used to create a smooth transition between emotions. Color scheme: red (“angry”), orange (“happy”), green (“neutral”), blue (“sad”).
In the healthy example, emotions shift rapidly from “neutral” to “happy”, and back to “neutral” shown on the top of Fig. 13. Rapid changes can also be seen in the bvFTD example in the bottom of Fig. 13, the participant’s emotion changed from “happy” to “sad”, back to “happy”, then to “neutral” within 5 second changes, even though these adjacent 5 second time windows have a significant overlap.
Similar to results in the main body of the paper, we observe “happy” and “neutral” as the dominant emotions throughout the monologue. However, rather than showing a happy-neutral mixture, the wav2vec2-IEMOCAP model shows rapid switching between emotions. The interpretation of these switches becomes challenging since the 5 second time windows only differs by 0.1s. This contrasts with the smoother transition observed in Fig. s 7 and 8. However, by applying a moving average filter with a window size of 10 seconds, we can obtain more gradual transitions between emotions from the wav2vec2-IEMOCAP output. We explored varying the length of the moving average window from 5 to 15 seconds and found that differences between FTD and control participants were robust to the window length. It is important to note that this approach results in the loss of information from the first 10 seconds of the recordings. With this modified analysis, we can perform similar analyses as with OEM proposed in the main body. However, this time-smoothing approach is an ad-hoc fix to address the unrealistically rapid switching predicted by the wav2vec2-iemocap model. Thus we feel the proposed OEM, which is specifically trained to avoid overly confident predictions, is preferable.
Fig. 14 illustrates that a significant number of healthy individuals exhibited either a positive or negative trend in their “happy” percentage as they described their happy experiences. In contrast, most people with bvFTD and its subtypes appeared to have a relatively stable “happy” percentage over time. To quantify this observation, we calculated the absolute value of the slope for each participant, which is presented in Fig. 15.
Slope for the robust line fit of “happy” percentage over time for all participants performing monologue task. Happy percentage was obtained using wav2vec2-IEMOCAP and a moving average filter of size 10 seconds.
“Absolute change in “happy” score over time for all participants performing the monologue task, grouped by type. “Happy” percentage was obtained using wav2vec2-IEMOCAP with a moving average filter of 10 seconds.
Comparing with Fig. 11 in the main body of the paper, we observe that healthy elderly demonstrates more variability than people with bvFTD, svPPA, and nfvPPA.