Loading web-font TeX/Math/Italic
Textual Pre-Trained Models for Age Screening Across Community Question-Answering | IEEE Journals & Magazine | IEEE Xplore

Textual Pre-Trained Models for Age Screening Across Community Question-Answering


The flowchart of PTMs for age screening across cQA members. First, a pre-trained model is selected and downloaded from Hugging Face. This PTM is then adjusted to our task...

Abstract:

Almost every community Question-Answering (cQA) platform has the pressing need of enhancing user experience by presenting dedicated displays, connecting potential answere...Show More

Abstract:

Almost every community Question-Answering (cQA) platform has the pressing need of enhancing user experience by presenting dedicated displays, connecting potential answerers with open questions and revitalizing the material in their archives. In doing so, it is crucial to understand the profile of their community members, especially as it relates to their demographics. In this realm, variables such as age and gender have shown to be particularly promising for managing content. For instance, they make it easier to connect questions posted by one generation that are more likely to be answered by individuals from the previous generation. This paper advances the current body of knowledge in this area by exploring the performance of nineteen frontier transformer-based models (e.g., BERT and ELECTRA) on age recognition across a large-scale collection of cQA members. In effect, the best encoder (LongFormer) finished with an accuracy of 78.61% (F1-Score of 0.7424) by taking full-questions and answers into account. Unlike gender recognition, our outcomes do not show a noticeable difference between cased and uncased models. But on the other hand, they confirm that the transition from one age group to the other is smooth, and thus boundary individuals pose a tough challenge to discriminant models built on top of frontier machine learning approaches.
The flowchart of PTMs for age screening across cQA members. First, a pre-trained model is selected and downloaded from Hugging Face. This PTM is then adjusted to our task...
Published in: IEEE Access ( Volume: 12)
Page(s): 30030 - 30038
Date of Publication: 22 February 2024
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Automatically identifying age across community members is pivotal for almost all types of online social networks. As a matter of fact, these services not only need this piece of information to enhance user experience, but also they have the utmost urgency to the enforcement of their terms of service and the corresponding local laws. In effect, age screening is vital for successfully detecting many malicious activities, e.g., identify theft and fake profile creation as well as protecting children from harmful situations.

In the event of cQAs, previous studies have demonstrated that determining age is regarded as a decisive factor in matching open questions with potential answerers, especially in platforms where there is a transfer of knowledge from one generation to the other [1]. More concretely, it has been discovered that questions prompted by one particular age group are more likely to be answered by, typically another, specific cohort. More distinctively when dealing with topics including Dining Out, Home & Garden, and Family & Relationships.

However, age screening is not as simple as it might sound since abusers, criminals, and young people get around it by misrepresenting their age. Nevertheless, recent advances in Artificial Intelligence, namely Natural Language Processing and network analysis, have revealed that it is possible to infer textual, visual and activity patterns that are representative of distinct age clusters [1], [2], [3]. Needless to say, image and graph-based pre-trained Deep Neural Networks have offered great help here, but on the other hand, text-based frontier architectures have not been largely explored yet despite their significant recent breakthrough in this field [4], [5], [6], [7], [8], [9].

Hence, the novelty of this paper lies in comparing the performance of assorted state-of-the-art pre-trained models (PTMs) working on textual inputs. With this in mind, our contributions to this body of knowledge are summarized as follows:

  1. We adapted nineteen different frontier transformers, capable of achieving high age classification rates on the basis of textual inputs only.

  2. All our experiments are conducted on a massive study corpus, namely almost 550k community member profiles encompassing their respective full questions, answers and self-descriptions.

  3. We provide empirical evidence that the transition from one age group to the next/previous is smooth, and thus it presents a difficult challenge, even to state-of-the-art machine learning classification techniques.

In a nutshell, the best transformer was LongFormer, which accomplished an accuracy of 78.61%. The roadmap of this work is as follows. First, previous studies are discussed in Section II, and later Section III outlines our research questions. Then, Section IV and V discuss our methodology and experimental results, respectively. Eventually, Section VI touches on the key findings and sketches some future research directions.

SECTION II.

Related Work

In fact, age demographics across cQA websites is a largely unexplored research area. By examining the impact of sentiment analysis on these services, [10] superficially explored age patterns, in particular as they relate to attitude and sentimentality. In this regard, the authors presented two key findings: a) people are likely to respond to people of the same age in a more positive manner; and b) the sentimentality decreases with increasing age.

In a similar spirit, [11] focused on age trends across programming gurus in StackOverflow. They learned that programmer reputation scores grow in tandem with age well into the 50’s, while in their 30’ they tend to focus on fewer areas wrt. younger or older members. Additionally, they did not notice a strong correlation between age and scores in any specific knowledge area.

Lately, [3] perceived age recognition as a classification task by trying several ways of clustering community peers in consonance with their birth year. Curiously enough, their outcomes unveiled that it is better to reduce the archetypal five generations proposed by Strauss and Howe [12] to three via grouping its oldest three cohorts into one.

Consistent with this previous study, [1] tested high-dimensional vector spaces constructed from textual and metadata properties on conventional statistical strategies and frontier deep neural networks. In summary, FastText and MaxEnt proven to be effective, and when it comes to features, sentiment analysis. But more importantly, they drew the conclusion that effective models for identifying age cohorts bear strikingly similarities to models previously designed for question intent [13], namely in relation to the pertinence of sentiment analysis. Later, [14] provided support for the validity of this hypotheses by comparing the classification rate of assorted single-task and multi-task frameworks. More recently, the study of [2] showed that age-based centroid vectors tend to form a trail ordered by age in graph-based embeddings constructed on top of the activity of community.

SECTION III.

Research Questions

With prior works as the foundation, we advance this area of research by answering the following two main research questions:

  • RQ1: How well do vanilla frontier neural network classifiers perform on age identification?

  • RQ2: What can we learn from the errors that would help in the design of more efficient models in the future?

SECTION IV.

Our Methodology

In this section, we outline the assorted pre-trained encoders employed in our empirical settings for recognizing age (see figure 1). BERT is one of the first and most widely used PTMs. And its name is the acronym for Bidirectional Encoder Representations from Transformers, (cf. [15], [16]). This architecture consists of a multi-layer bidirectional transformer trained on clean plain text for masked words and next sentence prediction [16], [17]. Its underlying idea is the old principle that words are, to at least a great extend, defined by other terms within the same context [18]. BERT is composed of twelve transformer blocks and twelve self-attention heads with a hidden state of 768. Our experimental configurations accounted for most state-of-the-art BERT-inspired architectures. Fundamentally, we extended the encoders utilized by [19] (e.g., ALBERT [20], DEBERTA [21] and XLNet [22]) as described below:

  • BigBird deals with the inherent quadratic dependency of BERT by implementing a sparse attention mechanism. This new attention approach is linear in the number of tokens [23]. This model capitalizes on the power of extra global tokens for preserving its expressiveness. As a result, this brings about competitive performance on downstream tasks like question answering and long document classification.

  • ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) adapts a discriminator (transformer) that determines whether every token is an original or a replacement, instead of only masking a fraction of tokens within the input [24]. A generator, another neural network, masks and substitutes tokens to generate corrupted samples. In practical terms, this model trains much faster than BERT, requiring significantly less computation, while at the same time, accomplishing a competitive accuracy on several downstream tasks. In this work, we accounted for its base and large versions.

  • MPNETcapitalizes on underlying dependencies among predicted tokens via permuted language modeling and takes additional position information as input. This means it can see a full sentence, and lessen the position discrepancy accordingly [25]. In other words, it leverages the advantages of BERT and XLNet, while at the same time avoiding their limitations.

  • SqueezeBERT replaces several operations within self-attention layers by grouped convolutions. As a consequence, it is four times faster than BERT while accomplishing a competitive performance. This architecture also runs at lower latency on smartphones than many efficient encoders like MobileBERT [26].

FIGURE 1. - The flowchart of PTMs for age screening across cQA members. First, a pre-trained model is selected and downloaded from Hugging Face. This PTM is then adjusted to our task via the training and evaluation sets. The fine-tuned model is eventually utilized for casting predictions across the test collection.
FIGURE 1.

The flowchart of PTMs for age screening across cQA members. First, a pre-trained model is selected and downloaded from Hugging Face. This PTM is then adjusted to our task via the training and evaluation sets. The fine-tuned model is eventually utilized for casting predictions across the test collection.

On the whole, we took advantage of a total of 19 pre-trained models including DistillBERT [27] and RoBERTa [28] (see the complete list on table 3). For fine-tuning, we profited from the implementations by Hugging Face1 using the Simple Transformers2 library. By and large, we opted for default parameters settings to level the grounds and also to reduce the experimental workload. At all times, two epochs were set during model adjustment, this way the time for fine-tuning was limited to ten days (largest models). It is worth noting here that going beyond one epoch did not show any significant improvement, but we intentionally gave all encoders enough time to converge. The maximum sequence length was equal to 512, and whenever possible, we set sliding windows operating with a 0.95 stride. The reader can refer to table 4 for details about experimental configurations using and not using sliding windows. As for the batch size, this was set to make sure that the corresponding GPU memory usage reached its limit, namely eight due to the largest models (XLNet, XLMRoBERTa, DEBERTA, etc.), this way always allowing convergence. In our experiments, We used twenty NVIDIA Tesla GPU cards: 1 x P4 (8gb), 16 x A16 (16gb) and 3 x P40 (24gb). Lastly, it is worth mentioning that we applied half precision (fp16) format to all models, but FNET.

TABLE 1 Illustrative Excerpts From Three Distinct Profiles Within Our Study Corpus. Each Sample Belongs to a Different Age Cohort. Italics Denotes Questions, Whereas Self-Descriptions are Underlined With a Wavy Line
Table 1- 
Illustrative Excerpts From Three Distinct Profiles Within Our Study Corpus. Each Sample Belongs to a Different Age Cohort. Italics Denotes Questions, Whereas Self-Descriptions are Underlined With a Wavy Line
TABLE 2 Definitions and Distributions of the Three Age Cohorts Within Our Collection
Table 2- 
Definitions and Distributions of the Three Age Cohorts Within Our Collection
TABLE 3 Accuracy Obtained by Each Combination of Transformer and Pre-Defined Setting (Test Set). The † Indicates That Sliding Windows Could Not Be Used
Table 3- 
Accuracy Obtained by Each Combination of Transformer and Pre-Defined Setting (Test Set). The † Indicates That Sliding Windows Could Not Be Used
TABLE 4 F1-Scores Reaped by Each Encoder When Dealing With the Different Pre-Defined Configurations (Test Set). The † Signals That Sliding Windows Could Not Be Employed
Table 4- 
F1-Scores Reaped by Each Encoder When Dealing With the Different Pre-Defined Configurations (Test Set). The † Signals That Sliding Windows Could Not Be Employed

SECTION V.

Experiments

In the first place, we capitalized on the collection used by [19] for undertaking their gender study across cQA platforms. This corpus was gathered by [3], and it comprises 548,375 cQA member profiles including demographic information such as gender and age. Each of these records is connected with its corresponding sets of questions, answers, nicknames and self-descriptions (see table 1). Following the suggestion of [3], we clustered community fellows in conformity to the three groups version of the segmentation proposed by Strauss and Howe [12] (see distribution on table 2). Accordingly, the Entropy of this demographic variable is 0.8477.

This set was randomly divided into training 329,025 (60%), evaluation 109,675 (20%), and testing 109,675 (20%) by means of a random stratified sampling strategy, maintaining similar class distributions across the three splits. Particularly, in the test set, we find 53,842 instances from Gen Z (49,09%), 45,997 from Gen Y (41,94%) and 9,836 from Olders (8,97%). It is worth highlighting here that held-out evaluations were conducted in all our experiments by keeping these splits unchanged. Also, it has to be clarified that test material was employed solely for providing an unbiased assessment of a final model fit on the training/evaluation datasets.

In our experiments, we accounted for four empirical variations with the aim of assessing the contribution of each individual kind of information within the user profile to the overall performance of classifier. These four configurations are signalled by the next abbreviations:

  • T (question titles only)

  • TB (questions titles plus question bodies)

  • TBA (full questions coupled with answers)

  • TBAD (full questions, answers and self-descriptions)

Tables 3 and 4 displays the outcomes accomplished by the different combinations of encoders and configurations. From these results, it worth remarking the following findings:

  1. Regardless the metric, MobileBert finished with the best average performance. Note that every time it did not end with the best score, it provided a very competitive performance.

  2. Contrary to pre-trained models for gender classification [19], self-descriptions have a positive impact on the average classification rate. Basically it enhanced the performance for many BERT and RoBERTa-based architectures, whereas diminishing the score of other encoders such as FNet. This uncertainty can be attributed to the sparseness of short bios, only 7% of the community fellows decribe themselves.

  3. In juxtaposition, it is crystal clear that question bodies and answers gave a major boost to the best model. More concretely, its accuracy grew 7.02% when adding bodies to titles and 2.63% when enriching full questions with answers. A similar picture is seen in terms of F-Score, an increase from 0.6066 to 0.6953 when taking into account question content, and from 0.6953 to 0.7424 when also considering answers. All in all, informative cues inferred from answers are useful for building effective models for age recognition, in contrast to gender, where these inputs noticeably lower the classification rate [19].

  4. Closely analogous to gender recognition [19], the worst performance can be, most of the time, attributed to: XLM-RoBERTa, BERT and ELECTRA. More specifically, XLM-RoBERTa reached the lowest scores when trained on full questions (70.25%) and on TBA (71.42%). We deem this as a result of their sensitivity to the distinct input signals.

  5. The gap between the best and the worst systems is the narrowest when combining all signals (approximately 4.5% accuracy and 0.1103 F-Score points). Conversely, building models on top of full questions plus answers produces the widest gap (i.e., ca. 7.19% accuracy and 0.2135 F-Score points).

  6. As for checking the statistical significance, we bootstrap sampled the best model for each configuration and each of its respective competitors twenty times and perform a two-tailed t-test afterwards (\alpha < 0.05). All but one pair passed this test: XLNet and RoBERTa (accuracy).

In summary, these outcomes point towards that age recognition is feasible on the basis of textual interactions within cQA platforms. Specifically, the best systems accomplished an accuracy of 78.76% (Longformer without self-descriptions) and a F-Score of 0.7676 (squeezebert-uncased with short bios). Results also point towards the fact that fine-tuning effective models for age recognition requires signals provided by answers, while prior studies proved that these are detrimental for gender [19]. On average, question bodies aided in enhancing the performance by 9.3%. By the same token, adding answers brought about an average increase of 3.02%, and subsequently including profile descriptions by 0.13%. In effect, our outcomes suggest that short bios might be omitted most of the times without significantly affecting the classification rate. Recall that these are very sparse, since only 7% of the profiles have self-descriptions.

Another empirical difference between age and gender detection lies in the benefits of casing, both case and uncased models achieve competitive performance in the event of age detection. To be more precise, MobileBert is uncased, whereas LongFormer and RoBERTa are cased. This result also entails that cost-efficient models, namely MobileBert, can rival more resource-demanding encoders (i.e., LongFormer and RoBERTa) similarly to gender [19].

Figure 2 displays the confusion matrices for the best encoder under each configuration. In the first place, results indicate that adding question bodies assisted in boosting the correct recognition of instances from all cohorts, while answers solely from GEN Z and Olders. Secondly, self-descriptions were profitable exclusively for improving the identification of the minority cluster, namely Olders. Roughly speaking, most informative cues about age can be found across full questions and by and large, answers and short descriptions are fruitful mainly for tackling the data sparseness caused by the minority cohort.

FIGURE 2. - The confusion matrix for the best model under each configuration.
FIGURE 2.

The confusion matrix for the best model under each configuration.

These four confusion matrices also corroborate the findings of previous works [1], [2], that is to say that errors arise from perceiving samples as members of the prior or the next cohort. In other words, few misclassifications occur due to conceiving GEN Z fellows as Olders or the other way around. Interestingly enough, our empirical outcomes support the discovery of [2], who found out that graph (community) activity patterns slowly and systematically change in tandem with age. Lastly, the precision markedly improved across the three clusters, especially for the Olders. To be more exact, on GEN Z it went from 0.742 to 0.845 (compare figures 2a and 2c), while on GEN Y from 0.712 to 0.779 (see matrices 2a and 2b), and on Olders from 0.297 to 0.585 (cf. figs. 2a and 2d).

Additionally, figure 3 highlights the ROC (Receiver Operating Characteristic) curves for the best systems. These graphs show a progressive improvement from the titles only to the full questions plus answers setting. In fact, the macro-average AUC (Area Under the Curve) grew 0.092 points, and GEN Y had the sharpest increase (i.e., 0.135 points). All in all, self-descriptions were always detrimental in terms of AUC.

FIGURE 3. - ROC curves for the best model obtained for each configuration.
FIGURE 3.

ROC curves for the best model obtained for each configuration.

Figure 4 highlights term attributions3 assigned by LongFormer TBA to three member profiles. In this image, positive attribution numbers (in green) indicate terms that positively contribute towards the predicted cohort, while negative numbers (in red) signal words that negatively contribute towards the recognized age group [29], [30]. A more intense shade of colour features a greater contribution. These scores reveal that terms such as “academy”, “music” and “dreams” are strong indicators of the youngest generation, while “parents”, “permission”, “mom” and “ask” of GEN Y. In the event of Olders, we find words including “benign”, “vet” and “animals”. Interestingly enough, in this last group, terms like “marriage”, “woman” and “man” contribute negatively. Since these words are normally found across the category “Family and Relationships”, we conjecture that this negative contribution is due to its stronger relation to younger people, typically aiming or beginning to settle their families. All in all, this analysis shows that this best encoder reaped higher classification rates due to correctly inferring word usages within each target class.

FIGURE 4. - Accuracy vs. birth year (dotted line). The bars denote the fraction of community members born the respective year.
FIGURE 4.

Accuracy vs. birth year (dotted line). The bars denote the fraction of community members born the respective year.

Figure 5 unveils interesting regularities across errors. Like [1] noted, the accuracy systematically increases until around the middle of the Olders interval (i.e., year 1950), and it consistently decreases afterwards towards the boundary to GEN Y (year 1979). In a similar manner, the classification rate boosts from 1980 up to around the middle of the GEN Y cohort (year 1987), and heavily falls thereafter until reaching the boundary to GEN Z. Subsequently, we see the steady rise and fall of the accuracy for instances from GEN Z (peaking at 2003). Given these results, we can conclude that misclassifications are more likely to be found around generational boundaries. This conclusion makes perfect sense since these individuals are “transitional”, meaning it is reasonable to expect that they share considerable inter-generational traits, which are manifested across two consecutive cohorts. However, unlike [1], both inter-generational drops seem to be sharper in the case of LongFormer TBA. We interpret this as a consequence of inferring more informative cues of the right generation when dealing with “transitional” community peers. Another interesting discovery regards the rate of correct guesses, our outcomes indicate that this peaks around the middle of each age gap. We understand these to be the archetypes or the most representative users of the respective cohort. In effect, this finding is in consonance with the finding of [2], which revealed that age-based centroid vectors tend to form a trail ordered by age in the Node2Vec embedding space built on top of activity graphs.

FIGURE 5. - Accuracy vs. birth year (dotted line). The bars denote the fraction of community members born the respective year.
FIGURE 5.

Accuracy vs. birth year (dotted line). The bars denote the fraction of community members born the respective year.

Additionally, figure 5 points towards the fact that the transition between GEN Y to GEN Z is smoother (i.e., substantially higher accuracy) than between Olders and GEN Y. This leads to the conclusion that the difference between GEN Y and GEN Z is wider than between Olders and GEN Y. Lastly, figure 5 discloses that the inter-generational decrease between GEN Y to GEN Z takes place despite of owning the largest fractions of samples, signalling that this drop is not due to data sparseness.

A. Summary and Limitations

Aside from previously mentioned considerations, we can see that short bios suffer severely from data-sparseness, since ca. 7% of community fellows entered their self-description on the Yahoo! Answers platform. Nevertheless, the real impact of these pieces of information must be still assessed. One can reasonably envision here that there is a good chance of extracting demographic nuggets from these contexts. In particular, on communities where their members might be more inclined to provide their short biographies, e.g., Quora, Reddit and Stack Exchange. Here, frontier transfer learning approaches can be employed to mitigate the data sparseness across Yahoo! Answers self-descriptions.

One limitation of our approach is that the transition from one age group to the other is smooth, this means conventional text-based frontier models have, at their best, insufficient input to deduce discriminative patterns for “transitional” individuals. Here it would make sense to extend this work to consider multi-modal sources of input signals such as their corresponding profile images and activity patterns.

Another limitation of this study regards the corpora on which these state-of-the-art encoders where pre-trained. Most of the times, this training corpora comprises web texts, Wikipedia and books. If there is significant computational power accessible, one could think about pre-training frontier transformers on massive amounts of user-generated cQA texts. We envisage that smaller architectures would be more cost-efficient, since these archives are not as big as vanilla collections of web texts, for instance. Still yet, this sort of pre-training presents interesting challenges, for example when it comes to misspellings, jargon, aliases and the cleaning of the corpora. Given the fact that the grammar across questions titles is sharply different to what we can find across question bodies and answers, we need to think about special adjustments or separate models.

Additionally, a promising way of dealing with community members who prompted few questions and/or yielded a low number of responses in English is by capitalizing on multi-lingual transformers. It seems perfectly logical that this might enhance the age recognition when users can express themselves in several languages. A good example of this are short bios written in a language different from English.

SECTION VI.

Conclusion

In the first place, empirical results indicate that it is feasible to build frontier transformer-based classification models for age detection across cQAs. In this regard, LongFormer outclassed all other architectures when fine-tuned on all input signals but self-descriptions.

Unlike gender recognition, our outcomes do not show a noticeable difference between cased and uncased models. To be more precise, we see cased encoders, including RoBERTa and LongFormer, competing head-to-head with uncased MobileBERT.

We envision that our findings might be useful for transferring models to platforms where it is hard to obtain large-scale annotated data. Take for instance, services such as Stack Exchange, where people seldom tell cues about their age. We also envisage the implementation of multi-modal architectures as a means of obtaining a sharper detection within boundary samples.

Lastly, age recognition based on textual inputs is positioned to be an instrumental tool to design interventions to promote equal engagement and participation in online communities.

References

References is not available for this document.