Introduction
Natural language processing (NLP) is a dynamic interdisciplinary field that integrates linguistics, computer science, and artificial intelligence to facilitate interactions between computers and humans through natural language. NLP aims to develop algorithms and models that can efficiently process and analyze large volumes of natural language data, enabling computers to understand and interpret the content of various documents, including their contextual subtleties [1]. Currently, NLP technology can extract valuable information and insights from documents in multiple languages, with algorithms developed to classify and organize documents, improving their accessibility and usability.
A significant aspect of the evolution of NLP is its application to text analysis and classification using machine learning techniques for clustering. Text analysis allows extracting meaningful information from textual data, including sentiment analysis processes, topic extraction, entity recognition, and document classification [2]. NLP techniques employ machine learning algorithms to process and understand texts on a large scale. Clustering algorithms provide a significant advantage in grouping similar documents or text fragments based on their content, structure, or context.
As the labor market evolves, there is an increasing need to identify the future needs of professional skills and competencies [3]. This presents an opportunity to adapt and enhance current trends in NLP and apply them to the vast world of skills and competency classification through predictive models of future skills. The creation of Knowledge, Skills, and Abilities (KSA) taxonomies responds to the pressing need to align educational and occupational profiles with the changing demands of Industry 4.0 [4]. These taxonomies redefine how we conceptualize and prepare for occupations, incorporating a visual dimension to offer interactive frameworks for exploring future workforce dynamics [5]. Leveraging machine learning tools and diverse data sources such as ESCO, O*NET Frameworks, FutureSkills of the government of Singapore, and the World Economic Forum, dynamic taxonomies forecast and address changing labor market requirements [6]. Through these efforts, the goal is to unlock the future of the workforce, encouraging a proactive approach to skills development and education in line with industry demands. Leveraging the power of NLP text analysis, clustering, and classification, a model could be developed to feed a dynamic taxonomy of competencies that adapt to the changing landscape of occupational capabilities.
The main objective of this analysis is to provide a comprehensive summary of the current state of NLP and predictive models to define future skills. This includes evaluating various research papers, journals, and other studies in this field to understand in a better way the advances and challenges in these fields. This work offers a current perspective on the advances in these two disciplines and facilitates the identification of critical areas of opportunity for future research.
While existing studies highlight the predictive ability of NLP models across various contexts, our analysis points to the potential for developing more refined models specifically tailored to predict the relevance and evolution of skills and competencies required in the industry. As industries evolve, some competencies become obsolete while others gain importance. NLP models capable of predicting these changes from large volumes of textual data would be invaluable to educators, employers, and practitioners, opening up a new frontier in skills acquisition.
The paper is organized as follows: subsection I-A provides a brief definition of key concepts used in this paper; section II outlines the research methodology employed in designing and conducting this analysis, including the Research Questions, RQs; section III presents the results and analysis, followed by a discussion where a summary of findings, the identification of opportunity areas, and a discussion on the limitations of this work; finally, section IV presents the conclusions and suggests directions for future work.
A. Definition of Terms
This subsection presents the definitions of the relevant terms related to this work, where we identify the following:
Future skills: competencies that enable individuals to effectively address intricate problems in rapidly evolving contexts while demonstrating self-organizational capabilities. These qualifications are rooted in cognitive, motivational, volitional, and social resources and are imbued with core values. Furthermore, they can be acquired through a deliberate process of learning. In this context, knowledge encompasses a comprehensive array of facts, theories, and principles pertinent to a specific field of work or study; skills denote the practical capabilities essential for executing particular tasks with proficiency; and abilities to possess the sensory, physical, psychomotor, or cognitive means necessary for effectively undertaking a given study or task [7]. Notably, these Future Skills are intricately intertwined with the knowledge of emerging technologies [8].
Natural Language Processing (NLP): NLP refers to the branch of artificial intelligence, specifically artificial intelligence, that works with text processing capabilities to understand texts like humans do [9]. The NLP combines computational linguistics and rule-based modeling of human language with statistical, machine learning, and deep learning models [10].
Neural Network Models: This is a computational structure inspired by the human nervous system, designed to process information by simulating interconnected units called neurons. These neurons are organized into layers, specifically an input layer, one or more hidden layers, and an output layer. The strength or intensity of connections between these neurons, known as weights, are initially set to random values. The model learns by adjusting these weights based on the difference between its predictions and known outcomes through a repeated iterative process [11]. As it undergoes training with known data examples, the network refines its weights, improving its predictive accuracy. Once trained, the neural network can predict new, previously unseen data [12].
POS Tagging: Commonly referred to as grammatical tagging, it is a foundational pre-processing task in NLP. It involves assigning appropriate grammatical categories, such as verbs, nouns, adjectives, adverbs, and determiners, to a text’s words and punctuation based on their inherent definition and the context in which they appear [13]. The significance of POS tagging extends to understanding text’s morphological and syntactical structure, enabling more advanced NLP tasks [13].
Pre-trained language model: Pre-trained language model refers to large amounts of trained textual data with machine learning models. They mainly benefit specific NLP tasks by acquiring general linguistic features of linguistics such as syntax, grammar, and semantics. Typically, these models are used in several NLP applications, such as named entity identification, text summarization, and sentiment analysis [14]. However, there are several areas of opportunity where pre-trained linguistic models can be used. [15].
Syntactic patterns: Refer to the established arrangements of words within sentences and clauses in the English language, ensuring coherence and meaningful expression. These patterns dictate the acceptable word orders, especially when various linguistic elements, like indirect objects or prepositional phrases, are involved [16].
TF-IDF: Standing for term frequency-inverse document frequency is a statistical measure used to determine the significance of a word in a document relative to a corpus. It adjusts for words that naturally appear more frequently, thus providing a more accurate representation of word importance [17]. TF-IDF scores words based on their frequency in a specific document instead of their general frequency in the entire corpus, leading to more relevant search and recommendation results [18].
Transformer models: The Transformer model is a neural network that learns from context, meaning, and making relationships in sequential data like the words in this sentence. Transformer models apply an evolving set of mathematical techniques (called attention or self-attention) to detect a variety of subtle ways in which even distant data elements influence and depend on each other [19].
Word Embedding-Based Models: these models learn the meaning of words by considering the context of words, and thus, they consider the semantic relations between words. To do so, they transform words into numerical vectors that can be represented in a multidimensional space. Within this space, words with similar meanings are positioned closer, and words with dissimilar meanings are positioned further apart. As a result, distances between word vectors become informative about the meaning of words [20].
Research Methodology
This work was conducted following systematic review guidelines outlined by Page et al. [21], Kitchenham et al. [22], Xiao and Watson [23], and Torres-Carriét al. [24] aiming to identify work related to predictive models and NLP for skills from the last four years. By employing this kind of analysis in the research, we ensure a methodologically rigorous, robust, and transparent approach, enhancing the credibility and reproducibility of the findings.
The methodology developed for this work included the definition of RQs to guide the study, the search phase in well-known and reputable databases, and finally, the discussion and findings. This final step involved answering each RQ by contrasting different approaches and works from various authors in the databases. This process aimed to provide a comprehensive understanding of the current state of the art in the areas of interest.
The following subsections detail the objectives of the analysis, the RQs, and the process of searching for relevant information from formal sources to answer these questions. Subsequently, we analyze and discuss the findings, contrasting the achievements of different researchers and examining how their work fits into the research and ongoing investigation activities. Finally, we draw conclusions based on the gathered data and propose future research directions.
A. Objective of the Analysis
The primary objective of this analysis is to recognize and summarize the current state of the predictive models and NLP in relation to skills acquisition. This includes a comprehensive review of various research papers, journals, and other studies in the field. By extensively examining diverse literature sources, we aim to gain a holistic understanding of the advancements and challenges in predictive modeling and NLP, particularly related to the creation of skills taxonomies. This review provides a present perspective on the depth and breadth of current knowledge in the field.
Additionally, this work seeks to identify opportunities to enhance NLP techniques for predicting future skills. Such advancements could benefit workers and employers when developing strategies for aligning employees’ current knowledge and skills with the industry’s emerging needs.
B. Research Questions
To guide this study toward the goals of identifying relevant work related to predictive models and NLP for skills acquisition and determining future research directions, we identified the following three RQs:
RQ1. How are NLP techniques used and adapted to extract meaningful insights from textual content?
RQ2. Which and how are NLP models employed to create taxonomies or classifications of skills?
RQ3. How are NLP models being applied to predicting behaviors or outcomes in specific areas?
C. Information Sources
To ensure access to up-to-date and credible information pertinent to the (RQs), we identified several databases renowned for their extensive content and esteemed reputation:
Web Of Science [25]: An all-encompassing research database esteemed for its extensive coverage of scientific literature. Its design boasts a user-friendly interface complemented by advanced tools for tailored search experiences.
Scopus [26]: An expansive abstract and citation database encompassing research literature from diverse disciplines, with features that facilitate refined search capabilities.
The search engines include other specialized databases, such as Science Direct, IEEE Xplore, and Springer Link. The search keyword combinations should retrieve articles related to the research topic.
D. Search Strategy
Considering the RQs, we meticulously identified key phrases employed in relevant databases. These precise terms were chosen to maximize the retrieval of pertinent data related to the critical terms derived from our RQs.
Table 1 displays the keyword combinations used as queries that resulted from more than one result in the searches across multiple databases. This table provides the specific combinations of terms and the logical connectors used to filter items.
E. Eligibility Criteria
The selection of literature for this study followed a set of eligibility criteria to ensure that the most relevant studies were included to cover the objectives of this work.
Considered articles were published between 2019 and 2023, focusing on the most recent advances in NLP and predictive models for skills acquisition. Studies were included if they directly addressed the RQs related to the use of NLP techniques to extract information from textual content, the development of skill taxonomies, or the application of predictive models in skills acquisition. Only articles from artificial intelligence, computer science, software engineering, and related disciplines were considered to ensure contextual relevance. The search was limited to works published in peer-reviewed journals and conferences to ensure the credibility of the research, and only works in English were included to maintain consistency. Bibliographic searches were conducted in the Web of Science and Scopus databases, selected for their broad coverage of quality academic articles. Specific keyword combinations, as mentioned in Table 1, were used to capture the most relevant studies for this work.
Articles published before 2019 were excluded to maintain the focus on recent research. Sources not present in Web of Science or Scopus were excluded to ensure the scientific rigor of the findings. We omitted studies unrelated to the core disciplines of artificial intelligence, computer science, or software engineering and articles focused on unrelated topics such as theoretical linguistics. Duplicate publications identified in the databases were removed to avoid redundancy. Articles that did not specifically address the use of NLP in skills acquisition or the development of predictive models for skills-related taxonomies were excluded, along with studies lacking a clear methodology or relevant results. Non-English articles were excluded to ensure consistency and accuracy.
In the first phase of identification, 118 records were screened from Scopus (n =16) and WoS (n =102) using the keyword combinations defined in Table 1. During this process, 38 articles published before 2019 were excluded from the initial screening criteria. Additionally, 26 duplicate entries were identified and removed.
To conduct an assessment of the included studies, we employed eligibility criteria considering the key aspects directly addressing the RQs (see Figure 1). We filtered the initial set of 118 documents described in Table 2, resulting in 81 articles for further analysis.
The 81 articles were randomly split among the researchers, as shown in Table 3. These articles were screened by reading the title and the abstract, looking for information valuable to the RQs. As a result of the reading screening, 60 articles were excluded, leaving 21 articles. The exclusion criteria were based on factors such as the programming language employed, predictive models, NLP techniques, and the application area. In addition, the related category criteria focused primarily on artificial intelligence, computer science, and software engineering computer science.
Subsequently, this screening phase was centered on aligning with the primary focus of the study, which revolves around applying NLP techniques and predictive models to skills. Articles were omitted if they failed to provide insights into NLP models for information retrieval from documents or if they did not offer methods for predicting skills.
For the eligibility round, as seen in Table 4, three investigators were assigned seven articles to analyze, and each analyzed article was reviewed in detail to ensure that the articles answered the RQs, with a total of 21 articles analyzed in the eligibility round. Finally, after careful review, we decided to exclude 2 of them, resulting in a final collection of 19 eligible articles that matched the research criteria.
By applying these detailed eligibility criteria, we aimed to ensure a comprehensive and focused analysis of the most relevant literature in NLP and predictive models for skills acquisition.
F. Quality Assessment
After acquiring the 19 selected articles, three distinct parameters were discerned, aligning with specific research inquiries. The eligibility criteria comprised three subcategories, delineated by the extent to which the articles satisfied these predetermined conditions.
QA1. The study proposes using an NLP technique to extract meaningful insights from textual content.
1.1 The study does not identify or propose any specific NLP technique, or it does, but without giving enough details to understand how the method was used to extract the insights.
1.2 The study presents an NLP technique for extracting information from textual corpora. The method is explained, but the insights it yields are not specified.
1.3 The study describes a specific NLP technique and presents how it’s being used to extract information and meaningful insights from text corpora.
QA2. The article discusses which and how NLP models are employed to create taxonomies or classifications of skills.
2.1 The article does not mention using NLP to build taxonomies or to organize or classify skills.
2.2 The study identifies one or more NLP models for building taxonomies or organized data but fails to relate these classifications with job skills.
2.3 The study presents an NLP model that outputs some classification or hierarchization of related data. The data has been extracted from a textual corpus and converted to some word embeddings for further analysis and for finding similarities.
QA3. The study proposes using NLP models to make predictions or outcomes in a specific area.
3.1 The study does not mention using an NLP model to forecast future values based on the training data.
3.2 The study proposes using an NLP model to foresee changes in the semantic or syntactic representation and relations between words in a corpus.
3.3 The study uses an NLP model to analyze different relations of the vocabulary within a corpus, intending to find changes in those representations that could envision some pattern or tendency.
G. Data Extraction
During the data extraction phase, the primary focus was to gauge the depth with which the selected papers addressed each of the three proposed RQs. Each paper was assigned a subjective percentage, determined by the researcher, indicating the strength of its association with each RQ, based on the paper’s approach to the subject matter.
The results of this evaluation are visualized in Figure 2, which presents the percentage of each subgroup in the quality assessment.
In Figure 2, the color red represents papers that relate to a given RQ at a level of 33% or less. Yellow indicates a relationship ranging from 34% to 66%. Lastly, the green bars signify papers with a strong connection, relating more than 66%
From the previous chart, we can see that RQ1 was more discussed in the papers, followed by RQ3 and finally RQ2. Interestingly, the limited attention given to RQ3 might indicate that applying NLP models in predictive analytics for specific areas remains an emergent field, and perhaps the complexities of prediction, especially in dynamic scenarios, challenge the existing NLP methodologies.
This distribution also unveils potential directions for future research endeavors. The predominant red bars for RQ2 in Figure 2 suggest a vast space for novel studies, aiming to delve deeper into applying NLP models for crafting skill taxonomies or classifications. Yellow bars signal an intermediary exploration level, emphasizing a need to investigate further and refine NLP techniques’ adaptation to extract meaningful insights from diverse textual data. On the other hand, the green bars, particularly in the context of RQ3, hint at a burgeoning space. There’s an opportunity for more comprehensive research focusing on the nuanced applications of NLP models in predicting behaviors or outcomes in specific sectors, underlining the ever-evolving nature of NLP in contemporary analytics.
Results and Discussion
The evaluation of the 19 selected articles was conducted meticulously in alignment with the three RQs. We crafted a detailed table to facilitate a comprehensive understanding and comparative analysis (Figure 2). This tabular representation is a valuable tool for identifying common themes, detecting patterns, and highlighting potential avenues for future research. It effectively organizes the diverse contributions across different research teams and individual studies.
An overview of the findings, as outlined in Table 5, is provided below, presenting an analytical commentary on their relevance to the RQs. Although several papers addressed themes relevant to all three RQs, certain studies resonated more profoundly with specific queries. The subsequent discussions delve into the relationship of each paper with the research objectives.
The compilation and assessment of these papers were meticulously guided by their alignment with and contributions to the RQs. Within this segment, we unpack the varied strategies and outcomes of the reviewed papers, underscoring their significance and limitations in the broader landscape of NLP research.
Based on the information in the Table 5, we can perform a data analysis that will allow us to evaluate the types of data in the papers we reviewed. This analytical exercise is important so we can emphasize the characteristics of the information collected and identify the possible patterns, trends, and aspects that may influence the study context.
In Figure 3, we can identify the percentage relative to the scope of application of the papers analyzed. This visual evaluation exercise effectively identifies and understands the proportional distribution of the different domains or thematic areas addressed in the research.
Figure 4 shows a count of the libraries or frameworks used in the papers analyzed. This analysis allows us to quantitatively visualize the prevalence of technological tools in all analyzed studies. The visual representation of the data facilitates the identification of the most recurrent libraries or frameworks, highlighting those that have been most frequently adopted.
In Figure 5, we can see the various NLP techniques in the analyzed papers, giving an insight into the preferences and frequency of use of the most recurrent techniques within the field of study. This visual approach facilitates the identification of relevant trends and patterns.
Figure 6 shows the visualization of the predictive models mostly used in the analyzed papers. This visualization allows us to understand the trend of functional predictive models, offering a strategic vision for those interested in the field of study.
Along with the discoveries of the individual analysis of each data obtained, a clustering of all these data together was carried out (Table 5) based on the work of Azofeifa et al [45], which seeks to glimpse in a defined space how related or similar they are the works from a general point of view of the whole. Remember that we have the following data: Programming language/package used, Library or framework used, NLP Technique, Predictive model used, Application Area, and whether it is a systematic literature review. Considering that each job can use zero, one or more of the options each of these selections has available. To compare the items, we define a metric that assigns each item a value depending on the options used. In particular, we assign a number 1 if the article uses the characteristic or a 0; otherwise, establishing the value of the paper based on whether or not it possesses each of all the available characteristics and the distance between documents in the n-dimensional space will be calculated. using the L2 Euclidean norm [46].
We can perform a cluster analysis in an n-dimensional space by having a defined distance between items. First, we define the optimal number of groups by applying the elbow method, as shown in Figure 7.
Subsequently, by applying the k-means algorithm, we group the data into a predefined number of groups. The number of clusters is at the “elbow” of the graph, which shows a trade-off between the number of items in each cluster (the higher, the better) and the compactness of each cluster (the more compact, the better). Thanks to the application of this method, several clusters k =2 were found. Figure 7 also shows the choice of clusters for k =1 and k =3, which shows that k +2 is a good choice. so we proceeded to make a view of the subdivision of the group as a dendrogram in Figure 8.
The k-means results in group items according to their distance; it is not apparent what each group includes and why, where initially in space is five dimensions, but using the T-distributed Stochastic Neighbor Embedding (T-SNE) algorithm can be projected in 2D, Figure 9, where the algorithm seeks to maintain nearby points in the higher dimensional space also in a 2D space. The main parameters of the T-SNE algorithm are the following: perplexity =100, exaggeration =1, and 6 components PCA, and the results obtained from this are shown in Figure 9, where two groups are identified, and each point with its respective number corresponds to an article and its reference for Table 5.
T-distributed stochastic neighbor embedding (T-SNE) has been used for visualization of the relationship among papers. The image shows two clusters identified by the dendrogram and their correlation distribution.
The preceding charts and figures illustrate the multiple approaches adopted in NLP to overcome the challenge of deciphering unstructured text and extracting valuable information from a spectrum of distinct sources. Each visual representation encapsulates a unique facet of NLP methodologies, reflecting the diversity researchers employ in tackling the complexities inherent in textual data.
Below, we revisit the initial RQs that set the groundwork for this comprehensive exploration. With a discerning lens, we embark on a journey to address and unravel these questions, leveraging the rich tapestry of insights gleaned from the various contributions of selected researchers. The objective is to not only contrast the diverse methodologies employed but also analyze the approaches that each researcher brings to the forefront.
By comparing these perspectives, we aim to comprehensively understand the multifaceted landscape of NLP applications in extracting meaning and valuable information from unstructured textual content to construct taxonomies or classifications. Through the analysis, we seek to unveil patterns, trends, and novel insights that collectively contribute to the evolving narrative of NLP research. Now, we will answer and discuss each of the RQs thanks to the findings and analysis carried out in this work.
RQ1.- How are NLP techniques used and adapted to extract meaningful insights from textual content?
The ever-evolving field of NLP has been pivotal in transforming how we extract and interpret data from textual content. This section explores the diverse methodologies and perspectives that have sought to answer this central query, illuminating NLP’s expansive reach and potential in extracting and interpreting information from textual data.
Afshar et al. [27] employs a robust architecture to convert clinical documents into standardized vocabularies. This translation from diverse clinical narratives into a unified lexicon illustrates the power of NLP in the medical realm, showcasing its ability to standardize and unify complex and varied textual data. The study compared CUIs and n-grams as inputs to a regularized logistic regression classifier using tf-idf transformation, with hyperparameters (L1 vs L2 regularization and coefficient C) optimized through grid search for the highest AUC ROC, while evaluating recall, specificity, NPV, and PPV using Python and RStudio.
Diving into deeper semantic understanding, Blandin et al. [28] employ machine learning classifiers to discern text appropriateness, ensuring age-aligned content. This method highlights the capacity of NLP to evaluate and categorize content based on semantic appropriateness, which is crucial for tailoring information to specific audiences. The study involves three models: a standard regression model with a first layer of 606 dimensions, a multi-task model with age prediction and binary classification (adult vs. children), and a sequence model combining classification and regression for children’s texts, with hyperparameters (hidden layers, activation function, dropout) shared across models. Similarly, Huang et al. [29] explore seed-guided topical taxonomy construction with their CoRel framework. This framework crafts intricate knowledge structures from text, reinforcing that NLP techniques are evolving beyond simple text processing to understanding and creating complex relationships within data. Their method expands user-given seed taxonomies by using a relation transferring module for discovering root topics and subtopics and a concept learning module for generating topical clusters, leveraging a pre-trained BERT model to train a relation classifier on minimal seed data with data augmentation and masked token techniques to infer hierarchical relationships.
Dehbozorgi and Mohandoss’s work [30] integrates the analysis of emotions with educational outcomes, employing NLP techniques to translate speech-based emotions into predictors of academic success. Aspect-based emotion analysis is performed with POS tagging, and the KNN algorithm is applied to the combined aspects and emotions as feature vectors to predict student performance. This application underscores NLP’s potential in interdisciplinary research, particularly in understanding the emotional dimensions of communication and their impact on learning.
Similarly, Cao et al. [38] enhance sentiment analysis by injecting user identity into pre-trained models, emphasizing the personalization dimension of NLP techniques. For the sentiment analysis, they use U-PLMs, incorporating user identity through embedding-based personalization in the embedding module and attention-based personalization in BERT’s self-attention module, with a two-stage training process. User embeddings are added to token representations for personalized document-level bias, and BERT’s transformer encoder processes the token sequence with multi-head self-attention and feed-forward networks across L layers to predict sentiment categories. Adapting pre-existing models for specific contexts exemplifies NLP’s shift towards more tailored and context-aware insights.
Elteto et al. [39] highlight an intersection of Bayesian modeling and NLP. Although primarily targeting human skill acquisition, their hierarchical Bayesian sequence model demonstrates NLP methodologies’ flexibility and interdisciplinary nature, with potential applications in robust text sequence analysis in NLP.
Laverghetta et al. [33] epitomize deep semantic understanding by exploring the intricate psychological layers embedded in linguistic expressions. This research delves into the nuances of language, moving beyond surface-level analysis to uncover deeper meanings and implications.
Lin et al. [2] focus on closing the gap between structured and unstructured data. Their work demonstrates a push to harmonize structured representations, such as taxonomies and graphs, with the intricacies of unstructured textual data. Their approach combines a seed-guided method and relation transferring to adapt existing taxonomies and employs advanced semantic understanding to derive structured graphs from free-form text.
Collectively, these articles address the RQ1, showcasing the vast and dynamic reach of NLP. From fundamental sentiment analysis to intricate taxonomy construction, the literature paints a picture of an ever-evolving NLP landscape. Techniques range from pre-trained models to Bayesian analytics, machine learning classifiers, and semantic parsing. These methods not only bolster the extraction of textual insights but also underscore NLP’s adaptability across diverse disciplines and contexts.
RQ2.- Which and how are NLP models employed to create taxonomies or classifications of skills?
In exploring the literature on NLP models for establishing taxonomies or categorizations of competencies, we identified several significant contributions aligned with the research focus. These contributions provide valuable insights into the intersection of NLP and skills categorization, enhancing our understanding of the topic. Below, we examine the most salient contributions and discuss how their methods and findings contribute to our overall understanding of the topic.
Huang et al. [29] presents an innovative approach to creating skills taxonomies. Their method focuses on constructing thematic taxonomies guided by conceptual seeds. This approach is notable for its ability to expand the taxonomy structure in amplitude and depth. Including conceptual names as seeds enriches the corpus and lays the foundation for building a robust taxonomy. Furthermore, the transfer of relations along multiple paths, driven by a relation classifier, not only reveals root nodes and uncovers new topics and subtopics but also provides a hierarchical structure essential for skill classification. Their innovative techniques and approach offer valuable insights that effectively contribute to the construction of taxonomies.
Jiang et al. [40] provides an original perspective on organizing knowledge into structured forms, highlighting its applicability across various domains. Their work, particularly the TaxoEnrich framework, addresses a vital issue in an automated manner to incorporate new concepts without starting from scratch. The framework consists of four modules: taxonomy-contextualized embedding generation, sequential feature encoder, query-aware sibling encoder, and query-position matching. Taxonomy-contextualized embeddings are generated using pseudo sentences derived from taxonomic relationships, with separate handling of ancestor and descendant paths. These embeddings feed into a sequential feature encoder that models structural taxonomy information in a vertical view. The query-aware sibling encoder captures horizontal structural information by selecting relevant siblings based on query-relatedness, and the final query-position matching model computes relatedness using both fine- and coarse-grained relationships between the query node and the candidate parent, child, and siblings for accurate taxonomy completion. TaxoEnrich has demonstrated efficacy by significantly outperforming several existing methods, showcasing its potential applicability in skill categorization. This framework emphasizes the importance of evolving and enhancing existing taxonomies directly relevant to our goals.
Lin et al. [2] significantly contribute to our understanding of applying NLP models in constructing skill taxonomies, specifically within the context of the LinkedIn talent marketplace. Their completion task is a supervised classification problem using BERT-based embeddings and LinkedIn’s proprietary data to predict relationships between skills by leveraging synthetic training sentences and a carefully designed labeling process. Their innovative approach to building skill graphs highlights the practical application of NLP models. Lin et al. underscore the importance of quality text corpus, demonstrating the effectiveness of NLP techniques and providing solutions to common problems, such as irrelevant information in real-world text corpora. This study is crucial for understanding how NLP can be effectively applied in categorizing skills in professional environments, offering practical insights into the construction and refinement of skill taxonomies.
RQ3.- How are NLP models applied to predicting behaviors or outcomes in specific areas?
NLP models are increasingly employed to predict behaviors and outcomes across various domains, demonstrating their versatility and effectiveness.
In the clinical domain, Afshar et al. [27] utilize NLP to process clinical documents by extracting unique concept identifiers for large-scale clinical research. By applying machine learning techniques, this study highlights the effectiveness of NLP in analyzing unstructured clinical data, offering potential applications in healthcare research and data interoperability.
NLP models also play a crucial role in knowledge structuring and taxonomy expansion. Huang et al. [29] developed a seed-guided topical taxonomy construction method, while Jiang et al. [40] introduced the TaxoEnrich framework. These models automate the incorporation of new concepts into existing taxonomies by leveraging semantic and structural information, enhancing knowledge organization with significant implications for domains such as e-commerce and named entity recognition.
NLP’s application in psychometrics and language assessment is underscored by Laverghetta et al. [33]; they employ transformer-based language models to predict linguistic competencies and psychometric properties, highlighting the capabilities and limitations of current models. The potential for NLP to predict human performance on reasoning tasks introduces efficiencies in psychometrics. Similarly, Pansare et al. [34] combine the Myers-Briggs Type Indicator with machine learning algorithms to predict personality traits based on linguistic patterns. This approach aids in selecting candidates with suitable personalities for specific roles, showcasing the applicability of NLP in job profiling.
In educational contexts, Blandin et al. [28] propose using innovative machine-learning approaches to determine the age-appropriateness of textual content, offering scalable and objective solutions for educational platforms and libraries. Additionally, Dehbozorgi and Mohandoss [30] explore Aspect-Based Emotion Analysis using NLP to correlate students’ sentiments with academic performance. This predictive capability enhances collaborative learning environments by providing insights into student emotions and their impact on educational outcomes.
In the field of talent management, Lin et al. [41] develop approaches to constructing skill graphs for platforms like LinkedIn. These approaches leverage NLP models to predict relationships between skills, facilitating the efficient matching of job descriptions with user profiles. This contributes to personalized career advancement recommendations and better-aligning skills with job market demands.
Applying NLP models to predict behaviors and outcomes spans diverse fields such as education, talent management, psychometrics, personality assessment, knowledge structuring, and healthcare. As NLP models evolve and adapt, their interdisciplinary impact on predictive modeling becomes increasingly apparent, providing valuable insights and efficiencies across various domains.
Among the authors who addressed RQ1, there was a clear trend towards the diverse and innovative use of NLP techniques to extract meaningful insights from textual content. Afshar et al. [27] demonstrated NLP’s adaptability in the medical sector by standardizing clinical narratives. Dehbozorgi and Mohandoss [30] highlighted the intersection of NLP with emotion analysis and educational prediction. A prominent trend is the shift from fundamental text analysis to deeper semantic understanding, as seen in Cao et al. [38], who infused sentiment analysis with user identity, and Blandin et al. [28] who focused on discerning age-appropriate content. Huang et al. [29] and Lin et al. [41] emphasized harmonizing structured taxonomies with raw textual data, bridging the gap between structured and unstructured data. Meanwhile, Elteto et al. [39] illustrated the field’s interdisciplinary nature by combining Bayesian modeling with NLP, enhancing analytical capacities. Laverghetta et al. [31], [33] delve into the semantic intricacies and psychological layers within linguistic expressions. Collectively, these authors highlight the expansive scope, adaptability, and evolving nature of NLP across various disciplines and contexts.
Several key insights emerged in exploring how NLP models are employed to create taxonomies or classifications of skills, addressing RQ2. Lin et al. [41] highlighted the significance of a robust text corpus in constructing skill graphs for platforms such as LinkedIn. Their work not only illustrated the complexities of developing such a system but also addressed the challenges of filtering out irrelevant information in real-world datasets. Meanwhile, Huang et al. [29] introduced an innovative technique for crafting seed-guided thematic taxonomies, utilizing conceptual names as foundational seeds to enrich the taxonomy’s content and deploying relation classifiers to generate hierarchical structures, thereby streamlining the taxonomic framework. Additionally, Jiang et al.’s [40] TaxoEnrich emerged as a promising solution for expanding existing taxonomies, automating the incorporation of new concepts, and demonstrating a level of efficiency superior to many contemporary methods.
In addressing RQ3 on the application of NLP models for predicting behaviors or outcomes, the study found that these models are versatile across several domains. Blandin et al. [28] harnessed machine learning to assess the age-appropriateness of textual content in educational platforms, while Dehbozorgi and Mohandoss [30] employed Aspect-Based Emotion Analysis to correlate student sentiments with academic outcomes. In talent management, skill graphs for platforms like LinkedIn predict skill interrelations, streamlining the matching of job profiles with user profiles. Laverghetta et al. [31], [33] utilized transformer-based models to predict linguistic capabilities in psychometrics, and Pansare et al. [34] combined the Myers-Briggs Type Indicator with machine learning to foresee personality traits from language patterns, optimizing job role fitment. Additionally, Huang et al. [29] and Jiang et al. [40] demonstrated the evolution of knowledge taxonomies through NLP. In contrast, Afshar et al. [27] showcased NLP’s prowess in extracting insights from clinical data. These findings underscore NLP’s potential in diverse predictive modeling applications, ranging from education to healthcare.
The research areas identified in the revised studies highlight a distinct opportunity for refining NLP methodologies explicitly tailored for skill extraction and categorization. For instance, the contributions of Lin et al. [41], and Huang et al. [29] provide foundational steps; however, there remains ample room for continued innovation in constructing dynamic skill graphs. This underscores the need for further exploration in automatic skill extraction and categorization from diverse textual sources, as well as how NLP can be leveraged to identify emerging or niche skills that may not yet be categorized in established taxonomies.
While the studies under review emphasize the predictive capabilities of NLP models in various contexts, there remains ample room for exploring more refined models that precisely predict skill relevancy and evolution. As industries undergo continuous transformation, specific skills become obsolete while others rise in importance. NLP models that forecast such shifts based on extensive textual data, such as Word Economic Forum reports detailing the skills needed to perform specific jobs, would offer immense value to educators, employers, and professionals.
Multilingual NLP can be crucial in defining the future skills for a global and dynamic workforce [47]. However, we found no relevant multilingual NLP studies related to skills acquisition. We attributed the fact to two main reasons: Firstly, the significant amount of available English-language materials, including academic research, industry reports, and educational content on skills acquisition, predominantly available in English. Secondly, not using a multilingual NLP approach increases accuracy and performance when dealing with a single language, typically English. By concentrating on the English language, NLP models can be fine-tuned to understand the context more effectively, leading to more accurate extraction and analysis of skills-related information.
The limitations of this study are underscored by the diverse array of approaches evident in the data extraction process, as depicted in Figure 4, which illustrates the number of libraries or frameworks utilized in the analyzed articles. The heterogeneity in the selection of technological tools introduces variability in the results and may impact the comparability between different studies. Furthermore, the visualization presented in Figure 5, detailing the NLP techniques employed, highlights the necessity to address the complexity inherent in the diverse methodological approaches. Another significant limitation relates to the domains covered by the research, as depicted in Figure 3 where application areas are identified; the omission of specific relevant disciplines may leave areas of interest unexplored. Additionally, the limitation in the number of authors examined in the review, which was narrowed to focus specifically on works relevant to this study, may impact the breadth and depth of the analysis conducted.
Another inherent limitation of this study manifests in the restricted number of papers included in the review. While this restriction was necessary to maintain a selective approach, it may hinder the representativeness of the overall findings. Nonetheless, this limitation presents an opportunity for future research endeavors. Expanding the sample to include a more significant number of papers could enhance the richness and validity of the study. Although we recognize the current restriction in terms of the number of documents analyzed, we also recognize it as a clear opportunity for the development and expansion of future research.
Conclusion and Future Work
This analysis presents an encompassing overview of the current landscape of NLP in skill extraction, prediction, and taxonomy construction. The findings collectively underscore the adaptability and depth of NLP across various disciplines, from clinical narratives to talent management and education. Authors have demonstrated a paradigm shift from basic text analytics to profound semantic understanding, bridging the gap between raw data and structured taxonomies. The interdisciplinary nature of NLP, its expansive scope, and its evolving capabilities stand as a testament to the field’s potential to shape the future of knowledge management and information extraction.
Throughout this investigation, the potential of NLP in crafting dynamic taxonomies of skills becomes evident. However, the real novelty lies in the opportunities for further refinement and expansion of these techniques, particularly in predicting skill evolution and relevancy within a rapidly changing job market. NLP’s capability to forecast emerging or niche skills from vast textual datasets hints at its future potential in shaping educational strategies, informing hiring practices, and professional development pathways.
Future research should prioritize refining NLP methodologies specifically designed for skill extraction and categorization to address the evolving needs of a dynamic workforce. For example, one key direction is the development of more robust models for Named Entity Recognition (NER) to accurately identify and extract emerging skills and occupations from diverse textual sources, including unstructured data like job descriptions, academic publications, and industry reports. This would require advancing NER models to capture contextual nuances, particularly for niche skills that may not be captured in existing taxonomies.
Building on this, another avenue of future work is constructing dynamic skill graphs that model the evolution of skills and occupations over time. This can be achieved by integrating NLP with longitudinal datasets to detect patterns in skill development, transitions across roles, and shifts in industry demands. In particular, predictive modeling techniques can be applied to forecast skill evolution, offering insights into which skills will grow in demand or become obsolete. This is crucial for answering the third RQ, which concerns the ability of NLP models to predict and adapt to emerging trends in skills and occupations.
In our initial work, we have begun designing an architecture incorporating both NER and relation extraction models. The NER component focuses on identifying unique entities, such as skills and occupations, in unstructured text, while the relation extraction model infers relationships between these entities, helping to discover connections that might not be explicitly stated. Future work will build on this foundation by exploring methods for automating the discovery of new skill-occupation relationships. For instance, relation extraction models could be enhanced to identify and predict associations between emerging skills and the industries most likely to require them, contributing to the forecasting of industry trends.
Moreover, the predictive potential of these models should be evaluated using real-world datasets, such as the World Economic Forum’s reports on future job skills, to ensure that the models can deliver actionable insights. This line of research would benefit educational institutions and organizations looking to stay ahead of workforce trends by anticipating skills that will be in high demand and adjusting curricula and training programs accordingly. Businesses could also leverage such models to better inform workforce planning, helping them align their recruitment strategies with emerging skill requirements.
The need for this architecture is underscored by the increasing volume of unstructured text in reports, articles, and other media, providing a rich source of information for uncovering new insights about occupations and skills. As industries evolve and new skills emerge, the ability of NLP models to predict these changes will become increasingly important, making this an exciting and necessary area of continued research.
ACKNOWLEDGMENT
The authors would like to thank the Vicerrectoría de Investigación y Posgrado, the Research Group of Advanced Artificial Intelligence, and the Cyber Learning and Data Science Laboratory of Tecnológico de Monterrey.