Introduction
Over the last decade, the surge in user-generated content (UGC) across online social platforms has been unprecedented. Platforms like Flickr, Twitter, and Facebook witness the upload of hundreds of thousands of posts every minute, leading to a wide variance in user engagement metrics such as views, likes, and comments. This variance has sparked significant interest in analyzing and modelling the factors that drive social media content to become popular. Popularity prediction, which aims to forecast the level of user interaction with specific posts, is critical for various applications, including content recommendation systems, targeted advertising, and enhancing information retrieval effectiveness. However, accurately predicting social media popularity is a challenging task due to the multitude of factors at play, including but not limited to the content of the image or post, captions, user profile characteristics, and the timing and location of the post. The different impacts of various data modalities and the intricate relationships between them further compound the complexity.
Numerous studies have focused on this task. Several early works explored traditional machine learning techniques, such as support vector regression [1], and co-factorization machines [2], for predicting social media popularity. These methods often relied on manually engineered features and lacked the ability to automatically learn complex representations from raw data. A significant number of studies have focused on manual feature engineering, extracting relevant features from various sources, such as image content [3], sentiment [4], user traits, and social context. These handcrafted features were then used as input to machine learning models for popularity prediction. While manual feature engineering can capture domain-specific knowledge, it is time-consuming and may not comprehensively represent all relevant information. With the recent advancements in deep learning, numerous works have explored the use of deep neural networks for social media popularity prediction. These include convolutional neural networks (CNNs) [5], [6], recurrent neural networks (RNNs) [6], and transformers [7]. Deep learning techniques have the ability to automatically learn representations from raw data, capturing complex patterns and relationships that are difficult to encode manually. Given the multimodal nature of social media data (e.g., images, text, metadata), several studies have investigated techniques for fusing different modalities. These include late fusion strategies [8], [9], attention mechanisms [6], and variational information bottleneck frameworks [10]. Multimodal fusion aims to leverage complementary information from various data sources to improve prediction performance. To capitalize on the strengths of both manual feature engineering and deep learning, some works have proposed hybrid models that combine manually engineered features with deep learned representations [11], [12]. These hybrid approaches demonstrate the potential of integrating domain knowledge with the powerful representation learning capabilities of deep neural networks.
However, traditional machine learning methods and even some deep learning approaches rely on manually engineered features, which may not capture the full complexity and nuances of social media data. Furthermore, a pervasive oversight in many popularity prediction methodologies is the failure to consider the crucial relationship between field names (labels or identifiers) and their corresponding field values (actual data) within social media data structures. As an illustrative example, consider the meta social media data “Uid”: “25893@N22”, “Pid”: “565381”, “Geoaccuracy”: “16”, “Title”: “Beautiful tree.”. Previous methods have typically leveraged only the field values (i.e., “25893@N22”, “565381”, “16”, and “Beautiful tree.”) while disregarding the associated field names (i.e., “Uid”, “Pid”, “Geoaccuracy”, and “Title”) during feature extraction and modeling processes. Consequently, these approaches merely learn the feature value (e.g., “16”) without contextualizing its semantic significance as the attribute (e.g., “geoaccuracy”) of the post. Neglecting this critical relationship between identifiers and their associated data inherently give rise to several issues or limitations in predicting the popularity of social media posts: (1) Loss of context: Field names provide essential context for understanding the meaning and significance of the associated field values in that social media, without which the data may become ambiguous or open to misinterpretation. (2) Incomplete feature representation: Field names often reveal important aspects or dimensions of the data relevant to predicting social media popularity. Disregarding them can lead to overlooking or underrepresenting key features in the modelling process. (3) Difficulty in feature engineering: Field names can guide the process of selecting, transforming, and combining relevant features, making it more challenging to identify and extract meaningful features without them. (4) Lack of domain knowledge integration: Field names often incorporate domain knowledge or conventions specific to social media platforms or data sources. Ignoring this knowledge can result in models that fail to capture important nuances or patterns inherent in the data.
To address these limitations, this paper proposes to transform the metadata of social media into semantic-enriched and contextualized text for predicting social media popularity with the Large Language Model. There are two steps in our proposed approach. Firstly, expanding the field names and field values and transforming them into a text representation with rich semantic content, tailored for the context of social media posts. For example, consider the given sample. Through the proposed reconstruction process, this metadata is transformed into a textual representation that conveys the implicit meaning and context, such as: “A social media user (5893@N22) has uploaded a post (565381) with a geo-accuracy of 16, titled ‘Beautiful tree.”’ The expanded field names provide semantic-riched context and domain knowledge for field values, effectively linking different field values together. Secondly, We capitalize on the strengths of Low-Rank Adaptation (LoRA) technology, applied to pre-trained Large Language Models. LoRA is specifically designed to fine-tune only a small subset of the model’s parameters, thus maintaining the general capabilities of the LLM while equipping it with the necessary tools to interpret and integrate the newly contextualized social media text. This adaptation is meticulously designed to equip the LLMs with the proficiency to interpret and assimilate the semantically transformed social media text. The integration of LoRA into LLMs facilitates a deeper and more nuanced understanding of the transformed semantic-riched and contextualized text, surpassing the limitations of conventional metadata-based approaches. This step ensures that the adapted LLMs achieve a deeper and more nuanced understanding of the social media data. We summarize our contributions as follows:
We introduce an innovative method for predicting social media popularity using Large Language Models (LLMs) enhanced with Low-Rank Adaptation (LoRA), capable of interpreting and learning from semantic-enriched and contextualized social media text.
We propose a novel approach for transforming social media metadata into text with enriched semantic and contextual content, providing a more narrative and information-rich representation of the social media data.
We validate the proposed method on the SMPD dataset, achieving state-of-the-art results, and demonstrating both theoretical innovation and practical effectiveness.
Related Works
A. Traditional Machine Learning Models
In recent times, a significant amount of research has been devoted to understanding the predictors of popularity on various social media platforms, including text-based [11], [13], video-based [1], [5], and photo-based [3], [14], [15] mediums. The common approach across these studies involves a two-step process: initially extracting pertinent features from the content—whether it be tweets, photos, or videos, and subsequently applying a regression model to estimate popularity scores. This paper specifically delves into the realm of social media photography popularity, offering an overview of related research. Additionally, it explores the application of deep learning models in forecasting social media popularity. It’s noteworthy that many studies in popularity prediction have traditionally focused on a single type of content, such as text, neglecting the potential insights from integrating multiple content types, which could enhance prediction accuracy. We start with a concise review of the literature on social media popularity prediction. Mathioudakis and Koudas [16] explored popularity trends by clustering bursty keywords on Twitter. To advance popularity prediction, Rsheed and Khan [17] developed a model that predicts the popularity of news articles on Twitter by categorizing the features of the article into internal and external, followed by using a decision tree for prediction. Additionally, some researchers, like Wu et al. [14], have focused on predicting photo popularity through regression analysis on image content and contextual data surrounding the user, with further studies [4], [18] underscoring the significance of a photo’s aesthetic appeal in predicting its popularity. This turns into an examination of convolutional neural networks (CNNs) for prediction tasks.
B. Deep Learning Models
CNNs have achieved remarkable success in complex tasks such as classification, object detection, and segmentation, earning widespread acclaim. Advances in deep learning and the concept of transfer learning have simplified adapting sophisticated pre-trained models to specific problems. By leveraging transfer learning, we utilize the weights from pre-trained models, adjusting a few layers to meet our specific problem’s needs. Deep learning models like VGGNet, VGG19, ResNet, Inception have demonstrated excellent performance in image classification tasks [19], [20], [21]. By modifying these models, many works [6], [12], [22], [23], [24], [25], [26] aim to apply them to the prediction of social media photo popularity. Lastly, we discuss the role of multimodal information in prediction tasks.
Predicting social media popularity has been extensively studied, with researchers exploring multimodal data and deep learning techniques. Reference [6] proposed a multimodal framework using CNNs for visual features, RNNs for textual features, and an attention mechanism to capture important regions. Reference [12] combined visual features from CNNs and textual features using word embeddings, feeding them into an XGBoost model. Reference [8] fused visual (CNNs), textual (word embeddings, topic modeling), and metadata features using a late fusion strategy and a neural network. Reference [9] also fused multiple features, investigating different fusion strategies like concatenation and attention. Reference [25] combined visual, textual, and metadata features through late fusion and transfer learning. Recently, [7] proposed a contrastive vision-and-language transformer model, aligning image and text representations through contrastive learning for popularity prediction.
Several models leverage advanced machine learning and deep learning techniques to address this challenge. ViralGCN uses a temporal-spatial convolutional approach to predict viral content spread by analyzing cascade graphs. It integrates both structural and temporal data, offering insights into how content goes viral. Another framework uses social media data, particularly tweets, to analyze urban mobility. NLP techniques classify and analyze travel-related messages in real time, aiding in traffic management and urban planning.
In fake news detection, an AI-assisted deep NLP model classifies news as real or fake based on content, user behavior, and publisher credibility. This model outperforms traditional methods, focusing on social features for better accuracy.
Lastly, machine learning classifiers have been used to predict individual popularity on social media by analyzing comments with NLP tools, showing the effectiveness of combining data mining and classification for social media insights.
These works demonstrate the importance of multimodal data and advanced deep learning techniques, employing CNNs, RNNs, transformers, attention mechanisms, and fusion strategies to capture relevant information for accurate popularity prediction.
C. Large Language Models in Text Analysis
The emergence of Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) has marked a paradigm shift in text analysis. These models, built upon deep learning architectures, have demonstrated remarkable capabilities in understanding and generating human-like text. Radford et al. [27] illustrated the potential of GPT-2 in generating coherent and contextually relevant text, underscoring the possibilities these models offer for complex text analysis tasks.
LLMs [28] are trained on extensive datasets, enabling them to capture a wide array of linguistic patterns and styles. This training allows them to contextualize and analyze text in a way that mimics human understanding, which is particularly valuable in deciphering the multifaceted and often ambiguous nature of social media content.
D. LoRA Technology and Its Applications
Low-Rank Adaptation (LoRA) [29] represents a significant advancement in the field of machine learning, particularly in fine-tuning large models like LLMs. LoRA enables the efficient adaptation of pre-trained models to specific tasks without the need for extensive retraining. This is particularly advantageous in applying LLMs to niche domains, where the availability of large, domain-specific training datasets might be limited.
In social media analysis, LoRA can fine-tune LLMs to better understand the unique linguistic and contextual nuances of different platforms and content types. This targeted adaptation enhances the model’s ability to extract meaningful insights from social media data, making it more effective for tasks like popularity prediction.
Methodology
This study introduces a novel approach for predicting the popularity of social media content, utilizing transformed semantic-enriched data alongside Low-Rank Adaptation (LoRA) technology for the fine-tuning of large pre-trained language models. Figure 1 shows that the proposed method comprises two key steps: (1) Transformation of metadata. (2) LoRA-based finetuning.
The overview of the proposed method, which includes the transformation of social media metadata and LoRA-based finetuning.
A. Transformation of Metadata
The initial phase of our methodology involves preprocessing the raw social media data to transform them into a format rich in semantic information and suitable for LLM input. In this paper, we use the Social Media Prediction Dataset (SMPD)1 to validate the proposed method.
The metadata generated on social media is generally stored in structured formats for easy reading. As shown in the upper part of Table 1, these social media metadata are saved with field names and field values. In social media metadata, field names and field values play crucial roles in organizing, categorizing, and making sense of the vast amounts of data generated by users and their interactions. Understanding the role of each is key to predicting the popularity of social media effectively. However, the roles of field name and field value are different in modelling. The field values feed the predictive model with concrete data that it can learn from. By analyzing these values, the model can identify patterns and relationships between various characteristics and the popularity outcome, allowing it to predict how likely new or unseen content will achieve a certain level of popularity. In contrast, as shown in Table 3, each field name has a special meaning. Field names guide the selection and engineering of features that are used in predictive models. They help in structuring the dataset and ensuring that the relevant aspects of the social media metadata are utilized in analyzing the popularity of the content.
The existing works [5], [24], [30] predominantly rely on field values alone to predict the popularity of social media content. In contrast, our novel approach involves unfolding the field names and merging them with the corresponding field values, thereby transforming social media metadata into semantically enriched and contextual text representations. This process addresses the critical challenge of semantic sparsity in raw social media data, enhancing the model’s ability to understand and predict content popularity. Our semantic enrichment process follows these key principles: (1) Use Descriptive Field Names: We expand abbreviated field names to their full descriptive forms using a consistent naming convention. This provides more context and clarity to the Large Language Model (LLM) during fine-tuning, enabling it to better grasp the significance of each field. (2) Preserve Data Types and Formats: We maintain the original data types and formats of the field values (e.g., strings, numbers, timestamps, booleans) when constructing the text. This preservation helps the LLM better understand the nature of the information being represented and its relevance to popularity prediction. (3) Design Additive Prompts: We provide specific instructions or questions for the LLM to address, incorporating descriptive field names in context-specific prompts. These prompts are distinctly separated from the input text, guiding the LLM’s interpretation and expansion of the metadata.
The expanded text representation consists of two main components: 1. Instruction: This component provides a clear directive to the LLM, guiding it on how to interpret and process the given input. The instruction is designed to elicit accurate popularity predictions based on the semantically enriched text. 2. Input: This is the core of our semantic enrichment process. It consists of a coherent text where field names have been expanded and contextually integrated with their corresponding values. This text presents a comprehensive, semantically connected description of the social media post, incorporating all relevant metadata in a natural language format. As illustrated in the latter section of Table 1, this study presents two distinct templates for data transformation. The first template adheres to a systematic approach for expanding abbreviated field names into their descriptive counterparts, as defined in Table 3. This process involves employing a uniform naming convention that aligns with the detailed descriptions of field names provided. Subsequently, the expanded field names are integrated with their corresponding field values, forming a coherent key-value structure. In addition, a special prompt is developed to facilitate clear and concise task descriptions. Contrastingly, the second template adopts a more liberal approach, crafting descriptive forms that seamlessly blend the field names with their respective values. This method aims to provide a more nuanced and free-flowing representation of the metadata, enhancing the interpretability of the task descriptions.
B. Low-Rank Adaptation (LoRA) Fine-Tuning
After preparing the semantically enriched dataset, we proceed to fine-tune a pre-trained language model using Low-Rank Adaptation (LoRA). LoRA introduces minimal modifications to the model’s architecture, allowing for efficient adaptation to the task at hand while preserving the model’s generalizability. The cost-effectiveness of this methodology is evident when compared to the alternative of developing a new model specifically tailored for popularity prediction. Building a new model would require not only extensive data collection and training but also significant computational resources for training processes that might span multiple GPUs over weeks or even months. In contrast, adapting an existing pre-trained model using LoRA can be accomplished with a fraction of these resources, making it a more efficient and sustainable option.
The fine-tuning process is outlined as follows:
1) Model Selection
For this study, we select the Qwen Large Language Model [28], a large pre-trained language model renowned for its strong performance across various natural language tasks.
2) LoRA Integration
We integrate LoRA by introducing low-rank matrices \begin{align*} & \hspace {-0.1pc}\text {Attention}(\mathbf {Q}, \mathbf {K}, \mathbf {V}) \\ & =\text {softmax}\left ({{\frac {(\mathbf {Q} + \mathbf {LQ})(\mathbf {K} + \mathbf {RK})^{T}}{\sqrt {d_{k}}}}}\right)(\mathbf {V} + \mathbf {RV}), \tag {1}\end{align*}
3) Training
The semantically enriched dataset, consisting of the transformed social media metadata along with the corresponding popularity metrics (e.g., engagement rates), is fed into the Qwen model modified with LoRA. The training objective is to minimize the discrepancy between the model’s predicted popularity scores and the actual popularity metrics associated with each social media post. This is typically achieved by optimizing a loss function such as cross-entropy loss:\begin{equation*} {\mathcal {L}}_{\text {CE}} = -\frac {1}{N} \sum _{i=1}^{N} \left [{{y_{i} \log \hat {y}_{i} + (1 - y_{i}) \log (1 - \hat {y}_{i})}}\right ], \tag {2}\end{equation*}
4) Hyperparameter Tuning
To optimize the model’s performance on the popularity prediction task, we systematically adjust relevant hyperparameters such as the learning rate
By leveraging the semantically enriched dataset and the efficient LoRA fine-tuning approach, we aim to adapt the Qwen LLM to effectively capture the nuances and patterns present in social media metadata, enabling accurate predictions of content popularity. The LoRA integration allows for task-specific adaptation while preserving the model’s general knowledge and capabilities, potentially leading to improved performance and generalization on the popularity prediction task.
C. Implementation Details
In this research, instruction-based supervised fine-tuning was adopted as the core training methodology, complemented by the implementation of Low-Rank Adaptation (LoRA) as the fine-tuning mechanism. The LoRA framework was specifically applied to modulate the parameters associated with the query and value projections within the model’s architecture. A batch size of 16 was selected for training, with gradient accumulation steps fixed at one to maintain a balance between computational efficiency and training stability. Furthermore, the learning rate followed a cosine scheduler, initiating at 5e-5, to gradually adjust the learning rate in a manner that promotes convergence. The entire training regimen spanned eight epochs, a duration deemed sufficient for the model to adapt to the nuances of the popularity prediction task.
To augment the specificity of the LoRA technique, additional hyperparameters were meticulously calibrated. The LoRA alpha parameter was set to 16, indicating the degree of modification applied to the attention mechanism, while the dropout rate was maintained at 0.0, signifying the absence of dropout regularization during the LoRA adaptation process. Crucially, the rank of the LoRA matrices was established at 8, reflecting a strategic balance between model complexity and the granularity of adaptation.
Experiment
A. Dataset
To validate the efficacy of our proposed approach in real-world scenarios, we leverage the Social Media Prediction Dataset (SMPD), a comprehensive corpus encompassing 486,000 social multimedia posts from 70,000 users. The SMPD aggregates multifarious social media data, including anonymized photo-sharing records, user profiles, web images, textual content, temporal metadata, geolocation information, and categorical attributes. This multifaceted, large-scale, and temporally-aware dataset was curated from Flickr, one of the largest photo-sharing platforms, facilitating a representative characterization of real-world social multimedia dynamics. For the time-series forecasting task, the dataset was partitioned into distinct training and testing subsets, adhering to the conventional chronological separation paradigm, typically delineated by date and time attributes. Table 2 summarizes the statistical characteristics of the SMPD. From the available dataset (305,613 samples), we randomly selected 90% of the samples (275,051) as the training dataset and 10% of the samples (30,562) as the test set. Figure 2 shows the word cloud of the SMPD.
B. Evaluation Metric
For our experimental evaluation, we adopt Spearman’s rank correlation coefficient (
The Spearman’s rank correlation coefficient (\begin{equation*} \rho _{s} = 1 - \frac {6\sum _{i=1}^{n}d_{i}^{2}}{n(n^{2}-1)}. \tag {3}\end{equation*}
MAE quantifies the average magnitude of errors in a set of predictions, irrespective of their direction. It is computed as the arithmetic mean of the absolute differences between predicted and actual values, with all individual differences weighted equally. Using the SciPy library, MAE is formulated as:\begin{equation*} \text { MAE} = \frac {1}{n}\sum _{i=1}^{n}|y_{i} - \hat {y}_{i}|. \tag {4}\end{equation*}
MSE is a quadratic scoring function that also measures the average magnitude of errors, albeit by squaring the differences between predicted and actual values. We utilize the Scikit-learn library to compute this regression loss metric, with MSE defined as:\begin{equation*} \text { MSE} = \frac {1}{n}\sum _{i=1}^{n}(y_{i} - \hat {y}_{i})^{2}. \tag {5}\end{equation*}
C. Main Results
The main results presented in Table 4 compare the performance of the proposed SMP-LLM method with several state-of-the-art approaches for social media popularity prediction. The evaluation metrics used are Spearman’s rank correlation coefficient (SPR), Mean Absolute Error (MAE), and Mean Squared Error (MSE).
Among the existing methods, Multi-F [8] achieves the highest SPR of 0.7773, indicating a strong positive correlation between the predicted and actual popularity values. It also has a competitive MAE of 1.1354 and an MSE of 2.4351. VTFX [12] has the next highest SPR of 0.7630, followed by the Multi-model Approach [25] and TTC-VLT [7], both with an SPR of 0.7500.
The existing methods, such as Multi-F [8], VTFX [12], the Multi-model Approach [25], and TTC-VLT [7], rely on both image and social media metadata as input data. These methods employ multiple modules to extract features from the different modalities, which can be a time-consuming and challenging process due to the extensive feature engineering involved. In contrast, the proposed SMP-LLM method takes a different approach by solely utilizing the social media metadata as input. Instead of relying on explicit feature extraction, the SMP-LLM method transforms the metadata into semantic-rich and contextual text representations, which are then fed into large language models (LLMs) for prediction.
Despite using only a single modality of input data (social media metadata), the SMP-LLM method consistently outperforms the existing multimodal approaches across all evaluation metrics. The Qwen-1.8B model achieves an SPR of 0.8631, an MAE of 0.7336, and an MSE of 1.7780, surpassing the best-performing baseline, Multi-F [8].
The performance of the SMP-LLM method further improves with larger model sizes. The Qwen-7B model demonstrates the best overall performance, with the highest SPR of 0.8866, the lowest MAE of 0.6432, and the lowest MSE of 1.4223 among all methods evaluated.
These results highlight the effectiveness of the SMP-LLM method in leveraging the power of large language models for social media popularity prediction, without the need for explicit feature engineering or multimodal data fusion. By transforming the social media metadata into contextual text representations, the method allows the SMP-LLM to effectively capture the semantic and contextual information relevant to popularity prediction.
The superior performance of the SMP-LLM method, even when using a single modality of input data, showcases its ability to efficiently utilize the rich knowledge encapsulated in large language models, surpassing the performance of multimodal approaches that rely on complex feature engineering and data fusion techniques.
Overall, the proposed SMP-LLM method offers a streamlined and effective approach for social media popularity prediction, demonstrating the potential of leveraging large language models for this task without the need for extensive multimodal feature engineering.
D. Ablation Study
1) The Ablation Study on the Transformed Text
As shown in Table 5, the ablation study investigates the impact of different transformed inputs on the performance of SMP-LLM(Qwen-1.8B) in predicting the popularity score of a social media post. The table presents the results across three evaluation metrics: Spearman’s rank correlation coefficient (SPR), Mean Absolute Error (MAE), and Mean Squared Error (MSE).
Field values: When only the raw field values are provided as input to the SMP-LLM, the performance is the lowest, with an SPR of 0.6432, an MAE of 1.3825, and an MSE of 3.3612. This baseline scenario demonstrates the SMP-LLM’s struggle to effectively understand and process the numerical and categorical data without any additional context or information. Field names and Field values: By including both the field names and their corresponding values as input, the performance improves slightly, with an SPR of 0.7054, an MAE of 1.3261, and an MSE of 2.8820. While the field names provide some context, the SMP-LLM still lacks a comprehensive understanding of the relationships between the data points and their relevance to the popularity score. Expanded field names and Field values: When the field names are expanded to provide more descriptive and contextual information, and these expanded field names are paired with the field values as input, the performance further improves, with an SPR of 0.7642, an MAE of 1.1843, and an MSE of 2.351. This suggests that providing more detailed and explanatory field names aids the SMP-LLM in better comprehending the data and its connection to the target variable. Transformed template 1: As shown in the lower part of Table 1, by transforming the field names and field values into a semantic-rich and contextual text format (Transformed template 1), the SMP-LLM achieves significantly better performance across all three metrics, with an SPR of 0.8524, an MAE of 0.7577, and an MSE of 1.9015. This transformation likely helps the SMP-LLM to better understand the meaning and relationships within the data by presenting it in a more natural language format. Transformed template 2: The best performance is obtained when using the Transformed template 2, which is shown in the lower part of Table 1, a highly contextualized and semantically rich representation of the input data. This format presents the data in a coherent narrative form, explicitly stating the relationships between different fields and their relevance to predicting the popularity score. The SMP-LLM achieves an SPR of 0.8631, an MAE of 0.7336, and an MSE of 1.7780, outperforming all other input representations.
The ablation study demonstrates the significant impact of our semantic enrichment process on the SMP-LLM’s performance in predicting social media post popularity. By progressively enhancing the input representation, we observe substantial improvements across all metrics. The baseline scenario using only raw field values shows the lowest performance, highlighting the challenge in interpreting sparse metadata. Our semantic enrichment process, which expands field names and integrates them contextually with field values, shows marked improvement in model performance. This underscores the importance of providing clear, descriptive context for each piece of metadata. The most significant performance boost is observed with our transformed templates, particularly Transformed template 2. This template represents the culmination of our semantic enrichment process, where we integrate expanded field names into a coherent, contextually rich narrative. The superior performance of this approach demonstrates the SMP-LLM’s ability to leverage its natural language understanding capabilities when presented with semantically rich, contextually integrated input. Our method’s effectiveness lies in its ability to transform sparse metadata into a format that closely resembles natural language, allowing the SMP-LLM to better capture intricate relationships within the data. This approach not only improves prediction accuracy but also enhances the model’s ability to generalize across different types of social media posts.
In conclusion, our semantic enrichment process significantly enhances the SMP-LLM’s ability to predict social media post popularity by addressing the challenge of semantic sparsity in raw metadata, resulting in improved performance, better generalization, and enhanced interpretability.
2) Zero-Shot Prediction with State-of-the-Art LLMs
In this ablation study, we aim to demonstrate the limitations of utilizing state-of-the-art large language models (LLMs), such as GPT-4 and Claude-3, for directly predicting social media popularity scores. Despite the remarkable language understanding and generation capabilities of these models, their ability to make accurate numerical predictions for complex phenomena like social media popularity remains an open challenge.
To evaluate the performance of LLMs in this task, we apply the transformed template (shown in Table 1) to three distinct instances of social media metadata. Subsequently, we append prompts to these transformed representations, instructing the LLMs to predict the corresponding popularity scores.
As shown in Table 6, in the first sample, GPT-4 predicts a range of 5.76 to 7.29, while Claude-3 predicts 4.83. However, the ground truth label for this sample is 1.0, indicating that both LLMs significantly overestimate the popularity value. Similarly, for the second sample, GPT-4’s prediction range of 8.42 to 10.34 and Claude-3’s prediction of 8.64 deviate substantially from the ground truth label of 6.89, with both models overestimating the popularity. In the third sample, GPT-4’s prediction range of 10.86 to 12.77 and Claude-3’s prediction of 9.17 underestimate the ground truth label of 15.15, further highlighting the inaccuracies in the predictions.
The motivation behind this ablation study is twofold: First, it serves as a baseline evaluation of the current state-of-the-art LLMs’ capabilities in directly predicting social media popularity scores without any task-specific training or fine-tuning. Second, it highlights the inherent limitations of these models when faced with complex numerical prediction tasks that require capturing intricate relationships and dynamics that may not be fully represented in the training data used for language modeling. Based on this analysis, Our proposed LoRA-based SMP-LLM is a reasonable approach for social media prediction.
Conclusion
This paper proposed a novel approach (SMP-LLM) to predict social media popularity by leveraging Large Language Models (LLMs) adapted with Low-Rank Adaptation (LoRA) to interpret semantic-enriched and contextualized social media text representations. The key innovations are twofold: Firstly, we introduced a method to transform sparse social media metadata into narrative text with enriched semantics and context. Secondly, we adapted pre-trained LLMs using LoRA to enable deep understanding and learning from the transformed semantic-rich social media text. The proposed SMP-LLM approach, trained on transformed semantic text, achieved state-of-the-art performance on the SMPD dataset for social media popularity prediction. This validates the method’s theoretical novelty and demonstrates its practical effectiveness over conventional metadata-based techniques.
Future Work
In future work, we plan to enhance our model’s capabilities in two key areas. First, we will deepen our exploration into zero-shot prediction with LoRA-adapted large language models (LLMs), aiming to improve how these models generalize from seen to unseen data in dynamic environments like social media. Additionally, despite its extensive scope, the SMPD presents certain limitations, primarily its concentration on a single platform and its potential bias towards the demographics predominant among Flickr users. These limitations could influence the generalizability of our findings across other social media platforms with different user bases or content types. We will expand our dataset diversity by incorporating multiple datasets from various social media platforms to better understand and adapt to the factors influencing content popularity across different demographic and cultural contexts. These advancements will refine our model’s predictive accuracy and versatility in real-world applications.
ACKNOWLEDGMENT
(Tianjian Chen, Jiang Huang, and Xuetong Wu contributed equally to this work.)