Sentiment analysis based on BiLSTM with attention mechanism on Chinese comment with stickers

. As the Internet is progressively becoming larger and more intricate, more and more users of various social media choose to post their comments to express their opinions and thinking on those platforms. Analyzing the emotions contained in user comments holds great business value, helping to accurately perceive user consumption habits and improve user service levels. However, the use of emoticons and stickers in comments has increased dramatically in recent years, which brings new challenges to text sentiment analysis based on natural language processing. In this paper, in order to alleviate the above problems, we propose a method for analyzing the sentiment of Chinese comments based on the attention mechanism and BiLSTM. Specifically, we partitioned the original dataset from the Weibo platform according to the number and type of emoticons in the comments. By analyzing the actual data, the specific features of emojis that affect the performance of sentiment analysis are identified, and corresponding explanations are given. In addition, a hypothesis is proposed to quantify the impact of emoticons on model effectiveness. All the results demonstrate the effectiveness of our proposed method.


Introduction
Sentiment analysis (SA) is an important field of NLP and SA technology can classify the emotions of various types of texts.Serving as platforms for open expression, social networks can be vital sources of information for analyzing public sentiment.This is particularly relevant for significant public events, as they can provide crucial insights to inform decision-making [1].In recent years, with the advancement of software technology and front-end development, people have increasingly engaged in the frequent usage of social networks.Moreover, within the comments they post, the utilization of emoticons such as stickers and emojis has become exceedingly prevalent.Relevant studies have also identified that the utilization of emoticons in communication often enhances the emotional intensity of text expression.On occasion, these symbols can synergize with text to convey deeper emotional nuances.Emoticons have been empirically shown to enrich the spectrum of emotional expression, making them an equally vital feature of textual communication [2] [3].
Broadly speaking, emoticons refer to symbols visualized as cartoon-like expressions by users.However, fundamentally, emoticons are primarily categorized into several types: emojis, stickers, and memes.Emojis are essentially determined by unique codes similar to Chinese characters, whose number of characters available for users can significantly differ depending on the system and version, though their display forms remain relatively consistent.Stickers are essentially images, presented to users as cartoon expressions by websites or apps through the addition of special markers within the text.Common marker methods include '' ''(WeChat, QQ) and '[]' (Weibo, Xiaohongshu, Baidu Tieba).In various operating systems or different versions of operating systems, as long as they belong to the same company's products, there won't be differences.However, the distinctions between different products are significant, to the extent that even the naming conventions differ.Memes, also image-based, lack textual representation and exhibit identical display forms regardless of dissemination platform.However, they tend to be more associated with image processing and carry more intricate information.
In past studies, researchers always used traditional machine learning algorithms or deep learning models to analyze texts with emoticons, especially emoji.While traditional machine learning models solve this problem with a carefully labeled sentiment lexicon, deep learning focuses on model changes to get more information or by training word embedding that is more tailored to the specific application context for better results [4][5] [6].Due to the rapid iteration of internet language, traditional manual methods for constructing sentiment lexicons have become prohibitively costly and time-consuming.As a result, semi-supervised and unsupervised approaches for sentiment lexicon construction have emerged [7], which decreases the time cost and gets good accuracy.The AEC-BiLSTM model proposed by Hang et al. in 2021 achieves an accuracy of 0.96 on a binary classification dataset of IMDB hotel reviews [8].Further, there are commonly two ways to boost the effect of models.The First way is optimizing the word embedding to ensure models can learn higher-quality information.The CEmo-LSTM model proposed by Liu et al. achieves an accuracy of 0.95 when processing Chinese text containing emoticons [4].The second way is to find novel models or a combination of models or add some new mechanisms to improve an existing model.Researchers also added an emoji attention mechanism on BiLSTM to analyze the sentiment more deeply, and the accuracy of the model can reach 0.87 in a dataset labeled three types of emotion [9].
However, there is a significant overlap between stickers and emojis in terms of functionality.Many companies started to develop their own sticker so more customers could use their products in any operating system.Even the naming and display of many of the stickers aim to be similar to emoji.The trend of stickers replacing emojis as the internet grows is evident in Asia, and the trend is starting to appear in the West as well [10].Hence, investigating the impact of stickers on text sentiment analysis is of paramount importance.Nevertheless, prior research has predominantly focused on analyzing the influence of emojis on text sentiment analysis or on devising novel models to better handle text containing emojis, often neglecting to allocate significant attention to stickers.In this paper, we will build a BiLSTM model with an attention mechanism, then divide the dataset according to the different features of the STICKERS, and then analyze the effect of these stickers on the model's effectiveness.

Structure of BiLSTM
The Long Short-Term Memory (LSTM) network possesses robust capabilities in addressing the challenge of long-range dependencies in text, while also mitigating the issue of vanishing gradients [11].It achieves this by selectively retaining and transmitting information from the text to subsequent layers.LSTM models are frequently employed for prediction and natural language processing (NLP) tasks, such as stock price forecasting [12][13], text sentiment classification, and information extraction [14][15].As shown in Figure 1, each computational unit of an LSTM model employs forget gates, memory gates, and output gates to ensure that features from distant positions in the text can be propagated to later positions.This mechanism enables the model to capture and utilize long-range dependencies effectively.The whole calculation process of LSTM can be formulated as: Similar to RNN, in LSTM, x represents the input, y is the output, and h signifies the output of the hidden layer.U and W denote the weights for the input x and the previous hidden layer's output, respectively.f(x) and g(x) represent the sigmoid activation function and softmax activation function, respectively.LSTM's computational unit introduces additional components like the cell state, forget gate, input gate, and output gate to retain and forget features from the text, enhancing its capability to capture and utilize contextual information, as: where i, f, c, and o respectively represent the input gate, forget gate, cell state, and output gate.They each process vectors with dimensions matching that of the input x.W and b stand for weight parameters.h t represents the hidden layer output at each time step.
In bidirectional LSTM, each output of the model is determined by the combined outputs from both directions.O t represents the final output of the model at each time step.V and C are weight parameters, and o t f and o t b represent the outputs propagated forward and backward, respectively.

Attention mechanism
In order for the model to notice useful information in the text, after the features are extracted by the BiLSTM layer, the model also decides which time step is more important according to the attention mechanism and then integrates these features into a sentiment feature vector for the whole sentence.s denotes the overall sentiment feature of the entire sentence: Where integer T represents the maximum number of time steps contained within the sequence, t ∈ [1, T]. a i represents the features noted at each time step after weighting: The softmax function, as defined, allocates weights to all elements based on the values of z i , z i ∈ Z: The score(h t ) maps the hidden layer features at each time step to the [0,1] interval and then multiplies them with the feature extraction matrix w, resulting in a sentiment score associated with the textual information for that specific time step.
where w is a learnable parameter that becomes more suitable for extracting emotional features of the text during the training process.

Model training
The loss function for training is the cross entropy loss function, which is defined as following equation: Where K represents the total number of emotion categories, p(x k ) represents the probability vector of labeled emotion as x k , Q(x k ) represents the probability vector of predicted emotion as x k , and m denotes the number of texts included in a single training instance.

Structure of BiLSTM-attention model
The network structure of BiLSTM-Attention model is shown in Figure 2. Here, CLt and hLt represent the cell state and hidden layer output during the forward pass of the model, and Crt and hrt represent the cell state and hidden layer output during the reverse pass.The combined hidden layer output ht symbolizes the result of concatenating the forward and backward hidden layer outputs.These outputs undergo computations via the Attention mechanism to derive the overall sentiment feature of the entire text.Subsequently, they enter a fully connected layer for feature analysis, ultimately leading to the classification outcome.

Data sources and processing
In the whole experiment, the dataset consists of n real Weibo comments with stickers.In the overall dataset, the longest sample consists of 230 characters, while the shortest contains 3 characters.We counted the number and type of stickers included in different comments in the dataset, and their distribution is shown in Figure 3.The samples encompass a maximum of 65 stickers, with a minimum of 1 sticker.The training data comprises 55000 sampled examples from the overall dataset, distributed in proportion to 50% for both positive and negative sentiments.Among these, 50000 instances are employed as the training set, while 5000 instances serve as the validation set.The remaining instances constituting 64988 of the data, are allocated to the test set, with positive sentiments comprising 51% and negative sentiments comprising 49%.The test set samples encompass a maximum of 47 stickers and a minimum of 1 sticker.All those comes from real Weibo comments.The test set is divided according to the quantity and types of emoticons present, and is concurrently employed to evaluate the performance of the 40 generations of the model.

Partition of the test dataset
Since the frequency of occurrence of texts containing different stickers in the test dataset which contains 64988 texts varies greatly, in order to avoid the chance of experiments caused by too little data, in for those sets that appear a lot of emoticons, but the sample capacity is very small to do the merging process, and finally divided the dataset into 12 datasets, respectively, the set containing 1,2,3,4,5,6,7,8-12,13-16,17-20,21-31,32-47 stickers.Another test dataset was also constructed from the test dataset which contains 64988 texts by the types of expressions, which are the text datasets containing 1,2,3,4,5,6,7,8,9,10,11-15,16-47 expressions.

Result analysis 3.2.1. Analysis for different numbers of stickers
Different generations of the Attention-enhanced BiLSTM models are applied to test each dataset, and upon conducting the tests, the models yield the following results as shown in Figure 4. Through the cross-generational comparison of data and the assessment of the average values for each generational model across the same quantity of stickers, the accuracy of the model's judgments exhibits a pattern of initial increase followed by subsequent decline.This observation underscores that the quantity of sticker symbols indeed yields a discernible adverse impact on the accuracy of the model's judgments, irrespective of whether one considers the individual generational models or the overall aggregate performance.When scrutinizing the vertical comparison of models from different generations utilizing the same sticker test dataset, it becomes apparent that, for test sets containing a lower count of stickers, the optimal number of training epochs required for the model to attain its optimal performance follows a trajectory characterized by initial augmentation and subsequent reduction.This trend underscores the reality that, concerning the training of the model, an augmented abundance of sticker symbols often signifies the presence of more intricate informational nuances.As a result, the model necessitates an escalated number of iterative processes to adeptly optimize parameters for effectively handling these heightened complexities.
However, it is worth noting that both in horizontal and vertical comparisons, the model's performance tends to converge towards test sets with a lower count of stickers (17-20 stickers, 21-31 stickers, 31-47 stickers).For instance, the performance of models on test sets with 21-30 stickers, 31-47 stickers, and even on sets with just 1 sticker, exhibits remarkable similarities.Each of these models ultimately achieves nearly perfect accuracy rates.Additionally, it's notable that the convergence patterns observed in the 31-47 stickers test set closely resemble those witnessed in the scenario involving only 1 sticker.This observation underscores intriguing insights into the behavior of the model in handling varying levels of sticker complexity within the dataset, which carries relevance to practical applications within the realm of computer science and engineering.

Analysis for different types of stickers
Through an analysis of the dataset, we have found that within natural microblogging datasets, samples with a notably high number of sticker symbols often tend to employ the same stickers.This observation indicates that the quantity of sticker symbols can indeed serve as an indicator of data information complexity.However, it is not the sole influencing factor.The specific variety of stickers present within a sentence might also impact the effectiveness of sentence-level models.Based on the dataset detailing the occurrence frequencies of different sticker types within sentences, the model has yielded the following data in Figure 5.When the data consists of only a single type of sticker (which might appear 2 times, 3 times, or even more within a sentence), the model's final accuracy is also remarkably close to 1.However, there still exists a minimal disparity compared to the accuracy on test sets featuring only one sticker.This observation further substantiates the conjecture that the quantity of stickers does indeed influence the model's efficacy to a certain extent.
In the horizontal comparison, the model's accuracy continues to exhibit the trend of an initial decrease followed by a period of stability and ultimately an ascent.Nonetheless, the overall accuracy of the model experiences a noticeable decline when compared to the division based on sticker quantity.While the model's accuracy remains consistently above 0.9 in the division by sticker quantity, the lowest accuracy within the division by sticker type is only 0.853.This underscores that the category of stickers within the model is a pivotal factor influencing accuracy, and its impact on model performance is more pronounced.
Although the model's accuracy does show some recovery on the test sets with a greater variety of sticker types, there exists an inherent margin of error.This is particularly evident in the 11-15 types test set with only 26 instances, and the 16-47 types test set with a mere 9 instances.Test sets with such a limited amount of data can yield results with a significant level of randomness.Nonetheless, even in light of this, the model's accuracy on these two test sets remains notably lower than the final model accuracy achieved through division based on sticker quantity.

Discussion
Both the number of stickers in the text and the type of sticker affect the training and final results of the model.The higher the number of identical stickers, the higher the accuracy of the model.The higher the number of different kinds of stickers, the lower the accuracy of the model.
Based on the above experiments and data, it can be assumed that a sticker is equivalent to a special information representation character, when there are more types of stickers and the distribution of the number of each sticker is more uniform, the higher the complexity of the sticker features, the worse the model's effect; while the fewer the types of stickers and the more the number of each sticker is and the distribution is polarized, the lower the complexity of the expression features, the better the model's effect.We can use the information entropy to represent the complexity of emoticon or sticker features: Where X is a random variable of all emojis in a text, H(X) represents the complexity of the emoji information in the text where X random variable is located, k represents the number of emojis, and p(x) represents the probability of emoji x appearing in the text, which reflects the number of times the emoji appear in the text from the side.This formula can very intuitively reflect the impact of the number and type of emoji on the sentiment classification results.The higher the complexity, the worse the result, and the lower the complexity the better the result.

Conclusion
With the development of the Internet, the sticker has been more and more widely used in social media platforms, as an important symbol for expressing emotions, sticker also contains very complex emotional information.In the sentiment analysis of these real texts, both the number of stickers and the types of stickers have an impact on the results of sentiment analysis, with the latter having a more significant impact.But overall this effect is attributed to the probability of each sticker appearing relative to all stickers in the text.In future work, this impact can be measured using larger datasets in order to find more precise and appropriate ways to characterize and even quantify this impact.

Figure 2 .
Figure 2. The network structure of BiLSTM-attention model.

Figure 3 .
Figure 3.The distribution of sticker number and types.

Figure 4 .
Figure 4. Model performance for comments with different numbers of stickers.

Figure 5 .
Figure 5. Model performance of various generation models on different types of stickers.