Applying self-attention model to learn both Empirical Risk Minimization and Invariant Risk Minimization for multimedia recommendation

. Multimedia recommendation systems have many applications in our daily life. However, how accurately capture a customer's preference is an issue that is difficult to deal with. The proposed Invariant Risk Minimization (IRM) and Empirical Risk Minimization (ERM) are ways to learn a customer's preference. Still, both frameworks show some limitations: although ERM performs excellently in a single environment, it fails to generalize well when faced with multiple and new domains. On the other hand, IRM learns invariant features across heterogeneous environments, but it lacks theoretical guarantees and performs less effectively where the invariants are unclear. This paper proposes an E RM and I RM O ptimized R ating Framework (EIOR) as our final recommender model with direct rating scores. The EIOR enhances the accuracy and functionality of the multimedia recommendation systems by utilizing self-attention mechanisms to combine IRM and ERM with adjusted attention weights. Specifically, IRM learns invariant parts across different environments, while ERM learns variant parts. With self-attention, we can adaptively allocate attention weights for the two pieces and seek the optimal pair of attention weights based on the loss function. We demonstrate EIOR on a cutting-edge recommender model UltraGCN and use the open multimedia dataset of TikTok to finish all the experiments. The results validate the effectiveness of EIOR by comparing purely operating invariant representations alone with the framework of IRM.


Introduction
The usual assumption of traditional machine learning methods is that the data for model training and testing are independent and identically distributed (IID).Here, the data for training and testing can be said to be In-Distribution (ID).In practical applications, the data obtained after the model is deployed and launched is often not completely controlled.That is to say, the data received by the model may be Out-of-Distribution samples, which can also be called abnormal samples (outlier, weird).For distribution shifts involving confounders, (or) anti-causal variables, and polynomial generative models, the IRM can achieve the desired OOD solution, while ERM can be asymptotically biased [1].
The use of IRM on modern software recommendation systems can effectively solve the deviation caused by Non-Independent Identically Distributed data to model training.The essential idea is to divide the invariant representations for separate learning [2].It is evident that we should use IRM for OOD samples, but for some IID samples, ERM will show higher effectiveness; how to balance the tradeoff between ERM and IRM is the main focus of our work.
Based on the above illustration, the IRM presents a powerful capability in recognizing invariant features.However, the IRM will excessively focus on the constant part while discarding all the variant parts, where some may contain some helpful information.Under the context of the recommendation, an individual will not only pay attention to internal or invariant factors such as preference and habits but also can be affected by external or variant factors such as comments and product appearance.In this case, we still focus on promoting the accuracy of preference estimation rather than just identifying causeeffects [3].Moreover, only retaining invariant parts conducted by the IRM will undermine the prediction results given the traditional scenario of the IID.Fortunately, with the ERM's introduction, the recommendation model's performance may be improved; however, since the IRM is categorized as OOD while ERM belongs to In-Distribution Generalization.The properties of the two items determine the incompatibility between the two.In this case, our group's motivation is to trade off the proportion of IRM and ERM self-adaptively applied in the recommendation system to better promote its accuracy and functionality.To realize self-adaptation, our group imports the attention mechanism, which can automatically adjust the weights of IRM and ERM according to the quality of different individuals.We expect the incorporation of both IRM and ERM under the monitoring of the attention model can more accurately extract an individual's proper preference combined with the influence of external factors to recommend his desired results with exactitude and efficiency.
The contributions of our paper are as follows.(1) We compare and analyze the strengths and weaknesses of ERM and IRM, respectively, applied to different environmental conditions.(2) We propose a new multimedia recommender system named EIOR, which considers variant parts and invariant parts to directly compute rating scores reflecting the user's preference towards the item.(3) We experiment with the proposed balancing mechanism and display the improvements in prediction performance.

Invariant Learning
Invariant learning in environment partition is to identify the features that do not change across heterogeneous environments.By focusing on features with consistent predictive capability across domains, multimedia recommendations can be more generalizable and adaptive to variations, leading to higher accuracy and efficiency [4,5].

ERM & IRM
ERM is a machine learning framework that aims to minimize the risk between the model's predicted and actual output.Due to its vulnerability to changes in the input distribution, ERM is sensitive to noisy data and has difficulty handling the trade-offs between objectives.On the other hand, the machine learning framework of IRM is to learn features invariant to changes across heterogeneous environments, improving the generalization and robustness of the models even with limited training data.
According to our introduction, ERM under the IID assumption does not always hold in real-world scenarios.However, IRM learns invariant features from the heterogeneity perspective, leading to stable performance under distributional shifts.
Regarding [6], it evaluates popular IRM methods on deep models with synthetic datasets.The results show InvRat performs more effectively than others.Therefore, we adopt the InvRat method from this paper to build our IRM and ERM model in the methodology section.

Attention Mechanism
The attention mechanism can be utilized as a resource allocation schema to concentrate on distinctive parts when dealing with overloaded information [7].Most of the attention mechanisms are focused attention which has been applied to various fields, such as image-based analysis [8,9], text classification [10,11], video classification [12], image captioning [13], and recommendation [14,15,16].
Specifically, self-attention mechanisms adaptively learn attention weights, facilitating the model to learn between various input elements that would be difficult to capture with fixed attention weights.
According to Xu. et, they.Suggest combining the self-attention model to graph neural networks for session-based recommendation [17].In the following section, we propose a state-of-the-art approach combining UltraGCN with the self-attention model to adaptively learn the variant and invariant parts regarding the environment for the multimedia recommendation.
Adaptive neighbor sampling: UltraGCN [18] can flexibly sample neighbors based on the neighbor status of different nodes, reducing computational and storage costs and improving the scalability and efficiency of the algorithm.
2. Scalability: UltraGCN [18] has good scalability and can easily handle large-scale, dense graph data with significant runtime efficiency and accuracy advantages.This paper uses UltraGCN [18] to implement IRM and ERM.
3. Attention mechanism: UltraGCN [18] uses attention mechanisms to weigh different nodes and features, better exploring the relationships and importance between nodes and improving algorithm accuracy and robustness.In the paper, we apply the attention mechanism to combine IRM with ERM to form a better representation model.
Based on these advantages, UltraGCN [18] has been widely applied in graph neural networks, achieving good performance on various tasks and datasets.

NDCG NDCG (Normalized Discounted Cumulative Gain
) is a metric used to evaluate the quality of search engines or recommendation systems.It is widely used in the field of information retrieval.
NDCG is based on DCG (Discounted Cumulative Gain) calculations.DCG assigns higher weights to results that rank higher while penalizing the appearance of irrelevant results.Specifically, for a search query or user, the calculation of DCG is as follows: Here,   is the relevance score of the i-th search result or recommendation result, which is usually a non-negative value.The log 2 ( + 1) is a discount factor that doubles the score of higher-ranking results and gradually reduces the scores of later developments.
NDCG eliminates the influence of data size and sorting position by normalizing DCG using Ideal DCG (IDCG).IDCG is calculated by calculating DCG values in the same ranking order when all results are relevant.The formula is: And NDCG is calculated as follows: Finally, the value of NDCG ranges between 0 and 1, with 1 indicating that all results are relevant and 0 indicating that no results are relevant.

The process of the method
Here we present a method to improve the accuracy of the representation model by combining the IRM and ERM.We exhibit the workflow of our process in Figure 1.We divide this method into eight essential parts (M1-M8).M1 is a pre-train representation model used to extract the contents from multimedia data, including words, sounds, and pictures.Based on IRM, we create M2 to find the variant part in the content representation model (M1).According to that, we construct M3 to divide the original environment into several subsets.Each subset forms an independent interaction environment; we can get one feature from every subset after experiencing a deep learning process.Then, in M4, we learn an invariant mask to prepare for a uniform representation model.Combining the content representation model (M1) and the result of the consistent cover (M4), we will obtain the invariant representation model (M5), which is also the result of IRM.To the data in variant part (M2), we apply them to construct the ERM representation model (M6) by the training method ERM.After that, we employ an attention mechanism to give different weights for each feature which we obtained from both the ERM representation model (M6) and the invariant representation model (M5).The result of that is attention mechanism representation (M7).After some optimization, we obtain the final model.We will illustrate the M2 and M5 in 3. , the content representation c i , the dimension of content representation D, the invariant representation , the variant representation .Also, we adopt: To demonstrate the sets of two different representations.And we define the invariant representation   as: when we delete all the data of the invariant representation, the others is the variant part which is defined as: The most important part of the invariant representation is the generation of consistent mask m and environment partition.The detailed procedure is in the modules M2, M3, and M4; we will discuss these three modules in the following sections: 3.1.2and 3.1.3.

Environment partition.
According to the IRM, to finish the environment, we create a module (M3) to take in the different use-item interactions and output features about these data to form a climate set  .Each domain  ∈  reflects a kind of correlation between users and items; some are spurious correlations [2], and some are real correlations.Here is the detailed process.We try to classify the whole environment: some interactions only can form one feature, so we should put them together as a small environment e.In order to describe that environment e, we learn a predictive model to apply the variant part data: where Γ (e) is the predictive model, Θ  indicates the model parameters.We now have environment E, which consists of spurious correlations [2].To improve the accuracy, we will find some interactions that can recognize a feature with a higher probability.To differentiate the interactions in the environment, we use this formula: Finally, we employ a loop to run these two formulas until they converge.Then we get the result of the environment partition In the next step, we will use this result to find the invariant mask.

Invariant mask.
For the invariant representation part, we argue that spurious correlations are unstable in heterogeneous environments, such as cattle on grass and cattle on the beach, where grass and beach have little direct connection to the cattle themselves.From 3.1.2we get the result of environment (), which consists of variant part .And the variable of  is invariant mask m.In this part, we will pay attention to this vector: which is used to generate invariant representation.We'd like to find a vector  that can perform well in both the single-environment and cross-environment predictive models.According to IRM and Heterogeneous Risk Minimization (HRM) [19], we do the following work: We define And   = max{0, min{1,   + }} , ℎ  ∼  (0,  2 ) (12) After that, we use a predictive model in HRM [19]: The first part of this function is the typical recommendation loss, the second part is constraint across ℒ  environments, and the third part is a regularization formula ℒ  is the average environment loss value; the formula is: Our purpose is to minimize the ℒ  , so with the loop continuing, we use the formula:   ← {0, {1,   }} (15) clip the mask .when the ℒ  converges, we will get the invariant representation successfully.

Attention Mechanism
A large factor affecting the prediction accuracy of our model is the ability to filter out invariant representations accurately.Still, the positive impact of changing words on the correct prediction of the model cannot be denied entirely.For example, when a user buys a dress online, there is a high probability and weight that the user likes the dress itself, which is the invariant representation; however, the corresponding changing terms, such as models, lighting, and scenes, can also have a facilitating effect on the user's purchase.
Therefore, we adopt self-attention mechanisms to adaptively learn the variant and invariant parts regarding the environment partition.To adaptively learn these two parts, we use the attention mechanism of adaptive learning to balance their weights dynamically.More effective model fusion is achieved by combining the attention learning process with the UltraGCN prediction process.Based on the selfattention mechanism, it can effectively allocate weights among different environments.
Up to now, we have obtained stable invariant and changing representations by learning, denoted by   ,   respectively.Subsequently, we construct attention mechanisms to learn the learning weights of the invariant and changing representations, that is, to determine the weights of the contributions of the invariant and changing representations to the final prediction results.We piece together the change and invariant representations according to the following equation: where   1 and   2 are the attention weights for variant representation and invariant representation.In other words, they indicate the size of impact factors in two representations' predictions on users' preferences.

Collaborating filtering
Concerning the collaborative filtering term   , (u, i) can be written as a user-commodity feature sparse matrix, for which users  1 ,  2 (row vectors) and all commodities  (column vectors) are written as: The similarity of the preferences of user 1 and user 2 can be measured by the cosine similarity: The user's preference for an item i can be calculated by using the rating formula:

Final Prediction Model
Final prediction model.The invariant mask becomes stable by running streams M2-M3-M4 repeatedly in T times until convergence.Therefore, we learn the attention weights  1 and  2 of both and the final prediction model based on the invariant and changing representations generated in M5 and M6, respectively.
In order to find the specific  1 and  2 values, we obtain the variant mask by taking the inverse of the invariant show to part 3.1.3,and we use these two parts of the show to refine the change representation and the consistent representation, respectively: We apply an empirical risk minimization model to the change representation part to find the gap between the predicted and empirical environments.We expect and see the loss separately for userrelated and user-irrelevant items, where the former term in Eq.19 denotes the loss function for predicting user and user-related articles.The latter term means the loss function for predicting user and userirrelevant items and takes the square root of the two results to normalize the loss values.
After obtaining the two sets of environmental losses, we use the attention weights to combine the failure of the changing representation and the failure of the invariant representation into the failure of the overall feature, which is also the loss of our final prediction model.We make the initial attention weights equal and keep the sum of the weights always 1. Thus, we have: Where  1 is the attention weight of the invariant representation loss,  2 is the attention weight of the variant representation loss.
The UltraGCN automatically weighs the learning ratios of ERM and IRM using a loss function to achieve optimal results.

Backbone: UltraGCN[1, 2]
UltraGCN pushes the representations to encode the user-item graph through the graph-based loss function, ℒ = ℒ  +   ℒ  +   ℒ  (22) where   and   are hyper-parameters to balance the importance weights of these loss terms.The ℒ  indicates the objective loss, and the first and second terms calculate the relevance between multimedia items and targets for positive and negative samples, respectively.The relevance is mapped to a probability value using the logistic function .Then, the logarithm of this probability value is taken and negated to represent the matching loss.The objective of the first term is to maximize the relevance of positive samples, allowing the recommendation model to match users with target items better.On the other hand, the objective of the second term is to minimize the relevance of negative samples, enabling the recommendation model to better distinguish users from irrelevant items.The objective loss is, ℒ  indicates the user-item constraint loss, which is used to train an adversarial model, such as the discriminator in a generative adversarial network, to enhance the robustness and generalization capability of the invariant representation learning model.In this loss, the first and second terms calculate the relevance between multimedia item  and target items  and , respectively, based on different weight terms , and ,.The relevance values are then mapped to probability values using the logistic function .Subsequently, the logarithm of these probability values is taken and multiplied by the corresponding weight terms.Consequently, the objective of the first term is to maximize the relevance of positive samples, while the aim of the second term is to minimize the relevance of negative samples.The user-item constraint loss is, where the fixed weight coefficients  ,  and  ,  are derived from the user-item interactive graph  by: Where   and   denote the degrees of the corresponding nodes.Another constraint relies on an itemitem correlation graph  =   , where  indicates the user-item interactive graph.Thus, ℒ  indicates the item-item constraint loss, a regularization term used to encourage the relevance between a multimedia item  and its associated items  in the same temporal sequence  within the recommendation model.The inner summation term measures the relevance between the multimedia item and its associated items using the logistic function transformation and taking the logarithm.The outer summation term aggregates the relevance values of associated items within the same temporal sequence.By minimizing LI, the recommendation model can learn the relevance between the multimedia object and its associated items in the same temporal sequence, thereby better considering the temporal dependencies, The item-item constraint loss is, (26) where () indicates the adjacent item set of the item .The weight coefficient  , is computed by: where   and   denote the degrees of item  and item  in .
We learn the predictive model by: arg   [ℒ( Γ(, , c   ),   )] (28) Here,   represents the user's true preference for an item (expressed through ratings), and its loss function can be defined as where λ and (1 -λ) can be interpreted as the percentage representation of ERM learning weights and IRM learning weights, calculated as follows: UltraGCN automatically weighs the learning ratios of ERM and IRM using a loss function to achieve optimal results.

Evaluation protocols.
Building upon prior studies [20,21], our approach involves assessing the user-item interactions through trained models and subsequently satisfactorily ranking them.Specifically, for each user, we prioritize the top- items and determine the Precision@K (P@K), Recall@K (R@K), and Normalized Discounted Cumulative Gain (N@K) based on the observed interactions within the testing dataset.To evaluate the efficacy of the trained model, we calculate the average scores across all users.
4.1.3.Baseline.To assess the effectiveness of our model, we adopt the comparative approach outlined in the InvRL article and compare it against state-of-the-art multimedia recommendation methods.Specifically, we consider baselines from three categories as follows: 1.
The performance evaluations of the baselines above are sourced from previous works [20,21], following the established conventions.

Parameter settings.
Adam's algorithm can better adapt to the case of sparse gradients by using second-order moment estimates of the slopes (mean of squared angles).This makes it perform better for light data processing in huge matrix tasks.Therefore, we empirically used Adam [29] as an optimizer.This section describes the experimental settings used to evaluate our proposed approach.The details are as follows: 1. Batch Size: We set the batch size to 512, determining the number of samples processed in each training iteration. 2.
Embedding Dimension: The dimension of the embeddings was fixed at 64, ensuring consistent representation across the model.

3.
Hyperparameter Tuning: We performed individual tuning of the learning threshold and regularization factor for specific embeddings and other parameters.This process involved adjusting the values to optimize the model's performance.

4.
Regularization Factors: To control overfitting, we utilized regularization factors with weights of 10-4 for specific ID parameters.We experimented with values of 1, 0.1, 0.01, 0.001, and 0 for other parameters.
5. Learning Rate: We set the learning rate to 10-3 for all parameters to regulate the speed of model convergence during training.
Parameters  and : The parameters  and  in equation 10 were chosen from the set {1, 0.5, 0.1}, respectively.These values were selected to optimize the trade-off between accuracy and regularization. 8.
Learning Rate of Mask Generation: The learning rate of the mask generation module (m) was searched within {0.01, 0.001, 0.0001} to achieve optimal mask generation.9.
Parameters  and : The parameters  and  were adjusted in the set {2, 1, 0.1, 0.01, 0} to examine their impact on the model's performance.
10. Iteration Parameter : The iteration parameter  was initially set to 5, determining the number of iterations for the proposed approach.
11. Training Epochs: The environment segmentation model was trained for 20 epochs, the mask generation model for 40 epochs, and the final prediction model for 500 epochs, ensuring convergence and capturing essential patterns in the data.
12. Model Selection: The selection of models was based on validation scores, allowing us to identify the most effective models.The corresponding test scores were reported for further analysis.
By adopting these experimental settings, we aimed to thoroughly investigate the performance of our proposed approach and ensure reliable and meaningful results.

Result and Discussion
We present the overall performance comparison of different methods in Table 1.The following observations can be made: Neural collaborative filtering (NCF) approaches generally outperform collaborative filtering (CF) because NCF explicitly considers the interactions between embedding dimensions.This enables a more comprehensive representation of pairwise correlations, enhancing fine-grained information modeling.CNNs are applied to the matrix generated by the outer products, allowing for extracting higher-order correlations and complex patterns within the embedding space.[30] Moreover, the relatively poorer performance of DUIF highlights the impact of collaborative support.Additionally, M-NCF approaches consistently outperform G-NCF approaches, demonstrating that it is essential in application scenarios such as multimedia recommender systems to correctly analyze multiple data forms and establish interactions between different modalities idiosyncratically.[31,32] Notably, by adding a graph regularization term to the standard CNN structure and applying a graph convolution operation to aggregate the information of neighboring nodes, GRCN achieves the best performance among the NCF-based methods, which emphasizes the need for leveraging user behaviors and item contents in an effective multimedia recommendation model.[33] Our backbone model, UltraGCN, is a generic graph-based CF method.Despite its simple incorporation of multimedia content, UltraGCN significantly outperforms other multimedia recommendation baselines.This impressive performance indicates that UltraGCN can effectively capture collaborative information through constraint losses.The InvRL model uses the same prediction function and training target as UltraGCN, with the only difference being content representation through the learned invariant mask (as described in Section 3.1.3).These significant improvements can be attributed to the constraints imposed by MASK.And the result is, InvRL consistently achieves the best performance across the TikTok datasets, surpassing UltraGCN by 8.71% on Tiktok, respectively [2].However, the invariant representation obtained from singularity learning through mask masking is limited because it completely abstracts the subject's interaction with the environment, and our model suggests that there is also some connection between the changing and invariant representations.
Compared with InvRL, using the attention mechanism to connect the learning of changing representations with the knowledge of invariant representations in the InvRL model brings more features and means more learnable space.As shown in Table 1, the model using the attention mechanism has slightly improved performance over learning invariant representations using InvRL alone in the Jitterbug dataset, which supports that changing graphics is not useless.This confirms that changing pictures is not meaningless but can uncover information that is useful to us.The evaluation of performance is presented in the above table.Bold scores indicate the best performance achieved, while underlined scores represent the second-best performance.The abbreviations M-CF, G-NCF, and M-NCF correspond to multimedia CF, generic NCF, and multimedia NCF, respectively.
Through multiple iterations of learning attention mechanism coefficients, it is evident that despite our previous argument that we cannot wholly disregard the feature learning of varying representations, the attention coefficients corresponding to variable expressions are often significantly smaller than those of invariant representations.In other words, their impact is limited.
Furthermore, evidence suggests that a large portion of the data within a set of features represents variations, with only the core regions of the parts being invariant representations.This poses a challenge, as simply distinguishing between varying and invariant representations is insufficient.Taking the example of an image depicting a camel in the desert and a cow in a meadow, the main subjects of the image are the camel and the cow, which occupy only a tiny portion of the picture.However, regarding varying representations, the desert, meadow, and sky hold a significant advantage in terms of feature quantity.This may result in poor performance of our model in learning features related to varying representations in complex background images.
Therefore, addressing the insufficiency of differentiating between varying and invariant representations becomes necessary when partitioning varying manifestations.
Besides, to enhance our model's generalization capability, we consider further utilizing multi-head attention [34] to partition the varying representations' environmental aspects.As mentioned, the drawings contain features that contribute significantly to the prediction model and "irrelevant" features.Therefore, we propose assigning higher attention weights to the essential parts of the varying representations while assigning lower weights to the less significant ones.Our future work will focus on dynamic learning of the different models.
To achieve this, we will leverage the multi-head attention mechanism, which has been proven effective in capturing diverse patterns and dependencies within the input data.By incorporating multiple attention heads, each attending to a different aspect of the varying representations, we can better capture the complex relationships and variations in the environment.
Furthermore, we will explore techniques to dynamically adapt the attention weights based on the significance of the varying representations.This can be achieved through adaptive mechanisms such as reinforcement learning or adaptive gating tools, which can iteratively adjust the attention weights during the training process.
By incorporating these enhancements, we expect to improve the model's ability to distinguish between essential and irrelevant features within the varying representations.This, in turn, will lead to enhanced generalization performance and accuracy in handling complex visual data.
In our future work, we will conduct extensive experiments to evaluate the effectiveness of the proposed approach.We will compare the performance of our model with and without the multi-head attention mechanism on various datasets and complex background images.Additionally, we will investigate the impact of different strategies for dynamically learning the attention weights for the varying representations.

Conclusion
This paper introduces the EIOR for multimedia recommendation with explicit rating scores.The model incorporates the learning of the variant part across the environment based on ERM.According to the experiment results, applying self-attention mechanisms with adjusted attention weights for both IRM and ERM illustrates higher rating scores contrasting with the implementation of the IRM framework alone, which indicates that the variant part is not useless under the scenario of the multimedia recommendation.Moreover, the better performance of the EIOR compared with other models shows that the combination and balance between the variant part and the invariant part is more capable of predicting a customer's preference.In the future, we will put more effort into adopting the multi-head mechanism to improve the model's competence further.

4. 1 . 1 .
Dataset.TikTok platform tracks the viewing data of micro-videos, providing certified written, audiovisual, and auditory representations.To represent the textual content, the initial sentence-based textual representations, encoded as one-hot word vectors, are transformed by summing the word embeddings.

Table 1 .
Evaluation of Performance