Emotion-Based Music Movie and TV Series Recommendation System Using Deep Learning Algorithm

. It can be challenging to decide which music or movies to listen to from a huge number of options. The major purpose of our music and movie recommendation system is to provide clients with selections that fit their tastes. An assessment of a user's facial expression may provide insight into their current emotional or mental state. More than 60% of users anticipate that the number of songs in their music collection will grow to the point where they will be unable to find the song they need to play at some point in the future. It is feasible to assist a user in picking which music or movie to listen to or watch, by building a suggestion system. The face of the user is detected using the webcam. The snapshot of the user is taken based on their mood or feeling. It recognizes six facial expressions: angry, sad, fearful, joyful, surprised, and neutral. Based on the expression classification, the users are given three categories of recommendation as movies, music, or series based on their feelings. Seven different human facial expressions are classified using the Convolutional Neural Network (CNN) model. The Haar Cascade is an Object Detection Algorithm for recognizing faces in images and real-time video.


Introduction
Music, movies, and other TV series can reduce stress and provide relaxation to people. Everyone enjoys these activities because they can switch their mood and provide a sense of relief in their everyday lives. Every type of music or movie evokes diverse feelings since the user may quickly associate them with their surroundings. Also, it can connect with and affect people in ways that other forms of communication do not. Many people turn to music and movies to connect with others, express themselves, or discover common ground among peers. But, the user still has to actively browse through the songs and movies and select songs based on their current mood and behavior. A user had to carefully scroll through their playlist and select tracks that match their mood. This was a time consuming procedure, and people frequently struggled to come up with a suitable choice of songs or movies. Deep learning based facial expression recognition is one of the techniques for detecting human emotion states such as anger, happiness, fear, neutral, sad, and surprise. This technology seeks to recognize facial expressions automatically and accurately identify emotional states. In this method, CNN is trained by sending annotated facial photographs from a facial expression dataset to it. After that, the proposed CNN model determines which facial expression is used, and based on the facial expression detected, music, movies, and TV series will be suggested through YouTube.

Related Work
A two-stage emotion recognition utilizing frame level and video level information, contrasts a sevenclass classifier with a two-step classification for categorical emotion recognition.
[1] They used the FG2020 Multimodal Emotion Recognition (MER) Dataset, including skeleton data obtained with a Microsoft Kinect and video. They compare the performance of numerous unimodal features as well as various multimodal feature combinations. They also compare features at the frame and video levels. Changes in the curvatures of the face and the brightness of the pixels that corresponded [2]. The author used Artificial Neural Networks (ANN) to classify emotions. Several playlist approaches were also suggested by the author.
Authors Published a paper that will show us the recommended songs when a specific song is processed using libraries like NumPy and Pandas [3]. Music service providers need a useful system for categorizing recordings and helping their customers find music by providing outstanding suggestions[4]. Using Support Vector Machine (SVM), a database of 714 face emotion images was published. It was created by taking two digital photos of seven different facial expressions on 51 different persons. Later, utilizing 476 training and 238 testing to identify seven emotions, the performance of four SVM kernels for face emotion identification was evaluated. By eliminating the manual selection, a feature extraction method based on a deep convolutional neural network and It reconstructs the conventional local binary pattern (LBP) feature operator for facial expression images and fuses it with the abstract expression in the full connection layer [5]， [6]. A paper describing the Face expression recognition method and the test sample image's expression that is recognized and classified by utilizing the Softmax Classifier and Convolutional Neural Network has been published (CNN).
The use of deep learning for face expression identification is now common these days [7]. The dataset consists of 35,887 face grayscale images, and the batch size options are 8, 40, 50, 55, 88, 100, and 128. The final proposed batch size in the training model is 50, which indicates that the proposed deep convolutional neural net-work is the most effective model to use. The diagonal characteristics are also apparent in the confusion matrix, indicating that the method of this paper successfully achieves the effect of expression classification. Researchers proposed work on multimodal emotion identification from voice and expression was published [8]. It has around 12 hours of audiovisual content, including video, audio, speech text, and facial expressions from 10 actors in either scripted or impromptu scenes. To extract face expression components from this data, several small scale kernel convolution blocks were created [9]. With some performance loss, the dimensionality of the feature vector was reduced in this research using a distance metric learning approach. They used 5 and 10 dim feature vectors to classify genres without losing speed. The bulk of music recommendation algorithms uses CF and CBF to find common patterns. One of the characteristics that are most frequently used for this filtering is genre [10]. This paper focused on a neural network based personalized music recommendation system that was created based on research on personalized music recommendation systems that solely use tag information [11] [12].

Methodology
The proposed system also suggests movies and TV series along with song suggestions. This project focuses on the user's emotions as sensed by the webcam, and music, movies, and series are offered based on facial expressions. Following the recognition of their facial expression, the desired movie, song, or series is selected based on their mood. The major goal of this system is to create a sophisticated recommendation system that can improve the user's moods and rejuvenate them. Fig1. Show the work flow of the recommendation system. One of the strategies for recognizing human emotional states is deep learningbased facial expression identification such as anger, happiness, fear, neutral, sad, and surprise. This technique aims to reliably determine emotional states by automatically recognizing facial expressions. CNN is trained using this method by feeding it annotated facial pictures from a facial expression dataset. Following that, the proposed CNN model identifies which facial expression is employed, and music, movies, and TV shows are offered via YouTube based on the facial emotion detected [13].

Convolutional Neural Network
Convolutional neural networks are a particular kind of deep neural network that is used in deep learning to analyze visual vision. It uses a process called convolution. By using a mathematical technique called convolution, it is possible to create a function that expresses how the shape of one is influenced by the other. Convolutional neural networks are composed of many layers of neurons [14]. Similar to their biological counterparts, artificial neurons are mathematical functions that compute the weighted sum of numerous inputs and produce an activation value. When an image is an input, each layer of a convolutional neural network generates many activation functions that are then transferred to the following layer [15]. Horizontal and diagonal edges are taken from the first layer. The subsequent layer, which is in charge of identifying more intricate properties like corners and combinational edges, receives this information. Even more intricate things like objects, faces, and other things are recognized by it [16]. The Pooling layer, like the Convolutional Layer, is in charge of shrinking the spatial size of the Convolved Feature. The amount of computing power needed to process the data is decreased by reducing its size [17]. Max Pooling is used to determine which pixel from a kernelcovered area of the image has the biggest value. It gets rid of all clamorous activations.

ReLU Layer
After each convolution operation, a nonlinear layer is added. It has a nonlinear feature due to its activation function. The function () =max (0,) is computed [19]. To put it another way, the activation is simply set to zero 4.2 Pooling Layer It reduces the size of the image by down sampling it. After the convolutional, nonlinear, and pooling layers have been completed, a fully connected layer must be added.

4.3
Fully-Connected Layer The output data from convolutional networks is received by the fully connected layer. When a completely linked layer is attached to the end of the network, an Ndimensional vector is created, where N is the number of classes from which the model chooses the necessary class [21].
( + ) (5) Dimensions of the output tensor can be determined from the input tensor by using the above formula where g is the activation function, x is the given input vector, W is the weight matrix and b is the bias vector [22].

4.4
Dropout Layer When every attribute is linked to the Fully Connected Layer, the training dataset is susceptible to overfitting occurs when a model performs so well on training data that it has a negative effect on its performance when applied to new data [23]. In order to tackle this issue, a dropout layer is utilized, which causes a tiny model to be created by removing a few neurons from the neural network during training. Thirty percent of the nodes in the neural network are lost out randomly with a dropout of 0.3.

4.5
Activation Function One of the most important components of the CNN model is the activation function. They are used to learn and approximation any type of continuous and complex net-work variable to variable association. It adds nonlinearity to the network. ReLU, Softmax, tanH, and Sigmoid are the activation functions used in deepa learning models. There is a specific usage for each of these functions. Sigmoid and softmax functions are preferred for a CNN model for binary classification, although softmax is commonly used for multi-class classification [24].

HAAR Cascade
Haar Cascade Detection is an extensive face detection approach that has been around for quite some time. Faces, eyes, and lips were all recognized using Haar Features. It's important to remember that, like other machine learning models, this one takes a lot of positive photos of faces and negative photos of non faces to train the classifier. This algorithm consists of four stages. Firstly, it calculates haar feature [25].

5.1
Calculating Haar Features The initial step is to gather the Haar characteristics. Haar feature performs a set of calculations on neighboring at a given location, rectangular parts of a detection window. The total of the pixel intensities in each section is then subtracted from the total. Below are a few examples of Haar characteristics [26].

5.2
Creating Integral Images It constructs subrectangles and array references for each of those subrectangles instead of computing at each pixel. The Haar features are then computed using them. It's vital to remember that while doing object detection, practically all of the Haar characteristics are meaningless because the only features that matter are those of the object. Adaboost is useful in choosing the finest characteristics from hundreds of thousands of Haar features to represent an object [27].

5.3
Adaboost Training Adaboost generally selects the most useful features and trains the classifiers with which they should be used It combines weak classifiers used by the algorithm to detect things to build a powerful classifier. Weak learners are produced by moving a window over the input image and computing Haar characteristics for each region of the image. This difference is compared to a learned threshold for distinguishing between non-objects and objects. Because these are poor classifiers, creating a strong classifier is essential [28].

Implementing Cascade Classifiers
Each stage of the cascade classifier has a collection of weak learners. Weak students are taught by boosting, which generates a highly accurate classifier based on the average prediction of all weak students. The classifier then selects whether to mark an object as found or to go on to the next region based on this prediction. Because the bulk of the windows does not contain anything of interest, stages are designed to eliminate negative samples as quickly as feasible [29].

Hierarchical Data Format 5
HDF5 is gaining popularity because of its portability. Other programming languages that may handle HDF5 files include Python, MATLAB, Fortran, and C. It is widely utilized in the scientific world to store huge datasets, as Simon mentioned. It is a file format for storing structured data rather than a model. Because the weights and model configuration may be easily stored in a single file, Keras saves models in this format. It is a file format for storing structured data rather than a model.

Data Set
The dataset taken for this experiment covers six different face emotions (Happy, Angry, Surprise, Fear, Neutral). These are normally collected images of size 1920*2560 pixels in size. The size of the training images is 28710 photos. These photos are divided into the following categories: Happy with 5300 samples, Sad with 5110 samples, Angry with 4500 samples, Surprise with 4600 samples , Fear with 4200 samples and Neutral with 5000 samples.

Performance Metrics
Accuracy can be used to assess the performance of the model. Accuracy refers to the percentage of correct predictions made by the model. It is calculated based on the TP stands for "true positives," FN for "false negatives," TN for "true negatives," and FP for "false positives," respectively.

Result
The proposed system suggests music, movie, and series based on the user's emotions as captured by the webcam. To detect the face in real-time video, the CNN Algorithm is used. Depending on the facial expression and mood the user will be redirected to YouTube which suggests songs, movies, or TV series.
Comparing the training data of the mobile net and the H5 model, the H5 model gives slightly higher accuracy than that of the mobile net. Even though convolutional neural network models like mobile net, and Restnet recognize emotions and characters there are few contradictions in detection. For suggesting music, movies, and series the pre-trained modified architecture gives a loss of 4% and accuracy of 98% shown in Fig.2

Conclusion
An emotion based music, movie, and series recommendation system is demonstrated in this proposed project. Using the image collection, the model distinguishes six different facial expressions. It will be fascinating to see how the system reacts to the introduction of new emotions. It may be able to divide adults and children into categories and then recommend music, movies, and television shows based on those categories. Such an application might be valuable in supporting humans in relaxing and lowering stress in today's technological world.