HiFormer: Hierarchical transformer for grounded situation recognition

. The prevalence of monitoring video is critical to public safety, but existing Object Detection and Action Recognition models are overwhelmed by camera operators, unable to identify relevant events. In light of this, Grounding Situation Recognition (GSR) provides a practical solution to recognize the events in a surveillance video. That is, GSR can identify the noun entities (e.g., humans) and their actions (e.g., driving), and provide grounding frames for involved entities. Compared with Action Recognition and Object Detection, GSR is more in line with human cognitive habits, better allowing law enforcement agencies to understand the predictions. However, the crucial issue with most existing frameworks is the neglect of verb ambiguity, that is, superficially similar verbs but have distinct meanings (e.g. buying v.s. giving) . Many existing works propose a two-stage model, which first blindly predicts the verb, and then uses this verb information to predict semantic roles. These frameworks ignore the importance of noun information during verb prediction, making them susceptible to misidentifications. To address this problem and better discern between ambiguous verbs, we propose HiFormer, a novel hierarchical transformer framework with direct and comprehensive consideration of similar verbs for each image, to more accurately identify the salient verb, semantic roles, and the grounding frames. Compared with the state-of-the-art models in Grounded Situation Recognition (SituFormer and CoFormer), HiFormer shows an advantage of over 35% and 20% on the Top-1 and Top-5 verb accuracy respectively, as well as 13% on the Top-1 Noun accuracy.


Introduction
The rapidly developing surveillance technology is significantly improving our lives.At the end of 2019, the number of surveillance cameras in the world exceeded 770 million [1], which is approximated to have a market size of 69.1 billion dollars by 2026 [2].Despite the high coverage of surveillance cameras, human activity detections still rampage in places with a less sufficient police force, even directly under high-resolution cameras.Now, as the field of Artificial Intelligence and Computer Vision develops, new solutions such as Biometric Identification [3,4,5], Object Detection/Tracking [6,7,8], Crowd Density Analysis [9,10,11,12,13], and Action Recognition [14,15,16,17,18,19] have come to light, which in theory could automatically detect objects or actions.However, this seemingly useful technology remains confined to the laboratories as a consequence of deficiencies: 1) The narrow scope of the detection process (e.g., action-only or object-only) cripples the model accuracy, for it requires a comprehensive consideration of multiple factors to determine the nature of a situation.2) Although these models could determine what action or object is in the image, they could not recognize the details or causes of the entire event.As shown in Figure 1, the Object Detection model monotonously determines the probability for each noun entity but makes no effort to recognize the action taking place.To tackle the existing problems, Grounded Situation Recognition (GSR) [20], first proposed by Pratt et al., aims to recognize images following the human cognitive pattern.Unlike the Object Detection process shown in Figure 1, which mechanically predicts the likelihood of each noun entity, the task of GSR is to understand the given image from an event-based perspective, which is to identify the involved noun entities, their mutual interactions as well as the relative and absolute locations of the noun entities.In the case of Figure 1, GSR recognizes not only the two men and the certificate, but also the apparent action of giving, as well as the classroom where the action is taking place.This comprehensive image analysis makes GSR models more accurate, significantly reducing misidentifications.Furthermore, in addition to the traditional Situation Recognition, GSR also makes a grounding frame prediction for all relevant noun entities based on the semantic information of the salient action, the noun entities, and their mutual relations.This provides the model with locations of noun entities within the image, thereby benefiting many downstream tasks such as multimedia understanding [21,22,23] and information retrieval [24,25,26].In summary, Grounded Situation Recognition aims to predict the salient verb, noun entities, and grounded frames when analyzing an image by utilizing the crisscrossing semantic relations.However, almost all existing works in this field neglect the existence of verb ambiguity (similarlooking verbs with different meanings).For instance, "providing," "buying," and "giving" have drastically different meanings despite looking quite similar (as shown in Figure 2).Most works in this field use identical two-stage frameworks, which first predict the verb, and then use this information to predict the nouns.With the complete ignorance of noun information during verb prediction, these models could not discern between many similar verbs, whose only difference is with their semantic roles.
Recent works (such as SituFormer [27] and GSRFormer [28]) try to tackle both of the above problems by proposing a three-stage transformer framework.On the firsthand, they try to alternately update verbs and semantic roles by 1) predicting the verb in stage 1; 2) predicting semantic roles in stage 2 based on the verb; 3) refining the verb features in the third stage with the newly acquired semantic roles.On the other hand, when processing an image with the salient verb found, SituFormer tries to consider the other highly similar verbs.However, the approaches to both problems are still largely ineffective: 1) the interconnecting semantic relations between verbs and nouns remain underexploited, for the refinement process lacks repetition; 2) the problem of verb ambiguity still remains unsolved, for many of the computed similar verbs could still have drastically different meanings due to their different semantic roles.Therefore, without considering these semantic roles, the actually similar verbs remain neglected.Furthermore, they have overcomplicated this task by adding redundant and inherently ineffective modules.
To address this problem, we propose HiFormer, a novel transformer-based GSR model.HiFormer takes advantage of its hierarchical internal structure and directly considers all similarities during the decision-making process, significantly reducing the chance of misidentifying, and leading to more reliable and accurate predictions for GSR.We uniquely contribute to the task of Grounded Situation Recognition in the following three aspects: 1. We unravel the problems of previous works, pointing out their neglect of verb ambiguity.The lack of macroscopic analysis regarding the similarity between verbs causes their accuracy to re-main gloomy.Despite repeated attempts at improvement, their current performances are still pessimistic, for they oversimplify the retrieval process of similar verbs and ignore the significance of semantic roles.
2. We propose a novel hierarchical transformer framework, taking advantage of its internal hierarchy to explicitly consider all similarities for each image.This approach effectively reduces the misidentifications caused by verb ambiguity, directly confronting the most crucial issue of the GSR task.
3. We achieve stage-of-the-art performance accuracy on the challenging SWiG benchmark, far surpassing all the previous works by over 35%, 21% on the Top-1, Top-5 verb accuracy, and over 13% on the Top-1 noun accuracy.
Our proposed HiFormer can be applied to the surveillance camera network to alert the local authorities and medical centers in the event of criminal activity or risky behavior.Due to the rapid improvement in the economy and the development of technology both in developed and developing countries, surveillance technology will become increasingly widespread.Therefore, our Hiformer could deter and drastically reduce criminal activity, and in the end, improve the life quality of the general public as a whole.

Related Work
This section will briefly introduce some previous essential works in Transformers, Action Recogni-tion, and Grounded Situation Recognition.

Progression of Action Recognition
According to an analysis conducted by Gella et al. [29], Action Recognition is the task of recognizing the activity taking place in pictures and videos.Most existing works can be classified into four categories: 1) Action Classification; 2) Human-Object Interaction detection; 3) Visual Verb Sense Disambiguation; 4) Visual Semantic Role Labeling.In the beginning, Action Recognition took the form of Action Classification based on small-scale data sets [30,31,32,33], which laid the foundation numerous later works.However, this kind of classification is problematic in two folds.Firstly, the methods could not be extended to large-scale data sets.Secondly, these works all assume a singular verb label for each image, ignoring the fact that multiple activities could take place in the images simultaneously.In response to these problems, Human-Object Interaction detection was proposed [34], which solved both of the above problems.However, more deficiencies have come to light regarding HOI detection.Firstly, it neglects the existence of multiple meanings of the same verb.For example, the verb "take" means accepting, acquiring, and carrying.Secondly, it ignores the importance of noun information within the verb prediction process, which is often the only difference between similar verbs such as "riding a bicycle" or "riding a horse".These observations have led to many arguments regarding how actions should be analyzed on the level of verb senses.Later, Gella et al. [36] proposed a new task of visual Verb Sense Disambiguation, where each image is annotated with verb sense labels.However, although this task handles the ambiguity of verbs, it neither identifies nor localizes the noun entities within the images.Some recent works [37,38] solve this deficiency by proposing Visual Semantic Role Labeling, which not only predicts and identifies the semantic roles but also provides grounding frames to localize these roles.

Transformer in Grounded Situation Recognition
The Transformer Framework [39] was first proposed by Vaswani et al. to tackle problems in Natural Language processing.Its built-in attention mechanism allows it to easily model the long-range dependencies between words and phrases without laboriously stacking up multiple layers.It is significantly more efficient than conventional Convolutional Neural Networks.Furthermore, due to the inherent structure of the attention block, transformers are much more parallelizable compared to Recurrent Neural Networks.Later, this model became widely used and modified in Natural Language Processing.Modifications on the transformer framework such as Lightweight variants [40,41,42], recurrent transformers [43,44] and hierarchical transformers [45,46,47], etc.According to a survey by Khan et al. [48] many works have been trying to implement this successful model in Computer Vision.Its minimal need for inductive biases and ability to tackle long-range dependencies makes it much more suitable for the task than conventional convolutional neural networks.Furthermore, the robust design of the transformer model makes it competent in various sub-areas, such as video, image, and audio, without any laborious modifications.Therefore, the transformer model is becoming increasingly popular in Computer Vision.In the Specific field of Grounded Situation Recognition, the Transformer model was first proposed by Cho et al. in GSRTR [49], by replacing the object-centric queries in DETR [50] with semantic role queries.More recently, Wei et al. proposed SituFormer [27], a two-stage model that predicts the verb and nouns separately using two transformer-based detectors, to improve the performance.

Hierarchical Transformer Framework
This section will further elaborate on the Hierarchical Transformer Framework (HiFormer).

Overview of HiFormer
To address the above problems, we reshape the transformer framework in Grounded Situation Recognition by proposing a renewed learning framework named HiFormer.As shown in Figure 3, HiFormer is a two-stage transformer-based model that directly tackles the problem of verb ambiguity.It computes and considers the similar images for each training set image and thoroughly exploits the semantic verb-noun relations.In the first stage, the Leaf Transformer (TRMleaf) learns the preliminary representation for each image.Then, the support image set for each image is computed during the KNN retrieval pro-cess, which serves as a transition to the Root Transformer.Finally, the Root Transformer utilizes the support image sets to refine the representations of each image.
Formally, HiFormer can be represented by Eq. 1-3 as, TRM root ({ () ,  () where the three equations respectively denote the working procedures of the Leaf Transformer, KNN retrieval process, and Root Transformer.Firstly, the Leaf Transformer learns the role tokens  (0) and the preliminary representations  (0) of each image, using the image features  extracted by the CNN backbone.Afterwards, we compute a support image set { (0) ,  (0) } sim of size  for each image in the retrieval process.Finally, the Root Transformer refines the preliminary representations of the original images with their support verb sets.

Leaf Transformer
Leaf Transformer is trained to independently learn the preliminary representations of salient verbs and their corresponding semantic roles, consisting of an encoder and a decoder.As shown in Figure 4, the Leaf Transformer consists of an Encoder and a Decoder.The Leaf Transformer Encoder is designed to learn the representation of the salient verb and predict the preliminary verb category.At the same time, the Leaf Transformer Decoder is devised to learn the corresponding semantic role representation based on the salient verb.

Representation of Salient Verb
As shown in Figure 4, the CNN backbone first extracts a feature map   ∈ ℝ ×ℎ× from the im-age, with is transformed into a sequence of image features [ 1 ,  2 , … ,  ℎ− ] by a 1 × 1 convolutional layer and a flattening operator, where each element   ∈ ℝ  represents the features of a single pixel.Inspired by ViT [51], we initialize a vector of learnable verb tokens   ∈ ℝ  to represent the salient verb.Then, the verb tokens and image features are encoded with positional embedding  pos to take positional information into account.Finally, the encoded image features are fed to the Leaf Transformer Encoder, equipped with a multi-head self-attention module.The detailed working procedure of the Leaf Transformer Encoder stands as below.

Representation of Semantic Roles
Before entering the decoder, a verb classifier determines the preliminary verb category  based on the verb embedding acquired by the encoder, from which we fetch the corresponding semantic roles and initialize them to a role embedding vector [ 1 , … ,   ].Where  is the number of roles, and each element   ∈ ℝ  represents the embedding for a single role.
After acquiring the verb, semantic role, and image embedding, we feed them to a transformer decoder module to further learn the preliminary representations of the verb   and semantic roles [  1 , … ,    ] as shown below: where   ,   1 , … ,    are all vectors of real numbers with size 1 × .
Note that the Leaf Transformer Decoder is equipped with a multi-head cross-attention block, where the image embedding with position encoding serves as the query for the Attention Mechanism, and the concatenated vector of verb features and role features serves as both the key and value.

Retrieval Process
In this stage, we compute the support image set for each image (as shown in Figure 5), which consists of  image with the highest cosine similarity to the said image.

Computation of the Support Image Set
Before the actual computation begins, we split the Training/Validation/Test datasets into multiple segments of around 10000 images to guarantee the program efficiency under the high complexity of the pairwise similarity calculation.Next, for each segment in the dataset, we calculate the pairwise cosine similarity of all the images within using a brute-force method with ( 2 ⋅ ) time complexity.Finally, for each image, we find  images in the segment with the highest cosine similarity to it, and then save this information in a hash-table form.

Retrieval of the Support Image Set
Using the verb and role representations computed by the Leaf Transformer, we retrieve a support set {} sim of  images with the highest similarity to the current image in order to consider all the possibilities and prevent misidentification of similar verbs.The Retrieval process is as follows: S(,   ) = where  is the current image segment.Note that in the actual training or evaluation processes, the support image set of each image is already computed.Therefore, the retrieval process could be done in (1) time complexity using the preprocessed hash-table.

Root Transformer
In this stage, the Root Transformer is trained to refine the verb and role features in an iterative and alternating way.Before the main procedure begins, we first use the pre-trained CNN backbone, Leaf Transformer, and hash-table to extract the preliminary representations from the raw images.Recognize that the Leaf Transformer is already trained to its full extent in the preliminary stage and does not participate in loss calculation or backward propagation in this stage.Then, the four main steps stand as follows: 1) retrieval of the support image set; 2) computation of neural messages; 3) refinement of semantic role features using previously acquired verb information; 4) refinement of verb features using previously acquired semantic role information.In this stage, the above steps are repeated five times, in which () denotes the messages and representations in the -th iteration.

Computation of Neural Messages
To allow the decoder to simultaneously consider all the images in the support image set, we use the Root Transformer Encoder to compute a compacted semantic message, following the neural message passing mechanism.

Refinement of Semantic Role Messages
For each vector of role representations  →  (𝑡) of the original image, we utilize the semantic relations between it and the  + 1 verb representation vectors to compute an update message   all () ∈ ℝ 1× .
More specifically, we aggregate   () ,   1 () ,   2 () , … ,    () on their second dimension using a Fully Connected Network, which corresponds to the Agg module in Figure 6.The update message and the role embedding vector are then fed to the transformer sub-layer consisting of two Layer Normalization modules with a Feedforward Network in between, which will update the role feature    () to    (+1) .

Refinement of Verb Messages
Similarly, the verb representations of the original image are refined by the semantic role representations.We first aggregate  → 1 … () like the previous section to compute the update message.Then, we feed it to another transformer sub-layer, along with the to-be updated verb features, to conduct the refinement process below.

Preliminary Stage
In the preliminary stage, we calculate the loss functions for the three main outputs: 1) the preliminary verb category predicted by the verb classifier after the encoder module; 2 ) the preliminary verb representations produced by the decoder; 3 ) the preliminary noun representations produced by the decoder, where the first output represents the encoder, and the second and third outputs represent the decoder.Although the preliminary verb category does not participate in the future stages, we still impose a loss function for it since it plays a crucial role in the input of the Leaf Transformer Decoder.The details of the loss functions are as below: where  1 is the verb category predicted by the verb classifier between the encoder and decoder in the preliminary stage;  2 , [ 1… ] and [ 1… ] are the verb, noun, and bounding box predictions respectively, based on the preliminary representations acquired at the end of the preliminary stage.The purpose of the three loss functions are: 1)   1 helps optimize the preliminary verb category  1 ;2) 2 assist optimization of the verb prediction  2 based on the preliminary representations of the salient verb.3)  noun quantifies the loss of noun prediction [ 1… ] based on the preliminary representation of semantic roles.

Refinement Stage
Like the Leaf Transformer-decoder, we have to optimize the verb, semantic roles, and bounding boxes via cross-entropy loss functions.Same as the second step of the encoder, the detail of the decoder loss calculation stands as: where , [ 1… ] and [ 1… ] are the verb, noun, and bounding box predictions based on their corresponding refined representations.The respective purposes of these loss functions are identical to those above.

Process of Evaluation and Inference
During non-training processes such as evaluation, our framework produces the result straightforwardly.
The evaluation process consists of five simple steps: 1) the CNN backbone extracts the image features from the raw images; 2 ) the Leaf Transformer produces the preliminary verb and semantic role representations of the images; 3 ) we retrieve the support image sets of the images from the precomputed hash table (note that the support sets of validation/test image are also precomputed); 4) the Root Transformer refines the verb and semantic role representations; 5) the verb category, noun category, and bounding box predictors use the well-refined representations to produce the final outputs of Grounded Situation Recognition.
In addition, the custom image inference process is highly similar to the evaluation process above.However, since we cannot precompute the support image set for inference images during training, the inference process is slightly different from the evaluation process in the third step.Instead of using the precomputed hash table, we manually find its support image set in the training set by calculating its Cosine Similarity with every image in the set.

Dataset
We use the most dominant dataset in Grounded Situation Recognition, the SWiG benchmark, to train and evaluate HiFormer.SWiG builds upon the imSitu dataset while retaining the original images and the frame annotations.SWiG provides additional grounding frames for each image's visible semantic roles.There are 126,102 images, 504 verb classes, and 190 semantic role classes, where each verb is followed by 1 to 6 corresponding semantic roles.For each image, three sets of annotations exist made by different annotators.We split the Training/Validation/Test datasets into sets with sizes of 75 K/25 K/25 K, respectively, following the official dataset split.

Performance Comparison with State-of-the-Art Models
For HiFormer, we use the evaluation metric for Grounded Situation Recognition proposed by Pratt et al. 1., which stands as below: 1) verb: accuracy of the verb prediction; 2) value: accuracy of prediction a single semantic role; 3) val-all: the accuracy of correctly predicting all the semantic roles in an image simultaneously; 4) grnd: the accuracy of single bounding box predictions; 5): the accuracy of correctly predicting all the bounding boxes at once.We deem a bounding box prediction as correct if the IoU between it and the ground truth bounding box is above 0.5.We further implement the above metrics under three different settings: 1) Top-1-Verb: only calculate the accuracy of the top-1 verb, its corresponding semantic roles, and grounding frames; 2) Top-5-Verb: calculate the accuracy of the top-5 verbs, its semantic roles and; 3) Ground-Truth-Verb: the ground truth verb is known before the prediction, so only the accuracy of roles and frames are calculated.Note that in the first two settings, the role and bounding box predictions are automatically considered incorrect if the top- verbs do not include the ground truth verb.

Performance Comparison with State-of-the-Art Models
As shown in Table 1 and 2, HiFormer achieves state-of-the-art verb and noun accuracy under the top-1 and top-5 verbs.Compared to the current best-performing GSR model CoFormer [12], the improvement in the verb prediction accuracy range from 21% under the Top-5 Verb to 35% under the Top-1 verb.Fur-thermore, HiFormer improves the noun and bounding box accuracy by 14% and 6%, respectively, under the Top-1 verb.However, our model shows some deficiency under the Ground Truth Verb metric, as well as the value-all and grnd-all accuracy (in which a prediction is only counted as correct if all nouns or grounding boxes are predicted correctly).Despite the model deficiency, the astounding improvement in the verb prediction accuracy demonstrates the effectiveness of our framework in solving verb ambiguity.

Conclusion
We propose a novel two-stage hierarchical transformer framework, in which we simultaneously consider all similarities for each image instance.With this improved framework, HiFormer outperforms all state-of-the-art models regarding verb and noun accuracy.Compared to the current two best-performing models, CoFormer [58] and SituFormer [27], HiFormer prevails by over 35% on the top-1 verb accuracy, 13% on the top-1 noun accuracy and 21% on the top-5 verb accuracy.Regardless, some limitations of HiFormer lie with the bounding box prediction and the accuracy under the Ground-Truth-Verb, which we intend to explore further in the future.Our hierarchical framework provides a foundation for future Grounded Situation Recognition works in solving the bottleneck for many downstream applications such as E-commerce [59,60,61,62,63], Intelligent Transportation [64,65,66], etc.We believe our work will contribute to moving Grounded Situation Recognition out of the laboratories and implementing it in the surveillance network to improve people's lives.