Summary of deep neural network pruning algorithms

. As deep learning has rapidly progressed in the 21st century, artificial neural networks have been continuously enhanced with deeper structures and larger parameter sets to tackle increasingly complex problems. However, this development also brings about the drawbacks of high computational and storage costs, which limit the application of neural networks in some practical scenarios. As a result, in recent years, more researchers have suggested and implemented network pruning techniques to decrease neural networks' computational and storage expenses while retaining the same level of accuracy. This paper reviews the research progress of network pruning techniques and categorizes them into unstructured and structured pruning. Finally, the shortcomings of current pruning techniques and possible future development directions are pointed out.

original accuracy of the network and reducing its computational complexity, thus achieving effective compression. Of course, besides pruning, there are many other methods for compressing networks, such as knowledge distillation [5] and quantization [6]. This article mainly .shortcomings of current pruning methods and possible future development directions in the end, as shown in figure1.
Methods for artificial pruning of neural networks can be broadly categorized into two types: unstructured pruning and structured pruning. Unstructured pruning can prune specific weights, thereby increasing the sparsity of the model. On the other hand, structured pruning has a larger granularity and prunes kernels or channels, but this changes the original structure of the network, hence the name "structured pruning."

Unstructured pruning
Non-structured pruning appeared early. As early as 1990, Lecun et al. [4] proposed the optimal brain damage (OBD) method, which is also an early pruning method based on connection importance. They established a local model of the error function and deleted parameters through Taylor expansion to reduce network size and improve generalization. However, although OBD hypothetically ignored the offdiagonal terms of the Hessian matrix, it still needs to calculate part of the Hessian matrix, which undoubtedly dramatically increases the computational cost of optimization.
Han et al. [7] proposed to prune the network by learning to find essential connections. Iteratively removing the connections with weights below a certain threshold from the network can significantly reduce the number of neurons in the final network. Han et al. additionally integrated quantization and Huffman coding to further compress the model, shrinking the parameter count of the ImageNet-trained VGG16 model from 138M to 11.3M. However, the above methods require much time and exceptional hardware support.
Generally, regularization-based pruning methods constrain the network parameter count during the training process with norms. Among norms, the 0 norm is one of the most basic norms, which denotes In the formula, represents the number of elements in the { | ( ) ≠ 0}, and represents a vector, so 0 norm represents the quantity of non-zero elements in . However, the 0 norm cannot be derived, making it unusable in normal circumstances. Typically, it is approximated by other norms, and the 1 and 2 norms are commonly used. The 1 norm is as follows The 1 norm of a vector is the total of the absolute values of all its elements. The 1 norm is defined as follows: Louizos et al. [8] proposed using variational optimization theory to make the introduction of 0 loss function differentiable, thereby solving the problem that the 0 norm cannot be derived.
In 2018, Zhang et al. [9] suggested a technique that transformed the weight pruning challenge of neural networks into a problem about constrained non-convex optimization and employed the Alternating Direction Method of Multipliers algorithm for network pruning.
Although unstructured pruning increases the sparsity of the network by directly removing parameters while maintaining the network's original accuracy, its effectiveness is often limited in practical scenarios. Firstly, storing sparse matrices in the network incurs additional memory costs. Secondly, there is currently a scarcity of software and hardware platforms that can efficiently perform sparse matrix operations, which makes it challenging to replicate the experimental findings reported in the literature. Finally, optimizing the network demands a considerable amount of time and resources.

Structured pruning
In recent years, convolutional neural networks (CNNs) have become ubiquitous in deep learning. While unstructured pruning can effectively reduce the size of neural networks to some extent, it faces particular challenges when applied to convolutional neural networks, including the following issues: • Unstructured pruning usually needs to consider and prune each weight separately. This may lead to the irregular distribution of residual weights, thus destroying local connection characteristics and weight sharing in convolutional neural networks.
• Unstructured pruning may destroy the sparsity of convolution operations in convolutional neural networks, resulting in reduced efficiency of convolution operations.
• Convolutional neural networks typically contain a high concentration of parameters within the fully connected layer. These parameters are usually unsuitable for unstructured pruning because there is a dense connection between these parameters. However, unstructured pruning cannot handle this dense connection.
Therefore, structured pruning is usually used for the pruning of convolutional neural networks. This method can maintain the sparsity and local connection characteristics of convolutional neural networks and can competently deal with the problem of parameter-dense connection in the fully connected layer. This paper divides structured pruning into Filter-wise, Channel-wise, Group-wise, and Stripe-wise. Next, these four pruning methods will be introduced and compared.

Filter-wise pruning
Convolution kernel pruning is an essential structured pruning method that reduces the computational complexity and the number of network parameters by deleting some convolution kernels with small contributions in the network and maintaining the network performance as much as possible. The following will introduce four typical convolution kernel pruning algorithms.
In a matrix, the rank of the matrix often represents the amount of information carried by the matrix. Lin et al. [10] found that it is the same that a specific convolution kernel generates multiple feature maps' average ranks. After experimental verification, the larger the rank of the feature map, the more information it contains. So, they proposed the HRank method. The importance of different convolution kernels in the model is obtained by analyzing the rank of the feature map. Finally, the convolution kernels that generate low-rank feature maps are cut off to achieve the compression model effect.
ThiNet [11] is a pruning method based on accelerated sparse convolutional neural networks proposed by Luo et al.. The central concept is to prune the convolutional neural network by examining the redundancy among feature maps to decrease the computational and storage requirements of the network. In contrast to conventional pruning methods, ThiNet utilizes a novel index termed "feature reuse rate" to gauge the extent of reuse of each filter in the convolution layer across distinct feature maps. Specifically, ThiNet first clusters the input data to gather similar data together and then obtains a sub-sampled data set by sub-sampling the data of each cluster. Then, for each filter in the convolution layer, the correlation and feature reuse rate between it and other convolution kernels are calculated, and the convolution kernels are sorted according to these indicators. Then, by pruning the filter with a low reuse rate, the computation and storage of the network can be reduced.
In 2018, He et al. [12] considered that much previous work was Hard Filter Pruning, that is, deleting convolution kernels directly and unrecoverable, which may reduce the network's learning ability. Therefore, they proposed Soft Filter Pruning ( SFP ). Precisely, the norm is calculated during the round of training, and then the filter with a smaller norm is set to zero, and then the + 1 round of training is performed. After updating the weights in the filter by error backpropagation, the norm and zeroing operations are repeated until the end of training. Finally, the filter with a smaller norm is pruned. This method dynamically adjusts the pruning of the filter during the training process so that the network accuracy after pruning is even higher than the original network.
He et al. proposed the FPGM [13] algorithm in 2019, which compresses the model by pruning redundant filters instead of convolution kernels with relatively minor importance. The primary concept of the FPGM algorithm is to determine the significance of the filter by calculating the geometric median of each filter. In the training process, the FPGM algorithm sorts the convolution kernels according to the geometric median of the filters and deletes a certain proportion of the less critical filters. The FPGM algorithm does not entirely use norm-based pruning criteria for pruning, which can more accurately evaluate the importance of each filter and has a good pruning effect ( Figure 2).

Channel-wise pruning
Channel pruning compresses the network by removing channels with little contribution and unimportant information in the neural network. The following will introduce several typical channel pruning methods (Table 1). Liu et al. [14] proposed that each channel in the network can be optimized by introducing a scaling factor. Specifically, each channel is multiplied by a scaling factor. Then these factors are added to the network training, which is regularized to achieve the purpose of sparse factors. According to this idea, the loss function of network training is defined as The first half of the loss function is the loss of network output and label, where is the network input and is the training label. The latter part is the loss caused by the regularization of the channel factor, where γ is the channel scaling factor, and represents a regularization norm. In the author's paper, the 1 norm is used, and is a hyperparameter. After training, the channel scaling factor γ in the network will change with the iteration of the network. When γ is small, it indicates that the importance of the current channel is low. Therefore, the channel with a small channel factor can be selected to delete to achieve the effect of reducing network size and position accuracy.
The original essay's channel scale factor comes from the BN layer's scale factor [15]. The BN layer is as follows: Where is the BN layer input, is the input mean, 2 is the input variance, is the BN layer output,γand are trainable parameters, where γ is the scale factor, which is directly used as the channel scale factor and used to judge the importance of the channel. However, Huang and Wang [16] improved the method since some networks may not have a BN layer. They proposed introducing additional scale factors to make the method more general.
Although the above-proposed method is effective, it needs to train the entire network from scratch when pruning the network, which may generate much extra time overhead. Therefore, there is also a pruning algorithm for the trained model, which only needs to fine-tune the network after pruning.
Polyak and Wolf [17] argued that an input channel has different contributions to the output of different convolution kernels, so they proposed a method based on eliminating low-activity channels, calculating the variance of the output value of each channel, and using the variance as the channel activity index. Specifically, for a convolutional layer with input , output , weight , and depth , there are = ∑ * =1 (6) Therefore, the channel activity from the convolution kernel comes from * , so its variance can be calculated as The larger the variance, the richer the features learned by the channel and the more significant the contribution to the results. By setting a threshold, the less active channels are deleted, which can effectively reduce the network's computation and storage.
There is also a class of channel pruning algorithms based on machine learning methods and search algorithms to find excellent network compression structures. Hu et al. [18] proposed that the channel pruning algorithm can be regarded as a combinatorial optimization problem with exponential solution space, which a genetic algorithm can solve. In order to improve the search efficiency of the genetic algorithm, a two-step approximate fitness function is designed. Chang et al. [19] divided the optimization process into four parts. Firstly, clustering was performed by the resemblance of feature maps for initial pruning. Secondly, by employing the population initialization technique to convert the network structure into a set of candidate populations. These populations are then subject to a search process based on particle swarm optimization to identify the optimal compression structure. Finally, the pruned network undergoes a fine-tuning phase (Figure 3). . F1 is a set of convolution kernels, k1 is the first flattened convolution kernel in f1, s1 is the first input channel, l1 is a flattened patch in the input channel. The image on the left side represents the network before pruning, while the image on the right side depicts the network after pruning.

Group-wise pruning
Implementing most convolution operations is not as direct as it seems to slide the convolution kernel on the input channel because such a calculation method is challenging to optimize the acceleration, and it is easier to optimize the acceleration by converting it into matrix multiplication. On the one hand, the operation data is stored in persistent memory after transforming into a matrix, which is convenient for hardware optimization acceleration, such as cache. On the other hand, many libraries implement efficient matrix multiplication, such as BLAS [20], which can significantly accelerate the operation speed.
The group-wise pruning prunes the exact position of a set of convolution kernels. This pruning method can fully use the operation mentioned above through the structured sparse convolution kernel to compress the matrix and achieve the acceleration effect. Because of its small pruning granularity, it can achieve a good compression effect.
In 2016, Lebedev and Lempitsky [21] proposed a pruning method based on Group-wise. They believed that the matrix multiplication method of convolution calculation mentioned above could be used to achieve accelerated calculation. They considered two optimization processes at the same time. The first one is to sparse the model during training, and the second is to sparse the trained model. For the first optimization process, they add a group-sparse regularizer based on the 2,1 norm to the gradient loss of the network so that the network can generate a specific group sparse structure. For the second optimization process, they adopted the Gradual Group-wise Sparsification method. Specifically, they selected groups with small regularization terms by setting a specific threshold for sparseness, reducing damage to the critical part of the network after the neural network is structured sparsely in the form shapewise, and the matrix operation of the convolution changes.
After deleting a set of convolution kernels with parameters at the same position, the convolution kernel participating in matrix operation becomes smaller, thus realizing network acceleration. It is affected, thereby reducing the damage of sparseness to essential parts of the network by setting the hold-out set to maintain.
In ' Learning Structured Sparsity in Deep Neural Networks ' [22], Wen et al. summarized the optimization goal of Group-wise pruning as follows : ( ) denotes the prediction loss, ∑ (∑ ∑ ∑ ‖ :, , , is the regularization of the weight of the group lasso paradigm, is the regularization factor, the purpose is to balance the two items.
Later, in 2019, Wang et al. [23] believed that the fixed-size regularization factor usually used in the previous method ignores the sensitivity of the CNN network and may cause damage to the CNN network. Therefore, they improved the optimization goal and proposed a dynamic regularization method. Specifically, they introduced unstructured regularization terms and group lasso regularization terms into the optimization goal and weighted them with dynamic regularization factors. This method has a very high acceleration rate and a minimal accuracy decline among many methods at that time.

Stripe-wise pruning
The pruning methods mentioned above almost consider the importance of different parameters in the neural network, while Stripe-wise pruning is based on the consideration of filter shape.
Meng et al. [24] proposed this method in their paper 'PRUNING FILTER IN FILTER' published in 2020. They implicitly learned the better shape of the filter by introducing Filter Skeleton into the convolution layer during training. Specifically, they describe the loss function as Among them, ⊙ represents the point multiplication operation, represents the weight in the convolution layer, and represent Filter Skeleton. I hope to learn the different Stripes' importance by adding Filter Skeleton to gradient descent. However, this alone does not filter out some Stripes with less contribution well, so the author adds a regularization term to the loss function to sparse the filter Skeleton and achieves a better filtering result.
After learning the better shape of the filter through Filter Skeleton, some Stripes are deleted, but this will destroy the structure of the convolution. The author uses the method of reorganizing the filter. They transform the original N filters with a size of * into * * filters with a size of 1 * 1, as shown in Figure 4. After this operation, the shape learned by the previous Filter Skeleton can be used to delete some filters to achieve the effect of compressing the model. Since this method solely prunes at the Stripe level, it can preserve high performance without requiring fine-tuning. Additionally, it can achieve a high pruning rate due to its small pruning granularity.

Conclusion and discussion
The primary purpose of pruning is to decrease the parameters of a network and expedite network computations. In this paper, the pruning algorithms of neural networks in recent years are classified and summarized. The unstructured pruning algorithm and the structured pruning algorithm are introduced. According to the pruning granularity and starting point, the structured pruning algorithms are divided into four categories: Filter-wise pruning、Group-wise pruning、Stripe-wise pruning, and Channel-wise pruning. Unstructured pruning can delete the parameters with negligible contribution from the neural network to the greatest extent. However, it needs the support and optimization of hardware and library to exert its advantages.
Compared with unstructured pruning, structured pruning has a larger granularity, which can optimize the network's structure and does not need additional hardware and library support because it directly reduces parameters. The main disadvantage is that the change of input and output dimensions in the middle layer of the network may cause profound accuracy loss or some deviation to the network, and even the disappearance of the middle layer of the network may occur.
The practical and easy-to-implement structured pruning algorithms are mainly about convolutional neural networks. However, with the popularity of language processing models such as ChatGPT in recent years, it can be predicted that natural language processing tasks will receive more attention. As the backbone network of most language processing models, Transformer [25] has excellent context processing capabilities and heavy parameters and calculations to neural networks. Therefore, implementing an algorithm that can effectively compress the Transformer skeleton network will be a promising direction.
In addition, graph neural networks are more in line with the human brain structure in theory and may have significant development in the future. However, there are few pruning algorithms for neural graph networks, so that pruning algorithms can be developed for such networks in the future.
Several automated network structure search algorithms have emerged, such as the SPOS [26] and the ENAS algorithm [27]. They are committed to finding the optimal network sub-model through search algorithms, simplifying the network structure.