Applied and Computational Engineering

- The Open Access Proceedings Series for Conferences


Proceedings of the 3rd International Conference on Signal Processing and Machine Learning

Series Vol. 5 , 31 May 2023


Open Access | Article

Convolutional neural network and vision transformer for image classification

Jiaqi Lu * 1
1 Viterbi School of Engineering, University of Southern California, 3650 McClintock Ave, Los Angeles, CA 90089, United States of America.

* Author to whom correspondence should be addressed.

Applied and Computational Engineering, Vol. 5, 104-108
Published 31 May 2023. © 2023 The Author(s). Published by EWA Publishing
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Citation Jiaqi Lu. Convolutional neural network and vision transformer for image classification. ACE (2023) Vol. 5: 104-108. DOI: 10.54254/2755-2721/5/20230542.

Abstract

Visual Transformer (ViT) has been a hot topic for research for the past few years after it first emerged in the field. On image recognitions, due to the amount of information ViT could retrieve from the source image, in cases it can rival the traditionally prevailing Convolutional Neural Network (CNN). Then there emerged different models based on ViT, all being built having a specific field or a flaw not addressed by original ViT in mind. In this paper these models are being tested on the same dataset along with a standard CNN to see how they perform compare to each other, and the best performing ViT model was then changed to see how there would be some possible improvements.

Keywords

Visual Transformer, Convolutional Neural Network, Image Classification, Machine Learning, Computer Vision.

References

1. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. (2021). “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv.org

2. Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. (2022). “Better Plain ViT Baselines for ImageNet-1k.” arXiv.org

3. Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song. (2021). “Vision Transformer for Small-Size Datasets.” arXiv.org

4. Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. (2021). “DeepViT: Towards Deeper Vision Transformer.” arXiv.org

5. Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. (2021). “Going Deeper with Image Transformers.” arXiv.org

6. Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. (2021). “CvT: Introducing Convolutions to Vision Transformers.” arXiv.org

7. Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. (2021). “LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference.” arXiv.org

8. Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. (2021). “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet.” arXiv.org

9. Chen, Chun-Fu Richard, Quanfu Fan, and Rameswar Panda. (2021). CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp. 347–356. doi: 10.1109/ICCV48922.2021.00041.

10. Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. (2020). “Linformer: Self-Attention with Linear Complexity.” arXiv.org

Data Availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Authors who publish this series agree to the following terms:

1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.

2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.

3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open Access Instruction).

Volume Title
Proceedings of the 3rd International Conference on Signal Processing and Machine Learning
ISBN (Print)
978-1-915371-57-7
ISBN (Online)
978-1-915371-58-4
Published Date
31 May 2023
Series
Applied and Computational Engineering
ISSN (Print)
2755-2721
ISSN (Online)
2755-273X
DOI
10.54254/2755-2721/5/20230542
Copyright
© 2023 The Author(s)
Open Access
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited

Copyright © 2023 EWA Publishing. Unless Otherwise Stated