Comparison of DenseNet201 and ResNet50 for lip reading of decimal Digits
Lip reading is a technology supportive of humanity. It a process that interprets the movement of the lips
to understand speech by means of visual interpretation. Where understanding speech is difficult for some
groups of people, especially the hearing impaired or people who are in noisy environments such as the
airport or factories lip reading is the alternative source for understanding what people are saying.
In the proposal the work starts with inserting the video into the Viola Jones algorithm and taking a
sequential frame of the face image, then face detection, mouth detection and ROI cropping, then inserting
the mouth frame into a convolutional neural network (DenseNet201) or ReNet50 neural network where
features are extracted and then the test frames are categorized. In this research, a database consisting of
35 videos of seven people (5 males and 2 females) was used to pronounce decimal numbers (0, 1, 2, ...,
9). The test results indicate that the accuracy in DenseNet20 network is 90%, and in ResNet50 network
we got an accuracy of 86%.
A. Nagzkshay Chandra Aarkar, “ROI EXTRACTION AND FEATURE
EXTRACTION FOR LIP READING OF,” vol. 7, no. 1, pp. 484–487, 2020.
R. Bowden, “Comparing Visual Features for Lipreading,” no. September 2016.
A. Garg and J. Noyola, “Lip reading using CNN and LSTM,” Proc. - 30th IEEE
Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Jan, p. 3450, 2017.
J. S. Chung and A. Zisserman, “Learning to lip read words by watching videos,”
Comput. Vis. Image Underst., vol. 173, pp. 76–85, 2018, doi:
T. As for training and classification, this is done with the help of artificial neural
networks and A. Basturk, “Lip Reading Using Convolutional Neural Networks with
and without Pre-Trained Models,” vol. 7, no. 2, pp. 195–201, 2019, doi:
G. Zhao, M. Barnard, and M. Pietikäinen, “Lipreading with local spatiotemporal
descriptors,” IEEE Trans. Multimed., vol. 11, no. 7, pp. 1254–1265, 2009, doi:
J. Ngiam and A. Y. Ng, “Multimodal Deep Learning,” 2011.
C. Tian and W. Ji, “Auxiliary Multimodal LSTM for Audio-visual Speech
Recognition and Lipreading,” no. 1, pp. 1–9, 2017, [Online]. Available:
Mr. Befkadu Belete Frew, “Audio-Visual Speech Recognition using LIP Movement
for Amharic Language,” Int. J. Eng. Res., vol. V8, no. 08, pp. 594–604, 2019, doi:
H. VAIBHAV, “Face Identification using Haar cascade classifier,” Medium, pp. 1–5,
, [Online]. Available: https://medium.com/geeky-bawa/face-identificationusing-
T. Bezdan and N. Bačanin Džakula, “Convolutional Neural Network Layers and
Architectures,” no. July, pp. 445–451, 2019, doi: 10.15308/sinteza-2019-445-451.
A. Ghosh, A. Sufian, F. Sultana, A. Chakrabarti, and D. De, Fundamental concepts
of convolutional neural network, vol. 172, no. January. 2019.
J. Wu, “Introduction to Convolutional Neural Networks,” Introd. to Convolutional
Neural Networks, pp. 1–31, 2017, [Online]. Available:
W. M. Learning and P. Kim, MATLAB Deep Learning. .
Q. Zhang, M. Zhang, T. Chen, Z. Sun, Y. Ma, and B. Yu, “Recent advances in
convolutional neural network acceleration,” Neurocomputing, vol. 323, pp. 37–51,
, doi: 10.1016/j.neucom.2018.09.038.
“Convolutional Layer.” [Online]. Available:
M. Z. Alom et al., “The History Began from AlexNet: A Comprehensive Survey on
Deep Learning Approaches,” 2018, [Online]. Available:
H. J. Jie and P. Wanda, “Runpool: A dynamic pooling layer for convolution neural
network,” Int. J. Comput. Intell. Syst., vol. 13, no. 1, pp. 66–76, 2020, doi:
S. Wang, “DenseNet-201-Based Deep Neural Network with Composite Learning
Factor and Precomputation for Multiple Sclerosis Classification DenseNet-201 based
deep neural network with composite learning factor and precomputation for multiple
sclerosis classification,” no. September, 2020, doi: 10.1145/3341095.
C. Vision, H. Resnet, R. Block, H. Resnet, and U. Resnet, “What is Resnet or
Residual Network | How Resnet Helps? Introduction to Resnet or Residual
Network,” pp. 1–8, 2020, [Online]. Available:
S. Petridis, J. Shen, and D. Cetin, “Visual-only recognition of normal, whispered and
silent speech,” 2018.
Copyright (c) 2023 Journal of Education for Pure Science- University of Thi-Qar
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The Authors submitting a manuscript do so on the understanding that if accepted for publication, copyright of the article shall be assigned to Journal of education for Pure Science (Jeds), University of Thi-Qar as publisher of the journal.
Copyright encompasses exclusive rights to reproduce and deliver the article in all form and media, including reprints, photographs, microfilms and any other similar reproductions, as well as translations. The reproduction of any part of this journal, its storage in databases and its transmission by any form or media, such as electronic, electrostatic and mechanical copies, photocopies, recordings, magnetic media, etc. , will be allowed only with a written permission from Journal of education for Pure Science (Jeds), University of Thi-Qar.
Journal of education for Pure Science (Jeds), University of Thi-Qar, the Editors and the Advisory International Editorial Board make every effort to ensure that no wrong or misleading data, opinions or statements be published in the journal. In any way, the contents of the articles and advertisements published in the Journal of education for Pure Science (Jeds), University of Thi-Qar are sole and exclusive responsibility of their respective authors and advertisers.