Comparison of DenseNet201 and ResNet50 for lip reading of decimal Digits

  • 1Computer Sciences Department, College of Education for pure Sciences, University of Thi-Qar
  • 1Computer Sciences Department, College of Education for pure Sciences, University of Thi-Qar
Keywords: Lip reading, Recognition, CNN, Densenet201 and ResNet50


Lip reading is a technology supportive of humanity. It a process that interprets the movement of the lips
to understand speech by means of visual interpretation. Where understanding speech is difficult for some
groups of people, especially the hearing impaired or people who are in noisy environments such as the
airport or factories lip reading is the alternative source for understanding what people are saying.
In the proposal the work starts with inserting the video into the Viola Jones algorithm and taking a
sequential frame of the face image, then face detection, mouth detection and ROI cropping, then inserting
the mouth frame into a convolutional neural network (DenseNet201) or ReNet50 neural network where
features are extracted and then the test frames are categorized. In this research, a database consisting of
35 videos of seven people (5 males and 2 females) was used to pronounce decimal numbers (0, 1, 2, ...,
9). The test results indicate that the accuracy in DenseNet20 network is 90%, and in ResNet50 network
we got an accuracy of 86%.


A. Nagzkshay Chandra Aarkar, “ROI EXTRACTION AND FEATURE

EXTRACTION FOR LIP READING OF,” vol. 7, no. 1, pp. 484–487, 2020.

R. Bowden, “Comparing Visual Features for Lipreading,” no. September 2016.

A. Garg and J. Noyola, “Lip reading using CNN and LSTM,” Proc. - 30th IEEE

Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Jan, p. 3450, 2017.

J. S. Chung and A. Zisserman, “Learning to lip read words by watching videos,”

Comput. Vis. Image Underst., vol. 173, pp. 76–85, 2018, doi:


T. As for training and classification, this is done with the help of artificial neural

networks and A. Basturk, “Lip Reading Using Convolutional Neural Networks with

and without Pre-Trained Models,” vol. 7, no. 2, pp. 195–201, 2019, doi:


G. Zhao, M. Barnard, and M. Pietikäinen, “Lipreading with local spatiotemporal

descriptors,” IEEE Trans. Multimed., vol. 11, no. 7, pp. 1254–1265, 2009, doi:


J. Ngiam and A. Y. Ng, “Multimodal Deep Learning,” 2011.

C. Tian and W. Ji, “Auxiliary Multimodal LSTM for Audio-visual Speech

Recognition and Lipreading,” no. 1, pp. 1–9, 2017, [Online]. Available:

Mr. Befkadu Belete Frew, “Audio-Visual Speech Recognition using LIP Movement

for Amharic Language,” Int. J. Eng. Res., vol. V8, no. 08, pp. 594–604, 2019, doi:


H. VAIBHAV, “Face Identification using Haar cascade classifier,” Medium, pp. 1–5,

, [Online]. Available:


T. Bezdan and N. Bačanin Džakula, “Convolutional Neural Network Layers and

Architectures,” no. July, pp. 445–451, 2019, doi: 10.15308/sinteza-2019-445-451.

A. Ghosh, A. Sufian, F. Sultana, A. Chakrabarti, and D. De, Fundamental concepts

of convolutional neural network, vol. 172, no. January. 2019.

J. Wu, “Introduction to Convolutional Neural Networks,” Introd. to Convolutional

Neural Networks, pp. 1–31, 2017, [Online]. Available:


W. M. Learning and P. Kim, MATLAB Deep Learning. .

Q. Zhang, M. Zhang, T. Chen, Z. Sun, Y. Ma, and B. Yu, “Recent advances in

convolutional neural network acceleration,” Neurocomputing, vol. 323, pp. 37–51,

, doi: 10.1016/j.neucom.2018.09.038.

“Convolutional Layer.” [Online]. Available:

M. Z. Alom et al., “The History Began from AlexNet: A Comprehensive Survey on

Deep Learning Approaches,” 2018, [Online]. Available:

H. J. Jie and P. Wanda, “Runpool: A dynamic pooling layer for convolution neural

network,” Int. J. Comput. Intell. Syst., vol. 13, no. 1, pp. 66–76, 2020, doi:


S. Wang, “DenseNet-201-Based Deep Neural Network with Composite Learning

Factor and Precomputation for Multiple Sclerosis Classification DenseNet-201 based

deep neural network with composite learning factor and precomputation for multiple

sclerosis classification,” no. September, 2020, doi: 10.1145/3341095.

C. Vision, H. Resnet, R. Block, H. Resnet, and U. Resnet, “What is Resnet or

Residual Network | How Resnet Helps? Introduction to Resnet or Residual

Network,” pp. 1–8, 2020, [Online]. Available:

S. Petridis, J. Shen, and D. Cetin, “Visual-only recognition of normal, whispered and

silent speech,” 2018.