I. Introduction
Human Action Recognition (HAR) is an essential work in computer vision that involves recognizing and predicting human actions. HAR has numerous applications, such as abnormal human actions detection, video retrieval, healthcare [1], human-computer/robot interaction, and gaming [2]. HAR goes beyond simply representing patterns of motion of distinct body parts; it also describes a person’s intentions, emotions, and thoughts, making it an essential ingredient in recognizing and predicting human behavior. In recent years, lots of work has been done in computer vision [3]–[5], [71]–[73] related to classification, segmentation, resolution, etc. Traditionally, handcrafted feature-based approaches were used to determine HAR in videos. Visual features that describe a region locally were extracted. Combining the local features resulted in a video-level description of fixed size. HAR system analyzes the sequence of video frames for learning the features of human action in the training phase and uses this learned feature for classifying the same kind of action in the testing phase [6]. Traditional approaches are limited to handcrafted features [36], [41] and take much computational time. Deep learning-based approaches have led to the enhancement of HAR performance. Deep learning-based approaches refer to analyzing, recognizing, and then accurately predicting the human behaviors depicted in the videos. The primary feature extraction method in videos for accomplishing HAR is Convolutional Neural Networks (CNNs). However, modeling temporal information in videos is a complex task in HAR as it involves understanding the dynamics of actions over time. Unlike images, videos contain temporal information that captures the motion and evolution of the scene over time. Deep learning-based approaches need a substantial number of labeled data. Large-scale action video datasets like UCF101, HMDB51, and Kinetics have aided in developing more accurate and efficient HAR models. This paper focuses on the evolution of HAR in video analysis. Our discussion begins with exploring the two-stream networks, which were critical in developing more efficient models for HAR. We also analyze the systematic advancements made to HAR over the years. The paper is structured as follows: Section III outlines the popular datasets for HAR. Section IV covers the various advancements in HAR and their limitations. We compare the discussed approaches in Section V. The paper is concluded in Section VI.