I. Introduction
Falls are a significant health problem among the elderly, with about 36 million falls occurring among elderly people each year, resulting in over 32,000 deaths and approximately 3 million elderly individuals receiving medical care in emer-gency departments annually due to fall-related injuries [1]. According to the Centers for Disease Control and Prevention [2], the predominant cause of hip fractures, accounting for over 95% of cases, is attributed to falls. It can also cause other injuries such as broken bones and head injuries. Therefore, this is an important area of research. The current work in fall detection uses wearable sensors and IoT-based [3] [4] [5], radar-based [6], multimodal and purely vision based [7] [8] [9] [10]. A limitation of wearable sensors is the need for frequent charging [11]. This becomes particularly problematic for older adults who may find it difficult to monitor battery levels consistently and recharge devices regularly. Moreover, wearable sensors can be uncomfortable and may have var-ious side effects. Consequently, considering the discomfort, side effects, and inconvenience of frequent charging, these traditional options may not be the most suitable for elderly individuals [12]. Multi-modal approaches in fall detection have some drawbacks, as noted in [13]. Firstly, handling multiple sources of data from subjects and environments requires the need for efficient methods of extracting features, and handling different data from multiple sources, making the fall detection system computationally demanding and challenging. Secondly, placing multiple sensors on the body and environment can result in higher costs, discomfort (especially for the elderly), and issues related to deployment and implementation in real-world settings. In addition to that, the use of IoT-based solutions is computationally expensive because they require hardware. In light of this, non-intrusive vision-based sensors emerge as an appealing alternative for monitoring purposes. In purely vision-based solutions, machine learning-based techniques such as Support Vector Machines [14], Random Forests [15], Hidden Markov Models [16], and Gaussian Mixture Models [17] are used. However, most of the state-of-the-art work for fall detection are using Convolutional Neural Networks (CNNs) - both 2D and 3D CNNs, Recurrent Neural Networks such as Long Term Short Memory (LSTMs), and Autoencoders [18] which fall under the umbrella of deep learning. The existing body of work in this domain has employed images or videos as data inputs for fall detection. In a vision-based video classification system, the initial step involves preprocessing the video and extracting individual frames. Subsequently, feature extraction is conducted on each frame, followed by inference using a classifier to determine whether it corresponds to a fall or not. By analyzing frames, capturing spatial and temporal information, and leveraging classification models, the pipeline efficiently categorizes videos based on their content. This process empowers a range of applications including video surveillance, action recognition, and video recommendation systems. However, video-based classification methods, although efficient, are also computationally expensive. For daily life activities, having such resources is inconvenient. Hence, there is a need for cheap and efficient solutions for detecting falls.