I. Introduction
Recently, stereo vision has been used in surveillance systems to improve object recognition and tracking efficiency performance [1], [2]. Stereo vision can provide depth and color information for advanced image processing. Intrinsically different than RGB information, pixels in a depth map essentially represent 3-D space using stereo vision; hence, depth video represents the variation of the space information. Advances and developments in computer vision now can obtain stereo vision from either a dual camera system [3]–[10] or an RGB-D camera [11]–[14]. A dual camera system uses two video camera inputs to capture images through stereo matching then calculate the distance between the object and the camera. Calibration in a dual camera system is necessary before calculating the disparity map; however, the computational complexity of disparity match is typically difficult. Still, a major advantage is that the depth range is adjustable. On the other hand, an RGB-D system is composed of an RGB camera and a depth sensor, such as a Kinect or Asus Xtion Pro [15]. The biggest advantage of this system is that it is more convenient for acquiring depth information from a real-time video stream compared to the dual camera system. However, the depth information is noisy due to hardware drawbacks of a time-of-flight infrared-based depth sensor, and it can’t be used in outdoor environments.