Microsoft Kinect Sensor "Future trends and latest research challenges"

Topic > Microsoft Kinect Sensor "Future trends and latest research challenges"

Index Introduction:Literary survey:Research proposal:Design methodology:Software: Dot Net technologyPractical experiments:Result and discussion:Conclusion:With the invention of low cost The Microsoft Kinect sensor, high-resolution (RGB) visual and depth sensing have become available for widespread use. In recent years, Kinect has gained more popularity as a portable, low-cost, markerless device for human motion capture and easy software development. The Microsoft Kinect sensor is a low-cost, high-resolution depth and visual sensing (RGB) device. As a result of these advantages and advanced skeletal tracking capabilities, it has become an important tool for clinical evaluation, physical therapy and rehabilitation. This article contains an overview of the evolution of the different versions of Kinect and highlights the differences in their main features. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an original essay KEYWORDS: computer vision, depth imaging, information fusion, Kinect sensor. Introduction: Saving three-dimensional information about the geometry of objects or scenes tends to be increasingly applied in the conventional workflow for documentation and analysis of cultural heritage and archaeological objects or sites. In this particular field of study, needs can be cited in terms of restoration, conservation, digital documentation, reconstruction or museum set-up [1,2]. The digitization process is greatly simplified nowadays thanks to several available techniques that provide 3D data [3]. In the case of large spaces or objects, terrestrial laser scanners (TLS) are preferred because this technology allows a large amount of accurate data to be collected very quickly. While trying to reduce costs and working on smaller pieces, digital cameras are commonly used instead. They have the advantage of being rather easy to use, through image-based 3D reconstruction techniques [4]. Furthermore, both methodologies can also be combined to overcome their respective limitations and provide more complete models [5,6]. The Microsoft Kinect is a device originally designed to detect human motion and developed as a controller for the Xbox game console that has been sold since 2010. It didn't take long for researchers to notice that its applicability goes beyond video games, but it needs to be used as a depth sensor that facilitates interaction through gestures and body movement. In 2013, a new Kinect device is introduced with the new gaming console called Kinect v2 or Kinect for Xbox One. The new Kinect replaced previous technologies and brought numerous improvements to the quality and performance of the system. The old Kinect called Kinect v1 or Kinect for Xbox 360 after the arrival of the new Kinect. Although it's billed as a depth camera, the Kinect sensor is much more than that. It has several advanced sensing hardware containing a color camera, a depth sensor, and a four-microphone array. These sensors provide several opportunities in the areas of 3D motion capture, facial and voice recognition [5]. While Kinect for Xbox 360 uses a structured light model to obtain a depth map of a scene, Kinect for Xbox One uses a faster and more precise TOF sensor. Kinect's skeleton tracking capabilities are used to analyze human body motion for applications related to human-computer interaction, motion capture,recognition of human activity and other areas. Furthermore, it is of great use for studies especially in physical therapy and rehabilitation. An economical TOF (Time of Flight) technology with potential application to verify patient positioning in radiotherapy. In radiotherapy the patient is initially positioned during a computed tomography (CT) simulation, which is then used to create a treatment plan. The treatment plan is designed to deliver a tumoricidal dose to a planning target volume (PTV), which includes the severe disease with an additional margin to account for setting uncertainties. Once the treatment plan is approved, patients return for multiple treatment fractions over days or weeks. Replicating precise patient positioning between fractions is critical to ensuring accurate and effective delivery of the approved treatment plan. The motivation of this survey is to provide a comprehensive and systematic description of popular RGB-D datasets for the convenience of other researchers in this field. Literature Survey: Motion capture and depth sensing are two emerging research areas in recent years. With the launch of Kinect in 2010, Microsoft opened the door for researchers to develop, test, and optimize algorithms for these two areas. Leyvand T [2] discussed Kinect technology. His work sheds light on how a person's identity is tracked by the Kinect sensor for XBox 360. It also presents some insights into how changes in the technology are occurring over time. With the launch of Kinect, an epochal change in identification and tracking techniques is expected. They discussed possible challenges in the coming years in the field of gaming and the identification and tracking of Kinect sensors. Kinect identification occurs in two ways: biometric login and session tracking. They considered the face that players do not change their clothes or rearrange their hairstyle but change their facial expressions, take different poses etc. He believes the biggest challenge to Kinect's success is the accuracy factor, both in terms of measurement and regression. The main perspective of the method is that they are considering a single depth image and are using an object recognition approach. From a single input depth image, they inferred a distribution of body parts per pixel. Depth imaging refers to calculating the depth of each pixel along with RGB image data. The Kinect sensor provides real-time depth data in isochronous mode[18]. Therefore, to correctly track the movement, it is necessary to process each depth flow. The depth camera offers many advantages over the traditional camera. It can work in low light conditions and is color invariant [1]; Depth sensing can be performed via laser time-of-flight sensing or structured light patterns combined with stereo sensing [9]. The proposed system uses the stereo detection technique provided by PrimeSense [21]. Kinect depth sensing works in real time with greater precision than any other depth sensing camera currently available. The Kinect depth sensing camera uses the laser beam to predict the distance between the object and the sensor. The technology behind this system requires the CMOS image sensor to be connected directly to the socket-on-chip [21]. Additionally, a sophisticated decryption algorithm (not released by PrimeSense) is used to decrypt the input depth data.Research Proposal: Due to their attractiveness and imaging capabilities, many works have been devoted to RGB-D cameras during the last decade. The aim of this section is to outline the state of the art relating to this technology, considering aspects such as fields of application, calibration methods or metrological approaches. The application fields of RGB-D cameras are a wide range of applications that can be explored by considering RGB-D cameras. The main advantages are the cost, which for most of them is low compared to laser scanners, but also their high portability which allows their use on board mobile platforms. Towards 3D modeling of objects with an RGB-D camera, the creation of 3D models represents a common and interesting solution for the documentation and visualization of archaeological heritage and materials. Due to its remarkable results and its convenience, the technique probably most used by the archaeological community remains photogrammetry. Sources of error and calibration methods are the main problem when working with ToF cameras is due to the fact that the measurements made are distorted by several phenomena. To guarantee the reliability of the acquired point clouds, especially for accurate 3D modeling, a preventive removal of these distortions must be carried out. To do this, a good understanding of the multiple sources of error that influence measurements is helpful. Prospects for the future: Analyzing the above papers, we believe that there are certainly many future works in this research community. Here, we discuss potential ideas for each of the major vision-related topics separately. Object tracking and recognition is done via background subtraction based on depth images which can easily solve practical problems that have hindered object tracking and recognition for a long time. It will not be surprising if small devices equipped with Kinect-like RGB and depth cameras appear in ordinary office environments in the near future. However, the depth camera's limited range may not allow it to be used for standard indoor surveillance applications. To solve this problem, combining multiple Kinects could be a potential solution. This will obviously require communication between the Kinects and re-identifying objects across different views. Human activity analysis is getting a reliable algorithm that can estimate complex human poses (such as gymnastic or acrobatic poses), and poses of people interacting closely will definitely be active topics in the future. For activity recognition, further investigation for low-latency systems, such as the system described in, may become the trend in this field, as more and more practical applications require online recognition. From the analysis of hand gestures it can be seen that many approaches avoid the problem of detecting hands from a realistic situation by assuming that hands are the objects closest to the camera. These methods are experimental and their use is limited to laboratory environments. In the future, methods that can handle arbitrary and high-degree-of-freedom hand movements in realistic situations may attract more attention. Furthermore, there is a dilemma between shape-based and 3D model-based methods. The former allows for high-speed operations with a loss of generality while the latter provides generality at a higher computing power cost. Therefore, balance and compromise between them will become an active topic. According to the results of the evaluation of the most current approaches, mappingInternal 3D fails when incorrect edges are created during mapping. Therefore, methods that can detect bad edges and repair them independently will be very useful in the future. In sparse feature-based approaches, you may need to optimize the keypoint matching scheme by adding a feature lookup table or eliminating mismatched features. In dense point matching approaches, it is worth trying to reconstruct larger scenes such as the interior of an entire building. More memory-efficient representations will be needed here. Design Methodology: Our system implements augmented reality using the processing capabilities of Kinect. The system consists of 4 main components such as tracking device, processing device, input device and display device. We use Kinect as a tracking device. It contains three sensors for processing depth images, RGB images and voice. Kinect's depth camera and multi-array microphone are used to capture the real-time image stream and audio data, respectively. The depth sensor is used to obtain the distance between the sensor and the object to be tracked. The input device of our setup is a high definition camera which is used to get the input image stream and work as a background for all Augmented Reality components. On top of this background stream, we overlay event-specific 3D models to provide a virtual reality experience. The Processing Device, composed of a Data Processing Unit, Audio Unit and associated software, deals with which model to superimpose and at what time. The processing unit transmits the input video stream and 3D model to the display device for display purposes. The Kinect system plays an important role in the operation of the overall system. This system works as a tracking unit for the augmented reality system. This system uses some of Kinect's coolest features like skeletal tracking, joint estimation, and speech recognition for a human body. Skeletal tracking is useful for determining the user's position from Kinect, when the user is in the frame, which will be used to guide them through the assembly procedure. Furthermore, it helps in gesture recognition. This system guides the user through the complete assembly of the product using voice and gesture recognition. Product assembly includes joining individual constituent parts and assembling it as a product. There are two assembly modes for this system, full assembly and partial assembly. In Full Assembly mode, Kinect will guide the technician on how to assemble an entire product in sequence. This mode will be useful when you need to assemble the entire product. In Part Assembly mode, the technician must select a part to assemble and Kinect will guide him or her on how to assemble the selected part. Once the assembly of that part is complete, the technician can select another part or exit. This mode will be useful when you need to assemble one or more parts. The system has been developed to work in 2 modes, voice mode and gesture mode. The choice to select a mode was given to the user based on their familiarity with the system and comfort in using it. If the user has opted for voice mode, he has to use voice commands to interact with the system and the system will guide him through voice commands. On the other hand, if the user has opted for gesture mode, he will have to use gestures to interact with the system and the systemwill guide him through voice commands. The 'START' command is used in both modes to start the system. After the system boots, the user will select a voice mode or gesture mode and continue working in the same mode. Software: Dot Net technology Hardware Kinect: The Kinect sensor, the first low-cost depth camera, was introduced by Microsoft in November 2010. First, it was typically a motion-controlled gaming device. A new version for Windows was later extended. Here in this section we will discuss the evolution of Kinect from v1 to the recent version v2. Kinect v1: Microsoft Kinect v1 was released in February 2012 and started competing with many other motion controllers available in the market. The Kinect's hardware consists of a sensor bar that includes 3D depth sensors, an RGB camera, a multi-array microphone, and a motorized spindle. The sensor provides full-body 3D motion capture, facial recognition and voice recognition. The depth sensor consists of an IR projector and an IR camera, which is a monochrome CMOS (complementary metal oxide semiconductor) sensor. The IR projector projects the IR laser which passes through a diffraction grating and turns into a series of IR dots. The points projected in the 3D scene are invisible to the color camera but are visible to the IR camera. The relative left-right translation of the point pattern gives the depth of a point. Kinect v2: Microsoft Kinect v1 got an update to v2 in November 2013. The second generation Kinect v2 is completely different based on ToF technology. Its basic principle is that a series of emitters sends a modulated signal that travels to the measured point, is reflected and received by the sensor's CCD. The sensor captures a 512 *424 depth map and a 1920 * 1080 RGB image at the rate of 15 to 30 frames per second. Kinect Software: OpenKinect is a free, open source library maintained by an open community of Kinect people. Most users use the first two libraries, namely OpenNI and Microsoft SDK. Microsoft SDK is only available for Windows while OpenNI is a cross-platform and open source tool. Microsoft Kinect includes free downloadable software, which is a Kinect development library tool. Practical experiments: Kinect, in this document, refers to both advanced hardware RGB/depth sensing and software-based technology that interprets RGB/depth signals. The hardware contains a regular RGB camera, a depth sensor and a four-microphone array, capable of simultaneously providing depth signals, RGB images and audio signals. On the software side, there are several tools available that allow users to develop products for various applications. These tools provide functionality to synchronize image signals, capture 3D human motion, identify human faces and recognize human voice, and more. In this case, human voice recognition is achieved using a remote speech recognition technique, thanks to recent advances in surround sound echo cancellation and microphone array processing. More details on Kinect audio processing can be found in [5] and [6]. In this article we will focus on the techniques relevant to computer vision, therefore leaving out the discussion of the audio component. [image: ]The RGB camera provides three basic color components of video. The camera operates at 30 Hz and can offer images at 640x480 pixels with 8 bits per channel. Kinect also has the ability to produce higher resolution images, running at 10 frames/s.