Visual Depth and Velocity Mapping

This project provides a sight substitution system for the visually impaired. Information about the surrounding environment is gathered by stereo vision cameras and used to generate disparity maps. These maps are then used to calculate the depth and velocity of surrounding objects. The depth and velocity values are expressed as vectors, which are then translated into audio signals of varying pitch and synchronization.

=Problem Definition=

Although there are extensive resources available for most disabilities, there are very few aids for the visually impaired. This project aims to write software to map visual stimuli to an auditory field using Real Time Disparity Mapping (RTDM) with a dual camera system and simple processor. The unit should be wearable and functional in the sense that somebody could walk around campus with it on their head. By November 27th we expect to have our first prototype running and functional.

Background
It has been demonstrated that those who are visually impaired are able to learn how to navigate via echo location. Thus, this project is based on the intuition that it is possible to learn how to translate variations of an audio signal into meaningful information which can then be used to ascertain the spatial relationship of objects in the surrounding environment.

Deliverables
The deliverables for this project include the following:

1. Functioning hardware (ZED stereo camera integrated with Nvidia GPU processing board)

2. Proprietary modifications to SIFT and SNAP disparity mapping algorithms

3. Portfolio containing documentation

Specifications


User Interface Requirements:



A set of headphones and a camera shall be attached to a helmet or hat.

The device should be secured to the body in a configuration suitable for walking and sitting

The device should not impede user's ability to interact with objects in front or to either side of him or her

The device should not expose the user to any severe electric or mechanical hazards (electric shock or burning)  

What the product should do:

 This product will relay 3D position and velocity information in an audio format to the user. The resolution and accuracy of the data should be sufficient for traversing a room without the assistance of eyesight.</li> </ul> </li>

Physical Requirements



Maximum weight: 10 lbs</li>

Power: Board 12.6 W</li>

Camera 3.8 W</li>

Framerate 15 30 60 FPS</li>

Depth of field (max): 15 20 30 m</li>

</ul>

</li> </ul>

=Design Considerations= In order to create a Real Time Disparity Mapping system that could be used in a mobile setup, it was necessary to find hardware that was small enough to be worn on the user's body and yet powerful enough to process images. The first prototype that we proposed was a Tara Stereo vision camera that would be connected to a Raspberry Pi 3.0 B+, which would be powered by a battery pack.

However, after discussing this option with our client, we determined that the Raspberry Pi was not powerful enough to do the image processing that we needed and that the viewing angle of the Tara camera was not large enough. After more research, we found another camera called ZED produced by StereoLabs which had the desired viewing angle and depth perception. Also, we found another processing board called the Nvidia Jetson TX2 which had sufficient computational capabilities to perform the tasks that we needed.

Another problem that our team deliberated over was how to place the hardware on the user's body. We initially thought of placing the camera on the user's chest, but then decided to have it fixed to their head with the jetson board stored in                                           a backpack along with a battery pack.

Figure 1: Visualization of the final system

Another discovery that our team made was that we need to develop the code for the project on a custom development kit produced by Nvidia instead of Open CV. Links to each of these development environments are given below:



Open CV: </li>

Nvidia Jetpack: </li>

Open AL: </li>

</ul>

=Project Learning=

Prior Considerations
One of the possible software solutions we considered was to use a Convolutional Neural Network to identify objects, scale them, and estimate their depths with respect to the user. Because this was computationally extensive, we decided to use the disparity maps that are generated by the ZED SDK.

Current Learning
In order to produce a RTDM, there are 5 fundamental milestones that need to be achieved: 1.	Single-camera calibration 2.	Multiple-camera calibration 3.	Rectification 4.	Feature extraction 5.	Spatial feature matching 6.	Temporal feature matching

OpenCV provides tools to complete these tasks.

Single-Camera Calibration

Single-Camera Calibration consists of solving for intrinsic parameters of a camera, namely the camera matrix and the set of distortion coefficients. The camera matrix describes how 3-dimensional objects are projected onto a 2-dimensional image plane. This matrix defines the focal length and the field of view of the camera. The distortion coefficients describe how the camera lense distorts the light as it is collected. An example of distortion is radial distortion. This creates a “barrel effect” on the resulting image. An example of barrel distortion is shown below in figure 1.

Figure 2. The left image is transformed to the right image using a distortion-correction matrix. http://www.ntu.edu.sg/home/assklam/images/distortion-correction.jpg

The ultimate use of the camera matrix is to constrain each feature point to a line in 3-dimensional space (Figure 2). If two images are captured from adjacent points, and feature points can be compared, the depth of each pair of feature points can be solved for by finding the intersection of the two lines in 3-dimensional space.

Figure 3. A single feature point, Pc, is constrained along the vector OP. https://prateekvjoshi.files.wordpress.com/2014/05/3-pinhole-camera-geometry.png

In order to find the true line that each feature point can be constrained to, it is necessary to undistort the image first. Figure 1 demonstrates the undistortion of an image.

Multiple-Camera Calibration

Multiple-camera calibration is necessary to relate two lines that a single feature may fall upon, so the intersection point can be extracted, and, consequently, the depth can be calculated. The relationship between two camera positions can be described with an essential matrix.

The essential matrix describes the transformation by a translation and rotation to constrain relative position and orientation. Figure 3 demonstrates the concept of solving along two intersecting lines.

Figure 4. After a feature is matched, the two lines are created and solved for using the essential matrix which describes the relative positions and orientations of the two cameras, Ol and Or. https://docs.opencv.org/3.0-beta/_images/essential_matrix.jpg

Rectification

After feature points have been extracted, they need to be matched in order to find depth. In order to find matches, for each feature point in one image, a search for a match in the corresponding image must be conducted. The search space can be reduced to a single line, by the analysis of epipolar geometric relations, shown in Figure 4.

Figure 5. Epipolar lines

Feature Extraction

Feature extraction consists of the recognition of potentially interesting points in an image. There is a suite of algorithms that take different approaches to completing this task. The algorithm mentioned, but not required, by Dan Schneider is called the Scale-Invariant Feature Transform (SIFT). SIFT works by detecting the intersections of edges in an image.

Audio Data Flow

In order to limit the number of points that we need to play to audio, we have decided to filter the raw data again after the edge filter. We will use the Disparity map to generate a Max Pool and an Edge Map. The Edge Map will be played with high gain. The sounds will stand out to the user. The Max Pool will be split up into a grid and an average point will be taken and filtered on the grid. These points will be played with low gain and will be more background noise. (See Image)

Figure 6: Audio Data Flow

Spatial and Temporal Matching

Matching is the process of finding corresponding points of interest in a pair of images. This process involves searching along a pair of epipolar lines for a match. There are many approaches to this search problem, some much more thorough, but computationally expensive, than others.

Matching must occur between two images that occur at a single time and between two sets of points at different times in order to calculate both 3-dimensional position of points and the velocity of each point.

Uploading the software to the hardware

Nividia JetPack 3.3 was not initially working on Ubuntu 18.04. The processor on the startup machine needed to be updated before we could launch the installer. JetPack 3.3 is the sdk that will run the program on our device. The JetPack device will be connected to the same router as the startup machine. The next step was to install Ubuntu 16.04 onto a virtual machine. This needed to be done because the newer version of Ubuntu was not compatible with JetPack 3.3.

=Final Design=

The end goal for this system is to have the user wearing a backpack that contains the Jetson powered by a battery pack, which is connected to a ZED camera mounted on the user's head. The Jetson will be using the ZED SDK to capture stereo images and render their disparity maps in real time. These disparity maps will then be used along with feature extraction and edge filtering to calculate the depth, angle, and height of relevant objects in the user's surroundings. This data will then be converted into a specialized coordinate system that uses inter-aurol time delay, frequency modulation, and distortion to convey the relevant object's x, y, and z coordinates to the user. We have not made any further design specifications for the final system beyond that initial concept.

=Validation=

Because we have not yet developed a prototype, we do not yet have a concrete validation plan. If the final version of this system is similar to what we imagine, then validation will entail checking that framerate, resolution, and field of vision of the stereo vision system complies with the client's specifications. Also, a more subjective validation that will need to be performed is ensuring that the feature extraction accurately identifies the edges of objects. It is also necessary to ensure that the user is not overwhelmed with extraneous auditory data.

We will be given more specific instructions for validation by our client when we have a functional or semi-functional prototype. Shown below are some validation plans that were created at an early stage of project development. These may morph into other plans as we learn more about the project.

=Team Members=

=Additional Documentation=

Project Schedule

[[Media:Spectralink schedule.pdf]]

Meeting Minutes

[[Media:First semester minutes.pdf]]

Presentations

[[Media:Visual Depth and Velocity Mapping ppt.pdf]] [[Media:Spectralink Intro.pdf]] [[Media:Spectralink- Preliminary Design Review.pdf]] [[Media:Spectralink- Preliminary Design Review.pdf]] [[Media:Snapshot 2 Poster.pdf]] [[Media:08 Schneider Real-Time Disparity Mapping.pdf]]