Visual Depth and Velocity Mapping

This project provides a sight substitution system for the visually impaired. Information about the surrounding environment is gathered by stereo vision cameras and used to generate disparity maps. These maps are then used to calculate the depth and velocity of surrounding objects. The depth and velocity values are expressed as vectors, which are then translated into audio signals of varying pitch and synchronization.

=Problem Definition=

Although there are extensive resources available for most disabilities, there are very few aids for the visually impaired. This project aims to write software to map visual stimuli to an auditory field using Real Time Disparity Mapping (RTDM) with a dual camera system and simple processor. The unit should be wearable and functional in the sense that somebody could navigate around campus with it on their head. By November 27th we expect to have our first prototype running and functional.

=Background=

It has been demonstrated that those who are visually impaired are able to learn how to navigate via echo location. Thus, this project is based on the intuition that it is possible to learn how to translate variations of an audio signal into meaningful information which can then be used to ascertain the spatial relationship of objects in the surrounding environment.

=Deliverables=

The deliverables for this project include the following:

1. Functioning hardware (ZED stereo camera integrated with Nvidia GPU processing board)

2. Design Report detailing how to work the hardware and provide updates to the software when needed

3. Portfolio containing documentation

=Specifications=

User Interface Requirements


A set of headphones and a camera shall be used by the user.

Eventually the device should be secured to the body in a configuration suitable for walking and sitting

The device should not impede user's ability to interact with objects in front or to either side of him or her

The device should not expose the user to any severe electric or mechanical hazards (electric shock or burning)



What the Product Should Do


This product will relay 3D position and velocity information in an audio format to the user. The resolution and accuracy of the data should be sufficient for traversing a room without the assistance of eyesight.



Physical Requirements


Maximum weight: 10 lbs</li>

Power: Board 12.6 W</li>

Camera 3.8 W</li>

Framerate 15 30 60 FPS</li>

Depth of field (max): 15 20 30 m</li>

</ul>

=Design Considerations=

In order to create a Real Time Disparity Mapping system that could be used in a mobile set-up, it was necessary to find hardware that was small enough to be worn on the user's body and yet powerful enough to process images. The first prototype that we proposed was a Tara Stereo vision camera that would be connected to a Raspberry Pi 3.0 B+, which would be powered by a battery pack.

Initial Considerations
Initially we looked at the Raspberry Pi 3.0 B+ and the Tara.

Secondary Considerations
After discussing the first option with our client, we determined that the Raspberry Pi was not powerful enough to do the image processing that we needed and that the viewing angle of the Tara camera was not large enough. After more research, we found another camera called ZED produced by StereoLabs which had the desired viewing angle and depth perception. Also, we found another processing board called the Nvidia Jetson TX2 which had sufficient computational capabilities to perform the tasks that we needed.

System Design Considerations
Another problem that our team deliberated over was how to place the hardware on the user's body. We initially thought of placing the camera on the user's chest, but then decided to have it fixed to their head with the jetson board stored in a backpack along with a battery pack. In this way, the product would more accurately replicate the function of eyes with respect to object orientation.

Prospective Final Design
Figure 1: Visualization of the final system

Software Design Choices
Another discovery that our team made was that we need to develop the code for the project on a custom development kit produced by Nvidia instead of Open CV. Links to each of these development environments are given below:



Open CV: </li>

Nvidia Jetpack: </li>

Open AL: </li>

</ul>

=Project Learning=

Prior Considerations
One of the possible software solutions we considered was to use a Convolutional Neural Network to identify objects, scale them, and estimate their depths with respect to the user. Because this was computationally extensive, we decided to use the disparity maps that are generated by the ZED SDK.

Current Learning
In order to produce a RTDM, there are 5 fundamental milestones that need to be achieved:

1.	Single-camera calibration

2.	Multiple-camera calibration

3.	Rectification

4.	Feature extraction

5.	Spatial feature matching

6.	Temporal feature matching

OpenCV provides tools to complete these tasks.

OpenCV Installation
In order to install OpenCV on the Jetson we installed a BASH script file from:

https://github.com/jetsonhacks/buildOpenCVTX2/blob/master/buildOpenCV.sh

We ran this script for the installation processes through the Linux (Ubuntu) terminal.

ZED API Functions
ZED SDK comes with grab and capture functions as well as API calls that create disparity maps. This was extremely useful when we were implementing the vision side as we were able to use the ZED SDK to make our disparity maps.

The grab and capture functions are designed to work with the ZED Camera. They allow easy access to take images with the ZED Camera. They also allow those images to be recieved by the process and dealt with however the user intends. In our case, we made disparity maps from the images.

Single-Camera Calibration
Single-Camera Calibration consists of solving for intrinsic parameters of a camera, namely the camera matrix and the set of distortion coefficients. The camera matrix describes how 3-dimensional objects are projected onto a 2-dimensional image plane. This matrix defines the focal length and the field of view of the camera. The distortion coefficients describe how the camera lense distorts the light as it is collected. An example of distortion is radial distortion. This creates a “barrel effect” on the resulting image. An example of barrel distortion is shown below in figure 1.

Figure 2. The left image is transformed to the right image using a distortion-correction matrix.

http://www.ntu.edu.sg/home/assklam/images/distortion-correction.jpg

The ultimate use of the camera matrix is to constrain each feature point to a line in 3-dimensional space (Figure 2). If two images are captured from adjacent points, and feature points can be compared, the depth of each pair of feature points can be solved for by finding the intersection of the two lines in 3-dimensional space.

Figure 3. A single feature point, Pc, is constrained along the vector OP.

https://prateekvjoshi.files.wordpress.com/2014/05/3-pinhole-camera-geometry.png

In order to find the true line that each feature point can be constrained to, it is necessary to undistort the image first. Figure 1 demonstrates the undistortion of an image.

Multiple-Camera Calibration
Multiple-camera calibration is necessary to relate two lines that a single feature may fall upon, so the intersection point can be extracted, and, consequently, the depth can be calculated. The relationship between two camera positions can be described with an essential matrix.

The essential matrix describes the transformation by a translation and rotation to constrain relative position and orientation. Figure 3 demonstrates the concept of solving along two intersecting lines.

Figure 4. After a feature is matched, the two lines are created and solved for using the essential matrix which describes the relative positions and orientations of the two cameras, Ol and Or. https://docs.opencv.org/3.0-beta/_images/essential_matrix.jpg

Rectification
After feature points have been extracted, they need to be matched in order to find depth. In order to find matches, for each feature point in one image, a search for a match in the corresponding image must be conducted. The search space can be reduced to a single line, by the analysis of epipolar geometric relations, shown in Figure 4.

Figure 5. Epipolar lines

Feature Extraction
Feature extraction consists of the recognition of potentially interesting points in an image. There is a suite of algorithms that take different approaches to completing this task. The algorithm mentioned, but not required, by Dan Schneider is called the Scale-Invariant Feature Transform (SIFT). SIFT works by detecting the intersections of edges in an image.

Edge Detection
For edge detection, we used Canny Edge Detection. Canny Edge Detection is an OpenCV API call that looks like: cv2.Canny. It also uses a call to imread to read the disparity map. The edge detection system reads in a Disparity Map and outputs a edge detection map which is a filtered Disparity Map. The edge detection map shows only the edges of objects. We were planning on using our own filtration system again on this, extracting some points from it to play with high gain. We would then use low gain for the points extracted from the disparity map. This would allow important edges to be identified by the user easily.

We did not end up using the edge detection system in our final design, however, it is all set up for future Capstone Students to use if they need or want to use it.

Spatial and Temporal Matching
Matching is the process of finding corresponding points of interest in a pair of images. This process involves searching along a pair of epipolar lines for a match. There are many approaches to this search problem, some much more thorough, but computationally expensive, than others.

Matching must occur between two images that occur at a single time and between two sets of points at different times in order to calculate both 3-dimensional position of points and the velocity of each point.

Disparity Filtration System
In order to limit the number of points that we need to play to audio, we have decided to filter the data after the disparity mapping. We will use the Disparity map to generate a Max Pool. The Max Pool will be split up into a grid and the max points from each partition will extracted. From these point's x, y, and z values (obtained in the disparity mapping), we will calucalate our coordinates for mapping our system to audio. (See Figure 6)

Figure 6: Disparity Map Example

Figure 7: Disparity Map Partitioning

Uploading software to the hardware
Nividia JetPack 3.3 was not initially working on Ubuntu 18.04. The processor on the startup machine needed to be updated before we could launch the installer. JetPack 3.3 is the sdk that will run the program on our device. The JetPack device will be connected to the same router as the startup machine. The next step was to install Ubuntu 16.04 onto a virtual machine. This needed to be done because the newer version of Ubuntu was not compatible with JetPack 3.3.

=Final Design=

The end goal for this system is to have the user wearing a backpack that contains the Jetson powered by a battery pack, which is connected to a ZED camera mounted on the user's head. The Jetson will be using the ZED SDK to capture stereo images and render their disparity maps in real time. These disparity maps will then be used along with feature extraction and edge filtering to calculate the depth, angle, and height of relevant objects in the user's surroundings. This data will then be converted into a specialized coordinate system that uses inter-aurol time delay, frequency modulation, and distortion to convey the relevant object's x, y, and z coordinates to the user. We have not made any further design specifications for the final system beyond that initial concept. Unfortunately, we were unable to reach this end goal given the time constraint of our project.

Final Hardware and Software
The final design consisted of: <ol>

Hardware:</li>



NVIDIA Jetson</li>

Stereo Labs ZED 2

Tenergy Battery</li>

</ul>

Software:</li>



<li>OpenCV</li>

<li>OpenAL</li>

<li>ZED SDK</li>

<li>Ubuntu</li>

</ul>

</ol>

Final Specifications
In conclusion, we were able to implement a functional system meeting most of the specifications.

The specifications we met were:

<ul>

<li>ALL hardware requirements</li>

<li>Sounds that play</li>

<li>Semi-functioning system (proof of concept)</li>

</ul>

The specifications we were unable to meet were:

<ul>

<li>The exact amount of points captured and played</li>

<li>The frequency of sounds playing (we play a sound every 300 ms instead of every 1-10 ms</li>

<li>The total amount of variations of sounds (freequency or distortion)</li>

</ul>

Final Design Images
Figure 8: Final Design

Figure 9: Final Design

=Validation=

Because we have were unable to complete our prototype, some of our validation is not concrete. We were able to check that the framerate, resolution, and field of vision of the stereo vision system complies with the client's specifications. Also, a more subjective validation that will need to be performed is ensuring that the feature extraction accurately identifies the edges of objects. We were unable to do this, however, future groups should be able to continue with our project. It is also necessary to ensure that the user is not overwhelmed with extraneous auditory data. This must be done on human subjects, following guidelines for testing with human subjects.

Validation Results
<ol>

<li>The field of vision was found to be:

<ul>

<li>110° horizontal</li>

<li>90° vertical</li>

</ul>

</li>

<li>Our camera has a depth range of 1-30 m</li>

<li>The framerate was found to be 300ms </li>

<li>The max resolution we were able to produce was 36 points using a 6x6 partitioning algorithm on the Disparity Map.</li>

</ol>

Validation Conclusions
Playing sounds fast enough was the limiting factor with respect to framerate and resolution. Using OpenAL, we were unable to play more than 36 seconds without > 200ms lag in the sounds playing. The resolution could be increased, however, the framerate would have to be decreased. The same way, the framerate could be increased, however the resolution would have to be decreased.

Validating the Filtration System
In order to validate our filtration system, we inflated the partition pixel into 20x20 pixels and displayed those. For each partition, we found the max pixel and inflated it into a 400 (20x20) pixel area. We were then able to see the max pixel for each partition be updated and we were able to validate that our filtration system was successfully getting output and updating in real time.

Validating that the Coordinate System Maps to Sound
To validate that we were playing different sound files at each frame, we printed the sound files that we were playing. This proved to be successful as all of the sound frequencies and distortions were all used. A majority of the sound files were used as well.

Comparison to the Initial Validation Requirements
Shown below (Figure 10) is the validation plan that were created at an early stage of project development. We were unable to meet many of these requirements as discussed above, however we did produce a working proof of concept.

Figure 10 Design Validation Plan

=Team Members=

=Additional Documentation=

Project Schedule

[[Media:Spectralink schedule.pdf]]

Meeting Minutes

[[Media:First semester minutes.pdf]]

Presentations

[[Media:Visual Depth and Velocity Mapping ppt.pdf]] [[Media:Spectralink Intro.pdf]] [[Media:Spectralink- Preliminary Design Review.pdf]] [[Media:Spectralink- Preliminary Design Review.pdf]] [[Media:Snapshot 2 Poster.pdf]] [[Media:08 Schneider Real-Time Disparity Mapping.pdf]]