The new machine learning system can generate a 3D scene from an image around 15,000 times faster than other methods.
Humans are pretty good at looking at a single two-dimensional image and understanding the full three-dimensional scene that it captures. Artificial intelligence agents are not.
However, a machine that has to interact with objects in the world – like a robot that is supposed to harvest plants or help with operations – has to be able to work on properties of a 3D from observations of the 2D images on which it has been trained Close scene.
While scientists have been successful with neural networks in deriving representations of 3D scenes from images, these machine learning methods are not fast enough to make them feasible for many real-world applications.
A new technique demonstrated by researchers WITH and elsewhere is able to render 3D scenes from images about 15,000 times faster than some existing models.
The method represents a scene as a 360-degree light field, a function that describes all rays of light in a 3D space that flow through every point and in every direction. The light field is encoded in a neural network that enables faster rendering of the underlying 3D scene from an image.
The light field networks (LFNs) developed by the researchers can reconstruct a light field after just looking at an image and render 3D scenes at real-time frame rates.
“Ultimately, the great promise of these neural scene representations is that they can be used in visual tasks. I’ll give you a picture and from that picture you create a representation of the scene, and then you do whatever you want to think about in the space of this 3D scene, “says Vincent Sitzmann, postdoc at the Computer Science and Artificial Intelligence Laboratory (CSAIL) and co-lead author of the paper.
Sitzmann co-wrote the article with co-lead author Semon Rezchikov, a postdoc at Harvard University; William T. Freeman, Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science and member of CSAIL; Joshua B. Tenenbaum, Professor of Computational Cognitive Science in the Department of Brain and Cognitive Sciences and member of CSAIL; and senior author Frédo Durand, professor of electrical engineering and computer science and member of CSAIL. The research results will be presented this month at the Conference on Neural Information Processing Systems.
In computer vision and computer graphics, rendering a 3-D scene from an image involves mapping thousands, or possibly millions, of camera beams. Think of camera beams like laser beams shooting out of a camera lens and hitting every pixel in an image, one beam per pixel. These computer models must determine the color of the pixel that will be hit by each camera beam.
Many current methods accomplish this by taking hundreds of samples along the length of each camera beam as it moves through space, which is a computationally intensive process that can result in slow playback.
Instead, an LFN learns to represent the light field of a 3D scene and then assigns each camera beam in the light field directly to the color that is being observed by that beam. An LFN takes advantage of the unique properties of light fields that allow a ray to be rendered after just a single evaluation, so that the LFN does not have to stop along a ray to perform calculations.
“With other methods, this rendering requires you to follow the ray until you find the surface. You have to do thousands of samples because that means finding a surface. And you’re not even finished because there can be complex things like transparency or reflections. With a light field, after reconstructing the light field, which is a complicated problem, you only need a single example of the representation to render a single ray, since the representation maps a ray directly onto its color, ”says Sitzmann.
The LFN classifies each camera beam based on its “Plücker coordinates”, which represent a line in 3D space based on its direction and its distance from the point of origin. The system calculates the Plücker coordinates of each camera beam at the point where it meets a pixel in order to render an image.
By mapping each ray with Plücker coordinates, the LFN can also calculate the geometry of the scene due to the parallax effect. Parallax is the difference in the apparent position of an object when viewed from two different angles. For example, when you move your head, objects farther away seem to move less than closer objects. The LFN can determine the depth of objects in a scene based on parallax and uses this information to encode the geometry of a scene and its appearance.
In order to reconstruct light fields, however, the neural network first has to get to know the structures of light fields. Therefore, the researchers trained their model with many pictures of simple scenes of cars and chairs.
“There is an intrinsic geometry of light fields that our model tries to learn. You might worry that bright fields of cars and chairs are so different that you can’t see any similarities between them. But it turns out that if you add more types of objects, as long as there is some homogeneity, you get a better and better sense of what light fields of general objects look like, so that you can generalize about classes, ”says Rezchikov.
Once the model has learned the structure of a light field, it can render a 3D scene from just one image as input.
The researchers tested their model by reconstructing 360-degree light fields from several simple scenes. They found that LFNs could render scenes at over 500 frames per second, about three orders of magnitude faster than other methods. In addition, the 3D objects rendered by LFNs were often sharper than those produced by other models.
An LFN is also less memory-intensive and only requires about 1.6 megabytes of storage space, as opposed to 146 megabytes for a common base method.
“Light fields were suggested before, but then they were persistent. With these techniques, which we have used in this work, you can now represent these light fields for the first time and work with these light fields. It’s an interesting convergence of the mathematical models and the neural network models that we have developed, which come together in this application to represent scenes so that machines can think about them, ”says Sitzmann.
In the future, the researchers want to make their model more robust so that it can be used effectively for complex, real-life scenes. One way to drive LFNs forward is to focus only on reconstructing certain areas of the light field, which could make the model run faster and work better in real-world environments, Sitzmann says.
“Recently, neural rendering has enabled photo-realistic rendering and manipulation of images from just a few input views. Unfortunately, all of the existing techniques are computationally very expensive, which prevents applications that require real-time processing, such as video conferencing. This project takes a big step towards a new generation of computationally efficient and mathematically elegant neural rendering algorithms, ”says Gordon Wetzstein, associate professor of electrical engineering at Stanford University, who was not involved in this research. “I anticipate it will have widespread uses in computer graphics, computer vision, and beyond.”
Reference: “Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering” by Vincent Sitzmann, Semon Rezchikov, William T. Freeman, Joshua B. Tenenbaum and Fredo Durand, June 4, 2021, Computer Science> Computer Vision and Pattern Recognition.
This work is supported by the National Science Foundation, the Office of Naval Research, Mitsubishi, the Defense Advanced Research Projects Agency, and the Singapore Defense Science and Technology Agency.