By learning to see, recognize, understand, and synthesize objects, AI can also manipulate them, creating images and videos indistinguishable from the real thing. Is there a future in which people will no longer be able to rely on the naked eye if they wish to distinguish between authentic video and subtle editing? How are dipfakes created? Will fake videos fill our social networks?
We’re no longer surprised that AI is outpacing humans in the efficiency of processing large amounts of data. But what about abilities unique to humans or other living beings, such as perception? Vision is the most important of the human senses. Computer (machine) vision is the branch of AI that trains computers to see. This word means not only digitizing a video or image, but also making sense of what the computer “sees” while doing so.
Read how machine vision works and dipfakes are created in this piece based on the book AI-2041.
Convergent Neural Networks (CNN)
Making computer vision work based on a standard neural network has proven to be a daunting task – after all, any image consists of tens of millions of pixels, and teaching a deep learning system to find subtle cues and signs in a huge number of images – it even sounds daunting.
Researchers turned to the human brain for inspiration on how to improve this technology. The visual cortex engages neurons corresponding to a set of restricted areas (known as receptive fields, or perceptual fields) within which our eyes capture an image at any given time. Receptive fields identify basic features of visible objects: shapes, lines, colors, or angles. These detectors are connected to the neocortex, the upper layer of the cerebral cortex (new cortex). The neocortex stores information hierarchically and processes the outputs of the perceptual fields, converting them into a more complex interpretation of the scene.
Observations of how people “see” inspired the invention of so-called convolutional neural networks (CNNs).
The lowest layer of a CNN consists of a large number of filters which are repeatedly applied to an image. Each of these filters, like receptive fields, can only see small adjacent portions of the image. Deep learning by parameter optimization on multiple images decides what each filter “notices. Each filter outputs confidence that it has seen the particular feature it represents (e.g., a black line).
The higher layers of the CNN are organized hierarchically, like the neocortex. They take confidence output from lower layers and detect more complex features. For example, if a zebra image is loaded into the CNN, the lower layer filters may look for only black and white lines in each area of the image, while the higher layers will see stripes, ears, and feet in larger areas. The next layers can see multiple stripes, two ears and four legs. At the highest layer, parts of CNN may purposefully try to distinguish a zebra from a horse or tiger.
Note: all of these examples illustrate what CNN can do, but in real work, the network itself decides which features (stripes, ears, or generally something beyond human understanding) will be used to maximize the target function.
CNN is a specific and improved deep learning architecture designed for computer vision, with different options for images and video. The idea for such networks originated in the 1980s; scientists at the time did not have enough data or computing power to demonstrate what these networks could do. It was not until 2012 that it became clear that this technology was superior to all previous approaches to computer vision.
By happy coincidence, around the same time users took a huge number of photos and videos with their much cheaper and therefore massively spread smartphones and posted them on social networks. It was also around that time that fast computers and big data storage became more affordable. All of these factors combined to drive a leap in the development and diffusion of this technology.
Generative Adversarial Networks (GANs)
Deepfakes are based on a technology called generative adversarial networks. As the name implies, GANs are a pair of “adversarial” deep learning neural networks.
The first network, the generator, tries to create something that looks realistic (say, a synthesized image of a dog) based on millions of dog images.
The second network, the discriminator (the detector network), compares the synthesized dog image from the first network to genuine dog images and determines whether the generator output is genuine or fake.
Based on the feedback from the discriminator, the generator retrains itself so that it will fool the discriminator the next time. It self-corrects by minimizing the “loss function,” that is, the difference between the generated image and the real image.
Then the discriminator is retrained too, and it becomes easier to recognize fakes – the “loss function” is maximized. These two processes are repeated millions of times; both networks improve their skills until a stable equilibrium occurs.
From the book AI-2041.
The post How neural networks create dipfakes appeared first on Business.