AI learns to decipher images based on spoken words—almost like a toddler

Timothy B. Lee

from Ars Technica - All content on 2018-09-23 13:00 (#3ZC4C)

Screen-Shot-2018-09-21-at-3.36.21-PM-800

Enlarge / Given this picture and audio of the word "airliner," a neural network identifies the portions of the image where there's an airplane (indicated by the red lines). The software learned to do this entirely by looking at 400,000 pictures, each paired with a brief, free-form spoken description of the scene. (credit: David Harwath et al.)

Babies learn words by matching images to sounds. A mother says "dog" and points to a dog. She says "tree" and points to a tree. After repeating this process thousands of times, babies learn to recognize both common objects and the words associated with them.

Researchers at MIT have developed software with the same ability to learn to recognize objects in the world using nothing but raw images and spoken audio. The software examined about 400,000 images, each paired with a brief audio clip describing the scene. By studying these labels, the software was able to correctly label which portions of the picture contained each object mentioned in the audio description.

For example, this image comes with the caption "a white and blue jet airliner near trees at the base of a low mountain."

Read 7 remaining paragraphs | Comments

index?i=hrCx8a2Wypw:eNyKrKjUlFI:V_sGLiPB

index?i=hrCx8a2Wypw:eNyKrKjUlFI:F7zBnMyn

Source	RSS or Atom Feed
Feed Location	http://feeds.arstechnica.com/arstechnica/index
Feed Title	Ars Technica - All content
Feed Link	https://arstechnica.com/