Article 69C66 Microsoft unveils AI model that understands image content, solves visual puzzles

Microsoft unveils AI model that understands image content, solves visual puzzles

by
Benj Edwards
from Ars Technica - All content on (#69C66)
multimodal_eyeball-800x450.jpg

Enlarge / An AI-generated image of an electronic brain with an eyeball. (credit: Ars Technica)

On Monday, researchers from Microsoft introduced Kosmos-1, a multimodal model that can reportedly analyze images for content, solve visual puzzles, perform visual text recognition, pass visual IQ tests, and understand natural language instructions. The researchers believe multimodal AI-which integrates different modes of input such as text, audio, images, and video-is a key step to building artificial general intelligence (AGI) that can perform general tasks at the level of a human.

"Being a basic part of intelligence, multimodal perception is a necessity to achieve artificial general intelligence, in terms of knowledge acquisition and grounding to the real world," the researchers write in their academic paper, Language Is Not All You Need: Aligning Perception with Language Models.

Visual examples from the Kosmos-1 paper show the model analyzing images and answering questions about them, reading text from an image, writing captions for images, and taking a visual IQ test with 22-26 percent accuracy (more on that below).

Read 6 remaining paragraphs | Comments

External Content
Source RSS or Atom Feed
Feed Location http://feeds.arstechnica.com/arstechnica/index
Feed Title Ars Technica - All content
Feed Link https://arstechnica.com/
Reply 0 comments