Microsoft Unveils AI Model That Understands Image Content, Solves Visual Puzzles

msmash

from Slashdot on 2023-03-03 16:40 (#69E5M)

Researchers from Microsoft have introduced Kosmos-1, a multimodal model that can reportedly analyze images for content, solve visual puzzles, perform visual text recognition, pass visual IQ tests, and understand natural language instructions. From a report: The researchers believe multimodal AI -- which integrates different modes of input such as text, audio, images, and video -- is a key step to building artificial general intelligence (AGI) that can perform general tasks at the level of a human. "Being a basic part of intelligence, multimodal perception is a necessity to achieve artificial general intelligence, in terms of knowledge acquisition and grounding to the real world," the researchers write in their academic paper, Language Is Not All You Need: Aligning Perception with Language Models. Visual examples from the Kosmos-1 paper show the model analyzing images and answering questions about them, reading text from an image, writing captions for images, and taking a visual IQ test with 22a"26 percent accuracy. [...] In this case, Kosmos-1 appears to be purely a Microsoft project, without OpenAI's involvement. The researchers call their creation a "multimodal large language model" (MLLM) because its roots lie in natural language processing, like a text-only LLM, such as ChatGPT. And it shows: For Kosmos-1 to accept image input, the researchers must first translate the image into a special series of tokens (basically text) that the LLM can understand.

Source	RSS or Atom Feed
Feed Location	https://rss.slashdot.org/Slashdot/slashdotMain
Feed Title	Slashdot
Feed Link	https://slashdot.org/
Feed Copyright	Copyright Slashdot Media. All Rights Reserved.