Google Unveils 'Imagen' Text-To-Image Diffusion Model, Claims It's Better Than DALL-E 2
An anonymous reader quotes a report from TechCrunch: The AI world is still figuring out how to deal with the amazing show of prowess that is DALL-E 2's ability to draw/paint/imagine just about anything but OpenAI isn't the only one working on something like that. Google Research has rushed to publicize a similar model it's been working on -- which it claims is even better. [...] Imagen starts by generating a small (64x64 pixels) image and then does two "super resolution" passes on it to bring it up to 1024x1024. This isn't like normal upscaling, though, as AI super-resolution creates new details in harmony with the smaller image, using the original as a basis. The advances Google's researchers claim with Imagen are several. They say that existing text models can be used for the text encoding portion, and that their quality is more important than simply increasing visual fidelity. That makes sense intuitively, since a detailed picture of nonsense is definitely worse than a slightly less detailed picture of exactly what you asked for. For instance, in the paper describing Imagen (PDF), they compare results for it and DALL-E 2 doing "a panda making latte art." In all of the latter's images, it's latte art of a panda; in most of Imagen's it's a panda making the art. (Neither was able to render a horse riding an astronaut, showing the opposite in all attempts. It's a work in progress.) In Google's tests, Imagen came out ahead in tests of human evaluation, both on accuracy and fidelity. This is quite subjective obviously, but to even match the perceived quality of DALL-E 2, which until today was considered a huge leap ahead of everything else, is pretty impressive. I'll only add that while it's pretty good, none of these images (from any generator) will withstand more than a cursory scrutiny before people notice they're generated or have serious suspicions. OpenAI is a step or two ahead of Google in a couple ways, though. DALL-E 2 is more than a research paper, it's a private beta with people using it, just as they used its predecessor and GPT-2 and 3. Ironically, the company with "open" in its name has focused on productizing its text-to-image research, while the fabulously profitable internet giant has yet to attempt it.
Read more of this story at Slashdot.