I tested out a buzzy new text-to-video AI model from China

Zeyi Yang

from MIT Technology Review on 2024-06-19 09:00 (#6NMGE)

This story first appeared in China Report, MIT Technology Review's newsletter about technology in China. Sign up to receive it in your inbox every Tuesday.

You may not be familiar with Kuaishou, but this Chinese company just hit a major milestone: It's released the first text-to-video generative AI model that's freely available for the public to test.

The short-video platform, which has over 600 million active users, announced the new tool on June 6. It's called Kling. Like OpenAI's Sora model, Kling is able to generate videos up to two minutes long with a frame rate of 30fps and video resolution up to 1080p," the company says on its website.

But unlike Sora, which still remains inaccessible to the public four months after OpenAI trialed it, Kling soon started letting people try the model themselves.

I was one of them. I got access to it after downloading Kuaishou's video-editing tool, signing up with a Chinese number, getting on a waitlist, and filling out an additional form through Kuaishou's user feedback groups. The model can't process prompts written entirely in English, but you can get around that by either translating the phrase you want to use into Chinese or including one or two Chinese words.

So, first things first. Here are a few results I generated with Kling to show you what it's like. Remember Sora's impressive demo video of Tokyo's street scenes or the cat darting through a garden? Here are Kling's takes:

Prompt: Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.ZEYI YANG/MIT TECHNOLOGY REVIEW | KLINGPrompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.ZEYI YANG/MIT TECHNOLOGY REVIEW | KLINGPrompt: A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something. Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. The scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates a warm contrast, accentuating the cat's orange fur. The shot is clear and sharp, with a shallow depth of field.ZEYI YANG/MIT TECHNOLOGY REVIEW | KLING

Remember the image of Dall-E's horse-riding astronaut? I asked Kling to generate a video version too.

Prompt: An astronaut riding a horse in space.ZEYI YANG/MIT TECHNOLOGY REVIEW | KLING

There are a few things worth applauding here. None of these videos deviates from the prompt much, and the physics seem right-the panning of the camera, the ruffling leaves, and the way the horse and astronaut turn, showing Earth behind them. The generation process took around three minutes for each of them. Not the fastest, but totally acceptable.

But there are obvious shortcomings, too. The videos, while 720p in format, seem blurry and grainy; sometimes Kling ignores a major request in the prompt; and most important, all videos generated now are capped at five seconds long, which makes them far less dynamic or complex.

However, it's not really fair to compare these results with things like Sora's demos, which are hand-picked by OpenAI to release to the public and probably represent better-than-average results. These Kling videos are from the first attempts I had with each prompt, and I rarely included prompt-engineering keywords like 8k, photorealism" to fine-tune the results.

If you want to see more Kling-generated videos, check out this handy collection put together by an open-source AI community in China, which includes both impressive results and all kinds of failures.

Kling's general capabilities are good enough, says Guizang, an AI artist in Beijing who has been testing out the model since its release and has compiled a series of direct comparisons between Sora and Kling. Kling's disadvantage lies in the aesthetics of the results, he says, like the composition or the color grading. But that's not a big issue. That can be fixed quickly," Guizang, who wished to be identified only by his online alias, tells MIT Technology Review.

The core capability of a model is in how it simulates physics and real natural environments," and he says Kling does well in that regard.

Kling works in a similar way to Sora: it combines the diffusion models traditionally used in video-generation AIs with a transformer architecture, which helps it understand larger video data files and generate results more efficiently.

But Kling may have a key advantage over Sora: Kuaishou, the most prominent rival to Douyin in China, has a massive video platform with hundreds of millions of users who have collectively uploaded an incredibly big trove of video data that could be used to train it. Kuaishou told MIT Technology Review in a statement that Kling uses publicly available data from the global internet for model training, in accordance with industry standards." However, the company didn't elaborate on the specifics of the training data(neither did OpenAI about Sora, which has led to concerns about intellectual-property protections).

After testing the model, I feel the biggest limitation to Kling's usefulness is that it only generates five-second-long videos.

The longer a video is, the more likely it will hallucinate or generate inconsistent results," says Shen Yang, a professor studying AI and media at Tsinghua University in Beijing. That limitation means the technology will leave a larger impact on the short-video industry than it does on the movie industry, he says.

Short, vertical videos (those designed for viewing on phones) usually grab the attention of viewers in a few seconds. Shen says Chinese TikTok-like platforms often assess whether a video is successful by how many people would watch through the first three or five seconds before they scroll away-so an AI-generated high-quality video clip that's just five seconds long could be a game-changer for short-video creators.

Guizang agrees that AI could disrupt the content-creating scene for short-form videos. It will benefit creators in the short term as a productivity tool; but in the long run, he worries that platforms like Kuaishou and Douyin could take over the production of videos and directly generate content customized for users, reducing the platforms' reliance on star creators.

It might still take quite some time for the technology to advance to that level, but the field of text-to-video tools is getting much more buzzy now. One week after Kling's release, a California-based startup called Luma AI also released a similar model for public usage. Runway, a celebrity startup in video generation, has teased a significant update that will make its model much more powerful. ByteDance, Kuaishou's biggest rival, is also reportedly working on the release of its generative video tool soon. By the end of this year, we will have a lot of options available to us," Guizang says.

I asked Kling to generate what society looks like when anyone can quickly generate a video clip based on their own needs." And here's what it gave me. Impressive hands, but you didn't answer the question-sorry.

Prompt: With the release of Kuaishou's Kling model, the barrier to entry for creating short videos has been lowered, resulting in significant impacts on the short-video industry. Anyone can quickly generate a video clip based on their own needs. Please show what the society will look like at that time.ZEYI YANG/MIT TECHNOLOGY REVIEW | KLING

Do you have a prompt you want to see generated with Kling? Send it to zeyi@technologyreview.com and I'll send you back the result. The prompt has to be less than 200 characters long, and preferably written in Chinese.

Now read the rest of China ReportCatch up with China

1. A new investigation revealed that the US military secretly ran a campaign to post anti-vaccine propaganda on social media in 2020 and 2021, aiming to sow distrust in the Chinese-made covid vaccines in Southeast Asian countries. (Reuters $)

2. A Chinese court sentenced Huang Xueqin, the journalist who helped launch the #MeToo movement in China, to five years in prison for inciting subversion of state power." (Washington Post $)

3. A Shein executive said the company's corporate values basically make it an American company, but the company is now trying to hide that remark to avoid upsetting Beijing. (Financial Times $)

4. China is getting close to building the world's largest particle collider, potentially starting in 2027. (Nature)

5. To retaliate for the European Union's raising tariffs on electric vehicles, the Chinese government has opened an investigation into allegedly unfair subsidies for Europe's pork exports. (New York Times $)

On a related note about food: China's exploding demand for durian fruit in recent years has created a $6 billion business in Southeast Asia, leading some farmers to cut down jungles and coffee plants to make way for durian plantations. (New York Times $)

Lost in translation

In 2012, Jiumei, a Chinese woman in her 20s, began selling a service where she sends good night" text messages to people online at the price of 1 RMB per text (that's about $0.14).

Twelve years, three mobile phones, four different numbers, and over 50,000 messages later, she's still doing it, according to the Chinese online publication Personage. Some of her clients are buying the service for themselves, hoping to talk to someone regularly at their most lonely or desperate times. Others are buying it to send anonymous messages-to a friend going through a hard time, or an ex-lover who has cut off communications.

The business isn't very profitable. Jiumei earns around 3,000 RMB ($410) annually from it on top of her day job, and even less in recent years. But she's persisted because the act of sending these messages has become a nightly ritual-not just for her customers but also for Jiumei herself, offering her solace in her own times of loneliness and hardship.

One more thing

Globally, Kuaishou has been much less successful than its nemesis ByteDance, except in one country: Brazil. Kwai, the overseas version of Kuaishou, has been so popular in Brazil that even the Marubo people, a tribal group in the remote Amazonian rainforests and one of the last communities to be connected online, have begun using the app, according to the New York Times.

Source	RSS or Atom Feed
Feed Location	https://www.technologyreview.com/stories.rss
Feed Title	MIT Technology Review
Feed Link	https://www.technologyreview.com/