“These Guys Really Understand AI”: Cai Haoyu Takes a Giant Leap Toward a “Billion-Person Virtual World”?

[GamePea Exclusive – Reproduction Prohibited!] GamePea reports: Over the past few years, Cai Haoyu has made a statement in public that has been repeatedly quoted: miHoYo's goal is to create a virtual world by 2030 that 1 billion people worldwide would be willing to live in. Of course, when that statement was made, most people dismissed it as a game company's grand ambition—sounding more epic than the metaverse concept, but equally grounded in thin air. At that time, miHoYo was at its peak; people remembered its revenue, its global footprint, but not Cai Haoyu's talk of a 'billion people.'

But if you think seriously about it, what is the first problem that must be solved to build a virtual world capable of housing a billion people? It's not how beautiful the scenery is, not how vast the map is, and not even how complex the story is—it's whether the people inside it are alive. A city can be digitally modeled, a forest can be algorithmically generated, but if a player entering this world is greeted only by a stiff model reciting the same fixed lines from RPGs of decades past, with lip movements and emotions completely disconnected—then that 'world' is essentially no different from a beautiful wallpaper. Characters must be alive; this is the fundamental issue of life and death for a virtual world.

On April 9, 2026, members of the team at Anuttacon, the AI company founded by Cai Haoyu, published a paper on the preprint platform arXiv under their personal names, simultaneously launching a project homepage. The paper is titled LPM 1.0: Video-based Character Performance Model—a Large Performance Model—with 24 authors and contributors. Based on the paper's demo, this technology has already been applied in Anuttacon's earlier released game Whispers from the Star. Paper link: https://arxiv.org/abs/2604.07823. Project page: https://large-performance-model.github.io/.

This is not an ordinary video generation model. Its goal is to make a static character image speak, listen attentively, furrow its brow slightly, curl its lips upward—and keep doing so without breaking, stiffening, or stopping until you no longer want to talk. If Cai Haoyu's 'billion-person virtual world' is the grounded destination in an era of 'AI chaos,' then in GamePea's view, LPM 1.0 solves the checkpoint called 'whether the character is alive.' And this checkpoint is far harder than it looks. The model is currently closed-source.

The Performance Trilemma: A Knot AI Video Hasn't Untied for Years

To understand what LPM 1.0 actually does, we first need to address a long-standing contradiction in the AI video generation field. The research team names this contradiction the 'Performance Trilemma.' Simply put, all current video generation models make painful trade-offs among three core capabilities: expressiveness, real-time performance, and long-term stability. To gain expressiveness, you often sacrifice speed; to gain speed, characters become increasingly stiff; to gain stability, characters lose their liveliness. LPM 1.0 attempts to solve all three at once.

The paper devotes considerable space to describing the data construction process. They collected a large-scale corpus of human video covering multiple scenarios, which underwent four stages of filtering: single-shot extraction, quality filtering and cropping, dialogue state recognition and segmentation, and caption and embedding generation. After this pipeline, the retention rate of raw video was less than 10%. One of the most interesting aspects is the data for 'listening.' Most similar models only learn 'speaking' because speaking videos are abundant on the internet. But the subtle reactions of a person attentively listening to another—nodding, slight eye movement, lip twitch, brief frown—are extremely scarce, accounting for less than 10% of natural video. The team specifically built a three-class (speaking/listening/silent) annotation system, supplemented by the Qwen3-Omni model for semantic verification, achieving an F1 score that surpassed results using Gemini 2.5 Pro directly.

Another highlight is the 'multi-granularity identity reference image' system. Previous models only accepted a single frontal face photo as character reference, causing the model to 'guess' what the side profile looks like when the character turns or lowers its head. LPM 1.0's solution provides three types of references simultaneously: an overall appearance reference image, multi-angle body perspective reference images (up to four directions), and a set of facial reference images covering different expressions (up to eight). This essentially tells the model: 'This is what the person looks like from the front, side, smiling, and frowning.'

The model architecture is built on DiT (Diffusion Transformer), using Wan2.1-I2V (16B) as the pre-training base, with approximately 3B additional parameters, resulting in a 17-billion-parameter model. The core architectural innovation is the 'interleaved dual audio injection' design. Traditional talking video generation processes only one audio stream—the character's own voice. LPM 1.0 injects 'speaking audio' into the even layers of the Transformer and 'listening audio' into the odd layers, allowing the two audio streams to influence video generation at different levels without doubling the parameter count. Speaking audio drives high-frequency mouth movements and rhythmic body motions, while listening audio is associated with lower-frequency expression shifts and body posture changes.

For temporal alignment, speaking audio uses a local time window—each video frame only 'sees' the audio segment corresponding to its own time, ensuring precise lip-sync. Listening audio uses a wider time window because a listener's reaction often reflects semantic understanding of an entire utterance, not just an immediate response to a single frame's sound wave. For identity stability, multiple reference images are encoded as tokens and directly concatenated into the video token sequence for self-attention computation. By setting different 3D rotary position encoding (RoPE offsets) for different types of reference images, the model implicitly distinguishes 'this is a frontal expression reference' from 'this is a side-body action reference.' This approach introduces no additional learnable parameters yet continuously anchors the character's appearance to the reference images during infinite-length generation.

During training, the team also used Direct Preference Optimization (DPO) for post-training alignment—generating a batch of candidate videos, labeling which are better, then fine-tuning the model with preference data to specifically correct visual quality defects like hand deformation and limb distortion.

Base LPM vs. Online LPM: From Offline to Real-Time

Base LPM solves generation quality but is essentially an offline model. True real-time conversation requires the system to start generating visual responses while the user is still speaking—this is what Online LPM addresses. Online LPM uses knowledge distillation to transfer Base LPM's capabilities to a causal streaming architecture, achieving block-by-block autoregressive generation of 1-second video chunks. The architecture is split into two parts: a Backbone responsible for maintaining stable temporal trajectories (2 denoising steps) and a Refiner that recovers high-frequency details on top of that (1 denoising step). Training proceeds in four stages: ODE supervision warm-up → off-policy distribution matching distillation → on-policy DMD (letting the model learn error correction from its own generated videos) → refinement network distillation. The core goal of this curriculum training is to keep the trajectory stable even as Online LPM accumulates errors during autoregressive generation.

In terms of inference efficiency, the system achieves a generation latency of approximately 700ms per 1-second video block on a single GPU, plus 180ms for VAE decoding, maintaining a real-time output of 24fps. A sliding window KV cache and 'attention sink' design allow the system to support theoretically infinite-length video generation within limited memory, with no increase in computation per step as the conversation extends.

LPM-Bench: A New Benchmark for Character Performance

Existing video generation evaluation benchmarks mostly focus on video quality, with almost no dedicated evaluation dimensions for 'character performance in dialogue.' The Anuttacon team built LPM-Bench, containing 1,000 test samples covering five scenarios: speaking (approximately 400), listening (approximately 200), dialogue (approximately 200), diverse actions (approximately 100), and character generalization (approximately 100). The test covers 78 emotions and over 5,000 action descriptors. In comparisons with state-of-the-art models, Base LPM at 720P resolution achieved a 64.3% win rate in overall human preference against Kling-Avatar-2, and 42.5% against OmniHuman-1.5. Online LPM at 480P resolution achieved an 82.5% win rate against LiveAvatar and 64.1% against SoulX.

Strong Competitors and Unavoidable Questions

Anuttacon's release has put a company with almost no prior public technical output under the spotlight. Naturally, comparisons with other players in this track quickly emerged. The competition is formidable. The video character generation track has seen massive funding and team influxes over the past two years. Not long ago, Bilibili launched a new AI video generation product called updream. Similarly, ByteDance's Seedance has evolved to the stage of film-grade short video production, capable of multi-character, multi-shot generation from text to full video, with commercial applications emerging in live-action short drama production. It is reported that AI-generated short dramas now account for over 30% of total short drama views in China. Kuaishou's Kling, from initial video generation to the Avatar series, has been developing digital avatar applications, with Kling itself serving as a platform entry point to absorb user demand.

On the academic open-source side, Alibaba's ATH recently released an open-source character performance model called HappyHorse-1.0, which quickly topped relevant benchmarks after launch, drawing significant community attention and signaling that the technical barrier in this field is rapidly lowering while open-source forces are rapidly gathering. Overseas, although OpenAI has paused related services for Sora, it has accumulated foundational video generation capabilities, and Google's investment in the Veo series should not be underestimated.

Compared to these players, Anuttacon is a newcomer, but the good news is that this also marks its emergence as a proper technology company, not just an AI department within a game company. It has a genuine research team, a complete technical paper, and a systematic benchmark evaluation. In this niche track, it now stands at the same starting line as top commercial products. However, despite the technical merits, Anuttacon faces many future challenges.

One structural issue is unavoidable: ByteDance has Douyin, Kuaishou has Kuaishou—these two platforms hold China's largest short video content consumption entry points. Virtual anchors, AI digital human live streaming, and AI-generated short dramas—these scenarios most likely for large-scale commercialization—will naturally be absorbed into their own ecosystems first. Why would a creator or live-streaming team choose a third-party character generation tool over built-in platform features? This question has no easy answer, unless the third party can achieve a decisive quality gap without being too expensive.

A deeper dilemma is that companies like ByteDance have nearly unlimited investment in model training. Training a single version of a mainstream large model costs hundreds of millions of dollars, requiring '10,000-GPU clusters' to run, with iteration speeds accelerating. Anuttacon is currently a small team of fewer than 50 people. In GamePea's view, its size makes it unlikely to pursue a 'burn cash recklessly' path.

Possible Paths for LPM 1.0

So what is the future of LPM 1.0? In GamePea's view, there are several possible paths. The first, and currently the clearest signal: technical accumulation to serve its own projects. Anuttacon's first game, Whispers from the Star, is available on Steam, and the AI chat product AnuNeko has been launched. What these products need is precisely a visual engine that makes characters 'come alive.' LPM 1.0 is less a product for external sale and more a report card of the company's technical capabilities. The paper itself explicitly states that the model weights are not open-sourced, no API is provided, and it is not for external commercial use.

The second path is to build vertical tool applications on top of this underlying capability. Game NPCs, virtual companionship, AI instructors—these demands are real and are indispensable infrastructure for that 'billion-person virtual world.' But from technical demonstration to productization, there is a long road: finding highly aligned users, co-developing the product, building sales and operations systems, and forming a commercial closed loop. This distance is far greater than the numbers on the paper.

The third path is to cooperate with external capital. Once financing begins, Anuttacon would have more fuel in the capital-intensive AI track, embarking on a 'point of no return' toward capitalization. But this also means being bound by capital logic, requiring sufficiently fast growth to support valuation. Currently, Anuttacon has no public external financing records, and whether Cai Haoyu wants to take this step remains unclear. Notably, Anuttacon's status as a non-public company poses a substantive limitation in attracting top AI researchers. In this circle, high salaries and large option packages are common. For a non-public company, the uncertainty of option liquidity makes top talent hesitate when making choices. If Anuttacon wants to continue expanding its technical reserves, it will eventually need to address this issue—whether through an IPO, financing, or other means.

Of course, when discussing AI, especially content tools, an unavoidable issue is the consumer market's acceptance of AI-generated content. Countless surveys have shown that acceptance is still very limited. AI digital humans may look very real, but are users willing to engage with them for extended periods? Willingness to pay is generally low. But if we take a longer view, in Cai Haoyu's described 'billion-person virtual world,' AI characters are not optional—they are a necessity. This world cannot fill infinite character interaction demands with finite human labor. When 100,000 people simultaneously talk to the same NPC, no human can handle it. AI is the only solution, and LPM 1.0 solves the most core 'character performance' problem in this scenario. Consumer acceptance of AI content will eventually rise, but the prerequisite is that the technology is ready first. From this perspective, what Anuttacon is doing now is preparing ammunition for a market that is not yet mature.

Community Reaction: 'These Guys Really Know Their Stuff'

After the release of LPM 1.0, community reactions can be roughly divided into two waves. The first wave was surprise at the model's effectiveness. In the demo video, a static image of a black male actor was driven into an intense emotional performance—furrowed brows, trembling lips, fingers pointing forcefully at the audience, emotions shifting from anger to grievance—paired with original dialogue, the naturalness exceeded many people's expectations. The second wave was a calm analysis from the tech community. On Zhihu, multiple users with AI research backgrounds interpreted the paper, with relatively restrained reactions. A consensus emerged: this team is genuine, not just name-dropping. 'Infinite duration + real-time interaction is insane. The ceiling of this thing might be higher than Seedance (because Seedance doesn't know how to commercialize yet, but real-time interaction can be directly implemented—don't forget miHoYo makes games). And considering the market cap difference between miHoYo and ByteDance might be two orders of magnitude.'

The reaction from the gaming community was more emotional. In some miHoYo-related forums and discussion groups, this event was interpreted as 'Boss Cai is cooking up something big,' implying that the world of Genshin Impact might one day truly 'come alive.' If you only look at LPM 1.0 itself, it is a technical demonstration, an academic paper, a business card staking a position in the AI video generation field. But if you place it in the context of Cai Haoyu's '2030, billion-person virtual world,' it is a checkpoint solved on the road to that goal. Not the destination, but a passing point. Where does the vitality of a virtual world come from? From living characters. What LPM 1.0 does is make that face speak, listen, and truly come alive. The road is still long—eighty-one trials, and this is only the first. But at least for this trial, there is now a decent answer.

Tags: AI, Anuttacon, miHoYo, Cai Haoyu