Introducing AKA-1: Akapulu's Flagship Real-Time Avatar Model

Today we're introducing AKA-1, Akapulu's flagship real-time avatar model.

AKA-1 converts text into a synchronized, photorealistic talking avatar fast enough to sustain live, two-way conversation.

Getting there is harder than it looks. Real-time audio-driven avatar synthesis is one of the more punishing problems in applied generative AI, and most systems that look good in a demo reel quietly fail the moment they have to run live, uninterrupted, for the length of a real conversation.

Below is a tour of where these systems break, and what AKA-1 does differently.

Diffusion: high fidelity, and way too slow

End-to-end video diffusion models produce the highest visual fidelity available today, but they generate pixels through hundreds of iterative denoising steps, each one a full forward pass over a high-dimensional spatiotemporal latent.

The cost scales with the number of sampling steps times the size of that latent grid, and with transformer backbones, attention scales quadratically in the number of spatial-temporal tokens.

Producing a few seconds of video can take tens of seconds of compute, which is a non-starter for conversational AI that typically requires responses within 1 second.

The naive fixes, smaller models or fewer sampling steps, trade away exactly the realism that made diffusion worth using.

AKA-1 sidesteps the cost curve by pushing diffusion to its extreme: we do not diffuse over pixels, or even over spatial latent patches, but over a compact vector of facial control parameters.

We first factor identity apart from dynamics, locking the speaker's appearance into a fixed reference, then run the denoising process directly on the low-dimensional kinematic coefficients that govern the face, the expression, head pose, and articulation values that describe how it should move and speak, conditioned on the incoming audio.

Because appearance is held constant, the generator only has to solve the audio-to-dynamics problem rather than re-synthesize the entire face at every step.

A dedicated synthesis network then expands those coefficients back into full-resolution pixels in a single feed-forward pass, deforming and rendering the reference appearance according to the predicted dynamics.

The diffusion target is a few hundred parameters per frame instead of the millions of values in a spatial latent, which collapses the per-step cost by orders of magnitude and lets us run very few sampling steps with no visible quality loss.

Pixel formation is amortized into one render pass rather than folded into the iterative loop.

That separation, cheap iterative diffusion on dynamics plus a single-shot render, is the core reason AKA-1 holds a real-time frame budget while keeping fidelity high, achieving a time to first frame of 150 ms from audio → video output.

Small errors = big drift

To run live, a model has to generate in a streaming, chunk-by-chunk fashion, emitting short windows of motion as audio arrives and conditioning each new segment on its own past output.

This creates a closed feedback loop: the model's predictions become the context for its next predictions.

Any autoregressive system of this form is exposed to exposure bias, the mismatch between the clean ground-truth context it saw during training and the imperfect, self-generated context it sees at inference.

Small approximation errors in early frames are fed back in, and because the mapping is recursive, those errors compound rather than wash out.

Over a long session this manifests in two ways.

The first is identity drift, where the reconstructed face gradually diverges from the original speaker as residual error in the appearance pathway accumulates across windows.

The second is temporal jitter, where motion loses smoothness at the seams between chunks, producing abrupt, unnatural transitions that the eye reads instantly as fake.

AKA-1 attacks this on two fronts.

First, generation is anchored to a fixed identity reference rather than to the previous frame.

Appearance is pinned to an immutable source representation, so every frame is reconstructed against the true speaker instead of against a degrading copy of itself; there is no recursive appearance path for drift to travel down.

Second, the temporal context that conditions each window is explicitly bounded and overlapped, with boundary-aware blending across chunk seams.

Historical state informs continuity but cannot snowball, because the model never attends arbitrarily far back into its own error history, and the overlapping windows are stitched so motion stays continuous across boundaries.

The result is an avatar that looks identical in minute thirty as it did in second one, with smooth, seam-free motion across unbounded multi-turn dialogue.

Who you are vs. how you move

Many pipelines drive each new frame from features pulled out of previous frames, which entangles two fundamentally different signals: "what this person looks like" (identity and appearance) and "how they are currently moving" (expression, articulation, pose).

Because both are carried in the same shared representation, the network has no clean way to vary one without perturbing the other.

When those signals bleed together you get flickering between frames, smearing around the mouth as the lips move, and a face that subtly warps as it talks because appearance is being re-estimated from motion-contaminated features every step.

AKA-1 keeps appearance and motion in separate, decoupled representations.

We extract a fixed identity code that captures the speaker's appearance once and holds it constant for the entire session, while a separate generator produces the motion signal independently, conditioned only on audio.

The two are combined at the very end: motion is applied on top of the frozen identity through the renderer, never folded back into it.

Because the appearance representation is immutable, expression and speech can never deform the underlying face, and the motion stream carries no appearance information to leak.

This explicit disentanglement is what produces clean, stable, phoneme-accurate lip-sync and full-face animation, coordinated motion across the jaw, cheeks, eyes, and brow, instead of a moving mouth pasted onto a drifting head.

Voice + video in one pipeline

A real-time conversational avatar breaks down into four sub-stages, chained back to back:

STT (speech-to-text) — transcribing what the user says into text

LLM (reasoning) — deciding what to say back

TTS (text-to-speech) — turning that response into a spoken audio waveform

Avatar rendering — animating a face that speaks that audio in sync

For the first two, we partner with best-in-class providers.

For transcription we partner with Deepgram, which gives us fast, accurate, low-latency speech recognition.

For reasoning, we integrate with OpenAI so you can bring your favorite LLM and configure the avatar's behavior however you need.

The last two stages, speech synthesis and avatar rendering, are where the hard real-time problem lives, and they are powered entirely by AKA-1, our proprietary in-house model.

AKA-1 takes the LLM's text output and produces both the lifelike voice and the synchronized, photorealistic talking face, with a text-to-speech stage that turns LLM tokens into an audio waveform, and an audio-to-motion-to-video stage that turns that waveform into a moving face.

Run naively, these stages are sequential: synthesize the full utterance, then render it, and the latencies compound.

AKA-1 instead runs them as a single streaming pipeline, emitting audio in small chunks and handing each to the renderer as it is produced, so the face starts animating before the sentence finishes.

Because the stages run concurrently and we optimize for time-to-first-frame, their costs hide behind each other instead of summing.

The harder problem is the voice itself. Our speech model is a non-autoregressive, style-conditioned synthesizer: a single style vector drives both timbre (who the speaker is) and prosody (how they speak), feeding a neural vocoder that produces the waveform in one pass.

What makes it fast is that it is non-autoregressive. Unlike token-by-token speech models that must generate each slice of audio conditioned on the one before it, our synthesizer predicts the full prosody and duration of an utterance up front and then decodes the entire waveform in a single parallel pass, with no sequential sampling loop to bottleneck on.

That feed-forward design, paired with a lightweight vocoder, is what lets us hit an end-to-end voice latency of 140ms from text in to audio out.

This allows us to reach 140 ms (voice) + 150 ms (video) = 290ms latency from text received → avatar video + voice ready

The entire back end, speech synthesis and avatar rendering, runs inside a single container on one GPU.

This is possible because of how AKA-1 is built: the voice model is non-autoregressive and the motion model diffuses over a handful of facial coefficients rather than pixels, so both stages fit in the same memory budget and execute back to back on the same device, passing tensors directly from speech to render with no network hop in between.

Collapsing both into one co-located container removes the overhead of running a multi-container setup: no cross-service latency, no idle GPU waiting on the other stage, and no stacked per-vendor markup.

A single container serves a full conversational avatar end to end, which is what lets AKA-1 run at a fraction of the cost of a multi-vendor pipeline and translate directly into per-minute pricing.

Every Akapulu avatar is built from a real human actor. The actor sends us a short reference recording, and from that single sample we fine-tune a dedicated voice model that captures their timbre, cadence, and intonation.

Doing this well is delicate. Fine-tuning the whole network to chase raw speaker similarity overshoots fast, pushing the voice out of distribution into gargly, artifact-ridden audio.

So instead we route the speaker's identity through a tightly constrained timbre adapter and keep the rest of the model frozen.

The result is a voice that lands close to the actor while staying stable, clear, and natural, fit from a single short recording and consistent across thousands of conversations.

The streaming design also makes the pipeline interruptible: because work flows in small chunks, an incoming user turn can barge in immediately, so the avatar stops and re-engages like a person instead of talking over you.

Output is sustained at a continuous 25 FPS with seamless idle-to-speaking transitions.

Generic models produce generic faces

Most systems animate a single shared model across every identity, conditioning it on a reference image at inference time.

That keeps deployment simple, but it caps how specific any one avatar can be: the network's weights encode a population average of faces and motion, so subtle, person-specific detail gets regressed toward the mean.

AKA-1 is fine-tuned per avatar. From an actor's reference material (A short, 1 minute clip), we fit dedicated weights for that individual across both halves of the model: a voice profile for their timbre and prosody, and a motion-and-rendering profile that learns their idiosyncratic dynamics (how their mouth shapes phonemes, how their brow and eyes move, their resting pose and micro-expressions) along with a high-fidelity appearance model of their face.

Because the kinematic mapping and the synthesis network are specialized to one person rather than shared across millions, the avatar captures detail a single global model structurally cannot.

Those per-avatar weights are then compiled into an optimized inference engine for the target hardware, so the personalization adds fidelity without adding latency.

The identity stays consistent and recognizable across thousands of conversations, because it is baked into the model, not approximated from a reference at run time.

Text, then voice, now face

The first wave of conversational AI was text. Now, we’re starting to see voice to voice applications flourish.

We believe that face-to-face interactions are the next step.

That last step is the hardest one to get right. A response can be fast and still not feel present. A face can look good in a demo and still break once a conversation runs long, gets interrupted, or needs to hold a specific identity across thousands of turns.

AKA-1 exists to solve that layer: real-time face-to-face presence, not just animated output.

It is the model behind every live avatar on Akapulu today, and it is the foundation we are building on as we build the next generation of live, face-to-face AI experiences.