Lip Sync Animation

Case study

Guide

How I built a Lip-synced character in Rive

I wanted to see how far you can push character animation in Rive, specifically lip sync. Apps like Duolingo use talking characters to make learning feel alive. I decided to explore what it takes to build something similar from scratch.

The character

I started with Midjourney iterating on the look until I had a reference worth building from. From there I rebuilt everything from scratch as a clean vector in Rive, keeping the original feel but making it fully animatable. The most important part was the mouth. First I rigged it using bones and tested a few approaches before finally settling on the method that gave the most control. On top of the lip sync, the character has a subtle fake 3D parallax on the head, a small thing that makes her feel alive even when she's not speaking.
Rigged mouth animation in Rive

Ten mouth shapes

Spoken English can be broken down into roughly ten distinct mouth positions, called visemes. I prepared ten animations in Rive, one per viseme. Each viseme covers a group of phonetically similar sounds:
Value Rive state Characters
0 Neutral pause, space
1 AEI A E I
2 BMP B M P
3 FV F V
4 CDGKNRSTXYZ C D G K N R S T X Y Z
5 U U
6 ChJSh Ch Sh J H
7 L L
8 O O
9 QW Q W

The state machine

Driving mouth movement with keyframes per phrase would be impractical. Instead, I built a state machine with a single number input called phoneme. Each of the ten states is triggered when phoneme equals its corresponding value. This way, the mouth can be controlled entirely from JavaScript at runtime. No timeline scrubbing, no manual keyframes per phrase.
Rive's state machine for lip sync animation

The problem

When I started working on the lip sync character, the animation side was clear - ten viseme states in Rive, one state machine, one number input. What wasn't clear was how to get the timing data. Speech synthesis APIs return audio. That's it. To sync mouth shapes to that audio, you need to know which sound is spoken at which exact millisecond. Doing this manually, listening, measuring, writing JSON, is tedious and doesn't scale past one or two phrases. I needed a tool.

Designing it first

Before writing any code, I opened Figma and designed the interface. I wanted it to be simple: paste in your credentials and text, hit generate, get usable output immediately. The output had two parts: a visual timeline of phoneme chips (so you can see what was detected), and a ready-to-copy JavaScript snippet for Rive.
Lip sync tool timing generator interface in Figma.

How it works

The tool uses ElevenLabs' end point, a lesser-known API call that returns not just the audio, but the start time of every character in the synthesized speech. From there, each character gets mapped to one of ten viseme values. Digraphs like Ch and Sh are handled as pairs. Consecutive identical visemes are deduplicated to keep the output clean.
The result is a millisecond-accurate array like this:
[
  { "chars": "H", "viseme": 6, "timeMs": 0 },
  { "chars": "e", "viseme": 1, "timeMs": 58 },
  { "chars": "ll", "viseme": 7, "timeMs": 112 }
]
Plus a JavaScript snippet ready to paste into your Rive project.

What you need to use it

1. An ElevenLabs account - the free tier works, but has limited characters per month.
2. An ElevenLabs API key - generated in your account settings.
3. A Voice ID - found in the ElevenLabs voice library.

What it outputs

1. MP3 audio playback - you can preview the generated voice directly in the tool.
2. Viseme chips - a color-coded visual of every phoneme and its timestamp.
3. JavaScript snippet - paste directly into your Rive project.
4. Raw JSON - if you want to use the timing data in your own way.
The tool is free and available at rive.expert/tools/lip-sync-timing-generator.
It also lives at lipsynctiming.netlify.app.

Connecting speech to animation

After I built a small web tool that takes synthesized speech, extracts character-level timestamps, maps them to viseme values, and outputs a millisecond-accurate JSON array, the JavaScript side is straightforward: iterate over the timing array and fire a setTimeout for each viseme event.
visemeTimeline.forEach(({ viseme, timeMs }) => {
  setTimeout(() => { phoneme.value = viseme; }, timeMs);
});

The result

The character responds to three different dialogue lines, each with accurate mouth sync and word-by-word subtitles.
You can try it at the top of this article, or at lipsync-animation.netlify.app.
The timing tool is free to use at rive.expert/tools/lip-sync-timing-generator.

Let's work together! Reach out or book a call to get started.