Build a Three.js 3D Avatar with Real-Time AI (Vision, Voice, Lip-Sync) in Next.js

Shipping a reactive 3D avatar that can see, hear, speak, and lip-sync in real-time is now production-ready. In this guide, you'll wire a VRM character into any Next.js or React project using Three.js and connect it to real-time AI with the Gabber SDK — including live viseme streams for frame-perfect mouth shapes.

Whether you're building AI companions, interactive tutorials, virtual assistants, or product demos, this is the complete implementation path.

What You'll Build

A Three.js scene that loads and animates a VRM model
Live microphone input streamed to real-time AI
AI speech output with Text-to-Speech playback in the browser
Live viseme stream mapped to VRM mouth expressions (Aa/Ee/Ih/Oh/Ou)
Smooth animation blending between talking states
Production-ready with proper cleanup and error handling

Prerequisites

Next.js 14+ with App Router and React 18
Three.js and @pixiv/three-vrm for VRM model support
A Gabber account for the low-latency real-time AI backend

Install the essentials:

npm i three @pixiv/three-vrm @gabber/client-react

Architecture Overview

We split the implementation into two core components:

app/avatar-3d/
  components/
    VRMAvatarDemo.tsx   // Gabber connection, audio I/O, viseme subscription
    VRMScene.tsx        // Three.js rendering, VRM loading, expression mapping
  page.tsx              // Entry point

VRMAvatarDemo handles all Gabber SDK interactions: connecting to the AI backend, publishing microphone audio, subscribing to AI audio output, and subscribing to the viseme stream.

VRMScene is a pure Three.js component that receives two props (isSpeaking and currentViseme) and renders the animated VRM model with real-time mouth shapes.

Part 1: The Three.js VRM Scene

Loading the VRM Model

The VRM format is an open standard for 3D humanoid avatars. We load it using GLTFLoader with the VRMLoaderPlugin:

// VRMScene.tsx
import * as THREE from 'three';
import { VRM, VRMLoaderPlugin, VRMExpressionPresetName } from '@pixiv/three-vrm';
import { GLTFLoader } from 'three/examples/jsm/loaders/GLTFLoader.js';

const loadVRM = async () => {
  const loader = new GLTFLoader();
  loader.register((parser) => new VRMLoaderPlugin(parser));

  loader.load('/models/girl.vrm', async (gltf) => {
    const vrm = gltf.userData.vrm;
    
    // Position and scale the model
    vrm.scene.position.set(0, -0.3, 0);
    vrm.scene.rotation.set(0, Math.PI, 0);
    vrm.scene.scale.set(0.9, 0.9, 0.9);
    
    // Add to scene
    scene.add(vrm.scene);
    
    // Store reference for animation updates
    vrmRef.current = vrm;
  });
};

Setting Up Animations with Mixamo

We load FBX animations from Mixamo and retarget them to the VRM rig. This requires bone mapping between Mixamo's skeleton and VRM's humanoid structure:

const MIXAMO_VRM_RIG_MAP = {
  mixamorigHips: 'hips',
  mixamorigSpine: 'spine',
  mixamorigSpine1: 'chest',
  mixamorigSpine2: 'upperChest',
  mixamorigNeck: 'neck',
  mixamorigHead: 'head',
  // ... more bones
};

const loadMixamoAnimation = async (url, vrm) => {
  const { FBXLoader } = await import('three/examples/jsm/loaders/FBXLoader.js');
  const loader = new FBXLoader();

  return new Promise((resolve) => {
    loader.load(url, (asset) => {
      const clip = THREE.AnimationClip.findByName(asset.animations, 'mixamo.com');
      
      // Retarget animation tracks to VRM bones
      const tracks = [];
      clip.tracks.forEach((track) => {
        const mixamoRigName = track.name.split('.')[0];
        const vrmBoneName = MIXAMO_VRM_RIG_MAP[mixamoRigName];
        
        if (vrmBoneName) {
          const vrmNodeName = vrm.humanoid?.getNormalizedBoneNode(vrmBoneName)?.name;
          if (vrmNodeName) {
            // Create new track with VRM bone name
            tracks.push(
              new THREE.QuaternionKeyframeTrack(
                vrmNodeName + '.' + track.name.split('.')[1],
                track.times,
                track.values
              )
            );
          }
        }
      });

      const vrmAnimation = new THREE.AnimationClip('vrmAnimation', clip.duration, tracks);
      resolve(vrmAnimation);
    });
  });
};

Playing Animations

Create an AnimationMixer and play the talking animation in a loop:

// Create mixer for the VRM model
mixerRef.current = new THREE.AnimationMixer(vrm.scene);

// Load talking animation (we use this continuously)
const talkingClip = await loadMixamoAnimation('/animations/Sad.fbx', vrm);

if (talkingClip) {
  const action = mixerRef.current.clipAction(talkingClip);
  action.setLoop(THREE.LoopRepeat, Infinity);
  action.clampWhenFinished = false;
  action.play();
  currentActionRef.current = action;
}

Mapping Visemes to VRM Expressions

VRM defines standard facial expressions. We map incoming viseme strings to these expressions in real-time:

// Inside the animation loop
const updateMouthExpressions = () => {
  const vrm = vrmRef.current;
  const expressionManager = vrm.expressionManager;
  
  // Reset all mouth expressions first
  expressionManager.setValue(VRMExpressionPresetName.Aa, 0);
  expressionManager.setValue(VRMExpressionPresetName.Oh, 0);
  expressionManager.setValue(VRMExpressionPresetName.Ee, 0);
  expressionManager.setValue(VRMExpressionPresetName.Ih, 0);
  expressionManager.setValue(VRMExpressionPresetName.Ou, 0);

  // Apply current viseme
  if (currentViseme && currentViseme !== 'sil') {
    const viseme = currentViseme.toLowerCase();
    
    if (viseme === 'aa') {
      expressionManager.setValue(VRMExpressionPresetName.Aa, 0.8);
    } else if (viseme === 'e') {
      expressionManager.setValue(VRMExpressionPresetName.Ee, 0.8);
    } else if (viseme === 'i' || viseme === 'ih') {
      expressionManager.setValue(VRMExpressionPresetName.Ih, 0.8);
    } else if (viseme === 'o' || viseme === 'oh') {
      expressionManager.setValue(VRMExpressionPresetName.Oh, 0.7);
    } else if (viseme === 'u' || viseme === 'ou') {
      expressionManager.setValue(VRMExpressionPresetName.Ou, 0.7);
    }
  }

  // Add blinking for realism
  const shouldBlink = Math.sin(elapsedTime * 0.3) > 0.98;
  expressionManager.setValue(
    VRMExpressionPresetName.Blink, 
    shouldBlink ? 1.0 : 0.0
  );

  // IMPORTANT: Update expressions every frame
  expressionManager.update();
};

The Animation Loop

Keep everything running with requestAnimationFrame:

const animate = () => {
  requestAnimationFrame(animate);
  
  const delta = clock.getDelta();
  
  // Update animation mixer
  if (mixerRef.current) {
    mixerRef.current.update(delta);
  }
  
  // Update VRM (required for expressions)
  if (vrmRef.current) {
    vrmRef.current.update(delta);
  }
  
  // Render the scene
  renderer.render(scene, camera);
};

Part 2: Connect to Real-Time AI with Gabber

Understanding the Gabber Graph

Gabber uses a graph-based architecture where nodes represent AI services. For this avatar, we need:

Publish node: Receives your microphone input
STT node: Converts speech to text
LLM node: Generates AI responses
TTS node: Converts text back to speech
LocalViseme node: Generates phoneme-aligned visemes
Output node: Streams audio back to your browser

The graph is defined in JSON (viseme_avatar_homepage.json) and orchestrates the entire flow automatically.

Initializing the Gabber Engine

// VRMAvatarDemo.tsx
import { useEngine, useEngineInternal } from '@gabber/client-react';

export default function VRMAvatarDemo({ apiUrl, runId }) {
  const { connect, disconnect, publishToNode, getLocalTrack, subscribeToNode } = useEngine();
  const { engineRef } = useEngineInternal();
  
  const [isConnected, setIsConnected] = useState(false);
  const [isSpeaking, setIsSpeaking] = useState(false);
  const [currentViseme, setCurrentViseme] = useState('sil');
}

Step 1: Connect to Gabber

Start the graph and connect the client:

const handleConnect = async () => {
  try {
    // Get authentication token
    const tokenResponse = await GenerateUserToken();
    const userToken = tokenResponse.token;

    // Start the graph run
    const response = await fetch(apiUrl + '/app/run_from_graph', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer ' + userToken,
      },
      body: JSON.stringify({ 
        run_id: runId, 
        graph: visemeAvatarGraph 
      }),
    });

    const data = await response.json();
    const connectionDetails = data.connection_details;

    // Connect the WebRTC client
    await connect(connectionDetails);
    setIsConnected(true);
    
    // Now subscribe to outputs
    await subscribeToOutput();
  } catch (err) {
    console.error('Failed to connect:', err);
  }
};

Step 2: Subscribe to AI Audio Output

Listen for the AI's voice and play it in the browser:

const subscribeToOutput = async () => {
  const outputNodeId = getOutputNodeId(); // Find Output node in graph
  
  const audioSubscription = await subscribeToNode({
    outputOrPublishNodeId: outputNodeId,
  });

  // Wait for audio track and attach to audio element
  audioSubscription.waitForAudioTrack().then((track) => {
    if (audioOutputRef.current) {
      track.attachToElement(audioOutputRef.current);
      audioOutputRef.current.play();
    }
  });
};

Step 3: Publish Microphone Audio

Send user voice to the AI:

const toggleMicrophone = async () => {
  if (audioPublication) {
    // Turn off microphone
    audioPublication.unpublish();
    setAudioPublication(null);
  } else {
    // Turn on microphone
    const audioTrack = await getLocalTrack({ type: 'microphone' });
    const publishNodeId = getPublishNodeId(); // Find Publish node in graph
    
    const pub = await publishToNode({ 
      localTrack: audioTrack, 
      publishNodeId 
    });
    setAudioPublication(pub);
  }
};

Step 4: Subscribe to Live Visemes (The Key Part!)

This is where the magic happens. We access the engine's internal pad system to get real-time viseme data:

const subscribeToVisemes = () => {
  // engineRef gives us direct access to the Gabber engine
  if (!engineRef.current) {
    console.warn('Engine not ready yet');
    return;
  }

  try {
    // Get the viseme source pad from the LocalViseme node
    const visemeNodeId = 'localviseme_8bc97c52'; // From your graph JSON
    const visemePad = engineRef.current.getSourcePad(visemeNodeId, 'viseme');
    
    if (!visemePad) {
      console.warn('Viseme pad not found');
      return;
    }

    // Listen for viseme values
    visemePad.on('value', (data) => {
      const viseme = data;
      const visemeShape = viseme?.value;
      
      if (visemeShape) {
        setCurrentViseme(visemeShape);
        setIsSpeaking(true);
        
        // Auto-reset to silence after 300ms
        if (visemeTimeoutRef.current) {
          clearTimeout(visemeTimeoutRef.current);
        }
        
        visemeTimeoutRef.current = setTimeout(() => {
          setCurrentViseme('sil');
          setIsSpeaking(false);
        }, 300);
      }
    });

    console.log('Successfully subscribed to viseme stream');
  } catch (err) {
    console.error('Failed to subscribe to visemes:', err);
  }
};

Timing the Viseme Subscription

The engine needs to be fully initialized before we can access pads. Use a useEffect that waits for both connection and engine availability:

useEffect(() => {
  if (isConnected && engineRef.current) {
    subscribeToVisemes();
  }
}, [isConnected, engineRef]);

Cleanup on Disconnect

Always clean up resources properly:

const handleDisconnect = async () => {
  try {
    // Unpublish audio
    if (audioPublication) {
      audioPublication.unpublish();
      setAudioPublication(null);
    }
    
    // Clear viseme timeout
    if (visemeTimeoutRef.current) {
      clearTimeout(visemeTimeoutRef.current);
    }
    
    // Disconnect engine
    await disconnect();
    setIsConnected(false);
    setIsSpeaking(false);
    setCurrentViseme('sil');
  } catch (err) {
    console.error('Disconnect error:', err);
  }
};

Part 3: Putting It All Together

The Complete VRMAvatarDemo Component

export default function VRMAvatarDemo({ apiUrl, runId }) {
  const { connect, disconnect, publishToNode, getLocalTrack, subscribeToNode } = useEngine();
  const { engineRef } = useEngineInternal();
  
  const [isConnected, setIsConnected] = useState(false);
  const [isSpeaking, setIsSpeaking] = useState(false);
  const [currentViseme, setCurrentViseme] = useState('sil');
  const [audioPublication, setAudioPublication] = useState(null);
  
  const audioOutputRef = useRef(null);
  const visemeTimeoutRef = useRef(null);

  return (
    <div className="min-h-screen">
      {/* 3D Avatar Viewport */}
      <div style={{ height: '500px' }}>
        <VRMScene
          isSpeaking={isSpeaking}
          currentViseme={currentViseme}
        />
      </div>

      {/* Controls */}
      <button onClick={isConnected ? handleDisconnect : handleConnect}>
        {isConnected ? 'Disconnect' : 'Connect'}
      </button>
      
      <button onClick={toggleMicrophone} disabled={!isConnected}>
        {audioPublication ? 'Mic On' : 'Mic Off'}
      </button>

      {/* Hidden audio element for AI voice playback */}
      <audio ref={audioOutputRef} />
    </div>
  );
}

Drop-In Usage in Any Page

// app/avatar-3d/page.tsx
import dynamic from 'next/dynamic';

const VRMAvatarDemo = dynamic(
  () => import('./components/VRMAvatarDemo'), 
  { ssr: false }
);

export default function Page() {
  return (
    <VRMAvatarDemo 
      apiUrl="https://api.gabber.dev/v1" 
      runId="your_unique_run_id" 
    />
  );
}

Production Tips

Keep the Animation Running

For conversation-heavy apps, keep the talking animation playing continuously. This prevents T-poses during brief silences:

// Start with talking animation
const action = mixerRef.current.clipAction(talkingClip);
action.setLoop(THREE.LoopRepeat, Infinity);
action.clampWhenFinished = false;
action.play();

Responsive Canvas

Use ResizeObserver to handle window resizing:

const handleResize = () => {
  if (!canvasRef.current || !renderer || !camera) return;
  const rect = canvasRef.current.getBoundingClientRect();
  renderer.setSize(rect.width, rect.height);
  camera.aspect = rect.width / rect.height;
  camera.updateProjectionMatrix();
};

const resizeObserver = new ResizeObserver(handleResize);
resizeObserver.observe(canvasRef.current);

Mobile Performance

Limit pixel ratio to reduce thermal throttling on mobile devices:

renderer.setPixelRatio(Math.min(window.devicePixelRatio, 2));

Smooth Viseme Transitions

For even smoother mouth movements, apply lerp-based smoothing to expression values instead of hard-setting them.

Troubleshooting

Problem: Engine not available yet
Solution: Make sure you're waiting for both isConnected and engineRef.current before subscribing to visemes.

Problem: Visemes not updating
Solution: Check that your graph JSON has a LocalViseme node and verify the node ID matches what you're passing to getSourcePad.

Problem: T-pose appears when avatar stops talking
Solution: Keep a looping animation playing at all times rather than trying to switch between idle and talking states.

Problem: Audio latency
Solution: Gabber's WebRTC infrastructure already provides sub-200ms latency. If you're experiencing delays, check your network connection and ensure you're not running heavy computations on the main thread.

Visual Guide

The complete interface showing the VRM avatar, control panel, and live viseme indicators.

The Gabber graph orchestrating STT, LLM, TTS, and LocalViseme nodes.

Why Gabber for Real-Time AI?

Traditional approaches require you to manually orchestrate:

WebRTC connections
STT/TTS API calls
LLM request/response handling
Audio stream management
Viseme generation and timing

Gabber gives you all of this in a single graph with:

Millisecond-level streaming for STT, TTS, and multi-modal AI
Orchestrated graphs that handle the entire pipeline
React SDK with clean hooks and direct pad access for advanced features
Built-in viseme generation perfectly synced to audio output

If you're building real-time voice or video AI apps, Gabber eliminates weeks of infrastructure work.

Next Steps

Replace placeholder images with your custom renders
Swap the VRM model to match your brand or character
Customize the Mixamo animations for different moods
Add camera input and connect a VLM node to let the avatar see users
Deploy to Vercel and share your interactive demo

👉 Ready to build? Start with Gabber

Full Working Example: Check the app/avatar-3d directory in this repo for the complete implementation with Three.js scene, Gabber connection, and live viseme lip-sync.

You can clone it, swap your credentials, and have a working 3D AI avatar in under 10 minutes.