Build a Three.js 3D Avatar with Real-Time AI (Vision, Voice, Lip-Sync) in Next.js
Shipping a reactive 3D avatar that can see, hear, speak, and lip-sync in real-time is now production-ready. In this guide, you'll wire a VRM character into any Next.js or React project using Three.js and connect it to real-time AI with the Gabber SDK — including live viseme streams for frame-perfect mouth shapes.
Whether you're building AI companions, interactive tutorials, virtual assistants, or product demos, this is the complete implementation path.
What You'll Build
- A Three.js scene that loads and animates a VRM model
- Live microphone input streamed to real-time AI
- AI speech output with Text-to-Speech playback in the browser
- Live viseme stream mapped to VRM mouth expressions (Aa/Ee/Ih/Oh/Ou)
- Smooth animation blending between talking states
- Production-ready with proper cleanup and error handling
Prerequisites
- Next.js 14+ with App Router and React 18
- Three.js and @pixiv/three-vrm for VRM model support
- A Gabber account for the low-latency real-time AI backend
Install the essentials:
npm i three @pixiv/three-vrm @gabber/client-react
Architecture Overview
We split the implementation into two core components:
app/avatar-3d/ components/ VRMAvatarDemo.tsx // Gabber connection, audio I/O, viseme subscription VRMScene.tsx // Three.js rendering, VRM loading, expression mapping page.tsx // Entry point
VRMAvatarDemo handles all Gabber SDK interactions: connecting to the AI backend, publishing microphone audio, subscribing to AI audio output, and subscribing to the viseme stream.
VRMScene is a pure Three.js component that receives two props (isSpeaking and currentViseme) and renders the animated VRM model with real-time mouth shapes.
Part 1: The Three.js VRM Scene
Loading the VRM Model
The VRM format is an open standard for 3D humanoid avatars. We load it using GLTFLoader with the VRMLoaderPlugin:
// VRMScene.tsx import * as THREE from 'three'; import { VRM, VRMLoaderPlugin, VRMExpressionPresetName } from '@pixiv/three-vrm'; import { GLTFLoader } from 'three/examples/jsm/loaders/GLTFLoader.js'; const loadVRM = async () => { const loader = new GLTFLoader(); loader.register((parser) => new VRMLoaderPlugin(parser)); loader.load('/models/girl.vrm', async (gltf) => { const vrm = gltf.userData.vrm; // Position and scale the model vrm.scene.position.set(0, -0.3, 0); vrm.scene.rotation.set(0, Math.PI, 0); vrm.scene.scale.set(0.9, 0.9, 0.9); // Add to scene scene.add(vrm.scene); // Store reference for animation updates vrmRef.current = vrm; }); };
Setting Up Animations with Mixamo
We load FBX animations from Mixamo and retarget them to the VRM rig. This requires bone mapping between Mixamo's skeleton and VRM's humanoid structure:
const MIXAMO_VRM_RIG_MAP = { mixamorigHips: 'hips', mixamorigSpine: 'spine', mixamorigSpine1: 'chest', mixamorigSpine2: 'upperChest', mixamorigNeck: 'neck', mixamorigHead: 'head', // ... more bones }; const loadMixamoAnimation = async (url, vrm) => { const { FBXLoader } = await import('three/examples/jsm/loaders/FBXLoader.js'); const loader = new FBXLoader(); return new Promise((resolve) => { loader.load(url, (asset) => { const clip = THREE.AnimationClip.findByName(asset.animations, 'mixamo.com'); // Retarget animation tracks to VRM bones const tracks = []; clip.tracks.forEach((track) => { const mixamoRigName = track.name.split('.')[0]; const vrmBoneName = MIXAMO_VRM_RIG_MAP[mixamoRigName]; if (vrmBoneName) { const vrmNodeName = vrm.humanoid?.getNormalizedBoneNode(vrmBoneName)?.name; if (vrmNodeName) { // Create new track with VRM bone name tracks.push( new THREE.QuaternionKeyframeTrack( vrmNodeName + '.' + track.name.split('.')[1], track.times, track.values ) ); } } }); const vrmAnimation = new THREE.AnimationClip('vrmAnimation', clip.duration, tracks); resolve(vrmAnimation); }); }); };
Playing Animations
Create an AnimationMixer and play the talking animation in a loop:
// Create mixer for the VRM model mixerRef.current = new THREE.AnimationMixer(vrm.scene); // Load talking animation (we use this continuously) const talkingClip = await loadMixamoAnimation('/animations/Sad.fbx', vrm); if (talkingClip) { const action = mixerRef.current.clipAction(talkingClip); action.setLoop(THREE.LoopRepeat, Infinity); action.clampWhenFinished = false; action.play(); currentActionRef.current = action; }
Mapping Visemes to VRM Expressions
VRM defines standard facial expressions. We map incoming viseme strings to these expressions in real-time:
// Inside the animation loop const updateMouthExpressions = () => { const vrm = vrmRef.current; const expressionManager = vrm.expressionManager; // Reset all mouth expressions first expressionManager.setValue(VRMExpressionPresetName.Aa, 0); expressionManager.setValue(VRMExpressionPresetName.Oh, 0); expressionManager.setValue(VRMExpressionPresetName.Ee, 0); expressionManager.setValue(VRMExpressionPresetName.Ih, 0); expressionManager.setValue(VRMExpressionPresetName.Ou, 0); // Apply current viseme if (currentViseme && currentViseme !== 'sil') { const viseme = currentViseme.toLowerCase(); if (viseme === 'aa') { expressionManager.setValue(VRMExpressionPresetName.Aa, 0.8); } else if (viseme === 'e') { expressionManager.setValue(VRMExpressionPresetName.Ee, 0.8); } else if (viseme === 'i' || viseme === 'ih') { expressionManager.setValue(VRMExpressionPresetName.Ih, 0.8); } else if (viseme === 'o' || viseme === 'oh') { expressionManager.setValue(VRMExpressionPresetName.Oh, 0.7); } else if (viseme === 'u' || viseme === 'ou') { expressionManager.setValue(VRMExpressionPresetName.Ou, 0.7); } } // Add blinking for realism const shouldBlink = Math.sin(elapsedTime * 0.3) > 0.98; expressionManager.setValue( VRMExpressionPresetName.Blink, shouldBlink ? 1.0 : 0.0 ); // IMPORTANT: Update expressions every frame expressionManager.update(); };
The Animation Loop
Keep everything running with requestAnimationFrame:
const animate = () => { requestAnimationFrame(animate); const delta = clock.getDelta(); // Update animation mixer if (mixerRef.current) { mixerRef.current.update(delta); } // Update VRM (required for expressions) if (vrmRef.current) { vrmRef.current.update(delta); } // Render the scene renderer.render(scene, camera); };
Part 2: Connect to Real-Time AI with Gabber
Understanding the Gabber Graph
Gabber uses a graph-based architecture where nodes represent AI services. For this avatar, we need:
- Publish node: Receives your microphone input
- STT node: Converts speech to text
- LLM node: Generates AI responses
- TTS node: Converts text back to speech
- LocalViseme node: Generates phoneme-aligned visemes
- Output node: Streams audio back to your browser
The graph is defined in JSON (viseme_avatar_homepage.json) and orchestrates the entire flow automatically.
Initializing the Gabber Engine
// VRMAvatarDemo.tsx import { useEngine, useEngineInternal } from '@gabber/client-react'; export default function VRMAvatarDemo({ apiUrl, runId }) { const { connect, disconnect, publishToNode, getLocalTrack, subscribeToNode } = useEngine(); const { engineRef } = useEngineInternal(); const [isConnected, setIsConnected] = useState(false); const [isSpeaking, setIsSpeaking] = useState(false); const [currentViseme, setCurrentViseme] = useState('sil'); }
Step 1: Connect to Gabber
Start the graph and connect the client:
const handleConnect = async () => { try { // Get authentication token const tokenResponse = await GenerateUserToken(); const userToken = tokenResponse.token; // Start the graph run const response = await fetch(apiUrl + '/app/run_from_graph', { method: 'POST', headers: { 'Content-Type': 'application/json', 'Authorization': 'Bearer ' + userToken, }, body: JSON.stringify({ run_id: runId, graph: visemeAvatarGraph }), }); const data = await response.json(); const connectionDetails = data.connection_details; // Connect the WebRTC client await connect(connectionDetails); setIsConnected(true); // Now subscribe to outputs await subscribeToOutput(); } catch (err) { console.error('Failed to connect:', err); } };
Step 2: Subscribe to AI Audio Output
Listen for the AI's voice and play it in the browser:
const subscribeToOutput = async () => { const outputNodeId = getOutputNodeId(); // Find Output node in graph const audioSubscription = await subscribeToNode({ outputOrPublishNodeId: outputNodeId, }); // Wait for audio track and attach to audio element audioSubscription.waitForAudioTrack().then((track) => { if (audioOutputRef.current) { track.attachToElement(audioOutputRef.current); audioOutputRef.current.play(); } }); };
Step 3: Publish Microphone Audio
Send user voice to the AI:
const toggleMicrophone = async () => { if (audioPublication) { // Turn off microphone audioPublication.unpublish(); setAudioPublication(null); } else { // Turn on microphone const audioTrack = await getLocalTrack({ type: 'microphone' }); const publishNodeId = getPublishNodeId(); // Find Publish node in graph const pub = await publishToNode({ localTrack: audioTrack, publishNodeId }); setAudioPublication(pub); } };
Step 4: Subscribe to Live Visemes (The Key Part!)
This is where the magic happens. We access the engine's internal pad system to get real-time viseme data:
const subscribeToVisemes = () => { // engineRef gives us direct access to the Gabber engine if (!engineRef.current) { console.warn('Engine not ready yet'); return; } try { // Get the viseme source pad from the LocalViseme node const visemeNodeId = 'localviseme_8bc97c52'; // From your graph JSON const visemePad = engineRef.current.getSourcePad(visemeNodeId, 'viseme'); if (!visemePad) { console.warn('Viseme pad not found'); return; } // Listen for viseme values visemePad.on('value', (data) => { const viseme = data; const visemeShape = viseme?.value; if (visemeShape) { setCurrentViseme(visemeShape); setIsSpeaking(true); // Auto-reset to silence after 300ms if (visemeTimeoutRef.current) { clearTimeout(visemeTimeoutRef.current); } visemeTimeoutRef.current = setTimeout(() => { setCurrentViseme('sil'); setIsSpeaking(false); }, 300); } }); console.log('Successfully subscribed to viseme stream'); } catch (err) { console.error('Failed to subscribe to visemes:', err); } };
Timing the Viseme Subscription
The engine needs to be fully initialized before we can access pads. Use a useEffect that waits for both connection and engine availability:
useEffect(() => { if (isConnected && engineRef.current) { subscribeToVisemes(); } }, [isConnected, engineRef]);
Cleanup on Disconnect
Always clean up resources properly:
const handleDisconnect = async () => { try { // Unpublish audio if (audioPublication) { audioPublication.unpublish(); setAudioPublication(null); } // Clear viseme timeout if (visemeTimeoutRef.current) { clearTimeout(visemeTimeoutRef.current); } // Disconnect engine await disconnect(); setIsConnected(false); setIsSpeaking(false); setCurrentViseme('sil'); } catch (err) { console.error('Disconnect error:', err); } };
Part 3: Putting It All Together
The Complete VRMAvatarDemo Component
export default function VRMAvatarDemo({ apiUrl, runId }) { const { connect, disconnect, publishToNode, getLocalTrack, subscribeToNode } = useEngine(); const { engineRef } = useEngineInternal(); const [isConnected, setIsConnected] = useState(false); const [isSpeaking, setIsSpeaking] = useState(false); const [currentViseme, setCurrentViseme] = useState('sil'); const [audioPublication, setAudioPublication] = useState(null); const audioOutputRef = useRef(null); const visemeTimeoutRef = useRef(null); return ( <div className="min-h-screen"> {/* 3D Avatar Viewport */} <div style={{ height: '500px' }}> <VRMScene isSpeaking={isSpeaking} currentViseme={currentViseme} /> </div> {/* Controls */} <button onClick={isConnected ? handleDisconnect : handleConnect}> {isConnected ? 'Disconnect' : 'Connect'} </button> <button onClick={toggleMicrophone} disabled={!isConnected}> {audioPublication ? 'Mic On' : 'Mic Off'} </button> {/* Hidden audio element for AI voice playback */} <audio ref={audioOutputRef} /> </div> ); }
Drop-In Usage in Any Page
// app/avatar-3d/page.tsx import dynamic from 'next/dynamic'; const VRMAvatarDemo = dynamic( () => import('./components/VRMAvatarDemo'), { ssr: false } ); export default function Page() { return ( <VRMAvatarDemo apiUrl="https://api.gabber.dev/v1" runId="your_unique_run_id" /> ); }
Production Tips
Keep the Animation Running
For conversation-heavy apps, keep the talking animation playing continuously. This prevents T-poses during brief silences:
// Start with talking animation const action = mixerRef.current.clipAction(talkingClip); action.setLoop(THREE.LoopRepeat, Infinity); action.clampWhenFinished = false; action.play();
Responsive Canvas
Use ResizeObserver to handle window resizing:
const handleResize = () => { if (!canvasRef.current || !renderer || !camera) return; const rect = canvasRef.current.getBoundingClientRect(); renderer.setSize(rect.width, rect.height); camera.aspect = rect.width / rect.height; camera.updateProjectionMatrix(); }; const resizeObserver = new ResizeObserver(handleResize); resizeObserver.observe(canvasRef.current);
Mobile Performance
Limit pixel ratio to reduce thermal throttling on mobile devices:
renderer.setPixelRatio(Math.min(window.devicePixelRatio, 2));
Smooth Viseme Transitions
For even smoother mouth movements, apply lerp-based smoothing to expression values instead of hard-setting them.
Troubleshooting
Problem: Engine not available yet
Solution: Make sure you're waiting for both isConnected and engineRef.current before subscribing to visemes.
Problem: Visemes not updating
Solution: Check that your graph JSON has a LocalViseme node and verify the node ID matches what you're passing to getSourcePad.
Problem: T-pose appears when avatar stops talking
Solution: Keep a looping animation playing at all times rather than trying to switch between idle and talking states.
Problem: Audio latency
Solution: Gabber's WebRTC infrastructure already provides sub-200ms latency. If you're experiencing delays, check your network connection and ensure you're not running heavy computations on the main thread.
Visual Guide
The complete interface showing the VRM avatar, control panel, and live viseme indicators.
The Gabber graph orchestrating STT, LLM, TTS, and LocalViseme nodes.
Why Gabber for Real-Time AI?
Traditional approaches require you to manually orchestrate:
- WebRTC connections
- STT/TTS API calls
- LLM request/response handling
- Audio stream management
- Viseme generation and timing
Gabber gives you all of this in a single graph with:
- Millisecond-level streaming for STT, TTS, and multi-modal AI
- Orchestrated graphs that handle the entire pipeline
- React SDK with clean hooks and direct pad access for advanced features
- Built-in viseme generation perfectly synced to audio output
If you're building real-time voice or video AI apps, Gabber eliminates weeks of infrastructure work.
Next Steps
- Replace placeholder images with your custom renders
- Swap the VRM model to match your brand or character
- Customize the Mixamo animations for different moods
- Add camera input and connect a VLM node to let the avatar see users
- Deploy to Vercel and share your interactive demo
👉 Ready to build? Start with Gabber
Full Working Example: Check the app/avatar-3d directory in this repo for the complete implementation with Three.js scene, Gabber connection, and live viseme lip-sync.
You can clone it, swap your credentials, and have a working 3D AI avatar in under 10 minutes.