LTX 2.3 Technical Deep Dive: How Lightricks' 22B Video Model Works Under the Hood
LTX 2.3 is Lightricks' latest open-source video generation model — 22 billion parameters, native 4K output, 50fps, and built-in audio synchronization. Here's what's actually happening under the hood.
Architecture: DiT-Based Video Foundation Model
LTX 2.3 is a Diffusion Transformer (DiT) model, not a UNet. This matters because DiT scales better with parameter count and handles temporal consistency across frames more naturally than convolutional architectures.
Key specs:
22B parameters (up from ~8B in LTX-Video 1.x)
Native resolution: up to 2560×1440
Frame rate: up to 50fps
Aspect ratios: 16:9, 9:16 (portrait-native), 1:1
The New VAE
LTX 2.3 ships with a completely rebuilt VAE (taeltx2_3.safetensors). The previous VAE produced visible texture artifacts at high motion. The new one:
Encodes at a higher spatial compression ratio
Preserves fine detail (hair, fabric, text) significantly better
Is required for all setups — you cannot use the LTX 2.x VAE with 2.3 checkpoints
Text Encoder: Gemma 3 12B
LTX 2.3 replaces the T5 encoder with Gemma 3 12B — a 12-billion-parameter language model. This is why prompt understanding improved so dramatically. Gemma 3 handles:
Long, complex prompts (200+ tokens)
Compositional instructions ("camera pans left while subject walks right")
Text rendering within video frames
The tradeoff: Gemma 3 adds ~8GB VRAM overhead. On 16GB GPUs, you need the FP8 quantized checkpoint to fit everything in memory.
Quantization: FP8 vs Full Precision
| Variant | VRAM | File | Use case | |---|---|---| | FP8 Distilled (Kijai) | 16GB | ltx-2.3-22b-distilled_transformer_only_fp8_input_scaled_v3.safetensors | RTX 4080/4090, fastest | | FP8 Dev (Kijai) | 16GB | ltx-2.3-22b-dev_transformer_only_fp8_input_scaled.safetensors | LoRA training/inference | | Full Precision | 32GB | Official HuggingFace checkpoint | Research, highest quality |
FP8 uses input-scaled quantization, which preserves activation ranges better than naive FP8 and produces near-identical output to BF16 at half the memory.
Two-Stage Pipeline in ComfyUI
LTX 2.3 supports a two-stage generation pipeline:
Stage 1: Generate at base resolution (e.g., 768×432)
Stage 2: Pass through the spatial upscaler (ltx-2.3-spatial-upscaler-x2-1.0.safetensors) to 2x resolution
This is more efficient than generating at full resolution in one pass — the upscaler is a lightweight model that adds detail without re-running the full 22B transformer.
Audio-Video Synchronization
LTX 2.3 is a native audio-video model. The audio conditioning is baked into the architecture — not a post-processing step. You can provide:
Speech audio → lip sync
Music → motion rhythm matching
Ambient sound → environmental motion
In ComfyUI, this is handled via the LTXVAudioCondition node from the official ComfyUI-LTXVideo extension.
Running It Efficiently
For 16GB VRAM (RTX 4080/4090):
Use FP8 Distilled checkpoint
Enable --fp8_e4m3fn precision in ComfyUI args
Use tiled VAE decoding for resolutions above 1080p
For 24GB VRAM:
Use the Dev FP8 checkpoint if applying LoRA
Full precision VAE decoding works at 1080p
For 32GB+ VRAM:
Full precision checkpoint, no quantization needed
Two-stage pipeline at 4K is feasible
Where to Get Everything
All model files, VRAM recommendations, and ready-to-use ComfyUI workflow JSON are available at ltxworkflow.com — free, no sign-up required.
▎ Try the free VRAM Adapter and ComfyUI Workflow Generator at LTX Workflow
