Skip to main content

Command Palette

Search for a command to run...

LTX 2.3 Technical Deep Dive: How Lightricks' 22B Video Model Works Under the Hood

Updated
3 min read

LTX 2.3 is Lightricks' latest open-source video generation model — 22 billion parameters, native 4K output, 50fps, and built-in audio synchronization. Here's what's actually happening under the hood.

Architecture: DiT-Based Video Foundation Model

LTX 2.3 is a Diffusion Transformer (DiT) model, not a UNet. This matters because DiT scales better with parameter count and handles temporal consistency across frames more naturally than convolutional architectures.

Key specs:

  • 22B parameters (up from ~8B in LTX-Video 1.x)

  • Native resolution: up to 2560×1440

  • Frame rate: up to 50fps

  • Aspect ratios: 16:9, 9:16 (portrait-native), 1:1

The New VAE

LTX 2.3 ships with a completely rebuilt VAE (taeltx2_3.safetensors). The previous VAE produced visible texture artifacts at high motion. The new one:

  • Encodes at a higher spatial compression ratio

  • Preserves fine detail (hair, fabric, text) significantly better

  • Is required for all setups — you cannot use the LTX 2.x VAE with 2.3 checkpoints

Text Encoder: Gemma 3 12B

LTX 2.3 replaces the T5 encoder with Gemma 3 12B — a 12-billion-parameter language model. This is why prompt understanding improved so dramatically. Gemma 3 handles:

  • Long, complex prompts (200+ tokens)

  • Compositional instructions ("camera pans left while subject walks right")

  • Text rendering within video frames

The tradeoff: Gemma 3 adds ~8GB VRAM overhead. On 16GB GPUs, you need the FP8 quantized checkpoint to fit everything in memory.

Quantization: FP8 vs Full Precision

| Variant | VRAM | File | Use case | |---|---|---| | FP8 Distilled (Kijai) | 16GB | ltx-2.3-22b-distilled_transformer_only_fp8_input_scaled_v3.safetensors | RTX 4080/4090, fastest | | FP8 Dev (Kijai) | 16GB | ltx-2.3-22b-dev_transformer_only_fp8_input_scaled.safetensors | LoRA training/inference | | Full Precision | 32GB | Official HuggingFace checkpoint | Research, highest quality |

FP8 uses input-scaled quantization, which preserves activation ranges better than naive FP8 and produces near-identical output to BF16 at half the memory.

Two-Stage Pipeline in ComfyUI

LTX 2.3 supports a two-stage generation pipeline:

  1. Stage 1: Generate at base resolution (e.g., 768×432)

  2. Stage 2: Pass through the spatial upscaler (ltx-2.3-spatial-upscaler-x2-1.0.safetensors) to 2x resolution

This is more efficient than generating at full resolution in one pass — the upscaler is a lightweight model that adds detail without re-running the full 22B transformer.

Audio-Video Synchronization

LTX 2.3 is a native audio-video model. The audio conditioning is baked into the architecture — not a post-processing step. You can provide:

  • Speech audio → lip sync

  • Music → motion rhythm matching

  • Ambient sound → environmental motion

In ComfyUI, this is handled via the LTXVAudioCondition node from the official ComfyUI-LTXVideo extension.

Running It Efficiently

For 16GB VRAM (RTX 4080/4090):

  • Use FP8 Distilled checkpoint

  • Enable --fp8_e4m3fn precision in ComfyUI args

  • Use tiled VAE decoding for resolutions above 1080p

For 24GB VRAM:

  • Use the Dev FP8 checkpoint if applying LoRA

  • Full precision VAE decoding works at 1080p

For 32GB+ VRAM:

  • Full precision checkpoint, no quantization needed

  • Two-stage pipeline at 4K is feasible

Where to Get Everything

All model files, VRAM recommendations, and ready-to-use ComfyUI workflow JSON are available at ltxworkflow.com — free, no sign-up required.

▎ Try the free VRAM Adapter and ComfyUI Workflow Generator at LTX Workflow