Abstract

We present a scalable pipeline for synthetic video data generation that combines the Ramblr Data Engine with NVIDIA Cosmos Transfer 2.5, a controllable world-model-based video generation system. By applying controllable generation to real egocentric footage, we expand small, domain-specific datasets while preserving object identity, spatial structure, and action semantics. We demonstrate the approach on a robotics manipulation use case and show that augmenting a scarce coffee-making dataset with synthetic variants yields consistent gains in Visual Question Answering (VQA) performance for process understanding, supporting the practical use of the pipeline for instruction tuning of Vision–Language Models (VLMs).

Introduction

Collecting and annotating real-world video at scale is expensive and often impractical in industrial and robotics settings, where environments, lighting conditions, and object appearances vary widely. On the other hand, models trained on narrow distributions may fail when deployed in even slightly different contexts.

Synthetic data generation offers a path forward. By combining the strengths of the Ramblr Data Engine (RDE), which automates annotation and scene parsing of egocentric videos, and NVIDIA Cosmos Transfer 2.5 [1], a multi-control diffusion-based video generation model capable of re-rendering scenes with varied backgrounds, lighting, and material appearances while respecting structural control signals, we obtain an end-to-end video augmentation pipeline – from raw video to annotation-rich training data – that is both scalable and suitable for practical deployment.

Methods

The Ramblr Data Engine

The Ramblr Data Engine is a modular video processing and annotation platform that runs semi-automatic and fully automatic detection, segmentation and scene-graph extraction pipelines to produce a rich set of structural signals: object bounding boxes and segmentation maps, object relations and attributes, and activities. These signals can further serve as input for high-level information extraction in the form of natural-language instructions, as we have shown before in Instruction Generation.

NVIDIA Cosmos Transfer 2.5

Cosmos Transfer 2.5 is a controllable world-model-based video generation system that combines multiple spatial control signals with a text prompt to produce photorealistic video outputs. Key properties that make it suited for this application are:

  • Multi-control conditioning on depth, segmentation, and edge maps for fine-grained structural fidelity.
  • Scene-level appearance transfer allows variations in backgrounds, lighting, and material properties while preserving key objects and spatial structure.
  • Scalable inference supports batch generation from a single reference clip across multiple prompts.

Combining Cosmos Transfer 2.5 with the structural signals produced by the RDE allows precise control over what changes - such as surface appearance and environment - and what remains fixed, including object identity, motion semantics, and action intent.

Figure 1 illustrates the end-to-end pipeline. We use the RDE to compute video segmentation maps and depth maps and to detect activities, which are then processed through scene-graph extraction to capture the underlying intent of the human or robot action in the video. This scene description is provided to an LLM, which generates multiple prompts that preserve action semantics while varying background conditions and appearance attributes.

Figure 1: End-to-end synthetic data generation pipeline combining Ramblr Data Engine and Cosmos Transfer 2.5.

Robotics Application

To illustrate the approach in a practical setting, we apply the pipeline to egocentric footage of a manipulation task: a robot arm performing a pick-and-place operation on a tabletop surface.

Figure 2: Visual examples of synthetic data generation and multi-view video data.

Starting from a small set of real recordings, the pipeline generates visually diverse variants of the same manipulation sequence while preserving the temporal structure of the task. In the resulting videos, scene appearance, illumination, and surface materials can be altered without changing the core interaction sequence, enabling the creation of broader training distributions from limited real-world demonstrations. We include a qualitative video example to illustrate the fidelity of the generated variants and the consistency of the underlying action semantics across renderings.

Instruction Tuning Results

We evaluate the impact of synthetic data augmentation on VQA performance for process understanding, using Ramblr’s Coffee-Making 101 dataset. The dataset consists of 47 annotated egocentric clips of a capsule-based coffee-making workflow spanning 11 activity classes - simulating a typical industrial data-scarcity scenario in which collecting additional real footage is constrained.

More specifically, the VQA benchmark consists of multiple-choice questions generated end-to-end via the instruction-set generation feature of the RDE. As shown in Figure 3, synthetic augmentation improves macro-F1 across all low-data regimes. With only one real training video, macro-F1 increases from 43.5% ± 13.7 to 64.5% ± 9.2, a gain of 21.0 percentage points. With three real videos, performance rises from 52.5% ± 10.3 to 69.0% ± 6.0 (+16.5 pp), and with five real videos, from 61.8% ± 11.1 to 71.9% ± 10.1 (+10.1 pp). The gains are largest in the most data-scarce setting, but remain consistently positive as the number of real training examples increases.

These results suggest that synthetic variants are particularly valuable when only a handful of real examples are available, while still providing measurable benefits as the real dataset grows.

Figure 3: VLM performance (Gemini 2.5 Flash Lite) on the coffee-making MCQ task.

Conclusion

We have presented a practical, scalable pipeline for synthetic video data generation by combining the Ramblr Data Engine with NVIDIA Cosmos Transfer 2.5. The pipeline requires only a small seed collection of annotated real footage, automates the derivation of spatial control signals, and leverages scene-graph-driven prompt diversity to generate semantically consistent yet visually varied training data.

Results on the coffee-making VQA benchmark show that augmenting scarce real data with synthetic variants yields substantial and consistent improvements in process-understanding performance, supporting the use of this approach for instruction tuning of VLMs in industrial and robotics contexts. As controllable video generation models continue to improve in fidelity, consistency, and controllability, the value of such pipelines is likely to increase further, especially in data-scarce application domains.

References

[1] NVIDIA et al. World Simulation with Video Foundation Models for Physical AI. 2025. arXiv preprint arXiv:2511.00062