Research Projects

Show2Instruct envisions AI systems that respond to spoken, context-specific queries – such as “Do all the windows in this room meet the requirements of the BIM specifications and accessibility standards?” – with real-time, visually grounded answers from an AI-powered device.

In the context of a construction site, this could mean smart glasses providing real-time answers to complex compliance questions. In collaboration with industry and academic partners, we are developing multimodal AI agents that integrate large language models with computer vision to enable real-time, context-aware reasoning – demonstrating how generative AI can transform decision-making and workflows in the construction industry.

Our focus is on developing a new generation of human-machine interfaces by architecting a scalable platform for context-aware, multimodal instruction-following systems that perform reliably in high-noise, high-variability environments like construction sites.

Key Contributions and Technical Innovations:

Multimodal Fusion

Integration of LLMs with visual scene understanding models to enable grounded, contextual responses.

Real-Time Interaction

Development of low-latency inference pipelines suitable for edge deployment (e.g., AR glasses).

Domain-Specific Reasoning

Fine-tuning models on construction-specific tasks (e.g., BIM compliance checks, safety standard verification).

Human-Centered Interfaces

Design of natural language-driven workflows to reduce cognitive load and streamline decision-making.

Scalable Architecture

Modular system design enabling deployment across diverse sites and tasks.

MuvAko is a forward-looking research initiative based in Saxony, Germany, and focused on leveraging generative AI and spatial computing to develop multimodal, context-aware assistance systems for virtual and mixed reality.

At its core, MuvAko explores how these cutting-edge technologies enhance the user experience and add measurable value across digital sales processes. In collaboration with our partners, we are addressing key research challenges such as:

  • How can spatially-aware AI systems capture and interpret multimodal context in dynamic real-time e-commerce scenarios?

  • How can recommendation engines be personalized using real-time user data?

  • How can context-aware AI assistants provide meaningful support within immersive shopping environments?

Beyond technical exploration, MuvAko aims to push the boundaries of usability, personalization, and accessibility in AI-driven product presentation and delivery – laying the foundation for the next generation of immersive, intelligent e-commerce platforms.

Key Contributions and Technical Innovations:

Real-Time Multimodal Context Sensing

Fusion of spatial, visual, and behavioral data to guide AI decision-making.

Adaptive Personalization Engines

Dynamic tailoring of recommendations based on user interactions and spatial cues.

Intelligent Virtual Assistants

Context-aware agents that assist users throughout immersive shopping journeys.

Enhanced Usability in XR Interfaces

Streamlined interaction models for intuitive engagement in mixed reality.

Scalable Architecture

Modular system design for integration into diverse digital commerce platforms.

REACT

What if machines could truly understand their surroundings – and respond in real time?

Large Language Models (LLMs) revolutionized human-computer interaction, by making complex knowledge accessible through natural language, bridging the gap between humans and machines. Current multimodal models (MLLMs) still fall short when it comes to real-world awareness. They often lack an understanding when and where something is happening – critical elements essential for interacting reliably with the real world.  That’s why we’re building REACT: a fine-tuned MLLM designed to understand spatial and temporal context by analyzing video and sensor data. Our goal is to enable machines to follow and understand complex human instructions in dynamic environments.

We’re putting REACT to the test with a real-world scenario: a mobile robot that takes a coffee order directly from a person, operates the coffee machine, and serves the drink – even as conditions around it change. By combining cutting-edge AI with real-time contextual understanding, REACT takes human-machine interaction to the next level. 

Key Contributions and Technical Innovations

Spatiotemporal Contextualization

Fine-tuning MLLMs to interpret spatiotemporal sequences from multimodal inputs.

Real-Time Reasoning

Enabling fast, context-aware decision-making in dynamic environments.

Grounded Instruction Following

Translating natural language into executable steps tied to the environment.

Sensor + Video Fusion

Integrating first- and third-person video with real-world sensor data for richer scene understanding.

Autonomous Task Execution

Demonstrating end-to-end autonomy in a physical human-interaction scenario.

Subscribe to Our Newsletter to Discover Upcoming Product Improvements and Releases.

* indicates required