Research

Internal Research and Technology

Hybrid Architecture for Activity Prediction

A simple architecture that couples a state-of-the-art video encoder (InternVideo2) with a lightweight XGBoost classifier. This approach efficiently classifies video features into domain-specific activities, minimizing training time and data requirements.

Efficient Fine-Tuning for Domain-Specific Tasks

Enable quick-turnaround training for closed-vocabulary activity prediction. By lightly fine-tuning the encoder and training a simple classifier, we achieve competitive performance in data-starved environments, making it ideal for industrial applications.

Significant Improvement in Activity Prediction

The proposed model achieves an 85.8% F1 score on a closed-domain benchmark, improving over the baseline method by more than 40%, demonstrating the method's effectiveness for training efficient, specialized activity predictors.

Object-Centric Fingerprints for Deep Video Understanding

We design GLOBE to learn temporally consistent object-centric embeddings by extracting features precisely aligned to object regions using binary masks. This region-guided approach enables the model to capture identity-preserving, stable object representations from video.

Generalized Contrastive Learning for Temporal Object Consistency

We extend SimCLR with a multi-positive contrastive loss that clusters multiple views of the same object across time, promoting identity-preserving, temporally robust representations despite occlusions, motion, and appearance changes.

Significant Gains in Temporal Object Re-identification

GLOBE achieves a 15.13% improvement in F1 score over the DINOv2 baseline in a realistic object matching and clustering task, demonstrating strong performance for re-identification and tracking applications.

Research Projects

In the context of a construction site, this could mean smart glasses providing real-time answers to complex compliance questions. In collaboration with industry and academic partners, we are developing multimodal AI agents that integrate large language models with computer vision to enable real-time, context-aware reasoning – demonstrating how generative AI can transform decision-making and workflows in the construction industry.

Our focus is on developing a new generation of human-machine interfaces by architecting a scalable platform for context-aware, multimodal instruction-following systems that perform reliably in high-noise, high-variability environments like construction sites.

Key Contributions and Technical Innovations:

Multimodal Fusion

Integration of LLMs with visual scene understanding models to enable grounded, contextual responses.

Real-Time Interaction

Development of low-latency inference pipelines suitable for edge deployment (e.g., AR glasses).

Domain-Specific Reasoning

Fine-tuning models on construction-specific tasks (e.g., BIM compliance checks, safety standard verification).

Human-Centered Interfaces

Design of natural language-driven workflows to reduce cognitive load and streamline decision-making.

Scalable Architecture

Modular system design enabling deployment across diverse sites and tasks.

Go to Show2Instruct

At its core, MuvAko explores how these cutting-edge technologies enhance the user experience and add measurable value across digital sales processes. In collaboration with our partners, we are addressing key research challenges such as:

How can spatially-aware AI systems capture and interpret multimodal context in dynamic real-time e-commerce scenarios?
How can recommendation engines be personalized using real-time user data?
How can context-aware AI assistants provide meaningful support within immersive shopping environments?

Beyond technical exploration, MuvAko aims to push the boundaries of usability, personalization, and accessibility in AI-driven product presentation and delivery – laying the foundation for the next generation of immersive, intelligent e-commerce platforms.

Key Contributions and Technical Innovations:

Real-Time Multimodal Context Sensing

Fusion of spatial, visual, and behavioral data to guide AI decision-making.

Adaptive Personalization Engines

Dynamic tailoring of recommendations based on user interactions and spatial cues.

Intelligent Virtual Assistants

Context-aware agents that assist users throughout immersive shopping journeys.

Enhanced Usability in XR Interfaces

Streamlined interaction models for intuitive engagement in mixed reality.

Scalable Architecture

Modular system design for integration into diverse digital commerce platforms.

Go to MuvAko

Large Language Models (LLMs) revolutionized human-computer interaction, by making complex knowledge accessible through natural language, bridging the gap between humans and machines. Current multimodal models (MLLMs) still fall short when it comes to real-world awareness. They often lack an understanding when and where something is happening – critical elements essential for interacting reliably with the real world. That’s why we’re building REACT: a fine-tuned MLLM designed to understand spatial and temporal context by analyzing video and sensor data. Our goal is to enable machines to follow and understand complex human instructions in dynamic environments.

We’re putting REACT to the test with a real-world scenario: a mobile robot that takes a coffee order directly from a person, operates the coffee machine, and serves the drink – even as conditions around it change. By combining cutting-edge AI with real-time contextual understanding, REACT takes human-machine interaction to the next level.

Key Contributions and Technical Innovations

Spatiotemporal Contextualization

Fine-tuning MLLMs to interpret spatiotemporal sequences from multimodal inputs.

Real-Time Reasoning

Enabling fast, context-aware decision-making in dynamic environments.

Grounded Instruction Following

Translating natural language into executable steps tied to the environment.

Sensor + Video Fusion

Integrating first- and third-person video with real-world sensor data for richer scene understanding.

Autonomous Task Execution

Demonstrating end-to-end autonomy in a physical human-interaction scenario.

Research, Technology, and Projects

Activity Prediction with XGBoost

GLOBE

Show2Instruct

Key Contributions and Technical Innovations:

MuvAko

Key Contributions and Technical Innovations:

REACT

Key Contributions and Technical Innovations