• Author(s): Hao Dong, Eleni Chatzi, Olga Fink

“Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-Supervision” introduces a novel framework aimed at enhancing the ability of models to generalize and adapt to new, unseen domains in a multimodal context. This research addresses the challenge of recognizing novel classes within unseen domains, a task known as open-set domain generalization (OSDG), which is particularly complex when dealing with multiple data modalities.

The proposed framework leverages self-supervised learning techniques to improve the robustness and adaptability of models. Two innovative multimodal self-supervised pretext tasks are introduced: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks are designed to help the model learn rich and transferable features by predicting missing parts of the data across different modalities and solving puzzles that require understanding the relationships between different parts of the data.

Masked Cross-modal Translation involves masking parts of the input data in one modality and training the model to predict the missing parts using information from another modality. This task encourages the model to learn how different modalities relate to each other, enhancing its ability to generalize across domains. Multimodal Jigsaw Puzzles, on the other hand, require the model to rearrange shuffled parts of the data into their correct order, further promoting the learning of robust and transferable features.

Additionally, the framework incorporates an entropy weighting mechanism to balance the contributions of different modalities and ensure that the model focuses on the most informative parts of the data. This mechanism helps the model adapt more effectively to new domains by dynamically adjusting the importance of each modality based on the uncertainty of the predictions. The paper extends the proposed approach to Multimodal Open-Set Domain Adaptation (MM-OSDA), where the goal is to adapt a model trained on a source domain to perform well on a target domain that includes novel classes. This extension demonstrates the versatility and effectiveness of the framework in various scenarios.

Extensive experiments conducted on the EPIC-Kitchens and HAC datasets showcase the efficacy of the proposed method. The results indicate that the framework significantly outperforms existing state-of-the-art approaches in both OSDG and OSDA tasks, highlighting its potential for real-world applications where models must handle diverse and evolving data environments. “Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-Supervision” presents a significant advancement in the field of domain generalization and adaptation. By introducing innovative self-supervised tasks and an entropy weighting mechanism, the authors provide a robust and flexible framework for improving the performance of multimodal models in unseen domains. This research has important implications for various applications, including video understanding, robotics, and any field that requires robust multimodal learning.