Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LanDA: Language-Guided Multi-Source Domain Adaptation (2401.14148v1)

Published 25 Jan 2024 in cs.CV

Abstract: Multi-Source Domain Adaptation (MSDA) aims to mitigate changes in data distribution when transferring knowledge from multiple labeled source domains to an unlabeled target domain. However, existing MSDA techniques assume target domain images are available, yet overlook image-rich semantic information. Consequently, an open question is whether MSDA can be guided solely by textual cues in the absence of target domain images. By employing a multimodal model with a joint image and language embedding space, we propose a novel language-guided MSDA approach, termed LanDA, based on optimal transfer theory, which facilitates the transfer of multiple source domains to a new target domain, requiring only a textual description of the target domain without needing even a single target domain image, while retaining task-relevant information. We present extensive experiments across different transfer scenarios using a suite of relevant benchmarks, demonstrating that LanDA outperforms standard fine-tuning and ensemble approaches in both target and source domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Unsupervised multi-source domain adaptation without access to source data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10103–10112, 2021.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137:541–593, 2007.
  4. Stylip: Multi-scale style-conditioned prompt learning for clip-based domain generalization. arXiv preprint arXiv:2302.09251, 2023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In International conference on machine learning, pages 1081–1090. PMLR, 2019a.
  7. Uniter: Learning universal image-text representations. 2019b.
  8. Promptstyler: Prompt-driven style generation for source-free domain generalization. arXiv preprint arXiv:2307.15199, 2023.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30, 2017.
  11. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision, pages 88–105. Springer, 2022.
  12. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Using language to extend to unseen domains. In The Eleventh International Conference on Learning Representations, 2022.
  15. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell, 1:1–40, 2016.
  16. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946, 2021.
  17. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  19. Algorithms and theory for multiple-source adaptation. Advances in neural information processing systems, 31, 2018.
  20. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  21. L Kantorovich. On the transfer of masses (in russian). In Doklady Akademii Nauk, page 227, 1942.
  22. Diffusionclip: Text-guided image manipulation using diffusion models. 2021.
  23. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 491–507. Springer, 2020.
  24. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022.
  25. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  27. Conditional adversarial domain adaptation. Advances in neural information processing systems, 31, 2018.
  28. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  29. Boosting domain adaptation by discovering latent domains. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3771–3780, 2018.
  30. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
  31. Eduardo Fernandes Montesuma and Fred Maurice Ngole Mboula. Wasserstein barycenter for multi-source domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16785–16793, 2021.
  32. Recent advances in optimal transport for machine learning. arXiv preprint arXiv:2306.16156, 2023.
  33. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085–2094, 2021.
  34. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019.
  35. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  36. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  37. Clipood: Generalizing clip to out-of-distributions. arXiv preprint arXiv:2302.00864, 2023.
  38. Ad-clip: Adapting domains in prompt space using clip. arXiv preprint arXiv:2308.05659, 2023.
  39. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  41. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  42. Your classifier can secretly suffice multi-source domain adaptation. Advances in Neural Information Processing Systems, 33:4647–4659, 2020.
  43. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017.
  44. Metateacher: Coordinating multi-model domain adaptation for medical image classification. Advances in Neural Information Processing Systems, 35:20823–20837, 2022.
  45. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  46. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3964–3973, 2018.
  47. Open-set domain adaptation with visual-language foundation models. arXiv preprint arXiv:2307.16204, 2023.
  48. Domain prompt learning for efficiently adapting clip to unseen domains. arXiv preprint arXiv:2111.12853, 2021.
  49. Multi-source distilling domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12975–12983, 2020.
  50. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9719–9728, 2020.
  51. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
  52. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhenbin Wang (7 papers)
  2. Lei Zhang (1689 papers)
  3. Lituan Wang (5 papers)
  4. Minjuan Zhu (2 papers)
Citations (7)

Summary

  • The paper introduces LanDA, a novel method that uses textual descriptions and optimal transport theory to align multiple source domains without needing target images.
  • The paper leverages vision-language foundational models to transform and project image embeddings into a joint space, preserving class-specific attributes.
  • The paper demonstrates superior accuracy over traditional fine-tuning approaches across diverse benchmarks, highlighting the potential for text-guided domain adaptation.

Overview of LanDA: Language-Guided Multi-Source Domain Adaptation

Multi-Source Domain Adaptation (MSDA) remains a challenging area within AI, particularly due to the reliance on domain images to guide the adaptation process. A paper introduces a groundbreaking method known as LanDA (Language-Guided Multi-Source Domain Adaptation) that pivots from conventional approaches by exclusively leveraging textual descriptions. LanDA engages with Optimal Transport (OT) theory and Visual-Language Foundational Models (VLFMs) to adapt multiple source domains to a target domain without requiring target domain images.

Challenges and Novel Approach

Traditional MSDA methods necessitate target domain images for successful adaptation, which can be problematic when such images are hard to obtain. LanDA circumvents this by exploiting language descriptions of the target domain, thus removing the need for actual target domain imagery. This is enabled by LanDA's utilization of a VLFMs framework, which has a joint image and language embedding space allowing for effective domain alignment based on language cues.

Mechanisms and Contributions

LanDA freezes the parameters of a model like CLIP and inserts lightweight augmenters to transform image embeddings from multiple source domains into extended domains. These extended domains are then projected into a Wasserstein space, accounting for both image and text information, to align with the unseen target domain and preserve class-specific attributes. To evaluate the effectiveness of LanDA, extensive experiments were carried out across varying transfer scenarios and benchmarks. The results demonstrate that LanDA achieves superior accuracy over standard fine-tuning and ensemble approaches, both in target and source domains.

Performance Evaluation and Future Outlook

The proposed LanDA method exhibits notable accuracy improvements, substantiating the viability of text-guided adaptation. It has the potential to shape future methodologies that aim to refine domain adaptation processes, especially given the often difficult task of collecting extensive target domain image datasets. The adaptability of LanDA alongside its methodological innovation offers a promising avenue for harnessing language as a guiding force in domain adaptation tasks. This paves the way for further exploration into methods that mitigate reliance on image datasets and exploit the synergy between language and vision modalities within AI.