Emergent Mind

The AI Community Building the Future? A Quantitative Analysis of Development Activity on Hugging Face Hub

(2405.13058)
Published May 20, 2024 in cs.SE , cs.AI , cs.CY , and cs.LG

Abstract

Open source developers have emerged as key actors in the political economy of AI, with open model development being recognised as an alternative to closed-source AI development. However, we still have a limited understanding of collaborative practices in open source AI. This paper responds to this gap with a three-part quantitative analysis of development activity on the Hugging Face (HF) Hub, a popular platform for building, sharing, and demonstrating models. First, we find that various types of activity across 348,181 model, 65,761 dataset, and 156,642 space repositories exhibit right-skewed distributions. Activity is extremely imbalanced between repositories; for example, over 70% of models have 0 downloads, while 1% account for 99% of downloads. Second, we analyse a snapshot of the social network structure of collaboration on models, finding that the community has a core-periphery structure, with a core of prolific developers and a majority of isolate developers (89%). Upon removing isolates, collaboration is characterised by high reciprocity regardless of developers' network positions. Third, we examine model adoption through the lens of model usage in spaces, finding that a minority of models, developed by a handful of companies, are widely used on the HF Hub. Overall, we find that various types of activity on the HF Hub are characterised by Pareto distributions, congruent with prior observations about OSS development patterns on platforms like GitHub. We conclude with a discussion of the implications of the findings and recommendations for (open source) AI researchers, developers, and policymakers.

Development activity distributions in HF Hub repositories.

Overview

  • The paper analyzes the distribution and collaboration patterns within the Hugging Face Hub, highlighting a significant imbalance where a small percentage of models and developers account for most activities and downloads.

  • Community engagement reveals that the majority of model repositories have minimal collaborative contributions, with high reciprocity among the active developers indicating mutual relationships.

  • The study finds that repositories with permissive licenses show higher activity and engagement, and identifies major tech companies as dominant players in model adoption.

Open Source AI: A Deep Dive into the Hugging Face Hub

Development Activity in the HF Hub

Researchers have been intrigued by the patterns of development activity on platforms like Hugging Face (HF) Hub, particularly in the context of open source AI. This study analyzed various activities across 348,181 model repositories, 65,761 dataset repositories, and 156,642 spaces on the HF Hub.

  1. Right-Skewed Distributions: The HF Hub shows a very imbalanced activity distribution. For example, over 70% of models have zero downloads, while a mere 1% account for 99% of all downloads. Similarly right-skewed distributions are observed in likes, discussions, and commits, reflective of the Pareto principle.
  2. Community Engagement: Community size correlates with activity levels. However, most repositories have limited collaborative engagement, with around 87% of model repositories having only one contributor. High reciprocity values indicate prevalent mutual relationships among active developers.
  3. Licenses: A significant number of repositories lack licenses, with 65% of model repositories and 72% of datasets undocumented in this regard. However, repositories with permissive licenses exhibit higher activity levels and engagement.

Collaboration Patterns

Analysis of the social network structure of collaboration in model repositories reveals a core-periphery structure with a few prolific developers at the core:

  • Isolate Developers: About 89% of developers work in isolation without collaboration.
  • Core-Periphery Structure: Collaboration is characterized by a dense core of prolific developers. High reciprocity values suggest that collaborations are mutual and not dependent on one-sided efforts.
  • High Modularity: The developer community is made up of distinct communities in various AI sub-fields like NLP and computer vision, exhibiting high modularity at lower k-core thresholds which indicates loosely connected groups.

Model Adoption in Spaces

Model adoption in spaces, an indicator of practical use and impact, also reveals a high concentration among a few key players:

  • Right-Skewed Distribution: Similar to activity distributions, model adoption displays a right-skewed distribution where a minority of models see significant usage. For instance, runwayml/stable-diffusion-v1-5 is used in over 1,747 spaces.
  • Dominance by Big Tech: Major players like Meta, Google, and Stability AI are behind the most adopted models, indicating a high concentration of influence. Models from industry leaders are not only the most used but also highly co-used with other models.
  • Correlation with Likes: A strong positive correlation exists between the number of likes a model repository receives and its usage in spaces, highlighting likes as a good indicator of model adoption.

Implications and Future Research Directions

The findings underscore significant concentration and imbalance in development activities and model adoption within the HF Hub community. Here are some underlying implications:

  1. Licensing: Given the substantial number of unlicensed repositories, developers and platform providers should emphasise the importance of proper licensing to facilitate collaboration and avoid legal complications.
  2. Community Engagement: Encouraging more collaboration, especially among isolated developers, could lead to richer, more diverse contributions in open source AI.
  3. Policy and Governance: Policymakers should take note of these concentrated influences, especially by industry leaders, to inform discussions on the benefits, risks, and governance of open models.
  4. Methodological Considerations: Future research should expand beyond the HF Hub to include other platforms and compare development and collaboration patterns. Additionally, temporal analyses could provide more nuanced insights into the dynamics of open source AI development.
  5. Understanding Developer Incentives: Investigating what drives individual developers and companies to contribute to open source AI could illuminate paths to more balanced and democratic participation.

This study not only extends the literature on open source AI but also provides practical insights for researchers, developers, policymakers, and platform providers. As open models continue to proliferate, a deeper understanding of these collaborative practices will be essential for fostering a sustainable and equitable AI ecosystem.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.