The AI Community Building the Future? A Quantitative Analysis of Development Activity on Hugging Face Hub (2405.13058v2)

Published 20 May 2024 in cs.SE, cs.AI, cs.CY, and cs.LG

Abstract: Open model developers have emerged as key actors in the political economy of AI, but we still have a limited understanding of collaborative practices in the open AI ecosystem. This paper responds to this gap with a three-part quantitative analysis of development activity on the Hugging Face (HF) Hub, a popular platform for building, sharing, and demonstrating models. First, various types of activity across 348,181 model, 65,761 dataset, and 156,642 space repositories exhibit right-skewed distributions. Activity is extremely imbalanced between repositories; for example, over 70% of models have 0 downloads, while 1% account for 99% of downloads. Furthermore, licenses matter: there are statistically significant differences in collaboration patterns in model repositories with permissive, restrictive, and no licenses. Second, we analyse a snapshot of the social network structure of collaboration in model repositories, finding that the community has a core-periphery structure, with a core of prolific developers and a majority of isolate developers (89%). Upon removing the isolate developers from the network, collaboration is characterised by high reciprocity regardless of developers' network positions. Third, we examine model adoption through the lens of model usage in spaces, finding that a minority of models, developed by a handful of companies, are widely used on the HF Hub. Overall, activity on the HF Hub is characterised by Pareto distributions, congruent with OSS development patterns on platforms like GitHub. We conclude with recommendations for researchers, companies, and policymakers to advance our understanding of open AI development.

Citations (32)

View on Semantic Scholar

Summary

The paper demonstrates that Hugging Face Hub’s model activity is highly skewed, with over 70% of models having zero downloads and only 1% accounting for nearly all usage.
The paper reveals that 87% of model repositories have a single contributor and 89% of developers work in isolation, highlighting limited collaborative engagement.
The paper finds that repositories with permissive licenses enjoy higher activity and engagement, emphasizing important implications for open source policy and governance.

Open Source AI: A Deep Dive into the Hugging Face Hub

Development Activity in the HF Hub

Researchers have been intrigued by the patterns of development activity on platforms like Hugging Face (HF) Hub, particularly in the context of open source AI. This paper analyzed various activities across 348,181 model repositories, 65,761 dataset repositories, and 156,642 spaces on the HF Hub.

Right-Skewed Distributions: The HF Hub shows a very imbalanced activity distribution. For example, over 70% of models have zero downloads, while a mere 1% account for 99% of all downloads. Similarly right-skewed distributions are observed in likes, discussions, and commits, reflective of the Pareto principle.
Community Engagement: Community size correlates with activity levels. However, most repositories have limited collaborative engagement, with around 87% of model repositories having only one contributor. High reciprocity values indicate prevalent mutual relationships among active developers.
Licenses: A significant number of repositories lack licenses, with 65% of model repositories and 72% of datasets undocumented in this regard. However, repositories with permissive licenses exhibit higher activity levels and engagement.

Collaboration Patterns

Analysis of the social network structure of collaboration in model repositories reveals a core-periphery structure with a few prolific developers at the core:

Isolate Developers: About 89% of developers work in isolation without collaboration.
Core-Periphery Structure: Collaboration is characterized by a dense core of prolific developers. High reciprocity values suggest that collaborations are mutual and not dependent on one-sided efforts.
High Modularity: The developer community is made up of distinct communities in various AI sub-fields like NLP and computer vision, exhibiting high modularity at lower k-core thresholds which indicates loosely connected groups.

Model Adoption in Spaces

Model adoption in spaces, an indicator of practical use and impact, also reveals a high concentration among a few key players:

Right-Skewed Distribution: Similar to activity distributions, model adoption displays a right-skewed distribution where a minority of models see significant usage. For instance, runwayml/stable-diffusion-v1-5 is used in over 1,747 spaces.
Dominance by Big Tech: Major players like Meta, Google, and Stability AI are behind the most adopted models, indicating a high concentration of influence. Models from industry leaders are not only the most used but also highly co-used with other models.
Correlation with Likes: A strong positive correlation exists between the number of likes a model repository receives and its usage in spaces, highlighting likes as a good indicator of model adoption.

Implications and Future Research Directions

The findings underscore significant concentration and imbalance in development activities and model adoption within the HF Hub community. Here are some underlying implications:

Licensing: Given the substantial number of unlicensed repositories, developers and platform providers should emphasise the importance of proper licensing to facilitate collaboration and avoid legal complications.
Community Engagement: Encouraging more collaboration, especially among isolated developers, could lead to richer, more diverse contributions in open source AI.
Policy and Governance: Policymakers should take note of these concentrated influences, especially by industry leaders, to inform discussions on the benefits, risks, and governance of open models.
Methodological Considerations: Future research should expand beyond the HF Hub to include other platforms and compare development and collaboration patterns. Additionally, temporal analyses could provide more nuanced insights into the dynamics of open source AI development.
Understanding Developer Incentives: Investigating what drives individual developers and companies to contribute to open source AI could illuminate paths to more balanced and democratic participation.

This paper not only extends the literature on open source AI but also provides practical insights for researchers, developers, policymakers, and platform providers. As open models continue to proliferate, a deeper understanding of these collaborative practices will be essential for fostering a sustainable and equitable AI ecosystem.