StellarTop: An Integrated Multi-topic Dataset on GitHub Repositories

Zhiwei Zhu, Wenrui Huang, Wei Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

GitHub has become one of the most popular platforms for open source version control and collaboration. In 2017, GitHub introduced the “Topics” feature, which allows repository owners to add descriptive topics to better characterize their repositories. However, incorrect topic assignment can adversely impact a repository’s visibility to potential contributors. Compared to previous datasets, which delete low-frequency topics or map them to the frequent featured topics, our dataset retains all valuable topics, taking full account of the diversity of topics. In our work, we have collected the top 50,000 starred repositories on GitHub so far, along with their text information such as descriptions and README. Finally we collected information from 28,386 repositories with a total of 162,038 topics, covering 22,710 distinct topics. This extensive dataset supports various research applications, such as topic recommendation and trend analysis in open-source projects. Our dataset is available freely at https://github.com/Zzzzzhuzhiwei/StellarTop.

Original languageEnglish
Title of host publicationBenchmarking, Measuring, and Optimizing - 16th BenchCouncil International Symposium, Bench 2024, Revised Selected Papers
EditorsWeiwei Lin, Zhen Jia, Sascha Hunold, Guoxin Kang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages113-126
Number of pages14
ISBN (Print)9789819650316
DOIs
StatePublished - 2025
Event16th BenchCouncil International Symposium on Benchmarking, Measuring, and Optimizing, Bench 2024 - Guangzhou, China
Duration: 4 Dec 20246 Dec 2024

Publication series

NameLecture Notes in Computer Science
Volume15519 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference16th BenchCouncil International Symposium on Benchmarking, Measuring, and Optimizing, Bench 2024
Country/TerritoryChina
CityGuangzhou
Period4/12/246/12/24

Keywords

  • Dataset
  • Github topics
  • Open Source Projects
  • Software Engineering

Fingerprint

Dive into the research topics of 'StellarTop: An Integrated Multi-topic Dataset on GitHub Repositories'. Together they form a unique fingerprint.

Cite this