Skip to main content

· 5 min read
Mia Zhou
Wink Yao
Caitin Chen

The OSS Insight website displays the data changes of GitHub events in real time. GitHub events are activities triggered by user actions on GitHub, for example, commenting and forking a repository. In nearly seven weeks, GitHub events increased by about 150 million, from 4.7 billion to 4.85 billion. GitHub events are booming!

This post dives deeply into GitHub event trending, why GitHub events are surging, and whether GitHub's architecture can handle the increasing load.

Historical data analysis

The OSS Insight database includes all the GitHub events since 2011. When we plot the number of events by year, we can see that since 2018 they have been increasing rapidly.


GitHub event trending

GitHub event trending

The figure below shows how long it takes to grow each billion events in GitHub.


The time to reach a billion GitHub events

The time to reach a billion GitHub events

It's taking less and less for GitHub to generate 1 billion events. It took more than 6 years for the first billion events and only 13 months for the last billion!

The secret behind the exponential growth of GitHub events

GitHub Actions was released in October 2018. Since August 2019, it has supported continuous integration and continuous delivery (CI/CD), and it has been free for open source projects. Therefore, projects hosted on GitHub can automate their own development workflows, and a large number of automation-related bot applications have appeared on GitHub Marketplace. Could GitHub events' data growth be related to these?

To find the answer, we divided the events into data from humans and data from bots and plotted them with the following histogram. The blue columns represent the human data, and the yellow columns represent the bot data.


Bot events vs. human events

Bot events vs. human events

As you can see, the proportion of GitHub bot events has increased each year. In 2015, they were only 1.23% of all events. In early July of this year, they reached 13.2%. To show the data changes of bot events more clearly, we made the following line chart.


Bot event trending

Bot event trending

This figure shows that since 2019, bot events have been grown faster than before. As Mini256, a TiDB community contributor said in Love, Code, and Robot — Explore robots in the world of code:

For now, rough statistics find that there are more than 95,620 bots on GitHub. The number doesn't seem like so much, but wait...

These 95 thousand bot accounts generated 603 million events. These events account for 12.82% of all public events on GitHub, and these GitHub robots have served over 18 million open source repositories.

Bots are playing an increasingly important role on GitHub. Many projects are handing over automated work to bots. We expect that GitHub events will grow faster in the future.

When will GitHub reach 10 billion events?

How many GitHub events will there be by the end of 2022? We fit predictions to GitHub historical data.


Human event fit (left) vs. bot event fit (right)

Human event fit (left) vs. bot event fit (right)

It's estimated that by the end of 2022, GitHub events will reach 5.36 billion.


github-event-prediction
GitHub event prediction

According to this prediction, GitHub events will exceed 10 billion in February 2025.


gitub-events-exceed-10-billion
GitHub events will exceed 10 billion in 2025

Can MySQL sharding support such a huge amount of data?

GitHub uses MySQL as the main storage for all non-git warehouse data. The rapid growth of data volume poses a great challenge to GitHub's high availability. In March 2022, GitHub had 3 service disruptions, each lasting 2-5 hours. The official investigation report shows the MySQL database caused the outages. During peak load periods, the GitHub mysql1 database (the main database cluster in GitHub) load increased. Therefore, database access reached the maximum number of connections. This affected the performance of many GitHub services and features.

In fact, over the past few years GitHub has optimized its databases. For example, it added clusters to support platform growth and partitioned the main database. But these improvements did not fundamentally solve the problem. In the near future, GitHub events will exceed 5 billion, or even 10 billion. Can MySQL sharding support such data surge?

Data sources

All the analysis data in this article comes from OSS Insight, a tool based on TiDB to analyze and gain insights into GitHub events data.

You can use it to easily get insights about developers and repositories based on billions of GitHub events. You can also get the latest and historical rankings and trends in technical fields.


The OSS Insight website

The OSS Insight website

· 10 min read
Wink Yao
Fendy Feng

In early January 2022, Max, our CEO, a big fan of open-source, asked if my team could build a small tool to help us understand all the open-source projects on GitHub; and, that if everything worked well, we should open the API to help open source developers to build better insights. In fact, GitHub continuously publishes the public events in its open-source world through the open API. (Thank you and well done! Github). We can certainly learn a lot from the data!

I was excited about this project until Max said: “You’ve only got one week.” Well, the boss is the boss! Although time was tight and we were faced with multiple head-aching problems, I decided to take up this challenge.

Headache 1: we need both historical and real-time data.

After some quick research, we found GHArchive, an open-source project that collects and archives all GitHub data from 2011 and updates it hourly. By the way, a lot of open-source analytical tools such as CNCF's Devstats rely on GH Archive, too.

Thanks to GH Archive, we found the data source.

But there's another problem: hourly data is good, but not good enough. We wanted our data to be updated in real time—or at least near real time. We decided to directly use the GitHub event API, which collects all events that have occurred within the past hour.

By combining the data from the GH Archive and the GitHub event API, we can gain streaming, real-time event updates.


GitHub event updates

GitHub event updates

Headache 2: the data is huge!

After we decompressed all the data from GH Archive, we found there were more than 4.6 billion rows of GitHub events. That’s a lot of data! We also noticed that about 300,000 rows were generated and updated each hour.


The data volume of GitHub events occurred after 2011

The data volume of GitHub events occurred after 2011

The database solution would be tricky here. Our goal is to build an application that provides real-time data insights based on a continuously growing dataset. So, scalability is a must. NoSQL databases can provide good scalability, but what follows is how to handle complex analytical queries. Unfortunately, NoSQL databases are not good at that.


Scalability vs SQL

Another option is to use an OLAP database such as ClickHouse. ClickHouse can handle the analytical workload very well, but it is not designed for serving online traffic. If we chose it, we would need another database for the online traffic.


OLAP vs Online Serving

What about sharding the database and then building an extract, transform, load (ETL) pipeline to synchronize the new events to a data warehouse? This sounds workable.


How a RDBMS handles the GitHub data

How a RDBMS handles the GitHub data

According to our product manager's (PM’s) plan, we needed to do some repo-specific or user-specific analysis. Although the total data volume was huge, the number of events was not too large for a single project or user. This meant using the secondary indexes in RDBMS would be a good idea. But, if we decided to use the above architecture, we had to be careful in selecting the database sharding key. For example, if we use user_id as the sharding key, then queries based on repo_id will be very tricky.

Another requirement from the PM was that our insight tool should provide OpenAPI, which meant we would have unpredictable concurrent traffic from the outside world.

Since we're not experts on Kafka and data warehouses, mastering and building such an infrastructure in just one week was a very difficult task for us.

The choice is obvious now, and don't forget PingCAP is a database company! TiDB seems a perfect fit for this, and it's a good chance to eat our own dog food. So, why not using TiDB! :)

If we use TiDB, can we get:

  • SQL support, including complex & flexible queries? ☑️
  • Scalability? ☑️
  • Secondary index support for fast lookup? ☑️
  • Capability for online serving? ☑️

Wow! It seems we got a winner!


By using the secondary index, TiDB scanned 29,639 rows (instead of 4.6 billion rows) GitHub events in 4.9 ms

By using the secondary index, TiDB scanned 29,639 rows (instead of 4.6 billion rows) GitHub events in 4.9 ms

To choose a database to support an application like OSS Insight, we think TiDB is a great choice. Plus, its simplified technology stack means a faster go-to-market and faster delivery of my boss' assignment.

After we used TiDB, we got a simplified architecture as shown below.


Simplified architecture after we use TiDB

Simplified architecture after we use TiDB

Headache 3: We have a "pushy" PM!

Just as the subtitle indicates, we have a very “pushy” PM, which is not always a bad thing. :) His demands kept extending, from the single project analysis at the very beginning to the comparison and ranking of multiple repositories, and to other multidimensional analysis such as the geographical distribution of stargazers and contributors. What’s more pressing was that the deadlines stayed unchanged!!!

We had to keep a balance between the growing demands and the tight deadlines.

To save time, we built our website using Docusaurus, an open source static site generator in React with scalability, rather than building a site from scratch. We also used Apache Echarts, a powerful charting library, to turn analytical results into good-looking and easy-to-understand charts.

We chose TiDB as the database to support our website, and it perfectly supports SQL. This way, our back-end engineers could write SQL commands to handle complex and flexible analytical queries with ease and efficiency. Then, our front-end engineers would just need to display those SQL execution results in the form of good-looking charts.

Finally, we made it. We prototyped our tool in just one week, and named it OSS Insight, short for open source software insights. We continued to fine-tune it, and it was officially released on May 3.

How we deal with analytical queries with SQL

Let's use one example to show you how we deal with complex analytical queries.

Analyze a GitHub collection: JavaScript frameworks

OSS Insight can analyze popular GitHub collections by many metrics including the number of stars, issues, and contributors. Let’s identify which JavaScript framework has the most issue creators. This is an analytical query that includes aggregation and ranking. To get the result, we only need to execute one SQL statement:

SELECT
ci.repo_name AS repo_name,
COUNT(distinct actor_login) AS num
FROM
github_events ge
JOIN collection_items ci ON ge.repo_id = ci.repo_id
JOIN collections c ON ci.collection_id = c.id
WHERE
type = 'IssuesEvent'
AND action = 'opened'
AND c.id = 10005
-- Exclude Bots
and actor_login not like '%bot%'
and actor_login not in (select login from blacklist_users)
GROUP BY 1
ORDER BY 2 DESC
;

In the statement above, the collections and collection_items tables store the data of all GitHub repository collections in various areas. Each table has 30 rows. To get the order of issue creators, we need to associate the repository ID in the collection_items table with the real, 4.6-billion-row github_events table as shown below.


mysql> select * from collection_items where collection_id = 10005;
+-----+---------------+-----------------------+-----------+
| id | collection_id | repo_name | repo_id |
+-----+---------------+-----------------------+-----------+
| 127 | 10005 | marko-js/marko | 15720445 |
| 129 | 10005 | angular/angular | 24195339 |
| 131 | 10005 | emberjs/ember.js | 1801829 |
| 135 | 10005 | vuejs/vue | 11730342 |
| 136 | 10005 | vuejs/core | 137078487 |
| 138 | 10005 | facebook/react | 10270250 |
| 142 | 10005 | jashkenas/backbone | 952189 |
| 143 | 10005 | dojo/dojo | 10160528 |
...
30 rows in set (0.05 sec)

Next, let's look at the execution plan. TiDB is compatible with MySQL syntax, so its execution plan looks very similar to that of MySQL.

In the figure below, notice the parts in red boxes. The data in the table collection_items is read through distributed[row], which means this data is processed by TiDB’s row storage engine, TiKV. The data in the table github_events is read through distributed[column], which means this data is processed by TiDB’s columnar storage engine, TiFlash. TiDB uses both row and columnar storage engines to execute the same SQL statement. This is so convenient for OSS Insight because it doesn’t have to split the query into two statements.


TiDB execution plan

TiDB execution plan

TiDB returns the following result:

+-----------------------+-------+
| repo_name | num |
+-----------------------+-------+
| angular/angular | 11597 |
| facebook/react | 7653 |
| vuejs/vue | 6033 |
| angular/angular.js | 5624 |
| emberjs/ember.js | 2489 |
| sveltejs/svelte | 1978 |
| vuejs/core | 1792 |
| Polymer/polymer | 1785 |
| jquery/jquery | 1587 |
| jashkenas/backbone | 1463 |
| ionic-team/stencil | 1101 |
...
30 rows in set
Time: 7.809s

Then, we just need to draw the result with Apache Echarts into a more visualized chart as shown below.


JavaScript frameworks with the most issue creators

JavaScript frameworks with the most issue creators

Note: You can click the REQUEST INFO on the upper right side of each chart to get the SQL command for each result.

Feedback: People love it!

After we released OSS Insight on May 3, we have received loud applause on social media, via emails and private messages, from many developers, engineers, researchers, and people who are passionate about the open source community in various companies and industries.

I am more than excited and grateful that so many people find OSS Insight interesting, helpful, and valuable. I am also proud that my team made such a wonderful product in such a short time.


Applause given by developers and organizations on Twitter-1

Applause given by developers and organizations on Twitter-1

Applause given by developers and organizations on Twitter

Lessons learned

Looking back at the process we used to build this website, we have learned many mind-refreshing lessons.

First, quick doesn’t mean dirty, as long as we make the right choices. Building an insight tool in just one week is tricky, but thanks to those wonderful, ready-made, and open source projects such as TiDB, Docusaurus, and Echarts, we made it happen with efficiency and without compromising the quality.

Second, it’s crucial to select the right database—especially one that supports SQL. TiDB is a distributed SQL database with great scalability that can handle both transactional and real-time analytical workloads. With its help, we can process billions of rows of data with ease, and use SQL commands to execute complicated real-time queries. Further, using TiDB means we can leverage its resources to go to market faster and get feedback promptly.

If you like our project or are interested in joining us, you’re welcome to submit your PRs to our GitHub repository. You can also follow us on Twitter for the latest information.

note

📌 Join our workshop

If you want to get your own insights, you can join our workshop and try using TiDB to support your own datasets.

· 7 min read
Mini256

When it comes to GitHub, we often see fake GitHub users who are always enthusiastic and active, giving timely feedback to project maintainers and contributors, and helping developers with tasks that can be automated. Yes, the next thing I want to discuss is something about GitHub bots.

Overview

In the OSSInsight project, we have developed a number of metrics to provide insight into open source projects. When developing some open source project metrics, we always consider excluding bot-generated actions or events from the metric calculation.

However, We can't ignore the contribution of robots in the domain of open source, and it's important to shift our thinking to look at what bots are doing on GitHub.

GitHub bots help developers do a lot of work:

  • Issue triage and management. (For example: stale[bot]todo[bot])
  • Code review, security audit and quality inspection (For example, snyk-bot).
  • Format checking like ensuring license agreement signing, or make sure commit messages semantic. (For example: CLAassistant)
  • Integration with third-party systems, including Jira, Slack, Jenkins and so on.
  • As an agent to help contributor perform some operations needed permission on the repository. (For example: k8s-ci-botti-chi-bot)

Looking at the historical data, we see that the number of GitHub bots grows significantly faster after 2019 (on average, 20,000 new bots are created each year)

· 3 min read
Jagger

In this chapter, we will share with you some of the top JavaScript Framework repos(JSF repos) on GitHub in 2021 measured by different metrics including the number of stars, PRs, contributors, countries, regions and so on.

Note:

  1. You can move your cursor onto any of the repository bars/lines on the chart and get the exact number.
  2. The SQL commands above each chart are what we use on our TiDB Cloud to get the analytical results. Try those SQL commands by yourselves on TiDB Cloud with this 10-minute tutorial.

Star history of top JavaScript Framework repos since 2011

The number of stars is often thought of as a measure of whether a github repository is popular or not. We sort all JavaScript framework repositories from github by the total number of historical stars since 2011. For visualizing the results more intuitively, we show the top 10 open source databases by using an interactive line chart.

· 2 min read
Jagger

In this chapter, we will share with you some of the top low-code development tools repos (LCDT repos) on GitHub in 2021 measured by different metrics including the number of stars, PRs, contributors, countries, regions and so on.

· 6 min read
Jagger

On this page, we will share with you many deep insights into open source databases, such as the database popularity, database contributors, coding vitality, community feedback and so on.

We’ll also share the SQL commands that generate all these analytical results above each chart, so you can use them on your own on TiDB Cloud following this 10-minute tutorial.

· 2 min read
Jagger

In this chapter, we will share with you some of the top programming language repos (PL repos) on GitHub in 2021 measured by different metrics including the number of stars, PRs, contributors, countries, regions and so on.

· 3 min read
Jagger

In this chapter, we will share with you some of the top Web Framework repos (WF repos) on GitHub in 2021 measured by different metrics including the number of stars, PRs, contributors, countries, regions and so on.

· 10 min read
Fendy Feng

4.6 billion is literally an astronomical figure. The richest star map of our galaxy, brought by Gaia space observatory, includes just under 2 billion stars. What does a view of 4.6 billion GitHub events really look like? What secrets and values can be discovered in such an enormous amount of data?

Here you go: OSSInsight.io can help you find the answer. It’s a useful insight tool that can give you the most updated open source intelligence, and help you deeply understand any single GitHub project or quickly compare any two projects by digging deep into 4.6 billion GitHub events in real time. Here are some ways you can play with it.

Compare any two GitHub projects

Do you wonder how different projects have performed and developed over time? Which project is worthy of more attention? OSSInsight.io can answer your questions via the Compare Projects page.

Let’s take the Kubernetes repository (K8s) and Docker’s Moby repository as examples and compare them in terms of popularity and coding vitality.

Popularity

To compare the popularity of two repositories, we use multiple metrics including the number of stars, the growth trend of stars over time, and stargazers’ geographic and employment distribution.

Number of stars

The line chart below shows the accumulated number of stars of K8s and Moby each year. According to the chart, Moby was ahead of K8s until late 2019. The star growth of Moby slowed after 2017 while K8s has kept a steady growth pace.

The star history of K8s and Moby

Geographical distribution of stargazers

The map below shows the stargazers’ geographical distribution of Moby and K8s. As you can see, their stargazers are scattered around the world with the majority coming from the US, Europe, and China.

The geographical distribution of K8s and Moby stargazers

Employment distribution of stargazers

The chart below shows the stargazers’ employment of K8s (red) and Moby (dark blue). Both of their stargazers work in a wide range of industries, and most come from leading dotcom companies such as Google, Tencent, and Microsoft. The difference is that the top two companies of K8s’ stargazers are Google and Microsoft from the US, while Moby’s top two followers are Tencent and Alibaba from China.

The employment distribution of K8s and Moby stargazers

· 3 min read
ilovesoup

Providing insights on large volume of email data might not be as easy as we thought. While data coming in real-time, indices and metadata are to be built consistently. To make things worse, the data volume is beyond traditional single node databases' reach.

Background

To store large volumes of real-time user data like email and provide insights is not easy. If your application is layered on top of Gmail to automatically extract and organize the useful information buried inside of our inboxes.

It became clear that they were going to need a better system for organizing terabytes of email metadata to power collaboration as their customer base rapidly increased, it is not easy to provide insights. You need to organize email data by first applying a unique identifier to the emails and then proactively indexing the email metadata. The unique identifier is what connects the same email headers across. For each email inserted in real-time, the system extracts meta information from it and builds indices for high concurrent access. When data volume is small, it's all good: traditional databases provide all you need. However, when data size grows beyond a single node's capacity, everything becomes very hard.

Potential Database Solutions

Regarding databases, there are some options you might consider:

  1. NoSQL database. While fairly scalable, it does not provide you indexing and comprehensive query abilities. You might end up implementing them in your application code.
  2. Sharing cluster of databases. Designing sharding key and paying attention to the limitations between shards are painful. It might be fine for applications with simple schema designs, but it will be too complicated for CRM. Moreover, it's very hard to maintain.
  3. Analytical databases. They are fine for dashboard and reporting. But not fine for high concurrent updates and index based serving.

How to get real-time insights

TiDB is a distributed database with user experience of traditional databases. It looks like a super large MySQL without the limitations of NoSQL and sharding cluster solutions. With TiDB, you can simply have the base information, indices and metadata being updated in a concurrent manner with the help of cross-node transaction ability.

To build such a system, you just need following steps:

  1. Create schemas according to your access pattern with indices on user name, organization, job title etc.
  2. Use streaming system to gather and extract meta information from your base data
  3. Insert into TiDB via ordinary MySQL client driver like JDBC. You might want to gather data in small batches of hundreds of rows to speed up ingestion. In a single transaction, updates on base data, indices and meta information are guaranteed to be consistent.
  4. Optionally, deploy a couple of TiFlash nodes to speed up large scale reporting queries.
  5. Access the data just like in MySQL and you are all done. SQL features for analytics like aggregations, multi-joins or window functions are all supported with great performance.

For more cases, please see here.

info

🌟 Details in how OSS Insight works

Go to read Use TiDB Cloud to Analyze GitHub Events in 10 Minutes and use the Developer Tier free for 1 year.

You can find how we deal with massive github data in Data Preparation for Analytics as well!