The OSS Insight website displays the data changes of GitHub events in real time. GitHub events are activities triggered by user actions on GitHub, for example, commenting and forking a repository. In nearly seven weeks, GitHub events increased by about 150 million, from 4.7 billion to 4.85 billion. GitHub events are booming!
This post dives deeply into GitHub event trending, why GitHub events are surging, and whether GitHub's architecture can handle the increasing load.
The OSS Insight database includes all the GitHub events since 2011. When we plot the number of events by year, we can see that since 2018 they have been increasing rapidly.
GitHub event trending
The figure below shows how long it takes to grow each billion events in GitHub.
The time to reach a billion GitHub events
It's taking less and less for GitHub to generate 1 billion events. It took more than 6 years for the first billion events and only 13 months for the last billion!
The secret behind the exponential growth of GitHub events
GitHub Actions was released in October 2018. Since August 2019, it has supported continuous integration and continuous delivery (CI/CD), and it has been free for open source projects. Therefore, projects hosted on GitHub can automate their own development workflows, and a large number of automation-related bot applications have appeared on GitHub Marketplace. Could GitHub events' data growth be related to these?
To find the answer, we divided the events into data from humans and data from bots and plotted them with the following histogram. The blue columns represent the human data, and the yellow columns represent the bot data.
Bot events vs. human events
As you can see, the proportion of GitHub bot events has increased each year. In 2015, they were only 1.23% of all events. In early July of this year, they reached 13.2%. To show the data changes of bot events more clearly, we made the following line chart.
For now, rough statistics find that there are more than 95,620 bots on GitHub. The number doesn't seem like so much, but wait...
These 95 thousand bot accounts generated 603 million events. These events account for 12.82% of all public events on GitHub, and these GitHub robots have served over 18 million open source repositories.
Bots are playing an increasingly important role on GitHub. Many projects are handing over automated work to bots. We expect that GitHub events will grow faster in the future.
How many GitHub events will there be by the end of 2022? We fit predictions to GitHub historical data.
Human event fit (left) vs. bot event fit (right)
It's estimated that by the end of 2022, GitHub events will reach 5.36 billion.
GitHub event prediction
According to this prediction, GitHub events will exceed 10 billion in February 2025.
GitHub events will exceed 10 billion in 2025
Can MySQL sharding support such a huge amount of data?
GitHub uses MySQL as the main storage for all non-git warehouse data. The rapid growth of data volume poses a great challenge to GitHub's high availability. In March 2022, GitHub had 3 service disruptions, each lasting 2-5 hours. The official investigation report shows the MySQL database caused the outages. During peak load periods, the GitHub mysql1 database (the main database cluster in GitHub) load increased. Therefore, database access reached the maximum number of connections. This affected the performance of many GitHub services and features.
In fact, over the past few years GitHub has optimized its databases. For example, it added clusters to support platform growth and partitioned the main database. But these improvements did not fundamentally solve the problem. In the near future, GitHub events will exceed 5 billion, or even 10 billion. Can MySQL sharding support such data surge?
In early January 2022, Max, our CEO, a big fan of open-source, asked if my team could build a small tool to help us understand all the open-source projects on GitHub; and, that if everything worked well, we should open the API to help open source developers to build better insights. In fact, GitHub continuously publishes the public events in its open-source world through the open API. (Thank you and well done! Github). We can certainly learn a lot from the data!
I was excited about this project until Max said: “You’ve only got one week.” Well, the boss is the boss! Although time was tight and we were faced with multiple head-aching problems, I decided to take up this challenge.
Headache 1: we need both historical and real-time data.
After some quick research, we found GHArchive, an open-source project that collects and archives all GitHub data from 2011 and updates it hourly. By the way, a lot of open-source analytical tools such as CNCF's Devstats rely on GH Archive, too.
Thanks to GH Archive, we found the data source.
But there's another problem: hourly data is good, but not good enough. We wanted our data to be updated in real time—or at least near real time. We decided to directly use the GitHub event API, which collects all events that have occurred within the past hour.
By combining the data from the GH Archive and the GitHub event API, we can gain streaming, real-time event updates.
After we decompressed all the data from GH Archive, we found there were more than 4.6 billion rows of GitHub events. That’s a lot of data! We also noticed that about 300,000 rows were generated and updated each hour.
The data volume of GitHub events occurred after 2011
The database solution would be tricky here. Our goal is to build an application that provides real-time data insights based on a continuously growing dataset. So, scalability is a must. NoSQL databases can provide good scalability, but what follows is how to handle complex analytical queries. Unfortunately, NoSQL databases are not good at that.
Another option is to use an OLAP database such as ClickHouse. ClickHouse can handle the analytical workload very well, but it is not designed for serving online traffic. If we chose it, we would need another database for the online traffic.
What about sharding the database and then building an extract, transform, load (ETL) pipeline to synchronize the new events to a data warehouse? This sounds workable.
How a RDBMS handles the GitHub data
According to our product manager's (PM’s) plan, we needed to do some repo-specific or user-specific analysis. Although the total data volume was huge, the number of events was not too large for a single project or user. This meant using the secondary indexes in RDBMS would be a good idea. But, if we decided to use the above architecture, we had to be careful in selecting the database sharding key. For example, if we use user_id as the sharding key, then queries based on repo_id will be very tricky.
Another requirement from the PM was that our insight tool should provide OpenAPI, which meant we would have unpredictable concurrent traffic from the outside world.
Since we're not experts on Kafka and data warehouses, mastering and building such an infrastructure in just one week was a very difficult task for us.
The choice is obvious now, and don't forget PingCAP is a database company! TiDB seems a perfect fit for this, and it's a good chance to eat our own dog food. So, why not using TiDB! :)
If we use TiDB, can we get:
SQL support, including complex & flexible queries? ☑️
Secondary index support for fast lookup? ☑️
Capability for online serving? ☑️
Wow! It seems we got a winner!
By using the secondary index, TiDB scanned 29,639 rows (instead of 4.6 billion rows) GitHub events in 4.9 ms
To choose a database to support an application like OSS Insight, we think TiDB is a great choice. Plus, its simplified technology stack means a faster go-to-market and faster delivery of my boss' assignment.
After we used TiDB, we got a simplified architecture as shown below.
Just as the subtitle indicates, we have a very “pushy” PM, which is not always a bad thing. :) His demands kept extending, from the single project analysis at the very beginning to the comparison and ranking of multiple repositories, and to other multidimensional analysis such as the geographical distribution of stargazers and contributors. What’s more pressing was that the deadlines stayed unchanged!!!
We had to keep a balance between the growing demands and the tight deadlines.
To save time, we built our website using Docusaurus, an open source static site generator in React with scalability, rather than building a site from scratch. We also used Apache Echarts, a powerful charting library, to turn analytical results into good-looking and easy-to-understand charts.
We chose TiDB as the database to support our website, and it perfectly supports SQL. This way, our back-end engineers could write SQL commands to handle complex and flexible analytical queries with ease and efficiency. Then, our front-end engineers would just need to display those SQL execution results in the form of good-looking charts.
Finally, we made it. We prototyped our tool in just one week, and named it OSS Insight, short for open source software insights. We continued to fine-tune it, and it was officially released on May 3.
Let's use one example to show you how we deal with complex analytical queries.
This is an analytical query that includes aggregation and ranking. To get the result, we only need to execute one SQL statement:
SELECT ci.repo_name AS repo_name, COUNT(distinct actor_login)AS num FROM github_events ge JOIN collection_items ci ON ge.repo_id = ci.repo_id JOIN collections c ON ci.collection_id = c.id WHERE type='IssuesEvent' ANDaction='opened' AND c.id =10005 -- Exclude Bots and actor_login notlike'%bot%' and actor_login notin(select login from blacklist_users) GROUPBY1 ORDERBY2DESC ;
In the statement above, the collections and collection_items tables store the data of all GitHub repository collections in various areas. Each table has 30 rows. To get the order of issue creators, we need to associate the repository ID in the collection_items table with the real, 4.6-billion-row github_events table as shown below.
Next, let's look at the execution plan. TiDB is compatible with MySQL syntax, so its execution plan looks very similar to that of MySQL.
In the figure below, notice the parts in red boxes. The data in the table collection_items is read through distributed[row], which means this data is processed by TiDB’s row storage engine, TiKV. The data in the table github_events is read through distributed[column], which means this data is processed by TiDB’s columnar storage engine, TiFlash. TiDB uses both row and columnar storage engines to execute the same SQL statement. This is so convenient for OSS Insight because it doesn’t have to split the query into two statements.
After we released OSS Insight on May 3, we have received loud applause on social media, via emails and private messages, from many developers, engineers, researchers, and people who are passionate about the open source community in various companies and industries.
I am more than excited and grateful that so many people find OSS Insight interesting, helpful, and valuable. I am also proud that my team made such a wonderful product in such a short time.
Applause given by developers and organizations on Twitter
Looking back at the process we used to build this website, we have learned many mind-refreshing lessons.
First, quick doesn’t mean dirty, as long as we make the right choices. Building an insight tool in just one week is tricky, but thanks to those wonderful, ready-made, and open source projects such as TiDB, Docusaurus, and Echarts, we made it happen with efficiency and without compromising the quality.
Second, it’s crucial to select the right database—especially one that supports SQL. TiDB is a distributed SQL database with great scalability that can handle both transactional and real-time analytical workloads. With its help, we can process billions of rows of data with ease, and use SQL commands to execute complicated real-time queries. Further, using TiDB means we can leverage its resources to go to market faster and get feedback promptly.
If you like our project or are interested in joining us, you’re welcome to submit your PRs to our GitHub repository. You can also follow us on Twitter for the latest information.
When it comes to GitHub, we often see fake GitHub users who are always enthusiastic and active, giving timely feedback to project maintainers and contributors, and helping developers with tasks that can be automated. Yes, the next thing I want to discuss is something about GitHub bots.
In the OSSInsight project, we have developed a number of metrics to provide insight into open source projects. When developing some open source project metrics, we always consider excluding bot-generated actions or events from the metric calculation.
However, We can't ignore the contribution of robots in the domain of open source, and it's important to shift our thinking to look at what bots are doing on GitHub.
GitHub bots help developers do a lot of work:
Issue triage and management. (For example: stale[bot]、todo[bot])
Code review, security audit and quality inspection (For example, snyk-bot).
Format checking like ensuring license agreement signing, or make sure commit messages semantic. (For example: CLAassistant)
Integration with third-party systems, including Jira, Slack, Jenkins and so on.
As an agent to help contributor perform some operations needed permission on the repository. (For example: k8s-ci-bot、ti-chi-bot)
You can move your cursor onto any of the repository bars/lines on the chart and get the exact number.
The SQL commands above each chart are what we use on our TiDB Cloud to get the analytical results. Try those SQL commands by yourselves on TiDB Cloud with this 10-minute tutorial.
In this chapter, we will share with you some of the top low-code development tools repos (LCDT repos) on GitHub in 2021 measured by different metrics including the number of stars, PRs, contributors, countries, regions and so on.
In this chapter, we will share with you some of the top programming language repos (PL repos) on GitHub in 2021 measured by different metrics including the number of stars, PRs, contributors, countries, regions and so on.
In this chapter, we will share with you some of the top Web Framework repos (WF repos) on GitHub in 2021 measured by different metrics including the number of stars, PRs, contributors, countries, regions and so on.
4.6 billion is literally an astronomical figure. The richest star map of our galaxy, brought by Gaia space observatory, includes just under 2 billion stars. What does a view of 4.6 billion GitHub events really look like? What secrets and values can be discovered in such an enormous amount of data?
Here you go: OSSInsight.io can help you find the answer. It’s a useful insight tool that can give you the most updated open source intelligence, and help you deeply understand any single GitHub project or quickly compare any two projects by digging deep into 4.6 billion GitHub events in real time. Here are some ways you can play with it.
The line chart below shows the accumulated number of stars of K8s and Moby each year. According to the chart, Moby was ahead of K8s until late 2019. The star growth of Moby slowed after 2017 while K8s has kept a steady growth pace.
The chart below shows the stargazers’ employment of K8s (red) and Moby (dark blue). Both of their stargazers work in a wide range of industries, and most come from leading dotcom companies such as Google, Tencent, and Microsoft. The difference is that the top two companies of K8s’ stargazers are Google and Microsoft from the US, while Moby’s top two followers are Tencent and Alibaba from China.
The employment distribution of K8s and Moby stargazers
Providing insights on large volume of email data might not be as easy as we thought. While data coming in real-time, indices and metadata are to be built consistently. To make things worse, the data volume is beyond traditional single node databases' reach.
To store large volumes of real-time user data like email and provide insights is not easy. If your application is layered on top of Gmail to automatically extract and organize the useful information buried inside of our inboxes.
It became clear that they were going to need a better system for organizing terabytes of email metadata to power collaboration as their customer base rapidly increased, it is not easy to provide insights. You need to organize email data by first applying a unique identifier to the emails and then proactively indexing the email metadata. The unique identifier is what connects the same email headers across. For each email inserted in real-time, the system extracts meta information from it and builds indices for high concurrent access. When data volume is small, it's all good: traditional databases provide all you need. However, when data size grows beyond a single node's capacity, everything becomes very hard.
Regarding databases, there are some options you might consider:
NoSQL database. While fairly scalable, it does not provide you indexing and comprehensive query abilities. You might end up implementing them in your application code.
Sharing cluster of databases. Designing sharding key and paying attention to the limitations between shards are painful. It might be fine for applications with simple schema designs, but it will be too complicated for CRM. Moreover, it's very hard to maintain.
Analytical databases. They are fine for dashboard and reporting. But not fine for high concurrent updates and index based serving.
TiDB is a distributed database with user experience of traditional databases. It looks like a super large MySQL without the limitations of NoSQL and sharding cluster solutions. With TiDB, you can simply have the base information, indices and metadata being updated in a concurrent manner with the help of cross-node transaction ability.
To build such a system, you just need following steps:
Create schemas according to your access pattern with indices on user name, organization, job title etc.
Use streaming system to gather and extract meta information from your base data
Insert into TiDB via ordinary MySQL client driver like JDBC. You might want to gather data in small batches of hundreds of rows to speed up ingestion. In a single transaction, updates on base data, indices and meta information are guaranteed to be consistent.
Optionally, deploy a couple of TiFlash nodes to speed up large scale reporting queries.
Access the data just like in MySQL and you are all done. SQL features for analytics like aggregations, multi-joins or window functions are all supported with great performance.