Skip to main content

Step 3: Get Insights with SQL

Now, the data is ready, let's try to analyze these data with SQL!

1. Get Insights with Raw SQL!

a. Can't Wait!

i. Example 1

Let's learn the data!

SELECT
*
FROM
github_events
LIMIT
1

Result:

***************************[ 1. row ]***************************
id | 0
type | PushEvent
created_at | 2014-04-24 20:45:30
repo_id | 17139664
repo_name | recserv.uiowa.edu
actor_id | 0
actor_login | jwfuller
language |
additions | 0
deletions | 0
action |
number | 0
commit_id |
comment_id | 0
org_login | ITS-UofIowa
org_id | 0
state |
closed_at | 1970-01-01 00:00:00
comments | 0
pr_merged_at | 1970-01-01 00:00:00
pr_merged | 0
pr_changed_files | 0
pr_review_comments | 0
pr_or_issue_id | 0
event_day | 2014-04-24
event_month | 2014-04-01
event_year | 2014
push_size | 14
push_distinct_size | 0
creator_user_login |
creator_user_id | 0
pr_or_issue_created_at | 1970-01-01 00:00:00

ii. Example 2

In addition to using aggregation functions for simple statistics, sometimes we need to conduct more complex data analysis, such as calculating percentage points, moving average, cumulative total, etc.

Question

If you need to analyze the star growth trend of the specified repository (Fox example: pingcap/tidb) according to the existing data, can you use one SQL to complete it.

At this time, if we only use aggregation functions, our data analysis SQL will become complex, or we need to complete it through additional program processing.

TiDB provides common table expression and window function features to help us greatly simplify SQL queries.

SELECT
event_month, stars
FROM (
SELECT
date_format(created_at, '%Y-%m-01') as event_month,
COUNT(1) OVER (ORDER BY date_format(created_at, '%Y-%m-01')) AS stars,
ROW_NUMBER() OVER (PARTITION BY date_format(created_at, '%Y-%m-01')) AS row_num
FROM
github_events
WHERE
type = 'WatchEvent'
AND repo_id = 41986369
ORDER BY 1
) sub
WHERE row_num = 1
;

Reading more:

iii. Example 3

The below SQL is picked from blog - Love, Code, and Robot, it explains how robots grow on GitHub:

WITH bots_with_first_seen AS (
SELECT
actor_login, MIN(YEAR(created_at)) AS first_seen_at
FROM github_events ge
WHERE
actor_login REGEXP '^(bot-.+|.+bot|.+\\[bot\\]|.+-bot-.+|robot-.+|.+-ci-.+|.+-ci|.+-testing|.+clabot.+|.+-gerrit|k8s-.+|.+-machine|.+-automation|github-.+|.+-github|.+-service|.+-builds|codecov-.+|.+teamcity.+|jenkins-.+|.+-jira-.+)$'
GROUP BY actor_login
ORDER BY first_seen_at
), acc AS (
SELECT
COUNT(actor_login) OVER (ORDER BY first_seen_at) AS cnt,
first_seen_at AS event_year
FROM
bots_with_first_seen AS bwfs
ORDER BY event_year
)
SELECT ANY_VALUE(cnt) AS bots_total, event_year
FROM acc
GROUP BY event_year
ORDER BY event_year;

Result:

(Remember there are only 240k rows stored in your Serverless Tier database.)

+------------+------------+
| bots_total | event_year |
+------------+------------+
| 3 | 2015 |
| 10 | 2016 |
| 14 | 2017 |
| 21 | 2018 |
| 40 | 2019 |
| 54 | 2020 |
| 73 | 2021 |
| 82 | 2022 |
+------------+------------+
8 rows in set (1.99 sec)

mysql>


"It works, but could it be faster?", "Of course!"


b. !!! Set Column-Oriented Storage Replica

This step is important enough that it adds column-oriented-storage ability to TiDB - We call it TiFlash. The tiup playground installed 1 TiFlash node by default, what we need to do is just make data STORED in these replica node too.

i. It's easy to set TiFlash replica, different from other software, TiDB uses SQL to take such changes into effect:

ALTER TABLE gharchive_dev.github_events SET TIFLASH REPLICA 1;

ii. Setting a TiFlash replica will take you some time, so you can use the following SQL statements to check if the procedure is done or not.

SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'gharchive_dev' and TABLE_NAME = 'github_events';

If the result you get is same as follow, that means the procedure is done.

mysql> SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'gharchive_dev' and TABLE_NAME = 'github_events';
+---------------+---------------+----------+---------------+-----------------+-----------+----------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_ID | REPLICA_COUNT | LOCATION_LABELS | AVAILABLE | PROGRESS |
+---------------+---------------+----------+---------------+-----------------+-----------+----------+
| gharchive_dev | github_events | 68 | 1 | | 1 | 1 |
+---------------+---------------+----------+---------------+-----------------+-----------+----------+
1 row in set (0.27 sec)

mysql>

Now try to execute the former robot sql(start with: WITH bots_with_first_seen) again, result:

WITH bots_with_first_seen AS (
SELECT
actor_login, MIN(event_year) AS first_seen_at
FROM github_events ge
WHERE
actor_login REGEXP '^(bot-.+|.+bot|.+\\[bot\\]|.+-bot-.+|robot-.+|.+-ci-.+|.+-ci|.+-testing|.+clabot.+|.+-gerrit|k8s-.+|.+-machine|.+-automation|github-.+|.+-github|.+-service|.+-builds|codecov-.+|.+teamcity.+|jenkins-.+|.+-jira-.+)$'
GROUP BY actor_login
ORDER BY first_seen_at
), acc AS (
SELECT
COUNT(actor_login) OVER (ORDER BY first_seen_at) AS cnt,
first_seen_at AS event_year
FROM
bots_with_first_seen AS bwfs
ORDER BY event_year
)
SELECT ANY_VALUE(cnt) AS bots_total, event_year
FROM acc
GROUP BY event_year
ORDER BY event_year;

It is SUPER faster!


We won't talk too much about how to get USEFUL insights from such a big data as we are also students in open source software field. All SQLs in this project can be found on page, just click the SHOW SQL button on top-right of each chart to get the corresponding SQL.

2. Well done!

Visit http://localhost:30000 to check the mini OSS Insight site.

If you have any question, please contact ossinsight@pingcap.com.

Video Thumbnail