Typical Pitfalls of Analytics Using Git Data as an Example

Author: Alexey Milovidov, 2020-11-13.

Typical Pitfalls of Analytics
Using Git Data as an Example

Alexey Milovidov

Episode 1

Taking data from Git to analyze development processes.

Using this task as an example, I will talk about:

— trivial facts
— and truisms,
  that everyone already knows.

What is Git

What is the difference between Git and GitHub?

Git — a distributed version control system.

GitHub — a centralized hosting for Git repositories
+ a social network for developers.

What We Want to Get

— files with the maximum and minimum number of authors;
— the oldest lines of code in the repository;
— files with the longest history;
— favorite files for each author;
— large files with a small number of authors;
— on which days of the week is it better to write code,
  so that it has less chance of being deleted;
— files sorted by code age;
— who wrote the most code;
— who wrote the most code that wasn't deleted;
— whose code is most often rewritten;
— find pairs of authors where one writes code and another deletes it;
— median time for deleting lines of code;
— which code was rewritten the most;
— who likes writing comments or tests more;


https://xkcd.com/1597/

Git Data Model

What does a Git repository consist of?

A repository consists of blobs, trees, and commits.

Everything has a hash that uniquely identifies the content.

Blob — file content.

Tree — a snapshot of directory contents, references trees and blobs.

Commit — a reference to a tree and parent commits.

Commits reference each other and form a graph.

If a commit has more than one parent — it's a merge.

Git Data Model

Important notes:

Git doesn't store diffs between commits.
It stores snapshots of file contents, and diffs are computed on the fly.

Git doesn't store line authorship. It's computed on the fly by viewing the history in sequence until the needed diff is found.

Git doesn't have linear history. There's no way to represent a repository as a sequence of patches while preserving authorship.

git log — is just one way to traverse the commit graph

What to Do With This?

Put everything into a relational DB* and run SELECT queries.

But Git's data model is not relational
— we'll have to make compromises.

* of course in ClickHouse.

Method 1

Take the current state of the repository.

Calculate blame for each file (git blame)
— authorship and commit time of each line.

Put each line of each file into a table.

Data volume — is simply LOC.

ClickHouse:      732,048 lines (without contrib)
LLVM:          9,809,148 lines
Chromium:     15,063,229 lines (without third_party)
Linux Kernel: 21,645,046 lines

Yandex internal repository: ??? lines.

Method 1

— files with the maximum and minimum number of authors;
— the oldest lines of code in the repository;
— files with the longest history;
— favorite files for each author;
— large files with a small number of authors;
— on which days of the week is it better to write code,
  so that it has less chance of being deleted;
— files sorted by code age;
— who wrote the most code;
— who wrote the most code that wasn't deleted;
— whose code is most often rewritten;
— find pairs of authors where one writes code and another deletes it;
— median time for deleting lines of code;
— which code was rewritten the most;
— who likes writing comments or tests more;

Method 2

Take all commits.
For each commit, take the repository state.

Data volume, upper bound estimate:
number of lines now * number of commits / 2

ClickHouse — 732,048 lines * 51,055 commits / 2 = 18.6 billion.
Linux Kernel — 21,645,046 lines * 950,019 commits / 2 = 10 trillion.

Everything can be calculated, but there's too much data :(

Method 3

Take all commits.

For each commit, get the diff (git show).

We get added and deleted lines of code.

Write everything to the database!

Data volume — several times more than LOC.
... depending on how many times on average the code was rewritten.

Method 3

Disadvantages:

For merge commits, diff is non-trivial,
as it shows changes relative to several parents.

Solution: just skip merge commits.

Changes for conflict resolution won't be included in statistics.

Statistics will be inaccurate... oh well.

But we have change history and can calculate everything!

Defining the Data Model

Three tables:

commits
  hash String,
  author LowCardinality(String),
  time DateTime,
  message String,
  ...
file_changes
  change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4...),
  path LowCardinality(String),
  old_path LowCardinality(String),
  file_extension LowCardinality(String),
  ...
line_changes
  sign Int8,
  line_number_old UInt32,
  line_number_new UInt32,
  line LowCardinality(String),
  indent UInt8,
  line_type Enum('Empty' = 0, 'Comment' = 1, 'Punct' = 2, 'Code' = 3),
  ...

Defining the Data Model

Principles:

1. Denormalization.
In each file_changes row, duplicate all commit properties.
In each line_changes row, duplicate all file_changes properties and two commits — current and previous from blame.

2. Enrichment with statistics.
In each commits row, also add:
- number of added and deleted files;
- number of added and deleted lines;

3. Pre-computation of individual details.
- calculate the indentation level of the line and put it in a separate column;
- calculate whether the line is empty, a comment, code...;
- save the file extension in a separate column too.

Defining the Data Model

Principles:

1. Denormalization.

2. Enrichment with statistics.

3. Pre-computation of individual details.

These principles are contraindicated for transactional DBs
  and very good for analytical DBs.

Defining the Data Model

CREATE TABLE git.line_changes
(
    sign Int8,
    line_number_old UInt32,
    line_number_new UInt32,
    hunk_num UInt32,
    hunk_start_line_number_old UInt32,
    hunk_start_line_number_new UInt32,
    hunk_lines_added UInt32,
    hunk_lines_deleted UInt32,
    hunk_context LowCardinality(String),
    line LowCardinality(String),
    indent UInt8,
    line_type Enum('Empty' = 0, 'Comment' = 1, 'Punct' = 2, 'Code' = 3),

    prev_commit_hash String,
    prev_author LowCardinality(String),
    prev_time DateTime,

    file_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
    path LowCardinality(String),
    old_path LowCardinality(String),
    file_extension LowCardinality(String),
    file_lines_added UInt32,
    file_lines_deleted UInt32,
    file_hunks_added UInt32,
    file_hunks_removed UInt32,
    file_hunks_changed UInt32,

    commit_hash String,
    author LowCardinality(String),
    time DateTime,
    commit_message String,
    commit_files_added UInt32,
    commit_files_deleted UInt32,
    commit_files_renamed UInt32,
    commit_files_modified UInt32,
    commit_lines_added UInt32,
    commit_lines_deleted UInt32,
    commit_hunks_added UInt32,
    commit_hunks_removed UInt32,
    commit_hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

Writing a Script

I tried to write a script in bash but failed.

Wrote a script in C++ :) (could have been in Python).

This is not a recommendation.
You are analysts. Write in whatever you want.

https://github.com/ClickHouse/ClickHouse/pull/14471

Running the Script

clickhouse-git-import --help

Repository processing and table sizes in ClickHouse:

- ClickHouse:  31 sec;    6M rows;  122 MB;
- LLVM:         8 min;   62M rows;  1.2 GB;
- Linux:       12 min;   85M rows;  1.7 GB;
- Chromium:    67 min;  343M rows;  6.8 GB.

Tiny data volume! Can be analyzed on a laptop.

Analyzing Data

Number of authors who wrote the most lines of code.

LOC — the dumbest metric of all :(

Options:
— everything written or only lines that weren't deleted?
— only added lines of code or is deleting code also important?
— how to account for code, tests, documentation?

SELECT author AS k, count() AS c FROM line_changes WHERE file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20 ┌─k────────────────────┬───────c─┐ │ Alexey Milovidov │ 1061697 │ │ proller │ 200704 │ <-- commits of third-party code │ Nikolai Kochetov │ 183370 │ │ Alexey Arno │ 107018 │ │ Vitaly Baranov │ 93296 │ │ Andrey Mironov │ 92973 │ │ Guillaume Tassery │ 89530 │ <-- commit and deletion │ alesapin │ 86999 │ of auto-generated files │ Vitaliy Lyudvichenko │ 85609 │ │ Michael Kolupaev │ 69178 │ │ CurtizJ │ 62607 │ │ chertus │ 53425 │ │ Alexey Zatelepin │ 49331 │ │ zhang2014 │ 48352 │ │ Alexander Tokmakov │ 46371 │ │ alexey-milovidov │ 41518 │ <-- duplicate │ peshkurov │ 36525 │ <-- corrupted git history, │ Nikita Mikhaylov │ 30226 │ snapshot committed without parent. │ Nikita Vasilev │ 28182 │ │ Artem Zuikov │ 25176 │ └──────────────────────┴─────────┘

Warnings

Query executed in 0.009 sec. But to understand why the result is garbage will take hours of investigation.

You cannot measure developer productivity
by physical properties of the repository.

Lines of code — is a very poor metric.

If you aggregate data and calculate top N
— you will see, first of all, data outliers.

You'll have to investigate outliers, and data will need to be cleaned.

One aggregated report is not the result.

Something More Interesting

Authors and the share of their code deleted by other authors:

SELECT k, written_code.c, removed_code.c, round(removed_code.c * 100 / written_code.c) AS remove_ratio FROM ( SELECT author AS k, count() AS c FROM line_changes WHERE sign = 1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') GROUP BY k ) AS written_code INNER JOIN ( SELECT prev_author AS k, count() AS c FROM line_changes WHERE sign = -1 AND file_extension IN ('h', 'cpp') AND line_type NOT IN ('Punct', 'Empty') AND author != prev_author GROUP BY k ) AS removed_code USING (k) WHERE written_code.c > 1000 ORDER BY c DESC LIMIT 500

Something More Interesting

Authors and the share of their code deleted by other authors:

┌─k───────────────────────┬──────c─┬─removed_code.c─┬─remove_ratio─┐ │ Alexey Milovidov │ 426656 │ 113306 │ 27 │ │ Nikolai Kochetov │ 83479 │ 16999 │ 20 │ │ Andrey Mironov │ 55097 │ 43989 │ 80 │ │ Alexey Arno │ 49474 │ 22664 │ 46 │ │ Vitaliy Lyudvichenko │ 47603 │ 31890 │ 67 │ │ Vitaly Baranov │ 44032 │ 3010 │ 7 │ │ alesapin │ 40617 │ 8505 │ 21 │ │ proller │ 36638 │ 19471 │ 53 │ │ Michael Kolupaev │ 31523 │ 21182 │ 67 │ │ peshkurov │ 25442 │ 24556 │ 97 │ │ chertus │ 24013 │ 6961 │ 29 │ │ CurtizJ │ 23684 │ 7604 │ 32 │ │ Alexey Zatelepin │ 22892 │ 11105 │ 49 │ │ Alexander Tokmakov │ 21147 │ 2524 │ 12 │ │ zhang2014 │ 20476 │ 3900 │ 19 │ │ alexey-milovidov │ 16513 │ 10415 │ 63 │ │ Nikita Vasilev │ 14099 │ 2633 │ 19 │ │ Guillaume Tassery │ 12431 │ 1199 │ 10 │ │ Nikita Mikhaylov │ 11673 │ 1447 │ 12 │

Conclusions

The resulting tool is a good toy.

Metrics can be calculated from this data...
but blindly relying on them — no.

The tool can be given to a person who will
doubt, think, and double-check the data.

Episode 2

GitHub — centralized hosting for Git repositories;
+ a social network for developers;
+ collaboration tools.

What's not in Git repository data:
— issues (task tracker);
— pull requests;
— code review, stars, likes and comments.

This is implemented by GitHub itself.

I want to extract and analyze all the data!

Importing GitHub Data

— GitHub API;

— GH Archive: https://www.gharchive.org/.

Data about events that occurred in repositories.

GH Archive — an updated archive of data about all GitHub repositories!

Importing GitHub Data

GitHub API event types:

— CommitCommentEvent;
— CreateEvent;
— DeleteEvent;
— ForkEvent;
— GollumEvent;
— IssueCommentEvent;
— IssuesEvent;
— MemberEvent;
— PublicEvent;
— PullRequestEvent;
— PullRequestReviewCommentEvent;
— PushEvent;
— ReleaseEvent;
— SponsorshipEvent;
— WatchEvent;

Importing GitHub Data

wget --continue \
https://data.gharchive.org/{2015..2020}-{01..12}-{01..31}-{0..23}.json.gz

All data is open and you can download it!

Just wait a few days while downloading
84,264 files totaling 1.2 TB.

Can be accelerated by splitting across servers, but why...
when you can just leave it for the weekend and forget.

Analyzing GitHub Data

How to analyze 1.2 TB of .json.gz files?

Method 1. As is. Don't load anywhere.
Analyze using clickhouse-local:

clickhouse-local --query " SELECT count() FROM file('*.json.gz', TSV, 'data String') WHERE JSONExtractString(data, 'actor', 'login') = 'alexey-milovidov'"

Query hits disk and takes ~30 minutes.
Bottleneck — reading all available data.

This is not optimal!

Analyzing GitHub Data

How to analyze 1.2 TB of .json.gz files?

Method 2. Put into a table in ClickHouse.

But the data has a very complex structure.
— 15 different events, each with many attributes.

Solution: simplify the structure!

CREATE TABLE github_events
(
    event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4,
                    'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8,
                    'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11,
                    'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15,
                    'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19),
    actor_login LowCardinality(String),
    repo_name LowCardinality(String),
    created_at DateTime,
    updated_at DateTime,
    action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6,
        'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12,
        'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16),
    comment_id UInt64,
    body String,
    path String,
    position UInt32,
    line UInt32,
    ref LowCardinality(String),
    ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3),
    creator_user_login LowCardinality(String),
    number UInt32,
    title String,
    labels Array(LowCardinality(String)),
    state Enum('none' = 0, 'open' = 1, 'closed' = 2),
    locked UInt8,
    assignee LowCardinality(String),
    assignees Array(LowCardinality(String)),
    comments UInt32,
    author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4),
    closed_at DateTime,
    merged_at DateTime,
    merge_commit_sha String,
    requested_reviewers Array(LowCardinality(String)),
    requested_teams Array(LowCardinality(String)),
    head_ref LowCardinality(String),
    head_sha String,
    base_ref LowCardinality(String),
    base_sha String,
    merged UInt8,
    mergeable UInt8,
    rebaseable UInt8,
    mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4),
    merged_by LowCardinality(String),
    review_comments UInt32,
    maintainer_can_modify UInt8,
    ...
) ENGINE = MergeTree ORDER BY (event_type, repo_name, created_at);
find . -name '*.json.gz' | xargs -P16 -I{} bash -c "
gzip -cd {} | jq -c '
[
    .type,
    .actor.login,
    .repo.name,
    .created_at,
    .payload.updated_at // .payload.comment.updated_at // .payload.issue.updated_at // .payload.pull_request.updated_at,
    .payload.action,
    .payload.comment.id,
    .payload.review.body // .payload.comment.body // .payload.issue.body // .payload.pull_request.body // .payload.release.body,
    .payload.comment.path,
    .payload.comment.position,
    .payload.comment.line,
    .payload.ref,
    .payload.ref_type,
    .payload.comment.user.login // .payload.issue.user.login // .payload.pull_request.user.login,
    .payload.issue.number // .payload.pull_request.number,
    .payload.issue.title // .payload.pull_request.title,
    [.payload.issue.labels[]?.name // .payload.pull_request.labels[]?.name],
    .payload.issue.state // .payload.pull_request.state,
    .payload.issue.locked // .payload.pull_request.locked,
    .payload.issue.assignee.login // .payload.pull_request.assignee.login,
    [.payload.issue.assignees[]?.login // .payload.pull_request.assignees[]?.login],
    .payload.issue.comments // .payload.pull_request.comments,
    .payload.review.author_association // .payload.issue.author_association // .payload.pull_request.author_association,
    .payload.issue.closed_at // .payload.pull_request.closed_at,
    .payload.pull_request.merged_at,
    .payload.pull_request.merge_commit_sha,
    [.payload.pull_request.requested_reviewers[]?.login],
    [.payload.pull_request.requested_teams[]?.name],
    ...
    .payload.pull_request.additions,
    .payload.pull_request.deletions,
    .payload.pull_request.changed_files,
    .payload.comment.diff_hunk,
    .payload.comment.original_position,
    .payload.comment.commit_id,
    .payload.comment.original_commit_id,
    .payload.size,
    .payload.distinct_size,
    .payload.member.login,
    .payload.release.tag_name,
    .payload.release.name,
    .payload.review.state
]' | clickhouse-client --input_format_null_as_default 1 --date_time_input_format best_effort --query '
    INSERT INTO github_events FORMAT JSONCompactEachRow' || echo 'File {} has issues'
"

Loading GitHub Data into ClickHouse

What we got:

One flat table with all event types.

Many columns — different properties of different events.

Easy to analyze different events without JOIN.

SELECT
    sum(event_type = 'WatchEvent') AS stars,
    sum(event_type = 'PullRequestEvent'
        AND action = 'opened') AS prs
FROM github_events
GROUP BY repo_name

Loading GitHub Data into ClickHouse

What we got:

— 2.5 billion rows;
— 150 GB compressed;
— 140 million repositories;
— queries run in seconds.

Conclusion:

All events in all repositories throughout GitHub history:
— is not "Big Data";
— is "Fits on Your Laptop" Data.

GitHub Data in ClickHouse

Can quickly answer almost any question
about the Open-Source ecosystem*:

— top repositories by stars, all time, this year;
— fastest growing for the quarter or stagnating among large repositories;
— graphs of star count, issues, contributors for any repository;
— most prolific code authors;
— who does the most code review;
— repositories with the most comment activity;
— all repositories related to ClickHouse;
— all my comments... and others' too.

* about the part available on GitHub.

Conclusions

To analyze data, you need to:

— study the source model and nature of the data;
— choose the right representation for analysis;
— perform data preparation and cleaning;
— cut corners where possible;
— be able to write a simple script in any language;
— never be afraid of big data;
— investigate outliers and data internals;
— be ready to throw everything away and start over;
— never stop at one report;

... and you'll also need courage.

.