Skip to main content

Command Palette

Search for a command to run...

Stop Guessing, Start Measuring Your Git Repository

Turn Your Git Repository into an SQL Database with gitstatdb

Updated
4 min read
Stop Guessing, Start Measuring Your Git Repository

It started with a simple request: "Can you get me a report on our top committers?"

I searched for tools to extract statistical information from a git repository, but found that most were designed for version control, not analysis. I eventually ran into gitstats, which does a great job at extracting metrics but generates a static snapshot. I showed the HTML report to my client, and the inevitable happened: he started asking, "Can I filter this by month?", "Can I restrict that to the backend team?", "Can I see only the refactors?"

It was the normal human process: the moment you start seeing data, you want to ask it questions. But static reports can't answer new questions.

I realized that hacking filters into existing tools wasn't enough, but urgency reigns, so I added some filtering options to gitstats and gave it to my client to cover the momentary needs. I was going to leave it there, but the ETL/BI engineer inside me was hooked. If we wanted real answers—like "Who was the most productive author of 2024?" or "Which files have the highest churn rate?"—we didn't need a report generator; we needed a data warehouse.

Thus was born GitStatsDB.

Today, I am releasing gitstatdb, an open-source ETL (Extract, Transform, Load) tool that turns your Git history into a structured MySQL database ready for dynamic reporting

💡
If you have ever tried to extract meaningful management metrics from git log, diff stat, ls-*, and similar commands, you know the pain. While Git is fantastic for version control, it wasn't designed as an analytics engine.

The Concept: Your Code History as Data

The philosophy behind gitstatdb is simple: standard SQL is more powerful for reporting than Git commands. By extracting repository metadata and loading it into a relational database, we unlock the full power of BI tools like Metabase, Superset, or Tableau.

Unlike simple commit counters, gitstatdb captures the deep context of your project:

  • Complete History: Commits, normalized authors, and committers.

  • Branch Evolution: Tracks all branches, including those that have been deleted, ensuring historical accuracy.

  • File Forensics: Tracks every file change (insertions, deletions, modifications) and file renames.

  • Merge Relationships: Automatically detects source and target branches for merges, making it easier to visualize workflow efficiency.

Under the Hood

The database schema is designed for performance and detailed analysis. It features normalized tables for author, repository, and branch, linked via a central commit table.

We also include pre-calculated statistics tables (repository_statistics, branch_statistics, author_repository_statistics). This means that when you connect a dashboard tool, it doesn't have to crunch millions of rows in real-time—the heavy lifting is already done.

Incremental Updates

One of the biggest challenges with Git analytics is performance. gitstatdb supports incremental updates. After the initial import, you can run the tool daily (recommended actually); it detects new commits and processes only what has changed. It even detects when local branches have been deleted and marks them accordingly in the database.

Visualizing the Data

Once your data is in MySQL, the magic happens. You can connect tools like Metabase to visualize your repository's heartbeat.

I have created a set of advanced dashboards that track:

  • Authors of the Month: Ranked by impact and consistency (not just commit counts).

  • Code Churn: Identifying "hotspots" in your codebase that are frequently rewritten.

  • Project Velocity: Visualizing merge rates and active days.

Watch this video to see how we use Metabase to explore a repository's history:

Installation & Usage

Getting started is straightforward. You can install it directly as a Python package:

# 1. Install
pip install -e .

# 2. Configure your database in a .env file
echo "DB_NAME=gitstatdb" > .env
# ... add user/pass ...

# 3. Run the ETL
gitstatdb /path/to/your/repository

For specific analysis, you can even force the recalculation of statistics for specific branches or the whole repo via the command line.

Commercial Reporting

While gitstatdb is open source (MIT License) and free to use, building the right SQL queries for advanced dashboards can be tricky.

The project includes a reporting directory with setup instructions for Metabase. However, the advanced template packs, complex SQL reports (like the "Authors of the Year" logic), and specific Metabase/Superset configurations are available as On-Demand Services.

If you want to skip the setup and jump straight to insights, you can contact me for the premium dashboard pack, which includes:

  • Support for setting up the tool and the necessary crons.

  • Pre-configured Metabase dashboards.

  • Complex SQL views for Churn and Author ranking.

  • Support for setting up Superset (others?) visualizations.

Get the Code

The project is hosted on GitHub. Give it a star and start treating your code history like the valuable dataset it is.

👉 GitHub - joebordes/gitstatdb