GitHub Archive Star

Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth. GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

GitHub provides 20+ event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. These events are aggregated into hourly archives, which you can access with any HTTP client:

Query Command
Activity for 1/1/2015 @ 3PM UTC wget
Activity for 1/1/2015 wget{0..23}.json.gz
Activity for all of January 2015 wget{01..30}-{0..23}.json.gz

Each archive contains JSON encoded events as reported by the GitHub API. You can download the raw data and apply own processing to it - e.g. write a custom aggregation script, import it into a database, and so on! An example Ruby script to download and iterate over a single archive:

  • Activity archives are available starting 2/12/2011.
  • Activity archives for dates between 2/12/2011-12/31/2014 was recorded from the (now deprecated) Timeline API.
  • Activity archives for dates starting 1/1/2015 is recorded from the Events API.

For the curious, check out The Changelog episode #144 for an in-depth interview about the history of GitHub Archive, integration with BigQuery, where the project is heading, and more.

Analyzing event data with BigQuery

The entire GitHub Archive is also available as a public dataset on Google BigQuery: the dataset is automatically updated every hour and enables you to run arbitrary SQL-like queries over the entire dataset in seconds - i.e. no need to download or process any data on your own. To get started:

  1. If you don't already have a Google project...
  2. Open public dataset:
  3. Execute your first query...

For convenience, note that there are multiple tables that you can use for your analysis:

  1. year dataset: 2011, 2012, 2013, and 2014 tables contain all activities for each respective year.
    • The schema is in a "flattened" format where each field is mapped into a distinct column.
    • The schema is the same between all years.
  2. month dataset: contains activity for each respective month between 2011-today - e.g. 201501.
    • The schema for 201101~201412 tables is the same as 2011-2014 year tables: flattened with each field in a distinct column.
    • The schema for 201501+ tables contains nested records plus a JSON encoded payload - see below.
  3. day dataset: contains activity for each day starting on January 1, 2015 - e.g. 20150101.
    • The schema contains distinct columns for common activity fields (see same response format), plus a payload string field which contains the JSON encoded activity description. The format of the payload is different for each type and may be updated by GitHub at any point, hence it is kept as a string value in BigQuery. However, you can extract particular fields from the payload using the provided JSON functions - e.g. see query example above with JSON_EXTRACT().

Note that you get 1 TB of data processed per month free of charge. In order to make best use of it, you can restrict your queries to relevant time ranges to minimize the amount of scanned data. To scan multiple tables at once, you can use table wildcards:

Daily reports

Changelog Nightly is the new and improved version of the daily email reports powered by the Github Archive data. These reports ship each day at 10pm CT and unearth the hottest new repos on GitHub. Alternatively, if you want something curated and less frequent, subscribe to Changelog Weekly.

Research, visualizations, talks...

Have a cool project that should be on this list? Send a pull request!