Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth. GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
GitHub provides 20+ event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. These events are aggregated into hourly archives, which you can access with any HTTP client:
|Activity for 1/1/2015 @ 3PM UTC||
|Activity for 1/1/2015||
|Activity for all of January 2015||
Each archive contains JSON encoded events as reported by the GitHub API. You can download the raw data and apply own processing to it - e.g. write a custom aggregation script, import it into a database, and so on! An example Ruby script to download and iterate over a single archive:
The entire GitHub Archive is also available as a public dataset on Google BigQuery: the dataset is automatically updated every hour and enables you to run arbitrary SQL-like queries over the entire dataset in seconds - i.e. no need to download or process any data on your own. To get started:
For convenience, note that there are multiple tables that you can use for your analysis:
Note that you get 1 TB of data processed per month free of charge. In order to make best use of it, you can restrict your queries to relevant time ranges to minimize the amount of scanned data. To scan multiple tables at once, you can use table wildcards:
Have a cool project that should be on this list? Send a pull request!