Skip to content

Latest commit

 

History

History
105 lines (69 loc) · 5.26 KB

README.md

File metadata and controls

105 lines (69 loc) · 5.26 KB

CryptOSS

This repository contains tooling for collecting and viewing Cryptocurrency Open Source Software (OSS) development.

Click to expand Papers and Citations related to this project.
@inproceedings{trockman-striking-gold-2019, 
  title = {{Striking Gold in Software Repositories? An Econometric Study of Cryptocurrencies on GitHub}},
  booktitle = "International Conference on Mining Software Repositories", author = "Trockman, Asher and {van~Tonder}, Rijnard and Vasilescu, Bogdan",
  series = {MSR '19},
  year = 2019
}

Paper Link

@inproceedings{van-tonder-crypto-oss-2019, 
  title = {{A Panel Data Set of Cryptocurrency Development Activity on GitHub}},
  booktitle = "International Conference on Mining Software Repositories",
  author = "{van~Tonder}, Rijnard and Trockman, Asher and {Le~Goues}, Claire",
  series = {MSR '19},
  year = 2019
} 

Paper Link

CSV and raw data

DOI

View the GitHub data online. This does not include the full data available in the CSV above, which includes cryptocurrency prices, market cap, and trading volumes.

Building the tooling

  • Install opam. Typically:
sh <(curl -sL https://raw.githubusercontent.com/ocaml/opam/master/shell/install.sh)
  • Then run
opam init
opam switch create 4.05.0 4.05.0 
  • Next:
opam install core opium yojson hmap tyxml tls

Then:

opam pin add github https://github.com/rvantonder/ocaml-github.git 

Then type make in this repository. The scripts and command-line utilities should now work. Let's step through the possible uses.

Collecting your own data

The cronjob folder contains the crontab for actively polling and collecting GitHub data. It's a good place to look if you want to understand how to collect data.

  • cronjob/crontab: The crontab pulls data by invoking cronjob/save.sh and cronjob/ranks.sh at certain intervals (these can be customized).

  • cronjob/save.sh: Essentially runs the crunch.exe save command (with a user-supplied GitHub token), see here. This command takes a list of comma-separated names registered in the db.ml file. You can see the invocation of the save.sh script in the crontab file.

  • cronjob/ranks.sh: Pulls cryptocurrency data from CoinMarketCap

  • batches: The crontab uses batches of cryptcurrencies (listed in files) example). Each batch corresponds to a list of cryptocurrencies that fit within the 5000 request rate limit for GitHub, so that batched requests can be spaced out over 24 hours. The interval and batch size can be changed depending on need (see cronjob/batches/generate.sh).

Besides the cronjob, you can manually save data by running, say, crunch.exe save Bitcoin -token <your-github-token>. This produces a .dat file, as processed by ./pipeline.sh.

The list of supported cryptocurrencies are in the database file. Modify it to include your own, and type make again to update the tooling. You can then run crunch.exe save <My Crypto> -token ....

Processing data

If you want more control over data processing besides ./pipeline.sh, you can use crunch.exe load. You can generate a CSV file from a .dat with a command like:

crunch.exe load -no-forks -csv -with-ranks <ranks.json file from CoinMarketCap> -with-date <DD-MM-YYYY> <some-dat-file>.dat

A similar command is used in the csv-of-dat.sh script to generate the MSR data set.

You can generate aggregate values by running ./crunch.exe aggregate on some directory containing .dat files. This will create .agg files. .agg files can be used to generate the web view.

Generating the web view

The ./deploy.sh script builds a static site. If you want to create the webview for a particular date, say Oct 10, 2018 (containing .dats), simply run ./deploy.sh datastore/2018-10-10. This will generate a web view in docs.

Recreating the MSR dataset from the raw data

Create a directory called datastore. Download and untar the raw data file in this directory. In the toplevel of this repository, run ./pipeline.sh <N>, where N is the number of parallel jobs (this speeds up processing). You can ignore any warnings/errors. Once finished, you'll have generated .csv files in the toplevel directory.

Feel free to add your own data in the datastore (for some date), and rerun ./pipeline.sh.


DOI