Section 3 Data

The main source of data used in this report is the download logs from the RStudio CRAN mirror site : https://cran.rstudio.com/. These log files are created for every instance of download of an R-package via the RStudio CRAN mirror, then these log files are processed, daily, into CSV files that contain the following variables with the name of header in brackets:

  • Date (date),
  • Time in UTC time zone (time),
  • Size of the file in bytes (size),
  • Version of R used to download the package (r_version),
  • Architecture type for R (i386 = 32 bit, x86_64 = 64 bit) (r_arch),
  • Operating System (darwin9.8.0 = mac, mingw32 = windows) (r_os),
  • Package (package),
  • Country in two letter ISO country code (country), and
  • Anonymised daily unique id (ip_id).

A similar log file is also created for every download of R from the RStudio CRAN mirror with the processed log file generating a CSV file that contains the same variables except r_arch and package, and r_version and r_os are named as version and os. These CSV files are hosted at http://cran-logs.rstudio.com/ and updated daily with data available from 1st October 2012.

The log files of a particular day is processed and compressed into a single CSV file of about 40 megabytes (file sizes of earlier years are much smaller due to lower number of download logs). As there are over 700,000 CSV files, a simple estimate of the size of the data is 28 terabytes - far exceeding typical portable hard drives which are 1-4 terabytes.

The summarised version of data, where the data show the total daily download counts for each package, is accessible using the cranlogs R-package. The cranlogs package accesses this summary data through the web application programming interface (API) maintained by r-hub (R-Hub 2020).

References