Section 2 Introduction
Data have penetrated into every industry and business field, has become an increasingly important factor of production and consumption in our lives. However, data must be processed to uncover the information within. For this reason many statistical tools have been produced, such as SPSS, Python, R and so on. In this project, we focus on the R language (R Core Team 2021).
R is a programming language and environment for statistical computing and graphics, which can be extended easily via packages and provide an Open Source route to participation in statistical methodology. It is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License (“The GNU Operating System and the Free Software Movement” 1991). R is one of the most popular statistical languages and has been ranked competitively, even among the general purpose programming languages (peaking 8th in August 2020 in “Latest News” 2021). The TIOBE Programming Community index aggregates several search engines to derive a metric of the popularity of a programming language. It is important to note, however, that programming languages ranked higher than R are mostly general purpose languages and therefore naturally have a larger user base.
Today, R is greatly enhanced by over 17,713 R-packages contributed by hundreds of developers all over the world. However, when R originally appeared in August of 1993 with its first official release in June of 1995 (Ihaka 1998), the contributions were managed by only a small group of core developers. In April of 1997, the Comprehensive R Archive Network (CRAN) was established as the official R-packages repository, with 3 mirror sites. Now, the source repositories to install R-packages have expanded to Bioconductor, Gitlab, GitHub, R-Forge and 106 CRAN mirrors in 49 regions. Of all the CRAN mirrors, the daily download counts for each package is only readily available from the RStudio CRAN mirror. In this report, we will be focusing only on one CRAN mirror of RStudio, albeit it being the most popular one, for exploratory data analysis. Henceforth, all download counts of packages refer to the download from this RStudio CRAN mirror.
CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R (“The Comprehensive r Archive Network” 2021).
The most intuitive measurement of popularity is the total number of downloads, although it does not directly reflect the number of users, because it includes repeated downloads by the same user, updates and test downloads caused by the server. Briefly, some works concerning the reporting and interpretation of these download counts include: the R-package adjustedcranlogs
(Morgan-Wall 2017), which attempts to correct for download inflation using a heuristic approach; the package packageRank
(Peter Li 2021), which provides a way to compute the rank percentile for packages and filters invalid downloads, including small and medium size logs. There are also some packages such as pkgsearch
(Csárdi and Salmon 2020) and Visualize.CRAN.Downloads
(Ponce 2021) that provide some visualization methods to explore the package download trends or for convenient searching. More details are provided in Section 5.
Based on those initial works, researchers have started to study the popularity of R-packages from CRAN. We still assume in this project that, the download amount is a relatively reliable and simple measure of a R-package’s popularity. The previous studies, the R-package adjustedcranlogs
(Morgan-Wall 2017) finds that there are spikes in downloads due to automatic re-downloads and package updates, and it also provides a method to remove these download logs. Some of the methods are function extensions based on previous packages, and some of their proposed methods are new functions, but generally speaking, they are more tool like, to help users explore the characteristics of package download logs from as many directions as possible.
As for us, the purpose of this project is to explore the download logs of R-packages on CRAN and to analyze the relationships between some probably influential factors, to help users and developers better understand the download patterns, and also figure out what could determine a R-package’s popularity to some extent.
A CRAN mirror is a website containing differently located servers, which aims to facilitate to quick and smooth access for people from different regions and countries. Each server is called a mirror (“CRAN Mirrors,” n.d.).
CRAN task view aims to provide some guidance regarding which packages on CRAN are relevant for tasks related to a certain topic (“CRAN Task Views,” n.d.).