Section 6 Summary
In this project, we collected summary daily download logs of R-packages through web Application Programming Interface (API) maintained by r-hub(R-Hub 2020) and also used daily download data in CRAN for a time period from 2012-10-01 to 2021-06-12, to explore the daily download pattern for all of R-packages each year. In that case, we found that the cumulative numbers of downloads increase over time, with an also increasing variance, which indicates that some R-packages with larger downloads grow rapidly. In addition, there is also a strong weekly seasonality in the daily download pattern. The download counts tend to peak through weekdays and drop on weekend. What’s more, through Lorenz curve, we also found that most of the cumulative downloads came from the top 10% downloaded R-packages, which means the distribution of downloads is quite unequal. Part of the reason is that those top 10% downloaded R-packages contain quite a lot popular and frequently used R-packages, such as tidyverse
and rlang
that would probably obtain high downloads. In addition, there are other R-packages that often get high download volume, which can be divided into the following four categories:
- R-packages maintained by R studio
- R-packages created by authors from R core group
- R-packages created by authors from R secondary group
- R-packages created by R related authors
- R-packages created by top 20 prolific maintainers (This is resourced at “The Most Prolific Package Maintainers on CRAN” (2018))
However, the existence of those R-packages may make it difficult to reflect the popularity of other R-packages, so we excluded them, for the analysis of user preferences. And we found that the topics of newly added R-packages on 1st Oct of each year are from quite different application areas, while the R-packages remaining most stably popular during 2017 and 2019 is related to JAVA dependency. Definitely, JAVA always rank top three among programming languages according to TIOBE Index(“Latest News” 2021).
As for R itself, its download pattern is quite similar to that of total R-packages on CRAN. The most used OS for R users is windows OS. Also, the most popular version of R is 3.2.1.
After exploration of the characteristics of download pattern for R-package and R, we extracted the release dates for all of CRAN R-packages and task view R-packages, to compare the total download counts (past year for task view R-packages and the most recent 6 month for all) among R-packages with different release dates or with different numbers of updates. And we found that for R-packages from the same topic, earlier release date can usually related to more download counts, while R-packages with more update times would not always have higher downloads. R-packages released earlier and kept being updated are more likely to have higher downloads.
In the next section, we initially tried to scrape the numbers of commits in Github repositories, for all of available CRAN R-packages, through Github API by R, to check whether more commits can result in more downloads. But there came a tricky problem on the rate limit of Github API. As documented in “GitHub REST API” (2021), unauthenticated users could only be permitted to send 60 requests per hour. And only after get authentication, could the rate limit be expanded up to 5000 per hour. However, after trying several methods to get authentication, the rate limit was failed to be promoted. In that case, we switched to make this done with Python by setting random user agent. Meanwhile, in order to display our initial research idea, we still had a look at the last 1% downloaded R-packages on this question with the original R method. Therefore, we could expect that, generally, if an R-package has more commits on master (main) branch in Github repository, it can probably obtain more download counts.
The last two parts are about analysis for R-package names. We compared the average downloads among R-packages with different name lengths and different alphabetical orders. It is believed that over half of the R-packages tend to have shorter names , probably for the sake of being easily remembered by users. Alphabetical order played little roles in promoting the download volume.