5.1 Daily downloads of R-packages
Finding 1: There was unusual download activities in one day of 2014 and 2018.
In this first section, we studied the daily downloads of CRAN R-packages from 2012-10-01 to 2021-06-12. The data was obtained from the cranlogs
package(Csárdi 2019), which includes a summary of the download logs via the RStudio CRAN mirror. The daily download data for CRAN R-packages are available from 1st October 2012. Examination of this data showed two unusual observations in 2014 and 2018 as shown in Figure 5.1. The one happening in 2014 was on 2014-11-17, Monday, while the other one happening in 2018 was on 2018-10-21, Sunday.
When having a closer look into those two spikes, we firstly focused on the one on 2014-11-17. From Table 5.1, we could see that the downloads of top downloaded R-packages on this day differs little, indicating this spike is not due to a certain package.
package | n |
---|---|
BayHaz | 767035 |
clhs | 660298 |
GPseq | 394840 |
OPI | 382518 |
YaleToolkit | 370513 |
survsim | 224994 |
BAT | 40592 |
Rcpp | 3509 |
ggplot2 | 3167 |
plyr | 3150 |
Table 5.2 shows the downloads from different countries on 2014-11-17. It is obvious that Indonesia
obtains much more downloads than any others.
country | n |
---|---|
ID | 2863576 |
US | 96336 |
CN | 32729 |
DE | 14548 |
FR | 11860 |
GB | 10491 |
IN | 8635 |
HK | 8090 |
BE | 7720 |
KR | 6794 |
Furthermore, we also checked IP addresses, displayed in Table 5.3. Downloads from ip3758
is much higher than others. So, it seems that most of the downloads are owing to one certain IP for the unusual spike in 2014.
ip_id | n |
---|---|
3758 | 2863432 |
11536 | 6244 |
11725 | 5992 |
16385 | 5991 |
534 | 5986 |
3784 | 5983 |
18519 | 4511 |
80 | 2124 |
27 | 1892 |
464 | 1375 |
Next, we turned to the unusual spike in 2018. Table 5.4 shows the downloads from tidyverse
is much higher than others, with nearly three orders of magnitude.
package | n |
---|---|
tidyverse | 11692582 |
Rcpp | 16263 |
stringi | 13981 |
rlang | 13796 |
ggplot2 | 13306 |
dplyr | 13081 |
glue | 12593 |
digest | 12302 |
stringr | 11505 |
fansi | 11275 |
As for country, from Table 5.5, we could know that US
occupies the most part of downloads on that day.
country | n |
---|---|
US | 12140853 |
NA | 179847 |
GB | 76624 |
IN | 51502 |
CN | 46095 |
TR | 36590 |
AU | 35078 |
DE | 32837 |
CA | 31125 |
KR | 30469 |
Finally, the most interesting finding is on IP address, displayed in Table 5.6. Several consecutive IPs have highly distinguished downloads. It seems that they are probably from the same individual, or caused by a server test issue, in such a short period of time.
ip_id | n |
---|---|
266 | 3034720 |
263 | 2457383 |
655 | 2099321 |
264 | 1557640 |
267 | 1406876 |
265 | 1032535 |
2 | 179711 |
268 | 99932 |
112 | 34397 |
3296 | 17223 |
To sum up, we found that these two unusual spikes have one thing in common, that is, most of the downloads came from a specific country. The difference is that in 2014, a large number of downloads came from several different R-packages, while in 2018, they came from only one package tidyverse
. In addition, in 2014, a large quantities of downloads came from one IP, while in 2018, they came from several consecutive IPs, At this point, it is guessed that they should come from the same individual, and it is very likely due to sever test issue, for it may be not necessary or reasonable for an individual to generate such a large amount downloads in one day.
Finding 2: There are increasing numbers of downloads over time, which can attests the growing number of R users.
Figure 5.2 shows the download trend of all R-packages on CRAN over a period pf time from 2012-10-01 to 2021-06-12, after fixing the unusual spikes mentioned above. There is an upward trend, with an increasing variance in download counts.
Finding 3: Weekends have a lower downloads than weekdays.
To have a closer look at the weekly pattern, figure 5.3 shows the daily downloads of all CRAN R-packages via the RStudio mirror, with the grey areas highlighting the weekend.
To be more specific, except for 2012 and 2013, the patterns of other years are very similar, with a strong weekly seasonality. To be more detailed, in 2012, the download logs showed an overall upward trend, which also reflected more and more users there after release of CRAN. In the following years, there is no obvious trend in download volume, but a strong seasonality, which indicates that in a week, the total downloads always increases first then decreases, and reaches the lowest on weekends. Although the pattern of 2013 is more volatile, it still conforms to that. We suppose that is because CRAN was only open for a short period of time in 2013, so the amount of download data is not adequate to show the weekly pattern very clearly. After 2016, the pattern of each year is quite consistent, for the total downloads have been increasing year by year. Back to weekly seasonality, people are more likely to download packages during weekdays, and rest on weekends. So, the trough of download curve always occurs on weekends. In addition, the lowest downloads across the year are always happening at the end of December or the beginning of January, probably due to the Christmas and New Year’s holidays. Meanwhile, the downloads are on the rise from August to October, and from February to April, which covers the beginning of semesters for many universities around the world, a time when related students tend to download CRAN R-packages very often.
As there are many fluctuations in daily download pattern, which is due to calendar effect and test server issue of CRAN mirror, an STL decomposition model explained in Hyndman and Athanasopoulos (2021a), was applied, to smooth the curve for all of the R-packages in Figure 5.4.
Figure 5.5 shows the distributions and the median of the downloads between weekday and weekends, which differ from each other a lot. The violin plots of weekends are wider and shorter, while those of weekdays are thinner and higher, on the contrary. That is because the total downloads on weekends are less than those in weekdays. In 2012, the median and interquartile range of download logs are not very distinguished between weekdays and weekends, for the data volume was not adequate at this time as mentioned before. But after 2013, the gap between the two has been becoming more and more obvious. The median downloads of working days are significantly higher than those of weekends, and the overall download volume is also significantly larger than that of weekends as well. Interestingly, the lower adjacent sometimes occurs on weekends, such as in year 2014, 2015, 2018, 2019 and 2021, while sometimes also in weekdays, such as in year 2012, 2013, 2016, 2017 and 2020.
Finding 4: Top 10% downloaded R-packages share nearly 90% cumulative download counts of the whole.
From the previous analysis, we could see that the cumulative download counts of R-packages show an increasing trend. It would be perfect equality if every R-package had the same download count : the last 20% downloaded R-packages would gain 20% of the total download count or the top 60% downloaded R-packages would get 60% of the total download count. But knowing from experience, we know that is hardly possible. So, here, we introduced Lorenz curve(Pettinger 2021) to show the respective numbers of R-packages within different download levels (groups defined by quantiles of download counts). In this way, we could figure out how many download counts contributed by different downloaded R-packages.
Figure 5.6 shows cumulative download counts against each downloaded group. It can be seen that most of the download counts come from the top 10% downloaded R-packages. At the same time, we could also observe that the Gini value is close to 1, which indicates that the download volumes across groups are quite unbalanced. In fact, the download volume of the top 10% group is extremely distinguished from that of the following groups. It’s not hard to understand that this group should contain some R-packages with high popularity and large quantities of users.
For example, if we extracted the first 10 packages of this group in Table 5.7, we could find that there are many quite famous and frequently-used R-packages, such as rlang
and dplyr
.
package | total |
---|---|
rlang | 15572507 |
vctrs | 13544857 |
dplyr | 12739206 |
ggplot2 | 12670952 |
jsonlite | 12627542 |
lifecycle | 11124212 |
tibble | 10935860 |
magrittr | 10312021 |
pillar | 9566463 |
glue | 9534999 |