5.1 Daily downloads of R-packages

Finding 1: There was unusual download activities in one day of 2014 and 2018.

In this first section, we studied the daily downloads of CRAN R-packages from 2012-10-01 to 2021-06-12. The data was obtained from the cranlogs package(Csárdi 2019), which includes a summary of the download logs via the RStudio CRAN mirror. The daily download data for CRAN R-packages are available from 1st October 2012. Examination of this data showed two unusual observations in 2014 and 2018 as shown in Figure 5.1. The one happening in 2014 was on 2014-11-17, Monday, while the other one happening in 2018 was on 2018-10-21, Sunday.

Unusual download spikes in 2014 and 2018.

Figure 5.1: Unusual download spikes in 2014 and 2018.

When having a closer look into those two spikes, we firstly focused on the one on 2014-11-17. From Table 5.1, we could see that the downloads of top downloaded R-packages on this day differs little, indicating this spike is not due to a certain package.

Table 5.1: The total downloads of each R-package on 2014-11-17
package n
BayHaz 767035
clhs 660298
GPseq 394840
OPI 382518
YaleToolkit 370513
survsim 224994
BAT 40592
Rcpp 3509
ggplot2 3167
plyr 3150

Table 5.2 shows the downloads from different countries on 2014-11-17. It is obvious that Indonesia obtains much more downloads than any others.

Table 5.2: The countries downloading from CRAN on 2014-11-17
country n
ID 2863576
US 96336
CN 32729
DE 14548
FR 11860
GB 10491
IN 8635
HK 8090
BE 7720
KR 6794

Furthermore, we also checked IP addresses, displayed in Table 5.3. Downloads from ip3758 is much higher than others. So, it seems that most of the downloads are owing to one certain IP for the unusual spike in 2014.

Table 5.3: The IP addresses downloading from CRAN on 2014-11-17
ip_id n
3758 2863432
11536 6244
11725 5992
16385 5991
534 5986
3784 5983
18519 4511
80 2124
27 1892
464 1375

Next, we turned to the unusual spike in 2018. Table 5.4 shows the downloads from tidyverse is much higher than others, with nearly three orders of magnitude.

Table 5.4: The total downloads of each R-package on 2018-10-21
package n
tidyverse 11692582
Rcpp 16263
stringi 13981
rlang 13796
ggplot2 13306
dplyr 13081
glue 12593
digest 12302
stringr 11505
fansi 11275

As for country, from Table 5.5, we could know that US occupies the most part of downloads on that day.

Table 5.5: The countries downloading from CRAN on 2018-10-21
country n
US 12140853
NA 179847
GB 76624
IN 51502
CN 46095
TR 36590
AU 35078
DE 32837
CA 31125
KR 30469

Finally, the most interesting finding is on IP address, displayed in Table 5.6. Several consecutive IPs have highly distinguished downloads. It seems that they are probably from the same individual, or caused by a server test issue, in such a short period of time.

Table 5.6: The IP addresses downloading from CRAN on 2018-10-21
ip_id n
266 3034720
263 2457383
655 2099321
264 1557640
267 1406876
265 1032535
2 179711
268 99932
112 34397
3296 17223

To sum up, we found that these two unusual spikes have one thing in common, that is, most of the downloads came from a specific country. The difference is that in 2014, a large number of downloads came from several different R-packages, while in 2018, they came from only one package tidyverse. In addition, in 2014, a large quantities of downloads came from one IP, while in 2018, they came from several consecutive IPs, At this point, it is guessed that they should come from the same individual, and it is very likely due to sever test issue, for it may be not necessary or reasonable for an individual to generate such a large amount downloads in one day.

Finding 2: There are increasing numbers of downloads over time, which can attests the growing number of R users.

Figure 5.2 shows the download trend of all R-packages on CRAN over a period pf time from 2012-10-01 to 2021-06-12, after fixing the unusual spikes mentioned above. There is an upward trend, with an increasing variance in download counts.

The download trend of all R-packages on CRAN from 2012-10-01 to 2021-06-12.

Figure 5.2: The download trend of all R-packages on CRAN from 2012-10-01 to 2021-06-12.

Finding 3: Weekends have a lower downloads than weekdays.

To have a closer look at the weekly pattern, figure 5.3 shows the daily downloads of all CRAN R-packages via the RStudio mirror, with the grey areas highlighting the weekend.

To be more specific, except for 2012 and 2013, the patterns of other years are very similar, with a strong weekly seasonality. To be more detailed, in 2012, the download logs showed an overall upward trend, which also reflected more and more users there after release of CRAN. In the following years, there is no obvious trend in download volume, but a strong seasonality, which indicates that in a week, the total downloads always increases first then decreases, and reaches the lowest on weekends. Although the pattern of 2013 is more volatile, it still conforms to that. We suppose that is because CRAN was only open for a short period of time in 2013, so the amount of download data is not adequate to show the weekly pattern very clearly. After 2016, the pattern of each year is quite consistent, for the total downloads have been increasing year by year. Back to weekly seasonality, people are more likely to download packages during weekdays, and rest on weekends. So, the trough of download curve always occurs on weekends. In addition, the lowest downloads across the year are always happening at the end of December or the beginning of January, probably due to the Christmas and New Year’s holidays. Meanwhile, the downloads are on the rise from August to October, and from February to April, which covers the beginning of semesters for many universities around the world, a time when related students tend to download CRAN R-packages very often.

The total downloads for all of R-packages on CRAN would decrease on weekends and increase during weekdays.

Figure 5.3: The total downloads for all of R-packages on CRAN would decrease on weekends and increase during weekdays.

As there are many fluctuations in daily download pattern, which is due to calendar effect and test server issue of CRAN mirror, an STL decomposition model explained in Hyndman and Athanasopoulos (2021a), was applied, to smooth the curve for all of the R-packages in Figure 5.4.

The total downloads of all R-packages on CRAN after smoothing.

Figure 5.4: The total downloads of all R-packages on CRAN after smoothing.

Figure 5.5 shows the distributions and the median of the downloads between weekday and weekends, which differ from each other a lot. The violin plots of weekends are wider and shorter, while those of weekdays are thinner and higher, on the contrary. That is because the total downloads on weekends are less than those in weekdays. In 2012, the median and interquartile range of download logs are not very distinguished between weekdays and weekends, for the data volume was not adequate at this time as mentioned before. But after 2013, the gap between the two has been becoming more and more obvious. The median downloads of working days are significantly higher than those of weekends, and the overall download volume is also significantly larger than that of weekends as well. Interestingly, the lower adjacent sometimes occurs on weekends, such as in year 2014, 2015, 2018, 2019 and 2021, while sometimes also in weekdays, such as in year 2012, 2013, 2016, 2017 and 2020.

The violin plot for downloads of all of R-packages on CRAN, between weekday and weekends.

Figure 5.5: The violin plot for downloads of all of R-packages on CRAN, between weekday and weekends.

Finding 4: Top 10% downloaded R-packages share nearly 90% cumulative download counts of the whole.

From the previous analysis, we could see that the cumulative download counts of R-packages show an increasing trend. It would be perfect equality if every R-package had the same download count : the last 20% downloaded R-packages would gain 20% of the total download count or the top 60% downloaded R-packages would get 60% of the total download count. But knowing from experience, we know that is hardly possible. So, here, we introduced Lorenz curve(Pettinger 2021) to show the respective numbers of R-packages within different download levels (groups defined by quantiles of download counts). In this way, we could figure out how many download counts contributed by different downloaded R-packages.

Figure 5.6 shows cumulative download counts against each downloaded group. It can be seen that most of the download counts come from the top 10% downloaded R-packages. At the same time, we could also observe that the Gini value is close to 1, which indicates that the download volumes across groups are quite unbalanced. In fact, the download volume of the top 10% group is extremely distinguished from that of the following groups. It’s not hard to understand that this group should contain some R-packages with high popularity and large quantities of users.

Percentiles of the download counts against cumulative download counts for R-packages at or below that percentile.

Figure 5.6: Percentiles of the download counts against cumulative download counts for R-packages at or below that percentile.

For example, if we extracted the first 10 packages of this group in Table 5.7, we could find that there are many quite famous and frequently-used R-packages, such as rlang and dplyr.

Table 5.7: First 10 R-packages of top 10% downloaded group
package total
rlang 15572507
vctrs 13544857
dplyr 12739206
ggplot2 12670952
jsonlite 12627542
lifecycle 11124212
tibble 10935860
magrittr 10312021
pillar 9566463
glue 9534999

References

Csárdi, Gábor. 2019. Cranlogs: Download Logs from the ’RStudio’ ’CRAN’ Mirror. https://CRAN.R-project.org/package=cranlogs.
Hyndman, Rob J, and George Athanasopoulos. 2021a. “Forecasting: Principles and Practice (3rd Ed).” 3.6 STL Decomposition. https://otexts.com/fpp3/stl.html.
Pettinger, Tejvan. 2021. “Lorenz Curve.” Economics Help. https://www.economicshelp.org/blog/glossary/lorenz-curve/.