5.1 Daily downloads of R-packages

Finding 1: There was unusual download activities in one day of 2014 and 2018.

In this first section, we studied the daily downloads of CRAN R-packages from 2012-10-01 to 2021-06-12. The data was obtained from the cranlogs package(Csárdi 2019), which includes a summary of the download logs via the RStudio CRAN mirror. The daily download data for CRAN R-packages are available from 1st October 2012. Examination of this data showed two unusual observations in 2014 and 2018 as shown in Figure 5.1. The one happening in 2014 was on 2014-11-17, Monday, while the other one happening in 2018 was on 2018-10-21, Sunday.

Figure 5.1: Unusual download spikes in 2014 and 2018.

When having a closer look into those two spikes, we firstly focused on the one on 2014-11-17. From Table 5.1, we could see that the downloads of top downloaded R-packages on this day differs little, indicating this spike is not due to a certain package.

Table 5.1: The total downloads of each R-package on 2014-11-17
package	n
BayHaz	767035
clhs	660298
GPseq	394840
OPI	382518
YaleToolkit	370513
survsim	224994
BAT	40592
Rcpp	3509
ggplot2	3167
plyr	3150

Table 5.2 shows the downloads from different countries on 2014-11-17. It is obvious that Indonesia obtains much more downloads than any others.

Table 5.2: The countries downloading from CRAN on 2014-11-17
country	n
ID	2863576
US	96336
CN	32729
DE	14548
FR	11860
GB	10491
IN	8635
HK	8090
BE	7720
KR	6794

Furthermore, we also checked IP addresses, displayed in Table 5.3. Downloads from ip3758 is much higher than others. So, it seems that most of the downloads are owing to one certain IP for the unusual spike in 2014.

Table 5.3: The IP addresses downloading from CRAN on 2014-11-17
ip_id	n
3758	2863432
11536	6244
11725	5992
16385	5991
534	5986
3784	5983
18519	4511
80	2124
27	1892
464	1375

Next, we turned to the unusual spike in 2018. Table 5.4 shows the downloads from tidyverse is much higher than others, with nearly three orders of magnitude.

Table 5.4: The total downloads of each R-package on 2018-10-21
package	n
tidyverse	11692582
Rcpp	16263
stringi	13981
rlang	13796
ggplot2	13306
dplyr	13081
glue	12593
digest	12302
stringr	11505
fansi	11275

As for country, from Table 5.5, we could know that US occupies the most part of downloads on that day.

Table 5.5: The countries downloading from CRAN on 2018-10-21
country	n
US	12140853
NA	179847
GB	76624
IN	51502
CN	46095
TR	36590
AU	35078
DE	32837
CA	31125
KR	30469

Finally, the most interesting finding is on IP address, displayed in Table 5.6. Several consecutive IPs have highly distinguished downloads. It seems that they are probably from the same individual, or caused by a server test issue, in such a short period of time.

Table 5.6: The IP addresses downloading from CRAN on 2018-10-21
ip_id	n
266	3034720
263	2457383
655	2099321
264	1557640
267	1406876
265	1032535
2	179711
268	99932
112	34397
3296	17223

To sum up, we found that these two unusual spikes have one thing in common, that is, most of the downloads came from a specific country. The difference is that in 2014, a large number of downloads came from several different R-packages, while in 2018, they came from only one package tidyverse. In addition, in 2014, a large quantities of downloads came from one IP, while in 2018, they came from several consecutive IPs, At this point, it is guessed that they should come from the same individual, and it is very likely due to sever test issue, for it may be not necessary or reasonable for an individual to generate such a large amount downloads in one day.

Finding 2: There are increasing numbers of downloads over time, which can attests the growing number of R users.

Figure 5.2 shows the download trend of all R-packages on CRAN over a period pf time from 2012-10-01 to 2021-06-12, after fixing the unusual spikes mentioned above. There is an upward trend, with an increasing variance in download counts.

Figure 5.2: The download trend of all R-packages on CRAN from 2012-10-01 to 2021-06-12.

Finding 3: Weekends have a lower downloads than weekdays.

To have a closer look at the weekly pattern, figure 5.3 shows the daily downloads of all CRAN R-packages via the RStudio mirror, with the grey areas highlighting the weekend.

To be more specific, except for 2012 and 2013, the patterns of other years are very similar, with a strong weekly seasonality. To be more detailed, in 2012, the download logs showed an overall upward trend, which also reflected more and more users there after release of CRAN. In the following years, there is no obvious trend in download volume, but a strong seasonality, which indicates that in a week, the total downloads always increases first then decreases, and reaches the lowest on weekends. Although the pattern of 2013 is more volatile, it still conforms to that. We suppose that is because CRAN was only open for a short period of time in 2013, so the amount of download data is not adequate to show the weekly pattern very clearly. After 2016, the pattern of each year is quite consistent, for the total downloads have been increasing year by year. Back to weekly seasonality, people are more likely to download packages during weekdays, and rest on weekends. So, the trough of download curve always occurs on weekends. In addition, the lowest downloads across the year are always happening at the end of December or the beginning of January, probably due to the Christmas and New Year’s holidays. Meanwhile, the downloads are on the rise from August to October, and from February to April, which covers the beginning of semesters for many universities around the world, a time when related students tend to download CRAN R-packages very often.

Figure 5.3: The total downloads for all of R-packages on CRAN would decrease on weekends and increase during weekdays.

As there are many fluctuations in daily download pattern, which is due to calendar effect and test server issue of CRAN mirror, an STL decomposition model explained in Hyndman and Athanasopoulos (2021a), was applied, to smooth the curve for all of the R-packages in Figure 5.4.

Figure 5.4: The total downloads of all R-packages on CRAN after smoothing.

Figure 5.5 shows the distributions and the median of the downloads between weekday and weekends, which differ from each other a lot. The violin plots of weekends are wider and shorter, while those of weekdays are thinner and higher, on the contrary. That is because the total downloads on weekends are less than those in weekdays. In 2012, the median and interquartile range of download logs are not very distinguished between weekdays and weekends, for the data volume was not adequate at this time as mentioned before. But after 2013, the gap between the two has been becoming more and more obvious. The median downloads of working days are significantly higher than those of weekends, and the overall download volume is also significantly larger than that of weekends as well. Interestingly, the lower adjacent sometimes occurs on weekends, such as in year 2014, 2015, 2018, 2019 and 2021, while sometimes also in weekdays, such as in year 2012, 2013, 2016, 2017 and 2020.

Figure 5.5: The violin plot for downloads of all of R-packages on CRAN, between weekday and weekends.

Finding 4: Top 10% downloaded R-packages share nearly 90% cumulative download counts of the whole.

From the previous analysis, we could see that the cumulative download counts of R-packages show an increasing trend. It would be perfect equality if every R-package had the same download count : the last 20% downloaded R-packages would gain 20% of the total download count or the top 60% downloaded R-packages would get 60% of the total download count. But knowing from experience, we know that is hardly possible. So, here, we introduced Lorenz curve(Pettinger 2021) to show the respective numbers of R-packages within different download levels (groups defined by quantiles of download counts). In this way, we could figure out how many download counts contributed by different downloaded R-packages.

Figure 5.6 shows cumulative download counts against each downloaded group. It can be seen that most of the download counts come from the top 10% downloaded R-packages. At the same time, we could also observe that the Gini value is close to 1, which indicates that the download volumes across groups are quite unbalanced. In fact, the download volume of the top 10% group is extremely distinguished from that of the following groups. It’s not hard to understand that this group should contain some R-packages with high popularity and large quantities of users.

Figure 5.6: Percentiles of the download counts against cumulative download counts for R-packages at or below that percentile.

For example, if we extracted the first 10 packages of this group in Table 5.7, we could find that there are many quite famous and frequently-used R-packages, such as rlang and dplyr.

Table 5.7: First 10 R-packages of top 10% downloaded group
package	total
rlang	15572507
vctrs	13544857
dplyr	12739206
ggplot2	12670952
jsonlite	12627542
lifecycle	11124212
tibble	10935860
magrittr	10312021
pillar	9566463
glue	9534999

References

Csárdi, Gábor. 2019. Cranlogs: Download Logs from the ’RStudio’ ’CRAN’ Mirror. https://CRAN.R-project.org/package=cranlogs.

Hyndman, Rob J, and George Athanasopoulos. 2021a. “Forecasting: Principles and Practice (3rd Ed).” 3.6 STL Decomposition. https://otexts.com/fpp3/stl.html.

Pettinger, Tejvan. 2021. “Lorenz Curve.” Economics Help. https://www.economicshelp.org/blog/glossary/lorenz-curve/.