5.4 Compare last year’s downloads with the initial release date

Finding: R-packages that are initially released earlier on CRAN tend to have higher download counts in the past year. That is perhaps because, in earlier times, there were fewer R-packages in the same category, then users had ‘no choice’ but to use them. Due to that, those R-packages would accumulate user base, which makes it more possible to attract new users.

In our common cognition, it may be assumed that the earlier an R-package is released, the more people can get to know it, and thus the more downloads it can have. However, R-packages related to different topics cannot be directly compared, because download counts of R-packages in one topic can be higher than that in another. Therefore, in order to test this conjecture as clearly as possible, we selected three domain R-packages through CRAN task view(“CRAN Task Views,” n.d.), calculated their respective downloads in the previous one year, and extracted their earliest release dates for comparison. Those three topics are :

  • R-packages for Time Series Analysis

The first topic is Time Series Analysis. Time Series Analysis is a statistical technique that deals with time series data, or trend analysis. Time series data means that data is in a series of particular time periods or intervals(“Time Series Analysis” 2020).

  • Bayesian R-packages for general model fitting

The second topic is Bayesian Inference. Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data(perpetual 2019).

  • Econometrics R-packages

The last topic is related to econometrics. Econometrics is the use of statistical methods using quantitative data to develop theories or test existing hypotheses in economics or finance, which relies on techniques such as regression models and null hypothesis testing(Hayes 2020).

Figure 5.10 displays the scatterplot of the past year’s download counts and the earliest release dates, for Time Series Analysis, Econometrics and Bayesian R-packages. It can be seen that, generally, as the earliest release dates get later and later, the numbers of download logs become lower and lower. For Time Series Analysis R-packages, they are mainly released between 2012 and 2019. For Bayesian R-packages, most of the R-packages are born from 2007 to 2012. And most Econometrics are centered between 2013 and 2016.

The download counts decrease with the initial release dates.

Figure 5.10: The download counts decrease with the initial release dates.

In conclusion, it is not surprising to find that the earlier the R-package is released, the more downloads it could have, which is reflected in all of three topics above. That is probably because the R-packages released earlier can be better-known. When they are released early, there may be a relatively small number of R-packages in the same topic, under non-serious competition. As a result, the R-packages coming later can easily be covered up, since people may generally tend to use well-known, mature and habitual packages.

That is to say, earlier R-packages are more conducive to the cultivation of user habits. After all, habits are influenced by the length of time. For example, if the teacher is a senior user of an R-package, they may recommend that R-package to their students when teaching, especially when they obtain a satisfying user experience.

References

“CRAN Task Views.” n.d. CRAN Task Views. https://cran.r-project.org/web/views/.
Hayes, Adam. 2020. “Econometrics: What It Means, and How It’s Used.” Investopedia. Investopedia. https://www.investopedia.com/terms/e/econometrics.asp.
perpetual, NSSI am a. 2019. “Bayesian Statistics Explained in Simple English for Beginners.” Analytics Vidhya. https://www.analyticsvidhya.com/blog/2016/06/bayesian-statistics-beginners-simple-english/.
“Time Series Analysis.” 2020. Statistics Solutions. https://www.statisticssolutions.com/time-series-analysis/.