A small set observations with a few extreme observations plus subjective splitting of a data set into two subsets to be fitted separately to a linear regression model resulted in very clear cut conclusions and striking figures. However, none of this is solid evidence, or evidence at all supporting the paper’s conclusions. This series of articles, not only discusses the problems in the paper, but more importantly, it traces the review process that allowed it to be published in Nature.
A new analysis of the data appears in an article at “Ask a Swiss” but still based on model fitting. They detect a significant change in slope, but still we do not have confidence bands available.
This is a very clear-cut example of a case of not being aware of the limitations of the data. The same problem affects the interpretation of data from any survey or experiment and any statistical method, machine learning to linear regression or t-tests. The design of the survey or experiment limits the range of validity of possible conclusions and the nature of such conclusions. No sophisticated analysis can correct for biased data unless the bias is measured. We cannot extract information that is not already in the data. We need to make assumptions, but this imposes limitations to what can be concluded. Less dangerous, but equally wrong interpretations of data analyses are a lot more frequent in the scientific literature than we usually realize. Look for parallels between this example and your own research or what you read. What are the data representative of? and meaningful for?
In most situations pseudo-random numbers produced by computer software (“random” number generators) are good enough as long as we are careful when choosing the seed forthe generator. Sometimes, it can be even an advantage to be able to reproduce sequences of pseudo-random numbers by setting the seed value. Frequently, the seed is obtained from the clock of the computer, e.g. using the seconds or milliseconds digits from current time. This is still not truly random, as random numbers cannot be generated by any deterministic process. True random numbers can be only be generated by a random physical process.
The site random.org is a service which provides true random numbers for free (at least if below a quota). R package random provides an interface to this service.