A series of four seminars on Statistics (week 47)

Series title: Statistics: Changes since I was an undergrad

Abstract. I took my first course in Statistics 37 years ago. How we do statistics has changed dramatically since then. The amount of data we produce and analyse has also increased enormously. However, different research communities are making use of these new possibilities to very different extents. Even Biology curricula at different universities differ substantially in their emphasis on ” methods and /mathematical” literacy. All branches of Biology are becoming more quantitative. By reflecting on how advances in computers and computing science (and in the methods these advances have made possible) have opened a whole new way of approaching data analysis, I hope I will make you rethink your approach to data analysis.

If you are planning to participate next year in my course 526052 Using R for reproducible data analysis I recomend that you attend this series of talks as a gentle introduction to the subject. If you attend you can get credits, either through the DDPS or in on the regular seminar series in Plant Biology.

Place: Biocenter 3, Room 5405 (5th floor, the room in front of the stairs)

Two hours are reserved, for talk plus discussion.

Please, let me know by e-mail if you intend to participate.

Part 1: Increased easy of computation

Monday 17 November, 10:15-12:00

This first talk focuses on describing the advances in computing hardware and software and why they are relevant to data analysis. I will also briefly mention the now fashionable Science” and Data” concepts and the currently fuzzy boundary between statistics and programming.

Part 2: Advances in theory and methods

Tuesday 18 November, 13:15-15:00

If your statistical knowledge is limited to the “traditional” methods, I hope to introduce you to the new possibilities brought about by lifting the that used to prevent us from using computation intensive methods and analysing big data set. In contrasts, if you are a young researcher, well versed in modern methods, you will still hopefully find my talk interesting from the historical perspective of getting a glimpse of what limitations we had to deal with in the recent past, and how they influenced, and still influence, the traditional ways of treating biological data. This talk focuses mostly on statistical theory and methods. However, no specialised methods like those used in molecular biology or vegetation analysis will be described in this talk.

Part 3: Examples of modern methods using R

Wednesday 19 November, 13:15-15:00

In this talk I will present some examples of types of analyses that have become available to any biologist thanks to the increase in computing capacity and the development of new theory and methods that make use of these new possibilities. The aim not to teach you how to apply this methods, but instead to give an idea of what a broad array of methods are currently available to anyone with access to a run-of-the-mill personal computer, or failing this a cheap cloud server.

Part 4: Reproducible research and data analysis

Thursday 20 November, 13:15-15:00

This talk introduces the currently hot topic of research accountability and repeatability. Why is this openness needed, and how it can be achieved in practice, and how modern software and modern combinations of old software make it possible to achieve this goal rather painlessly even for complex data analyses. I will also reflect on the origins of these ideas in relation to computer programming around the concept of literate programming proposed by Donald Knuth in the early 1980’s.

However good data looks at first sight, check it!


I will tell today two true stories, one old and one very recent. The point that I want to make today is that one should never blindly trust the results of measurements. This applies in general, but both examples I will present have to do with measurements made with instruments, more specifically with measuring UV-B radiation in experiments using lamps.

A case from nearly 20 years ago

Researcher A received a new very good spectroradiometer from the manufacturer and used it to set the UVB output from the lamps.

Researcher B had access to an old spectroradiometer that could measure only a part of the UVB spectrum. He knew this, and measured the part of the spectrum that was possible to measure, and extrapolated the missing part from published data. He also searched the literature and compared his estimates to how the same lamps had been used earlier.

Researcher A was unlucky enough that because of a mistake at the factory, the calibration of the new instrument was wrong by about a factor of 10. She did not notice until after the experiment was well under way, but before publication. The harm was that the results were less relevant than what had been the aim, but no erroneous information was published.

Researcher B was able able to properly measure the UVB irradiance after the experiment was well under way, and he found that the treatment was within a small margin of what he had aimed.

A case I discovered just a few days ago

A recently published paper concluded that they had obtained evidence that a low and ecologically relevant dose of UVB on a single day was able to elicit a large response in the plants. From the description of the lamps used, the distance to the plants and the time that the lamps were kept switched on is easy to estimate that in fact they had applied a dose that was at least 15 or 20 times what they had measured and reported in the paper. Coupled to a low level of visible light this explains why they observed a large response from the plants! Neither the authors, reviewers, nor the editor had noticed the error! [added on 8 October] I read a few other papers on similar subjects from the same research group and the same problem seems to also affect them. I will try to find out the origin of the discrepancy, and report here what I discover.

[added on 26 October]
I have contacted three of the authors. They have confirmed the problem. Cause seems to have been that the researchers did not notice that the calibration they used had been expressed in unusual units by the manufacturer. The authors are concerned and are checking how large the error was, but first comparative measurements suggest that the reported values were underestimated by a factor of at least 20 times.

About this case, I do not yet know the whole story, but evidently it yielded a much worse result: The publication of several articles with wrong data and wrong conclusions.

Take home message

Whenever and whatever you measure, or when you use or assess non-validated data from any source, unless you know very well from experience what to expect, check the literature for ballpark numbers. In either case, if your data differ significantly from expectations try to find an explanation for the difference before you accept the data as good. You will either find an error or discover something new.

“Reproducible research” is a hot question

I have long been interested in the question of reproducible research and as a manuscript author, reviewer and more recently, editor, have attempted to make sure that no key information was missing and that methods were described in full detail and, of course, valid.

Although the problem has always existed, I think that in recent years papers and reports with badly described methods have become more frequent. I think that there are many reasons for this: 1) the pressure to publish quickly and frequently as a condition for career advance, 2) the overload on reviewers work’ and the pressure from journals to get manuscript reviews submitted within a few days’ time, 3) the stricter and stricter rules of journals about maximum number of “free” pages, and 4) the practice by some journals of publishing methods at the end of the papers or in smaller typeface, implying that methods are not important for most readers, and irrelevant for understanding the results described (which is a false premise).

Continue reading““Reproducible research” is a hot question”

From ResearchGate Q&A

Do we actually need (or understand) more than basic statistics?

This is another topic worthwhile looking at, and especially thinking about. I copy here, my answer, that is to some extent off-topic (you will need to follow the link above to read the original post and other answers):

Frequently students that I have supervised, seem to think that statistical tests come first, rather than being a source of guidance on how far we can stretch the inferences that we can make by “looking at the data” and derived summaries. They just describe effects as statistically significant or not. This results in very boring “results” sections lacking the information that the reader wants to know. When I read a paper I want to know the direction and size of an effect, what patterns are present in the data, and if there is a test, then statistical tests should help us decide what amount of precaution we need to use until additional evidence becomes available. Many students and experienced researchers which “worship” p-values and the use of strict risk levels ignore how powerful and important is the careful design of experiments, and how the frequently seen use of “approximate” randomization procedures or the approach of repeating an experiment until the results become significant invalidate the p-values they report.

[edited 5 min later] As I read again what I wrote it feels off-topic, but what I am trying to say is that not only the proliferation of p-values and especially the use fixed risk levels, but also many times how results are presented, is the reflection of a much bigger problem: statistics being taught as a mechanical and exact science based on clear and fixed rules. Oversimplifying the subtleties and degree of subjectivity involved in any data analysis, especially in relation to what assumptions are reasonable or not, and how any experimental protocol relates to which assumptions are tenable or not, is simply not teaching what would be the most useful training for anybody doing experimental research. So, in my opinion, yes we need to understand much more than basic statistics in terms of principles, but this does not mean that we need to know advanced statistical procedures unless we use them or assess work that uses them.


Back to Top