In an earlier post I suggested that many people who regard themselves as scientists, or who are accepted as experts about Science, do not really understand what Science is. Medically qualified people, in particular, who either claim or are credited with scientific expertise, often do not deserve such credit.
There is another major area, or field, and class of scientific tools, that has for many years been involved in a high proportion of scientific research projects, but is very often not properly understood, even by people who are presented as, or present themselves as, scientific experts. I am referring to the area of statistical inference. And again, medical “experts” frequently show themselves to be relatively weak in this area. In some cases they are aware of, and admit to, their comprehensional weakness in relation to statistics, though they may not appear to recognize how much this weakness undermines any pretensions they may have to understanding the research that depends on statistical techniques.
The difficulty in inducing and developing understanding of statistical inference
I suggested, earlier, that scientific thinking does not come easily to most people, and, after having tried hard, for over forty years, to find ways to persuade undergraduate students of Psychology to understand, at an intuitive level, how inferential statistics work, it has become evident that statistical inference is even harder for most people to understand than is scientific thinking. I have, indeed, to admit to indifferent success in inculcating an intuitive understanding of statistical inference. After assessing the research theses of hundreds of Honours students whom I had taught throughout their undergraduate years I have been forced to accept that the majority of them still did not have a clear understanding of what could and could not be concluded on the basis of statistical reasoning.
And it is not only students that have failed fully to comprehend these techniques of reasoning. Even the major textbooks in this area contain errors which suggest that their authors have not fully grasped the concepts that they are trying to impart. I suspect that, in the fullness of time, the widespread uncomprehending use of these techniques will become one of the major scandals of scientific history.
A simplified introduction to significance testing
Having said all this, I am disconcerted to discover, and my readers may be dismayed to realize, that in order to make the main points of this posting, regarding the non-expertise of some “experts,” it is necessary for me to try to impart a basic understanding of one of the most widely used types of technique of inferential statistics—-significance testing. I do apologise for this necessity
Do not be surprised if you find it necessary to re-read this section a number of times, and perhaps refer back to it later on.
All significance tests involve having a body of data that has been collected, calculating some statistic/s that describes those data, and making an inference or inferences based on the value/s of that/those “descriptive statistics”. My primary example of a significance test will be what has probably been the most widely used of all of them. It is called the “t-test for independent samples.” I shall call this one just “t-test,” for short. This test looks at the difference between the “means” (averages) of two sets of scores (the “descriptive statistic” referred to above), and asks, effectively, how likely it is that we would have obtained a mean difference this big if the scores had been “sampled randomly” (I’ll leave this term vague for the moment) from two indefinitely large “populations” (sets) which have the same mean. The number the test ultimately provides is the likelihood (“probability”) that, if the means of the two populations were the same, we would have found a mean difference as big as we did find.
Researchers do not routinely report this probability. The convention is that instead, they decide in advance what probability is small enough to justify concluding that the two populations do not have the same mean. The critical probabilities most often applied (called “significance levels”) are p (“probability”)=0.05 and p=0.01:corresponding to one chance in twenty, and one chance in one hundred, respectively. The test is described as being “significant at the p=.05 level,” or “significant at the p=.01 level,” accordingly.
This use of the word “significant” is clearly different from its everyday uses. It does not mean, for instance, “important,”, but something more related to “signifying,” in this case signifying a difference (other than zero) between the population means. It is difficult, however, for people using the term to rid their minds of the more usual implications of “significant”.
A simple research project
The imaginary piece of research we are considering involved randomly selecting 20 twelve-year-old boys and 25 twelve-year-old girls, and measuring their heights. We shall assume that the girls turn out to be, on average, 2.2 inches taller than the boys. Note that, in a real study, we would need to specify more clearly the population/s from which we were drawing our two samples;for instance,that they were English boys and girls, at public schools. Our t-test would now ask, and answer, the question of whether the probability of randomly drawing samples like these, differing in mean by as much by as much as 2.2 inches, if the true mean difference between the populations was 0.0 inches, was (using the p=.05 significance level, for example) as little as .05. If it was—and it almost certainly would be, since 12-year-old girls do, in fact, tend to be taller than 12-year boys—we would reject the proposal (“hypothesis”) that the population mean difference was 0.0 inches.
Incidentally, this hypothesis of no difference is called, by statisticians, “the null hypothesis”. We would say “The difference in means was significant at the p=.05 level. Note that, although the difference we found between the mean heights of the girls and the boys was 2.2 inches, the test only entitles us to say, if we believe its result, that the the difference between the mean heights of twelve-year-old girls and twelve-year-old boys is unlikely to be 0.0 inches.
Most statistics textbooks give an account of the, primarily mathematical, reasoning behind the development of this significance test that is based on a number of assumptions about, for instance, how the heights of twelve-year-olds are distributed across the range from very short to very tall. These assumptions are very detailed and precise, and the logic of the test theoretically, we are told, depends on them being satisfied. But it turns out that the accuracy of the test is astonishingly unaffected by even quite extreme violation of these assumptions.
People who like using the test tend to be delighted by this. After all, it means that they do not need to be too worried if their data do not accord with these assumptions. Statisticians describe tests that still give accurate results when their assumptions are violated as “robust”. What a nice word!—it has connotations of both strength and reliability.
But there is another possible interpretation of the performance of a test being unaffected by the falsity of assumptions that are integral to the reasoning on which the test is supposed to be based. That interpretation is that the test doesn’t really work in the way that the designated experts think it does: that they really don’t understand it very well.
measure of relationship
I will give one more example in which this issue arises. It involves another very widely used descriptive statistic called “Pearson’s product-moment correlation coefficient,” symbolized by “r”. What r measures is, given the scores of each individual case on two variables, how well knowing the score on one measure enables one to estimate the score on the other. For instance, in our example, above, r would measure how well knowing the sex of a twelve-year-old would enable us to estimate their height. Associated with this measure is a significance test that answers the question of how likely it is that, if the sex of a twelve-year-old does not enable one to estimate their height, or if knowing their height does not improve one’s guess at their sex, an r-value as high as we obtained would have occurred. It is worth noting that this is, in a sense, logically equivalent to the issue tackled by the t-test.
The mathematical development of this significance test follows a similar path to that of the t-test, and requires a similar set of assumptions, and the fact that the sex measure can only take one of two values means that one of these assumptions is always so ridiculously far from being satisfied that the probability obtained should be meaningless. Nevertheless, this test always gives the same outcome probability as is obtained from the t-test.
I have never seen any discussion of this oddity. My own suspicion is that it again suggests that the significance test does not work for the reasons believed by the designated experts.
Enough for now
I think that I have given you enough to think about for one posting, so I shall continue this later, when I hope to give you more reasons to suspect that the designated experts do not fully understand inferential statistics.