In the last decade student evaluations have become the “Key Performance Indicator” (KPI) for university lecturers. Whatever merits student evaluations may have, their current use as a KPI is statistical nonsense.

In statistics courses one learns that questionnaires are riddled with problems, mainly the following. First, there is the problem of non-response. If many people, in this case students, do not respond the outcomes are imprecise estimates of the “true” evaluation. Any estimate has to come with standard errors, and any conclusion is of the form that “this or that hypothesis” is either rejected or cannot be rejected. Standard errors typically become large if non-response is large, up to the point that pretty much no hypothesis can be rejected: the sample is not informative. The problem of non-response is one of imprecision. This is a serious problem but statistics has a solid albeit not infallible framework to address it.

This is not the case for the second problem, selective non-response. Standard statistics assumes that people responding are – perhaps with some tweaking here and there, dubbed “corrections” – representative of those not responding. It is safe to say that this assumption is somewhat problematic. Students who do not respond make up a heterogeneous group of people who (i) feel that the course is OK-ish, (ii) drop out, mentally and/or physically, (iii) feel all these evaluations are a waste of time and/or are useless anyway (I am not sure I totally disagree), (iv) are on holiday or otherwise engaged, or (v) forget. The people responding, on the other hand, are either people with (a) a sense of civic duty (they’ll fill in any form) or (b) strong feelings about the course (positive or negative). In any case, responding students are not representative of students not responding – therefore not of the “population” of students. (conversely, what are we to make of students who do not attend class but fill in the questionnaire anyway?)

Third, there are so called covariates. That is, the “true” evaluation of the course is influenced by factors (the covariates) at the student, teacher, course, cohort, faculty and even university levels. These factors have to be taken into account (i) simply to get a good understanding of the outcomes and (ii) to compare them (which is the whole idea of a KPI). First-year students differ from MA students – in what they evaluate and how they evaluate it. My personal experience is that first-years go berserk if anything about the course organization is unclear, but don’t really know how to evaluate the level of the course; the opposite goes for MA students. Students of university colleges are more critical, reflecting their motivation as well as a sense of entitlement.

Of course, different types of students are part of the challenge and even fun of teaching, but in our context it means that the exact same course can lead to different evaluation outcomes. (One could counter that a good lecturer takes into account differences between students beforehand. Well, that is exactly my point. The question then is how – and to what extent – to do that, as I’m sure nobody wants lecturers to just maximize their KPI.)

Besides these three traditional statistical problems, there are yet more problems – often not treated in statistics courses but well known to anthropologists and psychologists. The outcomes are also influenced by (i) the framing and (ii) the order of questions. One could also debate whether a 5-point answering scale is ideal. (Yet another discussion is whether students should be the (main) party assessing quality.)

Taking all these problems together, my conclusion is that the really existing student evaluations are close to useless. They have to come with standard errors (they indeed sometimes do), selection bias is never addressed, there is no control for student, faculty or university-characteristics, and they are never carefully framed and ordered. In any case, the results should be interpreted with great care.

The striking (and upsetting) thing is that all these statistical problems are taught in first-year courses (ironically these are evaluated according to principles that are being shown to be biased in the very same course; one couldn’t make this up). Why are student evaluations then used as KPI? For a statistician this is a mystery. For sociologists it probably isn’t. Constant monitoring of staff (by students) makes student-evaluations-as-KPI a manager’s paradise. Managers have a constant stream of quasi-information on their “dashboard”, can feel “in control” and can always bully half the staff (by construction half of personnel is below the median and should thus “improve performance”). Quite conveniently, managers are never evaluated themselves by lecturers, so there is no fear of retaliation (of course, we teachers would never contemplate using evaluations for pay-back).

Some people counter that evaluations are a good way to (a) get feedback from students, thereby improving the course, and (b) give student the opportunity to air their feelings and make them feel heard. I fully agree with those aims, but the current means are not furthering them. I am not against evaluations per se, but I am against ill-designed evaluations that are (mis)used as KPIs. In fact, in a good course feedback is given frequently, organically and sometimes substantially.

I get feedback all the time – verbally, non-verbally, by mail – and often adapt my lectures accordingly (for the better I hope). I do not need (the threat of) evaluations to change the course. In fact, evaluations incentivise me to “play it safe”, using the format I know will result in a sufficient KPI. Conversely, if a university needs questionnaires to give students the feeling they are being heard, and if students seriously feel that they are being heard, then it isn’t a real university.