Saturday, 13 May 2017

The Evaluation of Adaptive Comparative Judgement as an Information Theoretical Problem

Adaptive Comparative Judgement is an assessment technique which has fascinated me for a long time (see Only recently, however, have I had the opportunity for trying it properly... and its application is not in education, but in medicine (higher education, for some reason, has been remarkably conservative in experimenting with the traditional methods of assessment!).

ACJ is a technique of pair-wise assessment where individuals are asked to compare two examples of work, or (in my case) two medical scans. They are asked a simple question: Which is better? Which is more pathological? etc. The combination of many judges and many judgements produces a ranking from which a grading can be produced. ACJ inverts the traditional educational model of grading work to produce a ranking; it ranks work to produce a grading.

In medicine, ACJ has fascinated the doctors I am working with, but it also creates some confusion because it is so different from traditional pharmacological assessment. In the traditional assessment of the efficacy of drugs (for example), data is examined to see if the administration of the drug is an independent variable in the production of the patient getting better (the dependent variable). The efficacy of the drug is assessed against its administration to a wide variety of patients (whose individual differences are usually averaged-out in the statistical evaluation). In other words, in traditional clinical evaluation, there is a linear correlation between
P(patient) + X(drug) = O(outcome)
where outcome and drug are shown to be correlated across a variety of patients (or cases).

ACJ is not linear, but circular. The outcome from ACJ is what is hoped to be a reliable ranking: that is, a ranking which accords with the  judgements of the best experts. But it is not ACJ which does this - it is not an independent variable. It is a technique for coordinating the judgements of many individuals. Technically, there is no need for more than one expert judge to produce a perfect ranking. But the burden of producing consistent expert rankings for any single judge (however good they are) will be too great, and consistency will suffer. ACJ works by enlisting many experts in making many judgements to reduce the burden on a single expert, and to coordinate differences between experts in a kind of automatic arbitration.

Simply because it cannot be seen to be an independent variable does not mean that its efficacy cannot be evaluated. There are no independent variables in education - but we have a pretty good idea of what does and doesn't work.

What is happening in the ACJ process is that a ranking is communicated through the presentation of pairs of images to the collective judgements of those using the system. The process of communication occurs within a number of constraints:

  1. The ability of individual judges to make effective judgements
  2. The ease with which an individual judgement might be made (i.e. the degree of difference between the pairs)
  3. The quality of presentation of each case (if they are images, for example, the quality is important)

An individual's inability to make the right judgement amounts to the introduction of "noise" into the ranking process. With too much "noise" the ranking will be inaccurate.

The ease of making a judgement depends of the degree of difference, which in turn can be a measure of the relative entropy between two examples. If they are identical, then the relative entropy will be the same. Equally, if images are the same, the mutual information between them will be high, calculated as:
H(a) + H(b) - H(ab)
If the features of each item to be compared can be identified, and each of those features belongs to a set i, then the entropy of each case can be measured simply as a value for H, across all the values of x in the set i:

The ability to make distinctions between the different features will depend partly on the quality of images. This may introduce uncertainty in the identification of values of x in i.

What ACJ does is it deals with issues 1 and 2. Issue 3 is more complex because it introduces uncertainty as to how features might be distinguished. ACJ deals with 1 and 2 in the same way as any information theoretical problem deals with problems of transmission: it introduces redundancy.

That means that the number of comparisons needed to be made by each judge is dependent on the quality and consistency of the of the ranking which is produced. This can be measured by determining the distance between the ranking produced by the system and the ranking determined by experts.  Ranking comparisons can be made for the system as a whole, or for each judge. Through this process, individual judges may be removed or others added. Equally, new images may be introduced whose ranking is known relative to the existing ranking.

The evaluation of ACJ is a control problem, not a problem of identifying it as an independent variable. Fundamentally, if ACJ doesn't work, it will not be capable of producing a stable and consistent ranking - and this will be seen empirically. That means that the complexity of the judges performing ranking will not be as great as the complexity of the ranking which is input. The complexity of the input will depend on the number of features in each image, and the distance between each pair of images.

In training, we can reduce this complexity by having clear delineations of complexity between different images. This is the pedagogical approach. As the reliability of the trainee's judgements increase, so the complexity of the images can be increased.

In the clinical evaluation of ACJ, it is possible to produce a stabilised ranking by:

  1. removing noise by removing unreliable judges
  2. increasing redundancy by increasing the number of comparisons
  3. introducing new (more reliable) judges
  4. focusing judgements on particular areas of the ranking (so particular examples) where inconsistencies remain
As a control problem, what matters are the levers of control within the system. 

It's worth thinking about what this would mean in the broader educational context. What if ACJ was a standard method of assessment? What if the judgement by peers was itself open to judgement? In what ways might a system like this assess the stability and reliability of the rankings that arise? In what ways might it seek to identify "semantic noise"? In what ways might such a system adjust itself so to manipulate its control levers to produce reliability and to gradually improve the performance of those whose judgements might not be so good? 

The really interesting thing is that everything in ACJ is a short transaction. But it is a transaction which is entirely flexible and not constrained by the absurd forces of timetables and cohorts of students.

No comments: