©1997

Arthur Warmoth

Psychology Department

Sonoma State University

Every five years, the CSU requirement for a Program Review Self Study provokes a discussion of the Psychology Department’s grading practices within the faculty. While we were not dissatisfied with the outcomes of this year’s discussion, we were reminded that letter grades, and particularly the GPA, are deeply flawed statistics. Furthermore, the assumptions behind these methodological flaws are so deeply ingrained in academic culture that otherwise quite sophisticated social scientists fail to see them.

Individually and together, we have from time to time, in various contexts, addressed these flaws. We believe that they lead to consequences which do a disservice to the department and to the individual users of grades as statistical information, that is, to students (and their families), graduate schools, and employers. The disservice to the department derives from invidious comparisons of grading practices that are based on untenable and oversimplified interpretations of the measurement and evaluation issues involved. The disservice to individual users can include a very poor quality of information relative to the purpose for which evaluative information is needed. Furthermore, a spurious appearance of quantitative precision can be given to decisions which are essentially random, or even arbitrary and capricious.

It is not our intent in this paper to present a fully developed alternative approach. However, we find ourselves in an era where technology permits highly sophisticated techniques for processing complex information. This in turn requires the average educated citizen, not to mention the average academic, to become a more sophisticated information manager. By focusing on the statistical flaws in our customary grading practices, and by inviting participation in the development of a more methodologically sound approach to evaluation, we hope to engage both our colleagues and our students to become more knowledgeable interpreters and managers of basic social science data.

There are three basic methodological problems with the use of letter grades as the universal statistic for measuring academic achievement:

We will address each of these issues in turn. We will then briefly explore a philosophy of "criterion references grading" which we believe offers the possibility of grading practices that are more accurate, fair, and rich in informational content.1. They encourage the comparison in quantitative terms of variables that are comparable only at the basic level of qualitative measurements. 2. They are therefore subject to inappropriate methodological manipulation, thus creating a spurious appearance of quantitative accuracy.

3. They are therefore also subject to untenable assumptions about the appropriate distribution of grades, or grading curve, that should be expected from the grading process

One of the first things that a student in a statistics or research methodology class learns is that all measurements are not alike. With some measuring systems the numbers merely stand for different classifications, e.g., codes for medical diagnostic categories. With such "nominal" data, the only appropriate average is the mode or most common score.

Some measuring systems arrange objects or people into ranks or categories that have a relative order. The difference between categories is not reliable, but the order of the categories is fixed. Examples include military ranks and rating scales. With such "ordinal" data, one can use the median or middle score as an average in addition to the mode.

Some measuring systems arrange objects or people into equally spaced categories. Here the distance between two objects at the bottom of the scale is equal to the same distance between two objects at the top of the scale. Examples include relative humidity, specific gravity and temperature on a centigrade thermometer. With such "interval" data, one may appropriately calculate an arithmetic mean as well as a median and a mode.

The most sophisticated and reliable measuring systems have all the properties of an interval scale (above) as well as a true zero point. Examples include length, weight and temperature on a Kelvin thermometer. With such "ratio" data, one may appropriately calculate a harmonic mean as well as an arithmetic mean, a median and a mode.

The type of measuring system that describes a particular property is the result of both the way that the property is measured and the intrinsic character of the quality itself. For example, no matter how careful and sophisticated the measuring technique, medical diagnostic categories will always be nominal data.

This limitation clearly applies even to the comparison of students within a department. For example, within the psychology curriculum, we have identified three separate areas of knowledge and skills that are relevant to the major:

While each of these areas is important to success within the major, they are very difficult to compare or equate in any statistically meaningful way.1. Academic knowledge and skills: mastery of the theory and research base of those aspects of psychology appropriate to the students (sic) goals. Ability to integrate and use this knowledge. Ability to communicate the student’s knowledge both orally and in writing. 2. Process knowledge and skills: These skills include (but are not limited to): Communication & listening skills, group process skills (participation & facilitation), biofeedback training, collaborative learning skills, project management skills, and creative expression & facilitation.

3. Self-knowledge and personal awareness. Value and goal clarification and motivational development. Tools for self-awareness, including journal keeping, dream work, somatic awareness & yoga, and creative expression. (Psychology Department Five-Year Self-Study Report, April 1997, pp. 25-26.)

When we look at grading practices across disciplines, meaningful comparison becomes even more difficult. Looking only at the basic information bases of the disciplines there is such a great variety that their mastery is very difficult to compare. But if we look at the actual variety of performance skills that go into success in the various disciplines. meaningful comparison becomes even more difficult.

Historically the assumption of comparability was buttressed by the assumption of some common underlying general intelligence ("g") factor that underwrites success in any academic field. However, that assumption has been discredited by psychometric research, and it is even less credible in the light of Howard Gardner’s "multiple intelligence" model. But even if there was a "g" factor, the equal distribution of performance in various disciplines would not be very likely, due to differences in such variables as pedagogical strategies and motivational factors.

__Grades as an example of Ordinal Measurement.__
Clearly, grades are ordinal measurements. They are supposed to represent
how much a student has learned or the quality of his/her academic product.
However, our standards are imprecise and we can only express relative amounts
of knowledge or performance. Thus grades are much more similar to rating
scales than to measurements with a centigrade thermometer. Also, it is
difficult to maintain that the distance from an "F" to a "D" is the same
as the distance from a "B" to an "A." Thus, at least as currently measured,
grades are only ordinal data.

Going back to our discussion of Scales of Measurement (above), one would conclude that it only would be appropriate to calculate medians or modes as an average of grades. An arithmetic mean or "GPA" would tend to distort the data and give too much influence to extreme scores. Another way to say this is that grade point average (GPA) assumes that the intervals between grades are equal and uses this information in establishing the exact geometric center of the distribution. Without equal intervals between points on the scale, ordinal distributions have no geometric center.

__Average of averages__. Every student of statistics
and research methodology also learns that it is improper to average a group
of averages unless each of the individual averages is based on the same
number of scores. The proper way to obtain an overall average is to multiply
each individual average by the number of scores included in it, sum these
products and then divide by the total number of scores. This weights each
individual average by the number of scores it represents. An unweighted
overall average treats the average of three scores the same as an average
of three hundred scores.

In obtaining students' overall GPA for their transcripts, we are averaging averages for each class and, arbitrarily, weighting by the number of units in each class. This is clearly an error in mathematical procedure. If we were to give credence to the idea of an arithmetic average of grades at all, i.e., if we ignore their ordinal nature, we should weight each grade by the number of measurements it represents. That is, weight each grade by how many distinct tests, papers or other evaluative assignments it represents.

Calculating GPAs of teachers, departments and schools for comparison purposes is a similarly flawed mathematical process. A special studies class with one student and one graded paper is given the same weight as an upper division lecture class with thirty students, four multiple choice tests and two papers. This is clearly a nonsensical procedure.

How precise are grades? That depends on how you look at them. If grades are thought of as simple ranks or ratings, then they are precise to a single digit, i.e., we should round off GPA to the nearest integer. If we accept the plus/minus convention approved by the University (see the list below), then they are precise to two significant digits and we should round off GPA to the nearest tenth of a grade point.

Another view is that grades should be rounded to the nearest category in the list below.

A = 4.0 A- = 3.7 B+ = 3.5 B = 3.0 B- = 2.7 C+ = 2.5

C = 2.0 C- = 1.7 D+ = 1.5 D = 1.0 D- = 0.7 F <= 0.6*

The GPA is rounded to two decimal digits on students' transcripts. No matter which of the above views on precision you adopt, it is evident that we are reporting and using grades as if they were at least ten times more precise than they actually are. Even worse, we take action based on this spurious precision. For example, assuming a student has 45 graded units at Sonoma, if s/he has an overall GPA of 3.50 or above s/he will graduate with honors. If s/he has an overall GPA of 3.49, s/he will not graduate with honors. Similarly, lack of a hundredth of a grade point may keep a student from being admitted to SSU.

* Note the non-linearity of the grade to grade point scale. An increase of one grading category may be translated as .3, .2 or .5 grade points, depending on where you start.

finding during the latter half of the nineteenth century that errors of measurement, polygenetic biological traits and most human abilities are approximately normally distributed. While this is true, there are at least three reasons why this finding cannot be applied uncritically to grading.

The first reason is that a sample of small to moderate size often is not normally distributed, though the population from which it was obtained may have been normally distributed. According to the Central Limit Theorem, the means of such samples will be normally distributed, but the samples themselves and other statistics that pertain to them, such as range and standard deviation, are distinctly nonnormal.

The second reason is that to maintain normality, the population must be unselected on the property being measured. We cannot expect a sample of students selected on the basis of their high grades to have a normal distribution of grades. The fact that the CSU system accepts only the top 33% of high school students (based on their high school grades) means that our students will probably have an extremely skewed distribution of grades in our freshmen classes. If our freshmen grades are not highly negatively skewed then there is something wrong: (1) high school grades are not predictive of college success and should be dropped as an entry requirement, (2) there is a bias in our grading system, or (3) there is an element or elements in our college grading system that was not present in high school grading. If the later reason is primarily responsible we should look for an additional basis for choosing students. This should also tend to reduce the infamously high freshmen drop-out rate that plagues most universities.

What is true of freshmen grades is even more true of students in higher levels. The degree of selectivity is ever greater as more students drop out and as students begin to cluster into special interest groups such as majors. Thus, any attempt to maintain a symmetrical grade distribution becomes more and more artificial and wrong headed.

Finally, there is the effect of teaching. Even if a pretest were to show that a group of students entered a class with a symmetrical distribution of knowledge and skills on the subject matter of the class, there is no reason to suppose that they would have the same distribution at the end of the class. Good teaching negatively skews the distribution. That is, by the end of the semester, most students will have learned a substantial amount about the subject matter of the course and test near the top of the class standards.

By arbitrarily manipulating test parameters or invoking a quota system, it is possible to obtain a politically correct grade distribution, one that matches some presumed correct distribution, e.g., approximately normal. However, this may grossly distort the actual distribution of knowledge and skills of the students. Individual professors and even departments are frequently asked to change their "grading standards" to come into line with university norms. This is not a legitimate request and should be ignored. The statistical benchmarks that make up the university norms are, as we have shown here, mathematically and pedagogically indefensible.

A criterion-referenced grading standard specifies a set of knowledge and skills that a student should master for each grade. While this is more difficult and time consuming to set up, it has the advantage that it can be clearly expressed to students and colleagues. One can discuss the appropriateness of each set of knowledge and skills and come to agreement or at least enlightened disagreement. Furthermore, the standards for each class may be refined from semester to semester as experience dictates.

__The advantages of criterion-referenced grading.__
From a pedagogical perspective there is a particular advantage in a criterion-referenced
grading system. Once enunciated and clarified, the teacher can step away
from the standard and concentrate on helping students meet the standard.
The students are no longer in the position of having to please the teacher.
Instead, students and teacher can work together to help the students learn
and achieve.

Criterion-referenced grading systems work for any type of class. In some classes particular facts or cognitive skills may dominate the criteria. In other classes, process oriented knowledge and affective skills may dominate the criteria. The choice is up to the teacher and will be shaped by his/her professional judgment of what is appropriate for the class and the field. However, because such a grading system puts strong emphasis on public declaration of specific knowledge and skills for each grade, every professor is open to challenge of the rigor and appropriateness of his/her grading system. This will encourage lively and productive debate: debate that is specific enough to form a realistic basis for change and growth.

One alternative to letter grading and the use of the GPA. is the narrative transcript. This approach has been practiced at innovative schools such as the University of California at Santa Cruz and The Evergreen State College in Washington State. The problem with this approach is the large quantity of often redundant information that it generates, as well as the additional amount of faculty time required to generate the transcripts. However, the second author’s experience with this system at Evergreen suggests that, in practice, the narrative statements often become code phrases for the evaluation of specific criteria that are relatively standardized, as well as within the competence of the faculty member to evaluate efficiently. This includes competencies such as the mastery of specific subject matters, effective written and oral communication skills, and critical thinking abilities. The advent of information processing technology gives us the possibility of managing this type of information with a relatively high degree of ease and speed. By including information about the relative criteria in a given class, and comparing performance on related criteria in different classes, we could generate transcripts with a much higher level of useful information content than we are currently producing. The main thing we would need to give up is the illusion of the spurious accuracy of student ranking that we now create by using the GPA.

If most university classes had criterion-referenced
grading systems, we would no longer need to speak of potential grade inflation
in terms of vague, mathematically imprecise and unreliable statistics--statistics
that are only indirectly related to grading standards. Instead, we could
critique the actual standards on which the grades were based.