Valid and Fair Assessments
It is very important that the testing instruments must undergo technical analyses for making the examinations and assessments truly fair to students and useful for the education system at large-scale standardized tests are supposed to be permeating our education system at a fast pace. Punjab, through Punjab Examination Commission (PEC), has been administering examinations to all children attending public schools at the end of 5th and 8th class. Sindh has done the same through outsourcing the development and administration of examinations at that same grade levels to a private firm.
In addition to these large-scale tests, the provinces have also been experimenting with small-scale sample-based educational evaluations. Most of these tests are supposed to be standardized. But what is standardization all about? We must be able to understand their nature, see how they are different, and see why we don’t have them yet. Sometimes, the large-scale tests are incorrectly referred to as standardized even when they have not been put through the technical development of standardization.
In order to extend this discussion, let us first take a look at our current perspectives on testing. Our understanding of examinations and assessments is tainted by our personal encounters with the traditional system of examinations. As a student, the only examinations I, and most other children, cared about were the dreaded, as well as much anticipated, annual examinations. These examinations carried very high stakes for us as they do for our children. Passing them meant promotion to a new grade and, more importantly, a bag full of new books, new stationary, and several other goodies. Failing them meant being left behind in intolerable ways.
Also, all I cared about was the raw score; understand writing marks, which I obtained on a test. The marks were always out of a total of 100, except for some subjects that had a little less or more weight age in the scheme of studies. Marks were the only thing that notable between the good students and the not-so-good students.
Yet, as we know now, raw score has never been a good way of discerning between students taking different tests at the same level. For example, consider candidates A and B appear in the secondary examinations organized by two different boards of secondary and intermediate education in Punjab. A gets 80 out of 100 in a certain subject and B 70 out of 100. Can we infer that A was more able than B? Traditionally, this would be our conclusion. But this conclusion is wrong since it does not consider the possible variation in the degree of difficulty of the two papers set by two different paper setters in two different places. Raw score was a crude criterion for comparison. It continues to be so for most students, parents, and teachers.
Our higher education institutions do not offer relevant courses in quantitative methods in assessment. As a result, we are perennially dependent on the indefensible practice of relying on foreign consultants to help with the demand to develop tests at the primary level.
As a schoolteacher, I also let the expectations about terminal examinations determine the content of what I taught my students. The topics were divided into ‘very important’, ‘important’, and ‘not important’. My teaching was seldom designed to cover the curriculum, but to ensure that my students obtained mastery in all ‘very important’ and ‘important’ topics. My reason as a teacher was to do everything I could to increase their chances of getting more and more marks in terminal exams. I was not alone in teaching to the test. Nearly every schoolteacher I knew then had similar concerns and objectives. In my early years as an education researcher, I learnt the phrase WYTIWYG (What You Test is what You Get). The nature of testing and their stakes drives the teaching and learning in the classrooms.
The onset of standardization of tests, or standardized tests, does not change the basic truth about teaching and learning embodied in WYTIWYG. However, it does change the way the tests are constructed, administered, and interpreted. Raw scores loose much of their traditional value. Absolute marks obtained by a student do not solely determine his or her location relative to other students who may be taking a slightly more difficult or slightly easier test at the same level. Different versions of the same test may vary in the degree of difficulty of the items they contain.
Relying on raw scores is extremely unfair, because a student with lower ability taking an easier version of the test may obtain higher scores than a student with higher ability taking a more difficult version of the same test. Scaled scores make your mind up this problem through test equating, i.e., by adjusting for relative difficulty levels in different editions of the same test.
But how do we ever find out if a particular test question (sometimes also called test items in the technical parlance) is more or less difficult? No, we cannot make that decision through our own subjective judgment. We do not know about the relative difficulty of fastidious items unless we put them through the empirical test of giving them to a sample of students and then analyzing the results. If most students respond correctly to an item it may be concluded that it has lower difficulty level than another item that elicits fewer correct responses.
Students appearing in intermediate, secondary, middle, as well as the primary institutions must get a fair deal. They deserve to be judged and certified moderately. The advantages of a good examination system are not confined to students alone. Teacher educators, policymakers, and education researchers can also learn from it and do their bit to improve it further.