The Assessment Capable Teacher, Part 2

Validity, Reliability, and Fairness

As discussed in a previous post, assessment literate teachers – or assessment capable teachers as the PYP calls them – understand standards of assessment quality. The most commonly referenced standards are validity, reliability, and fairness. During my research I found that the complexity of these issues differed, depending upon who was conducting the inquiry – teachers, school leaders, or researchers.

James Popham (2009) recommends dividing assessment into two categories: classroom based assessments and accountability assessments. He argues that classroom based assessment, that is assessment data that is being used by students and teachers to determine next steps, is the first priority for teachers to understand as it has the greatest impact on student learning. He defines classroom based assessments as “those formal and informal procedures that teachers employ in an effort to make accurate inferences about what their students know and can do (6)”. These assessments can be diagnostic, formative or summative, as long as the data is used to improve learning.

I am going to trust Popham’s advice here and focus on how the key measurement concepts of validity, reliability and fairness impact on classroom based assessments. I am also going to warn you that we are moving into an area where my own knowledge is in its infancy.


Popham (2009) and Hickey (2014) both consider validity to be THE MOST IMPORTANT construct in assessment. Validity, defined simply, is the degree to which the inference or argument you make about student learning is supported by the data collected.

Another point that is made consistently in literature about validity is that the assessment tool itself does not hold validity. Validity lives in our interpretations, in our judgement. For example, my sons are sitting an end-of-year math exam that only covers content taught in the last three months. Any argument or judgement made about my sons’ understanding of the year’s content would have low validity.

There are three ways to measure validity: content-related, criterion-related, and construct-related. Popham (2018) and Hickey (2014) both argue that content-related validity is most important for classroom teachers. Content-related validity is the “extent to which the content of the test matches the instructional objectives” (Florida Center for Instructional Technology). This is an idea we’ve seen repeatedly when discussing assessment – as teachers we must ensure that the assessment we use or design is aligned with the content we’ve taught.


Unlike validity, reliability does live within the assessment tool. Reliability is the degree to which the assessment tool produces consistent results. A classic example is that of your kitchen scale. If you weigh a bag of potatoes in the morning and the scale registers 5 kg and then you do it again the next day with the exact same bag of potatoes, a reliable scale would give you the same result. Additionally, an assessment’s reliability is contingent upon its use. For example, if you administer a reading assessment and then use the data to judge a student’s spelling, the results are not reliable.

Reliability is measured by correlating the results of one version or running of the assessment with a second version or running. Correlations above 0.8 are said to be very reliable. A correlation below 0.5 is considered unreliable.

For an assessment to be valid, it must be reliable. Yet you can have a reliable assessment and still interpret the results invalidly.

The ideas of validity and reliability raise interesting challenges. Designing a reliable assessment to measure factual knowledge is rather straightforward. However, the PYP insists that “assessment practices are formed around conceptual learning” (International Baccalaureate, 2018, 8). How do we design reliable assessments that measure conceptual learning? Additionally, learning through inquiry means that different students can end up with different amounts and types of knowledge. Do we assess to ensure they have learned the basic knowledge and skills of the unit, or do we want to know about everything that they have learned? And how do we make sure our inferences are valid when the potential outcomes can be varied?


Fairness is ensuring there is no bias in the assessment that will advantage or disadvantage one student over another based on religion, language, gender, or ethnicity. It is considered equally important as validity and reliability, even though it is a late addition to the league tables!

There are two types of evidence that can be gathered to prove that an assessment is fair:

Judgmental evidence: this is examining the assessment’s internal qualities, such as the instructions, images, questions, resources available or test items. A group of people who are likely to be able to spot internal bias, such as your EAL teacher, SEN teacher, minority culture teachers, members of both genders, etc… can review the assessment and identify any problematic questions and then rework them to remove the bias. Popham (2018) recommends using the question, “Might this item offend or unfairly penalize any group of students because of personal characteristics such as gender, ethnicity, religion, or race?” (59). The reviewing members need to either respond yes, no or unsure. No questions are safe and yes and unsure questions are reviewed and eliminated or rewritten to remove the bias.

For example, the reviewers identify that the word problems on a math assessment use language that is above the reading age of the students. Clearly these problems need to be rewritten as they will unfairly penalise the students.

Popham (2018) contends that going through this procedure, even with simple class based tests, helps teachers become more aware of unintended bias in their teaching and assessing (60). We recently developed a math assessment and one of the teachers used British pounds as the monetary system for a word problem. We decided we needed to change the quantities and symbol to Mauritian Rupees as our curriculum deals with rupees and the question would be unfair to any student who had no experience with British pounds.

Empirical evidence: This approach requires teachers to compare results of assessments with demographic data, such as gender or language. If 80% of native speakers were able to answer question 4 on the Year 4 math assessment correctly, but only 14% of EAL (English as an Additional Language) students were able to answer question 4 correctly, we would need to look closely at question 4. Perhaps the description of the task or the language of the question is too complex for EAL students to access.

However, it could also be that the poor performance of the EAL students has to do with how this content was taught to them. Perhaps the language used by the teacher was too complex throughout the unit and the EAL students were unable to grasp the nuances needed. This is why it is important to use both types of evidence to determine the fairness of an assessment, where possible.

One of the challenges in using empirical evidence though is that for the results to be indicative, you need to have a large sample size, at least 100 students. While this is possible in some international schools, for those of us in smaller situations, Popham recommends sticking to judgement evidence (63).

What this means for me:

The message is clear: as teachers we must develop assessment literacy. Without it, we are unable to help our students achieve the best educational outcomes possible and if there is one thing I know about teachers, it is that they truly do care about doing the best for their students.

That said, assessment literacy is complex and layered. As Brown (2016) stated, you can be literate in the sense that you can read a book, and you can be literate in the sense that you can analyse a great work of literature. Assessment literacy is the same and I have tried in the past couple posts to include a dummy’s ‘reading-level’ review of those concepts that are most pressing for class teachers to know and understand to be able to engage in effective assessment for learning.

There is a whole other level of assessment literacy which is needed for school leaders to make sense of what Popham (2009) calls accountability assessment. I think this level of literacy would also be needed for any group that is going to engage in programme evaluation at their school or has to process standardised assessment data. As such, this means I have much further to go. At some point, I will figure out how to make sense of our International Schools Assessment (ISA) reports!

Much of the literature I looked at about these concepts were focused on tests – standardised and class-based. While we do use tests in PYP schools, they are only one of a number of assessment tools. As a school we certainly have more work to do in terms of learning how to develop and interpret more reliable and valid assessments, particularly assessments that measure conceptual understanding and skills.

I think it is important for us to also consider how external forces can impact on validity and reliability. For example, a student with low self-efficacy might be convinced s/he will fail the assessment. This belief impacts her/his ability to perform well, rendering the results unreliable as they do not accurately measure what the student really knows. How then do we find out about what this student understands?

If your school does a lot of testing, I highly recommend you read James Popham’s book Assessment Literacy for Educators in a Hurry. The language is engaging and he goes into much more detail about how to determine the validity, reliability and fairness of large-scale assessments.

My next post: Teachers’ Beliefs and Assessment Capabilities


Hickey, D. [BigOpen OnlineClasses]. (2014, June 15). Validity in Classroom Assessment [Video File]. Retrieved from

International Baccalaureate Organisation (2018). Programme Standards and Practices. Retrieved from 

Popham, W.J. (2018). Assessment Literacy for Educators in a Hurry. Alexandria, VA: ASCD.

Popham, W.J. (2009). Assessment Literacy for Teachers: Faddish or Fundamental? Theory Into Practice, 48, 4-11.

Florida Center for Instructional Technology (n.d.). Classroom Assessment. [Website]. Retrieved from

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s