An evaluation instrument or assessment measure such as a test or quiz is said to have evidence of validity if we can make inferences about a specific group of people or specific purposes from the results. What this means is that if a particular computational thinking test is designed, tested, and validated with 7th grade students, it does not mean automatically that it would also be valid for high school students. Without additional testing for validity for the high school student population, we cannot make that jump. The specific sub-group of the population used to validate the instrument may also be important—for instance, an instrument measuring motivation to learn CS among 7th grade students who are primarily White may not be valid for use with 7th grade students who are primarily African American.
The same thing applies to specific purposes from the results. If an instrument is designed, tested, and measured for validity with 7th grade students in order to measure self-efficacy with respect to learning programming, the evidence of validity may differ if the same group of students were given the instrument for measuring self-efficacy with respect to learning science. There may be correlations, but without further investigation of the instrument, automatic assumptions about the evidence of validity in the other subject (science) cannot be firmly made.
Face validity and criterion validity are the most commonly used forms of testing for validity in evaluation instruments for education. But there are many options to consider. Validity is a bit more subjective than reliability and there is no one pure method of “proving” validity–we can only gather evidence of validity. You can obtain deeper levels of evidence by evaluating it through different measures (described below), but you can only do so much with limited resources. Researchers are not typically expected to go to extraordinary lengths to demonstrate evidence of validity–however, those researchers with more extensive time and resources may.
Face validity (or Content-related Evidence)
Face validity is simply that–on its surface (or face) as reviewed by content specialists, are the items and questions on the instrument appropriate? An instrument that “passes” face validity would meet each of the following criteria, as defined by the experts, if when taken as a whole the items on the instrument are:
- Representative of the sample of the content being assessed,
- In a format appropriate for using the instrument, and
- Designed appropriately for the target population.
What this means is that if an instrument is measuring cognitive load of 4th grade students learning computational thinking through the use of a tool such as Scratch, the content represents a variety of questions designed to measure cognitive load across the targeted goals (such as variable assignments and mathematical expressions). It may also have images of Scratch blocks to contextualize the questions on the instrument. Wording of questions and the appropriateness of the answers (perhaps multiple choice) will all be designed at the 4th grade level.
Measuring face validity can be done by the researchers creating the survey. However, given the inherent biases of experts when evaluating their own work, this carries more meaning if the instrument is evaluated by other experts whenever possible.
Another method of assessing the face validity of an instrument involves conducting interviews with individuals who are similar to those whom you are researching. By asking these individuals to review the items on your survey or interview protocol and answer them as if they were participating, they can give an indication of which items make sense to them and which are worded poorly or may be measuring something other than what the investigator believed.
Criterion-Related (or Concrete) Validity
Criterion related validity evaluates to what extent the instrument or constructs in the instrument predict a variable that is designated as a criterion—or its outcome. Concurrent validity and predictive validity are forms of criterion validity. This form of validity is related to external validity, discussed in the next section.
- Predictive Validity – This measures how likely the instrument measures a variable that can be used to predict a future related variable. For example, measuring the interest of 11th grade students in computer science careers may be used to predict whether or not it can be used to determine whether those students will pursue computer science as a major in college.
- Concurrent Validity – This concrete validity measure defines how well the instrument measures against existing instruments/tests. In this case, the results of using the instrument may be compared against similar instruments in related fields or in the same field or against similar instruments targeting other population groups. For example, if you are measuring attitudes of students towards computers in your instrument, then you may want to, concurrently, or at the same time, administer a second instrument that has already been determined to measure student attitudes toward computers. If the results of your instrument’s measures are statistically the same or better than the other instrument’s results, then your instrument has met concurrent validity.
Internal and External Validity
Internal validity is the extent to which results of the instrument can be generalized in the same population group and context. External validity, on the other hand, is the extent to which the results of the instrument can be generalized to other population groups, other fields, other situations, and other interventions. That is, can the results from participants in a controlled experiment be generalized to the population outside of the controlled experiment?
- Population Validity – Determines to what extent the sample population can be extrapolated to the general population. It provides evidence about how findings from a study can be generalized to a broader population, which has much to do with scope and sampling. In order to establish this type of validity, one must think carefully about the broader population in which you would like to generalize the findings.
- Ecological Validity – Determines the extent to which the tests conducted in a controlled setting can be generalized to settings outside of these—that is, to the behavior one might experience in the world.
As an example of Population Validity, suppose you are interested in understanding how participating in the Exploring Computer Science (ECS) curriculum affects the likelihood that 9th grade students will participate in future CS courses in high school. In order to be able to make the case that one is obtaining an accurate picture of 9th grade student behavior in general at a particular school, it would be preferable to randomly assign students to participate in the course, rather than allowing them to sign up voluntarily. In order to be able to generalize findings to the broader population of 9th grade students in a city or state, the same experiment would need to be carried out (using random assignment) at multiple schools, ideally also chosen at random.
An example of an Ecological Validity issue would be the development of a test for assessing high school graduates’ skill in computational thinking. If you designed such a test, you might like to know how students subsequently fare in other computing-related ventures in the future (either in a career or college setting). Establishing ecological validity would require a long-term study of a set of high school graduates who had taken the test and examining their college and/or career performance in CS-related tasks and coursework to determine if there is a strong relationship between their performance on the exam and their “real world” performance in CS.
Construct validity is the extent to which an evaluation instrument measures what it claims to measure (Brown, 1996; Cronbach and Meel, 1955; Polit and Beck, 2012). Correlation and cause and effect are both taken into consideration, and the study design is evaluated for how well it models and reflects real-world situations. In this measure of validity, the quality of the instrument and/or experimental design (as a whole) are measured.
- Convergent Validity – measures the extent to which constructs that are expected to correlate actually do correlate.
- Discriminant Validity – measures the extent to which constructs that are not expected to correlate actually do not correlate.
Convergent and discriminant validity combined are used to measure construct validity. The Multitrait-Multimethod Matrix (MTMM) is a formal method of measuring these two validity concepts together to determine the extent to which the instrument is meets convergent and discriminant validity measures. It uses multiple methods (e.g. observation/fieldwork, questionnaires, journals, quizzes/tests, etc.) to measure and triangulate the results of the same set of variables through the demonstration of correlations within a matrix.
The notions of convergent and discriminant validity are important to consider when trying to establish construct validity. For example, to help establish the convergent validity of a question battery focused on motivation to participate in studying CS, one might expect a reasonably high correlation with a battery of questions focused on intention to persist in CS and take other courses. Conversely, when thinking about divergent validity, one might attempt to correlate this motivation to participate in CS with a question battery that measures classroom disengagement (if the motivation battery is truly measuring what is intended, it would likely have an inverse relationship with disengagement).
Formative and Summative Validity
Formative validity refers to the effectiveness of the instrument to provide information that can be used to improve variables that influence the results of the study. That is, if the instrument is designed to measure the weakness in learning conditional statements in such a way that identifies specific changes that can be made in a curriculum, then the level of formative validity can be said to be adequate.
Summative validity (Lee & Hubona, 2009) measures the extent to which the instruments lends itself to theory-building, or the end result of the process of creating the theory through the results of the instrument. This can be tested empirically via the “modus tollens” logic.
Not to be confused with population validity, sampling validity measures the extent to which the most important items are chosen using a defined pattern of sampling method. The pattern should align with the goals of the study to ensure that the variables that need to be measured are in fact posed as questions in some way in the instrument. For example, when measuring the teaching style among a population of pre-service computer science teachers, evaluating only the knowledge of pre-service teachers would not be sufficient if the intent of your study was to measure how successful the group of teachers will be in teaching computer science concepts to middle school students. You may also need to include questions related to pedagogy, assessment, and self-efficacy.
Conclusion validity is the “degree to which conclusions we reach about relationships in our data are reasonable” (Trochim, 2006). Suppose you want to measure the relationship between teacher self-efficacy in programming and the ability for their students to learn programming (academic achievement). The relationship could be positive, where high self-efficacy and high student achievement, or it could be negative, where low self-efficacy and high student achievement are correlated, or it could be more neutral—where self-efficacy has no relationship on student achievement. The degree to which the conclusions about the relationship can be stated is referred to as conclusion validity.
There are a number of reasons why it may be difficult to effectively reach a conclusion. This may be because your measures have low levels of reliability, because the relationship is weak in general, or because you did not collect enough data (or the right kind of data). Thus, an important consideration for conclusion validity relates to good statistical power, good reliability, and good implementation. Statistical power should be at 0.80 or greater and care should be given in advance to determine the appropriate sample size, effect size, and alpha levels. Reliability can be improved by carefully constructing your instrument, by increasing the number of questions in it, and by reducing distractions within the measurement.
Cite this page
To cite this page, please use:
McGill, Monica M. and Xavier, Jeffrey. 2019. Measuring Reliability and Validity. Retrieved from https://csedresearch.org
Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents.
Creswell, J. (2008). Educational Research: Planing, Conducting, and Evaluating Quantitative and Qualitative Research. Upper Saddle River, New Jersey, USA: Pearson Education, Inc.
Cronbach, L. J.; Meehl, P.E. (1955). “Construct Validity in Psychological Tests”. Psychological Bulletin. 52 (4): 281–302. doi:10.1037/h0040957. PMID 13245896.
Lee, A.S., Hubona, G.S. (2009). A scientific basis for rigor in Information Systems research. MIS Quarterly, 33(2), 237-262.
Polit, D.F., Beck, C.T. (2012). Nursing Research: Generating and Assessing Evidence for Nursing Practice, 9th ed. Philadelphia, USA: Wolters Klower Health, Lippincott Williams and Wilkins.
Schmuckler, M. A. (2001). What is ecological validity? A dimensional analysis. Infancy, 2(4), 419-436.
Trochim, W. (2006). Web Center for Social Research Methods. Available online at https://www.socialresearchmethods.net