An evaluation instrument or assessment measure such as a test or quiz is said to be valid if we can make inferences about a specific group of people or specific purposes from the results. What this means is that if a particular computational thinking test is designed, tested, and validated with 7th grade students, it does not mean automatically that it would also be valid for high school students. Without additional testing for validity for the high school student population, we cannot make that jump. The specific sub-group of the population used to validate the instrument may also be important—for instance, an instrument measuring motivation to learn CS among 7th grade students who are primarily White may not be valid for use with 7th grade students who are primarily African American.

The same thing applies to specific purposes from the results. If an instrument is designed, tested, and validated with 7th grade students in order to measure self-efficacy with respect to learning programming, it may not be valid for measuring self-efficacy with respect to learning science. There may be correlations, but without further testing of the measure, automatic assumptions about its validity in the other subject (science) cannot be made.

Face validity and criterion validity are the most commonly used forms of testing for validity in evaluation instruments for education. But there are many options to consider. Validity is a bit more subjective than reliability and there is no one pure method of “proving” validity. You can obtain deeper levels of evidence by evaluating it through different measures (described below), but you can only do so much with limited resources. Researchers are not typically expected to go to extraordinary lengths to demonstrate validity–however, those researchers with more extensive time and resources may.

Face validity (or Content-related Evidence)

Face validity is simply that—on its surface (or face) as reviewed by content specialists, are the items and questions on the instrument appropriate? An instrument that “passes” face validity would meet each of the following criteria, as defined by the experts, if when taken as a whole the items on the instrument are:

  • Representative of the sample of the content being assessed,
  • In a format appropriate for using the instrument, and
  • Designed appropriately for the target population.

What this means is that if the instrument is measuring cognitive load of 4th grade students learning computational thinking through the use of a tool such as Scratch, the content represents a variety of questions designed to measure cognitive load across the targeted goals (such as variable assignments and mathematical expressions). It may also have images of Scratch blocks to contextualize the questions on the instrument. Wording of questions and the appropriateness of the answers (perhaps multiple choice) will all be designed at the 4th grade level.

Measuring face validity can be done by the researchers creating the survey. However, given the inherent biases of even experts when evaluating their own work, this carries more meaning if the instrument is evaluated by other experts whenever possible.

Another method of assessing the face validity of an instrument involves conducting interviews with individuals who are similar to those whom you are researching. By asking these individuals to review the items on your survey or interview protocol and answer them as if they were participating, they can give an indication of which items make sense to them and which are worded poorly or may be measuring something other than what the investigator believed.

Criterion-Related (or Concrete) Validity

Criterion related validity evaluates to what extent the instrument or constructs in the instrument predict a variable that is designated as a criterion—or its outcome. Concurrent validity and predictive validity are forms of criterion validity. This form of validity is related to external validity, discussed in the next section.

  • Predictive Validity – This measures how likely the instrument measures a variable that can be used to predict a future related variable. For example, measuring the interest of 11th grade students in computer science careers may be used to predict whether or not it can be used to determine whether those students will pursue computer science as a major in college.
  • Concurrent Validity – This concrete validity measure defines how well the instrument measures against existing instruments/tests. In this case, the results of using the instrument may be compared against similar instruments in related fields or in the same field or against similar instruments targeting other population groups. For example, if you are measuring attitudes of students towards computers in your instrument, then you may want to, concurrently, or at the same time, administer a second instrument that has already been determined to measure student attitudes toward computers. If the results of your instrument’s measures are statistically the same or better than the other instrument’s results, then your instrument has met concurrent validity.
Internal and External Validity

Internal validity is the extent to which results of the instrument can be generalized in the same population group and context. External validity, on the other hand, is the extent to which the results of the instrument can be generalized to other population groups, other fields, other situations, and other interventions. That is, can the results from participants in a controlled experiment be generalized to the population outside of the controlled experiment?

  • Population Validity – Determines to what extent the sample population can be extrapolated to the general population.
  • Ecological Validity – Determines the extent to which the tests conducted in a controlled setting can be generalized to settings outside of these—that is, to the behavior one might experience in the world.
Construct Validity

Construct validity is the extent to which an evaluation instrument measures what it claims to measure (Brown, 1996; Cronbach and Meel, 1955; Polit and Beck, 2012). Correlation and cause and effect are both taken into consideration, and the study design is evaluated for how well it models and reflects real-world situations. In this measure of validity, the quality of the instrument and/or experimental design (as a whole) are measured.

  • Convergent Validity – measures the extent to which constructs that are expected to correlate actually do correlate.
  • Discriminant Validity – measures the extent to which constructs that are not expected to correlate actually do not correlate.

Convergent and discriminant validity combined are used to measure construct validity. The Multitrait-Multimethod Matrix (MTMM) is a formal method of measuring these two validity concepts together to determine the extent to which the instrument is meets convergent and discriminant validity measures. It uses multiple methods (e.g. observation/fieldwork, questionnaires, journals, quizzes/tests, etc.) to measure and triangulate the results of the same set of variables through the demonstration of correlations within a matrix.

Formative and Summative Validity

Formative validity refers to the effectiveness of the instrument to provide information that can be used to improve variables that influence the results of the study. That is, if the instrument is designed to measure the weakness in learning conditional statements in such a way that identifies specific changes that can be made in a curriculum, then the level of formative validity can be said to be adequate.

Summative validity (Lee & Hubona, 2009) measures the extent to which the instruments lends itself to theory-building—the end result of the process of creating the theory through the results of the instrument. This can be tested empirically via the “modus tollens” logic.

Sampling Validity

Not to be confused with population validity, sampling validity measures the extent to which the most important items are chosen using a defined pattern of sampling method. The pattern should align with the goals of the study to ensure that the variables that need to be measured are in fact posed as questions in some way in the instrument. For example, when measuring the teaching style among a population of pre-service computer science teachers, evaluating only the knowledge of pre-service teachers would not be sufficient if the intent of your study was to measure how successful the group of teachers will be in teaching computer science concepts to middle school students. You may also need to include questions related to pedagogy, assessment, and self-efficacy.

Conclusion Validity

Conclusion validity is the “degree to which conclusions we reach about relationships in our data are reasonable” (Trochim, 2006). Suppose you want to measure the relationship between teacher self-efficacy in programming and the ability for their students to learn programming (academic achievement). The relationship could be positive, where high self-efficacy and high student achievement, or it could be negative, where low self-efficacy and high student achievement are correlated, or it could be more neutral—where self-efficacy has no relationship on student achievement. The degree to which the conclusions about the relationship can be stated is referred to as conclusion validity.

Conclusion validity is measured through good statistical power, good reliability, and good implementation. Statistical power should be at 0.80 or greater and care should be given in advance to determine the appropriate sample size, effect size, and alpha levels. Reliability can be improved by carefully constructing your instrument, by increasing the number of questions in it, and by reducing distractions within the measurement.

Cite this page

To cite this page, please use:

McGill, Monica M. and Xavier, Jeffrey. 2019. Measuring Reliability and Validity. Retrieved from


Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents.

Creswell, J. (2008). Educational Research: Planing, Conducting, and Evaluating Quantitative and Qualitative Research. Upper Saddle River, New Jersey, USA: Pearson Education, Inc.

Cronbach, L. J.; Meehl, P.E. (1955). “Construct Validity in Psychological Tests”. Psychological Bulletin. 52 (4): 281–302. doi:10.1037/h0040957. PMID 13245896.

Lee, A.S., Hubona, G.S. (2009). A scientific basis for rigor in Information Systems research. MIS Quarterly, 33(2), 237-262.

Polit, D.F., Beck, C.T. (2012). Nursing Research: Generating and Assessing Evidence for Nursing Practice, 9th ed. Philadelphia, USA: Wolters Klower Health, Lippincott Williams and Wilkins.

Schmuckler, M. A. (2001). What is ecological validity? A dimensional analysis. Infancy, 2(4), 419-436.

Trochim, W. (2006). Web Center for Social Research Methods. Available online at