Category Archive: Reliability and Validity

Validity in Educational Research: A Deeper Examination

Presented by Joe Tise, PhD, Senior Education Researcher at IACE

The concept of validity, including validity of educational research, has evolved over millennia. Some of the earliest examples of how validity influenced society at scale comes from the ancient Chinese Civil Service exam programs (Suen & Yu, 2006). Back in 195 BCE, Emperor Liu Bang decreed that men of virtue and ability ought to serve the nation in various capacities. With that decree, Emperor Bang established the “target” for assessment against which the effectiveness of various assessments would be judged. Assessments which could more readily select men of virtue and ability would be considered more useful than those which could not. Over the next 2,000 years, the Chinese Civil Service exam took many forms: expert observational ratings, writing poetry, drafting philosophical essays, and portfolio submissions, among other variations.

The concept of validity has been deliberated in Western cultures as well since at least the early 1900s. Early conceptions viewed validity merely as a statistical coefficient, such that even a correlation coefficient could measure validity (Bingham, 1937; Hull, 1928). You need only to examine the numerous spurious correlations demonstrated by Tyler Vigen to see that a simple correlation cannot sufficiently capture the true nature of validity. Fortunately, this reductive view of validity did not persist long–it was soon overtaken by an ancestor of present validity conceptions. By the 1950s, psychometricians detailed multiple “types” of validity, such as predictive, concurrent, construct, and content validity (American Psychological Association, 1954).

Still, this conception of validity misinterpreted the unitary nature of validity. Philosophical deliberations largely concluded in 1985 when the Joint Committee for the Revision of the Standards for Educational and Psychological Testing (Joint Committee; comprised of the APA, AERA, and NCME) established validity as one unitary construct that does not have multiple “types” but rather multiple sources of validity evidence (American Educational Research Association et al., 1985). In 2014, the Joint Committee further clarified that validity refers to the “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (American Educational Research Association et al., 2014). Thus, “validity” applies to our interpretations and uses of tests, not the test itself.

The contemporary view of validity details five primary sources of validity evidence, presented in the table below (in no particular order).

Source of Validity Evidence Description Example of How to Assess
Test content Themes, wording, and format of the items, tasks, or questions on a test Expert judgment; delphi method; content specifications
Response processes The extent to which test takers actually engage the cognitive/behavioral processes the test intends to assess Analyze individual test responses via interviews; examine early drafts of a response (e.g., written response or math problem); eye-tracking
Test structure The degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based Exploratory factor analysis; confirmatory factor analysis; differential item functioning analysis
Relations to other variables Do scores on the test relate in the expected way to other variables and/or other tests of the same construct? Correlate test scores with other variables; regress test scores on other variables
Test consequences Must evaluate the acceptability of (un)intended consequences from interpretations and uses of the test scores Does the test incentivize cheating? Are certain subgroups systematically harmed by the test?

I will note here that although there are multiple sources of validity evidence, a researcher does not necessarily have to provide evidence from all sources for a given test or measure. Some sources of validity evidence will not be relevant to a test.

For example, evidence of test structure is largely nonsensical for a simple one or two-item “test” of a person’s experience with, say, teaching computer science (e.g., asking someone how many years of experience they have in computer science). Similarly, a measure of someone’s interest in computer science likely does not require evidence based on response processes, because reporting interest in a topic does not really invoke any cognitive or behavioral processes (other than reflection/recall, which is not the target of the test).

References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Author.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association.

American Psychological Association. (1954). Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin, 51(2), 1–38. https://doi.org/10.1037/h0053479

Bingham, W. V. (1937). Aptitudes and aptitude testing (pp. viii, 390). Harpers.

Hull, C. L. (1928). Aptitude Testing (L. M. Terman, Ed.). World Book Company.https://doi.org/10.1086/498328
Suen, H. K., & Yu, L. (2006). Chronic Consequences of High‐Stakes Testing? Lessons from the Chinese Civil Service Exam. Comparative Education Review, 50(1), 46–65. https://doi.org/10.1086/498328

Demystifying Reliability and Validity in Educational Research

Post prepared and written by Joe Tise, PhD, Senior Education Researcher

In the past, reliability and validity may have been explained to you by way of an analogy: validity refers to how close to the “bullseye” you can get on a dart board, while reliability is how consistently you throw your darts in the same spot (see figure below).

Four images of arrows and targets to represent when reliability and validity can be achieved.

Such an analogy is largely useful, but somewhat reductive. In this four-part blog series, I will dig a bit deeper into validity and reliability to show the different types of each, the different conceptualizations of each, and the relations between them. The structure and content in this blog post comes largely from the Standards for Educational and Psychological Testing (2014), so I highly recommend you get a copy of that book to learn more.

Validity

The Standards (American Educational Research Association et al., 2014) define validity as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.” This definition leads me to make an important distinction at the outset: a test is never valid or invalid—it is the interpretations and uses of that test and decisions made because of a test that are valid or invalid.

To illustrate this point, consider the following scenario. I want to measure students’ reading fluency. I dig into a big pile of data I collected from thousands of K12 students and see that taller students can read longer and more complex books than shorter students. I say to myself:
“Great! To assess new students’ reading fluency, all I need to do is measure how tall they are. Taller students are better readers, after all. Thus, a measure of students’ height must be a valid test of reading fluency.”

Of course, you likely see a problem with my logic. Height may well be correlated with reading fluency (because older children tend to be taller and better readers than younger children), but clearly it is not a test of reading fluency. Nobody would argue that my measuring tape is invalid—just that my use of it to measure reading fluency is invalid. This distinction, obvious as it may seem, is the crux of contemporary conceptions of validity (American Educational Research Association et al., 2014; Kane, 2013). Thus, researchers ought never say a test is valid or invalid but rather, their interpretations or uses of a test are valid or invalid. We may, however, say that an instrument has evidence of validity and reliability while bearing in mind the relevance of such evidence may apply differentially among populations, settings, points in time, or other factors.

Reliability

A similar distinction must be made about reliability—reliability refers to the data, rather than the test itself. A test that produces reliable data will produce the same result for the same participants after multiple administrations, assuming no change in the construct has occurred (e.g., assuming one did not learn more about math between two administrations of the same math test). Thus, The Standards define reliability as “the more general notion of consistency of the scores across instances of the testing procedure.”

But how can you quantify such consistency in the data across testing events? Statisticians have several ways to do this, each differing slightly depending on their theoretical approach to assessment. Each approach utilizes some form of a reliability coefficient, or “the correlation between scores on two equivalent forms of the test, presuming that taking one form has no effect on performance on the second form.” There are many theories of assessment, but three of the most common include Classical Test Theory (Gulliksen, 1950; Guttman, 1945; Kuder & Richardson, 1937), Generalizability Theory (Cronbach et al., 1972; Suen & Lei, 2007; Vispoel et al., 2018) and Item Response Theory (IRT) (Baker, 2001; Hambleton et al., 1991). This blog post is too broad in scope to detail each of these theories, but just know that each theory differs in assumptions made about assessment, terminology used, and each has different implications for how one quantifies data reliability.

What’s Next?

This post only introduces these two terms. The next three posts discuss validity and reliability more in-depth for both quantitative and qualitative approaches (to be published over the next few weeks).

References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association.

Baker, F. B. (2001). The Basics of Item Response Theory (2nd ed.). ERIC Clearinghouse on Assessment and Evaluation. https://eric.ed.gov/?id=ED458219

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. John Wiley and Sons.

Gulliksen, H. (1950). Theory of Mental Tests. Wiley.

Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10(4), 255–282. https://doi.org/10.1007/BF02288892

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Sage Publications, Inc.

Kane, M. (2013). The argument-based approach to validation. School Psychology Review, 42(4), 448–457. https://doi.org/10.1080/02796015.2013.12087465

Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151–160. https://doi.org/10.1007/BF02288391
Suen, H. K., & Lei, P.-W. (2007). Classical versus Generalizability theory of measurement. Educational Measurement, 4, 1–13.

Vispoel, W. P., Morris, C. A., & Kilinc, M. (2018). Applications of generalizability theory and their relations to classical test theory and structural equation modeling. Psychological Methods, 23(1), 1–26. https://doi.org/10.1037/met0000107