Reliability in evaluation instruments is no different than reliability in other instruments used in science—and even computer science. For example, let’s take another instrument we use quite frequently in deciding what to wear every day–the thermometer. When a thermometer is designed and then produced, as a consumer, we want the thermometer to accurately measure the temperature each day. In computing, when we enter a URL in our browser, we want the browser to return the correct site for that particular URL. In both cases, we want our instrument (or tool) to be reliable.
For evaluation instruments, reliability is a measure of the consistency and stability of the participants’ results. It answers the question “Does this instrument yield repeatable results?” If you are a teacher, you know that creating a reliable test for assessing students’ knowledge of while loops or conditional statements can be time-consuming. You want to ensure that the test or quiz measures the same knowledge for every student who takes it.
We can measure reliability in several ways.
1. Internal consistency reliability (Reliability across items)
Internal consistency reliability reflects the consistency of the results of an instrument or a test. It helps ensure that the various items that measure different constructs deliver consistent scores across those items. For example, suppose you are creating a multiple-choice test, and three of the questions are designed to measure one construct – while loops. If your test had 100% internal consistency reliability and if a student understands while loops, then all three answers would be answered correctly. If they did not understand while loops, then all three answers would be answered incorrectly.
If you have a set of 6 Likert-scale items that are designed to measure self-efficacy of students who are studying programming for the very first time and all items were stated in the positive (“I am confident I can learn programming” instead of “I am not confident I can learn programming”), then students wouldn’t answer “Strongly Agree” on three items and “Strongly Disagree” on the other three. A test for internal consistency, such as Cronbach’s α (the Greek letter alpha), would indicate that the three “Strongly Disagree” questions should be removed from the instrument.
There are several ways to test for internal consistency. Cronbach’s α is the most popular. A value of 0.80 or higher is generally considered to indicate that the construct being measured has good internal reliability, though it should be noted that, as the number of items in a scale increases, it may be possible to obtain a relatively high Cronbach’s α even with relatively low levels of correlation between items.
2. Inter-rater reliability (Reliability across researchers)
Inter-rater reliability assesses to what degree different raters give consistent results of the same data or phenomenon that is being rated. In terms of tests or quizzes in the classroom, if you have two graders helping you grade a batch of quizzes, and the answers were open-ended, you would want to ensure that each grader gave approximately the same score for the same problems. As another example, for this site, two of us reviewed and rated every single abstract across the initial ten targeted journals for the years 2012-2017 to determine which would qualify for adding to our article summary dataset. We rated each to decide whether or not it met our pre-defined criteria. We then compared our results of our rating, and found that we were consistently rating these abstracts 94.8% of the time, so our inter-rater reliability score was 94.8%. If we were at 100%, we would be perfectly measuring them the same each time.
Evaluation instruments are no different. Whenever there are two or more researchers scoring the participants’ answers to open-ended questions, for example, whether it be a qualitative study or open-ended questions as part of a quantitative study, the researchers need to be trained in how to interpret the results (or code them). This can be done in various ways, including keeping track of the rater of each item, having two or more researchers always rate the same items and then averaging the score, calibrating the rating methods by having all researchers score the same five participant results and then discussing how or why their scores were different, and several more.
While this form of reliability will frequently show up in research studies, it is less likely to show up in the description of an instrument, as high inter-rater reliability is more a measure of how well a study has been carried out than how well a rubric or set of open-ended questions has been designed (though there is some relationship).
3. Test-retest reliability (Reliability across time)
Test-retest reliability is the most basic and easy to grasp form of reliability testing. It means that given the same questionnaire at different time intervals, perhaps two weeks apart, participants will answer the items the same. The results of the first dissemination of the survey and the second can then be correlated to determine its stability over time. For example, if you were measuring attitudes of middle school students towards computing, you might administer the survey on November 1st, then give the same students the same survey on November 14th (in the same environment under the same conditions). The test-retest reliability would be measured by use of the correlation coefficient, such as Pearson’s r. If Pearson’s r is 0.70 or above and the p-value (test for significance) is less than 0.05, then the test-retest reliability is said to show evidence of reliability.
4. Parallel forms reliability and Split-half reliability
When you develop an instrument or a test, which questions are the best ones to ask? The parallel forms measure of reliability can help discern the best questions when different questions designed to measure the same concept are placed into two different instruments or tests. That is, you run two exact same tests at the same time but using different questions from the same battery that is meant to measure the same construct. You then evaluate the differences between the results to determine which ones provide higher reliability, which may be determined by which one yields the highest Cronbach’s α score for sets of questions designed to measure the same construct.
Split-half reliability is very similar. In this case, a test, perhaps larger in its set of questions, is divided into two. The two tests are given to two different sets of participants. The scores for each half are then compared with the each other. The set of questions that give the most consistent results is then used. This is measured through a correlation (Pearson’s r or Spearman’s rho) between the two different halves of the instrument. The resulting coefficients are analyzed using the Spearman-Brown formula to determine the split-half reliability coefficient, which produces the aggregate measure of reliability.
Select here to go to next page to learn about Validity.
Cite this page
To cite this page, please use:
McGill, Monica M. and Xavier, Jeffrey. 2019. Measuring Reliability and Validity. Retrieved from https://csedresearch.org
Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents.
Creswell, J. (2008). Educational Research: Planing, Conducting, and Evaluating Quantitative and Qualitative Research. Upper Saddle River, New Jersey, USA: Pearson Education, Inc.
Cronbach, L. J.; Meehl, P.E. (1955). “Construct Validity in Psychological Tests”. Psychological Bulletin. 52 (4): 281–302. doi:10.1037/h0040957. PMID 13245896.
Lee, A.S., Hubona, G.S. (2009). A scientific basis for rigor in Information Systems research. MIS Quarterly, 33(2), 237-262.
Polit, D.F., Beck, C.T. (2012). Nursing Research: Generating and Assessing Evidence for Nursing Practice, 9th ed. Philadelphia, USA: Wolters Klower Health, Lippincott Williams and Wilkins.
Trochim, W. (2006). Web Center for Social Research Methods. Available online at https://www.socialresearchmethods.net