Pages

Tuesday, January 11, 2011

Reliability and Validity - Developing a Good Instrument.

There are a lot of factors that go into developing a good instrument, like the choice of items and measurement scales, using the appropriate sample for pilot testing, implementing statistical techniques that fit the characteristics of your data, etc. Testing your instrument for reliability and validity can help to assess if your instrument is "good" or "bad" and ultimately help you to know if the interpretation of your data is accurate or misleading.

Testing for reliability and/or validity is not a simple process, and can take years of implementation in different samples to determine. Below, I will list the different types of reliability/validity and how they are assessed.

It's important to remember that reliability is a necessary but NOT sufficient condition of validity. A necessary condition of a statement must be satisfied for the statement to be true and a sufficient condition is one that, if satisfied, assures the statement's truth. In other words you can have an instrument that is reliable but not valid. However, if your instrument is valid, then it HAS to be reliable. It reminds me of the old adage that states: "all poodles are dogs, but not all dogs are poodles". It's the same thing in this case: all valid instruments are reliable, but not all reliable instruments are valid. 


 Definitions:

Reliability - the degree to which an instrument consistently measures whatever it intends to measure. In other words, it's the statistical measure of the reproducibility or stability of the data gathered by your survey.

Validity - the degree to which an instrument measures what it is suppose to measure. If your instrument is valid then you can feel confident in the interpretation of the data. Going back to the adage above, a valid instrument must also be reliable, but what does that mean? Here is another way to interpret it:

Precision (reliability) + Accuracy = Validity

If you spent any time studying the sciences, you were sure to come across precision and accuracy. Although the differences of these terms are clear to me now, they seemed very ambiguous when I was in Chemistry 101, so I'll use the bulls-eye analogy to explain. Precision, AKA reliability, is when all your "hits" are clustered in the same area (the degree to which repeated measures under unchanged conditions show the same result). Accuracy is when all your "hits" are close to the bulls-eye (how close the measurements are to the actual value). With both of these properties together, you have a validity (your "hits" are clustered together around the bulls-eye).
 




Types of Reliability:
 
Test-Retest (stability) - the degree to which scores on the same test are consistent over time, in other words, the questions are worded in such a way that cause the respondents to consistently answer the same way. To test this, you would implement the survey to a sample twice and then calculate the correlation coefficients (r) to compare the two sets of responses. Correlation coefficients are considered good if they are 0.70 or above, indicating that the responses are reasonably consistent from one point in time to the other. The trick is determining the amount of time to wait in between implementation. Two weeks is suggested as a good amount because it's long enough for the respondents to forget their answers, but short enough so that they don't gain knowledge or change behaviors before the second survey.
  • Intraobserver (intrajudge) - measures the stability of responses from the same person. This is a type of test-retest reliability because it looks at a single individual's score over a period of time. It is also measured using correlation coefficient.


Alternate Form (equivalent-form) - the degree to which two similar forms of a test produce similar scores from a single sample. The two instruments will have the same structure, number of items, reading level, difficulty level, etc; however, each item is not the same. Items differ in wording, but still measure the same idea. You can do this by implementing the survey in two separate samples of the same population, or implement the survey twice in the same sample (as a pre and post-test). Correlation coefficients (r) are compared and high values indicate good alternate-form reliability.

Internal Consistency  - indicates how well different items measure the same issue. It is applied to a group of items that are thought to measure different aspects of the same concept. This is important when measuring the reliability of latent constructs because a single item will not be able to assess concepts such as knowledge, behavior, and attitude. Below are two commonly used methods to assess internal consistency.
  • Split-Half - measures internal consistency by comparing two parts of a single instrument. Divide the instrument (or construct items) into two halves, compute each respondent's score on the two halves, correlate both scores. High correlation coefficients indicates high internal consistency.
  • Cronbach's Alpha -indicates how well the other items complement each other according to a single construct. This can be used for dichotomous items or longer measurement scales like the Likert scale. High CA values indicate high internal consistency. If your instrument only involves dichotomous responses (e.g. yes or no) then the Kuder-Richardson 20 or KR-20 is another option for this statistic.


Interobserver (interjudge/interrater) - measures how well two or more evaluators agree in their assessment of a variable. It refers to the consistency of two or more independent observers and is usually reported as a correlation coefficient. This type of reliability is used in qualitative studies like interviews, focus groups, or open ended surveys.


Types of Validity:

Face - involves the feedback of untrained reviewers. If you were to categorize validity testing into stages, this would be the first one. You're looking for incorrect spelling and grammar, ambiguous items, confusing layouts, etc. Untrained reviewers will focus on the overall aesthetics of the survey and not the content.


Content - the measure of how appropriate items or scales seem to a set of trained reviewers. Like the term suggests, you're looking at the content of the survey. The more people you can have to look over the survey, the better because each person will point out something different. This should be conducted after you check for face validity and can therefore be referred to as the second stage.

Criterion - the measure of how well one instrument compares to another. This is determined by relating the performance of your instrument to another instrument (the criterion against which the validity of your instrument is judged)
  • Concurrent - the comparison of your instrument again another that is considered to be the gold standard for the variable in question. It is calculated using correlation coefficients and high values indicate good concurrent validity.
  • Predictive- the degree to which a test can predict how well an individual will do in a future situation. For example, the GRE is suppose to be a good predictor of how well we will do in graduate school. However, I think that many of us will disagree with the GRE's predictive potential, but that's another subject for another time. :) Correlation coefficients are used to compare the initial score with the secondary outcome.
Construct - the degree to which an instrument measures a construct. This is the most important form of validity because it answers the question: is this instrument measuring what is was intended to measure? However, it is also the most difficult form of validity to understand, to measure, and to report.
  • Convergent - implies that several different methods for obtaining the same information will provide similar results. Assessing convergent validity is similar to alternate form reliability but is more theoretical. This requires a great amount of work over a long period of time to determine.
  • Divergent - measures the ability of an instrument to estimate the underlying truth in a given area. This is also very theoretical and requires a lot of time and work to determine.



References:

Litwin, M.S. (2003). How to assess and interpret survey psychometrics: the survey kit 2. Sage Publications: Thousand Oaks, CA.

Gay, L.R., Mills, G.E., Airasian, P. (2006). Educational research: competencies for analysis and applications. 8th Edition. Pearson Merrill Prentice Hall: Columbus, Ohio.