Communication (Links to Calendar,
Class Email and Forums)
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16

Instrument Reliability

What are we trying to accomplish? Reliability test How is this determined?



The same test is administered to the same individuals on two separate occasions. The trick is to administer the two tests close enough together so that you are not really detecting a change over time (maturation or history), but not so close together that people remember what response they chose the first time. Within one to two weeks is recommended.


Internal-consistency: Cronbach's alpha or Kuder-Richardson


Correlations are run comparing items determined to be similar (through construct validity). The correlations should be high such as R=.80 (80%) for an old instrument and R=.7 (70%) for a new instrument. In real words: How well do the items "hang together"?
Equivalence of instruments

Alternate forms or split-half




Instrument is given in different forms such as written and oral or in-person and over the phone. This also includes instruments in in different languages.

The two tests are given to participants at the same time and the scores should be similar. This is usually done when researchers are testing to see if a tried and true instrument can be used in a different order (like when you took the GRE) or a shorter version (such as the 24 item acculturation instrument that is now down to four items!)

The different forms of instrument administration need to be tested to assure equivalency.

Equivalence of data collectors

Interrater reliability

Two or more data collectors are tested to see if they administer a written or oral questionnaire the same way and get consistent results. They usually receive training so they follow the same script for obtaining participant consent and for giving the instructions before and during the data collection. They usually practice until participant scores have a correlation of R = .8 (8-%).

Decreasing Measurement error

Situational contamination



Transitory personal factors

Response-set bias













Different types of data collection methods




Inconsistencies in the data collection instruments:

For example: On a Likert scale: are the possible responses equal in both the spacing and conceptually.


On a visual analog scale: the markings should be equal distance from each other and each visual analog line should be the same length

There should be consistency as to the data collection if it seems relevant. Examples include researcher characteristics, the physical setting, the weather or the time of day.

Temporary conditions such as pain, anxiety or mood need to be assessed

The participant may respond in a certain way due to social desirability, boredom, or decreased attention span. For example when administering a questionnaire to teens they could respond to the age of the data collector or it could take longer than 20 minutes to take and they can start circling any answer just to get done.

Another technique is to put few questions on a page and have more pages. At the expense of trees, it keep participants engaged longer because it "feels" like they are making faster progress.

Another example would be always having the "good" answer in the same spot on a questionnaire. For example, on the questionnaire. I used for pregnant teens, mixed in with questions about the good things about school (hang out with friends, get a better education, be able to provide for my baby) there would be a bad thing about school (such as having lots of homework). This technique discourages just circling all the "goods" without really reading the questions. It also let me know if it was more likely they just circled all the "good" answers.

Data are collected in different ways such as: Self-report, an oral questionnaire, a written questionnaire, on the phone, on the computer, read by the data collector, read by the participant, open ended questions, and fixed response questions. Each one will have reliability threats and strengths depending on the characteristics of the participants.



Measure the distance between possible answers to assure they are equal. Make sure the phrase meanings are also consistent. For example on a Likert scale: It would read: Very unlikely Kind of unlikely Unlikely Likely Kind of likely and Very likely.


Measure them!

Avoiding costly mistakes! Big mistakes that aren't discovered until the study is under way! Pilot testing is a great idea! You don't want to have to figure out damage control in the middle of a major study!




Jeanette Koshar RN, NP, PhD
Office: (707) 664-2649 | Office Hours: Wed 10-12, email and by appointment | Email: jeanette.koshar@sonoma.edu
Deb Kindy RN, PhD
Email: klaas@sonoma.edu | Office Hours: Tues 1-3