Measurement, Reliability, and Validity: Part Four in the continuing series, Getting Quality Right

By Cliff Hurst

In conducting a statistical analysis, it is important to understand the level of measurement being used, how reliable it is, and if it is valid. These issues will be addressed in this article.

Measurement can occur at four levels: nominal, ordinal, interval, and ratio. For our purposes, we’ll treat interval and ratio data the same.

Nominal Data assigns numbers to names since software can only crunch numbers, not names. Suppose a call center operates two shifts a day. For analytic purposes, you may want to differentiate between shifts as you monitor the scores. To enter data into a statistical software package, you will assign a number to “day” and a number to “evening.” This is called coding the data.

You may define the day shift as “1” and the night shift as “2.” This doesn’t mean that “2” is higher than “1” in a judgmental sense or arithmetic sense. In nominal data, numbers simply become placeholders for names.

Ordinal Data signifies good, bad, and variations thereof. Ordinal means that there is an order to the numeric rating. It is good practice to code better ratings with higher numbers and lower ratings with lower numbers.

Let’s say you want to monitor whether agents verify a caller’s identity before disclosing confidential information. If the caller provides his or her last name and account number and these match your records, then proper verification has been made. Call monitoring will evoke either a “yes” or a “no” determination. To follow common practice, assign the number “1” to “yes” and “0” to “no.”

In another situation, you might want to evaluate professional courtesy using a Likert scale of 1 to 5, where 1 means not acceptable, 2 means below average, 3 means average, 4 means above average, and 5 means excellent. You will be making an extrinsic judgment because you are looking for “shades of gray.” Scores for this will also be measured at the ordinal level of measurement because “excellent” is better than “average,” and so forth. There is an order to the rankings.

This is where confusion between ordinal and interval levels of measurement can creep in. It appears that the intervals are built upon a five-point scale. However, this is not an interval level of measurement – it is ordinal because there are no standard increments of measurement within the scale that was used. The difference between “average” and “above average” can only be qualitatively determined.

Interval/Ratio Data: If you want to get more granular in your analysis, you can develop an interval or ratio scale as a subset of the category of professional courtesy. The following is a far-fetched example, not a recommendation; I am only illustrating the statistical principle.

Suppose you decide that the more the agent says “thank you,” the more professional courtesy is displayed. Evaluating the number of times the agent says “thank you” gives an interval/ratio level of measurement. You can code this part of the form with a “0” if the agent does not say thank you at all, “1” if the agent says it once, “2” if the agent says it twice, “3” if the agent says it three times, and “4” if the agent says it four times, ad nauseam.

The type of statistical analysis that your data requires is determined by whether it is nominal, ordinal, or interval/ratio.

Calibration involves using a set of proven statistical and analytical tools to measure how reliable and how valid your quality monitoring process is. Although these are often lumped into one category, these are actually two distinct components:

1. Reliability addresses consistency. Does your quality monitoring form allow your QA team to measure things consistently? Would different evaluators likely assign the same score to the same call? Does the team score similar calls similarly over time, or do they tend to drift apart in their scoring practices? These are the questions you must answer when establishing the reliability of your scoring forms. There are four different kinds of reliability:

Inter-rater reliability measures how similarly different evaluators rate the same call when they score it.

Test-retest reliability tracks whether the same evaluators rate the same call in a consistent way if they were to rate the same call again.

Parallel forms reliability ascertains whether one version of a form is at least as reliable as, or more reliable than, another.

Internal consistency reliability makes sure you are not “double-dipping” among what you think are distinct categories on your form.

2. Validity assesses whether your measurements are appropriate, meaningful, and useful. Validity is more difficult to quantify than reliability. There are three types of validity: content, criterion, and construct.

Content validity determines whether the things you measure are really an accurate reflection of what you intend to measure. We tend to do this pretty well in terms of the greeting, the closing, and accuracy of data entry. It is harder to measure “soft” areas such as courtesy, professionalism, and tone of voice.

Criterion validity determines how well the criteria we use in our monitoring forms correlate with other measures of customer satisfaction, such as post-call IVR surveys, written or phone surveys, measures of first call resolution, escalations, accuracy of data entry, customer retention, and even financial measures like goodwill, credits, average collection period, and returns. It’s important not to create and use monitoring forms in a vacuum, removed from these other performance measures.

Construct validity can be difficult to get right. One example that misses its mark is something that I see quite often. Many monitoring forms ask, “Did the CSR use the customer’s name three times during this call?” It seems like a good measure of customer-focus, but it really isn’t. Callers do not generally count the number of times their name is used during the call.

As an industry, we ought to get better at construct validity. For example, I propose that the best indicator of courteousness and professionalism is whether CSRs acknowledge the reason for the call or the emotional state of that caller before asking for verifying information. This truly makes the caller feel heard and valued. Yet that acknowledgement is seldom included on monitoring forms.

A thorough comparison of customer survey results, correlated with assorted monitoring criteria, can assist us to determine authoritatively what elements really contribute to customer-focus, professionalism, and courtesy rather than relying on conjecture. This sort of thorough analysis, within the overall context of quality assurance will lead to the next improvement in call center management: “getting quality right.”

Read part 3 and part 5 in this series.

Cliff Hurst is president of Career Impact, Inc, which he started in 1988. Contact Cliff at 207-499-0141, 800-813-8105, or cliff@careerimpact.net.

[From Connection Magazine – Jul/Aug 2008]