Reliability refers to the consistency of a measure, that is the extent to which a metric offers consistent measurement under similar conditions. Entrepreneurs or managers who adopt the scientific approach wish to test hypotheses using measures which not only are valid, but also consistent. That is, they are devoid of measurement error. Measurement error derives from property measures and can be reduced improving the consistency of a measure. Naturally, consistency tends to be higher if measures and measurement methods regard properties or performance of artifacts, for example a caliber that measures some tolerance, but in many innovation decisions hypotheses concern human behavior, another aspects for which measurement is more complex. In these cases, in order to reduce a measurement error, entrepreneurs and managers can resort to three types of consistency. Over time (test-retest reliability), across items (internal consistency), and across different researchers (interrater reliability). When entrepreneurs and managers need to measure a construct that they assume to be consistent over time, they need to make sure that the outcomes of measurement are also consistent across time, other things equal. Test-retest reliability is the extent to which this is actually the case. For example, if nothing changes in a given product or service, customers’ evaluation of it should be consistent across time. This means that any good measure of such customers’ preferences should produce roughly the same outcomes for the same individuals at any point in time. In the above example, a measure can be proven reliable over time if, once used on a group of customers at one time and then reused again on the same group of customers at a later time, it yields similar outcomes or outcomes that are very positively and significantly correlated. A second kind of reliability is internal consistency, which is the consistency of a respondents’ responses, for example, customers, across the items on a multiple-item measure, for example the items of a survey. In general, assuming that the measure is valid, that is that all the items on such measure reflect the same underlying construct, the outcomes of the measurement on those items should be correlated with one another. As in test-retest reliability, if an entrepreneur or a manager wishes to improve internal consistency, s/he must do it collecting and analyzing data. There are several approaches to establish internal consistency. The first is to look at a split-half correlation. In the above example, this involves splitting the items into two sets, such as the first and second halves of the items, or the even and odd number items. The outcomes are computed for each set of items, and the relationship between the two sets of outcomes is examined. A significant split-half correlation of 80% or greater is generally considered good internal consistency. Perhaps the most common approach to measure internal consistency for multiple-item metrics is a statistic called Cronbach's Alpha. Conceptually, Alpha is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach's Alpha would be the mean of the 252 split-half correlations. In practice, it is calculated differently, as explained in the following chart. Cronbach's Alpha is equal to the ratio between the product of the number of items times the average inter-item covariance among the items, to the sum of the average variance of the items and the product of the average inter-item covariance times the number of items minus one. Usually, a value of 80% or greater is generally taken to indicate good internal consistency, but entrepreneurs and managers might accept values as low as 60%. In innovation decisions, many behavioral measures involve a significant judgment on the part of an observer or a rater. Inter-rater reliability is the extent to which different observers are consistent in their judgments. If they are not, the inference of the entrepreneur or manager might be biased by measurement error. For example, consider an entrepreneur or manager who conducts ethnographic customers' interviews or directly observes customers’ behaviors to test a hypothesis. For example, to validate a customer's problem. If the interviews are audio-recorded, or if the actual customers’ behaviors are video-recorded, these audio or video files need to be analyzed and coded, for example, for the frequency and/or intensity of occurrence of certain themes representing the presence of the hypothesized customer problem. To the extent that each customer in the sample does, in fact, have some level of the hypothesized problem, that can be detected by any attentive observer, different observers’ ratings should be highly correlated with each other. Therefore, it is useful to have multiple independent raters observers, and make sure that their rates converge. Inter-rater reliability can be, again, assessed using Cronbach's Alpha when the judgments are quantitative, or an analogous statistic, called Cohen's Kappa, when they are categorical. The following chart illustrates how to calculate Cohen's Kappa, which is equal to the ratio between the difference between the relative observed agreement among rates, po, minus the hypothetical probability of chance agreement, pe, to one minus the hypothetical probability of chance agreement, pe. More simply, entrepreneurs or managers can otherwise evaluate how distant are the raters' evaluations from one another. For example, simply calculating the percent of agreement among raters. Making sure that innovation metrics are reliable might be critical and is certainly costly. Again, entrepreneurs and managers should make conscious decisions about the degree of measure reliability they wish, estimating the probability of incurring in measurement error and its potential cost. Reliability is important, but it matters only if the metrics are valid. A valid metric might be more or less reliable, but a reliable metric is not necessarily valid.