The Tyranny of Metrics

Jerry Z. Muller

There are things that can be measured. There are things that are worth measuring. But what can be measured is not always what is worth measuring; what gets measured may have no relationship to what we really want to know. The costs of measuring may be greater than the benefits. The things that get measured may draw effort away from the things we really care about. And measurement may provide us with distorted knowledge—knowledge that seems solid but is actually deceptive.

When their scores are used as a basis of reward and punishment, surgeons, as do others under such scrutiny, engage in creaming, that is, they avoid the riskier cases. When hospitals are penalized based on the percentage of patients who fail to survive for thirty days beyond surgery, patients are sometimes kept alive for thirty-one days, so that their mortality is not reflected in the hospital’s metrics. 2 In England, in an attempt to reduce wait times in emergency wards, the Department of Health adopted a policy that penalized hospitals with wait times longer than four hours. The program succeeded—at least on the surface. In fact, some hospitals responded by keeping incoming patients in queues of ambulances, beyond the doors of the hospital, until the staff was confident that the patient could be seen within the allotted four hours of being admitted.

The attempt to measure performance—while pocked with pitfalls, as we will see—is intrinsically desirable. If what is actually measured is a reasonable proxy for what is intended to be measured, and if it is combined with judgment, then measurement can help practitioners to assess their own performance, both for individuals and for organizations. But problems arise when such measures become the criteria used to reward and punish—when metrics become the basis of pay-for-performance or ratings.

not everything that is important is measureable, and much that is measurable is unimportant. (Or, in the words of a familiar dictum, “Not everything that can be counted counts, and not everything that counts can be counted.” 7) Most organizations have multiple purposes, and that which is measured and rewarded tends to become the focus of attention, at the expense of other essential goals. Similarly, many jobs have multiple facets, and measuring only a few aspects creates incentives to neglect the rest.

Whenever reward is tied to measured performance, metric fixation invites gaming.

What has come to be called “Campbell’s Law,” named for the American social psychologist Donald T. Campbell, holds that “[ t] he more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.” 9 In a variation named for the British economist who formulated it, we have Goodhart’s Law, which states, “Any measure used for control is unreliable.” 10 To put it another way, anything that can be measured and rewarded will be gamed.

Because belief in its efficacy seems to outlast evidence that it frequently doesn’t work, metric fixation has elements of a cult. Studies that demonstrate its lack of effectiveness are either ignored, or met with the assertion that what is needed is more data and better measurement. Metric fixation, which aspires to imitate science, too often resembles faith.

Taylorism was based on trying to replace the implicit knowledge of the workmen with mass-production methods developed, planned, monitored, and controlled by managers. “Under scientific management,” he wrote, “the managers assume … the burden of gathering together all of the traditional knowledge which in the past has been possessed by the workmen and then of classifying, tabulating, and reducing this knowledge to rules, laws, formulae…. Thus all of the planning which under the old system was done by the workmen, must of necessity under the new system be done by management in accordance with the law of science.” 11 According to Taylor, “It is only through enforced standardization of methods, enforced adoption of the best implements and working conditions, and enforced cooperation that this faster work can be assured. And the duty of enforcing the adoption of standards and enforcing this cooperation rests with management alone”

The quest for numerical metrics of accountability is particularly attractive in cultures marked by low social trust. And mistrust of authority has been a leitmotif of American culture since the 1960s. Thus in politics, administration, and many other fields, numbers are valued precisely because they replace reliance on the subjective, experience-based judgments of those in power. The quest for metrics of accountability exerts its spell over those on both the political left and right. There is a close affinity between it and the populist, egalitarian suspicion of authority based on class, expertise, and background.

Philip K. Howard has argued that the decline of trust leads to a new mindset in which “[ a] voiding human choice in public decisions is not just a theory … but a kind of theology…. Human choice is considered too dangerous.” As a consequence, “Officials no longer are allowed to act on their best judgment” 4 or to exercise discretion, which is judgment about what the particular situation requires. 5 The result is overregulation: an ever tighter web of rules, including the proliferation of rules within organizations. 6 Often enough, metrics provides the tools for tightening that web. Over-measurement is a form of overregulation, just as mismeasurement is a form of misregulation.

When institutions are particularly large, complex, and made up of dissimilar parts, that comprehension is simply impossible. Those at the top face to a greater degree than most of us a cognitive constraint that confronts all of us: making decisions despite having limited time and ability to deal with information overload. Metrics are a tempting means of dealing with this “bounded rationality,” and engaging with matters beyond one’s comprehension.

Principal-agent theory conceives of organizations as networks of relationships between those with a given interest (the principals) and those hired to carry out that interest (the agents). The perspective is that of the principals, and the premise is that the interests of the agents may diverge from those of the principals.

Robert Gibbons, a professor of organizational economics at MIT, pointed out that in fact the principal (the owner of the firm, for example) profits from a variety of outputs from the agent (the employee), and that many of these outputs are not highly visible or measureable in any numerical sense. Organizations depend on employees engaging in mentoring and in team work, for example, which are often at odds with what the employees would do if their only interests were to maximize their measured performance for purposes of compensation. Thus, there is a gap between the measureable contribution and the actual, total contribution of the agent. As a result, measured performance (such as an increase in the division’s profits or a rise in the company’s stock price) may actually lead to the organization getting less of what it really needs from its employees.

the fixation on quantifiable goals so central to metric fixation—though often implemented by politicians and policymakers who proclaim their devotion to capitalism—replicates many of the intrinsic faults of the Soviet system. Just as Soviet bloc planners set output targets for each factory to produce, so do bureaucrats set measurable performance targets for schools, hospitals, police forces, and corporations. And just as Soviet managers responded by producing shoddy goods that met the numerical targets set by their overlords, so do schools, police forces, and businesses find ways of fulfilling quotas with shoddy goods of their own: by graduating pupils with minimal skills, or downgrading grand theft to misdemeanor-level petty larceny, or opening dummy accounts for bank clients.

For potential employers, degrees act as signals: they serve as a shorthand that allows employers to rank initial applicants for a job. Having completed high school signals a certain, modest level of intellectual competence as well as personality traits such as persistence. Finishing college is a signal of a somewhat higher level of each of these. In a society where a small minority successfully completes college, having a B.A. signals a certain measure of superiority. But the higher the percentage of people with a B.A., the lower its value as a sorting device. What happens instead is that jobs that once required only a high school diploma now require a B.A. That is not because the jobs have become more cognitively demanding or require a higher level of skill, but because employers can afford to choose from among the many applicants who hold a B.A., while excluding the rest. The result is both to depress the wages of those who lack a college degree, and to place many college graduates in jobs that don’t actually make use of the substance of their college education.

convincing example of the potential virtues of medical metrics, also touted by Michael Porter, comes from the Geisinger Health System, a physician-led, not-for-profit, integrated system that serves some 2.6 million people in Pennsylvania, many of them rural and poor. Geisinger is a showcase for progressive healthcare in the United States—

The results are discussed with the larger staff, with an eye to learning from mistakes. This is an instance of diagnostic metrics. It provides data that can be used by a practitioner (physician), or internally within an institution (hospital), or shared among practitioners and institutions to discover what is working and what is not, and to use that information to improve performance.

Most studies of pay-for-performance, it noted, examined process and intermediate outcomes rather than final outcomes, that is, whether the patient recovered. “Overall,” it reports, “studies with stronger methodological designs were less likely to identify significant improvements associated with pay-for-performance programs. And identified effects were relatively small.” 22 Nor was this finding new. Social scientists who studied pay-for-performance schemes in the public sector in the 1990s concluded that they were ineffective. Yet such schemes keep getting introduced: a triumph of hope over experience, or of consultants peddling the same old nostrums.

When metrics used for public rankings or pay-for-performance do affect outcomes, it is often in ways that are unintended and counterproductive. And whether productive or unproductive, they typically involve huge costs, costs that are rarely considered by the advocates of pay-for-performance or transparency metrics. Among the intrinsic problems of P4P and public rankings are goal diversion. As a report from Britain notes, P4P programs “can reward only what can be measured and attributed, a limitation that can lead to less holistic care and inappropriate concentration of the doctor’s gaze on what can be measured rather than what is important.” The British P4P program led to lower quality of care for those medical conditions that were not part of the program. In short, it leads to “treating to the

Physician report cards create as many problems as they solve. Take the phenomenon of risk-aversion. Numerous studies have shown that cardiac surgeons became less willing to operate on severely ill patients in need of surgery after the introduction of publicly available metrics.

The phenomenon of risk-aversion means that some patients whose lives might be saved by a risky operation are simply never operated upon. But there is also the reverse problem, that of overly aggressive care to meet metric targets. Patients whose operations are not successful may be kept alive for the requisite thirty days to improve their hospital’s mortality data, a prolongation that is both costly and inhumane.

As of 2015, about three-quarters of the reporting hospitals were penalized by Medicare. Tellingly, major teaching hospitals—which tend to see more difficult patients—were disproportionately affected. 34 So were hospitals in poverty-stricken areas, where patients were less likely to be well taken care of (or to take care of themselves) after their initial discharge from the hospital. 35 Attaining the goal of reduced admissions depends not only on the steps that the hospital takes to educate the patient and provide necessary medications, but also on many factors over which the hospital has little control: the patient’s underlying physical and mental health, social support system, and behavior. Such factors point to another recurrent issue with medical metrics: hospitals serve very different patient populations, some of whom are more prone to illness and less able to take care of themselves once discharged. Pay-for-performance schemes try to compensate for this by what is known as “risk adjustment.” But calculations of the degree of risk are at least as prone to mismeasurement and manipulation as other metrics. In the end, hospitals that serve the most challenging patient population are most likely to be penalized. 36 As in the case of schools punished for the poor performance of their students on standardized tests, by penalizing the least successful hospitals, performance metrics may end up exacerbating inequalities in the distribution of resources—

metrics tend to be most successful for those interventions and outcomes that are almost entirely controlled by and within the organization’s medical system, as in the case of checklists of procedures to minimize central line–induced infections. When the outcomes are dependent upon more wide-ranging factors (such as patient behavior outside the doctor’s office and the hospital), they become more difficult to attribute to the efforts or failures of the medical system.

He also warns against the use of all “input metrics,” that is, metrics that count what the army and its allies are doing, for these may be quite distinct from the outcomes of those actions: Input metrics are indicators based on our own level of effort, as distinct from the effects of our efforts. For example, input metrics include numbers of enemy killed, numbers of friendly forces trained, numbers of schools or clinics built, miles of road completed, and so on. These indicators tell us what we are doing but not the effect we are having. To understand that effect, we need to look at output metrics (how many friendly forces are still serving three months after training, for example, or how many schools or clinics are still standing and in use after a year) or, better still, at outcome metrics. Outcome metrics track the actual and perceived effect of our actions on the population’s safety, security, and well-being.

Coming up with useful metrics often requires an immersion in local conditions. Take, for example, the market price of exotic (i.e., nonlocal) vegetables, which few outsiders look to as a useful indicator of a population’s perceived peace and well-being. Kilcullen, however, explains why they might be helpful: Afghanistan is an agricultural economy, and crop diversity varies markedly across the country. Given the free-market economics of agricultural production in Afghanistan, risk and cost factors—the opportunity cost of growing a crop, the risk of transporting it across insecure roads, the risk of selling it at market and of transporting money home again—tend to be automatically priced in to the cost of fruits and vegetables. Thus, fluctuations in overall market prices may be a surrogate metric for general popular confidence and perceived security. In particular, exotic vegetables—those grown outside a particular district that have to be transported further at greater risk in order to be sold in that district—can be a useful telltale marker.

There are indeed circumstances when pay for measured performance fulfills that promise: when the work to be done is repetitive, uncreative, and involves the production or sale of standardized commodities or services; when there is little possibility of exercising choice over what one does; when there is little intrinsic satisfaction in it; when performance is based almost exclusively on individual effort, rather than that of a team; and when aiding, encouraging, and mentoring others is not an important part of the job.

Andrew Natsios, a distinguished public servant with long experience in international development, notes that the employees of government agencies in this field have “become infected with a very bad case of Obsessive Measurement Disorder, an intellectual dysfunction rooted in the notion that counting everything in government programs will produce better policy choices and improved management.” The emphasis on quantification leads to a neglect of programs with the longest-run potential benefits: those that improve the skills, knowledge, and norms of the civil service and judicial systems in underdeveloped nations. Those who suffer from Obsessive Measurement Disorder, Natsios writes, ignore “a central principle of development theory—that those development programs that are most precisely and easily measured are the least transformational, and those programs that are the most transformational are the least measureable.”

As Tom Daschle, the Democratic former majority leader of the Senate, has recently observed, the “idea that Washington would work better if there were TV cameras monitoring every conversation gets it exactly wrong…. The lack of opportunities for honest dialogue and creative give-and-take lies at the root of today’s dysfunction.” 2 That is also why effective politicians must to some degree be two-faced, pursuing more flexibility in closed negotiations than in their public advocacy. Only when multiple compromises have been made and a deal has been reached can it be subjected to public scrutiny, that is, made transparent.

The ability to negotiate between couples or states often involves coming up with formulas that allow each side to save face or retain self-esteem, and that requires compromising principles, or ambiguity. The fact that allies spy on one another to a certain degree to determine intentions, capacities, and vulnerabilities is well known to practitioners of government. But it cannot be publicly acknowledged, since it represents a threat to the amour propre of other nations. Moreover, in domestic politics and in international relations as in interpersonal ones, there is a role for a certain amount of hypocrisy for practices that are tolerable and useful but that can’t be fully justified by international law and explicit norms. In short, to quote Moshe Halbertal once again, A degree of legitimate concealment is necessary to maintain the state and its democratic institutions. Military secrets, techniques for fighting crime, intelligence gathering, and even diplomatic negotiations that will fall apart if they become exposed—all these domains have to stay shrouded in secrecy in order to allow the functioning of ordinary transparency in the other institutions of the state. Our transparent open conversation rests upon a rather extensive dark and hidden domain that insures its flourishing.

We live in a world in which privacy is being eroded both through technology (the Internet) and a culture that proclaims the virtue of candor while dismissing the need for shame. In such a post-privacy society, people are inclined to overlook the value of secrecy. 8 Thus, the power of “transparency” as a magic formula is such that its counterproductive effects are often ignored. “Sunlight is the best disinfectant” has become the credo of the new faith of Wikileakism: the belief that making public the internal deliberations of all organizations and governments will make the world a better place. But more often, the result is paralysis. Politicians forced to reveal their every action are unable to arrive at compromises that make legislation possible. Officials who need to fear that their internal deliberations will be made public are less positioned to make effective public policy. Intelligence agencies that require secrecy to gather information on the nation’s enemies are thwarted. In each case, transparency becomes the enemy of performance.

workers who are rewarded for the accomplishment of measurable tasks reduce the effort devoted to other tasks. 2 The result is that the metric means comes to replace the organizational ends that those means ought to serve.

Rewarding individuals for measured performance diminishes the sense of common purpose as well as the social relationships that provide the unmeasureable motivation for cooperation and institutional effectiveness. 7 Reward based on measured performance tends to promote not cooperation but competition. If the individuals or units respond to the incentives created, rather than aiding, assisting, and advising one another, they strive to maximize their own metrics, ignoring, or even sabotaging, their fellows.

When the objects to be measured are influenced by the process of measurement, measurement becomes less reliable. Measurement becomes much less reliable the more its object is human activity, since the objects—people—are self-conscious, and are capable of reacting to the process of being measured. And if rewards and punishments are involved, they are more likely to react in a way that skews the measurement’s validity. By contrast, the more they agree with the goals of those rewards, the more likely they are to react in a way that enhances the measurement’s validity.

In a school setting, for example, the degree to which parents request a particular teacher for their children is probably a useful indicator that the teacher is doing something right, whether or not the results show up on standardized tests.

Collecting data, processing it, analyzing it—all of these take time, and their expense is in the opportunity costs of the time put into them. To put it another way, every moment you or your colleagues or employees are devoting to the production of metrics is time not devoted to the activities being measured.

Accountability metrics are less likely to be effective when they are imposed from above, using standardized formulas developed by those far from active engagement with the activity being measured. Measurements are more likely to be meaningful when they are developed from the bottom up, with input from teachers, nurses, and the cop on the beat. That means asking those with the tacit knowledge that comes from direct experience to provide suggestions about how to develop appropriate performance standards. 2 Try to involve a representative group of those who will have a stake in the outcomes. 3 In the best of cases, they should continue to be part of the process of evaluating the measured data.

Insofar as individuals are agents out to maximize their own interests, there are inevitable drawbacks to all schemes of measured reward. If, as is currently still the case, doctors are remunerated based on the procedures they perform, that creates an incentive for them to perform too many procedures that have high costs but produce low benefits. But pay doctors based on the number of patients they see, and they have an incentive to see as many patients as possible, and to skimp on procedures that are time-consuming but potentially useful. Compensate them based on successful patient outcomes, and they are more likely to cream, avoiding the most problematic patients.

In the end, there is no silver bullet, no substitute for actually knowing one’s subject and one’s organization, which is partly a matter of experience and partly a matter of unquantifiable skill. Many matters of importance are too subject to judgment and interpretation to be solved by standardized metrics. Ultimately, the issue is not one of metrics versus judgment, but metrics as informing judgment, which includes knowing how much weight to give to metrics, recognizing their characteristic distortions, and appreciating what can’t be measured. In recent decades, too many politicians, business leaders, policymakers, and academic officials have lost sight of that.

