Kill It with Fire

Manage Aging Computer Systems (and Future Proof Modern Ones)

Marianne Bellotti

We build our computer systems the way we build our cities: over time, without a plan, on top of ruins.—Ellen Ullman

Legacy modernizations are hard not because they are technically hard—the problems and the solutions are usually well understood—it’s the people side of the modernization effort that is hard. Getting the time and resources to actually implement the change, building an appetite for change to happen and keeping that momentum, managing the intra-organizational communication necessary to move a system that any number of other systems connect to or rely upon—those things are hard.

382 ↱

progress in technology is not linear. It’s cyclical. We advance, but we advance slowly, while moving tangentially. We abandon patterns only to reinvent them later and sell them as completely new. Technology advances not by building on what came before, but by pivoting from it. We take core concepts from what exists already and modify them to address a gap in the market; then we optimize around filling in that gap until that optimization has aggregated all the people and use cases not covered by the new tech into its own distinct market that another “advancement” will capture. In other words, the arms race around data centers left smaller organizations behind and created a demand for the commercial cloud. Optimizing the cloud for customization and control created the market for managed platforms and eventually serverless computing. The serverless model will feed its consumers more and more development along its most appealing features until the edge cases where serverless approaches don’t quite fit start to find common ground among each other. Then a new product will come out that will address those needs.

457 ↱

Unix was (and is) proprietary software, and the GNU Project’s philosophy said that we should not use proprietary software. But, applying the same reasoning that leads to the conclusion that violence in self defense is justified, I concluded that it was legitimate to use a proprietary package when that was crucial for developing a free replacement that would help others stop using the proprietary package.

838 ↱

Overall, interfaces and ideas spread through networks of people, not based on merit or success. Exposure to a given configuration creates the perception that it’s easier and more intuitive, causing it to be passed down to more generations of technology. The lesson to learn here is the systems that feel familiar to people always provide more value than the systems that have structural elegances but run contrary to expectations.

934 ↱

Almost all production software is in such bad shape that it would be nearly useless as a guide to re-implementing itself. Now take this already bad picture, and extract only those products that are big, complex, and fragile enough to need a major rewrite, and the odds of success with this approach are significantly worse.

960 ↱

We know that past the upper bound of mere exposure, once people find a characteristic they do not like, they tend to judge every characteristic discovered after that more negatively. 17 So programmers prefer full rewrites over iterating legacy systems because rewrites maintain an attractive level of ambiguity while the existing systems are well known and, therefore, boring. It’s no accident that proposals for full rewrites tend to include introducing some language, design pattern, or technology that is new to the engineering team. Very few rewrite plans take the form of redesigning the system using the same language or merely fixing a well-defined structural issue. The goal of full rewrites is to restore ambiguity and, therefore, enthusiasm. They fail because the assumption that the old system can be used as a spec and be trusted to have diagnosed accurately and eliminated every risk is wrong.

986 ↱

Another useful exercise to run when dealing with technical debt is to compare the technology available when the system was originally built to the technology we would use for those same requirements today. I employ this technique a lot when dealing with systems written in COBOL. For all that people talk about COBOL dying off, it is good at certain tasks. The problem with most old COBOL systems is that they were designed at a time when COBOL was the only option. If the goal is to get rid of COBOL, I start by sorting which parts of the system are in COBOL because COBOL is good at performing that task, and which parts are in COBOL because there were no other tools available. Once we have that mapping, we start by pulling the latter off into separate services that are written and designed using the technology we would choose for that task today.

1108 ↱

Tightly coupled and complex systems are prone to failure because the coupling produces cascading effects, and the complexity makes the direction and course of those cascades impossible to predict. If your goal is to reduce failures or minimize security risks, your best bet is to start by evaluating your system on those two characteristics: Where are things tightly coupled, and where are things complex? Your goal should not be to eliminate all complexity and all coupling; there will be trade-offs in each specific instance.

1226 ↱

Once you have identified the parts of the system where there is tight coupling and where there is complexity, study the role those areas have played in past problems. Will changing the ratio of complexity to coupling make those problems better or worse? A helpful way to think about this is to classify the types of failures you’ve seen so far. Problems that are caused by human beings failing to read something, understand something, or check something are usually improved by minimizing complexity. Problems that are caused by failures in monitoring or testing are usually improved by loosening the coupling (and thereby creating places for automated testing). Remember also that an incident can include both elements, so be thoughtful in your analysis. A human operator may have made a mistake to trigger an incident, but if that mistake was impossible to discover because the logs weren’t granular enough, minimizing complexity will not pay off as much as changing the coupling.

1246 ↱

When both observability and testing are lacking on your legacy system, observability comes first. Tests tell you only what won’t fail; monitoring tells you what is failing. Our new engineer had the freedom to alter huge swaths of the system because the work the team had done rolling out better monitoring meant when her changes were deployed, we could spot problems quickly.

1311 ↱

the blue-green technique involves running two components in parallel and slowly draining traffic off from one and over to the other. The big benefit to doing this is that it’s easy to undo if something goes wrong. Often with technology, increasing load reveals problems that were not otherwise found in testing. Legacy systems have both the blessing and the curse of an existing pool of users and activity. The system that replaces them has a narrow grace period with which to fix those mistakes discovered under high load. Blue-green deployments allow the new system to ease into the full load of the old system gradually, and you can fix problems before the load exacerbates them.

1376 ↱

Your plan will define what it means to modernize your legacy system, what the goals are, and what value will be delivered and when. Specifically, your plan should focus on answering the following questions: What problem are we trying to solve by modernizing? What small pragmatic changes will help us learn more about the system? What can we iterate on? How will we spot problems after we deploy changes?

1393 ↱

I tell my engineers that the biggest problems we have to solve are not technical problems, but people problems. Modernization projects take months, if not years of work. Keeping a team of engineers focused, inspired, and motivated from beginning to end is difficult. Keeping their senior leadership prepared to invest over and over on what is, in effect, something they already have is a huge challenge. Creating momentum and sustaining it are where most modernization projects fail.

1408 ↱

We struggle to modernize legacy systems because we fail to pay the proper attention and respect to the real challenge of legacy systems: the context has been lost. We have forgotten the web of compromises that created the final design and are blind to the years of modifications that increased its complexity. We don’t realize that at least a few design choices were bad choices and that it was only through good luck the system performed well for so long. We oversimplify and ultimately commit to new challenges before we discover our mistakes.

1435 ↱

Most web development projects, for example, run on Linux machines. Therefore, it is not uncommon for web applications to include shell scripts as part of their code base—particularly as part of the setup/ installation routine. Imagine what migrating those applications would feel like 20 years in the future if Linux were supplanted by a different operating system. We would potentially have to rewrite all the shell scripts as well as migrate the actual application.

1501 ↱

but the growth of the platform as a service (PaaS) market for commercial cloud is increasing the options to program for specific platform features. For example, the more you build things with Amazon’s managed services, the more the application will conform to fit Amazon-specific characteristics, and the more overgrowth there will be to contend with if the organization later wants to migrate away.

1555 ↱

Assuming you fully understand the requirements because an existing system is operational is a critical mistake. One of the advantages of building a new system is that the team is more aware of the unknowns. Existing systems can be a distraction. The software team treats the full-featured implementation of it as the MVP, no matter how large or how complex that existing system actually is. It’s simply too much information to manage. People become overwhelmed, and they get discouraged and demoralized. The project stalls and reinforces the notion that the modernization work is impossible.

1654 ↱

Legacy modernization projects go better when the individuals contributing to them feel comfortable being autonomous and when they can adapt to challenges and surprises as they present themselves because they understand what the priorities are. The more decisions need to go up to a senior group—be that VPs, enterprise architects, or a CEO—the more delays and bottlenecks appear. The more momentum is lost, and people stop believing success is possible. When people stop believing success is possible, they stop bringing their best to work. Measurable problems empower team members to make decisions. Everyone has agreed that metric X needs to be better; any actions taken to improve metric X need not be run up the chain of command. Measurable problems create clearly articulated goals. Having a goal means you can define what kind of value you expect the project to add and whom that value will benefit most.

1673 ↱

A great meeting is not a meeting where no one ever mentions anything out of scope; it’s one where out-of-scope comments are quickly identified as such by the team and dispatched before they have derailed the conversation.

1763 ↱

A quick trick when two capable engineers cannot seem to agree on a decision is to ask yourself what each one is optimizing for with their suggested approach. Remember, technology has a number of trade-offs where optimizing for one characteristic diminishes another important characteristic. Examples include security versus usability, coupling versus complexity, fault tolerance versus consistency, and so on, and so forth. If two engineers really can’t agree on a decision, it’s usually because they have different beliefs about where the ideal optimization between two such poles is. Looking for absolute truths in situations that are ambiguous and value-based is painful. Sometimes it helps just to highlight the fact that the disagreement is really over what to optimize for, rather than pure technical correctness.

1766 ↱

When there’s a responsibility gap, the organization has a blind spot. Debt collects, vulnerabilities go unpatched, and institutional knowledge is gradually lost.

1998 ↱

Neal Ford, director and software architect at ThoughtWorks, had a saying I’m fond of repeating to engineers on my teams: “Metawork is more interesting than work.” Left to their own devices, software engineers will almost invariably over-engineer things to tackle bigger, more complex, long-view problems instead of the problems directly in front of them. For example, engineering teams might take a break from working on an application to write a scaffolding tool for future applications. Rather than writing SQL queries, teams might write their own object relational mapping (ORM). Rather than building a frontend, teams might build a design system with every form component they might ever need perfectly styled.

2096 ↱

Decisions motivated by wanting to avoid rewriting code later are usually bad decisions. In general, any decision made to please or impress imagined spectators with the superficial elegance of your approach is a bad one. If you’re coming into a project where team members are fixing something that isn’t broken, you can be sure they are doing so because they are afraid of the way their product looks to other people. They are ashamed of their working, successful technology, and you have to figure out how to convince them not to be ashamed so that they can focus on fixing things that are actually broken.

2101 ↱

Breaking up the monolith into services that roughly correspond to what each team owns means that each team can control its own deploys. Development speeds up. Add a layer of complexity in the form of formal, testable API specs, and the system can facilitate communication between those teams by policing how they are allowed to change downstream interactions.

2139 ↱

The only thing worse than fixing the wrong thing is leaving an attempt to fix the wrong thing unfinished. Half-finished initiatives create confusing, poorly documented, and harder to maintain systems.

2154 ↱

A former colleague of mine and an experienced engineer from Google used to like to say, “Anything over four nines is basically a lie.” The more nines you are trying to guarantee, the more risk-averse engineering teams will become, and the more they will avoid necessary improvements. Remember, to get five nines or more, they have only seconds to respond to incidents. That’s a lot of pressure. SLAs/ SLOs are valuable because they give people a budget for failure. When organizations stop aiming for perfection and accept that all systems will occasionally fail, they stop letting their technology rot for fear of change and invest in responding faster to failure. That’s the idea anyway. Some organizations can’t be talked out of wanting five or even six nines of availability. In those cases, mean time to recovery (MTTR) is a more useful statistic to push than reliability. MTTR tracks how long it takes the organization to recover from failure.

2227 ↱

The way a murder board works is you put together a panel of experts who will ask questions, challenge assumptions, and attempt to poke holes in a plan or proposal put in front of them by the person or group the murder board exercise is intended to benefit. It’s called a murder board because it’s supposed to be combative. The experts aren’t just trying to point out flaws in the proposal; they are trying to outright murder the ideas. Murder boards are one of those techniques that are really appropriate only in specific circumstances. To be a productive and beneficial exercise, it is essential that the murder board precedes an extremely stressful event. Murder boards have two goals. The first is to prepare candidates for a stressful event by making sure they have an answer for every question, a response to every concern, and mitigation strategy for every foreseeable problem. The second goal of a murder board is to build candidates’ confidence. If they go into the stressful event knowing that they survived the murder board process, they will know that every aspect of their plan or testimony has been battle-tested.

2434 ↱

The US Army/ Marine CorpsCounterinsurgency Field Manual1 put it best when it advised soldiers: “Planning is problem solving, while design is problem setting.” Problem-solving versus problem-setting is the difference between being reactive and being responsive. Reactive teams jump around aimlessly. Setbacks whittle away their confidence and their ability to coordinate. Momentum is hard to maintain. Responsive teams, on the other hand, are calmer and more thoughtful. They’re able to sort new information as it becomes available into different scopes and contexts. They’re able to change approaches without affecting their confidence, because design thinking gives them insight into why the change happened in the first place.

2484 ↱

This exercise asks team members to map out how much they can do on their own to move the project toward achieving its goals. What are they empowered to do? What blockers do they foresee, and when do they think they become relevant? How far can they go without approval, and who needs to grant that approval when the time comes?

2616 ↱

as Conway put it: “The greatest single common factor behind many poorly designed systems now in existence has been the availability of a design organization in need of work.”

2681 ↱

Organizations end up with patchwork solutions because the tech community rewards explorers. Being among the first with tales of documenting, experimenting, or destroying a piece of technology builds an individual’s prestige. Pushing the boundaries of performance by adopting something new and innovative builds it even more so. Software engineers are incentivized to forego tried and true approaches in favor of new frontiers. Left to their own devices, software engineers will proliferate tools, ignoring feature overlaps for the sake of that one thing tool X does better than tool Y that is relevant only in that specific situation.

2704 ↱

One of the benefits of microservices, for example, is that it allows many teams to contribute to the same system independently from one another. Whereas a monolith would require coordination in the form of code reviews—a personal, direct interaction between colleagues—service-oriented architecture scales the same guarantees with process. Engineers document contracts and protocols; automation is applied to ensure that those contracts are not violated, and it prescribes a course of action if they are. For that reason, engineers who want to “jump ahead” and build something with microservices from the beginning often struggle. The level of complexity and abstraction is out of sync with the communication patterns of the organization.

2779 ↱

The reorg is the matching misused tool of the full rewrite. As the software engineer gravitates toward throwing everything out and starting over to project confidence and certainty, so too does the software engineers’ manager gravitate toward the reorg to fix all manner of institutional ills. And like a full rewrite, sometimes this is the appropriate strategy, but it is not nearly the right strategy as often as it is used. Reorgs are incredibly disruptive. They are demoralizing. They send the message to rank and file engineers that something is wrong—they built the wrong thing or the product they built doesn’t work or the company is struggling. It increases workplace anxiety and decreases productivity. The fact that reorgs almost always end up with a few odd people out who are subsequently let go exacerbates the issue.

2830 ↱

Conway’s law is ultimately about communication and incentives. The incentive side can be covered by giving people a pathway to prestige and career advancement that complements the modernization effort. The only way to design communication pathways is actually to give people something to communicate about. In each case, we allow the vision for the new organization to reveal itself by designing structures that encourage new communication pathways to form in response to our modernization challenges. As the work continues, those communication pathways begin to solidify, and we can begin documentation and formalizing new teams or roles. In this way, we sidestep the anxiety of reorganizing. The workers determine where they belong based on how they adapt to problems; workers typically left out are given time and space to learn new skills or prove themselves in different roles, and by the time the new organization structure is ratified by leadership, everyone already has been working that way for a couple months.

2919 ↱

To have air cover was to have confidence that the organization would help your team survive such inevitable breakages. It was to have someone who trusted and understood the value of change and could protect the team. As a team lead, my job was to secure that air cover. When I moved back to the private sector, I applied the same principles as a manager—networking, relationship building, recruiting, doing favors—so I could give my team members the safety and security necessary to do the hard jobs for which I had hired them.

3016 ↱

What colleagues pay attention to are the real values of an organization. No matter how passionate or consistent the messaging, attention from colleagues will win out over the speeches.

3040 ↱

the occasional outage and problem with a system—particularly if it is resolved quickly and cleanly—can actually boost the user’s trust and confidence. The technical term for this effect is the service recovery paradox.

3155 ↱

What kind of scenarios justify breaking things on purpose? The most common one when dealing with legacy systems is loss of institutional memory. On any old system, one or two components exist that no one seems to know exactly what they do. If you are seeking to minimize the system’s complexity and restore context, such knowledge gaps can’t just be ignored. Mind you, the situations when you can’t figure out what a component is doing from studying logs or digging up old documentation tend to be rare, but they do happen. Provided the system doesn’t control nuclear weapons, turning the component off and seeing what breaks is a tool that should be available when all other avenues are exhausted.

3177 ↱

You can justify any failure test the same way. Is it better to wait for something to fail and hope you have the right resources and expertise at the ready? Or is it better to trigger failure at a time when you can plan resources, expertise, and impact in advance? You don’t know that something doesn’t work the way you intended it to until you try it.

3193 ↱

Two types of impact are relevant to failure tests. The first is technical impact: the likelihood of cascading failures, data corruption, or dramatic changes in security or stability. The second is user impact: How many people are negatively affected and to what degree?

3197 ↱

You should have some idea of how different parts of the system are coupled and where the complexity is in your system from exercises in previous chapters; now you’ll want to put that information into a model of potential failures. A good way to start is with a 4 + 1 architectural view model. Developed by Philippe Kruchten, 13 a 4 + 1 architectural view model breaks an architectural model into separate diagrams that reflect the concerns of one specific viewpoint.

3200 ↱

Bullet journaling is effective for me because each page is a snapshot of everything that is on my mind at the time. I record work projects, personal projects, events and social activities, holidays, and illnesses. Anything that I expect to take up a large part of the day, I write down. Looking back on a project’s progress with that information gives it a sense of context that I hadn’t considered before. Once I consider it, I realize that I have not been standing in one place mindlessly banging my head against a wall. Bit by bit, piece by piece, I have made things better.

3461 ↱

postmortems are not specific to failure. If your modernization plan includes modeling your approach after another successful project, consider doing a postmortem on that project’s success instead. Remember that we tend to think of failure as bad luck and success as skill. We do postmortems on failure because we’re likely to see them as complex scenarios with a variety of contributing factors. We assume that success happens for simple, straightforward reasons. In reality, success is no more or less complex than failure. You should use the same methodology to learn from success that you use to learn from failure.

3470 ↱

Postmortems establish a record about what really happened and how specific actions affected the outcome. They do not document failure; they provide context. Postmortems on success should serve a similar purpose. Why was a specific approach or technique successful? Did the final strategy look like what the team had planned at the start? Your timeline in a postmortem for success should be built around these questions: How did the organization execute on the original strategy, how did the strategy change, when did those changes happen, and what triggered them? Even the biggest successes have challenges that could have gone better and places where good fortune saved the day. Documenting those helps people evaluate the suitability of your approach for their own problems and ultimately reproduce your success.

3504 ↱

postmortem’s key questions. What went well? What could have gone better? Where did you get lucky?

3513 ↱

As the helpful bits of GPS were not yet market-ready, consumers were more sensitive to the privacy concerns of the technology. In 1997, employees for United Parcel Service (UPS) famously went on strike after UPS tried to install GPS receivers in all of their trucks.

3716 ↱

one of the many forces changing the speed of the earth’s rotation is climate change. Ice weighs down the land masses on Earth, and when it melts, those land masses start to drift up toward the poles. This makes the earth spin faster and days fractions of a second shorter.

3728 ↱

automation is beneficial when it’s clear who is responsible for the automation working in the first place and when failure states give users enough information to understand how they should triage the issue. Automation that encourages people to forget about it creates responsibility gaps. Automation that fails either silently or with unclear error messages at best wastes a lot of valuable engineering time and at worst triggers unpredictable and dangerous side effects.

3783 ↱

In summary, systems age in two different ways. Their usage patterns change, which require them to be scaled up and down, or the resources that back them deteriorate up to the point where they fail. Legacy modernizations themselves are anti-patterns. A healthy organization running a healthy system should be able to evolve it over time without rerouting resources to a formal modernization effort.

3899 ↱

Organizations choose to keep the bus moving as fast as possible because they can’t see all the feedback loops. Shipping new code gets attention, while technical debt accrues silently and without fanfare. It’s not the age of a system that causes it to fail, but the pressure of what the organization has forgotten about it slowly building toward an explosion.

3912 ↱

The hard part about legacy modernization is the system around the system. The organization, its communication structures, its politics, and its incentives are all intertwined with the technical product in such a way that to move the product, you must do it by turning the gears of this other, complex, undocumented system.

3929 ↱

The best way to handle legacy modernization projects is not to need them in the first place. If the appropriate time and resources are budgeted for it, systems can be maintained in such a way that they evolve with the industry. The organizations that accomplish this ultimately understand that the organization’s scale is the upper bound of system complexity. Systems that are more complex than the team responsible for them can maintain are neglected and eventually fall apart.

3998 ↱

Loading highlights…