Accelerate

The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations

Forsgren PhD, Jez Humble, Gene Kim

At the end of an iteration, the total number of points signed off by the customer is recorded—this is the team’s velocity. Velocity is designed to be used as a capacity planning tool; for example, it can be used to extrapolate how long it will take the team to complete all the work that has been planned and estimated.

471 ↱

Organizational culture can exist at three levels in organizations: basic assumptions, values, and artifacts (Schein 1985). At the first level, basic assumptions are formed over time as members of a group or organization make sense of relationships, events, and activities. These interpretations are the least “visible” of the levels—and are the things that we just “know,” and may find difficult to articulate, after we have been long enough in a team. The second level of organizational culture are values, which are more “visible” to group members as these collective values and norms can be discussed and even debated by those who are aware of them. Values provide a lens through which group members view and interpret the relationships, events, and activities around them. Values also influence group interactions and activities by establishing social norms, which shape the actions of group members and provide contextual rules (Bansal 2003). These are quite often the “culture” we think of when we talk about the culture of a team and an organization. The third level of organizational culture is the most visible and can be observed in artifacts. These artifacts can include written mission statements or creeds, technology, formal procedures, or even heroes and rituals (Pettigrew 1979).

670 ↱

In 1988, he developed a typology of organizational cultures (Westrum 2014): Pathological (power-oriented) organizations are characterized by large amounts of fear and threat. People often hoard information or withhold it for political reasons, or distort it to make themselves look better. Bureaucratic (rule-oriented) organizations protect departments. Those in the department want to maintain their “turf,” insist on their own rules, and generally do things by the book—their book. Generative (performance-oriented) organizations focus on the mission. How do we accomplish our goal? Everything is subordinated to good performance, to doing what we are supposed to do. Westrum’s further insight was that the organizational culture predicts the way information flows through an organization. Westrum provides three characteristics of good information: It provides answers to the questions that the receiver needs answered. It is timely. It is presented in such a way that it can be effectively used by the receiver.

683 ↱

accident investigations that stop at “human error” are not just bad but dangerous. Human error should, instead, be the start of the investigation. Our goal should be to discover how we could improve information flow so that people have better or more timely information, or to find better tools to help prevent catastrophic failures following apparently mundane operations.

791 ↱

If you want to improve your culture, implementing CD practices will help. By giving developers the tools to detect problems when they occur, the time and resources to invest in their development, and the authority to fix problems straight away, we create an environment where developers accept responsibility for global outcomes such as quality and stability. This has a positive influence on the group interactions and activities of team members’ organizational environment and culture.

885 ↱

In The Visible Ops Handbook, unplanned work is described as the difference between “paying attention to the low fuel warning light on an automobile versus running out of gas on the highway” (Behr et al. 2004). In the first case, the organization can fix the problem in a planned manner, without much urgency or disruption to other scheduled work. In the second case, they must fix the problem in a highly urgent manner, often requiring all hands on deck—

931 ↱

High-performing teams were more likely to incorporate information security into the delivery process. Their infosec personnel provided feedback at every step of the software delivery lifecycle, from design through demos to helping with test automation. However, they did so in a way that did not slow down the development process, integrating security concerns into the daily work of teams. In fact, integrating these security practices contributed to software delivery performance.

986 ↱

Although in most cases the type of system you are building is not important in terms of achieving high performance, two architectural characteristics are. Those who agreed with the following statements were more likely to be in the high-performing group: We can do most of our testing without requiring an integrated environment. 1 We can and do deploy or release our application independently of other applications/ services it depends on.

1044 ↱

This connection between communication bandwidth and systems architecture was first discussed by Melvin Conway, who said, “organizations which design systems . . . are constrained to produce designs which are copies of the communication structures of these organizations” (Conway 1968). Our research lends support to what is sometimes called the “inverse Conway Maneuver,” 2 which states that organizations should evolve their team and organizational structure to achieve the desired architecture. The goal is for your architecture to support the ability of teams to get their work done—from design through to deployment—without requiring high-bandwidth communication between teams. Architectural approaches that enable this strategy include the use of bounded contexts and APIs as a way to decouple large domains into smaller, more loosely coupled units, and the use of test doubles and virtualization as a way to test services or components in isolation.

1062 ↱

We found that when teams “shift left” on information security—that is, when they build it into the software delivery process instead of making it a separate phase that happens downstream of the development process—this positively impacts their ability to practice continuous delivery. This, in turn, positively impacts delivery performance. What does “shifting left” entail? First, security reviews are conducted for all major features, and this review process is performed in such a way that it doesn’t slow down the development process. How can we ensure that paying attention to security doesn’t reduce development throughput? This is the focus of the second aspect of this capability: information security should be integrated into the entire software delivery lifecycle from development through operations. This means infosec experts should contribute to the process of designing applications, attend and provide feedback on demonstrations of the software, and ensure that security features are tested as part of the automated test suite. Finally, we want to make it easy for developers to do the right thing when it comes to infosec. This can be achieved by ensuring that there are easy-to-consume, preapproved libraries, packages, toolchains, and processes available for developers and IT operations.

1143 ↱

In order to reduce the time and cost taken to deliver federal information systems, a small team of civil servants at 18F created a platform as a service called cloud.gov based on an open-source version of Pivotal’s Cloud Foundry, hosted on Amazon Web Services. Most of the controls in systems hosted on cloud.gov—269 of the 325 required for a moderate-impact information system—are taken care of at the platform level. Systems hosted on cloud.gov can go from dev complete to live in weeks, not months. This significantly reduces the amount of work—and thus cost—needed to implement the requirements of the Risk Management Framework. Read more at https:// 18f.gsa.gov/ 2017/ 02/ 02/ cloud-gov-is-now-fedramp-authorized/.

1165 ↱

We investigated further the case of approval by an external body to see if this practice correlated with stability. We found that external approvals were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all. Our recommendation based on these results is to use a lightweight change approval process based on peer review, such as pair programming or intrateam code review, combined with a deployment pipeline to detect and reject bad changes. This process can be used for all kinds of changes, including code, infrastructure, and database changes.

1238 ↱

In regulated industries, segregation of duties is often required either explicitly in the wording of the regulation (for instance, in the case of PCI DSS) or by auditors. However, implementing this control does not require the use of a CAB or separate operations team. There are two mechanisms which can be effectively used to satisfy both the letter and the spirit of this control. First, when any kind of change is committed, somebody who wasn’t involved in authoring the change should review it either before or immediately following commit to version control. This can be somebody on the same team. This person should approve the change by recording their approval in a system of record such as GitHub (by approving the pull request) or a deployment pipeline tool (by approving a manual stage immediately following commit). Second, changes should only be applied to production using a fully automated process that forms part of a deployment pipeline. 1 That is, no changes should be able to be made to production unless they have been committed to version control, validated by the standard build and test process, and then deployed through an automated process triggered through a deployment pipeline. As a result of implementing a deployment pipeline, auditors will have a complete record of which changes have been applied to which environments, where they come from in version control, what tests and validations have been run against them, and who approved them and when. A deployment pipeline is, thus, particularly valuable in the context of safety-critical or highly regulated industries.

1246 ↱

If you want to know how your team is doing, just ask your team how painful deployments are and what specific things are causing that pain. In particular, be aware that if deployments have to be performed outside of normal business hours, that’s a sign of architectural problems that should be addressed. It’s entirely possible—given sufficient investment—to build complex, large-scale distributed systems which allow for fully automated deployments with zero downtime.

1375 ↱

In order to reduce deployment pain, we should: Build systems that are designed to be deployed easily into multiple environments, can detect and tolerate failures in their environments, and can have various components of the system updated independently Ensure that the state of production systems can be reproduced (with the exception of production data) in an automated fashion from information in version control

1390 ↱

Burnout is physical, mental, or emotional exhaustion caused by overwork or stress—but it is more than just being overworked or stressed. Burnout can make the things we once loved about our work and life seem insignificant and dull. It often manifests itself as a feeling of helplessness, and is correlated with pathological cultures and unproductive, wasteful work. The consequences of burnout are huge—for individuals and for their teams and organizations. Research shows that stressful jobs can be as bad for physical health as secondhand smoke (Goh et al. 2015) and obesity (Chandola et al. 2006). Symptoms of burnout include feeling exhausted, cynical, or ineffective; little or no sense of accomplishment in your work; and feelings about your work negatively affecting other aspects of your life.

1399 ↱

Christina Maslach, a professor of psychology at the University of California at Berkeley and a pioneering researcher on job burnout, found six organizational risk factors that predict burnout (Leiter and Maslach 2008): 3 Work overload: job demands exceed human limits. Lack of control: inability to influence decisions that affect your job. Insufficient rewards: insufficient financial, institutional, or social rewards. Breakdown of community: unsupportive workplace environment. Absence of fairness: lack of fairness in decision-making processes. Value conflicts: mismatch in organizational values and the individual’s values. Maslach found that most organizations try to fix the person and ignore the work environment, even though her research shows that fixing the environment has a higher likelihood of success.

1420 ↱

Being a leader doesn’t mean you have people reporting to you on an organizational chart—leadership is about inspiring and motivating those around you. A good leader affects a team’s ability to deliver code, architect good systems, and apply Lean principles to how the team manages its work and develops products.

1659 ↱

Loading highlights…