The Engineering Design Sequence

What are the different kinds of design meetings you might encounter? The Brainstorming meeting is one of the most important, but being prepared for all of the different design meetings and processes in the product lifecycle will help you anticipate the twists and turns of professional life.

1. Requirements Gathering and Inception: In this meeting you will define the problem statement: Who is the customer? What does the customer need? What is the timeline for delivery? The answers to these questions will evolve over time, but they must start from somewhere. This is the prompt for the design ideation meeting.

2. Design and Ideation Meeting: This is the subject of the previous letter. Every member of the team prepares for this meeting based on the initial profile of the customer discussed in the Requirements Gathering meeting. This meeting is within the team, and should be at most two hours long to start.

3. Design Preview: The Ideation meeting produces a few candidate designs that are worthy of further discussion. The “Design Preview” is a written rapid summary of these designs that is sent out to the broader engineering community within the venture. This allows a wide range of people outside of the immediate team an opportunity to offer high-level feedback on the various ideas. Other engineers can here suggest different technologies that may have been missed or mention unforeseen pitfalls. This Preview should be published the same day or the next day from the Ideation meeting, and the commenting period may be about a week long. The purpose of this process is to prevent the team from investing too much in the development of specific designs before many people have had the chance to mention problems that may have been overlooked. Distribution of these ideas and soliciting feedback also builds a more closely integrated engineering community.

4. High Level Design Review: A synchronous review of a design document that covers all of the major aspects at a summary level. The sections will likely include:

  1. Problem Statement, Customer Needs
  2. Major design constraints
  3. A component diagram of the most important modules of the system
    1. This shows for the happy path how the different modules interact.
  4. What are the APIs?
    1. What is the cost per million operations for each of the APIs? What are the load projections for the first year? In five years?
  5. What cloud services will be used, in public offerings (such as AWS) and internal dependencies?
  6. How do the servers scale? Could the service use a cellular architecture?
  7. Monitoring, logging, and archival strategy
  8. Security strategy
    1. What is the most sensitive data that this can handle? Will any safety-critical processes be impacted if this system is compromised?

The Design Preview was a sketch of the product and this is a full picture. It should be possible to read this document in less than 30 minutes so that there is another 30-60 minutes in the meeting for discussion and review of comments. There should be a smaller set of reviewers for this HLD who have built similar services but are on different teams.

The HLD will not be a complete specification of the system. It serves as a starting point for understanding how all of the pieces fit together. The designers must begin to consider each aspect of the system to write the HLD, and the reviewers have a chance to offer guiding feedback on these choices.

5. Low-Level Design Review(s): This is an asynchronous review of the most important algorithms, database schemata, call & retry patterns, state machines, and timing diagrams of the service. These include the unhappy paths in the system, such as how various dependency failures will be handled. The algorithms should be included as pseudo code. These design documents should allow line-by-line commenting.

Engineers will be able to work from this LLD to create individual stories (discrete units of implementation). Because of the level of detail and effort required to make this design, it is not necessary for every component to be complete before starting the asynchronous review. For example, there might be a single review of the database schemata, which could even have its own meeting.

6. Invalidated Assumption Response Meeting: “No plan survives first contact with the enemy”. In the practice of engineering, the laws of nature do not act adversarially as a military foe does, so plans survive longer, but are never perfect. All designs make assumptions about how the world works, whether that be about customer behavior, the structure of internal service dependencies, or the average runtime performance of an algorithm based on some distribution of data. Inevitably, something goes wrong. One should be psychologically prepared not only to adapt to these invalidated assumptions, but actively pursue evidence for their confirmation, with priority given to those that are the least certain.

In this meeting, the engineering team brainstorms options for dealing with these invalidated assumptions. This might require rewriting large portions of the service again, depending on the severity of the problem. Engineers must start from a reconceptualization of the product: if we knew at the beginning what we know now, how would we have designed it? Because of the demands of delivering the product, it might not be possible to completely rewrite everything, or start from scratch. But if you start from the ideal state and work towards what is feasible within the project timeline, you will come to a better solution than if you attempt to make the quickest patch to the design without an idea of how to do it the right way. The new conceptualization will also serve as a starting point for a re-architecture when the service encounters scaling challenges.

As the engineers on a team become more familiar with a design space, for example how to build customer-facing cloud services using AWS, their implementation speed will increase. Thus they will be more willing to discard outmoded components because the sunk cost fallacy will be less attractive. Keeping a level head in the face of major design changes that become necessary mid-way through implementation is an important part of being an effective engineer.

7. Scaling Challenges Response Meeting: Congratulations! Your product delivered clear value for the customer and you dealt with changing requirements effectively. It is so successful in the limited roll-out that management wants to deploy it everywhere, ASAP. You are ready for that, right?

While you may have designed your service to scale horizontally rather than vertically (such that increased load just requires more servers and not more powerful servers) you will probably find that adding new customers exposes new invalid assumptions, or reveals scaling processes that may require expert human input, such as customer-specific configuration. These processes need to be either automated, pushed to the customer (in the case of configuration or fine-tuning), or delegated to a deployment team. Accepting them as a recurring cost to the engineering team will not work as a long-term strategy. If you delegate it to deployment engineers, you will have to create training materials for them and prepare to spin up such a team, if you do not already have one. If the customers can be taught how to set their own configuration, then you may seemingly get that “for free” regardless of scale, but you risk compromising the customer experience and open the engineering team to high urgency requests from them. Automating such tasks is ideal in the long term, but may take too long to design for the current needs. Or, a new load on your service may also reveal hidden super-linear scaling behaviors such as synchronization required among all hosts or within some internal database or cache.

Prevention of these issues during the design phase beats trying to fix them in-flight, but despite your best efforts they will likely still arise. In this meeting, you will revise your growth projections for your service that you made in your initial HLD. You will now have concrete data for the performance of your service in production which you can use to make more confident estimates of resource needs. Your team will have a list of non-automated tasks and super-linear scaling behaviors that arose during the initial deployment of your service, and you will record all of them in a common register. The resolution of each issue will have a cost in some combination of automation (non-recurring engineering), deployment management (per-customer one-time cost), continuous support (recurring support engineering cost), and server costs (operational costs).

8. Support Plan Meeting

Every customer support ticket, service outage, scaling failure, manual maintenance action, and explanation of the service behavior to management imposes a potentially recurring cost to your team to sustain the product. The engineering team must work to automate such actions where possible and otherwise standardize the team’s response through the creation of “runbooks” or protocols. These runbooks are guides to diagnosing and addressing common issues, performing maintenance, or answering frequent questions. They will often include case studies of actual support tickets. When updating these runbooks, the team should have a live review so that each member can ask questions or share additional information about related issues that may have arisen during an on-call rotation. These discussions naturally create statements of requirements for redesigns that will prevent or automatically remediate such issues.

9. The Quick Fix: Acknowledged Tech Debt

In an ideal world, every discovered design flaw would lead to a rewrite of the service from first principles. However, in practice that cannot happen. All products, and especially software services, have a limited lifespan and it is sometimes appropriate to make a quick fix that isn’t pretty either in the expectation that it will be resolved at a later time, or that it will persist until the service is deprecated. The purpose of this as design meeting is to have common acknowledgement of suitability of this fix, review the documentation of its shortcomings, and sketch what would have been the right design if the team were starting from scratch. These sketches also seed the design of the future generations of the service.

10. Transmission and Inheritance

Changing ownership over a service would ideally look like a baton pass, but in practice it is moving into someone’s fully furnished house and slowly discovering all of the infrastructural problems with it.

As no one’s career should be tied to a single product, no product or service should be indefinitely owned by a single person or engineering team. The most natural progression is for engineers to move between teams every 1-5 years.

This rotation serves to:

  1. Bring fresh ideas, perspectives, and variety of technical expertise applied to specific problems,
  2. Reduce burn out by having people work on a variety of projects,
  3. Increase support redundancy and reduce siloization of efforts,
  4. Force regular maintenance of documentation and test its comprehensibility while onboarding new members to a problem.

This comes at the cost of:

  1. The effort of teaching new members and their effort in learning,
  2. The time to build new working relationships between engineers, product managers, customers, and other engineering teams.

If rotation is done too soon, engineers may not have had time to develop the deep knowledge of the problem space necessary to comprehensively evaluate designs. Working in one domain on a variety of different projects can enable engineers to see how problems relate and innovate on how to solve many things at once. In practice, however, premature rotation is rare because the cost of knowledge transfer is usually rated highly, and organizational inertia limits how frequently it is done.

The engineer rotating off of a project is responsible for finalizing documentation and summarizing the lessons learned during his tenure. He should write an overview of ongoing work, planned work, and research questions, and a retrospective of recognized shortcomings of the team’s designs and products. These should be reviewed together with the team.

A new engineer should start with an asynchronous review of the team’s documentation. A first reading is to become familiar with the shape of the domain, and a second is to make in-line comments for specific questions. Then one tenured engineer on the team takes responding to these comments and updating the holes in documentation as a task. They then have a live meeting to review the questions and give an introductory implementation task to the new engineer.

11. Deprecation

The memories of praises won are long-gone now, only a pile of technical debt remains. Customer needs, service dependencies, and even design paradigms have changed. Starting over is far more attractive than continuing to repatch the holes in the ship. It’s time to deprecate the service.

Deprecation, as a process, often requires its own design or execution plan. You want to ensure that your customers have a seamless transition to the next generation. In software, this typically means that both services are running simultaneously and the new one starts with a limited roll out. If the boundary between the service and the rest of the world is clean, it may be possible to simply reroute request traffic to the new one. However, if there are any unfortunate couplings such as database access it might not be that easy. As each customer switches, you may even discover load-bearing bugs or undocumented behaviors in the old system that were not replicated in the new design.

The deprecation execution plan should be reviewed both within the team and with other relevant engineering and customer stakeholders. The plan will likely include:

  1. Documentation of any changes in customer-facing behavior with the new generation.
  2. A plan for validating the migration of each customer.
  3. A schedule
    1. for communication with customers
    2. for each operation within the deprecation, such as an initial migration of customer records to a new database, not allowing creation of new entities, then not allowing updates of existing entities, final migration of existing records, validation of the migration, blocking reads of the existing database, and final deletion / archival of records in the old database.

Leave a comment