In software development, infrastructure refers to everything an application needs to get up and running. In contrast to technical functionality, development teams are not usually responsible for their own infrastructure. Analogous to the infrastructure of a city, such as public transport, electricity and water, this includes networks, storage and computing capacities, for example, but also API gateways and messaging middleware, whereby the latter can be differentiated with the term application infrastructure.

More and more companies are recognizing the benefits of autonomous development teams that take care of the entire software life cycle, including responsibility for operations. However, the majority of companies are also coming up against the limits of their development teams' capacity and skills. In order to shed light on these problems and suggest suitable measures, we have come up with a small case study.

Baseline

MakeItSure is a leading provider of digital solutions for the insurance industry. Its flagship product, CoverFlex, is an insurance configurator that enables insurance providers to create and customize insurance policies in real time. CoverFlex provides an easy-to-use interface that allows individuals and small businesses to configure individual insurance products to meet their specific needs.

The insurance configurator is managed by seven specialized development teams working on a total of 16 specialized modules or services. These services cover all functionalities from policy management and risk assessment to the integration of partner services.

Diagram showing the OpsTeam supporting DevTeams 1, 2, and 3, focusing on 'CoverFlex' and infrastructure roles, with services spanning on-premises and public cloud.
Fig. 1: Development and operation of the CoverFlex product, schematic representation after modernization. © embarc // INNOQ

Two years ago, MakeItSure GmbH decided to undertake a comprehensive modernization of its IT system landscape in response to the increasing complexity caused by internationalization in Europe and the shortage of skilled workers in IT operations. While all systems were previously run on-premises, some were moved to the cloud. By using public cloud services, the company can keep the workload of the operations team stable even with this new hybrid infrastructure architecture.

This modernization included not only the technical infrastructure and service architecture, but also adapting the development and operations processes to the new cloud architecture. Software is deployed using containers, and the necessary infrastructure is automatically provisioned and configured using infrastructure-as-code tools. Test and build pipelines are also container-based, and configuration is handled by development teams. In addition, they are now independently responsible for deploying and monitoring their applications. The operations team supports them by providing CI/CD infrastructure, creating templates for the pipelines, and providing advice. Operations also provides the underlying monitoring infrastructure and cross-cutting services like service discovery, routing and messaging. Figure 1 illustrates the situation schematically.

Development teams now enjoy a great deal of autonomy and flexibility, especially in their architectural decisions. They can use cloud services to choose the best tool for their specific needs.

Challenges and problems

This modernization ensured the company’s future viability in the market, but after some time, problems began to manifest themselves that management could see. The delivery of new features was delayed, and incidents seemed to be occurring more frequently, leading to uncomfortable customer inquiries. Management initiated a joint painstorming workshop between operations and development to get to the bottom of the problem. Figure 2 shows an interim result of this event, in which the teams first collected pain points without jumping too quickly to the solution level.

Visual of a brainstorming session titled 'What are your pains in the development and operation of CoverFlex?' showing comments on sticky notes grouped by ops team (pink) and dev teams (yellow). Challenges include outdated docs, deployment errors, monitoring issues, and lack of confidence in using cloud services.
Fig. 2: Some examples of pain points from the teams involved. © embarc // INNOQ

The goal of painstorming is to uncover current problems and obstacles in processes and collaboration. At the end of the workshop, the teams were able to identify four major problem clusters that cause many small and large pains in development and operations:

  1. Deployments as 'hot potatoes’: In some development teams, it has become common practice to collect new changes rather than deploy them immediately. Inexperienced members are often reluctant to deploy themselves for fear of not having a solution in case of a rollout failure. As a result, deployment responsibility is often shifted to those who already have infrastructure and deployment experience, or who are under pressure to roll out new changes quickly. However, the accumulation of changes over time makes it increasingly difficult to determine the root cause of a failed deployment.
  2. Fragile deployments: Other development teams continue to handle deployments as they did before modernization. Instead of continuous delivery, they still do deployments every few months. These deployments are often large events that absorb the entire development team for a day and require spot support from the operations team.
  3. Conservative architecture decisions: Many development teams are reluctant to make new infrastructure and architecture decisions, and tend to tackle any challenge with their existing technology stack. For example, they run regular data processing in a web application when in many cases it would be a good idea to run them as short-lived, scheduled jobs through the operating platform. Another example is storing images as blobs in the database. An object store might be a more appropriate solution here. Conservative architectural decisions may seem efficient and low-risk in the short term, but in the long term they result in higher operating costs and fragile applications that are not optimally adapted to the challenges they face. These applications often include workarounds and homegrown developments, even though the cloud provider already offers suitable solutions.
  4. Documentation deficits: A major problem is that documentation is highly fragmented, outdated, and difficult to find. There are countless distributed README files and Markdown documents in code repositories and file servers, as well as wiki entries. However, the latter are typically tailored to other audiences, such as product development. While direct support from the operations team is appreciated, they often lack the time to provide comprehensive advice on key infrastructure and architecture decisions. As a result, development teams are left with the impression that they have full responsibility for running their applications, while at the same time relying on implicit knowledge and the isolated knowledge of individuals. This often leaves them alone with their problems and prevents them from focusing on implementing functionality.

Management was relieved by the results because the teams seemed to be open about their pain and problems. This was an important first step. Next, after discussion with the stakeholders, management will launch the first action to address the issues identified: the creation of a virtual platform team.

Measure: Establish a virtual platform team

The results of the painstorming workshop showed that development teams were being given more and more responsibility without sufficient support. To make their work easier, the company now sees its development teams as internal customers and is therefore setting up a platform team. A concept that has gained a lot of attention through the book Team Topologies [TT].

This new team provides targeted support services, such as providing templates and instructions to guide development teams through recurring issues and requirements. They also provide application infrastructure operations, such as monitoring, routing and messaging. However, one of their first tasks is to drive a documentation initiative.

Management has decided to create a virtual platform team. This means that no members need to be recruited or leave their existing teams. It can be compared to a community that meets regularly. Each member takes on tasks and represents the interests of their actual original team. The members of the virtual platform team come from the operations team and two interested developers have also volunteered to be part of the platform team.

Measure: Consultation hours and reviews

Weekly office hours are set up to provide direct support to development teams on key infrastructure and architecture decisions. These office hours will primarily be for urgent questions that need to be answered within a few days or weeks.

In addition, the Platform Team will participate in the Joint Review Meeting, which was originally intended to be a presentation of development progress. At these meetings, the platform team will now also present their own results, e.g. new solution concepts, runbooks and templates that they have worked on in the last few weeks.

These two steps are designed to increase knowledge sharing and transparency. In this way, development teams no longer need to feel left alone.

Flowchart showing Development Teams, Platform Team, and Operations Team with mutual dependencies. Development Teams handle applications, Platform Team provides services that close knowledge gaps, and Operations Team manages infrastructure. Arrows indicate relationships and responsibilities, such as considering other teams as customers and collaborating based on concepts, proposals, and constraints.
Fig. 3: Interaction of the teams after the platform team has been established. © embarc // INNOQ

Measure: Architecture documentation in one place

As part of the documentation initiative, the platform team analyzed how to improve the documentation of the operating platforms and infrastructure options. They again drew on the results of the painstorming workshop. The most frequently cited problem was that the existing documentation was fragmented, outdated and difficult to find. However, these characteristics do not only apply to the infrastructure documentation provided by the operations team.

New architectural decisions need to be made based on many factors. These include product requirements, guidelines and regulatory requirements, as well as information about the infrastructure and service offerings of the operating platforms.

Table 1: Documentation relevant to the case study teams for conceptual and operational solution design (baseline)

Document Responsibility Original Storage Location
Product requirements (e.g. including quality scenarios) Product Development Wiki
Technology guidelines Enterprise Architecture, CTO PDFs in File Sharing
Legal requirements Legal, Data Protection Officer PDFs in File Sharing
Infrastructure and service offerings of the operating platforms Operations Team PDFs in File Sharing
Infrastructure-as-Code templates Operations Team Infrastructure Code Repository (Modules folder)
Application infrastructure setup guides (z.B. auth, monitoring, messaging) Operations Team Infrastructure component code repositories (Readme)
Troubleshooting guides and runbooks Operations Team Wiki Infrastructure Code Repository Infrastructure component code repositories (Readme)
Application and service documentation Development Team Code-Repositories (Readme) Wiki

It makes sense not to consider infrastructure and operational documentation in isolation. Therefore, in the future, all documents relevant to application development should be stored or at least linked in the wiki, following the example of product development. However, this means that the infrastructure-as-code templates from the operations team must remain in the infrastructure repository, as they are considered “code” by nature. However, concise usage instructions and corresponding use cases should be available in the wiki. In this way, the corporate wiki becomes the central source of knowledge, even for technical departments, which benefit from improved searchability and clarity.

The decision to use a wiki as the central documentation location was not taken lightly and is more of a compromise. The operations and development teams in particular often use Markdown documents stored in code repositories. However, this makes it difficult to find documents due to the large number of different locations. It was therefore agreed that links to external documentation sites would also be allowed in the wiki.

However, it quickly becomes apparent that everyone involved is struggling to find the right level of detail for their documentation. For example, should runbooks and troubleshooting instructions also be stored in the wiki? Should the team’s internal documentation be written in such a way that other teams can benefit from it? These uncertainties slow down the documentation initiative.

A clear model or structure is needed to enable efficient and targeted documentation.

Measure: arc42 as a guiding framework for content structure

arc42 [ARC42] is a mature and proven approach for structuring architecture descriptions. It will now be used to leverage the documentation initiative at MakeItSure.

arc42 proposes a total of 12 sections for describing a software architecture (see Figure 4). These include architecture-relevant requirements, modularization, technology choices and, to some extent, the evaluation of the software with respect to risks and technical debt.

What many people don’t know is that arc42 also covers infrastructure aspects. Figure 4 shows the structure and location of important infrastructure topics in the architecture documentation.

Diagram illustrating the arc42 architecture documentation template, with 12 sections, key explanations, and markers for requirements, solutions, and evaluation.
Fig. 4: Infrastructure aspects in the arc42 structure. © embarc // INNOQ

In particular, the quality goals in arc42 section 1.2 are relevant to the process of weighing infrastructure decisions. According to this, the use of a central monitoring system could contribute to several requirements such as reliability, availability and security. The quality requirements (Section 10) then specify the quality goals in the form of scenarios.

For many projects, much of the infrastructure is predetermined. These non-modifiable specifications are defined in Section 2 - Constraints. For example, the use of a specific database may be a constraint for licensing reasons. However, development teams can be given some leeway in many areas. This section can also define technologies and cloud services that teams can choose from.

Section 9 of arc42 is for recording architectural decisions. This is where the reasons and consequences of a decision are documented, such as the use of a central monitoring system despite the distribution of services across multiple operating platforms.

arc42 is primarily intended to describe the architecture of a single application. In contrast, MakeItSure’s CoverFlex is a highly distributed application consisting of many services. A comprehensive arc42 document would be too large and complex, with too many teams contributing relevant content.

So how does arc42 work when different teams are contributing on different services at the same time? Does each «construction site» need its own arc42 documentation?

Measure: arc42 on several levels

In order to manage the required volume of documentation, the teams are breaking down the documentation into several layers. Instead of creating one large arc42 document for the entire application, they modularize the documentation. The application documentation forms the macro-architectural basis and contains, among other things, all key decisions such as the specification of technologies and protocols for inter-service communication. This application documentation is largely maintained by the architecture community, which consists of the lead developers of the development teams. In contrast, each development team is jointly responsible for its own service or module documentation (see Figure 5).

This service documentation can also be structured according to arc42, but it should be kept lean and focused on the essential aspects of the service. Even a short description would suffice. Although arc42 is a template that provides a comprehensive structure for all aspects of a software project, not all content needs to be filled in. The primary goal is not completeness, but usefulness. When assessing usefulness, it can help to ask yourself and others, «Who would be interested in this?» or "Would writing down this aspect have answered questions in the past?

In addition, there should be platform documentation that describes cross-cutting concepts such as monitoring, central authentication systems, or routing, which in turn are based on the macro-architecture decisions stored in the application documentation. The concrete solution concepts of the platform are therefore simply linked through the application documentation. In this way, the platform team avoids redundancies that would otherwise lead to inconsistencies in the solution concepts. This is especially important for MakeItSure as it builds other products on top of the platform.

Diagram showing the relationship between platform, application, and module documentation, highlighting responsibilities of the platform team and development teams, and collaboration through architectural descriptions and decisions.
Fig. 5: arc42 on several levels. © embarc // INNOQ

Development teams are often unsure of what solutions are available in the infrastructure and when and how best to use them. As a result, they prefer to copy approaches and configurations from existing solution components without really understanding them.

To better support development teams in their infrastructure decisions, reference architectures and best practices complement the platform documentation. For each infrastructure component, there should be a how-to guide stored in the wiki and maintained by the platform team. This allows teams to rely on proven standards while maintaining flexibility in their architectural choices.

Measure: Precise and targeted concepts

So how do you write a good concept and a good guide? In addition to the basic structure and target audience, the character of the concept is crucial. What kind of document is it? The taxonomy of the Diátaxis framework is helpful here.

Diátaxis was developed by Daniele Procida [DIA] and represents a systematic approach to technical documentation. Unlike arc42, it does not just refer to software architecture. Not even just software.

Quadrant chart categorizing tutorials, how-to guides, explanations, and reference materials by cognition vs. action, and acquisition vs. application, with brief descriptions of each type.
Fig. 6: Taxonomy of technical documentation in the Diátaxis framework according to [DIA]. © embarc // INNOQ

Diátaxis identifies the needs of different audiences, suggests appropriate forms of documentation, and classifies them into quadrants. Figure 6 shows this categorization in a kind of compass. The idea: when writing a concept, consider which category it falls into. And then don’t mix them up, but refer to them and, if necessary, create another one with a different character. An example: There is a tutorial that teaches the basics. The how-to guides assume that the learner has mastered the basics.

The platform team provides the development teams with the information they need to do their work in the form of how-to guides. They include examples and decision trees, such as how to implement health checks for different application platforms and application types, from long-lived Web applications to batch jobs that process large amounts of data.

Don’t write tutorials and discussions for off-the-shelf solutions and cloud services, but link to sources that can be considered highly recommended, such as lecture videos. References to off-the-shelf solutions should also be linked rather than copied. Of course, this does not apply to information that is under the control of the platform team. For example, a list of the various environments.

The platform team also creates runbooks, which are how-to guides for recurring tasks that the platform team performs itself.

It quickly becomes apparent that the teams find it difficult to answer questions regarding why a particular solution was chosen over others. While their decisions are well thought out, traceability still seems to suffer. This raises the question of whether there is a suitable documentation structure for this as well.

While the concepts (section 8 in arc42) are more about the application of solution ideas, the documentation of architectural decisions is primarily concerned with comprehensibility (“Why did we do it this way?”).

Measure: Comprehensive decisions with ADRs

When it comes to documenting decisions, our teams quickly come across the buzzword ADR, which stands for Architecture Decision Records. The term was coined by Michael Nygard in a blog post [ADR]. An ADR documents a single significant decision in a project as a snapshot, in the same way every time. Nygard’s blog post also includes a proposal for such a structure, see Table 2. There are alternatives to this, which differ in detail, but the core idea is the same everywhere.

Table 2: Sections in an ADR based on M. Nygard [ADR]

Section Content Simplified Example
Title Short description of the decision, reflecting the content of the ADR. Centralized monitoring
Context Description of the conditions that led to the decision. We run our application in a distributed fashion across multiple environments (on-premises and public cloud). Monitoring with multiple solutions increases the workload on the operations team and makes it difficult to establish consistent standards.
Decision Description of the decision that was made in the context described above. We set up centralized monitoring on-premises. This also makes it easier to implement data security requirements and leads to a better UX for development teams. We separate metrics that we collect separately for different tenants with appropriate tags.
Status Position in the life cycle of the decision, such as “proposed”, “accepted”, “discarded”, “deprecated” or “superseded”. accepted
Consequences List of all positive and negative consequences. We use an appropriate exporter to retrieve relevant metrics from services running in the public cloud.

The teams agree to document all their significant decisions in the form of an ADR. This is for themselves, but also for interested outsiders, e.g. from other teams. Significance is defined according to the following criteria:

  • The decision does not just affect our own team, or even many teams.
  • It would be difficult and/or expensive to reverse the decision later in the project.
  • The decision has an impact on the success of our project, e.g. the achievement of important quality goals.
  • The decision deviates from our usual standards.

The more a particular decision meets these criteria, the more important is the creation of an ADR for that decision. Sometimes it is not clear until later, in which case the teams document the decision after the fact.

Just as arc42 is used at different levels of our teams, so are the ADRs, which can be found in the ‘Architectural Decisions’ section of arc42. In all teams, however, the mindset is much more critical than the structure: we document important, far-reaching, risky (in short: significant) decisions in a comprehensible way. This is for ourselves in the sense of a future self, but also for outsiders who ask themselves: «Why did you do it that way?» This could include new team members.

Measure: Concise architecture overviews for platform and product

Thanks to the documentation initiative, the teams at MakeItSure produce small, targeted information snippets. The ADRs make the decisions comprehensible, concepts, and the how-to-guides and runbooks communicate empirical knowledge and thus provide confidence in action.

New team members give positive feedback on this, but at the same time report a certain amount of information overload that they are confronted with. There is a lack of overview, especially as the development teams only need a small part of the platform documentation for their daily work. They are mainly interested in the how-to guides. As a first step, the platform team and the development teams create entry pages in the «Onboarding» wiki and link the most important content for getting started there. Nevertheless, the teams realize during onboarding that it would be helpful to have something concise at hand that shows the most important things at a glance. A kind of «big picture».

As part of a consultation, the operations team and the development teams therefore decide to create concise overviews for both the platform and the CoverFlex product. They use the existing content and condense it. They choose a presentation format and limit the scope to 12 content slides. They define the structure of the sections in Table 3.

Table 3: Structure of a slide deck for an architecture overview

Section Contents Answered questions
Task Mission statement and context definition (2–3 slides) What does the solution achieve? And what does it not?
Forces Top quality goals and important constraints (2–3 slides) What guides us in our design? What do we have to adhere to?
Solution strategy Key solution approaches, assigned to the quality goals (table) and an informal overview («big picture») (3–4 slides) What does the solution look like in principle? What are the cornerstones for addressing the goals?
Further information Reflection and outlook for next steps, links to further information (2–4 slides) What happens next?

Creating such overviews as slide deck is a trade-off. Individual pieces of information are now redundantly available next to the wiki and are in danger of becoming outdated or scattered. The teams therefore agree on the following principles:

  • In the overviews, concentrate on the essentials
  • The slides show the current situation
  • Everyone in the teams understands the content of both overviews and can present the slides to outsiders
  • We use the slides for each onboarding and update them afterwards (if necessary)

A good example of an informal overview based on the scheme in Table 3 can be found as a slide presentation and flyer (PDF, DIN A3) at the end of this article [Threema].

Table 4: MakeItSure measures taken at a glance

Name of the measure Short description Alternative approaches or additions
Establish a virtual platform team Establishment of a supporting team structure for development, with the aim of relieving the burden of operational tasks. The tasks of the platform team, in particular the support of the following measures, can also be assumed by a temporary initiative or the operations team.
Architecture documentation in one place Storing all information relevant to the solution design in a way that makes it easy to find. A shared wiki serves as an entry point. A developer portal, e.g. realized with Backstage [BS], can also take on the role of the entry point.
arc42 as a Guiding Framework for content structure Use the arc42 for naming and structuring our architecture documentation. An alternative is the work of Simon Brown, see [SB]
arc42 on several levels Breaking down information in scopes and apply arc42 at different levels, instead of creating a large, all-encompassing documentation. A single Mega-arc42 is not an alternative. However, a valid approach is using different structures at different levels, e.g. very compact profiles for services.
Precise and targeted concepts Communicating experiential knowledge through practical instructions. The 4-quadrant model (4MAT) can be used as a rough outline of a single concept, see e.g. [BM].
Comprehensive decisions with ADRs Record important decisions in the same structure. All teams use the same template. In addition to [ADR], other suggestions for structuring ADRs can be found as inspiration, e.g. in [AEX] and [DCAR].
Concise architecture overviews for platform and product Create compact introductions to the platform and main product as a slide deck and refer to the corresponding documentation for details. As an alternative to slide decks, short summaries may also be sufficient as part of the documentation itself. In the same medium, it is easier to ensure that the information is up to date.

Documentation is not an end in itself. When done well, it ideally supports communication and provides orientation and certainty as to what needs to be recorded and in what depth. Knowledge based on experience, such as arc42 or Diataxis, is particularly helpful when it comes to infrastructure issues. It makes little sense to reinvent the wheel for this purpose.

Sources

[ARC42] Peter Hruschka and Gernot Starke: «arc42-Template for architecture documentation», https://www.arc42.org

[ADR] Michael Nygard: «Documenting Architecture Decisions», 2011, blog post; https://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions

[AEX] Gernot Starke et al.: arc42 by Example, Software Architecture Documentation in Practice, Leanpub (E-Book), https://leanpub.com/arc42byexample

[BM] Bernice McCarthy et al.: «The 4MAT Research Guide», About Learning Inc., 2002

[BS] Backstage, Open-Source-Framework for creating developer portals, https://backstage.io/

[DCAR] Uwe van Heesch et al.: «Decision-Centric Architecture Reviews», IEEE Software Volume 31, Number 1, 2014

[DIA] Daniele Procida: «Diátaxis. A systematic approach to technical documentation authoring», https://diataxis.fr

[SB] Simon Brown: «The software guidebook. A guide to documenting your software architecture.», Leanpub (E-Book), https://leanpub.com/documenting-software-architecture

[TT] Matthew Skelton and Manuel Pais: «Team Topologies. Organizing Business and Technology Teams for Fast Flow», O’Reilly 2023

[Threema] Stefan Zörner, “Architektur-Porträt: Der mobile Instant-Messenger Threema”, https://www.embarc.de/architektur-portraet-threema/

Fazit

With the measures described above, the teams at MakeItSure were able to address the problems identified during painstorming and achieve step-by-step improvements in the development process. Table 4 shows an overview of them once again. In fact, they do not solve all the problems, and there are valid alternatives to them. We have therefore also included alternative approaches in the table, as well as some suitable additions.