Dieser Artikel ist auch auf Deutsch verfügbar

The use of artificial intelligence (AI) carries with it responsibilities. Transparency, explainability, and fairness are the essential principles that must be ensured alongside the effectiveness of the AI system. In order to satisfy these requirements, it is recommended to orientate around fields with a tradition of verifiable processes. Although these processes are not always error-free, without them it would be impossible to achieve safety standards. This can be seen most clearly in the most safety-critical and regulated industries such as medicine, but also in air and space travel or finance.

In the same way that these fields require processes in order to meet the relevant requirements, a company utilizing AI systems requires regulated processes that control access to machine learning (ML) models, implement guidelines and statutory provisions, monitors the interactions with the models and their results, and records on what basis a model is generated. These processes are given the umbrella term “model governance.” Model governance processes need to be implemented from the get-go in every phase of the ML life cycle (design, development, and operations). The author has discussed the concrete technical integration of model governance in the ML life cycle in more detail elsewhere.

Model Governance: Essential for Fulfilling Rules and Requirements

Model governance is not optional (see the “Model Governance Checklist” box). There are already rules that companies in specific industries have to meet. The importance of model governance can be illustrated with the example of the finance sector. Lending systems, interest-rate-risk models, and pricing models for derivatives are high risk and demand a high level of monitoring and transparency. According to an Algorithmia study on the most important trends in the application of AI in 2021, the majority of companies are required to meet regulatory requirements – 67 percent of the surveyed companies have to comply with multiple regulations. Only 8 percent stated that they are not subject to any statutory provisions.

Moreover, the scope of the regulations is likely to increase in the future. For example, in April 2021 the EU published the first legal framework for AI, which is intended to supplement the existing regulations. The proposal separates AI systems into four different risk categories (“unacceptable,” “high,” “limited,” and “minimal”). The risk categories define the type and scope of the requirements that are to be placed on the respective AI systems. AI software falling in the high-risk category must meet the strictest requirements.

The use of machine learning carries with it responsibilities and obligations. To comply with these requirements, a company requires processes through which

  • the access to ML models is controlled,
  • guidelines and statutory provisions are implemented,
  • the interactions with the ML models and their results are recorded,
  • and it is determined on what basis a model is generated.

Model governance describes these processes in their entirety.


  • Full model documentation or reports. This includes the reporting of the metrics by means of appropriate visualization techniques and dashboards
  • Versioning of all models for the purpose of external transparency (explainability and reproducibility)
  • Full data documentation to ensure a high level of data quality and to comply with data protection regulations
  • Management of ML metadata
  • Validation of ML models (audits)
  • Ongoing monitoring and recording of model metrics

These encompass the following aspects: robustness, safety, accuracy, documentation and recording, as well as appropriate risk assessment and risk reduction. Further requirements are a high level of quality of training data, non-discrimination, reproducibility, transparency, human oversight, as well as conformity testing and proof of conformity with the AI provisions by means of CE marking (see “Plan it Legal” box). Examples of ML systems in this category are public and private services (such as credit scoring) or systems used in school education or professional training to decide on access to education or on a person’s career development (for example the automated marking of tests).

The conformity of HR AI with the proposed AI regulations will be a precondition of its commercialization in the EU. It can be shown by means of a CE mark. The EU will furthermore adopt standards that assume compliance with the regulations.

For the comprehensive tests that will arise in accordance with the AI regulations, the responsible bodies are to develop “sandboxing schemes,” i.e., standards for reliable test environments. The compliance testing for AI is based on an ex-ante perspective, but at the same time has similarities with the data protection impact assessments of the GDPR. More information on this can be found in the blog post by Dr. Benhard Freund on planit.legal: “The EU’s AI Law – Draft and State of Discussion.”

Achieving Compliance with European AI Requirements

As the regulations are not only to apply to companies and individuals located in the EU but also to any company offering AI services within the EU, the law would have a similar scope as the GDPR. The proposals will need to be approved by the EU parliament and pass through the legislative procedures of the individual member states. The law will come into force in 2024 at the earliest. Once the law enters into force – in 2024 at the earliest – high-risk AI systems must undergo a conformity assessment for AI requirements during their development. Only then can the AI system be registered in the EU database. As a final step a declaration of conformity is required, so that AI systems can obtain the necessary CE mark and the providers are able to market them.

It is important to note that regulations are not the only decisive aspect for model governance processes. Models used in less strictly regulated contexts are also subject to model governance. Alongside the fulfilling of statutory provisions, companies must also avoid economic and reputational damage as well as legal difficulties. ML models providing a marketing department with information on the target group could lose their precision during operation, providing incorrect information on which important decisions are then made. They thus represent a financial risk. Model governance is therefore needed not only for the fulfillment of legal provisions but also for quality assurance of ML systems and the reduction of business risks.

Model Governance as a Challenge

The looming EU regulations and the existing provisions and business risks make it necessary to implement model governance processes from the start. The importance of model governance however often only becomes clear to companies when ML models go into production and need to comply with statutory regulations. On top of this, the abstract character of many legal provisions makes the practical implementation a challenge for many companies. According to the abovementioned Algorithmia study, 56 percent of respondents cited the implementation of model governance as one of the biggest challenges for successfully bringing ML applications into production. The numbers of the “State of AI in 2021” study with regard to the risks of artificial intelligence are comparable: 50 percent of the surveyed companies view compliance with legal provisions as a risk factor, while others highlight shortcomings in explainability (44 percent), reputation (37 percent), and justice and fairness (30 percent) as relevant risk factors.

Audits as Standardized Testing Processes in the Model Governance Framework

Audits are an important component of model governance as a tool to determine whether AI systems correspond to the corporate policies, industry standards, or regulations. Audits can be either internal or external. The Gender Shades study, mentioned in the first part of this series of articles, is an example of an external audit process. It examined the facial recognition systems of large providers and found inconsistent levels of precision depending on gender and ethnicity.

This external view is however limited, as external inspection processes only have access to the model results and not to the underlying training data or model version. These are important sources that companies must include in their internal audit processes. These processes should permit a critical reflection on the potential impacts of a system. First however, the basics of AI systems will be explained here.

Specific Features of AI Systems

In order to be able to examine AI software, it is important to understand how machine learning works. Machine learning consists of a range of methods that use computers to make or improve forecasts or to predict behavior on the basis of data. In order to construct these predictive models, ML models must find a function that for a specific input generates an output (label). For this purpose the model requires training data that contain the relevant respective outputs for the input data. This form of learning is called “supervised learning.” In the training process the model uses mathematical optimization procedures to look for a function that depicts the unknown relationship between input and output as well as possible.

An example of a classification is a sentiment analysis intended to determine whether tweets contain positive or negative sentiments. In this case an input would be a single tweet, and the corresponding label the coded sentiment determined for this tweet (-1 for a negative sentiment, 1 for a positive). With this annotated training data, the algorithm learns during the training process how input data correlate with the label. After the training, the algorithm can then autonomously assign new tweets to a class.

More Complex Components in the Field of Machine Learning

An ML model therefore learns the decision logic in a training process, instead of the logic being explicitly defined in code with a series of if–then rules, as is typical in software development. This fundamental difference between traditional and AI software means that methods used for traditional software tests cannot be directly transferred to AI systems. The testing is made complicated by the fact that in addition to the code, the data and the model are also relevant, and all three components are mutually dependent correspondent to the change-anything-change-everything principle (for more on this see “Hidden Technical Debt in Machine Learning Systems”).

If for example the data in a productive system differ from the data with which a model was trained (distribution shifts), the model experiences a drop in performance (model decay). In this scenario the model must be quickly trained with fresh training data and redeployed. This is complicated by the fact that the testing of AI software is still an open field of research without consensus and without best practices.

Ethical Principles as Non-Functional Characteristics

The relevant test aspects of AI software can be separated into functional and non-functional characteristics. Correctness as a functional characteristic can be captured directly mathematically through metrics such as accuracy and precision/recall. They indicate how great the accordance is between the forecasts of the trained model and the actual predictions (gold standard). On top of this there are established validation procedures such as cross-validation, which through the isolation of the test data determines by means of a data sample how well the trained model forecasts the correct model results (labels) for new data.

Non-functional characteristics correspond to the ethical principles such as fairness, data protection, interpretability, robustness, and security. In contrast to the functional characteristics, they cannot look back on a broad stock of standardized metrics and practices from the field of machine learning. Here too the challenge is that the testing of non-functional characteristics of AI software is not (yet) standardized. Trade-offs between different characteristics complicate the situation further: fairness reduces accuracy and vice versa.

Metaphorically, AI software can be described as a power plant. A functionally smooth, faultless operation doesn’t mean that there isn’t any damage being caused to the environment. The smooth operation corresponds to the functional characteristics, the protection of the environment the non-functional criteria. The metaphor highlights that different testing processes are required for functional and non-functional characteristics. For the former, best practices from the field of ML are applicable, for the latter however research is still required. This article now turns to the aspect of fairness, provides a description of concepts and test strategies, and considers the highlighting of model governance. This supports the transformation of ethical principles such as fairness in the practical development of AI software.

How Does “Unfairness” Occur?

First we consider the interesting question how unfairness occurs. The rule here is simple: what the models learn manifests itself in the training data. In supervised learning, training data sets consist of the input data and the corresponding labels. If the data labels contain biases, the model will inherit this default setting and learn them from the get-go. It is therefore important to check the labels sufficiently carefully. Bias can however also occur inherently from the data and not only from the labels: if the training data already contain biases, the algorithm will assume them as well. This problem can occur for example with extensive language models trained with large data volumes from the internet. It has been shown that the performance of a model is correlated with the strength of a stereotypical bias: with increasing precision the bias also increases.

A smaller sample size among minority groups can also lead to a homogenization of the learning process of the model to the advantage of the majority groups, for example through more photos of male faces than female in the training data. In addition to the data, the features used in the training process also play a role. If the model is not able to use a sufficient number of features, it makes it more difficult for the algorithm to learn the correlation between input and output. For this reason IBM reacted with Diversity in Faces as an attempt to increase the diversity of the photos in the training data. And finally features can be “substitutes” for excluded sensitive attributes: even if protected attributes cannot explicitly be used for decision-making, they can be implicitly involved when they correlate with the excluded features.

The aim of ensuring fairness is the protection of sensitive attributes such as sex, religious affiliation, or sexual orientation from unfair algorithmic decision-making. The right to freedom from discrimination is explicitly confirmed in the EU draft regulations for AI systems in the high-risk category.

While the unfairness in the Gender Shades study could be intuitively understood, the challenge now is to define the more abstract term “fairness” in an objective, metric-based, and ideally scalable way.

Definitions of Fairness and the Deriving of Test Strategies

Which audits and which metrics are suitable for the testing of fairness? The already acknowledged lack of consensus also applies to the definition of fairness. The diversity of the different sources of fairness shows that fairness cannot simply be created with a simple metric or test strategy – fairness audits must be part of the model governance processes that ensure the quality of the training data and the model. A further complicating factor is that the different use cases of AI are too diverse for there to be an easily generalizable one-size-fits-all solution. The question of how fairness can be measured and verified can therefore not be answered with a simple metric. Nonetheless, concrete possibilities for quantitatively measuring fairness are required before these audits are embedded into the model governance framework.

Statistical methods offer the most easily measurable definitions of fairness, at the same time forming the basis for other approaches. Fairness can be quantified by means of statistical metrics. The measured values are used to derive definitions that focus on the outputs of models. Fairness can be defined by means of similar rates of errors in the outputs for different sensitive demographic groups. An algorithm is thus fair when groups selected on the basis of sensitive attributes have the same probability of favorable decision outcomes (“group fairness”).

Consistency of the Overall Accuracy

It can also be determined whether the accuracy of the model is the same for different subgroups (consistency of the overall accuracy). Using the example of credit scoring, this definition of fairness is fulfilled when the probability of applicants with a good credit score being rated as creditworthy and those with a bad credit rating being denied credit is the same for both men and women, without consideration of the gender.

The first statistical approaches for such testing are already available. Fairness Indicators from TensorFlow is a library that offers an evaluation of frequently identified fairness metrics, with improved scalability for large datasets and models. Fairness Indicators also supports the analysis of the distribution of datasets and model performance across different user groups as well as the calculation of statistically significant differences on the basis of confidence intervals.

Statistical Approaches and Counterfactual Fairness

Although statistical approaches are easily quantifiable, they can fall short. Fairness cannot only be explained by similar rates of erroneous classification, especially when all other attributes with the exception of the sensitive attribute are ignored. For example, an AI system used for credit scoring may give the same proportion of male and female applicants a positive evaluation – statistical approaches would then judge the model to be fair. But if the male applicants were chosen at random while the female applicants were those with the most savings, fairness would not exist.

Similarity-based measures put the emphasis not on the model results and erroneous classification rates but rather on the process of decision-making and the use of characteristics in the training process. This allows the idea of “fairness through unawareness” to be derived as a concept for fairness: algorithms are considered fair when protected attributes are excluded from the training data. In our example this means that gender-specific characteristics are not used for the training of the model, so that decisions cannot be based on these characteristics. But this approach also has limitations. Excluding protected attributes is insufficient as other, unprotected attributes can contain information that correlates with the excluded protected attributes (“counterfactual fairness”). In this case the excluded attribute is implicitly included in other attributes and would indirectly influence the decision process (see the “Testing Counterfactually” box).

Causal reasoning approaches are based on causal inference tools. The definition of counterfactual fairness is based on the intuition that a decision concerning a person is fair when it is the same in the actual world as well as in a counterfactual world in which the person belongs to a different demographic group.

Counterfactual fairness is thus given when a prediction does not change even though the protected attribute is changed to the counterfactual opposite. For example, the decision for or against a financial loan to an individual must be the same when the attribute “male” is changed to “female.”

Exposing Unfairness with Manipulated Data

Adversarial testing is a common strategy for exposing weak spots by simulating a malicious attack on a system. The model is given input data whose characteristics have deliberately been slightly manipulated. In this way it can be tested whether the model makes undesirable forecasts with specifically tailored input data. The manipulation of the input data is domain-specific and can be inspired by analyses of algorithmic unfairness. The idea of testing the reaction of a model to input data and quantifying this type of bias can now also be found in frameworks.

For example, the benchmarking dataset StereoSet measures stereotype bias in language models with regard to gender, race, religion, and profession. Developers can submit trained language models in order to be scored on discriminatory decision-making and at the same time to consider the performance of the language modeling. StereoSet considers the overall performance of a model to be good when it is able to reconcile the conflicting goals of accuracy and fairness and thus ensure an exact understanding of natural language while minimizing bias. Frameworks like this one can serve as a good guideline, but do not replace the individual testing that must be firmly embedded in the review process.

Why Audits Alone Are Not Enough

The different testing processes are important – but are not enough to guarantee fairness on their own. Instead, audits are usually one of the first options to be considered in order to identify problems. They need to be part of the model governance framework and to supplement it. On their own however they do not have any explanatory power – only an integrated approach can consider all aspects relevant to fairness. Alongside the validation of functional and non-functional requirements that the audit processes described here and used for the testing of fairness can contain, comprehensive documentation is a further important component in the model governance framework.

Documentation should be started in the first phase of the ML life cycle – the development phase – already (for more information see “Practitioners' Guide to MLOps”). The primary aim of the development phase is the construction of a robust and reproducible training procedure consisting of data-processing and model-building steps. This construction process is experimental and iterative, with important information about data (selection and definition of features, allocation of the training, validation, and test data, schemas, and statistics), models (different tested models), and parameters to be retained (experimental tracking).

After the Training: Evaluation of AI Models

After the construction of the training procedure, the developed models need to be evaluated with regard to functional and non-functional properties (at this point the described audit strategies for the testing of fairness are relevant). The results of the evaluation and all information about the construction process of the training procedure are recorded in the documentation, which should also contain an explanation of the use-case context, a high-level explanation of the algorithm, model parameters, instructions for the reproduction of the model, examples of the training of the algorithms, and examples of the making of predictions through the algorithm.

The documentation can be practically supported by means of toolkits such as model cards and data sheets. Data sheets record which mechanisms or processes were used for the data acquisition or whether ethical review processes (audits) took place. [Model cards] (https://arxiv.org/pdf/1810.03993.pdf) complement data sheets and provide information on how the models were created, what assumptions were made during the development, and what the expectations are regarding model behavior with different cultural, demographic, or phenotypic groups.

Corporate Policies As the Key for Ethical AI

Full documentation ensures reproducibility and external transparency. After deployment this observability must be present in the productive system. The versioning of models and datasets plays an important role here. The versioning serves to maintain the principle of immutability of the models, so that all models can be reproduced without loss of data or changes. This also ensures that a model prediction can be assigned to the model version that produced it.

At the same time, a monitoring system must continuously monitor the performance of the productive model and summarize and visualize relevant metrics in a report. These values from the model logging should be processed in metrics and be visualizable in dashboards for the purposes of logging, analysis, and communication. If the monitoring detects a drop in the performance of a model (model decay), the model must be trained with new training data and then redeployed.

Recommendation: Audits Before Every New Deployment

Before each new deployment, audits should once again take place in order to check for ethical, legal, or societal risks. Ethical principles such as fairness need to be considered at every level of software development, including during data acquisition. Fairness cannot be generated with a simple metric or test strategy. It needs a correspondingly aligned company policy that recognizes it. Without model governance, AI systems are unpredictable with regard to compliance with statutory provisions as well as the minimization of business risk.