Training models with unequal economic mistake costs using Amazon SageMaker

Many companies are turning to machine learning (ML) to improve customer and business outcomes. They use the power of ML models built over “big information” to place patterns and observe correlations. Then they tin can place appropriate approaches or predict likely outcomes based on information nearly new instances. Yet, every bit ML models are approximations of the real world, some of these predictions will likely be in fault.

In some applications all types of prediction errors are truly equal in impact. In other applications, one kind of error can exist much more than plush or consequential than another – measured in absolute or relative terms, in dollars, fourth dimension, or something else. For example, predicting someone does not have breast cancer when they do (a false negative error) will, according to medical estimates, likely have much greater cost or consequences than the reverse error. Nosotros may even exist willing to tolerate more simulated positive errors if we sufficiently reduce the false negatives to recoup.

In this blog post, we address applications with unequal fault costs with the goal of reducing undesirable errors while providing greater transparency to the trade-offs being made. We show you how to train a model in Amazon SageMaker for a binary classification problem in which the costs of unlike kinds of misclassification are very different. To explore this tradeoff, we prove you how to write a custom loss function – the metric that evaluates how well a model makes predictions – that incorporates asymmetric misclassification costs. We then show you lot how to train anAmazon SageMaker Build Your Own Model using that loss role. Further, nosotros show how to evaluate the errors made by the model and how to compare models trained with dissimilar relative costs then that you lot tin identify the model with the best economic outcome overall.

The advantage of this approach is that information technology makes an explicit link between an ML model’s outcomes and errors and the business’ framework for controlling. This arroyo requires the business to explicitly state its cost matrix, based on the specific actions to be taken on the predictions. The business tin can then evaluate the economic consequences of the model predictions on their overall processes, the actions taken based on the predictions, and their associated costs. This evaluation process moves well beyond simply assessing the classification results of the model. This approach tin drive challenging discussions in the business, and force differing implicit decisions and valuations onto the table for open discussion and agreement.

Background and solution overview

Although model preparation always aims to minimize errors, most models are trained to assume that all types of errors are equal. Yet, what if nosotros know that the costs of unlike types of errors are not equal? For example, let’s take a sample model trained on UCI’s breast cancer diagnostic data set up.1 Clearly, a false positive prediction (predicting this person has breast cancer, when they do not) has very different consequences than a false negative prediction (predicting this person does not take breast cancer, when they do). In the first example, the event is an extra round of screening. In the second instance, the cancer might be at a more than avant-garde phase before it’s discovered. To quantify these consequences they are often discussed in terms of their relative cost, which so allows trade-offs to be made. While nosotros can argue what the exact costs of a fake negative or a false positive prediction should be, we believe we’d all hold that they’re not the aforementioned – although ML models are more often than not trained as if they are.

We can use a custom cost part to evaluate a model and meet the economic impact of the errors the model is making (utility analysis). Elkan2 showed that applying a cost part to the results of a model can be used to compensate for imbalanced samples when used in standard Bayesian and conclusion tree learning methods (for example: few loan defaults, versus a large sample of repaid loans). The custom function tin also exist used to perform this same compensation.

Nosotros can also have the model “shift” its predictions in a fashion that reflects the difference in cost, past providing the costs of different types of errors to the model during training using a custom loss part. And then, for example, in the breast cancer example nosotros’d like the model to brand fewer false negative errors and are willing to accept more false positives to reach that goal. We may even exist willing to surrender some “correct” predictions in lodge to have fewer false negatives. At least, we’d like to sympathise the trade-offs we tin make hither. In our example, we’ll employ costs from the healthcare industry.three,4

In add-on, we’d like to understand in how many cases the model’s predictions are “almost” predicted as something else. For example, binary models use a cutoff (say, 0.5) to classify a score equally “True” or “False.” How many of our cases were in fact very close to the cutting-off? Are our fake negatives classified that way considering their score was 0.499999? These details can’t be seen in the usual representations of confusion matrices or AUC measures. To help address these questions, nosotros accept developed a novel, graphical representation of the model predictions that allows united states of america to examine these details, without depending on a specific threshold.

In fact, there are likely cases where a model trained to avoid specific types of errors would brainstorm to specialize in differentiating errors. Imagine a neural network that’south been trained to believe that all misrecognitions of street signs are equal.

Now, imagine a neural network that’s been trained that misrecognizing a finish sign every bit a sign for speed limit 45 mph is a far worse fault than confusing two speed limit signs. It’s reasonable to expect that the neural network would begin to recognize different features. We believe this is a promising research management.

Nosotros use Amazon SageMaker to build and host our model. Amazon SageMaker is a fully-managed platform that enables developers and data scientists to rapidly and easily build, train, and deploy machine learning models at whatsoever scale. We author and analyze the model in a Jupyter notebook hosted on an Amazon SageMaker notebook instance, so build and deploy an endpoint for online predictions, using its “Bring Your Own Model” adequacy.

Note that while the terms “price part” and “loss function” are frequently used interchangeably, nosotros differentiate between them in this post, and provide examples of each:

  • We use a “loss function” to railroad train the model. Here, we specify the different weights of different kinds of errors. The relative weight of the errors is of most importance here.
  • We utilise a “cost function” to evaluate the economic impact of the model. For the toll office, we can specify the cost (or value) of correct predictions, too equally the cost of errors. Here, dollar costs are most appropriately used.

This distinction allows united states to farther refine the model’s behavior or to reflect differing influences from different constituencies. Although in this model we’ll utilise the aforementioned gear up of costs (quality adapted life years, QALY) for both functions, you could, for instance, employ relative QALY for the loss part, and costs of providing treat the cost role.

We’ll break up this problem into 3 parts:

  1. In “Defining a custom loss function,” nosotros bear witness how to build a custom loss part that weights unlike errors unequally. The relative costs of the prediction errors are provided as hyperparameters at runtime, allowing the furnishings of unlike weightings on the model to be explored and compared. We build and demonstrate the use of a custom cost function to evaluate our “vanilla” model, which is trained to assume that all errors are equal.
  2. In “Training the model,” nosotros demonstrate how to train a model by using the custom loss part. Nosotros emulate and extend a sample notebook, which uses the UCI breast cancer diagnostic data set up.
  3. In “Analyzing the results,” we show how we can compare the models to ameliorate understand the distribution of predictions as compared to our threshold. We’ll encounter that by training the model to avoid certain kinds of errors, nosotros’ll touch the distributions and so that the model differentiates more than finer betwixt its positive and negative predictions.
Popular:   Learn to Get Anything You Want Meaning

We are edifice our own model and non using one of the Amazon SageMaker built-in algorithms. This means that we tin make utilize of the Amazon SageMaker ability to train any custom model as long equally information technology’s packaged in a Docker container with the image of that container available in Amazon Elastic Container Registry (Amazon ECR). For details on how to train a custom model on Amazon SageMaker, see this mail or the diverse sample notebooks available.


To set up the environs necessary to run this example in your own AWS account, first follow Steps 0 and 1 in this previously published web log post to set upwards an Amazon SageMaker case. Then, as in Stride ii, open a terminal to clone our GitHub repo,,  into your Amazon SageMaker notebook instance.

The repo contains a directory named “container” that has all the components necessary to build and use a Docker image of the algorithm we run in this blog post. You lot tin can find more information on the individual components in this Amazon SageMaker sample notebook. For our purposes, at that place are 2 files that are most relevant and incorporate all the information to run our workload.

  1. This file describes how to build your Docker container prototype. Here you can define the dependencies of your code (for instance, which language you are using, such as Python), what packages your code needs (for example, TensorFlow), and so on. More than details tin can exist found here.
  2. custom_loss/train. This file is executed when Amazon SageMaker runs the container for grooming. It contains the Python lawmaking that defines the binary classifier model, the custom loss part used to train the model, and the Keras training task. We describe this code in more than detail later.

The notebook so imports libraries, creates some helper functions, imports the breast cancer data set up, standardizes it, and exports training and examination sets to Amazon S3 for subsequently use by Amazon SageMaker preparation.

Defining a custom loss function

We at present construct a loss part that weighs false positive errors differently from false negatives one. To do this, we build a binary classifier in Keras to use Keras’ power to accommodate user-defined loss functions.

To build a loss function in Keras, we define a Python function that takes model predictions and footing-truth as arguments and returns a scalar value. In the custom role, nosotros input the cost associated with a false negative error (fn_cost) and with a false positive error (fp_cost). Note that internally the loss function must use Keras backend functions to perform any calculations.

The post-obit function defines the loss of a single prediction equally the divergence between the prediction’s footing-truth class and the predicted value weighted past the cost associated with misclassifying an ascertainment from that basis-truth class. The total loss is the unweighted boilerplate of all of these losses. This is a relatively simple loss office, but edifice upon this foundation, more complex, situation-specific benefit and cost structures can be constructed and used to train models.

Training the model

Since we are using Amazon SageMaker to railroad train a custom model, all of the code related to building and training the model is located in a Docker container image stored in Amazon ECR. The lawmaking shown here is an case of the lawmaking independent in the Docker container image.

The files containing the bodily model code (and custom loss role, mirroring the copy shown before) also equally all the files necessary to create the Docker container image and push it to Amazon ECR are located in the repository associated with this weblog post.

We construct and railroad train iii models and then nosotros can compare the predictions of various models using Keras’ born loss office every bit well as our custom loss function. Nosotros use a binary nomenclature model that predicts the probability that a tumor is malignant.

The iii models are:

  1. A binary nomenclature model that uses Keras’ built-in binary cross-entropy loss with equal weights for false negative and false positive errors.
  2. A binary classification model that uses the custom loss role defined previously with false negatives weighted 5 times as heavily as false positives.
  3. A binary classification model that uses the custom loss function divers previously with false negatives weighted 200 times as heavily as imitation positives.

The costs used in the concluding model’s loss function are based upon the medical literature.3
4 The costs of screening are measured in QALYs. One QALY is defined as 1 twelvemonth of life in full wellness (i twelvemonth x ane.0 health). For case, if an individual is at half health, that is, 0.5 of full health, then 1 year of life for that private is equal to 0.v QALYs (1 year ten 0.five health). Two years of life for that individual is worth 1 QALY (ii years x 0.5 wellness).


Truthful Negative
Faux Positive -0.01288
True Positive -0.3528
False Negative -2.52

Hither, a truthful negative outcome is measured as the baseline of costs, that is, all other costs in the table are measured relative to a patient without breast cancer that tests negative for breast cancer. A adult female with breast cancer that tests negative loses ii.52 QALYs relative to the baseline, and a woman without breast cancer that tests positive loses 0.0128767 QALYs (or near days) relative to the baseline. A QALY has an estimated economic value of $100,000 USD. So these values can also be translated into dollar costs past multiplying the price in QALYs by $100,000 USD. Given these values, a false negative mistake is about 200 times more plush than a fake positive one. Come across the medical literature referenced in the introduction for more detail surrounding these costs.

The centre model value of 5 was chosen for demonstration purposes.

With these costs in hand, nosotros can now guess the model. Estimating the parameters of a model in Keras is a three-step process:

  1. Defining the model.
  2. Compiling the model.
  3. Training the model.

Defining the model architecture

First, nosotros define the structure of the model. In this case, the model consists of a unmarried node in a single layer. That is, for each model that follows, nosotros add a single Dense layer with a single unit that takes a linear combination of features and passes that linear combination to a sigmoid function that outputs a value between 0 and 1. Again, the bodily executable version of the code is in the Docker container, but is shown here for illustrative purposes. Nosotros’ll provide the relative weights in a later footstep.

Compiling model

Next, permit’s compile the models. Compiling a model refers to configuring the learning process. We need to specify the optimization algorithm and the loss function that we will utilize to train the model.

This is the step in which we comprise our custom loss function and relative weights into the model training process.

Training the model

Now we’re ready to train the models. To do this, nosotros call the fit method and provide the training data, number of epochs, and batch size. Whether you employ a built-in or a custom loss function, the code is the same in this step.

Popular:   Have You Learn Anything About Your Anger

Building the Docker epitome

We then execute a shell script ( that builds the Docker paradigm that contains the custom loss function and model code and pushing image to Amazon Rubberband Container Registry (ECR). The “image_name” defined at the meridian of this notebook is the name that will exist assigned to the repository in ECR that contains this image.

As mentioned previously, we perform the actual training of the binary classifier by packaging the model definition and training lawmaking in a Docker container and using the Amazon SageMaker bring-your-own-model training functionality to guess the model’s parameters.

The following code blocks railroad train three versions of the classifier:

  1. One with Keras’ congenital-in
    loss function.
  2. One with a custom loss function that weighs false negatives 5 times more heavily than false positives.
  3. One with a custom loss role that weighs imitation negatives 200 times more heavily than false positives.

We showtime create and execute an Amazon SageMaker training job for congenital-in loss office, that is, Keras’southward binary cross-entropy loss function. By passing the
fix to
builtin, Amazon SageMaker knows to employ Keras’s binary cross-entropy loss function. So, we create and execute an Amazon SageMaker training chore for the custom 5:one loss role, that is, custom loss with false negatives being 5 times more than costly than fake positives. By passing the
set to custom and
set to five and
set to 1, respectively, Amazon SageMaker knows to utilise the custom loss office with the specified misclassification costs.

Finally, we create and execute the
model the aforementioned manner, just with
set to
set up to

After grooming the model, Amazon SageMaker uploads the trained model artifact to the Amazon S3 bucket nosotros specifed in the output_path parameter in the training jobs. We now load the model artifacts from S3 and brand predictions with all 3 models variants, then compare results.

Analyzing the results

What characteristics are we generally looking for in a well-performing model?

  1. In that location should be a pocket-sized number of faux negatives, that is, a pocket-sized number of cancerous tumors classified as beneficial.
  2. Predictions should cluster closely around footing truth values, that is, predictions should cluster closely effectually 0 and 1.

Keep in mind as you rerun this notebook that the information set up used is small (569 instances), and therefore the examination set is even smaller (143 instances). Because of this, the exact distribution of predictions and prediction errors of the model may vary from run to run due to sampling fault. Despite this, the following general results concur across model runs.

Accuracy and the ROC Curve

Outset, we’ll show traditional measures of the model.

By these traditional measures, the ii custom loss function models practice not perform as well as the congenital-in loss function (by a pocket-size margin).

Withal: accurateness is less relevant in judging the quality of these models. In fact, accurateness may be everyman in the “all-time” model because we are willing to have more than fake positives as long equally nosotros decrease the number of imitation negatives sufficiently.

Looking at the ROC curve and AUC score for these iii models, all models appear very like according to these measures. However, neither of these metrics show us how the predicted scores are distributed over the [0, 1] interval so we are not able to determine where those predictions are amassed.

Classification report

Keep in listen that the cost of a false negative is increasing as nosotros move through these three models. That implies that the number of simulated negatives is probable to subtract in each successive model.

What does this imply for the values in these classification reports? It implies that the negative form (benign) should have higher precision and that the positive form (malignant) should have higher recall. (Remember that precision = tp / (tp + fp); call up = tp / (tp + fn).)

Think that for our classification problem we are classifying tumors as benign or malignant. According to the costs reported in the medical literature cited previously, a false negative is much more plush than a false positive. Because of that, we want to classify all malignant tumors equally such and are not bothered by that resulting in more false positive predictions (to a point). Therefore, for the negative class (benign), we intendance more about having a high precision, and for the positive class (malignant), we intendance more about having a high recall.

Born Loss Part

precision remember f1-score support
0.95 0.99 0.97 88
one 0.98 0.91 0.94 55
avg/total 0.96 0.96 0.96 143

5:1 Custom Loss Function

precision recall f1-score support
0.96 0.93 0.95 88
one 0.90 0.95 0.92 55
avg/total 0.94 0.94 0.94 143

Medical Custom Loss Role

precision recall f1-score support
0.99 0.77 0.87 88
1 0.73 0.98 0.84 55
avg/total 0.89 0.85 0.86 143

These classification reports show that we’ve achieved our goal: the medical model has the highest precision for benign, and the highest think for cancerous.

What this implies is that when using the medical model, we are least likely to falsely classify a cancerous tumor as benign, and we are most likely to identify all malignant tumors as cancerous.

Looking at the detail of these reports allows the states to run into that the medical model is the “best” of these three models, despite having the lowest F1-score, and the lowest average precision and recall.

Confusion Matrix

To meliorate sympathize the errors, a ameliorate tool is the confusion matrix.

Since our goal is to reduce the number of false negatives, the model with the fewest false negatives is “best,” provided that the increase in false positives is not excessive.

As we move through these three defoliation matrices, the cost of a false negative relative to a simulated positive increases. As such, nosotros look the number of false negatives to decrease and the number of false positives to increase. Yet, the number of truthful positives and true negatives might also shift because we’re training the model to weight differently than before.

Built-in loss part

Predicted Negatives

Predicted Positives

Actual Negatives



Actual Positives



5:one Custom Loss Function

Predicted Negatives

Predicted Positives

Actual Negatives



Actual Positives



Medical Custom Loss Part

Predicted Negatives

Predicted Positives

Bodily Negatives



Actual Positives



Nosotros tin can see from the results that modifying the loss function values provided when training the model allows us to shift the balance between the categories of error. Using different weightings for the relative cost has a significant affect on the errors, and moves some of the other predictions as well. An interesting direction for future research is to explore the changing features that are identified past the model in back up of these prediction differences.

This gives us a powerful lever to influence the model based on the moral, ethical, or economic impacts of the decisions we make most the relative weights of the different errors.

Custom defoliation matrix

A specific ascertainment is classified equally positive or negative by comparison its score to a threshold. Intuitively, the farther away the score is from the threshold called, the college is the assumed probability that the prediction is correct (assuming that the threshold value is well-chosen).

When comparing the model’south prediction and the threshold used for dividing classes, information technology’southward possible that the values are very shut to the threshold. In the extreme, the difference in values between a “true” or a “false” could be less than the error between two unlike readings of an input sensor or measurement; or fifty-fifty, less than the rounding error of a floating signal library. In the worst case, the majority of the scores for our observations could exist clustered quite shut to the threshold. These “close confusions” are not visible in the confusion matrix, or in the previous F1-scores or ROC curves.

Popular:   Learnwise Quiz 2 Set 1 Answers

Intuitively, information technology’s desirable to have the bulk of the scores further away from the threshold, or, conversely, to identify the threshold based on gaps in the distribution of scores. (In cartography, for case, the Jenks’ natural breaks method is frequently used to address the same problem.) The post-obit graphs give united states of america a tool to explore the human relationship of the scores to the threshold.

Each of the post-obit sets of distribution plots shows the actual scores for each sample in the confusion matrix. In each set, the top histogram plots the distribution of predicted scores for all actual negatives, that is, predicted scores for benign tumors (the superlative row of the confusion matrix). The bottom histogram plots predicted scores for actual positives (the lesser row).

The correctly classified observations on each plot are colored blue, and the incorrectly classified observations are colored orangish. The threshold value of 0.five, used in other functions in this notebook, is used for coloring the plots. Still, this threshold choice does Non affect the actual scores or shape or level of the plots, only the coloring. Some other threshold could be called, and the results in this section would still hold.

In the charts below, a “good” distribution is one in which the predictions are largely grouped effectually the 0 and 1 points. More than specifically, a “good” ready of histograms would have the actual positives largely clustered around 1 with few false negatives, that is, few orange points. We would similar to see the actual negatives clustered around 0, but for this utilise example we are willing to take a prediction spread over the back up with false positives as long as this gets us a small number of false negatives with predictions clustered around 1 for the actual positives.

We can see from these plots that as we increase the ratio, the distributions shift. Equally the ratio between the mistake types increases, the model pushes a larger number of samples to the extremes, essentially becoming much more discriminatory.

We can also meet that few of our actual positives accept scores close to the cutoff. The model is demonstrating increased “certainty” in its classifications of the positives.

Expected value

Nosotros now calculate the expected value (economic value) of each of the 3 classification models. The expected value captures the probability-weighted loss expressed in US dollars that an private patient is expected to suffer if given a specific diagnostic examination. The diagnostic test with the highest expected value is considered the “best” nether this metric. The expected value is stated in U.s.a. dollars.

For an explanation of QALY and the dollar values associated with testing outcomes defined in the following prison cell, meet the discussion of screening costs earlier in this web log post.

Note that this section reflects the value of all four possible test outcomes—true and imitation negatives, as well as true and false positives.

Expected Value
Builtin -$xix,658.34
5:1 -$18,160.83
Medical -$fifteen,282.86

The binary classifier trained with the custom loss function based upon the costs reported in the medical literature is the least costly of the three.

Notation that while we used QALY values to train the model and as well to evaluate the economical cost, information technology is not necessary to employ the aforementioned values. This provides a second powerful lever to apply in influencing, understanding and evaluating the model.

At present that we have demonstrated how to train the classifier with a custom loss office and inspected the results, feel free to experiment with different relative values for FN and FP, and with dissimilar costs, and explore the bear on.


In the case worked in this blog post, we’ve shown how to use a custom loss role to change the remainder of FN and FP errors. We’ve shown that we tin can impact that balance separately from the costs of different kinds of treatment plans applied to each set of predictions.

There are several ways in which this work can exist extended. Promising avenues include:

  • Using the relative loss and the economic costs as hyperparameters and exploring the hyperparameter infinite to notice optimal trade-offs.
  • Exploring different or more complex cost functions, including making the costs dependent on specific features within an observation.
  • Further exploration and understanding of how the model changes with different relative costs, and the features nearly relevant to those changes.

Nosotros’ve shown the approach applied to a binary classification problem, however, it is generalizable to multiclass classification problems. Designing the input cost matrix that accurately (or adequately) reflects the costs of unlike kinds of errors or misclassifications is more than challenging. For case, identifying a stop sign as a 45 mph sign will probably not have the same cost or consequences as the opposite. Notwithstanding, a model trained with this understanding could provide a better overall economic value than one trained to simply maximize precision, recall, F1-score, or AUC.

In this blog postal service, nosotros’ve shown the ability of using a custom loss function to correspond the true impacts of different kinds of errors. The custom loss part allows us to choose the relative balance of the types of errors fabricated by the model, and to evaluate the economic touch on of changing that balance. Visualizing the resulting score distributions lets us evaluate the discriminatory ability of the model. We can also evaluate the costs and tradeoffs of different approaches for the different predictions and their errors. This combination gives the business organisation a powerful new tool to link machine learning to business results, providing greater transparency to the trade-offs being made.


  1. Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository []. Irvine, CA: University of California, School of Information and Information science.
  2. Elkan, Charles. “The Foundations of Toll-Sensitive Learning.” In International Joint Conference on Artificial Intelligence, 17:973–978. Lawrence Erlbaum Associates Ltd, 2001.
  3. Wu, Yirong, Craig Thousand. Abbey, Xianqiao Chen, Jie Liu, David C. Folio, Oguzhan Alagoz, Peggy Peissig, Adedayo A. Onitilo, and Elizabeth S. Burnside. “Developing a Utility Decision Framework to Evaluate Predictive Models in Breast Cancer Risk Estimation.” Journal of Medical Imaging 2, no. 4 (October 2015).
  4. Abbey, Craig M., Yirong Wu, Elizabeth Southward. Burnside, Adam Wunderlich, Frank W. Samuelson, and John One thousand. Boone. “A Utility/Price Analysis of Breast Cancer Risk Prediction Algorithms.” Proceedings of SPIE–the International Gild for Optical Technology 9787 (February 27, 2016).
  5. Eykholt, Kevin, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Vocal. “Robust Physical-Earth Attacks on Deep Learning Models.” ArXiv:1707.08945 [Cs], July 27, 2017.

About the Authors

Veronika Megler, PhD, is a senior consultant for AWS Professional Services. She enjoys adapting innovative large information, AI and ML technologies to assistance customers solve new problems, and to solve old problems more efficiently and effectively.

Scott Gregoire is a Information Scientist with AWS Professional Services. He holds a PhD in Economics from the University of Texas at Austin and has advised clients in sectors ranging from international finance to retail. Currently, he is working with customers to develop innovative machine learning solutions on AWS.