Say you’re a baseball talent scout. One prospect has a .300 batting average, while the other bats a .275. You might think that the first player would be more of a team asset. But that might not be the case if the .300 player makes appreciably more fielding errors than the .275 player. In other words, while his accuracy as a hitter is important, it’s not the only metric worth consideration. The same logic applies to machine learning models.
Given that you’ll be making critical business decisions based on their output, the models need to make an acceptable number of correct predictions. But a model that produces a solid percentage of correct predictions or classifications can also produce an unacceptable number of false positives — and as a result, it might not be deemed reliable enough for use.
In other words, accuracy isn’t the end all, be all of model metrics. Rather, it is one of multiple metrics that determine a model’s reliability and utility. Below, we’ll explain exactly what model reliability is, the problems that can arise from using an unreliable model, some primary causes of unreliability, and, most importantly, how to detect and fix those issues.
What is model reliability?
A reliable machine learning model is one that, when applied to different data sets over time, consistently generates results that are accurate, precise, and trustworthy. Data scientists use multiple criteria and metrics to determine model reliability:
- Accuracy is, of course, a major criterion. If a model is designed to sort fraudulent vs legitimate credit-card transactions, and in a data set of 10,000 transactions, it sorted 9,000 correctly, its accuracy rate would be 90%.
- Precision is sometimes called the positive predictive rate because it reports the percentage of positive predictions that were correct by measuring how many false positives the model reported. The more false positives, such as customers being flagged as at risk of churn when they’re not, the lower the precision.
- Recall, also known as the true positive rate, takes into account false negatives as well as true positives. It’s especially important in situations where false negatives have serious ramifications, such as medical tests.
- Sometimes, high precision comes at the cost of low recall, and vice versa. An F1 score indicates the balance (the harmonic mean, to be exact) of precision and recall.
- Another measure of balance between false positives and false negatives, the AUPRC curve shows the rate at which the model correctly predicts positive outcomes versus the rate at which it incorrectly identifies negative outcomes as positive.
Once a model attains an acceptable percentage and balance of accuracy, precision, and recall, it needs to repeatedly generate acceptable results with both the same and new data under multiple scenarios. A model that performs well when the number of positives is close to the number of negatives, for instance, might not yield acceptable results when the data set consists largely of positives or largely of negatives.
The consequences of an unreliable model
The most immediate and obvious issues resulting from an unreliable ML model are the flawed decisions made based on incorrect outputs. Say a marketing department built a model to identify customers most likely to make repeat purchases without the incentive of a discount. Although the model performed well in tests, when fed new data sets, it mistakenly identified many loyal repeat customers as being in need of a discount to convert. Using those results, the marketers sent discount codes to customers who would have bought regardless. Not only did response rates not improve as they’d hoped, but they also lost profit margins on those loyal customers who would have paid full price.
Beyond costing companies money, in industries such as healthcare and energy management, erroneous outputs can result in misdiagnoses, power outages, and other life-threatening and safety issues. For example, consider a model designed to predict breast cancer. If the model labels 100% of patients as cancer-free but misses two cases out of a thousand, the model would be 99.8% accurate, but it would have life-threatening implications. Reputational damage, regulatory compliance failures, and discriminatory outcomes from biased models are other potential risks of using unreliable models.
Factors that threaten model reliability
We know that a model is only as good as its data — aka garbage in, garbage out. Training data with missing, inconsistent, or incorrect values, as well as data sets that are too small, will result in unreliable models.
Sometimes, though, the problem isn’t in the quality of the data but in the design or deployment of the ML model itself. Below are a few common issues impacting model reliability and how to remedy them:
Overfitting
An overfitted model is similar to a medical student who knows the names of all 206 bones in the human body but not their location or function. The model memorized the training data and the correct output but did not learn how to apply the cause and effect of the data and the output to new data sets, nor did it learn how to identify data outliers. Overfitted models tend to be overly complex, incorporating noise — irrelevant data and attributes — in their predictions. This often happens when a model is trained on too-small data sets or data with too many attributes.
To detect overfitting before a model is deployed, Pecan’s Predictive GenAI platform compares performance across training, validation, and testing sets; it then flags if there is more than 10% discrepancy between output reliability. Users can then retrain the model on larger data sets or remove irrelevant attributes from the data.
Underfitting
Interestingly enough, training a model on data sets that are too small can result in underfitting as well as overfitting. (At Pecan we recommend data sets of at least 1,000 entries and more than 10 attributes.) Whereas an overfitted model is too complex and noisy, an underfitted one is too simple. It hasn’t been introduced to enough data to be able to repeatedly detect predictive patterns among data points, nor has it learned to incorporate enough relevant attributes to create reliable outputs.
If a model’s outputs are not significantly better than the results of random guessing, there’s a good chance it’s underfitted. Adding more records to the training data sets, as well as more attributes, should rectify the problem.
Data leakage
Imagine you’re taking a multiple-choice test, and the answers are printed on the reverse side of the page. The paper is thin enough that you can read the answers even without flipping the page. That’s a sort of data leakage: information that shouldn’t be available making its way to someone or to some ML model that shouldn’t have access to it. Another example: You’ve built a model to predict whether customers will respond to certain types of emails, and in your training data set, you failed to remove the attribute indicating which customers responded. The model is certainly going to have a high F1 score on this data — but not when it needs to make predictions in a real-world situation without that attribute.
If all examples of data leakage were that obvious, data leakage wouldn’t be such a common problem. Often, it’s a much sneakier attribute causing the leakage, including data points that have been updated after the desired moment of prediction. When trying to predict the effectiveness of a promotion sent on a specific date, for example, make sure none of the data points were updated after that date; in other words, be sure to use data that is true at the moment of prediction and not after.
Models with suspiciously high accuracy, precision, or recall should be checked for data leakage, especially if any one feature or attribute is weighted for at least 50% importance. Ensuring that every field in your data set has a documented creation and update date that jibes with the desired moment of prediction can help prevent leakage.
Data drift
An easy way to understand data drift is to imagine you are a travel business in 2020. Your input data regarding consumers’ likelihood to fly overseas will change radically between January and April due to the pandemic lockdowns. Because the newer data is so different from the data it’s been trained on, the business’s ML model will no longer make reliable predictions.
In short, data drift is a change in the statistical properties of data and in the relationship among data points over time. Less dramatic examples include variances in seasonal businesses: a sales prediction model for a retailer of Christmas decor that was reliable in November might not be so reliable in April.
Regularly monitoring a model for performance degradation is key to identifying drift. Models within volatile industries and scenarios, such as those related to the stock market, seasonality, and consumer trends, typically require more frequent updates, as do models that rely heavily on time-based data.
Bias
In models and people alike, bias results from assumptions. Types of machine learning bias include:
- Selection bias happens when the training data does not statistically represent the population that will be used in real-world scenarios — for instance, college students make up only 5% of the training data, but they make up 40% of the data that will be used.
- Measurement bias results from the data itself not being accurate, even if it is representative of the population.
- Algorithmic bias happens when preferences incorporated into the data and the training. A facial recognition tool that was not taught to distinguish among dark-skinned individuals is an example.
Regularly evaluating data, collaborating with AI ethics specialists, and using fairness assessment tools can help combat bias.
Outliers
A business that made $100,000 from 20 sales might assume its average order value is $5,000 — and it would be correct. But suppose one of those orders was $50,000; the remaining 19 orders were in the $2,000-$3,000 range. Taking that outlier $50,000 order into account would make a big difference when predicting subsequent performance and determining marketing spend. A reliable model knows to identify outliers as noise rather than as elements of an underlying pattern.
Numerous types of graphs, cluster charts, and mathematical equations help detect and minimize the effect of outliers in ML models. At Pecan, we use the Root Mean Squared Logarithmic Error (RMSLE) metric in our regression models to compensate for outliers and ensure greater reliability.
How Pecan safeguards model reliability
Myriad factors influence model reliability and effectiveness — too many for most users to keep track of. That’s why Pecan’s low-code Predictive GenAI platform has numerous tools for detecting outliers, data leakage, drift, and other causes of unreliability.
Pecan is an end-to-end predictive modeling tool that helps you clean your data, choose the right business question, model, and features, and then construct a SQL-based model in minutes. You can use Pecan’s GenAI assistant to ask real-time questions and sharpen and refine your machine-learning models. Pecan also displays essential metrics on your dashboard, including precision, recall, and AUPRC, so you can compare the reliability of models not only during training and testing but also over time once they’ve been implemented.
To learn more about how Pecan AI can help you create reliable models, request a demo today.