Designing Machine Learning Experiment and Interpreting Experimental Results

25 min readMay 20, 2021

Introduction

Machine Learning is a part of Data Science which focuses on algorithms and statistics. Each time when we choose between different learning approaches, perform hyperparameters tuning or select a best feature’s set for a model, we compare two or more learning algorithms or models e. g. conduct an experiment — “a scientific procedure to make a discovery, test a hypothesis, or demonstrate a known fact”.

In practice, we choose a best model based on a score. It might be a maximum accuracy in classification or minimum error in regression.

On any randomly test set, one model can be better then another but on the whole population, both models may have identical scores. Identical models trained on different training sets can show different performance. Such kind of instability is a big problem for decision tree algorithms and small datasets.

Random starting state in algorithms (think about backpropagation in deep learning) is a source of internal randomness and may result in different scores, even if it is trained on the same training set, but from a different starting point.

And, finally, a subset of test samples can be mislabeled randomly which prevent to achieve a desirable error rate even with a best learning algorithm.

What makes the procedure scientific? The model scores should be statistically evaluated to conclude, the difference between the models is statistically significant, not due to a chance or noise in the data.

Statistical tests require specific conditions to be true and to satisfy these conditions the learning procedure (splitting data into training/validation and test sets) should be carefully designed.

The learning procedures and statistical tests are analyzed based on Type 1 and Type 2 errors. Type 1 error is when a test finds a significant difference when there is none and Type 2 error is when there is a difference, but a test did not find it.

Even if a model is developed to learn something useful from the available data and we are not going to predict anything in a future or compare with other models, we need to be sure the model is generalized well from available observations.

In this post I review learning procedures and corresponding statistical tests to evaluate result of machine learning experiments. The goal is to emphasize when and what should be used then particular details how to perform each operation in Python. Instead, I add references where more detail information can be found. Time series and stream datasets are not covered.

Learning procedure to evaluate model performance for available dataset size

Learning procedure or splitting data into training/validation and test sets is just well-known k-fold cross-validation in most cases or some other types of data splitting used for different dataset sizes. I use the following definitions of data set types:

Training set is used to train a model.
Validation set is used to evaluate a model in a process of training.
Testing set is used to assess a model performance.

Table 1 summarizes recommended approaches to setup the learning procedure and estimate the output with statistical tests based on an available dataset size.

Large Dataset and fast trained models

Evaluation Procedure: Divide the available dataset into a test set and several disjoined equal size training sets. Each model is trained on each training set. Disjoined validation sets can be used if needed for validation. Then models are tested on the test set.

Generalized Error: The average score on the test set estimates the model’s performances.

Estimate a model per se: Sample Confidence Interval

Comparing 2 models: Two-sided paired t-test can be performed to analyze the variance based on test model scores.

Medium Size Dataset and fast trained models

Evaluation Procedure: If a dataset is large enough to separate an independent test set but quite small to create several disjoined training and validation sets: divide the dataset into test and training parts. K-fold cross-validation then applied to the training part to train the models. Models are tested on the test set.

Generalized Error: The average score on the test set estimates the final model’s performances and average model scores on validation sets to optimize hyper-parameters or select a best feature set.

Estimate a model per se: Sample Confidence Interval

Comparing 2 models:

Two-sided paired t-test can be performed to analyze the variance based on test model scores.
If k-fold cross-validation is applied and the models are estimated based on training and validation sets, corrected t-test must be applied (18).

Large Dataset and model’s training can take days or weeks

Evaluation Procedure: Divide available dataset into a test set and trained once each model on a disjoined training set, validating, if needed, on another disjoined validation set. Test each model on the test set.

Generalized Error: Models scores (accuracy) on the test sets

Estimate a model per se: Binomial confidence interval

Comparing 2 models: McNemar’s test on based the test set. (20) or Stuart — Maxwell test for multi-class models

Small Dataset

Evaluation Procedure: If the dataset is limited, then it is better to train models on all available data. K-fold cross-validation provides good estimates. See also 5x2 cross-validation which is a special case of repeated K-fold cross-validation.

Generalized Error: Average model scores on the validation sets for k-fold cross-validation or average of both folds in 5x2cv (20)

Estimate a model per se: Sample Confidence Interval

Comparing 2 models:

Corrected Two-sided paired t-test can be performed to analyze the variance based on validation model scores for K-fold cross-validation and repeated K-fold cross validation (18)
5x2cv paired t-test (20)

Tiny Dataset (less than 300 observations)

Evaluation Procedure:

Leave-P-out or Leave-One-Out cross validation
Bootstrapping

Generalized Error: Average model scores on the test set which is instances leave out in LOO or LPO cross validation or full dataset in bootstrapping when bootstrap samples are used for training (12)

Estimate a model per se:

Calculation proper confidence interval is somewhere between difficult and impossible
Bootstrap Confidence Interval (12)

Comparing 2 models:

Sign-test
Wilcoxon signed-rank test

There is not an exact definition what size is large and what size is small strictly based on the number of rows in a dataset. The authors of some famous statistical publications in 1990x evaluated their approaches on small datasets which are less than 1000 instances. But for modern machine learning methods even 10 thousand samples dataset can be small. The other problem is the popular now days algorithms can be too complicated for approaches suggested 25 years ago and slow running even in available computer power.

The only one rule is based on minimum 30 instances per fold in cross-validation to apply t-test. It means if there are less than 300 instances in the dataset, k-fold cross validation is not applicable.

If you have enough data for your algorithm to separate a test set as well as disjoined several training and validation sets, you have a large dataset.

A separate test set is the best what you can have to estimate performance of a model. But what if after separating a test set there are not enough data for disjoined training and validation sets? You may apply the method suggested for small datasets — k-fold cross validation to train and still evaluate the model on the test dataset. Probably this is the most common situation in an industry application. But be aware to conclude your summary based on the test dataset scores not training or validation sets.

A dataset is small when you cannot separate a representative test set and have enough data for training. But even in this case you can still apply k-fold cross-validation and a corrected t-test on validation sets to get a meaningful conclusion.

In general, number of training sets is unfixed. Experiments showed that good results have 10-fold cross-validation in different projects. Nadeau and Bengio (18) also recommend the sizes of the validation and training sets: 10% and 90% of the available data. These sizes and 15 folds provide good power with reasonable computational effort. 10 folds might be Ok if use larger validation set and smaller training set. The type of cross-validation they experimented with is not exactly k-fold cross-validation but very close.

The generalized error from k-fold cross-validation based on a small dataset can be noisy. E. g. another run for different training/validation split can result in a different value of the generalized error.

Repeated k-fold cross-validation is the solution for simple linear models.

The same dataset is split 3,5 or 10 times in a different way. The final generalized error is an average for all validation folds and runs. Bouckaert and Frank (19) recommend 10-times 10-fold cross-validation and Nadeau and Bengio’s corrected t-test (18) to achieve appropriate Type 1 error, low Type 2 error, and high replicability.

Repeated cross-validation valid set generalized error can be used to select a best model, but nested cross-validation is often used when a best model is selected, and the generalized error is needed to be assessed at the same time, because the generalized error maybe biased if the same procedure is used multiple time with the same algorithm and dataset.

In the outer loop, the original, full dataset is split to training and validation sets. Each training set individually is split into k folds in the inner loop to optimize hyper-parameters or select an optimal feature set. The generalized error is estimated by averaging validation sets scores from the outer loop.

A tiny dataset (less than 300 instances) is the worst case. It is difficult or almost impossible to evaluate model performance. Leave-one-out cross validation is less reliable than k-fold cross-validation but this is only choice besides bootstrapping in this case. Leave-P-out (LPO) is a more general form of Leave-one-out cross validation (LOO) and P — number of samples out from the full dataset is a tunable parameter.

The latest 2 cases are common in social studies, biology, and medicine.

Training of modern deep learning models can be so time consuming that cross-validation cannot be applied. There is still a way to compare classification models, but variance cannot be calculated.

Random subsampling is a more general form of k-fold cross-validation, but in contrast to cross-validation it does not ensure that the validation sets do not overlap. It is not recommended to be used at all due to exceedingly large Type 1 error.

Few more notes regarding the learning procedure:

Stratified cross-validation for classification: each set should contain approximately the same percent of a target class as the full dataset which is especially important for an imbalance dataset.
Test, training, and validation sets must represent the full dataset to avoid bias. All variables, not just target, must have the same distribution. The topic of dataset shift (or concept shift or concept drift, changes of classification, changing environments, contrast mining in classification learning, fracture points and fractures between data) is quite wide and deserved its own post. Just few points:

o Medical or biology data can be collected from different projects or laboratories, processed by different people or on different days.

o Computer vision as well speech recognition or natural language processing need a lot of data. Web-scraping can easily add more images but in different resolution.

o Dataset shift can heavily bias cross-validation

o Weighted cross validation and other technics are used to solve the issue.

The entire data processing procedure on every fold of cross-validation must use only the training data for that fold to fit and then transform validation and test sets using this object.
If cross-validation is applied to an imbalance dataset along with under-sampling or oversampling, it should be applied only to the training sets, not validation or test sets. The result is differently distributed in the sets. There are analytical methods to correct the bias and produce well calibrated results.

Statistics to evaluate model performance

Statistics is used to get information from experimental results e. g. estimate a generalized error. In fact, the generalized error is the mean of the distribution of scores of a model or the mean of the differences of scores when we are comparing the results from a set of models.

The most common statistical tests are frequentist. Both Parametric and Non-Parametric tests follow the same approach.

A null hypothesis is stated that there is no difference between means of models scores.
The alternative hypothesis proposes that there is a difference.
Hypothesis testing provides a method to reject a null hypothesis within a certain significance level.

When the null hypothesis is rejected, the effect is said to be statistically significant.

Parametric tests require the data (generalized errors) is normally distributed. In a real life there is usually a difference between real and fitted normal distribution. If there are more small differences between scores, the distribution will have a long-left tail. Therefore, Non-Parametric tests are an effective alternative. Parametric tests usually have more statistical power than non-parametric tests and more likely to find a statistically significant difference if it exists.

Many researchers provide confidence intervals for the generalized error. The confidence interval or standard error enables interpretation of the generalized error in a more complete way.

In the Bayesian approach, the information is extracted from estimation of the distribution between the differences of models scores. The procedure involves three overlapping concepts:

A prior, i.e., information from a previous experiment is required. At the beginning of the experiment “non-informative” prior is used
Evidence i.e., the data of the current experiment
A posterior i.e., the updated information from the prior and the evidence. This what is produced by the Bayesian analysis.

Due to the great popularity, it is easy to find guides and tutorials for frequentist parametric tests then nonparametric and Bayesian.

In most cases deriving information from such tests involves a lot of assumptions and approximations. Nevertheless, they provide useful guidance for designing and interpreting experimental results in machine learning.

Few key statistical tests are mentioned in Table 1 and reviewed in more details below. A more complete overview can be found in Recent Trends in the Use of Statistical Tests for Comparing Swarm and Evolutionary Computing Algorithms: Practical Guidelines and a Critical Review (17).

Statistical tests

Two-tailed Paired t-test (classic Student’s t-test)

The “paired” t-test is used to compare the means of two related samples — pair scores from 2 different models (experiments) trained on the same dataset.

A two-tailed test asks the question: are two models(scores) different?
A one-tailed test asks the question: is score of one model better (low/higher) then the other?

A priori, we do not know which model will be better (if there is a difference) and we want to allow that either one might be. That is why we use two-tailed test.

First, we need to set an expectation (null hypothesis) which we are going to test: if we are analyzing model’s evaluation scores with 2 different learning rates, a null hypothesis would be that there is no difference in scores between two models. t-test compares the difference between two means in relation to the variation in the data (standard deviation of the difference between the means) and will tell us if the data are consistent with our null hypothesis or significantly different from it. Since there are always difference between scores, it is sensible to set up a null hypothesis as “no difference”.

Second, the data your test, should meet few requirements:

Independent samples.
Data should be normally distributed.
No outliers in the difference between the two related models scores

Independence of observations is usually not testable. Sometimes it can be reasonable assumed. But the subsets used in different folds of cross validation are NOT independent. They are taken from the same dataset and overlap in different iterations.

Normality To check normality you can use Shapiro-Wilk test.

Outliers can be easily identified using boxplot charts or Z-score.

When testing the requirements related to normality and outliers, you must use a difference between paired values (models scores) — not the original dataset.

When one or more of the requirements for the paired t-test are not met, you may want to run the nonparametric Wilcoxon test instead.

The formula of paired t-test:

where t is the t-value, m and s2 are the mean and the variance of the differences between all pairs of scores (d), respectively. n is the size of d.

The size of d can be equal to k in k-fold cross validation (n = k) or n = k*r in r-times repeated k-fold cross validation.

The mean of the differences d is comparted to 0. If the difference is significant, then it is far from 0.

where x1i is Model 1 score and x2i Model 2 score.

After t-value is calculated, t-test table should be read to get the critical value of Student’s distribution corresponding to the significance level alpha you selected (1% or 5%). The degree of freedom (df) used in the table is:

If the absolute t-value is greater than the critical value obtained from Student’s distribution, then the difference is significant.

The greater t-value, the greater the evidence there is a significant difference between the means. The closer t-value to 0, the more likely the difference between the models is not significant. Usually, an absolute t-value greater than 2 is acceptable.

The level of significance or p-value corresponds to the risk indicated by the t-test table for the calculated absolute t-value. P-value less than a selected significance level alpha rejects the null hypothesis which means there is a difference between the models.

It is quite simple to conduct Paired t-test in Python using scipy.stats package.

import scipy.stats as stats
# Assuming a 10 fold cross validation was run for different set of parameters or features and we obtain these scores for 2 different models:
alpha=0.05
BaseModelScores = [0.709202,0.675973,0.690961,0.692875,0.678119,0.699425,0.679891,0.691891,0.705739,0.702819]
OtherModelScores = [0.693766,0.668319,0.678609,0.680208,0.663592,0.682784,0.670627,0.683872,0.68519,0.692516]
#paired t-test
t=stats.ttest_rel(BaseModelScores,OtherModelScores)
print(t)
if t.pvalue>=alpha:
 print('No difference between the models with %s significance level'%alpha)
else:
 print('There is a difference between models with %s significance level'%alpha)

To comparing 3 or more models, ANOVA test should be applied. Both, t-test and ANOVA, analyze difference in means looking at variance across groups, but the way of calculation statistical significance is different.

Corrected paired t-test

As was mentioned above the subsets used in the cross-validation estimation are not independent. The result is a high Type 1 error, when a null hypothesis is wrongly rejected. Nadeau and Bengio (18) proved that this is due to underestimation of the variance. They propose to correct the variance and the correction is the correlation of training and validation scores means. The correlation is estimated as n2/n1 where n2 is the size of the validation set and n1 is the size of the training set.

Nadeau and Bengio suspect that is true not for all type of algorithms. First, it is applicable only for models with relatively small complexity where the size of the parameter set is small comparing to the size the dataset. Second, the model should be robust to the changes in the training set. And if the first condition easy to satisfy with modern, high-volume datasets, the second condition impacts most highly used approaches. It looks like tree-based models as well as K-nearest neighbors are not in this category because a modification in data may create different decision functions. I guess, if the decision trees are approximately the same in each cross-validation fold, it is Ok to apply the correction. (Well, there is always question how close they should be and how to measure this)

Anyway, the formula for corrected paired t-test according to Nadeau and Bengio is

where n2 is the size of the validation set and n1 is the size of the training set. n can be equal to k in k-fold cross validation (n = k) or n = k*r in r-times repeated k-fold cross validation.

Implementation in Python:

# Nadeau and Bengio corrected paired t-test
# https://link.springer.com/content/pdf/10.1023/A:1024068626366.pdf
# https://www.cs.waikato.ac.nz/~eibe/pubs/bouckaert_and_frank.pdfimport numpy as np
import math
import scipy.stats as statsdef corrected_paired_ttest(data1, data2, n_training_size_folds, n_test_size_folds, alpha):
 #corrected paired t-test
 diff=[y - x for y, x in zip(data1, data2)]
 n = len(diff)
 m = np.mean(diff)
 #it's important to provide ddof=1 (delta degrees of freedom) in numpy var to calculate variance with degre of freedom n - 1.
 v = np.var(diff,ddof=1)
 t = m/math.sqrt(v*(1/n + n2/n1))
 
 #degree of freedom
 df = n - 1
 
 #Critical value for Two-tailed test from t distribution table:
 critical_value=stats.t.ppf(q=1- alpha/2, df=df)
 
 #p-value - probability of getting a more extreme value - for two-sided test
 pvalue = 2*(1-stats.t.cdf(t, df))
 
 return t, critical_value, pvaluealpha=0.05
BaseModelScores = [0.709202,0.675973,0.690961,0.692875,0.678119,0.699425,0.679891,0.691891,0.705739,0.702819]
OtherModelScores = [0.693766,0.668319,0.678609,0.680208,0.663592,0.682784,0.670627,0.683872,0.68519,0.692516]n2=89559
n1=806039(c_t, critical_value, pvalue) = corrected_paired_ttest(BaseModelScores, OtherModelScores, n1, n2, alpha)
print('Corrected t-test value is %s , critical value is %s, p-value is %s'%(c_t, critical_value, pvalue))
if pvalue>= alpha:
 print('No difference between the models with %s significance level'% alpha)
else:
 print('There is a difference between models with %s significance level'% alpha)

5x2cv paired t test

There are 2 problems in the machine learning experiments with small amount of data. With decreasing number of runs, the noise in model scores is the problem. When the number of runs is increasing, independence becomes the issue.

Dietterich (20) proves that 5 runs of 2 folds cross-validation is a good trade-off between these 2 issues.

In each run, the dataset is randomly split into equal 2 parts. Each model is trained on each set and validated on the other set.

The individual difference in the score for fold i and run j:

where aij and bij are models scores. The mean of the difference for a single run of 2-fold cross-validation:

Both, training, and validation scores, are considered.

The variance is:

Then the 5x2cv paired t-test with 5 degrees of freedom is calculated in this way:

5x2cv paired t-test has a Type 1 error at or below the significance level, but inflated Type 2 error. So, if there is a small dataset and it is more important do NOT find the difference, when there is indeed no significant difference (incorrectly reject the null hypothesis), then find a possible existing difference, then 5x2cv paired t-test is a good choice.

McNemar’s test

The test is applicable for classification problems and good for models and dataset where training takes a lot of time. It requires only one train/test split. It may be used for comparing deep learning models.

The dataset is divided into training and test sets. Both models are trained on the training set and test on the other set. Then a contingency table is built based on the results of the test:

Total number of observations in the test set:

McNemar’s test is based on a Chi-Square goodness of fit test with 1 degree of freedom and compares the distribution of counts expected under the null hypothesis (two models are identical with the same rate) to the observed count.

Statistic is calculated as

and it considers a “continuity correction” (-1) due to the fact the statistic is discrete and Chi-Square distribution is continuous. It is calculated based on the only two elements of the contingency table, not a model score or error rate.

A p-value is the probability that the calculated statistic is greater then Chi-Square critical value. If the p-value greater than the significance level (0.05 or 0.01) we fail to reject null hypothesis which is two models are identical.

McNemar’s test has acceptable Type1 and 2 errors, but the test does not consider the variance because it is based only on a single train/test split and the statistic is calculated from the results of the test set only, which should be large enough to adequate represent the whole dataset.

See statsmodels.stats.contingency_tables.mcnemar to calculate McNemar’s test in python. There is also a good explanation of the test in this blog.

Statistical intervals

From the point of view evaluation model performance, 3 types of intervals are interested:

Generalized error Confidence Interval.
Single future observation Prediction Interval.
Tolerance Interval of a model predictions.

For some projects, where the point of interest is a model’s parameters, confidence interval of the parameters is important.

Confidence Interval

Confidence interval (CI) estimates the level of uncertainty associated with a mean of generalized error. Instead of a single generalized error value calculated based on a test set using models trained on different training sets, CI is a range of values which most likely includes a population generalized error value with a certain degree of confidence.

Different random test sets selected from a full dataset result in slightly different the same model scores and intervals. If we repeat the procedure many times, a specific percent (confidence level, usually 95%) of CI will contain the population score.

CI can be used to compare the precision of generalized error of different models. A narrow interval suggests a more precise generalized error.

There are two types of model outputs where confidence intervals are calculated in different ways:

1. Binary Classification models where the output probability scores are converted into discrete values 0 or 1. (Binomial Confidence Interval)

2. All other modes where output is continuous values. (Sample Confidence Interval)

Sample Confidence Interval

Samples in this case are model scores calculated based on a test set prediction from models trained on different training sets. There are some conditions that must be met before Sample Confidence Interval can be calculated: the scores must be independent and normally distributed.

If the conditions are met and we do not know a population standard deviation and n < 30 but can calculate a standard deviation of scores:

where m is a mean of the scores, Tcrit is a critical value for a respective confidence level from t-distribution with degree of freedom df=n-1, s is a standard deviation, n can be equal to k in k-fold cross validation (n = k) or n = k*r in r-times repeated k-fold cross validation.

Shapiro-Wilk test can be used to test normality. A nonparametric confidence interval can be calculated in a case if scores are not normally distributed. Bootstrap confidence interval does not require a normally distribution and maybe more accurate than the standard intervals, but it is computationally expensive.

In a case of a small dataset, when we do not have enough data for an independent test set and validation sets are used instead to estimate model performance, I guess, the same approach as in corrected Two-sided paired t-test can be applied (18).

Where s2 is variance, n2 is the size of the validation set and n1 is the size of the training.

One may think if confidence intervals are not overlap there is a difference between the models. Do not use sample confidence intervals to compare models. To have a consistent result with a significance test, the confidence intervals of differences between model scores must be analyzed.

Confidence intervals of the differences between model scores

The formulas are still the same as above, but m is a mean of the difference and s is a standard deviation of the difference scores between two models.

The difference between the means of model scores for the entire population present in this confidence interval. If there is no difference, then the interval contains zero (0). If zero is NOT in the range of values, the difference is statistically significant. CI and a hypothesis test should always match in this case with an equivalent significance level:

Confidence level = 1 — Significance level (alpha)

e. g. 95% confidence level is the same as 0.05 (5%) significance level (alpha).

Practical significance can be revealed from the width of CI. A narrow interval means a more precise estimate and, in a case, of a regression model score (MAE), which is in natural data units, it is easy to understand if the difference is practically significant. Unfortunately, there is no, an exact rule how narrow or wide the interval must be to be practically significant in a particular project.

diff=[y - x for y, x in zip(BaseModelScores, OtherModelScores)]import scipy.stats as stCI=st.t.interval(1-alpha, len(diff)-1, loc=np.mean(diff), scale=st.sem(diff))import statsmodels.stats.api as smsCI=sms.DescrStatsW(diff).tconfint_mean()

Corrected Confidence Interval:

import scipy.stats as stdef corrected_confidence_interval(data1, data2, n1, n2, confidence=0.95):
    diff=[y - x for y, x in zip(data1, data2)]
    n = len(diff)
    m = np.mean(diff)
    v = np.var(diff, ddof=1) 
    df = n - 1  
    t = stats.t.ppf((1 + confidence)/2, df)lower = m - t * math.sqrt(v*(1/n + n2/n1))
    upper = m + t * math.sqrt(v*(1/n + n2/n1))
    return lower, upper
Corrected_CI = corrected_confidence_interval(BaseModelScores, OtherModelScores, n1, n2,1-alpha)
Corrected_CI

Binomial Confidence Interval Confidence Interval for Binary Classification (predictions as discrete values)

In binary classification, each prediction is a discrete value (true/false or positive/negative or 1/0). The most widely used metric to estimate binary classification models is accuracy or inverse of accuracy (classification error).

Where n is the number of observations in a test set and r is the number of errors, e. g. number of wrong classification predictions.

Technically, the error is a proportion or ratio, so called Bernoulli trial, which obeys Binomial distribution. With a number of samples n, greater than 30, or more accurate rule of thumb n*error*(1- error) >=5, Binomial distribution is approximated by Normal distribution and the expression for the true error can be generalized for N% confidence interval:

where ZN is selected from a table for two-sided N% confidence interval. defines the width of the smallest interval under Normal distribution. For number of n less then 30, a table with exact values for Binomial distribution should be used.

It does not require k-fold cross validation or bootstrapping and can be used for classification models which takes days or weeks to be trained.

Prediction Interval

Prediction interval (PI) is used if you are interested more in a range of a future observation prediction then in a range of a model generalized error. PI quantifies uncertainty in a single specific outcome and can be obtain after a model is fit. PI predicts a value of a dependent variable based on independent variables.

When predictions are done from the same model and based on the same population of independent variables, a prediction interval contains a future prediction N% (usually 95%) of the time.

Prediction interval is always wider than a confidence interval because it is based on the variance that comes from the model parameters and variance of the individual independent variables as well as variance of the dependent variable mean. They are also more sensitive to normality than confidence intervals.

It is relatively easy to calculate prediction interval for linear regression models with normally distributed residuals. Prediction interval for a single prediction which is evaluated at a specific x using relation:

where n is number of observations, xi is an independent variable in a set, mx is a mean of independent variables and Tcrit is student’s t distribution equivalent to 95% confidence level, i.e., alpha=0.05 divided by 2 (0.025) for two-sided interval and n — 2 degree of freedom.

There are not a lot of information regarding prediction intervals for nonlinear regression models. There are few methods mentioned in “A Comprehensive Review of Neural Network-based Prediction Intervals and New Advances” (34).

Tolerance Interval

A tolerance interval (TI) represents the spread of values around the mean. It contains a specific percent of a population. TI requires the data are normally distributed but there are also nonparametric technics.

Both, a confidence level, and the percent are needed to create TI. Then we are N% confident that the interval contains M% of the population.

How tolerance interval can be used? If a model tolerance interval is wider then requirements, the model produces too many bad predictions. It also helps to detect anomalies or outliers.

Conclusion

It is important to understand the proper model validation and evaluation do not improve the model itself. A good model can be built without cross-validation and t-test. But without them you cannot be aware if your model is good.

In practice, I do not see a lot of statistical tests application in hyper-parameters tuning. Even if a test shows there is no difference, the process complete, time and resources spent, nothing to change. Just do not fool yourself a small change you achieved in a model score is something significant.

It, maybe, different if a new feature is added to a model and to collect the feature in a future observation significant money or time should be spent. Maybe you need to pay a 3rd party company for some additional data or build and run a sophisticated framework to collect web-data. In this case, a solid prove of a new feature usefulness can help to support the proposal. On the other hand, even if the effect of adding the feature is statistically significant, you need to estimate a practical significance of collecting and adding the feature. Is the change in a model accuracy large enough to care about and spend more money to collect the feature?

Science is supposed to be a self-correction community of experts who constantly check each other’s work. With all approximations and assumptions, there is always a scientist who can prove a process or test is incorrect or not applicable in your case. The more time and efforts you devote to validating and testing your model the better.

The full code for the post