M. H. Ali & I. Abustan


A new novel index for evaluating model performance

Abstract

A vast array of scientific literature is concerned with simulation models. The aim of models is to predict the unknown situation as close to as real one. To do this, models are validated and examined for their performance under known condition. In this paper, commonly used model performance evaluation indices are overviewed and examined under different situations. Difference based, efficiency based (Nash and Sutcliffe coefficient, model efficiency of Loague and Green, Legates and McCabe’s index) and composite indices (such as index of agreement, d, and dr) were found ambiguous, inconsistent and not logical in many cases. A new index, Percent Mean Relative Absolute Error (PMRAE), is proposed which is found unambiguous, logical, straight-forward, and interpretable; thus can be used to evaluate model performance. The model evaluation performance ratings based on PMRAE are also suggested.

 

Introduction

A range of simulation models and decision support systems have been developed and are being used for several decades in different fields. Simulation models have been successfully used to provide simulations of crop growth and development (Geerts et al. 2009, Stockle et al. 2003), hydrologic variables (Suleiman 2008), water and solute transport (Crescimanno and Garofalo 2005,Dust et al. 2000), solar radiation (Rivington et al. 2005), environmental impacts (Stockle et al. 1992) and many other areas. One important aspect in the model development process is the model evaluation. Model outputs are compared /examined with observed (or known) data gathered under respective conditions, both by quantitative and graphical methods.

Various statistical and efficiency-based indices/indicators and test statistics have been suggested and used by diffident model developers and users to judge the model performance. Among these, recommendation by Nash and Sutcliffe (1970), Fox (1981), Willmott (1982, 1985), and Loague and Green (1991) are prominent.

Among statistical indices, some of them quantify the departure of the model output from observed or experimental measurements, while others focus on correlation between model predictions and measurements. In essence, Fox (1981) recommended that the following four types of difference measures should be calculated and reported: mean error, mean absolute error, variance of the distribution of difference, and root mean square error (or its square - the mean square error).These difference-based statistics quantify the departure of the model outputs from the measurements.
Indicators for specific fields are also suggested. Bellocchi et al. (2002) proposed a fuzzy expert system to calculate a composite indicator for performance evaluation of solar radiation. They used correlation coefficient (r), relative root mean square error (RRMS), model efficiency (EF), and t-Student probability to make aggregated form. Confalonieri et al. (2010) proposed a fuzzy-based, indicator for evaluation of soil water content simulation. Jacovides and Kontoyiannis (1995) proposed mean bias error (MBE) and root mean square error (RMSE) in combination with the t-statistic as statistical indicators for the evaluation and comparison of evapotranspiration computing models.
Among the difference and/or statistical measures, mean error (ME), root mean square error (RMSE), relative error (RE), and correlation coefficient (r) are widely used in different fields –crop growth and yield (Geerts et al. 2009), irrigation scheduling (Liu et al. 1998), hydrological (Shen et al. 2009), environmental (Wagener and Kollat 2007), solar radiation (Rivington et al. 2005), pollution simulation model (Yang et al. 2007), etc. Model efficiency (EF) is used in almost every field of simulation. The above indices are used for both single model evaluation and comparison of multiple models (Prasher et al. 1996). Martorana and Bellocchi (1999) identified the mean squared error of prediction as the fundamental statistical index on which other widely used squared differences are based. While Willmott and Matsuura (2006) noted that RMSE is an inappropriate measure of average error because it is a function of three characteristics of a set of errors, rather than of one (average error).

Yang et al. (2000) evaluated different statistical methods to evaluate crop-nitrogen simulation model, N_ABLE. They suggested that two sets statistics can be used: (a) mean of error (ME), root mean square error (RMSE), forecasting efficiency, and paired t-statistic; (b) ME, mean absolute error, forecasting coefficient, and F-ratio of lack of fit over experimental error. They noted that either set can give the same conclusions which could not be quantitatively detected by graphical method. The use of test statistics (e.g. F, t-test, etc.) to judge the error variance between observed and simulated outputs have the possibility of producing type-I or type-II error.

Willmott (1981) demonstrated that the correlation coefficient, r (Pearson’s product-moment correlation coefficient) can be misleading measure of accuracy – ‘r’ between very dissimilar model-predicted variable and observed one can easily approach 1.0. Willmott (1982) discussed other drawbacks of ‘r’ and ‘R2’, and proposed an “index of agreement (d)”. He noted that the index ‘d’ is intended to be a descriptive measure, and it is both a relative and bounded measure which can be widely applied for cross-comparisons between models. Willmott et al. (2011) suggested a refined index (dr) considering the problem of d.

Among the efficiency-based indices (EF) suggested for model performance evaluation, widely used ones are Nash and Sutcliffe coefficient (Nash and Sutcliffe 1970) and model efficiency of Loague and Green (1991). Many researchers (Addiscott and Whitmore 1987, Martorana and Bellocchi 1999, Rivington et al. 2005, Moriasi et al. 2007) noted that a model may be judged suitable according to one statistic but it may be deficient according to another statistic. Alexandrov et al. (2011) emphasized the need of standardized evaluation tool.

The purpose of this paper is to examine all of the above indices, and suggest a logical, stable, unambiguous and straight-forward index for model performance evaluation.

 

Materials and methods

Definition of commonly used statistical measures and indices for model performance evaluation

Before going to analyze the indices, it would be useful to define them along with their perspectives. So, they are described below. For synchronization of all the indices, observed or measured value is designated by Oi, and predicted or simulated value is designated by Pi, although the original proposed symbol may be different in some cases.

Difference based Statistical indicators

(i) Mean bias or Mean error (ME)(Fox 1981):

1314_eq1

 

Where, N is the number of observations. 

(ii) Mean Absolute error (MAE) (Fox 1981):

1314_eq2

 

(iii) Root mean square error (RMSE):

1314_eq3

 

The RMSE quantifies the dispersion between simulated and measured data. Ideally, the value of ME, MAE, and RMSE should be zero.

iv) Relative error (RE) or relative root mean square error (RRMSE) (Loague and Green 1991, Bellocchi et al. 2002):

1314_eq4

 

Where, O is the mean of observed values. The RE may vary from 0 to positive infinity. The smaller the RE is, the better the model performance. Sometimes it is expressed as percentage form.

v) Scaled Root-mean-Square-Error (SRMSE) (Dust et al. 2000):

1314_eq5

 

In essence, the RE and SRMSE are the same.

Efficiency based indicators

(i) Nash and Sutcliffe Coefficient of efficiency (ENS) (Nash and Sutcliffe 1970):

 1314_eq6

Nash-Sutcliffe coefficient of efficiency (ENS) varies between - ∞ and 1.0, and ENS=1 is the optimum value. The ENS≤0.0 indicates unsatisfactory performance, and 0<ENS<1 is considered as the acceptable range.

(ii) Model efficiency of Loague and Green (ELG) (Loague and Green 1991):

1314_eq7

 

An ideal value of ELG is unity. Its upper limit is 1, and lower value can negative infinity.

The Nash-Sutcliffe coefficient of efficiency (ENS) and the model efficiency of Loague and Green (ELG) are the same. So, only one will be discussed in the later section.

(iii) Legates and McCabe’s index (ELM) (Legates and McCabe 1999)

Legates and McCabe’s index (ELM) is written as:

1314_eq8

 

Other composite indicators

(i) Index of Agreement (d) (Willmott 1982):

1314_eq9

 

where O’i = |Oi - P| , P’i = |Pi - P| , Oi is the observed value, Pi is the simulated value and P is the simulated mean.

(ii) Refined index of Willmott et al. (2011)

The refined index of Willmott et al. (2011) (dr) can be written as

1314_eq10

 

Where, 

1314_eq

 

Proposed new index

 Percent mean absolute relative error (PMARE)

It is the ‘mean absolute relative error’, expressed in percentage.

1314_eq11

 

Where, ‘Abs’ indicates absolute value (of the difference between observed and simulated value). Theoretically, the value of PMARE ranges from 0% to ∞ (positive infinity). The interpretation and characterization of the index are discussed later.

Data for comparison of indices

To test the statistics and indices, both the field observed data and simulated random data were used.

 Simulation comparison with field observed data

Field data are originated from wheat experiment, where diverse irrigation treatments were applied representing different strategies of deficit irrigation. Simulation was performed using AquaCrop model of FAO (Steduto et al. 2009). Before simulation, calibration of the model was performed using one year data. The model AquaCrop produces inferior simulations at extreme dry condition (herein referred as ‘odd simulation’– sometimes referred in the literature as ‘outliers’), which is a common problem in many models. Observed and simulated outputs are summarized in Table 1, which are used to explore the behavior of the indices.

1314_table1

1314_table1.1

 

 

 

Simulation comparison with Random data

To show the behavior of the indices under different patterns of data series, values of O and P were created (generated) using a random data generator. More specifically, 3 sets of random data of size n=20 were generated separately for O and P using a random number generator (RANDOM.ORG, 2012) [Table 2, Fig.1]. The randomness comes from atmospheric noise.

Calculation of the indices

The indices were calculated using Microsoft spreadsheet following the equations mentioned earlier.


1314_table2

1314_figure1

 

Results and Discussions

Grain yield of wheat

The statistical parameters and indices under different conditions (“with” and “without” odd simulated values) are presented in Table 3. The data points (with odd values) are graphically illustrated in Fig.2 along with 1:1 line.

 1314_figure2

 

 

For the simulation year-1, while the ‘odd simulated’ values are omitted from the calculation, the difference-based statistical indicators – mean error (ME), mean absolute error (MAE), root mean square error (RMSE), and relative error (RE) decreased compared to those with ‘odd simulated values’; which is logical. The efficiency based indicators –ENS or ELG, ELM, index of agreement d, and new index of agreement dr decreased; but they should be increased. Similar behaviors are also observed for the year-2.

For the combined data, the statistical indicators followed the logical behavior. Here, the ENS and ELM followed the logical trend – higher values for ‘without odd data’ (i.e. with good simulated data). But the d followed the reverse behavior – decreased with good simulated values. The PMARE always followed the logical behavior, and no ambiguous result.
From the different data sets, it is revealed that the difference-based statistical indicators gave consistent and logical measures. The behavior of ENS and ELM is inconsistent, and reverse in two cases. The behavior of d is reverse in all the studied cases. Similar behavior is also noticed by ‘r’.

The behavior of ENS and ELM may be due to their inherent formulation/structure. From the equation of ENS and ELM, it is revealed that they are more dependent on observation range (Oi and O) than the difference between the observed and predicted values. Thus, the ENS and ELM are more sensitive to observed range/fluctuation. Hence the output is not consistent and reliable. Similar behavior is also noticed by d. In the studies cases, the outputs are consistently reverse to the logical direction. For dr, the behavior is inconsistent for 2 cases – 1st& 2nd year data.

Case of total biomass yield

The observed and simulated total biomasses are illustrated in Fig. 3 along with 1:1 line. The statistical parameters and efficiency indices are presented in Table 4. The r value shows reverse behavior (opposite to logical, MAE & PMARE) for 2nd year & combined data - higher value for “with odd simulation’.

 

1314_table3

1314_figure3

Behavior with Random data

The values of statistical and efficiency based indices for 3 random data sets, with ‘all data’ (herein referred as ‘with odd simulation’) and ‘without extreme values’ (2 extremes) (herein referred as ‘without odd simulation’) are summarized in Table 5. Here, the ENS and dr do not follow the logical behavior of difference-based measures MAE and RMSE (and also PMARE) for the 1st & 3rd data sets. The ELM shows reverse trend for 3rd data set. The results indicate that the r, ENS, ELM, and dr show ambiguous performance rating under different conditions.

 1314_table4

1314_table5

 

Discussion

Now-a-days, the modeling and the use of models are becoming the major thrust in all branches of science. It is increasingly important that discussion of model evaluation procedure to be expanded in order that logical, consistent and generally accepted indicator(s) is identified. The indicators should appropriately quantify objective of model evaluation – that is, should direct towards the answer of model usability. It is logical demand that an ‘ideal indicator’ for model performance evaluation should:

(i) Have straight-forward physical meaning and interpretation

(ii) Indicate the strength (accuracy) or pit-fall (weakness) of the prediction capability, so that decision can be made regarding usefulness of the model

(iii) Have consistent value/trend with the logical direction, and no ambiguous performance rating

Graphical method gives the overall and real picture, while the different indices give quantitative measures. The diagnosis that can be made from the graph, must be supported by the quantitative measures. The indices should also be consistent in their results. Otherwise, the particular quantitative index is not suitable for model comparison, and should be abundant from model performance measure.

Legates and McCabe (1999) suggested that correlation and correlation-based measures (e.g. the coefficient of determination, R2) should not be used to assess the goodness-of-fit of hydrologic or hydro-climatic model, as these measures were found over-sensitive to extreme values (outliers) and are insensitive to additive and proportional differences between model predictions and observation. Willmott (1981) found ambiguous behavior of correlation coefficient,‘r’. The present study also showed ambiguous behavior of ‘r’.

Within the domain of efficiency based indicators, McCuen et al. (2006) showed that the outliers can significantly influence sample values of the Nash–Sutcliffe efficiency index (ENS). In the present study, ENS and ELG also showed ambiguous result due to the presence or absence of externalities (extreme values).

Willmott et al. (2011) proposed a new index, dr, and they compared the dr with ‘mean absolute error (MAE)’ of the data sets, which varies logically with MAE. But this should be compared with mean absolute relative error, because MAE can vary with different data pattern/set, while the ‘mean absolute relative error’ value may be the same (i.e. no change in relative pattern). In the present study, the dr index does not follow the logical trend within a particular data set, as in Table 2 (combined analysis); and also ambiguous among different sets (1st year and combined data) – with PMARE value. Similar inconsistencies are also observed for random data sets (Table 4, 1st & 3rd data sets – with PMARE).

As the behavior of EF, d, dr and r are not consistent and logical for all cases (ambiguous, conflicting performance rating); they should be avoided from model performance measure.

The ‘mean absolute error’ (MAE)and ‘mean bias error’ (ME) have been suggested by Willmott and Matsuura (2006). But the MAE or ME does not tell about the level or degree of error, and the MBE can ‘neutralize the amount of error’ if the error occurs on both positive and negative directions. The ‘mean absolute relative error’, when expressed as percentage, that is ‘percent mean absolute relative error’ (PMARE) (eqn.11), overcomes the above deficiencies. It has merit over ‘mean absolute error’ that it directly indicates the strength or weakness of the simulation; and thus helps to decide accept or reject the model. Theoretically, the value of PMARE can range from 0% to ∞ (positive infinity). As it is a measure of error (but relative – with respect to observed, which is logical than any other measure), the optimum value is 0.0, indicating no error (that is perfect simulation). Low magnitudes indicate less error (i.e. better model simulation) and the higher values indicate higher error (i.e. less perfect simulation). The 0<PMARE<100 can be considered as the practical/acceptable range.
Performance rating based on any indicator may depend on the model type, field of application (i.e. sensitivity of the work/project where the model output will be used), availability of real-world data, etc. In general, for the PMARE value, the following ratings may be used as a guide (Table 6):

1314_table6

 

The above threshold/maximum limit for rating a model is determined/suggested after examining various data sets, by sequentially omitting the data having higher difference (in percent) between observed and simulated values, and determining PMARE.

Based on the required precision, the user can choose lower PMARE value. On the other hand, where no other means/data are available, the user can use a model having even a higher PMARE value (say, 25%) to get a forecast.

The PMARE has distinct advantages over the other indicators:

(i) It is simple to calculate
(ii) Has direct physical meaning
(iii) Indicates directly the accuracy or pit-fall of the simulation, and thus helps to decide about the acceptability (or usefulness) of the model
(iv) No ambiguous result
(v) Follow the logical direction
(vi) Relative measure, thus applicable to any field of observation, regardless of units (scales of measurements) and range of values

 

 Summary and Conclusion 

Previous studies have produced comparable information for model evaluation indices (for selected models or in general). But no comprehensive standardization (or concrete suggestion) is available including recently developed indices. The purpose of this investigation is to review and evaluate available indices for model performance evaluation and explore a logical, interpretable, and unambiguous index for general use in model evaluation. The r, R2, and RMSE have been regarded as non-logical, ambiguous and mis-interpretable from previous studies (and have been suggested to abundant from the array of performance testing indicators) and also from this study.

The present investigation demonstrates that the index of agreement (d) between very dissimilar model-predicted variable and observed data can approach to one (1.0), but can have lower value for nearly similar data sets. The ambiguous and inconsistent behavior of dr are also observed, thus cannot be regarded as a reliable indicator. The investigation also demonstrates that the efficiency based indicators such as ENS and ELG, are not consistent with logical trend (and shows reverse trend in some cases), and also with widely accepted difference-based measures (e.g. MAE, RMSE).

The PMARE (which is based on similar principle of MAE, but relative to observed data) shows consistent, robust, descriptive (clear interpretative), and logical behavior, and thus can be used as an ideal indicator for model evaluation under diverse output conditions. The performance rating based on PMARE is also suggested. From investigation of various data sets (diverse in nature), it can be concluded that the index is measuring error with both accuracy and precision.

 

References

 Addiscott T. M., Whitmore A. P.,1987. Computer simulation of changes in soil mineral nitrogen and crop nitrogen during autumn, winter and spring. J. of Agril. Sci., Cambridge, 109, 141 – 157.

Alexandrov, G.A., D. Ames, G. Bellocchi, B. Michael, N. Crout, M. Erechtchoukova, A. Hildebrandt, F. Hoffman, C. Jackisch, P. Khaiter, G. Mannina, T. Matsunaga, S.T. Purucker, M.

Rivington, L. Samaniego. 2011. Technical assessment and evaluation of environmental models and software : Letter to the Editor. Environ. Modell. and Soft. 26 (3): 328-336.

Bellocchi G., Acuit M., Fila G., Donatelli M., 2002. An indicator of solar radiation model performance based on a fuzzy expert system. Agron. J. 94, 1222-1233.

Confalonieri, R., S. Bregaglio, S. Bocchi, M. Acutis. 2010. An integrated procedure to evaluate hydrological models. Hydrol. Proces. 24(19): 2762-2770

Crescimanno G., Garofalo P., 2005. Application and evaluation of the SWAP model for simulating water and solute transport in a cracking clay soil. Soil Sci. Soc. Am. J. 69, 1943-1954.

Dust M., Baran N., Errera G., Huston J.L., Mouvet C., Schafet H., Vereecken H., Walker A., 2000. Simulation of water and solute transport in field soils with the LEACHP model. Agric. Water Manage. 44, 225-245.

Fox D.G., 1981. Judging air quality model performance. Bull. Am. Meteorol. Soc. 62, 599-609.

Geerts S., Raes D., Garcia M., Miranda R., Cusicanqui J. A., Taboada C., Mendoza J., Huanca R., Mamani A., Condori O., Mamani J., Morales B., Osco V., Steduto P., 2009. Simulating Yield Response of Quinoa to Water Availability with AquaCrop. Agron. J. 101(3), 499-508.

Jacovides C.P., Kontoyiannis H., 1995. Statistical procedures for the evaluation of evapotranspiration models. Agric. Water Manage. 27, 365-371.

Legates D.R, McCabe G.J. Jr. 1999. Evaluating the use of “goodness of fit” measures in hydrologic and hydroclimatic model validation. Water Resources Research 35(1): 233–241.

Liu Y., Teixeira J.L., Zhang H.J., Pereira L.S., 1998. Model validation and crop coefficients for irrigation scheduling in the North China plain. Agric. Water Manage. 36, 233-246.

Loague K., Green R. E., 1991. Statistical and graphical methods for evaluating solute transport models: Overview and application. J. Contam. Hydrol. 7, 51 – 73.

Martorana F., Bellocchi G., 1999. A review of methodologies to evaluate agroecosystem simulation models. Ital. J. Agron. 3(1), 19-39.

McCuen, R.H., Z. Knight, A.G. Cutter. 2006 . Evaluation of the Nash–Sutcliffe Efficiency Index. J. Hydrologic Engrg. 11: 597-602.

Moriasi D.N., Arnold J.G., Van Liew M.W., Bingner R.L., Harmel R.D., Veith T.L., 2007. Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans. ASABE. 50(3), 885-900.

Nash J. E., Sutcliffe J. V., 1970. River flow forecasting through conceptual models. 1. A discussion of principles. J. Hydrol. 10, 282- 290.

Prasher S.O., Madani A., Clemente R.S., Geng G.Q., Bhardwaj A., 1996. Evaluation of two water table management models for Atlantic Canada. Agric. Water Manage. 32, 49-69.

RANDOM.ORG. 2012. Random Integer Generator. www.random.org/integers, Accessed on 20th Feb. 2012.

Rivington M., Bellocchi G., Matthews K.B., Buchan K., 2005. Evaluation of three model estimations of solar radiation at 24 UK stations. Agril. and Forest Meteorl. 132, 228-243.

Shen Z.Y., Gong Y.W., Li Y.H., Hong Q., Xu L., Liu R.M., 2009. A comparison of WEPP and SWAT for modeling soil erosion of the Zhangjiachong watershed in the three Gorges reservoir area. Agric. Water Manage. 96, 1435-1442.

Steduto P., Hsiao T.C. , Raes D., Fereres E., 2009. AquaCrop—The FAO crop model to simulate yield response to water: I. Concepts and underlying principles. Agron. J. 101(3), 101- 426.

Stockle C., Martin S., Campbell G., 1992. A model to assess environmental impact of cropping systems. Am.Soc.Agric.Eng. Paper No. 92-2041.

Stockle C.O., Donatelli M., Nelson R., 2003. CropSyst – a cropping systems simulation model. Eur.J.Agron. 18, 289-307.

Suleiman A.A., 2008. Modeling daily soil water dynamics during vertical drainage using the incoming flow concept. Catena 73, 312-320.

Wagener T., Kollat J., 2007. Numerical and visual evaluation of hydrological and environmental models using the Monte Carlo analysis toolbox. Environ. Modelling & Softwares, 22, 1021-1033.

Willmott C. J., 1981. On the validation of models. Phys. Geogr. 2, 184-194.

Willmott C. J., 1982. Some comments on the evaluation of model performance. Bull. Am. Meteorol. Soc. 63, 1309 – 1313.

Willmott C. J., Ackleson S.G., Davis R.E., Feddeema J.J., Klink K.M., Legates D.R., O’Connell J., Rowe C.M., 1985. Statistics for the evaluation and comparison of models. J. Geophys. Res. 90(C5), 8995-9005.

Willmott, C.J., and K. Matsuura. 2006. On the use of dimensioned measures of error to evaluate the performance of spatial interpolates. Int. J. Geogr. Infor. Sci. 20(1): 89-102

Willmott,C. J., Robeson, S. M. and Matsuura, K. (2011). A refined index of model performance. Int. J. Climatol. DOI: 10.1002/joc.2419

Yang C-C., Prasher S.O., Wang S., Kim S.H., Tan C.S., Drury C., Patel R.M., 2007. Simulation of nitrate-N movement in southern Ontario, Canada with DRAINMOD-N. Agric. Water Manage. 87, 299-306.

Yang, J., Greenwood D.J., Rowell D.L., Wadsworth G.A., Burns I.G., 2000. Statistical methods for evaluating a crop nitrogen simulation model, N_ABLE. Agric. System 64:37-53