LRID: A new metric of multi-class imbalance degree based on likelihood-ratio test

In this paper, we introduce a new likelihood ratio imbalance degree (LRID) to measure the class-imbalance extent of multi-class data. Imbalance ratio (IR) is usually used to measure class-imbalance extent in imbalanced learning problems. However, IR cannot capture the detailed information in the class distribution of multi-class data, because it only utilises the information of the largest majority class and the smallest minority class. Imbalance degree (ID) has been proposed to solve the problem of IR for multi-class data. However, we note that improper use of distance metric in ID can have harmful effect on the results. In addition, ID assumes that data with more minority classes are more imbalanced than data with less minority classes, which is not always true in practice. Thus ID cannot provide reliable measurement when the assumption is violated. In this paper, we propose a new metric based on the likelihood-ratio test, LRID, to provide a more reliable measurement of class-imbalance extent for multi-class data. Experiments on both simulated and real data show that LRID is competitive with IR and ID, and can reduce the negative correlation with F1 scores by up to 0.55.


Introduction
Imbalanced learning is an important research topic in the machine learning community (He and Garcia, 2009;Wang and Yao, 2012;Xue and Titterington, 2008;Xue and Hall, 2015). Imbalanced data are data that have unequal class distributions: majority classes have much more samples than minority classes. Minority classes in imbalanced data can be easily misclassified using standard learning algorithms, which can lead to heavy costs in practice.
A lot of imbalanced learning algorithms have been developed over the past decade. To design algorithms that can deal with the class-imbalance problem, several approaches are widely adopted, such as the resampling approach (Nekooeimehr and Lai-Yuen, 2016;Ha and Lee, 2016;Zhu et al., 2017;Castellanos et al., 2018), the cost-sensitive approach (Cheng et al., 2016;Castro and Braga, 2013) and the ensemble approach (Sun et al., 2015;Lusa et al., 2016;Tang and He, 2017;Yuan et al., 2018). Most of imbalanced learning algorithms are designed to solve binary classification problems, while multi-class imbalanced learning still needs further development (Wang and Yao, 2012).
In imbalanced learning, the class-imbalance extent is an important measurement to describe how imbalanced the data are (Ortigosa-Hernández et al., 2017). Usually, the more imbalanced the data, the larger the harmful effect on the classification results. An algorithm can be identified as better than others if it performs better on data that are more imbalanced. Moreover, the class-imbalance extent can be included in the design of a learning algorithm to improve the learning performance.
Imbalance ratio (IR) is the most commonly adopted metric for class-imbalance extent (He and Garcia, 2009). It is calculated as the ratio of the number of samples in the largest majority class to that for the smallest minority class. Although it is a good imbalance metric for binary-class data, IR cannot provide high resolution description of imbalance extent for multi-class data because it only considers the information of the largest class and the smallest class and ignores the information of classes in between.
Ortigosa-Hernández et al. (Ortigosa-Hernández et al., 2017) first propose a new metric, imbalance degree (ID), to provide a high resolution imbalance-extent measurement for multi-class data. ID is a sum of two components: 1) the normalised distance between the class distribution of the given data and that of the exactly balanced data, which takes values in [0,1], and 2) m − 1, where m is the number of minority class. By measuring the difference between class distributions in the first component, ID makes use of information in all classes and can provide a higher resolution measurement than IR. The second component in ID ensures that the class-imbalance extent of data with more minority classes is definitely higher than that with less minority classes because ID takes values in [m − 1, m].
In this paper, we note two problems of ID as a classimbalance measurement for multi-class data in the two components separately. First, although the first component can capture the information from all classes, the distance metric adopted can have large effect on the result. Several distance metrics are tested in Ortigosa-Hernández et al. (2017). However, which distance metric is suitable for the problem at hand is unknown. Second, the argument that the class-imbalance extent is higher for the data with more minority classes seems reasonable at the first glance, however, it is not always true. For example, given two datasets with three classes, one dataset has class frequencies of (1, 1000, 1000) and the other dataset has class frequencies of (1000, 1000, 1003). Clearly, the second dataset is roughly balanced while the first dataset is imbalanced. However, the ID of the second dataset is larger than that of the first dataset because the second one has two minority classes. Thus it is not reliable to use the number of minority classes in ID without considering how minor the classes are.
To solve the above two problems, we propose a new classimbalance extent metric for multi-class data, the likelihood ratio imbalance degree (LRID). We employ a natural and effective statistic, the log-likelihood ratio (Rice, 2006), to measure the difference between the class distribution of the imbalanced data and that of the exactly balanced data. Thus, LRID does not suffer from the problem of choosing proper distance metrics in practice. The number of minority classes is also not needed in LRID. Thus the second problem in ID is also solved by LRID. Experiments on both simulated data and real data demonstrate the effectiveness of LRID to measure the imbalance extent of multi-class data.
The rest of the paper is organised as follows. In Section 2, we first formulate the imbalance problem and discuss the problems of IR and ID. We then propose LRID as a more effective and reliable metric that can be easily applied in practice. In Section 3, we compare IR, ID and LRID using both simulated data and real data. Lastly, in Section 4, we present some concluding remarks.

Methodology
In this section, we first formulate the imbalance problem based on the multinomial distribution following Ortigosa-Hernández et al. (2017). Then we introduce two measurements in literature, the imbalance ratio (IR) and the imbalance degree (ID), and discuss their advantages and disadvatages for multi-class data. Lastly, to solve the problems in IR and ID, we propose a new measurement, likelihood ratio imbalance degree (LRID), that can effectively measure the imbalance extent for multi-class data.
2.1. Formulate the imbalance problem using multinomial distribution Given data vector x ∈ R p×1 and its label y, a generative classification model learns the joint distribution: p(x, y) = p(y)p(x|y), where p(y) is the prior knowledge on the probability of label y. Suppose there are C possible outcomes for y: y = [y 1 , y 2 , . . . , y C ]. Then each outcome y i is associated with a probability p c and we have C c=1 p c = 1. Thus the frequencies of the possible labels, n = [n 1 , n 2 , . . . , n C ], can be modelled using a multinomial distribution, Multinomial(N, p), with parameters N and p = [p 1 , p 2 , . . . , p C ].
Given a dataset, {x c i | i = 1, . . . , n c , c = 1, . . . , C}, we take N as a known parameter: the total number of observations C c=1 n c . The parameter p c is usually estimated as the fraction of the number of observations in the cth class:p c = n c N . We denote the estimation of p asp = [p 1 ,p 2 , . . . ,p C ].
For exactly balanced data, p c = 1 C (c = 1, 2, . . . , C). We use b = [ 1 C , 1 C , . . . , 1 C ] to denote the class distribution vector for exactly balanced data. For imbalanced data, the class witĥ p c ≥ 1 C is defined as the majority class while that withp c < 1 C is defined as the minority class. Therefore, a metric to measure the class-imbalance extent can be a single value that can summarise the difference betweenp and b.

Imbalance ratio
Imbalance ratio (IR) measures the class-imbalance extent using the extreme values inp: wherep max andp min are the maximum and minimum values inp, respectively. Clearly, for multi-class data, p c 's between p max andp min are ignored in IR. Class distributions with the samep max andp min while different p c 's in between have the same IR. Thus IR is considered as a low-resolution metric to describe class-imbalance extent for multi-class data (Ortigosa-Hernández et al., 2017).

Imbalance degree
To solve the problem of IR, Ortigosa-Hernández et al. (2017) propose the following high-resolution metric to summarise the difference betweenp and b: where m is the number of minority classes, p m describes the situation where there are exactly m minority classes in a dataset, and d(p m , b) is the maximum distance between b and all possible p m . Ortigosa-Hernández et al. (2017) show that d(p m , b) is the distance between b and a class distribution vector with m zeros, (C − m − 1) 1 C s and one 1 − C−m−1 C : The first term in (3) is the normalised distance betweenp and b with values in [0, 1], which utilises information of all classes. Thus ID considers detailed information inp and is a high-resolution metric.
However, the distance metric used in the first term can have large effect on the results and there is no rule to choose a proper distance metric in practice. In this paper, we aim to provide a simple and effective metric that can be easily applied in practice, without testing different distance metrics or parameters.
If we only use the first term as ID, it is possible to obtain the same ID for different class distributionsp. Thus the second term is added to make ID an injection function that has different values for different numbers of minority/majority classes (Ortigosa-Hernández et al., 2017).
There are two problems associated with the second term. First, it is not necessary to make ID an injection function. This is because it is reasonable for different class distributions to have the same class-imbalance extent. We will show this argument empirically in Section 3.1.2. In addition, the argument that ID is an injection function holds only for data with the same number of classes. If two datasets have different numbers of classes but the same number of minority classes, their IDs can still be the same.
Second, introducing the second term in ID can cause more problems in measuring class-imbalance extent. ID of a dataset with m minority classes has value in [m − 1, m], as d (p,b) d(p m ,b) ∈ [0, 1]. Thus the ID of a dataset with a large m is definitely larger than that with a small m. However, it is not always true that the larger the number of minority classes, the higher the imbalance extent. Suppose we have the following two datasets with C = 3: 1)p 1 = [ 1 100000 , 1 2 − 1 200000 , 1 2 − 1 200000 ] and 2)p 2 = [ 1 3.1 , 1 3.1 , 1 − 2 3.1 ]. ID(p 1 ) = c 1 + 0 ∈ [0, 1] and ID(p 2 ) = c 2 + 1 ∈ [1, 2], where c 1 and c 2 are the values of the first terms. Thus the second one is considered to be more imbalanced than the first one because its ID is larger. However, although it has two minority classes, the second dataset is roughly balanced. The first dataset is extremely imbalanced with one probability close to zero. Therefore ID fails to provide reliable class-imbalance measurement in this case.

Likelihood ratio imbalance degree
To solve the problems in ID, we propose a new metric of class-imbalance extent for multi-class data, the likelihood ratio imbalance degree (LRID).
First, since improper distance metric may have harmful effect on ID, we propose not to use the distance metric between two distributions in the imbalance extent measurement. Instead, we explore a natural and powerful statistical inference technique, the likelihood-ratio (LR) test (Rice, 2006), to provide a single value that can well summarise the difference betweenp and b.
Given a dataset with C classes and n = [n 1 , n 2 , . . . , n C ], the LR test for the multinomial distribution Multinomial(N, p) aims to test the null hypothesis that the parameters p equal to specific values. Here we aim to test whether p can be well fitted by b, i.e. the balanced class distribution. Thus we test H 0 : p = b against H 1 : p =p. The LR test statistic is −2ln L(b|n) L(p|n) , where L(·) is the likelihood function. Thus for balanced data, L(b|n) = L(p|n) and the value of the test statistic is 0; while for imbalanced data, L(b|n) < L(p|n) and the value of the test statistic is larger than 0. The larger the difference of the estimated class distributionp from the balanced class distribution b, the larger the value of the test statistic. Therefore the value of the test statistic can be used to measure the difference betweenp and b, or the class-imbalance extent. Moreover, similarly to the first term in ID, the LR test statistic considers the information of all classes and is a high-resolution measurement. Second, as we have discussed in the previous section, the second term in ID, (m − 1), is an unnecessary term and brings problems to the metric. Thus, in our new metric, we propose to eliminate this term and simply use the LR test statistic in (4) as the metric. We term this metric as the likelihood-ratio imbalance degree (LRID). Whenp c = n c N , LRID can be written as

Experiments
In the following experiments, we compare three imbalance degree metrics, IR, ID and LRID, on both simulated and real datasets. In ID, as suggested by the experiment results in Ortigosa-Hernández et al. (2017), the total variation distance and the Hellinger distance have the best performance and we test both distance metrics in our experiments. IDs using the total variation distance and the Hellinger distance are denoted as ID T V and ID HE , respectively.
The performances of the three metrics are tested by following the two criteria proposed in Ortigosa-Hernández et al. (2017): 1) the resolution of the metric and 2) the correlation between the metric and the classification performance. A better metric is expected to have higher resolution and more negative correlation with classification performance (the more imbalanced, the worse the classification performance). In this paper, the classification performance is measured by the F1 score, which is a widely used metric in imbalanced learning (He and Garcia, 2009). Linear discriminant analysis (LDA) is adopted as the classification algorithm.
3.1. Simulated data 3.1.1. Experiment settings for simulated data Here we design experiments to compare the performances of the class-imbalance metrics for data with different classseparation. Prati et al. (2004) show that the classification per- formance of imbalanced data can also be affected by the intrinsic properties of data. Since the correltion between the classification performance and the class-imbalance metrics is one of the criteria to measure the performance of metrics, we aim to test whether other properties of data, such as the separation of classes, can affect the performances of the metrics. We simulate three sets of data with different separation degrees of classes: well separated, overlapped and extremely overlapped. We measure the separation degrees of data by an index called 'separability index' (SI) (Greene, 2001;Thornton, 2002;Mthembu and Marwala, 2008). SI measures the proportion of observations that have a nearest neighbour with the same class, taking values between 0 and 100%. The details of the three sets of data are described as follows.
For each dataset, we simulate N = 10000 observations with C = 10 classes; that is, for fully balanced data, each class contains 1000 observations and b = [ 1 10 , 1 10 , . . . , 1 10 ]. For imbalanced data, the number of minority classes m is set to 1 to 9. For each m, the probability vector of the multinomial distribution is set to p = [p min , . . . , p min m , p ma j . . . , p ma j K−m ], where m minority classes have equal probabilities p min = 1 10 r and K − m majority classes have equal probabilities p ma j = (1 − m 10 r)/(K − m). To control the imbalance degree, r is set to 0.01, 0.05, 0.1, 0.5 and 0.9. Thus for each number of minority classes m, we set five different number of observations for the minority classes, where r = 0.01 corresponds to the most imbalanced situation while r = 0.9 corresponds to the roughly balanced situation.
We apply LDA to each dataset and use the F1 score as the metrics to assess the classification performances. We perform 20 random training/test splits on each dataset, with 70% training data and 30% test data.

Results of simulated data
i) The resolution of the measurements: Since ns are the same for the three sets of means, the values of each class-imbalance metrics are the same for the three sets of data. The values of the three metrics with different numbers of minority classes m and different imbalance extent measured by r are shown in Fig. 2.
For each plot in Fig. 2, the horizontal axis shows values of r and the vertical axis shows the values of the metrics. Each line in the plot corresponds to a specific value of m.
We observe different patterns for ID, IR and LRID against m and r. IR has the lowest resolution among the three measurements: the lines are close when r ≥ 0.1 and overlap when r ≥ 0.5, which indicates that IR cannot well distinguish data with different m. In contrast, ID has the highest resolution: the lines are equally separated, indicating that ID can well distinguish between data with different m. In addition, each line has a downward trending, indicating that the value of ID decreases as r increases. LRID has a resolution level between IR and ID: the distances between lines decrease as r increases. When r = 0.9, LRIDs of different numbers of minority classes are similar.
However, resolution is not the only criterion to assess the quality of the imbalance-degree metric. Although ID has the ] are both roughly balanced and they should have similar imbalance extent. IR and LRID that have similar values for different m are more resonable in this case. We will discuss more about how this problem will affect the correlation between ID and classification performance in the next section.
ii) The correlation with classification performances: In Ortigosa-Hernández et al. (2017), the correlations between ID and classification performances are calculated with eliminating the second term (m − 1). We denote ID without (m − 1) as ID * . In this paper, we report the correlations for both ID and ID * . The Spearman rank correlation coefficient (SRCC)  Table 1. We can make the following observations from Table 1. First, it is obvious that the performances of the metrics are different for different separation degrees of data. When the data are well separated, IR has the best SRCC and ID * HE has the best PCC. However, as the data become more overlapped, LRID becomes the best metric in terms of both SRCC and PCC.
Second, it is obvious that ID T V and ID HE have worse performances than ID * T V and ID * HE , which suggests that including the second term (m − 1) can be harmful in evaluating the imbalance degree for our simulated data. This observation supports our argument in Section 2.3.
To make further investigation for the above observations, we plot the F1 scores against the values of r and m in Fig. 3. For data that are well separated, the number of minority classes m does not have much effect on the F1 scores when r is large, as shown in Fig. 3a. The lines for m = 4 to m = 9 are almost   (15, 57, 1, 5, 259, 391, 568, 689, 634, 487, 267 (0.000, 0.000, 0.004, 0.014, 0.028, 0.062, 0.0934, 0.136, 0.165, 0.152, 203, 126, 103, 67, 58, 42, 32, 26, 14, 6, 9, 2, 2) 0 overlapped for all values of r and the lines for all m are overlapped for r ≥ 0.5. Hence, when data are well separated, it is reasonable for data with the samep max andp min but different p c 's in between to have the same imbalance extent in terms of the correlation with classification performance. Therefore we do not need a high-resolution metric under this situation and it makes sense that IR has the best performance in this case.
However, things are different when data are overlapped: the effect of the number of minority classes m becomes large on the F1 scores. The lines are more separated in Fig 3b and Fig 3c  than in Fig. 3a. When data are overlapped, the larger the number of minority classes, the lower the F1 score. Therefore, IR does not perform well while high-resolution metrics such as ID * and LRID can perform well in these cases.
To make the above analysis more obvious, we compare the plots in Fig. 3 and Fig. 2. It is clear that the plot of IR, Fig. 2a, has the most similar shape (but opposite trend) as Fig. 3a, which explains the good performance of IR for well-separated data in terms of correlations. In addition, the plot of ID HE in Fig. 2b shows the reason for its bad performance: when r = 0.9, datasets with different number of minority classes have very similar F1 scores, however, ID HE provides very different imbalance degrees. The plot of LRID in Fig. 2c has the most similar shape with Fig 3b and Fig 3c, which explains its best performances in both cases.
To sum up, the simulated experiments show the following conclusions. First, the rank of resolution of the three metrics is IR < LRID < ID, as shown in Fig. 2. Second, the separation of the data affect the performance of the metrics in terms of the correlations with classification performance. Data with different number of minority class m can have the same imbalance extent considering their classification performance, as shown in Fig. 3. IR and ID * HE is the best for well-separated data while LRID is the best for overlapped and extremely overlapped data. Therefore, LRID shows competitive performance compared with other two metrics in terms of both criteria: LRID has a resonable high resolution and competitive correlations with classification performance. In practice, if we know that the data are well separated, then IR is enough to measure the imbalance extent of multi-class data. However, if we know that the data are overlapped or we are not sure about the separation level of the data, then LRID can be a good candidate.

Real data
Ten UCI datasets are used in the experiments (Dheeru and Karra Taniskidou, 2017): yeast, ecoli, wine, abalone, auto mpg, glass, Hayes-Roth, pageblocks, penbased and shuttle. The descriptions of the ten datasets are shown in Table 2. Similarly to the simulated data, LDA is applied to all datasets. We perform 20 random training/test split on each dataset, with 70% training data and 30% test data. The mean of the F1 scores is recorded for each dataset. PCC and SRCC between the imbalance extent metrics and the F1 scores are calculated.

Results of real data
The correlations for the real datasets are shown in Table 3. It is obvious that ID * T V and LRID can achieve the best SRCC and PCC, while IR cannot provide good correlations with classification performances for the real datasets. This result is supported by the SIs of the datasets in Table 2. Except for the last three datasets, other seven datasets show different degrees of overlapping based on the values of SI. The Abalone dataset has a very low SI of 20%. Thus LRID shows better correlation with the F1 scores based on these datasets.
Similarly to those of simulated data, the results of real data also suggest that the distance metric can have great effect on the performance of ID * . In addition, adding (m − 1) can also have harmful effect on the performance of ID. In contrast, the new LRID can provide competitive performance with the best ID * while avoiding the difficulty to choose suitable distance metrics.
The results on real data also demonstrate that LRID is a simple and effective measurement of class-imbalance extent of multi-class data.

Conclusion
In this paper, we propose a new metric to measure the classimbalance extent of multi-class data based on the likelihoodratio test, the likelihood-ratio imbalance degree (LRID). LRID can provide effective measurement of class-imbalance extent and can be easily applied in practice. In the experiments, LRID demonstrates its superior performances over IR and ID on both simulated and real data.