From Stances' Imbalance to Their HierarchicalRepresentation and Detection

Stance detection has gained increasing interest from the research community due to its importance for fake news detection. The goal of stance detection is to categorize an overall position of a subject towards an object into one of the four classes: agree, disagree, discuss, and unrelated. One of the major problems faced by current machine learning models used for stance detection is caused by a severe class imbalance among these classes. Hence, most models fail to correctly classify instances that fall into minority classes. In this paper, we address this problem by proposing a hierarchical representation of these classes, which combines the agree, disagree, and discuss classes under a new related class. Further, we propose a two-layer neural network that learns from this hierarchical representation and controls the error propagation between the two layers using the Maximum Mean Discrepancy regularizer. Compared with conventional four-way classifiers, this model has two advantages: (1) the hierarchical architecture mitigates the class imbalance problem; (2) the regularization makes the model to better discern between the related and unrelated stances. An extensive experimentation demonstrates state-of-the-art accuracy performance of the proposed model for stance detection.


INTRODUCTION
The quality of online news is usually less substantiated than that of traditional news services such as magazines or newspapers [1,45,47]. A large volume of fake news is being produced for political or economical purposes [8,22,41]. Fake news are those news articles that purport to be factual, but which contain misstatements of fact with intention to arouse passions, attract viewership, or deceive [25,37,44]. Verifying news content needs to retrieve evidences and determine their stance with respect to the news claims, which proposes new challenges for the conventional stance detection task [31,36]. We specify evidence as text, e.g. web-pages and documents, that can be used to prove if news content is or is not true. Moreover, automatic stance detection has broad applications in information retrieval and text entailment [34,42].
The task of stance detection is to identify the stance of an evidence towards a given news claim [12,13]. Stances can be categorized into four classes: agree, disagree, discuss and unrelated [17]. Two characteristics make the stance detection task peculiar. On the one hand, news claims and evidences are often unrelated -generating a severe class imbalance problem; On the other hand, since the non-related classes are by definition related, intuitively, the identification of an evidence as related or unrelated to a news claim is semantically different from the identification of an evidence as belonging to one of the other three classes. These two characteristics suggests the natural presence of a hierarchical structure among stance classes.
Stance detection has been studied in areas of information extraction and natural language processing [11,40]. However, previous methods tackle the task as a multiclass classification problem, neglecting the hierarchical structure in stance classes. Also, the commonly-used four-way classifiers are easily influenced by the class imbalance problem. In this paper, we address this issue by modeling the stance detection task as a two-layer neural network. The first layer aims at identifying the relatedness of the evidence, while the second layer aims at classifying, those evidences identified as related, into the other three classes: agree, disagree and discuss. Moreover, by studying various level of dependence assumptions between the two layers: (1) independent, when there is no error propagation between the two layers; (2) dependent, when the error propagation is left free, and; (3) learned, when the error propagation is controlled by Maximum Mean Discrepancy (MMD), we show that when learned, the neural network (a) better separates the distributions of related and unrelated stances and (b) outperforms the state-of-the-art accuracy for the stance detection task. The remainder of the paper is organized as follows: § 2 summarizes the related work; § 3 defines the stance detection task; § 4 details the proposed hierarchical classification model and the regularization term; § 5 describes the used datasets and experimental setup; § 6 is devoted to experimental results, and; § 7 concludes the paper.

RELATED WORK
Machine learning techniques are widely researched to tackle the stance detection task. Previous works focus on political or congressional floor debates [11,40,46] and online forums [2,19,27,38,39,42]. Most of these works rely on content-based features, such as sentiment analysis and topic-specific features learned from labeled datasets for a closed set of topics.
Two methods only consider the agree, disagree and discuss classes: Bar-Haim et al. [7] split the stance detection task to three sub-tasks and propose a Contrast Classification Algorithm to distinguish agree and disagree classes; Augenstein et al. [4] build a neural network architecture based on bidirectional conditional encoding on a Tweeter dataset. A long-short term memory (LSTM) encodes the claim and another LSTM encodes the text with the encoded claim as initial states. These methods fail to consider the unrelated class.
Two other methods consider all the classes, but use two different models: Bourgonje et al. [10] use the lemmatized n-gram matching and a rule-based procedure to decide the evidence relatedness, and a three-way logistic regression classifier to distinguish among the relevant classes; Wang et al. [43] firstly develop a gradient boosted decision tree (GBDT) model [28] to determine the evidence relatedness, then another GBDT model is used to distinguish stances of the text towards the claim. These methods involve feature engineering in separate models and cannot be jointly optimized to achieve the best performance.
Other methods that also consider all the classes have been developed during the Fake News Challenge stage 1 (FNC-1) [18]. The winner team uses a 50%/50% weighted average between a GBDT model and a convolutional neural network (CNN) [5]. The second best performance is achieved by an ensemble of five multi-layer perceptrons (MLPs) where input features include bag-of-words, semantic analysis in addition to the baseline features developed by the challenge organizers [16]. Compared to the above two solutions, the third best team does not try ensemble methods. They use TF-IDF features and an MLP as a four-way classifier [33]. Zhang et al. [48] propose a ranking method to tackle the task and achieve empirical performance improvements. However, these methods all neglect the hierarchical structure among the four types of stances and suffer from class imbalance.
Deep learning-based methods have also been applied in the FNC-1. Bajaj [6] utilizes LSTM, CNN and their variants to detect stances. Bajaj finds that an attention-augmented CNN obtains the best performance. Rakholia and Bhargava [32] analyze the effectiveness of different ways of text coding, such as independent coding, bidirectional conditional encoding and attentive readers, and conclude that the attentive reader model is the most suitable for the task. Ma et al. [23] propose a multi-task learning algorithm that jointly detect rumours and stances. However, all these methods fail to achieve high accuracy for the agree and disagree classes.
There are three major defects in all the aforementioned methods: (a) they neglect the hierarchical relationships among the four stances; (b) they suffer from the class imbalance problem, and; (c) they fail to achieve acceptable detection performance for the agree and disagree classes.

STANCE DETECTION TASK
The stance detection task consists in classifying the stance of an evidence towards a claim as one of the four classes: agree, disagree, discuss and unrelated. Formal definitions of these four stances are: agree -the evidence supports the claim; disagree -the evidence denies the claim; discuss -the evidence does not have a position about the claim; unrelated -the evidence is not about the claim.

HIERARCHICAL CLASSIFICATION
In this section, we detail our proposed two-layer neural network for stance detection. § 4.1 outlines the model. In order to better differentiate between the related and unrelated classes, we design an MMD regularization term in § 4.2. This is then integrated into the two-layer neural network loss function in § 4.3. In Figure 1, we show the architecture of our model.

Two-Layer Neural Network
Let the input space be formed by m-dimensional real vectors in a neural network, denoted as v ∈ R m . The four-class label can be transformed into a one-hot vector y. The i-dimension of y (y i ) is 1 when the stance is the i-element in the label set {aдree, disaдree, discuss, unrelated} and 0 otherwise. The hidden layer with parameters θ u learns to map v to a k-dimensional hidden representation u ∈ R k : For the two-layer classification, the first layer decides whether the evidence is related to a claim. Hence, the first classification layer is called the relatedness layer. This layer is parameterized by θ r and learns to produce a 2-dimensional normalized vectorr as follows: Note that the Softmax function is included in д to normalize the 2-dimensional vector, so each component of the vectorr denotes the probability that the neural network assigns v to the related and unrelated classes, i.e., p(related ) and p(unrelated ). The second layer classifies the evidence into the related classes, i.e., agree, disagree, or discuss stances. Hence, the second classification layer is called the stance layer. The stance layer is parameterized by θ s and learns to produce a 3-dimensional normalized vectorŝ: where the vector multiplicationr · (1, 0) extracts the first element ofr. Note that the Softmax function is also included in h to normalize the 3-dimensional vector, so that each component of the vectorŝ denotes the conditional probability that the neural network u = ( ; ) v assigns v to agree, disagree and discuss given that v is related, i.e., p(aдree |related ), p(disaдree |related ), and p(discuss |related ).

Relatedness Layer
We define the classification loss by the Kullback-Leibler (KL) divergence [21], which measures the difference between the network outputs and labels: where r is the ground-truth relatedness of the input data. r is computed from a label y as follows: where 1 is the indicator function, e 4 is a 4-dimensional one-hot vector with fourth element equal to 1. When y = e 4 is verified, it indicates that the label belongs to the unrelated class. Similarly, the stance classification loss can be defined as: where s is the ground-truth stance of the input data. s is computed from a label y as follows: where e 1 , e 2 , e 3 are 4-dimensional one-hot vectors with first, second, and third elements equal to 1. When y = e 1 is verified, it indicates that the label belongs to the agree class, when y = e 2 is verified, it indicates that the label belongs to the disagree class, and when y = e 3 is verified, it indicates that the label belongs to the discuss class. Finally, we now define the loss function for the two-layer neural network as the linear combination between the loss function of the relatedness layer (l r ) and the loss function of the stance layer (l s ): where α leverages the importance of the two classification layers.

Maximum Mean Discrepancy
The classification of related/unrelated stances is a different task from that of agree/disagree/discuss stances. Therefore, data representations from the relatedness layer and the stance layer can be seen as samples drawn from two different distributions. In order to measure distribution discrepancy between these two layers, we employ the Maximum Mean Discrepancy (MMD) [9] as a regularization term. The MMD does not involve density estimation and thus is a non-parametric way of measuring the difference between distributions. MMD has achieved success in face recognition and image annotation [15]. MMD is defined as follows: Definition 4.1. Maximum Mean Discrepancy [9]: "Let p and q be two Borel probability distributions over a space X and let X and Z be sets with independent identically distributed samples drawn from p and q. The MMD is defined by a class Ψ of map functions ψ : X → H as: Here, x and z are samples from X and Z . " In other words, the MMD equation defines the largest possible distance between two expectations over the set of function Ψ. Moreover, "when H is the reproducing kernel Hilbert space (RKHS) [3], this means that for all x ∈ X, the linear point evaluation function mapping ψ → ψ (x ) exists and is continuous. When Ψ is the unit ball in a universal RKHS, it is guaranteed that MMD (p, q, Ψ) will detect any discrepancy between p and q [9,35]. " Let p denote the distribution for the first layer samples (unrelated hidden representations) in our model, with sample set U 1 = {u 1 1 , . . . , u 1 n 1 } and according to Eq. (1) their generating set And, q denotes the distribution for the second layer samples (agree, disagree and discuss hidden representations), with sample set U 2 = {u 2 1 , . . . , u 2 n 2 } and according to Eq. (1) their generating set V 2 = {v 2 1 , . . . , v 2 n 2 }. n 1 and n 2 are the number of samples in U 1 and U 2 . Thus we have X = R k and H = R j with ψ (x ) = θ d x, where θ d is a j × k matrix in the projection layer. k and j are the space dimensions. According to Eq. (1), the hidden representation u is parameterized by θ u , thus the empirical expression of MMD is parameterized by θ u and θ d : By constantly changing the projection layer parameterized by θ d , we find the maximum expectation difference between the representations of the two classification layers.

Optimization
The more different two distributions are, the larger the MMD is. Hence, in order to make the distributions easier to be distinguished, a larger MMD regularization term is preferred, and we treat the regularization term as an extra goal besides classification. We integrate the two-layer classification loss (see Eq. (8)) and the MMD regularization term (see Eq. (10)) into a single objective function (L). Specifically, we add these two sub-goals with a hyperparameter β as follows: where β leverages the importance of the regularization. The larger the MMD regularization term is, the easier is for the classifier to distinguish between the related and unrelated stances. Thus, the sign of the regularization term is negative. The optimization involves the minimization of the classification loss L with respect to θ u , θ r , θ s , and θ d as follows: Optimizing the model consists of two sub-goals. On the one hand, we want to maximize the distribution discrepancy between the two classification layers. On the other hand, we want to minimize the classification loss of both layers. Both of these two sub-goals involve the feature layer parameter θ u update, but in opposite update directions. The optimization process will not stop until a saddle point (the feature layer parameters can be well applied in both sub-goals) is reached. Algorithm 1 shows the parameter update process, which is based on the mini-batch gradient descent algorithm.

Prediction
Given as input a feature vector v, the classifier outputs the following probabilities: p(unrelated ), p(aдree |related ), p(disaдree |related ), and p(discuss |related ). However, these last 3 probabilities are not comparable with the first one. To make them comparable we derive Algorithm 1: Parameter update process based on the minibatch gradient descent algorithm.
input : Sample mini-batch {v i , r i , s i } n i=1 , mini-batch size n, 00000 hyperparameters α, β, and µ output :θ u , θ r , θ s , θ d 1 begin 2 Initialize θ u , θ r , θ s , θ d ; until θ u , θ r , θ s , θ d converge; p(aдree), p(disaдree) and p(discuss). By observing that the class agree is assumed as related, thus p(aдree, related ) = p(aдree), we derive that: Similarly, for the other two classes we derive that: Thereby, the model actual outputŷ is: where the class with the highest probability corresponds to the predicted stance.

EXPERIMENTAL SETUP
We start this section by presenting the datasets and evaluation measures relevant to the stance detection task. Then, we describe the features used by our model and the model parameterization.
Finally, we present the baselines. The software used to run the experiments of this paper is available on the website of the first author.

Datasets
Experiments are conducted on two publicly available datasets: the Emergent dataset 1 [14] and the FNC-1 dataset 2 . In these two datasets, a claim consists of a news article headline and an evidence of a news article content. These datasets are split into train and test subsets; see Table 1 for statistics about the splits. The FNC-1 dataset consist of 75,385 instances. Each instance in the dataset is a pair claim-evidence labeled as one of the four stances: agree, disagree, discuss and unrelated. The ratio of training data over testing data in the FNC-1 dataset is ∼2:1. Every class accounts for a similar percentage in the train and test subsets. The unrelated stances are the majority (over 70%) in both subsets, while the disagree stances are less than 3%. The agree and discuss stances are less than 20% and 10%.
The Emergent dataset is similar to the FNC-1 dataset, however it contains only agree, disagree and discuss stances. Hence, it needs to be augmented with unrelated stances. Similarly to how the FNC-1 dataset unrelated stances have been labeled, we manually labeled unrelated stances by pairing a claim with an unrelated evidence, i.e., paired with another claim. Moreover, to make the class distributions less imbalanced, we make the ratio of related stances and unrelated ones ∼1:1. The augmented Emergent dataset contains 4,071 training labels and 1,024 testing labels with a ratio of ∼4:1. Class distributions between train and test subsets are similar.
Compared to the FNC-1 dataset, the class distributions of the augmented Emergent dataset is more balanced. The percentage of unrelated stances is about 50%, whereas the percentages of agree and disagree stances are about 24% and 8%. Both datasets have similar percentages of the discuss stances.

Evaluation Measure
In line with the FNC-1 challenge, the evaluation is based on a weighted two-level scoring system based on the accuracy measure. This evaluation measure, called relative score, evaluates a model by splitting the stance detection task into two sub-tasks, related/unrelated and agree/disagree/discuss classification sub-tasks. To the former sub-task is given a 25% weight. This is done because this sub-task is considered to be easier than the latter sub-task to which is given a 75% weight.
We report the evaluation measures: relative score, accuracy, and accuracy on a per class basis.

Feature Extraction
To represent claims and evidences we use a bag-of-words approach. For each claim and evidence we generate a TF-IDF vector, and for 1 https://github.com/willferreira/mscproject. 2 https://github.com/FakeNewsChallenge/fnc-1. each pair claim-evidence we compute their cosine similarity. We also include the FNC-1 official features into the input feature vector.
The final set of features include: • TF-IDF vectors of claims; • TF-IDF vectors of evidences; • Cosine similarity (CosSim) between the claim vector and the evidence; • Ratio of word overlap (WordLap) between the claim and the evidence; • An Indicator whether a claim has refuting words (RefWord); • The polarity (Pol) of the claim and the evidence; • The number of overlapping n-grams (NGrams) for n ∈ {2, 3, 4, 5, 6} between the claim and the evidence.
For the TF-IDF vectors, we only use the top 2,000 most frequent terms except stop-words. All of these features are concatenated to form the input feature vector v.

Experimental Setting
The following hyperparameters have been set via a five-cross validation on the train subsets: • The dimension k of hidden representations is set to 100; • The dimension j of the MMD is set to 10; • The activation function used in the hidden layers is set to ReLu; • The parameters α are set to 1.5 and 1.3 for the Emergent and FNC-1 datasets. • The parameter β is set to 0.001; We include a L2 regularization term [29] for the MLP weight parameters in the final loss function to mitigate overfitting. Dropout is also used to mitigate overfitting with rate set to 0.6. We train in mini-batches of size 64 over the entire train subset. Note that the gradient steps in Algorithm 1 can easily be alternated with a more powerful optimizer such as the Adam optimizer [20]. Early stopping is applied when the classification loss on the validation subset does not get smaller for three continuous iterations. The whole model is implemented with TensorFlow.

Baselines
We compare our model against the methods mentioned in Section 2. These methods are detailed in the following. Among them we distinguish between methods that use the same features as ours and methods that learn their representations. We start with the latter type, we call these representation learning-based baselines: Bidirectional LSTM (BiLSTM). Augenstein et al. [4] build a neural network architecture based on bidirectional LSTM on a Tweeter dataset. A LSTM encodes the claim, and another LSTM encodes the evidence with the encoded claim set as initial states. The 100-d GloVe word embedding is used as input [30]; Attentive CNN (AtCNN). Bajaj [6] builds an attention-augmented CNN. The claim and the evidence are input to a convolutional neural network to obtain hidden representations, and the attention mechanism is employed to locate the most influential words or phases on the final results; Memory Network (MN). Mohtarami et al. [26] develop an endto-end memory network for stance detection. The network operates at the paragraph level and integrates convolutional and recurrent neural networks, as well as a similarity matrix as part of the overall architecture; Ranbking Model (RM). Zhang et al. [48] build a ranking method to tackle the stance detection and achieve empirical performance improvements. A ranking loss function is proposed to replace Softmax and maximize the representation difference between four classes of stance. We now review the second type of baselines: those methods that use the same features as our method, we call these feature engineering-based baselines: Official Baseline (OB). This is the FNC-1 official baseline that uses one gradient boosting decision trees model for fourway classification; Logistic Regression (LR). Bourgonje et al. [10] use n-gram matching and a rule-based procedure to decide relatedness, and three-way logistic regression to distinguish among the related classes; Gradient Boosted Decision Trees (GBDT). Wang et al. [43] develop two GBDT models, one to determine the relatedness of an evidence to a claim, and another to distinguish among the related classes; Multi-Layer Perception (MLP). This model [33] achieved the third best performance in FNC-1. It extracts TF-IDF and cosine similarity between claims and evidences as input features, and uses a MLP as the four-class classifier.

RESULTS AND DISCUSSION
In this section, we start by analyzing the dependency assumption. Then, we compare and contrast our model against the baselines. Next, we provide a sensitivity analysis of the hyperparameters. We conclude with an impact analysis of the features used by the model.

Dependency Assumption
In Figure 2 we show the effect of the 3 dependency assumptions by visualizing the learned representations using a t-SNE projection [24]. We observe that when the classifiers are assumed independent, i.e., the classification is performed in cascade -no error is propagated from the second layer to the first during training -then the learned representation well separates the unrelated class from the unrelated ones. When the classifiers are assumed dependent, i.e., the two classifiers are trained together -the error is left free to propagate from the second layer to the first -then the learned representation is not very well separated. However, when the dependence assumption of the two classifiers is learned via the MMD regularization, i.e., the two classifiers are trained together with the error propagation controlled by the regularizer, then the learned representation is again well separated like in the first case. Wellseparated representations suggest a greater discriminative power of the model -the unrelated and related classes are almost linearly separable.
The last three rows of Tables 2 and 3 show the performance of our model on the two test subsets for each one of the three assumptions: independent, dependent, and learned. Looking at the accuracy of the unrelated class, we observe that the accuracy is greater when the learned representations are well-separated, as in the independent and learned cases. Furthermore, looking at all the other scores, we observe that the learned assumption outperforms both the independent and dependent assumptions in all other cases, demonstrating that learning together both, relatedness and stance of the evidences towards claims, is beneficial to the stance detection task.

Overall Performance
In Tables 2 and 3 we compare our model against the state-of-the-art models. Our model achieves the best stance detection performance for the relative score on both datasets. The model achieves 89.30% on the augmented Emergent test subset and 88.15% on the FNC-1 test subset.
By comparing with four-way classification baselines (OB, MLP, BiLSTM, AtCNN, MN and RM) we demonstrate the advantage of separating the relatedness detection from the stance detection. We observe that these classifiers perform poorly on the disagree class, which is caused by the large percentage difference between the minority disagree class and the majority unrelated class. Further, the more imbalanced the evaluation dataset becomes, the worse performance the four-way classifiers achieve on the minority disagree class.
By comparing with baselines that separate the relatedness detection from the stance detection (LR and GBDT) we demonstrate the  superiority of a single end-to-end model. LR and GBDT are better on the disagree class, although their overall performance is worse than our model. In Figure 3 we show the confusion matrix of our model. Here we observe the detection performance on a per class basis. For the related/unrelated classification, we correctly classify 97.00% and 99.53% unrelated instances on the augmented Emergent and the FNC-1 test subsets. We can see that there is some misclassification between the agree and unrelated classes, and between the discuss and unrelated classes. The misclassification of the disagree class accounts for the largest error of the unrelated instances.
Our model achieves an accuracy of 69.05% and 72.35% for the disagree class on the Emergent and the FNC-1 test subsets. The classification accuracy is largely improved compared to the stateof-the-art. Some misclassification error exists between agree and disagree. However, our model can distinguish between the discuss and the disagree with few errors. While the number of discuss cases is the largest and the number of disagree instances is the smallest, our model does not mistake disagree instances as discuss ones, i.e., the model has learned the core representation difference between these two classes. Due to ambiguous expressions, misclassification between agree and discuss is the cause of most errors between these classes, which leads to a slightly worse accuracy for the discuss class on the Emergent (84.30%) and FNC-1 (77.49%) test subsets.
Two reasons account for the improved empirical performance observed on our model. On the one hand, the mitigation of the  class imbalance problem. Contrary to the four-way classifiers that directly compare the disagree and unrelated instances, the hierarchical model avoids the direct comparison of this minority disagree class (which is less than 2% in the FNC-1 dataset) with the majority unrelated one (which is more than 70% in the FNC-1 dataset). On the other hand, the MMD term that maximizes the discrepancy between the unrelated class and the aggregated related classes. Since the agree, disagree and discuss belong to the same class, the related class, the MMD regularization promotes the emergence of features that are useful to separate the class pairs: agree with unrelated, disagree with unrelated, and discuss with unrelated.

Hyperparameters Sensitivity
In this subsection we discuss the sensitivity to the hyperparameters of our model. The most influential hyperparameters for the proposed model are α and β. The former controls the relative importance of classification layers. The latter leverages the regularization. In Figures 4(a) and 4(b) we show how the performance of the model changes when varying α and β for the augmented Emergent and FNC-1 test subsets. α is searched between 0.1 and 3.0 with steps of 0.1, and β is searched in {0, 0.1, 0.01, 0.001, 0.0001, 0.00001}. For α, we observe that the performance of the model improves quickly as α increases and peaks at 1.5 and 1.3 for the FNC-1 and augmented Emergent datasets, then the performance experiences a slight decrease when α is increased. We hypothesize that the optimal α is related to the class balance between the unrelated class and the related ones. The more unbalanced the dataset is towards the unrelated class, larger is the optimal α. For β, we observe that the performance is the highest when β is set to 0.001. This happens for both augmented Emergent and FNC-1 test subsets. These optimal values of α and β observed on the test subsets are equal to the one found when training the model.

Feature Analysis
In this subsection we evaluate and discuss the importance of each feature towards the final prediction. To examine the influence of each feature on the final performance, we do a leave-one feature setout approach and record the classification accuracy on the stance  Table 4: Performance of our model with different feature sets on the FNC-1 dataset. "/" denotes no feature set is removed.
Removed Feature Set Accuracy (%) agree disagree discuss unrelated detection task. The following analysis is only based on the FNC-1 dataset. Similar results are observed on the augmented Emergent dataset.
In Table 4 we show the results of this analysis. We observe that removing the CosSim feature leads to a large decrease in accuracy for the unrelated class. Similarly, the use of WordLap has a positive effect for the agree class, and it also contributes to the unrelated class. The RefWord and Pol features help for the classes agree and disagree, while removing the NGram feature leads to an increase on the discuss class, i.e., the NGram feature causes confusion between the discuss and the other classes.

CONCLUSION
In this paper, we studied the problem of stance detection: the classification of the stance of an evidence towards a claim into one of the four classes: agree, disagree, discuss and unrelated.
We proposed a hierarchical representation of the stance classes, where the classes agree, disagree and discuss are combined together into a class referred as the related class. The main idea here is to divide a concept into sub-concepts that are organized in a hierarchical structure, and design constraints between sub-concepts in order to make the model parameter optimization more sensible. The primary advantage of this hierarchical representation is that it is useful to overcome the class imbalance problem.
This hierarchical representation has inspired the proposed twolayer neural network to tackle the stance detection task. The first layer performs a related-unrelated classification, while the second layer performs a more fine-grained classification among the related classes. Furthermore, we have empirically demonstrated that (1) it is advantageous to learn these two classification tasks together, and (2) the dependency between these two layers can be learned through a MMD regularization term, which measures the representation discrepancy between the two layers. Experiments on two publicly available datasets have shown that our model is able to outperform the state-of-the-art stance detection methods.
As future work we consider the enriching of the proposed model as follows. First, integrating a credibility evaluation of information sources as features. Second, improving the explainability of the model by showing which words or phrases are the most influential in predicting the stance via attention mechanisms.