Reply-Aided Detection of Misinformation via Bayesian Deep Learning

Social media platforms are a plethora of misinformation and its potential negative influence on the public is a growing concern. This concern has drawn the attention of the research community on developing mechanisms to detect misinformation. The task of misinformation detection consists of classifying whether a claim is True or False. Most research concentrates on developing machine learning models, such as neural networks, that outputs a single value in order to predict the veracity of a claim. One of the major problem faced by these models is the inability of representing the uncertainty of the prediction, which is due incomplete or finite available information about the claim being examined. We address this problem by proposing a Bayesian deep learning model. The Bayesian model outputs a distribution used to represent both the prediction and its uncertainty. In addition to the claim content, we also encode auxiliary information given by people's replies to the claim. First, the model encodes a claim to be verified, and generate a prior belief distribution from which we sample a latent variable. Second, the model encodes all the people's replies to the claim in a temporal order through a Long Short Term Memory network in order to summarize their content. This summary is then used to update the prior belief generating the posterior belief. Moreover, in order to train this model, we develop a Stochastic Gradient Variational Bayes algorithm to approximate the analytically intractable posterior distribution. Experiments conducted on two public datasets demonstrate that our model outperforms the state-of-the-art detection models.


INTRODUCTION
Although the digital news consumption has increased in the last decade, the increasing amount of misinformation and fake news has not certainly proven its quality. Different from traditional media where news are published by reputable organizations, online news on social media platforms such as Facebook and Twitter are shared by individuals and/or organizations without a careful checking or with malicious intents. In Figure 1 we show a false claim posted on Twitter about an alleged shooting in Ottawa. While some users showed surprise and asked for further clarifications in their replies, other users believed the claim and re-tweeted it as if it was true. This misinformation, when done on a large scale can influence the public by depicting a false picture of reality. Hence, detecting misinformation effectively has become one of the biggest challenges faced by social media platforms [17,27].
A valuable attempt at rectifying this epidemic of false claims has been tackled by some news websites, such as: Snopes 1 , Polifact 2 , and Emergent 3 , which have employed professional journalists to manually check and verify every potential false news. However, such manual approach is very expensive and way too slow to be able to check all the daily generated claims appearing on the web. Thus, making automatic tools is in great need to speed up this verification process.
In this paper, we tackle the automatic misinformation detection task, which consists in classifying whether a claim is True or False. Most existing models employ feature engineering or deep learning to extract features from claims' content and auxiliary information such as people's replies. However, these models generate deterministic mappings to capture the difference between true or false claims. A major limitation of these models is their inability to represent uncertainty caused by incomplete or finite available data about the claim being examined. We address this problem by proposing a Bayesian deep learning model, which incorporates stochastic factors to capture complex relationships between the latent distribution and the observed variables. The proposed model makes use of the claim content and replies content. First, to represent the claim content we employ a neural model to extract textual features from claims. To deal with the ambiguity of the language used in claims and obtain salient credibility information, the model generates a latent distribution based on the extracted linguistic features. Since no auxiliary information has been used so far, we interpret this latent distribution as a prior belief of the claim being true. Second, to extract auxiliary information from people's replies content, we rank all the replies of the claim in temporal order, and summarize them using a Long Short Term Memory neural network (LSTM). Finally, after updating the prior belief with the aid of the LSTM output, the model computes the veracity prediction and its uncertainty. This updated prior belief distribution is interpreted as the posterior belief.
In order to train the proposed Bayesian deep learning model, due to the analytical intractability of the posterior distribution, we develop a Stochastic Gradient Variational Bayes (SGVB) algorithm. A tractable Evidence Lower BOund (ELBO) objective function of our model is derived to approximate the intractable distribution. The model is optimized along the direction of maximizing the ELBO objective function.
Our model inherit two advantages: first of all, the model incorporates a latent distribution, which enables to represent uncertainty and promote robustness; second, the Bayesian model formulates all of its prior knowledge about a claim being examined in the form of a prior, which can be updated by more added auxiliary information generating more accurate detection results. To sum up, the proposed model advances state-of-the-art methods in four aspects: (1) An effective representation of uncertainty due to incomplete/finite available data; (2) A temporal order-based approach to extract auxiliary information from people's replies; (3) A SGVB algorithm to infer latent distributions; (4) A systematic experimentation of our model on two realworld datasets.
The remainder of the paper is organized as follows: § 2 summarizes the related work; § 3 defines the misinformation detection task; § 4 details the proposed Bayesian deep learning model; § 5 derives the Stochastic Gradient Variational Bayes optimization algorithm; § 6 describes the used datasets and experimental setup; § 7 is devoted to experimental results, and; § 8 concludes the paper.

RELATED WORK
Misinformation has been existing for centuries in different forms of media, such as printed newspaper and television. Recently, online social media platforms are also suffering from the same issues. Recent work on misinformation detection have tried to understand the differences between true and false claims in various aspects: claim content, information source, multimedia such as affiliated images and videos, and other users' engagement.

Textual Content
The text of a claim can provide linguistic features to help predict its veracity. Since misinformation and false claims are created for financial or political purposes rather than to report an objective event, they often contain opinionated or inflammatory language [6]. In order to reveal linguistic differences between true and false claims, lexical and syntactic features at character, word, sentence and document level have been exploited [1,11,33,36]. Wawer et al. [43] compute psycholinguistic features using a bag-of-words paradigm. Rashkin et al. [34] compare the language of true claims with that of satire, hoaxes, and propaganda to find linguistic characteristics of untrustworthy text. Kakol et al. [21] construct a content credibility corpus and examine a list of language factors that might affect web content credibility based on which a predictive model is developed. Bountouridis et al. [3] compare heterogeneous articles of the same story and reveal that pieces of information cross-referenced are more likely to be credible. Derczynski et al. [9] extract features from claim tweets including bag-of-words, presence of URLs, and presence of hashtags. A Support Vector Machine (SVM) is then used to distinguish between true and false claims. Guacho et al. [14] leverages a tensor decomposition to derive concise claim embeddings that capture contextual information from each claim; and uses these embeddings to create a claim-by-claim graph on which the labels propagate. Textual content has been empirically proven to be a strong indicator of claim veracity, and thus can be used as a prior probability.

Source Credibility Analysis
The credibility analysis of the sources of a claim is an important auxiliary information. As misinformation is usually published by unbelievable individuals or automatic bots, credibility plays a crucial role in message communication [18,32]. Accurate and timely discrimination of such accounts inhibits the proliferation of misinformation at an early stage. Tseng and Fogg [40] identify two components of source credibility, namely trustworthiness and expertise. Trustworthiness is generally taken to mean truthful, unbiased and well intentioned. Expertise instead is understood as knowledgeable, experienced and competent. Thus, features that can reveal the trustworthiness and expertise of information sources are strong indicators of source credibility. With the aid of information source Thomson et al. [39] examine the credibility of tweets related to the Fukushima Daiichi nuclear disaster in Japan. They found that tweets from highly credible institutions and individuals are mostly correct. Useful account features can be derived from the account demographics, such as integrity of personal information, the number of followers and followees [5]. Besides, aggregating a group of account features are indicative, since spreaders of true and false claims might come from different communities [44], such as the percentage of verified user accounts [28] and the average number of followers [26]. However, account demographics can easily be altered to decrease the similarity between credible and incredible sources.

Multimedia Features
Multimedia features have been shown to be an important manipulator for propaganda based on misinformation [4]. As we have characterized, online misinformation exploits the individual vulnerabilities of people and thus often relies on sensational or even fake images to provoke anger or other emotional response of consumers. Visual-based features are extracted from images and videos to capture the different characteristics of misinformation. Faking images are identified based on various user-level and tweet-level hand-crafted features [15]. Recently, various visual and statistical features have been extracted for news verification [20]. Yang et al. [45] develop a convolutional neural network to extract text and visual features simultaneously. Visual features include clarity score, coherence score, diversity score, and clustering score. Statistical features include count, image ratio, multi-image ratio, hot image ratio, long image ratio, etc. This approach suffers from the problem that some misinformation on social media does not contain multimedia content.

Social Engagement
The news spreading process over time on social media involves user-driven engagement. Auxiliary information can also be derived from such engagement to improve the claim veracity detection. Ma et al. [29] propose to learn discriminative features by following non-sequential propagation structure of tweets. A top-down and a bottom-up recursive neural networks are proposed to predict claim veracity. Glenski et al. [12] seek to better understand how users react to trusted and deceptive news sources across two popular, and very different, social media platforms. Significant differences have been observed in the speed and the type of reactions between trusted and deceptive news sources on Twitter, but far smaller differences on Reddit. People react to a piece of claim by expressing their stances or emotions in social media posts. Stances can be categorized as supportive, opposing, and neutral, which can be used to infer claim veracity [19,46,47]. Kochkina et al. [25] propose a neural multi-task model that leverages the relationship between veracity detection and stance detection in a joint learning setup. Another common post feature is the topic distribution that indicates the central point of relevant affairs, which is derived by topic models [2]. Post features are expanded in two ways: via aggregation with relevant posts for a specific affair, and via temporal evolution of post features. The first way relies on the "wisdom of crowds" to locate potential misinformation [5], while the second way captures the periodic fluctuations of shock cycles [26] or temporal pattern of user activities, such as the number of engaged users and time intervals between engagements [37]. Yet, semantic coherence and temporal changes between users' replies are not fully explored by existing methods.

PROBLEM STATEMENT
The task of misinformation detection is to predict the news' veracity of claims, given their content and their people's replies.
We use y i to denote the binary veracity label of the claim c i , which could be either y i = 1 for true or y i = 0 for false. The tuple of a claim and people's replies, i.e., {c i , D i }, forms a data instance to predict the claim veracity y i . For the sake of clarity, in the following, we will omit the subscripts i when describing a single instance: {c, D, y}.

BAYESIAN DEEP LEARNING
In this section, we present our proposed Bayesian deep learning model that effectively integrates claim and people's replies. We will first introduce how to encode claim content with deep learning and generate a latent distribution that is interpreted as a prior belief of claim veracity. We then describe the temporal-ordered approach to encode people's replies, which captures semantic variation along the time line. Finally, we correct the prior belief with the aid of people's replies, the result of which process is interpreted as the posterior belief of claim veracity. Figure 2 describes the proposed model.

Encoding a Claim
As content are strong indicators of claim veracity [42], we apply deep learning to extract linguistic features from the claim c. To avoid the ambiguity of claims and obtain salient credibility information, we generate a latent distribution based on the extracted linguistic features. The output of this claim encoder is the prior belief of the veracity of the claim.
Let each claim c be a sequence of discrete words or tokens, i.e., takes as input c, converts the sequence of word embeddings into a dense representation, and outputs the concatenation of two hidden states capturing past and future information: where h c denote the concatenated hidden states.
To avoid the ambiguity of claims, instead of a deterministic non-linear transformation, we generate a latent distribution, from which we sample a latent stochastic variable z. To embed linguistic information into the latent variable, we set the latent variable to be conditional on h c : where p is a latent distribution and θ denotes the non-linear transformation of h c to generate the parameters of p. This non-linear transformation is essential to capture higher level representations of h c ; we implement this non-linear transformation via a Multi-Layer Perceptron (MLP). We assume that the latent variable z is continuous and follows a multivariate Gaussian distribution. The variable z is parameterized as follows: where µ θ and diag(σ 2 θ ) are the mean and the covariance matrix of the multivariate Gaussian distribution. Since the variable z is conditional on the the claim hidden states h c , we derive these two parameters of the Gaussian distribution from h c through a deep neural network: where f θ denotes a MLP, l 1 and l 2 denote two Linear Transformations (LT). Since LT can generate negative values, to produce σ θ we exponentiate the result of l 2 .
In order to make µ θ and σ θ differentiable and backpropagate the loss through the latent distribution (p), the following reparameterization trick is used: where 0 is a vector of zeros and I is the identity matrix. By making use of the latent variable (z), our model is able to capture complex noisy patterns in the data.

Encoding People's Replies
We now present the people's replies encoder to obtain auxiliary information. This auxiliary information is claim-specific and is used to generate the posterior belief by correcting the prior belief of the claim veracity. Replies on social media platforms are listed along the time line as shown in Figure 1, where the earliest reply appears at the top of the list. Truth about an event can be gradually manifest as more evidence emerges, thus we assume that the latest replies tend to be more reliable and more important than the earlier replies. Based on this assumption, we design a two-layer recurrent neural network to encode replies: the first-layer applies a BiLSTM to summarize the semantic information of each reply and the second-layer applies a LSTM to capture the temporal semantic variation of the replies.
Given a claim c commented by a list of replies D = {d 1 , . . . . . . , d m , . . . , d M }, these replies are ranked based on their temporal order. The content of a d m consists of a sequence of words d = [w 1 , . . . , w k ]. To project the claim and replies into the same semantic space, we use the same pre-trained word embeddings for both claims and replies. Hence, w k ∈ R d is a d-dimensional vector such as the word embedding vector used to encoding the claim. For the sake of semantic coherence, we also employ the same BiLSTM to encode both the claim and its replies. Take the reply d m ∈ D for example, the concatenation of hidden states from forward and backward directions is denoted as: where h d m is the summary of the reply d m .
In order to capture the semantic information of all replies, we sequentially input the concatenated hidden states of each reply into a LSTM. We use the LSTM rather than a BiLSTM because the former gives high weights to recent input, which matches our assumption on the relative importance of the latest reply. Specifically, the LSTM takes the hidden states of each reply as input in a sequential way:

Veracity Modeling
In § 4.1, we developed a prior belief of the claim veracity. In this section we show how to correct this prior belief by including its replies.
The posterior belief is generated by combining the claim and reply information via a MLP. The strong non-linearity of MLPs make them suitable to find complex relationships between the claim and its replies. Specifically, the MLP input is the latent claim variable z concatenated to the hidden state of replies h D : This is the final prediction of our Bayesian deep learning model for misinformation detection.

OPTIMIZATION
The stochastic variables of our model are non-linear and nonconjugate [41]. Hence, the posterior distribution cannot be derived analytically. To approximate the posterior distribution, we construct an inference model parameterized by ϕ to approximate the intractable true posterior p θ (z|h c ); then we derive an objective function to measure how well p θ (z|h c ) is approximated; finally we exploit the Stochastic Gradient Variational Bayes (SGVB) method [24,35] to learn the inference model parameters ϕ together with the generative model parameters θ . Figure 3 shows the graphical representation of the generative model and the inference model.

Inference Model
Following the neural variational inference approach [24], we construct an inference model (as in Fig. 3) parameterized by ϕ to compute an approximated posterior distribution, called variational distribution. Given the observed variables, we define a variational distribution q ϕ (z|y, h c , h D ) to approximate the true posterior distribution p θ (z|h c ). Like for the Variational Auto Encoder (VAE) [24], similarly to Eq. 3 for p θ (z|h c ), the variational distribution is chosen to be a multivariate Gaussian distribution: where µ ϕ and diag(σ 2 ϕ ) are the mean and the covariance matrix of the multivariate Gaussian distribution. We use a deep neural network to derive these two parameters from the observed variables: where f ϕ denotes a MLP, and l 3 and l 4 denote two LTs. Note that in the inference model to compute µ and log σ we use y, h c , h D and not only h c as in the generative model.

Objective Function
In the following we derive the objective function of our Bayesian deep learning model following the variational principle. To maximize the log-likelihood ln (p(y|h c , h D )), we derive an Evidence Lower Bound (ELBO) objective function, which ensures a correct approximation of the true posterior. To simplify the notation of the derivation of the objective function we make the following substitutions: p θ (y) = p θ (y|z, h D ), p θ (z) = p θ (z|h c ), q ϕ (z) = Randomly draw a minibatch of B claims; The objective function is derived as follows: where L(θ, ϕ |y, h c , h D ) is the ELBO objective function. The second line of the derivation is possible by using the Jensen's inequality [16]. Since the ELBO objective function is a lower bound of the log-likelihood ln(p(y|h c , h D )), its maximization maximizes the log-likelihood.

Gradient Estimation
Large-scale inference needs minibatch optimization. Thus, we derive a minibatch-based SGVB estimator to differentiate and optimize the ELBO objective function (L(θ, ϕ |y, h c , h D )) with respect to both the inference parameters (ϕ) and the generative parameters (θ ). Through Monte Carlo estimation we compute the expectation part of the ELBO objective function. Let the minibatch size to be B and, for each claim c i with i ∈ [1, B], S a sample drawn from the variational posterior distributionz ∼ q ϕ . Given a subset of claims, we can construct an estimator of ELBO objective function for the full dataset based on mini-batches as follows: where L(θ, ϕ |y i , h c i , h D i ) denote the estimates based on the i claim and N is the total number of claims. Algorithm 1 shows the minibatch gradient descent optimization process for both the generative (θ ) and inference (ϕ) parameters. Note that the gradient steps in Algorithm 1 can easily be alternated with a more powerful optimizer such as the Adam algorithm [23]. Although both q ϕ (z|y, h c , h D ) and p θ (z|h c ) are modeled as parameterized Gaussian distributions. The former is an approximation of the latter that only functions during learning. The latter, instead, is the learned distribution from which samples are generated in order to classify claim veracity.

Prediction
After training, we compute the posterior distribution p θ (z|h c ) through the generative network. The actual prediction of a claim veracity is given by taking the expectation of S samples: where z s denote the samples drawn from the true posterior distribution p θ (z|h c ).

EXPERIMENT SETUP
In this section we start by introducing 4 research questions. We then present the methodology used to answer them. The software used to run the experiments in this paper is available on the website of the first author.

Research Questions
We seek to answer the following four research questions, which will be guide the remainder of the paper: RQ1 Does our model outperform the state-of-the-art misinformation detection baselines? RQ2 Does the incorporation of the latent distribution outperforms a deterministic counterpart? RQ3 Does the auxiliary information from people's replies produce a more accurate posterior belief of claim veracity? RQ4 Is the temporal order better than random when encoding replies? RQ5 Is it beneficial to incorporate a latent variable to encode replies? RQ6 How does the dimension of the latent variable z affect the model's performance?

Datasets
In order to compare the performance of our proposed model against the baselines, we experimented with two real-world benchmark datasets, the RumourEval [9] and the Pheme [48] datasets. Both datasets contain Twitter conversation threads about news (like the example shown in Figure 1). A conversation thread consists of a tweet making a true or false claim, and branches of people's replies expressing their opinion about it. A summary of the datasets statistics is available in Table 1.
The RumourEval dataset has been developed for the SemEval-2017 Task 8 competition. This dataset consists of 325 source tweets The Pheme dataset is constructed to help understand how users treat online rumour before and after the news is detected to be true or false. Like the RumourEval dataset, we divide the Pheme dataset into a training subset, a validation subset and a testing subset. Specifically, 70% of the claims are randomly selected as training instances, 10% as validation instances and the rest as testing instances. Users' replies are divided according to the claims.

Evaluation Measures
The misinformation detection task is a binary classification task. Such tasks are commonly evaluated by the following evaluation measures: Accuracy, F 1 , Precision, and Recall.
Accuracy is a common evaluation measure for classification tasks. However, it is less reliable when datasets suffer from class imbalance. The evaluation measures Precision, Recall and F 1 complement Accuracy because not suffering from this problem.

Hyperparameters Setting
The activation function of the three LSTMs is tanh. The activation function of the MLPs is ReLu.
The hyperparameters tuned on the validation subset are: • the dimension of the hidden layer of all three LSTMs is 30; • the dimension of the latent variables is 10; • the minibatch size is 32; • the number of samples used in Monte Carlo estimates is 20.
State-of-the-art techniques have been employed to optimize the objective function: Dropout [38] is applied to improve neural networks training, L2-norm regularization is imposed on the weights of the neural networks, Adam optimizer [23] is exploited for fast convergence, and stepwise exponential learning rate decay is adopted to anneal the variations of convergence.

Baselines
We test our Bayesian deep learning model against six state-of-theart models. In order to have a fair comparison, only those models using the claim content and users' replies have been selected. Support Vector Machine (SVM). This model evaluates the performance of manually extracted features. The extracted features from claim content include: bag-of-words representation, presence of URLs, presence of hashtags, proportion of supporting and denying response [9]. These features are then input to a linear Support Vector Machine classifier. This classifier achieves the highest misinformation detection performance in the SemEval-2017 Task 8 4 ; Convolutional Neural Networks (CNN). This model evaluates the performance of CNNs on the veracity detection task. Apart from the sequential approach such as BiLSTM, the convolutional model is another powerful neural architecture for natural language understanding [7,8,10,22,45]. CNN takes as input pre-trained word embeddings generated with Word2Vec [30] trained on the Google News dataset.
To capture features similar to n-grams, we apply different convolutional window sizes. A max poling layer is applied to compress the output information of the convolutional layers [7]; Tensor Embeddings (TE). This model leverages tensor decomposition to derive concise claim embeddings, which are used to create a claim-by-claim graph on which labels propagate [14]; Evidence-Aware Deep Learning (DeClarE). This model retrieves evidences from replies using claims as a queries [31]. Then both claims and retrieved replies are input into a deep neural network with attention mechanism. Claim veracity is then computed by aggregating over the prediction generated by every claim-retrieved reply pair; Multitask Learning (Multitask). This model leverages the relationship between two tasks of the veracity detection pipeline [25], stance detection and veracity prediction tasks. The model is trained on both jointly. We apply the hard parameter sharing mechanism, where different tasks share the same hidden LSTM layers. Task-specific layers takes the shared hidden information and generate per-task predictions; Tree-structured RNN (TRNN). This model learns discriminative features from replies content by following their non-sequential propagation structure. Among the proposed two structures, we select the top-down structure for tweet representation learning because marginally better than the bottom-up structure [29].

RESULTS AND DISCUSSION
This section answers the research questions proposed in § 6.

Ablation of the Latent Distribution (RQ2)
In this subsection, we evaluate the impact of using a latent distribution into the claim encoder on the misinformation detection task. To evaluate the impact of the latent distribution p, we ablate p in our model and compare its classification performance against the full model. Specifically, the ablation is done by taking the output of the BiLSTM hidden states, i.e., h c and give this as input to the output MLP. The rest of the model remains unchanged. Since no latent distribution is involved, the ablated model is optimized in accordance with the conventional Softmax loss minimization.
In Figure 4(a) and 4(b) we show the classification performance of the ablated model against the full model on the RumourEval and Pheme test subsets. We observe that the full model outperforms the ablated one by at least 7.77% on every evaluation measure. This demonstrates the better representation quality achieved by the use of the latent distribution.

Ablation of People's Replies (RQ3)
We now evaluate the contribution people's replies in the misinformation detection task. In order to examine its contribution we compare our full model with and without replies. Specifically, we ablate the input coming from the replies to the final MLP, which now is used only to perform a non-linear transformation of the latent variable z.
In Figure 5(a) and 5(b) we show the classification performance of the ablated model against the full model on the RumourEval and the Pheme test subsets. Here, we observe that the auxiliary information extracted from people's replies has a large impact to the final performance our model. In fact, every evaluation measure is increased by at least 10.11%.

Random vs. Temporal Ordered Replies (RQ4)
The proposed model rank people's replies based on the temporal order. In this subsection, we analyze the contribution of ranking the replies according to their temporal order. We compare this against a random order.Specifically, we randomize the h M d before it is input to the LSTM.
In Figure 6(a) and 6(b) we show the performance comparison of these two orders. We observe that the temporal ordered replies achieve better performance than the random ordered. Besides, the random ordered model is still worse TRNN yet better than Multitask. This is probably because TRNN takes the temporal structure of replies into the model while Multitask fail to involve temporal information.

A Latent Distribution for Replies (RQ5)
Considering the improved performance brought by the latent distribution for claims, in this subsection we answer whether it would be beneficial to incorporate a latent distribution also for replies.
In order to answer this research question, we expand our model by adding a new latent distribution in the reply encoder. Similarly to what done for the claim encoder, the new latent distribution is designed as a multidimensional Gaussian distribution with mean and covariance matrix derived from the LSTM output h D (as in Eq. 3, 4 and 5). A new latent variable is sampled similarly as in Eq 6 and input to the MLP to predicting veracity of the claim being examined.
In Figure 8(a) and 8(b) we show the model performance comparison. We observe that the new latent distribution does not have an effect on the performance on the model for all the evaluation measures and dataset test subsets. Based on this analysis, we conclude that the incorporation of the additional latent distribution for replies does not provide any additional improvement in performance.

Sensitivity Analysis (RQ6)
In this subsection we evaluate the effect of the dimension of the latent variable z. To do this after setting a dimension for z we optimize the rest of the hyperparameters on the validation subset. In Figure 8(a) and 8(b) we show the effect on performance of the dimension of z on both datasets. We observe that the results are similar for both evaluation measures, accuracy and F 1 . Varying the dimension from 1 to 5 the model brings a larger performance improvement than when varying it from 5 to 25. When the dimension is 15 the model obtains the highest accuracy, 81.22%, on the RumourEval test subset while when the dimension is 10 the model obtains the highest F 1 , 78.78%, on the RumourEval test subset and highest accuracy, 80.33% and F 1 , 78.78%, on the Pheme test subset. These results also show that the increase in model capacity may not necessarily lead to an improvement in performance. The reason could be found on the limited size of the datasets, which might cause overfitting when the model is too complex.

CONCLUSIONS
In this paper, we study the problem of misinformation detection on social media platforms. One major problem faced by existing machine learning methods is the inability to represent uncertainty due to incomplete or finite available information. We address the problem by proposing a Bayesian deep learning model. When encoding claim content, we incorporate a latent distribution accounting for uncertainty and randomness caused by noisy patterns in the finite dataset. This latent distribution provides a prior belief of claim veracity. We also encode auxiliary information from people's replies in a temporal order through an LSTM. Such auxiliary information is then used to update the prior belief generating a posterior belief. In order to optimize the Bayes model, we derive a minibatch-based gradient estimation algorithm. Systematic experimentation has demonstrated the superiority of our approach against the state-ofthe-art approaches in the misinformation detection task.
Despite encouraging experimental results, online misinformation detection is still a challenging problem with many open questions. In this paper, auxiliary information comes from people's replies alone, we argue that the proposed model can be enriched by utilizing other auxiliary information, such as source credibility. Also, the reply stances are a strong veracity indicator for a claim, since false claims are usually controversial and accompanied by opposite stances. We let for future work, the combination of features extract from credibility analysis and reply stances.