Slot Self-Attentive Dialogue State Tracking

An indispensable component in task-oriented dialogue systems is the dialogue state tracker, which keeps track of users' intentions in the course of conversation. The typical approach towards this goal is to fill in multiple pre-defined slots that are essential to complete the task. Although various dialogue state tracking methods have been proposed in recent years, most of them predict the value of each slot separately and fail to consider the correlations among slots. In this paper, we propose a slot self-attention mechanism that can learn the slot correlations automatically. Specifically, a slot-token attention is first utilized to obtain slot-specific features from the dialogue context. Then a stacked slot self-attention is applied on these features to learn the correlations among slots. We conduct comprehensive experiments on two multi-domain task-oriented dialogue datasets, including MultiWOZ 2.0 and MultiWOZ 2.1. The experimental results demonstrate that our approach achieves state-of-the-art performance on both datasets, verifying the necessity and effectiveness of taking slot correlations into consideration.

: An example dialogue with two domains. The value of slot "taxi-arriveby" should be inferred according to the value of slot "restaurant-book time". The value of slot "taxidestination" is the same as that of slot "restaurant-name". Sys dialogue system consists of four key components, i.e., natural language understanding (NLU), dialogue state tracking (DST), dialogue policy learning (DPL) and natural language generation (NLG) [5,12]. Among them, DST aims at keeping track of users' intentions at each turn of the dialogue. Since DPL and NLU depend on the results of DST to select the next system action and generate the next system response, an accurate prediction of the dialogue state is crucial to enhance the overall performance of the dialogue system [24,27]. The typical dialogue state comprises a set of predefined slots and their corresponding values [33] (refer to Table 1 for an example). Therefore, the goal of DST is to predict the values of all slots at each turn based on the dialogue context. DST has by far attracted much attention from both industry and academia, and numerous DST approaches have been proposed [9,18,19,25,43,51]. Although the state-of-the-art DST methods have achieved good performance, most of them predict the value of each slot separately, failing to consider the correlations among slots [6,22]. This can be problematic, as slots in a practical dialogue are unlikely to be entirely independent. Typically, some slots are highly correlated with each other, demonstrated by coreference and value sharing. Take the dialogue shown in Table 1 as an example. The value of slot "taxi-arriveby" is indicated by the slot "restaurantbook time". Thus, slot "taxi-arriveby" and slot "restaurant-book time" share the same value. The value of slot "taxi-destination" should 1 arXiv:2101.09374v1 [cs.CL] 22 Jan 2021 also be taken from slot "restaurant-name". Furthermore, slot values can have a high co-occurrence probability. For example, the name of a restaurant should be highly relevant to the food type it serves.
In the literature, we notice that several DST approaches [6,22,60] have tried to model the correlations among slots to a certain degree. However, these methods rely on huge human efforts and prior knowledge to determine whether two slots are related or not. As a consequence, they are severely deficient in scalability. Besides, they all leverage only the semantics of slot names to measure the relevance among slots and ignore the co-occurrences of slot values. Utilizing only the slot names is insufficient to capture the slot correlations completely and precisely. On one hand, the correlations among some slots may be overestimated, as slot values in a particular dialogue depend highly on the dialogue context. On the other hand, the correlations among some slots may be underestimated because their names have no apparent connections, even though their values have a high co-occurrence probability.
In this paper, we propose a new DST approach, named Slot self-aTtentive dialogue stAte tRacking (STAR), which takes both slot names and their corresponding values into account to model the slot correlations more precisely. Specifically, STAR first employs a slot-token attention module to extract slot-specific information for each slot from the dialogue context. It then utilizes a stacked slot self-attention module to learn the correlations among slots in a fully data-driven way. Hence, it does not ask for any human efforts or prior knowledge. The slot self-attention module also provides mutual guidance among slots and enhances the model's ability to deduce appropriate slot values from related slots. We conduct extensive experiments on both MultiWOZ 2.0 [3] and MultiWOZ 2.1 [11] and show that STAR achieves better performance than existing methods that have taken slot correlations into consideration. STAR also outperforms other state-of-the-art DST methods 1 .

RELATED WORK
DST is crucial to the success of a task-oriented dialogue system. Traditional statistical DST approaches rely on either the semantics extracted by the NLU module [45,49,52,53] or some hand-crafted features and complex domain-specific lexicons [20,32,38,50,61] to predict the dialogue state. These methods usually suffer from poor scalability and sub-optimal performance. They are also vulnerable to lexical and morphological variations [27,43].
Owing to the rise of deep learning, a neural DST model called neural belief tracking (NBT) has been proposed [33]. NBT employs convolutional filters over word embeddings in lieu of hand-crafted features to predict slot values. The performance of NBT is much better than previous DST methods. Inspired by this seminal work, a lot of neural DST approaches based on long short-term memory (LSTM) network [34,[40][41][42]59] and bidirectional gated recurrent unit (BiGRU) network [22,31,35,39,55,57] have been proposed for further improvements. These methods define DST as either a classification problem or a generation problem. Motivated by the advances in reading comprehension [4], DST has been further formulated as a machine reading comprehension problem [13,14,30,31]. Other techniques such as pointer networks [56] and reinforcement learning [7,8,23] have also been applied to DST.
Recently, pre-training language models has gained much attention from both industry and academia, and a great variety of pretrained language models such as BERT [10] and GPT-2 [36] have been released. Since the models are pre-trained on large corpora, they demonstrate strong abilities to produce good results when transferred to downstream tasks. In view of this, the research of DST has been shifted to building new models on top of these powerful pre-trained language models [15,21,25,27,29,43,48,58]. For example, SUMBT [27] employs BERT to learn the relationships between slots and dialogue utterances through a slot-word attention mechanism. CHAN [43] is built upon SUMBT via taking into account both slot-word attention and slot-turn attention. To better model dialogue behaviors during pre-training, TOD-BERT [54] further pretrains the original BERT model using several task-oriented dialogue datasets. SOM-DST [25] considers the dialogue state as an explicit fixed-sized memory and selectively overwrites this memory to avoid predicting the dialogue state at each turn from scratch. TripPy [18] uses three copy mechanisms to extract slot values. MinTL [29] exploits T5 [37] and BART [28] as the dialogue utterance encoder and jointly learns dialogue states and system responses. It also introduces Levenshtein belief spans to track dialogue states efficiently. NP-DST [16] and SimpleTOD [21] adopt GPT-2 as the dialogue context encoder and formulate DST as a language generation task.
All the methods mentioned above predict the value of each slot separately and ignore the correlations among slots. We notice that several approaches [6,22,60] have tried to model the relevance among slots to a certain degree. Specifically, CSFN-DST [60] and SST [6] construct a schema graph to capture the dependencies of different slots. However, the manually constructed schema graph is unlikely to reflect the correlations among slots completely. Besides, lots of prior knowledge is involved during the construction process. Therefore, CSFN-DST and SST are not scalable. SAS [22] calculates a slot similarity matrix to facilitate information flow among similar slots. The similarity matrix is computed based on either the cosine similarity or the K-means clustering results of slot names. However, when computing the similarity matrix, SAS involves several hyperparameters, which are hard to set. SAS also fixes the similarity coefficient at 1 if two slots are considered to be relevant. This is obviously impractical. Except for the model-specific drawbacks of CSFN-DST, SST and SAS, they also share a common limitation: they all measure the slot correlations using only the slot names. This may overlook or overrate the dependencies of some slots. Our method utilizes both slot names and their corresponding values to model slot correlations more precisely.

PRELIMINARIES
In this section, we first provide the formal definition of DST and then conduct a simple data analysis to show the high correlations among slots in practical dialogues.

Problem Statement
The goal of DST is to extract a set of slot value pairs from the system response and user utterance at each turn of the conversation. The combination of these slot value pairs forms a dialogue state, which keeps track of the complete intentions or requirements that have been informed by the user to the system.
Based on the dialogue context X and the ontology O, the task of DST is defined as learning a dialogue state tracker F : X × O → B that can efficiently capture the user's intentions in the course of conversation. According to this definition, we can see that DST is a relatively challenging problem, as it is needed to predict the values of multiple slots at each turn. Besides, the value spaces of some slots may be large, that is, there may be a large number of candidate values for some slots. This phenomenon makes the prediction of dialogue states even more challenging.
It is worth mentioning that in this paper we use the term "slot" to refer to the concatenation of the domain name and the slot name so as to include both domain and slot information. For example, we use "restaurant-pricerange" rather than "pricerange" to represent the "pricerange" slot in the "restaurant" domain. This format is useful, especially when a conversation involves multiple domains. It has also been widely adopted by previous works [22,25,27,43].

Data Analysis
To intuitively verify the strong correlations among slots in practical dialogues, we conduct a simple data analysis on MultiWOZ 2.1 [11], which is a multi-domain task-oriented dialogue dataset. Specifically, we treat every slot pair as two different partitions of the dataset. For each partition, the corresponding slot values are regarded as the cluster labels. Then we calculate the normalized mutual information (NMI) score between the two partitions. Note that we adopt NMI as the measurement of slot correlations, as mutual information can describe more general dependency relationships beyond linear dependence 2 . We illustrate the top-5 most relevant slots of slot "restaurant-area" and slot "taxi-destination" in Figure 1. Other slots show similar patterns. From Figure 1, we observe that slot "restaurant-area" and slot "taxi-destination" are indeed highly correlated with some other slots. The relevant slots are not only within the same domain but also across different domains. For example, slot "taxi-destination" correlates highly with slot "restaurant-food", even though their names have no apparent connections. This observation consolidates our motivation that it is necessary to take into account both slot names and their values.

STAR: SLOT SELF-ATTENTIVE DST
In this section, we describe our proposed slot self-attentive DST model STAR in detail. The overall architecture of STAR is illustrated in Figure 2, which is composed of a BERT-based context encoder module, a slot-token attention module, a stacked slot self-attention module and a slot value matching module.

Context Encoder
Recently, many pre-trained language models such as BERT [10] and GPT-2 [36] have shown strong abilities to produce good results when transferred to downstream tasks. In view of this, we employ BERT as the context encoder to obtain semantic vector representations of dialogue contexts, slots and values. BERT is a deep bidirectional language representation learning model rooted in Transformer encoders [46]. It can generate token-specific vector representations for each token in the input sentence as well as an aggregated vector representation of the whole sentence. Therefore, we exploit BERT to generate token-specific vector representations for dialogue contexts and aggregated vector representations for both slots and values.

Dialogue Context Encoder.
The dialogue utterances at turn are represented as = ⊕ , where ⊕ is the operation of sequence concatenation. The dialogue history of turn is denoted as = 1 ⊕ 2 ⊕ · · · ⊕ −1 . Then, the entire dialogue context of turn is defined as: where [ ] and [ ] are two special tokens introduced by BERT. The [ ] token is leveraged to aggregate all token-specific representations and the [ ] token is utilized to mark the end of a sentence. Since the maximum input length of BERT is restricted to 512 [10], we must truncate if it is too long. The straightforward way is to cut off the early dialogue history and reserve the most recent one in . However, this operation may throw away some key information. To reduce information loss, we use the previous dialogue state as input as well, which is expected to keep all the slotrelated history information. The dialogue state of previous turn is represented by − Note that in −1 we only include the slots that have been mentioned before (i.e., only non-none slots are considered). By treating −1 as part of the dialogue history, the entire dialogue context of turn is finally denoted as 3 : BERT Slot-Token Attention Let | | be the number of tokens in . Our first goal is to generate a contextual -dimensional vector representation for each token in . Let ∈ R denote the vector representation of the -th token and = [ 1 , 2 , . . . , | | ] ∈ R × | | the matrix form of all tokens' representations. We simply feed to BERT to obtain . Hence, we have: Note that BERT in Eq. (3) will be fine-tuned during training.

Slot and Value
Encoder. Following previous works [27,43], we use another BERT to encode slots and their candidate values. Unlike dialogue contexts, we need to generate aggregated vector representations for slots and values. To achieve this goal, we use the vector representation corresponding to the special token [ ] to represent the aggregated representation of the whole input sequence. As thus, for any slot ∈ S(1 ≤ ≤ ) and any value ∈ V , we have: where means that the pre-trained BERT without finetuning is adopted. Fixing the weights of BERT when encoding slots and values is beneficial. Firstly, the slot and value representations can be computed off-line, which reduces the model size of our approach. Secondly, since our model relies on the value representations to score each candidate value of a given slot, fixing the representations of values can reduce the difficulty of choosing the best candidate value.

Slot-Token Attention
Since there are multiple slots to be predicted at each turn from the same dialogue context , it is necessary to extract slot-specific information for each slot (1 ≤ ≤ ). Our model employs a multihead attention mechanism [46] to retrieve the relevant information corresponding to each slot .

Multi-Head Slot-Token Attention.
Our model adopts the multihead attention mechanism to calculate a -dimensional vector for each slot as the slot-specific information. More concretely, the slot representation [ ] is treated as the query vector, and the dialogue context representation is regarded as both the key matrix and the value matrix. Consequently, the token-level relevance between slot and dialogue context is summarized as: where ∈ R . Considering that only contains the value information of slot , we concatenate and [ ] to retain its name information. This merged vector is further transformed by a feed-forward neural network as below: where 1 ∈ R ×2 , 2 ∈ R × and 1 , 2 , ∈ R .

Slot Self-Attention
Albeit the slot-token attention is expected to retrieve slot-specific information for all slots, it may fail to capture the valid information of some slots due to the various expressing forms in natural conversations (e.g., coreference, synonymity and rephrasing). In addition, the slot-specific vector of each slot is computed separately. The correlations among slots are ignored. As a result, once the vector doesn't capture the relevant information of slot properly, the model has no chance to deduce the right value for slot . To alleviate this problem, we propose exploiting the slot self-attention mechanism to rectify each slot-specific vector based on the vectors corresponding to all slots. This mechanism should be rational because of the high correlations among slots. Therefore, our model is expected to provide mutual guidance among slots and learn the slot correlations automatically.
The slot self-attention is also a multi-head attention. Specifically, this module is composed of identical layers and each layer has two sub-layers. The first sub-layer is the slot self-attention layer. The second sub-layer is a feed-forward network (FFN) with two fully connected layers and a ReLU activation in between. Each sub-layer precedes its main functionality with layer normalization [1] and follows it with a residual connection [17].
, . . . , ] ∈ R × denote the matrix representation of all slot-specific vectors and let 1 = . Then, for the -th slot self-attention sub-layer (1 ≤ ≤ ), we have: In the slot self-attention sub-layer,˜serves as the key matrix, the value matrix, and also the query matrix. For the -th feed-forward sub-layer, we have:˜= where the function (·) is parameterized by 1 , 2 ∈ R × and 1 , 2 ∈ R , i.e., ( ) = 2 ( 1 + 1 ) + 2 . The final slot-specific vectors are contained in the output of the last layer, i.e., +1 . Let +1 = [ is taken as the final slot-specific vector of slot , which is expected to be close to the semantic vector representation of the groundtruth value of slot at turn . Since the output of BERT is normalized by layer normalization, we also feed to a layer normalization layer, which is preceded by a linear transformation layer as follows: where ∈ R .

Slot Value Matching
To predict the value of each slot (1 ≤ ≤ ), we compute the distance between and the semantic vector representation of each value ′ ∈ V , where V denotes the value space of slot . Then the value with the smallest distance is chosen as the prediction of slot . We adopt ℓ 2 norm as the distance metric.
During the training phase, we calculate the probability of the groundtruth value of slot at turn as: Our model is trained to maximize the joint probability of all slots, i.e., Π =1 ( | , ). For this purpose, the loss function at each turn is defined as the sum of the negative log-likelihood: We evaluate our approach on MultiWOZ 2.0 [3] and MultiWOZ 2.1 [11], which are two of the largest publicly available task-oriented dialogue datasets. MultiWOZ 2.0 consists of 10,348 multi-turn dialogues, spanning over 7 domains {attraction, hotel, restaurant, taxi, train, hospital, police}. Each domain has multiple predefined slots and there are 35 domain slot pairs in total. MultiWOZ 2.1 is a refined version of MultiWOZ 2.0. According to [11], about 32% of the state annotations have been corrected in MultiWOZ 2.1. Since hospital and police are not included in the validation set and test set, following previous works [25,27,43,55,60], we use only the remaining 5 domains in the experiments. The resulting datasets contain 17 distinct slots and 30 domain slot pairs. The detailed statistics are summarized in Table 2.

EXPERIMENTAL SETUP 5.1 Datasets
We follow similar data preprocessing procedures as [55] to preprocess both MultiWOZ 2.0 and MultiWOZ 2.1. And we create the ontology by incorporating all the slot values that appear in the datasets. We notice that several works [27,43] exploit the original ontology provided by MultiWOZ 2.0 and MultiWOZ 2.1 to preprocess the datasets in their experiments. However, the original ontology is incomplete. If a slot value is out of the ontology, this value is ignored directly in [27,43], which is impractical and leads to unreasonably high performance.

Comparison Methods
We compare our model STAR with the following state-of-the-art DST approaches: • SST: SST [6] first constructs a schema graph and then utilizes a graph attention network (GAT) [47] to fuse information from dialogue utterances and the schema graph. • SAS: SAS [22] calculates a binary slot similarity matrix to control information flow among similar slots. The similarity matrix is computed via either a fixed combination method or a K-means sharing method. • CREDIT-RL: CREDIT-RL [7] employs a structured representation to represent dialogue states and casts DST as a sequence generation problem. It also uses a reinforcement loss to fine-tune the model. • STARC: STARC [13] reformulates DST as a machine reading comprehension problem and adopts several reading comprehension datasets as auxiliary information to train the model. • CSFN-DST: Similar to SST, CSFN-DST [60] also constructs a schema graph to model the dependencies among slots. However, CSFN-DST utilizes BERT [10] rather than GAT to encode dialogue utterances. • SOM-DST: SOM-DST [25] regards the dialogue state as an explicit fixed-sized memory and proposes selectively overwriting this memory at each turn. • CHAN: CHAN [43] proposes a slot-turn attention mechanism to make full use of the dialogue history. It also designs an adaptive objective to alleviate the data imbalance issue. • TripPy: TripPy [18] leverages three copy mechanisms to extract slot values from user utterances, system inform memory and previous dialogue states. • NP-DST: NP-DST [16] transforms DST into a language generation problem and adopts GPT-2 [36] as both the dialogue context encoder and the sequence generator. • SimpleTOD: SimpleTOD [21] is also based on GPT-2. Its model architecture is similar to NP-DST. • MinTL: MinTL [29] is an effective transfer learning framework for task-oriented dialogue systems. It introduces Levenshtein belief span to track dialogue states. MinTL uses both T5 [37] and BART [28] as pre-trained backbones. We name them MinTL-T5 and MinTL-BART for distinction.

Evaluation Metric
We adopt joint goal accuracy [34] as the evaluation metric. Joint goal accuracy is defined as the ratio of dialogue turns for which the value of each slot is correctly predicted. If a slot has not been mentioned yet, its groundtruth value is set to none. All the none slots also need to be predicted. Joint goal accuracy is a relatively strict evaluation metric. Even though only one slot at a turn is mispredicted, the joint goal accuracy of this turn is 0. Thus, the joint goal accuracy of a turn takes the value either 1 or 0.

Training Details
We employ the pre-trained BERT-base-uncased model 4 as the dialogue context encoder. This model has 12 layers with 768 hidden units and 12 self-attention heads. We also utilize another BERTbase-uncased model as the slot and value encoder. For the slot and value encoder, the weights of the pre-trained BERT model are frozen during training. For the slot-token attention and slot selfattention, we set the number of attention heads to 4. The number of slot self-attention layers (i.e., ) is fixed at 6. We treat the context encoder part of our model as an encoder and the remaining part as a decoder. The hidden size of the decoder (i.e., ) is set to 768, which is the same as the dimensionality of BERT outputs. The BertAdam [26] is adopted as the optimizer and the warmup proportion is fixed at 0.1. Considering that the encoder is a pre-trained BERT model while the decoder needs to be trained from scratch, we use different learning rates for the two parts. Specifically, the peak learning rate is set to 1e-4 for the decoder and 4e-5 for the encoder. We use a training batch size of 16 and set the dropout [44] probability to 0.1. We also exploit the word dropout technique [2] to partially mask the dialogue utterances by replacing some tokens with a special token [ ]. The word dropout rate is set to 0.1. Note that we do not use word dropout on the previous dialogue state, even though it is part of the input. The maximum input sequence length is set to 512. The best model is chosen according to the performance on the validation set. We run the model with different random seeds and report the average results. For MultiWOZ 2.0 and MultiWOZ 2.1, we apply the same hyperparameter settings.

Baseline Comparison
The joint goal accuracy of our model and various baselines on the test sets of MultiWOZ 2.0 and MultiWOZ 2.1 are shown in Table 3 5 , in which we also summarize several key differences of these models. As can be seen, our approach consistently outperforms all baselines on both MultiWOZ 2.0 and MultiWOZ 2.1. Compared with the three methods that have taken slot correlations into consideration (i.e., SST, SAS and CSFN-DST), our approach achieves 2.96% and 1.13% absolute performance promotion on MultiWOZ 2.0 and MultiWOZ 2.1, respectively. Our approach also outperforms other baselines by 1.47% and 1.07% separately on the two datasets. From Table 3, we observe that SST and TripPy are the best performing baselines. Both methods reach higher than 55% joint goal accuracy on MultiWOZ 2.1. However, SST needs to construct a schema graph by involving some prior knowledge manually. The schema graph is exploited to Table 3: Joint goal accuracy of various methods on the test sets of MultiWOZ 2.0 and MultiWOZ 2.1 (± denotes the standard deviation). † indicates the results reported in the original papers. ★ means the results reproduced by us using the source codes. ‡ demonstrates a statistically significant improvement to the best baseline at the 0.01 level using a paired two-sided t-test.

Model
Context  capture the correlations among slots. Even though SST leverages some prior knowledge, it is still inferior to our approach. This is because the schema graph only considers the relationships among slot names and thus cannot describe the slot correlations completely. It is worth mentioning that SST also shows a deficiency in utilizing dialogue history. SST achieves the best performance when only the previous turn dialogue history is considered [6]. TripPy shows the best performance among BERT-based baselines. However, it employs both system actions and a label map as extra supervision. The label map is created according to the labels in the training portion of the dataset. During the testing phase, the label map is leveraged to correct the predictions (e.g., mapping "downtown" to "centre"). The label map is useful, but it may oversmooth some predictions. On the contrary, our model doesn't rely on any extra information and is a fully data-driven approach. Hence, our model is more general and more scalable. Since our model also achieves better performance than SST and TripPy, we can conclude that it is beneficial to take the slot correlations into consideration. The slot self-attention mechanism proposed by us is able to capture the relevance among slots in a better way. We conduct a further comparison between TripPy and our model STAR with and without the label map being leveraged on Multi-WOZ 2.1. We denote TripPy as TripPy− when the label map is removed and represent STAR as STAR+ when the label map is involved. The results are reported in Table 4  the performance of TripPy degrades dramatically if the label map is ignored. However, the label map doesn't have significant impacts on the performance of our approach. With the label map being considered, our approach only shows slightly better performance.

Single-Domain and Multi-Domain Joint Goal Accuracy
Considering that a practical dialogue may involve multiple domains or just a single domain, it is useful to explore how our approach performs in each scenario. To this end, we report the joint goal accuracy of single-domain dialogues and multi-domain dialogues on the test set of MultiWOZ 2.1, respectively. The results are shown in Figure 3, from which we observe that our approach achieves better performance in both scenarios. The results indicate that our approach can capture the correlations among slots both within a single domain and across different domains. From Figure 3, we also observe that all the methods demonstrate higher performance in the single-domain scenario. Our approach even reaches about 67% joint goal accuracy. While the performance of all methods in the multi-domain scenario slightly goes down, compared to the overall joint goal accuracy shown in Table 3. Nonetheless, our approach still achieves about 55% joint goal accuracy. The results further confirm the effectiveness of our approach.  Figure 4: Slot-specific accuracy on MultiWOZ 2.1. The domain "restaurant" is represented as "rest." for short.

Domain-Specific Joint Goal Accuracy and Slot-Specific Accuracy
In this part, we first investigate the performance of our model in each domain. The domain-specific joint goal accuracy on MultiWOZ 2.1 is reported in Table 5, where we compare our approach with CSFN-DST, SOM-DST and TripPy. The domain-specific accuracy is calculated on a subset of the predicted dialogue state. The subset consists of all the slots specific to a domain. In addition, only the domain-active dialogues are considered for each domain. As shown in Table 5, our approach consistently outperforms CSFN-DST and SOM-DST in all domains. Our approach also outperforms TripPy in three domains. Although TripPy demonstrates better performance in the "attraction" domain and "restaurant" domain, it shows the worst performance in the "taxi" domain. As analyzed in [25], the "taxi" domain is the most challenging one. This domain also has the least number of training dialogues (refer to Table 2). Owing to the strong capability of modeling slot correlations, our approach achieves much better performance in this challenging domain. We then illustrate the slot-specific accuracy of our approach and TripPy in Figure 4. The corresponding exact numbers are shown in Table 6 in Appendix A. The slot-specific accuracy measures the accuracy of each individual slot. Note that the slot-specific accuracy is calculated using only the dialogues that involve the domain the slot belongs to. From Figure 4, we observe that both methods demonstrate high performance for most slots. However, TripPy shows relatively poor performance for slot "taxi-departure" and slot "taxi-destination". The results are consistent with the domainspecific accuracy and explain why TripPy fails in the "taxi" domain.   From Figure 4, we also observe that our approach is inferior to TripPy in the name-related slots (i.e., "attraction-name", "hotel-name" and "restaurant-name") and leaveat-related slots (i.e., "taxi-leaveat" and "train-leaveat"). The values of these slots are usually informed by the users explicitly. Since TripPy leverages copy mechanisms to extract values directly, it seems to be more appropriate for these slots. This observation inspires us that it should be beneficial to extend our model by incorporating the copy mechanism, which we leave as our future work.

Per-Turn Joint Goal Accuracy
Given that practical dialogues have a different number of turns and longer dialogues tend to be more challenging, we further analyze the relationship between the depth of conversation and accuracy of our model. The per-turn accuracy on MultiWOZ 2.1 is shown in Figure 5. For comparison, we also include the results of TripPy and STAR-GT. STAR-GT means the groundtruth previous dialogue state is used as the input at each turn. Figure 5 shows that the accuracy of both TripPy and STAR decreases with the increasing of dialogue turns. In contrast, the performance of STAR-GT is relatively stable. This is because errors occurred in early turns will be accumulated to later turns in practice. However, when the groundtruth previous dialogue state is used, there is no error accumulation.

Effects of Number of Slot Self-Attention Layers
We vary the number of slot self-attention layers (i.e., ) in the range of {0, 3, 6, 9, 12} to study its impacts on the performance of our model. The results on MultiWOZ 2.1 are illustrated in Figure 6, from which we observe that our model achieves the best performance when is set to 6. The performance degrades while goes larger.
This may be caused by overfitting. Figure 6 also shows that when there is no slot self-attention (i.e., = 0), the joint goal accuracy decreases to around 54%. Note that when = 0, our model doesn't learn the slot correlations any more. Hence, we conclude that it is essential to take the dependencies among slots into consideration.

Effects of Number of Previous Dialogue Turns
To evaluate the effects of the number of previous dialogue turns (i.e., ), we vary in the range of {0, 1, 2, 3, 4, }, where means all the previous dialogue turns are considered. The results on MultiWOZ 2.1 are shown in Figure 7. As can be seen, when full dialogue history is leveraged, our model demonstrates the best performance. When no dialogue history is employed, our model also reaches higher than 55.5% joint goal accuracy. However, when is set to 1, 2 and 3, the performance degrades slightly. This is probably because the incomplete history leads to confusing information and makes it more challenging to extract the appropriate slot values.

CONCLUSION
In this paper, we have presented a novel DST approach STAR to modeling the correlations among slots. STAR first employs a slottoken attention to retrieve slot-specific information for each slot from the dialogue context. It then leverages a stacked slot selfattention to learn the dependencies among slots. STAR is a fully data-driven approach. It does not ask for any human efforts or prior knowledge when measuring the slot correlations. In addition, STAR considers both slot names and their corresponding values to model the slot correlations more precisely. To evaluate the performance of STAR, we have conducted a comprehensive set of experiments on two large multi-domain task-oriented dialogue datasets Multi-WOZ 2.0 and MultiWOZ 2.1. The results show that STAR achieves state-of-the-art performance on both datasets. For future work, we intend to incorporate the copy mechanism into STAR to enhance its performance further.
is always none. Due to this, even if a model predicts the values of a slot for all dialogues as none, it can still achieve a relatively high slot-specific accuracy. To overcome this limitation, we propose calculating the slot-specific accuracy using only the dialogues that involve the domain the slot belongs to. The detailed results are shown in Table 6 (in gray). For comparison, we also include the results computed based on all dialogues. As can be seen, no matter which method is adopted to calculate the slot-specific accuracy, our model is able to achieve better performance for most slots. Table 6 also shows that if the traditional method is adopted, all three models demonstrate higher than 90% slot-specific accuracy for each slot. Besides, there are only subtle differences in the slot-specific accuracy of the three models. While the slot-specific accuracy computed using our proposed method is more discriminative. Table 7 shows the predicted dialogue states of our model and TripPy on a dialogue from the test set of MultiWOZ 2.1. As can be seen, our model correctly predicts all slot values at each turn. However, TripPy fails to predict the value of slot "attraction-type" at turn 2 and delays the prediction to turn 3. At the last turn, TripPy predicts the value of slot "taxi-leaveat" as "01" rather than "01:15", albeit this information is explicitly contained in the user utterance. For slot "taxi-departure" and slot "taxi-destination", since the user provides the corresponding information indirectly, it is challenging to deduce their valid values. TripPy falsely predicts the destination as the departure and vice versa.