An Adversarial Imitation Click Model for Information Retrieval

Modern information retrieval systems, including web search, ads placement, and recommender systems, typically rely on learning from user feedback. Click models, which study how users interact with a ranked list of items, provide a useful understanding of user feedback for learning ranking models. Constructing"right"dependencies is the key of any successful click model. However, probabilistic graphical models (PGMs) have to rely on manually assigned dependencies, and oversimplify user behaviors. Existing neural network based methods promote PGMs by enhancing the expressive ability and allowing flexible dependencies, but still suffer from exposure bias and inferior estimation. In this paper, we propose a novel framework, Adversarial Imitation Click Model (AICM), based on imitation learning. Firstly, we explicitly learn the reward function that recovers users' intrinsic utility and underlying intentions. Secondly, we model user interactions with a ranked list as a dynamic system instead of one-step click prediction, alleviating the exposure bias problem. Finally, we minimize the JS divergence through adversarial training and learn a stable distribution of click sequences, which makes AICM generalize well across different distributions of ranked lists. A theoretical analysis has indicated that AICM reduces the exposure bias from $O(T^2)$ to $O(T)$. Our studies on a public web search dataset show that AICM not only outperforms state-of-the-art models in traditional click metrics but also achieves superior performance in addressing the exposure bias and recovering the underlying patterns of click sequences.

various click models have been developed [6,10,15,31]. Click models characterize how users interact with a list of items. Given click logs (including a set of queries, a ranked list of items, and the click data for each query), click models are trained to predict a sequence of user clicks, and return a set of model parameters that reflect users' underlying behaviors [13]. Click models provide useful evidence for ranking functions in both training and testing processes. In training, click models generate users' feedback on items with specific positions and contexts that have not been seen in the click logs, which help alleviate the inherent biases in users' behaviors (e.g., position bias, presentation bias) [23,35]. In testing, click models can be applied to evaluating the performance of ranking functions in cases where real users are not available or negative impacts on user experience have to be avoided.
Earlier click models are based on the probabilistic graphic models (PGMs). They represent user behaviors as a sequence of observable and hidden states, e.g., clicks, skips, attractiveness, and examinations [6]. Each state is defined as a binary event, e.g., whether a user examines a document, or whether a user is attracted by a document. Yet PGM framework requires manually setting the dependencies between the events, and thereby may be over-simplified and overlook some key aspects in user behaviors. Moreover, the expressive ability of PGM framework is usually limited [6,10].
To improve the expressive ability and allow flexible dependencies, Borisov et al. [6] proposed the neural click model (NCM). Rather than using binary random variables, NCM represents user behaviors as vector sequences with the distributed vector representation approach. The click sequence model (CSM) [7] and the contextaware click model (CACM) [10] utilize complex model structures to incorporate more information (e.g., session context information) and thus further enhance the expressive ability. However, such methods still suffer from several limitations.
First, the ultimate goal of click models is to understand user behaviors, and most importantly, the intrinsic utility behind the behaviors, e.g., maximizing the information needs or minimizing the effort of information seeking. This is the underlying intention that the user performs certain actions like clicks and skips. If this utility is modeled explicitly, then it will not only help click modeling, but also provide us with insights and quantitative guidance for the optimization and evaluation of a ranking function. Simply treating click model as a click prediction task with a black-box neural network might ignore this important aspect of click model.
Second, existing neural network (NN)-based models generally overlook the problem of exposure bias [4], which refers to the model input discrepancy between training and testing. Specifically, the neural network based methods mentioned above predict the next arXiv:2104.06077v2 [cs.IR] 19 Apr 2021 click based on previous clicks of the ground truth sequences in the training procedure. During testing, however, these models have to predict successive clicks based on previous predictions made by itself, which have not been seen during training. This discrepancy comes from the conflicts between the dynamic nature of user behaviors and the static modeling of these models. User behaviors naturally depend on the previously happened ones. However, previous works are supervised to predict one click at each time step by assuming all previous clicks are correct. Such a greedy method may yield sub-optimal results since small errors accumulated at each time step leads to a great deviation from the optimal sequence [28].
Moreover, existing NN-based models may not generalize well when the test data deviates from the training data, especially when the data is rather sparse w.r.t. to the whole space of click sequences. User behaviors are complex in nature and may contain multiple patterns. If the data is sparse, using maximum likelihood estimation (MLE) as in prior works [6,10] tends to average on all the patterns and thus fails to fit the complex user behaviors, resulting in inferior estimation. The MLE objective function only minimizes the KL divergence between target distribution and learned distribution, i.e., the forward KL divergence. However, the KL divergence between learned distribution and target distribution, i.e., the reverse KL divergence, which concentrates on the major pattern [5], has the potential to be beneficial for the click model to achieve better performance under its own generated distribution.
To tackle the aforementioned limitations, in this work, we propose a novel learning paradigm for click models based on imitation learning framework, namely, Adversarial Imitation Click Model (AICM). Imitation learning is a learning paradigm that aims at reconstructing sequential decision-making policies from sampled experts' trajectories [20]. Firstly, we regard user behaviors as expert demonstrations and thus assume that users' intrinsic utility is maximized. With this assumption, we build a reward function explicitly from users' click logs. Then we use this reward function to guide the learning of a click policy that reproduces user behaviors. This reward function provides important insights and quantitative guidance for the optimization and evaluation of a ranking function. Secondly, we formulate the click model as a dynamic system. To be specific, we base users' current state on previous predictions and optimize the click model for a long-term objective rather than a short sighted loss over individual clicks, which alleviates the exposure bias problem. Finally, we solve the dynamic system via an imitation learning algorithm, more specifically the generative adversarial imitation learning (GAIL) algorithm, to minimize the Jensen-Shannon (JS) divergence between the distributions of target sequences and generated sequences. Instead of solely considering the forward KL divergence as in MLE, minimizing JS divergence helps to learn a more stable distribution of click sequences, which makes the click model generalize better on different distributions of ranked lists.
Our theoretical analysis shows that AICM reduces the exposure bias from ( 2 ) to ( ). Extensive empirical studies are conducted to show the state-of-the-art performance of AICM in traditional click prediction and relevance estimation tasks, and superior performance in addressing the exposure bias and recovering the underlying patterns of click sequences. The results also demonstrate that AICM generalizes well on different distributions of ranked lists and achieves stable performance even in bad cases, which allows safe exploration of the ranking functions.

PRELIMINARY: IMITATION LEARNING
The goal of imitation learning is to learn a behavior policy ( | ) that reproduces expert behaviors, given a set of expert demonstrations, where each of such demonstrations is a sequence of states and actions, i.e., [26]. Models of imitation learning are generally divided into three classes: behavior cloning (BC), inverse reinforcement learning (IRL) and generative adversarial imitation learning (GAIL).

Behavior Cloning
Behavior cloning (BC) [3] learns a policy that directly maps states to actions without recovering the reward function. BC maximizes the likelihood of experts' trajectories, which equals to the minimization of KL divergence ( , ) for each state visited by the expert policy.
If we regard the learning of click models as an imitation learning problem, the traditional supervised click models can be categorized into BC. However, BC suffers from compounding error since it only fits a single-step decision instead of focusing on a long-horizon planning [28].

Inverse Reinforcement Learning
Inverse Reinforcement Learning (IRL) recovers the reward function from expert demonstrations under the assumption that such demonstrations are optimal. Then a policy can be trained according to the learned reward function. The IRL problem is ill-posed because a policy can be optimal for multiple reward functions. To obtain the unique solution, various additional objectives such as maximum margin [1,25] and maximum entropy [38,39] have been proposed. Taking the maximum causal entropy IRL [38,39] as an example, it looks for a cost function ∈ C (where the cost is equivalent to negative reward) that assigns low cost to the expert policy and high cost to the other policies, where is the causal entropy of the learned policy . We use an expectation w.r.t. policy to denote an expectation w.r.t. the trajectory it generates, e.g., where is the discount factor. IRL methods often require a costly iterative learning process, which has to solve an RL-type problem in every update step of the reward function.

Generative Adversarial Imitation Learning
Inspired by the connection between GANs [16] and IRL, Ho and Ermon [20] proposed generative adversarial imitation learning (GAIL). GAIL trains a policy ( | ) with the reward provided by a discriminator ( , ) : S × A → (0, 1), which distinguishes between state-action pairs of and . The objective function of GAIL is: (3) Ho and Ermon [20] showed that IRL is a dual of the occupancy measure matching under the maximum entropy principle. GAIL essentially solves an occupancy measure matching problem by minimizing the JS divergence between the occupancy measure under the learned behavior policy and the expert policy, with the causal entropy ( ) as the policy regularizer: where ( , ) = , Here the normalized occupancy measure and denote the distributions of state-action pairs under the learned behavior policy and the expert policy respectively. To be specific, 1− +1 is a normalization to ensure the probability sum equals one. As can be seen, JS divergence considers both forward and reverse KL divergence between the learned behavior policy and the target policy, making the learned behavior policy precise and stable.

METHODOLOGY
We first formulate the click model as an imitation learning problem and then present the overview of our proposed method AICM. After that, we introduce each component of AICM in detail.

Problem Formulation
Users' interaction with a ranked list can be naturally interpreted as a sequential decision-making process. As illustrated in Figure 1, a user starts a search session by issuing a query , and the ranking system delivers a ranked list with corresponding documents = { 1 , 2 , . . . , }. The user examines the presented list, possibly clicks one or more documents, and then abandons the list to end the interaction. During such a process, a click sequence { 1 , . . . , } is generated. The goal of click models [6] is to simulate the process from issuing a query till abandoning the search.  Note that in this paper we consider a general click model setting, where the information of the user's previous search queries in the same session (like in CACM [10]) are not considered.
As a sequential decision-making process, a Markov decision process (MDP) can be used to model user behaviors, where the key components are defined as follows.
• State. The initial user state 0 is initialized with the query .
Similar to many existing click models [6,10], we do not model the abandonment explicitly, rather our model is trained to predict low click probabilities for documents that are unlikely to be examined by the user. The state at the end of the ranked list is simply set as the terminal state.

Overview of AICM
In imitation learning, we aim to learn a behavior policy ( | ) from the state-action sequences provided by experts, i.e., expert demonstrations. In this work, we propose Adversarial Initation Click Model (AICM), by adopting GAIL framework to imitate the expert policy. AICM consists of three parts: 1) embedding layer for the query, document and interaction representations; 2) generator ( | ), i.e., the behavior policy, that generates user clicks; and 3) discriminator ( , ) that measures the difference between the generated user clicks and the ground truth clicks, as parameterized by . To alleviate the exposure bias, the generator and discriminator are learned in an adversarial training paradigm. The overall framework for AICM is shown in Figure 2.
We describe how query , document and interaction are represented in the embedding layer. For each document , we incorporate its vertical type , which is an important feature that infers the presentation style for each document. Common vertical types for a commercial search engines include the organic result, the encyclopedia vertical, the illustrated vertical, and etc. We first transform the original ID feature into a high-dimensional sparse features via one-hot encoding. Then the embedding layer is performed on the one-hot vectors to map them to low-dimensional, dense real-value embedding vectors: where Emb q ∈ R × , Emb d ∈ R × , Emb v ∈ R × , Emb c ∈ R × , * and * denote the input size and the embedding size of each feature, respectively 1 .

Generator
The generator ( | ) generates users' feedback based on state . The state carries the information of users' historical interactions with the documents presented before. We mainly follow the  The blank node is zero padding which will be mapped to the zero vector of corresponding shape as described in Section 3.3 and Section 3.4 after the embedding layer.
network configuration of NCM [6] and adopt the gated recurrent unit (GRU) [12] as the building block, which performs similarly to LSTM [21] but is computationally cheaper. The process of generator ( | ) can be divided into following steps: (1) A user starts the session by issuing query and the hidden state h 0 is initialized with , where the document v , vertical type v and previous interaction v are initialized with 0 , 0 , 0 . (2) At rank 1, the user examines the first document. The current hidden state h 1 encodes the last state h 0 , document embedding for current document v , its corresponding vertical type embedding v , and the previous interaction v via a GRU unit. The current action 1 with document 1 is generated according to the action probability from the policy ( 1 | 1 ) = Softmax (Linear(h 1 )); (3) The previous interaction v is updated with the embedding of current action 1 , i.e., click or not click. (4) For rank > 1, steps (2) and (3) are repeated to select current action and update previous interaction v .
The structure of the generator ( | ) is described as: where ⊕ is the vector concatenation, 0 * denotes a zero vector with size * for the corresponding feature and h is the hidden representation of . Note that in NCM the query embedding v is only used at step 0 while in AICM we encoded this information at each step to ensure it not forgotten during the propagation of RNN [17].

Discriminator
The discriminator ( , ) distinguishes the state-action pairs ( , ) generated by the behavior policy ( | ) from those of expert policy. We also use GRU as the building block for ( , ). To be consistent with the generator, the initial state h ′ 0 of the discriminator is initialized with query embedding v at rank 0. At rank ≥ 1, the GRU unit takes as input the query embedding v , document embedding v , vertical embedding v , and current interaction v with (recall that in Eq. (7), v is users' previous interaction with −1 ), and outputs a hidden vector h ′ , which contains the information of both and . The structure of the discriminator ( , ) is described as: Note that the hidden state h ′ encodes the information of both the state and action while h in Eq. (7) only encodes the information of state .

Adversarial Training
The behavior policy ( | ), i.e., the generator, generates click sequences, while the discriminator ( , ) measures the difference between the generated click sequences and the ground-truth sequences. The generator and discriminator are updated according to the following procedures until convergence.
Firstly, we sample trajectories from the behavior policy ( | ) and from expert demonstrations, then we update the discriminator parameters with the gradient Each time we obtain an updated discriminator ( , ), we take a gradient step using Proximal Policy Optimization (PPO) [29] to update the generator, according tô where The state-action value function defined in Eq. (11), which controls the direction and scale of policy gradient, is built upon the reward provided by the discriminator ( , ). It works from two aspects. On one hand, it gives low reward when the next click of the generated click sequence differs from training data, which is similar to most state-of-the-art methods. On the other hand, it also gives low reward to the generated sequence where the prefix, i.e., previous generated clicks, is significantly different from training data, which explicitly constrains the propagation of error. During the training, as the discriminator better distinguishes the generated sequences and the ground truth sequences, which makes the generator produce more realistic prefix, the exposure bias can be sufficiently alleviated.
Overall, by alternately updating the discriminator and the generator according to Eq. (9) and Eq. (10), we solve the optimization problem in Eq. (3), which essentially minimizes the JS divergence between the occupancy measures under the behavior policy and the expert policy. Recall that, most existing methods minimize KL divergence, which tends to average on all the patterns (assuming that user behaviors are complex and contain multiple pattern modes) and thus fails to fit the complex user behaviors. Compared to KL divergence, minimizing JS divergence, which concentrates more on the major pattern [5], encourages each click generated by behavior policy to be "real" according to expert demonstrations instead of trying to correctly cover each pattern from the demonstrations. Such tendency ensures the good quality of generated clicks, especially in case of sparse data and large search space.

THEORETICAL ANALYSIS
In this section, we analyze how the exposure bias is reduced in AICM theoretically. Definition 4.1 defines the user's intrinsic reward function and the policy-level expected utility based on it. In click models, future states are influenced by current decisions. The discount factor describes how much we should consider future states to make an optimal decision at each step. We assume that under the expert policy, the user's utility is maximized, so the ultimate goal for a click model is to minimize the following utility gap: Firstly, we derive the utility gap for a click policy ( | ) based on behavior cloning. From Eq. (1) we can derive the following theorem.
The proof is in Appendix A.1. According to Theorem 4.1, the exposure bias problem exists for a BC-based click policy. To be specific, in training stage the policy ( | ) is learned assuming each previous click behavior is real while in testing stage the next click is generated based on previous predictions, which might have not been seen during training. Under such a condition the induced utility gap is quadratic w.r.t. the list length .
After that, we derive the utility gap for a click policy ( | ) based on GAIL. From Eq. (4) we can derive the following theorem.
The proof is in Appendix A.2. According to Theorem 4.2, the utility discrepancy induced by AICM is linear to list length . AICM generates each click based on previous predictions and evaluates the quality of the whole generated sequence instead of one-step click. In such a dynamic training, we alleviate the exposure bias as in BC-based methods and reduce the utility gap significantly from O ( 2 ) to O ( ).

EXPERIMENT
In this section, we conduct extensive experiments 2 to answer the following questions: RQ1 How does AICM perform in click prediction and relevance estimation compared with the existing click models? RQ2 Does AICM perform better than the existing click models in recovering the distribution of real data?  Table 1.  [18], DCM [19], DBN [8], SDBN [13], PBM [14] and UBM [15] are considered as representative PGM-based click models, of which open-source implementations are available 4 . For NN-based click models, we consider NCM [6] and CACM [10] for experimental comparison.

Evaluation Metrics.
We use three traditional metrics for click prediction and relevance estimation tasks. In addition, we propose two metrics (Reverse PPL and Forward PPL) to evaluate the generalization and data distributional coverage of click models. More details are described in Section 5.3. For click prediction task, we report the log-likelihood (LL) and perplexity (PPL) [15] of each model. The definitions of the loglikelihood and click perplexity at the rank are as follows: where the subscript is the rank position in a result list, is the total number of queries, and is the number of results in a query. , and P , denote the real click signal and the predicted click probability of the -th result in the -th query. The total perplexity performance is calculated by averaging perplexities over all the positions. Lower values of perplexity and higher values of loglikelihood correspond to better click prediction performance.
For relevance estimation task, we use click models to rank the document list and calculate the mean Normalized Discounted Cumulative Gain (NDCG) [22] according to the human labels. We report NDCG scores at truncation level 1, 3, 5 and 10.

Implementation Details.
We train AICM with a mini-batch size of 128 by using the Adam optimizer. The embedding size and hidden size of GRU are both 64. The initial learning rate for the generator and discriminator are 5 × 10 −4 and 1 × 10 −3 with a decay rate of 5 × 10 −1 . To avoid overfitting, we set the coefficient of L2 norm and dropout rate to 1 × 10 −5 and 5 × 10 −1 . At the beginning of the training, we use the maximum likelihood estimation (MLE) to pre-train the generator and discriminator on training set with the initial learning rate of 1 × 10 −3 . Finally, we adopt the model at the iteration with the lowest validation PPL for evaluation in the test set. To ensure fair comparison, we also fine-tune all the baseline models to achieve their best performance.

Performance on Traditional Metrics (RQ1)
The results for the click prediction task and the relevance estimation task are presented in Table 2, from which we can obtain the following observations. (1) All NN-based models significantly outperform PGM-based models in the click prediction and the relevance estimation tasks. NN-based models learn the distributed representations of queries and documents, therefore they can better capture the user behavior patterns.  Adversarial training enables AICM to better capture the user behavior patterns, instead of only fitting the click logs via MLE. (4) For relevance estimation, AICM performs better than the best baseline model (i.e., CACM) at all truncation levels in terms of NDCG. The baseline CACM models examination prediction and relevance estimation separately, and uses the extra session-level information. On the contrary, without complex model structure and session-level side information, our AICM can still achieve comparatively better performance with CACM in relevance estimation task, which demonstrates the effectiveness of our proposed GAIL framework for click models.

Metrics for Distributional
Coverage. Regarding the click model as a type of generative model for click signal generations, click models are to approximate the true data distribution that underlies the click log data and generate click samples of high fidelity. Traditional tasks and the corresponding metrics (e.g., click prediction task and LL, PPL) are not very suitable, because they view the click model as a predictive model and only deal with one-step click probabilities P , conditioned on true previous clicks. We need a task that views the click model as a generative model and measures the quality of generated samples, i.e., the whole click sequences based on its own predictions. Therefore, we propose a novel task for click models, called distributional coverage, in which we aim to measure the similarity between the true data distribution and the data distribution learned by the click model. Greater similarity between the true data distribution and the learned data distribution implies higher fidelity and better distributional coverage of the click model. Since it is not possible to obtain the true data distribution, we cannot measure the similarity directly. A common quantitative measure to test the fidelity and distributional coverage of generative models is to evaluate the generated samples via a strong surrogate  Figure 3. After training target click models based on training and validation sets, we use target click models to generate click signals based on queries and corresponding document lists in the test set, while document lists are kept in the original order (i.e., no permutation). Click signals are generated by sampling from a Bernoulli distribution that takes 1 with probability P , and 0 with probability 1 − P , .
To generate a synthetic dataset of a similar size to the training set, click signals are independently sampled 7 times for each query in the test set, resulting in 289,835 queries. For each click model (e.g., NCM, CACM, AICM), a synthetic dataset is generated following the above process. To evaluate the fidelity and distributional coverage of different click models, we compute Reverse PPL and Forward PPL of individual synthetic datasets. Lower value of Reverse/Forward PPL indicates better performance in distributional coverage task.
While traditional PPL metric in click prediction task only considers the click model as a predictive model, Reverse/Forward PPL in distributional coverage task view the click model as a generative model and directly measure the data distribution similarity by taking generated click sequences into account. Therefore, Reverse/Forward PPL are more suitable in real-world application scenarios where click models aim to build a simulation environment and provide simulated click signals.

Performance for Distributional Coverage.
In our experiments, we measure the Reverse/Forward PPL of AICM, CACM, NCM and UBM. To conduct an adequate experiment, we use UBM as the PGMbased surrogate model and NCM as the NN-based surrogate model. For a fair comparison, surrogate models used in both Reverse PPL and Forward PPL for different click models are all of the same model size and are trained with the same training epochs. In addition, we also test Reverse/Forward PPL of the real data, i.e., the PPL of the surrogate model trained on held-out real data and evaluated on the same held-out real data, which are supposed to provide the best values for these two metrics. Results are presented in Table 3, from which we can obtain the following observations.
(1) As a PGM-based model, UBM achieves worse performance compared to NN-based methods in terms of Reverse PPL and Forward PPL for both two surrogate models, though its traditional PPL metric is very close to the best baseline model CACM in Table 2. This observation suggests that, the distributed vector representations is better than the traditional binary random variables representation in recovering the underlying distribution of click log data. (2) CACM fails to defeat NCM in these two metrics, though it shows a significant improvement in the traditional PPL metric. Observation (1) and (2) show that Reverse/Forward PPL for distributional coverage task have different tendences from traditional metrics (i.e., LL and PPL) for click prediction task. These two tasks evaluate different aspects of click models. Click prediction task considers one-step conditioned click probability, and distributional coverage measures the distributional discrepancy after the whole sequence is generated. (3) AICM outperforms all the baselines by a statistically significant margin ( -value < 0.001) in terms of both Reverse PPL and Forward PPL, with different surrogate models. This indicates that AICM can better recover the real data distribution of the click logs, which is to say, AICM is able to better capture the pattern of user behaviors in the real data. (4) An interesting observation we find in Table 3 is that Forward PPL of AICM even outperforms that of the real data. On one hand, this observation indicates that AICM learns relatively simpler data pattern compared to the real data pattern, which can be regarded as a denoising process (i.e., outliers are removed). On the other hand, the "proper" performance of AICM for Reverse PPL (i.e., better than all the baselines and worse than the real data) shows that the data pattern learned by AICM is not too simple to fall in mode collapse.

Visualization for Distributional
Coverage. Furthermore, in Figure 4, we visualize the t-SNE projections of the document embeddings and GRU hidden states learned by the surrogate NCM from synthetic datasets generated by different click models. The results on UBM synthetic dataset are not visualized because its Reverse/Forward PPL are significantly worse than the others. Note that we do not distinguish hidden states at different ranks with different colors. We can observe that both projections of document embeddings and GRU hidden states based on AICM synthetic dataset are closer to the real data compared to NCM and CACM. The projections of NCM and CACM perform similarly. These observations are consistent to the results of Reverse PPL in Table 3, which again validates the ability of AICM to capture the underlying distribution of user behaviors and generate click samples of high fidelity.

Performance in Bad Cases (RQ3)
Click models are trained and tested on real-world click logs, the document lists of which come from a well-trained ranking policy. However, when the click model is used as a simulation environment for a ranking policy, we cannot make the assumption that the ranking policy is always well-trained. Therefore, we consider whether the click model can provide stable performance in such bad cases where the document lists are not reasonably ranked. To be specific, we shuffle the original document lists which are well ranked and generate clicks based on such new lists. The traditional metrics (e.g., LL and PPL) are not suitable to evaluate the new generated clicks, because we do not have ground truth click signals on shuffled document lists. Whereas our proposed Reverse/Forward PPL in Section 5.3 are competent.
Similar to that in Section 5.3, we generate different synthetic datasets using different target click models, where the input document lists are permuted from the original test set. We permute the original document lists in two different ways: half permutation and full permutation. In half permutation, we separately shuffle the first half (i.e., rank 1 to 5) and the second half of the list, ensuring that the position of a document do not change dramatically (e.g., changing from rank 10 to rank 1). In full permutation, we shuffle the whole list, so that a striking position change is allowed. After generating synthetic datasets, we train surrogate models to measure Reverse/Forward PPL. The results are displayed in Figure 5, from which we can obtain the following observations: (1) Compared to NCM and CACM, AICM achieves the best and the most stable performance, no matter when the inputs are not permuted, half permuted and fully permuted. This indicates that, no matter whether the input lists are permuted, the data distribution of the synthetic dataset generated by AICM is consistently closest to the the real data distribution. This demonstrates that AICM is able to capture and simulate user behaviors even when it faces such bad cases where input lists are not well ranked. (2) We can sometimes observe performance improvement (i.e., the decrease of Reverse/Forward PPL) when input lists are permuted. Such phenomenon contradicts with our initial intuition that the performance of a click model should decrease if input lists are not well ranked, i.e., are permuted. The reason for this phenomenon differs in AICM and baselines.
-NN-based baseline models use MLE methods to cover the average pattern underlying the training set, which may sacrifice the generalization.

Ablation Study (RQ4)
5.5.1 Ablation Study on Pre-training Strategy. A sufficient pretraining is necessary to apply adversarial training to sequence generative models [34]. In our experiments above, we also adopt the pre-training strategy to stabilize the adversarial training process. In this section, we conduct experiments to investigate the performance of AICM when the supervised pre-training is insufficient. The results are shown in Figure 6. Only training curves of negative LL performance are displayed since all metrics (i.e., LL, PPL and NDCG) show a similar trend. We observe that the pre-training strategy does not influence the final convergence of AICM, but only impacts the range of performance fluctuation during training.
The discriminator provides reward guidance when training the generator. If no pre-training strategy is applied or AICM is insufficiently pre-trained, the generator will act almost randomly at the beginning of the training, and the discriminator can identify the generated click sequences to be unreal with high confidence. This leads to low rewards for almost every actions the generator takes, which does not guide the generator towards a good direction for performance improvement, resulting in inferior performance at the beginning. However, as the training goes on, the generator and discriminator can gradually learn from each other and finally converge, which shows the stability of AICM.

Ablation Study on Training Strategy.
In our experiments, we find that the stability of AICM highly depends on training strategies. More specifically, hyper-parameters _ and _ have a large effect on the performance of AICM. Figure 7 shows the effect of these two parameters. Suppose we set _ = and _ = × . Then, in each epoch, we train the generator for times, and use the trained generator to generate synthetic trajectories. For each trajectory , the discriminator is trained for times, resulting in total × updates for the discriminator in each epoch. From Figure 7, we can obtain the following observations: (1) Strategy 1, which is adopted in our experiments above, achieves the best performance. As the generator performs the best, the loss of the discriminator is higher than that of other strategies. In addition, the fluctuation at the beginning is caused by adversarial updates between generator and discriminator, which has been explained in Section 5.5.1. (2) In strategy 2, _ is much larger than _ , which leads to training the generator many times before updating the discriminator once. This strategy results in a fast convergence of the generator. However, in this case, the generator improves so quickly, that the discriminator cannot get fully trained and thus provides a misleading signal gradually. That is why strategy 2 leads to worse performance in AICM than strategy 1.  fake sequence generated by an insufficiently trained generator. Thus almost every synthetic sequence receives a low reward, which does not provide a good guidance for the generator. (4) Compared to strategy 3, the total number of updates for the discriminator in strategy 4 is still 50. But in each epoch, we use the generator to generate 5 synthetic trajectories and update the discriminator 10 times for each trajectory. This alleviates the overfitting of the discriminator and provides meaningful signal to the generator. Thus, the negative LL performance of the generator in strategy 4 is much better compared with strategy 3. However, strategy 4 performs worse than strategy 1 since the discriminator is still overtrained.
From the analysis above, we conclude that AICM benefits from a proper ratio of _ and _ , which is in line with the theorem in [16]. It is important to balance the training of the generator and discriminator. Only if the discriminator is capable of consistently differentiating real data from generated data, which should not be too simple to be distinguished, the supervised signal from discriminator can be meaningful and the whole adversarial training process can be stable and effective.

Ablation Study on Discount
Factor . The discount factor controls how much of the future we should look ahead to make the current decision. Typically, is viewed as part of the problem. However, in practice, we need to tune this parameter to obtain the best value that is suitable for certain tasks. In Figure 8, we show the PPL and negative LL performance on test set w.r.t different values. The best performance is obtained at = 0.1. This is a small value, showing a large discount on future rewards. This is reasonable due to the existence of position bias. The state in the distant future often corresponds to a lower position and becomes less important.

RELATED WORK
We first describe the prior works in click models, then we discuss the connections and distinctions between AICM and previous GAN/GAIL based user simulation models. Click Models. Traditional click models [13], which are based on PGM framework, treat user behaviors as a sequence of observable and hidden events. They usually incorporate different assumptions on user behaviors to specify how documents and clicks at different positions affect each other. Richardson et al. [27] proposed the examination hypothesis, under which the probability of click are decomposed into the examination probability and the document relevance. Different click models study examination probability differently. The simplest click model that follows the examination hypothesis is the position-based model (PBM) [14], which assumes that the examination probability only relates to the displayed positions. Craswell et al. [14] proposed the cascade model (CM) by assuming that users sequentially scan each document in the list until the first click. CM can only handle query sessions with exactly one click. On the basis of CM, user browsing model (UBM), dynamic Bayesian network (DBN), dependent click model (DCM), and click chain model (CCM) have been proposed to overcome this limitation.
To get better expressive power and flexible dependencies, NNbased approaches have been proposed. The neural click model (NCM) [6] is the first attempt to apply neural networks to click models. NCM represents user behaviors as a sequence of hidden states instead of binary events. The following neural network based approaches also adopt this distributed representation framework. The click sequence model (CSM) [7] incorporates an encoder-decoder architecture, where the encoder computes contextual embeddings of the documents and the decoder predicts the position sequence of the clicked documents. The contextual-aware click model (CACM) [10] takes the session-level information into consideration and separates the modeling of relevance and examination. Such methods suffer from exposure bias and inferior estimation, which is successfully alleviated in AICM by dynamic modeling and adversarial training. GAN/GAIL based User Simulation. The framework of GAN/GAIL has been successfully adopted for user simulation in many previous works [2,11,30]. These works are mostly built as simulators to enhance RL-based recommendation agents. These works model cross-page interactions, which differs from AICM in problem definition. VirtualTaobao [30] models state transitions as turning to next page or switching to another user, and users' actions are simply defined as her interactions to the whole page, i.e., buying, leaving, or turning page. GAN-CDQN [11] models state transitions as turning to the next page, and define users' actions as picking an item (or not pick) from the item set regardless of the order of item lists. Similar as above, IRecGAN [2] also models state transitions as turning to the next page and users' actions are defined as a click or not on a ranked item of the list. Such a modeling is restrictive and many important details of user behaviors might be lost, e.g., the rank of clicked items, the context of clicked items, even which item is clicked (in VirtualTaobao and IRecGAN). Also, it cannot deal with multiple clicks, which are common in real-world applications. Moreover, none of the mentioned models can be used to evaluate a ranking function since they simply ignore the order of the ranked list. In AICM, we focus on users' interaction with a ranked list and model a fine-grained user behavior within the ranked list, which provides useful information for both the training and evaluation of a ranking function.

CONCLUSION
In this work, we propose a novel learning paradigm for click models based on the imitation learning framework. We model users' interaction with a ranked list as a sequential decision-making process instead of one-step prediction, and learn a multi-step click policy from users' click logs as expert demonstrations. We base the users' current state on previous predictions and optimize for a long-term objective rather than a short-sighted one-step loss. With adversarial training, we learn a stable distribution which generalizes well across different ranked list distributions. Also, we explicitly build a reward function, which recovers users' intrinsic utility and underlying intentions. Theoretical analysis shows that our solution is capable of reducing the exposure bias from ( 2 ) to ( ). Empirical studies on a real-world web search dataset demonstrate the effectiveness of our solution from different aspects. For future work of research, we will utilize AICM in the offline evaluation and optimization for a ranking function.