Reinforcement Recommendation with User Multi-aspect Preference

Formulating recommender system with reinforcement learning (RL) frameworks has attracted increasing attention from both academic and industry communities. While many promising results have been achieved, existing models mostly simulate the environment reward with a unified value, which may hinder the understanding of users’ complex preferences and limit the model performance. In this paper, we consider how to model user multi-aspect preferences in the context of RL-based recommender system. More specifically, we base our model on the framework of deterministic policy gradient (DPG), which is effective in dealing with large action spaces. A major challenge for modeling user multi-aspect preferences lies in the fact that they may contradict with each other. To solve this problem, we introduce Pareto optimization into the DPG framework. We assign each aspect with a tailored critic, and all the critics share the same actor. The Pareto optimization is realized by a gradient-based method, which can be easily integrated into the actor and critic learning process. Based on the designed model, we theoretically analyze its gradient bias in the optimization process, and we design a weight-reuse mechanism to lower the upper bound of this bias, which is shown to be effective for improving the model performance. We conduct extensive experiments based on three real-world datasets to demonstrate our model’s superiorities.


INTRODUCTION
Recommender system, as an effective remedy for information overloading, has been widely applied in a number of real-world applications, ranging from the e-commerce [15], social network [13] to music radio [9] and health caring [16]. Traditional models usually solve the recommendation task within the supervised learning frameworks, which fails to consider its basically interactive nature and users' long term engagement [2,7,[20][21][22]. To alleviate these problems, recent years have witnessed an emerging trend of formulating the recommendation task as a reinforcement learning (RL) problem. Typically, the recommender system and user are regarded as the agent and environment, respectively [20,21]. In each interaction step (see figure 1(a)), the system takes an action (recommends an item), and the user responses the action with a reward (e.g., rating and click). The final objective is to maximize the total rewards in the whole interaction sequence.
In the research of RL-based recommender system, existing models mainly focus on how to develop effective agents or simulate accurate environments to enhance the model performance. However, little attention has been paid to designing rewards for reflecting the users' complex preferences in realities, which is crucial for aligning the learned agent with the real user profiles. Previous methods usually approximate the reward by a single value, such as an integer rating reflecting the user's overall preference on the item, or a 0-1 value indicating whether the user has clicked/purchased the item. We argue that such simple reward can be problematic from many perspectives. To begin with, a unified value can be hard in distinguishing the user's fine-grained preference. Users with different preferences may give the same overall rating for the same item. For example, in figure 1(b), user A and B both score item X with the full ratings, but A is more interested in aspect d, while B likes more on aspect a. Only based on the fact that both A and B like item X, we cannot distinguish these users and accurately match them with the candidate items Y and Z, which vary a lot on different aspects. And then, an ideal agent should recommend items which can maximize the users' long-term engagement on all the aspects. However, the unified reward cannot provide enough signals to optimize the agent for some specific aspects. The high (or low) overall reward does not necessarily mean the user like (or dislike) all the item aspects.
To alleviate the above problems, in this paper, we propose to model user multi-aspect preference in the context of RL-based recommender system. The singleton reward in RL models is extended to a reward vector with each dimension corresponding the user's preference on an item aspect. While modeling user multi-aspect preference can more comprehensively understand the users, the user personalities can be quite complex and diverse, and different aspect preferences may not align (or even contradict) with each other. For example, in the hotel recommendation scenario, a user may enjoy the room size and environment, but these nice properties make the room cost more, which may lower the user's satisfaction on the price. Different aspect rewards may bring the model to different optimal solutions, and it is hard to learn a unified agent which can simultaneously maximize all the rewards. To reasonably define and learn an optimal model in such a scenario, we introduce Pareto optimization into the framework, where we aim to learn an agent, such that no other agent can concurrently increase all the aspect-level cumulated rewards. More specifically, we build our model based on deterministic policy gradient (DPG). We assign each item aspect with a tailored critic, and different critics share the same actor. For better aligning the actor output with the users' real preference, we introduce a supervised regularizer into the optimization process. For benefiting the model scalability, unlike previous heuristic strategies [8,14], the Pareto optimization is integrated into our framework in a fully differentiable manner. We demonstrate that the introduction of Pareto optimization may bias the gradients, and we present the upper bound of the bias, which is shown to be related with the training batch size. Based on this theoretical result, a weight-reuse mechanism is further proposed to correct the gradient bias, and its effectiveness is verified in the experiments.
In a summary, in this paper, we propose to model the users' potentially inconsistent multi-aspect preferences in the RL-based recommender system. To achieve this goal, we extend traditional DPG with multi-objective rewards based on Pareto optimization. We theoretically analyze the upper bound of our model's gradient bias, and propose a weight-reuse method to correct this bias. Extensive experiments are conducted based on three real-world datasets to demonstrate our model's superiorities.

BACKGROUND
For more clear and integral presentation, in this section, we briefly introduce the necessary backgrounds of this work.

Recommendation as an RL Problem
RL-based recommender models hold the promise to optimize user long-term utilities. In a typical RL formulation [7,20], the action is the recommended item, the state is represented by a user's previously interacted products. The reward is the rating from the user to the item. At each step t, the agent (recommender) takes an action a t based on the current state s t , and the user (environment) responses the action with a reward r t . The state is transformed into Figure 1: (a) Recommendation as a reinforcement learning problem. (b) A toy example for recommendation with user aspect-level preferences. Both A and B like item X (i.e., scoring it with 5 stars), but their specific preferences are quite different. A is more interested in aspect d, but B casts more attention on aspect a. Only based on the overall ratings, the system cannot well match these users with the candidate items (e.g., Y and Z), whose qualities varies much on different aspects. s t +1 by incorporating a t with s t in a deterministic manner, that is, s t +1 = (s t , a t ). After many agent-environment interactions, we get several trajectories (s 1 , a 1 , r 1 , ..., s T , a T , r T )'s, and the goal is to maximize the sum of the rewards in these sequences. A promising RL model for solving the recommendation task is the deterministic policy gradient (DPG) [12]. It can well handle the extremely large item sets by learning the actor within a continuous action space [7,16,19,21]. Basically, DPG is an actor-critic framework. The critic is implemented based on a Deep Q-Network Q(s, a|ϕ), which is learned by: where are the training samples. γ is a discount factor used to balance the short-and long-term rewards. The parameter ϕ ′ is updated from ϕ, but with a slower pace for stabilized training. After optimizing the critic, the actor is learned by maximizing the Q function, that is, In the whole training process, the critic and actor are alternatively optimized until convergence.

Pareto Optimization
Pareto optimization stems from the economics, and has been recently leveraged to solve multi-objective optimization problems (MOOP) [8,11] in the machine learning community. In MOOP, the models are required to optimize a set of loss functions L(θ ) = {L 1 (θ ), L 2 (θ ), ..., L M (θ )}. Usually, it is hard to find a unified parameter θ , which can simultaneously minimize all the L i 's. In such a scenario, Pareto optimization provides a reasonable method to define the optimal solutions. To begin with, different parameters are compared based on the following concept: Intuitively, Pareto optimization aims to find a parameter θ * , such that no other parameters can concurrently decrease all the loss functions. We call such parameter as Pareto efficient solution, which is formally defined as: Definition 2. Pareto efficiency. For a parameter θ * , if there is no otherθ , such thatθ ≻ θ * , then we say θ * is a Pareto efficient solution.
In this paper, we leverage Pareto optimization to extend DPG for optimizing multiple inconsistent rewards, and further apply the designed model to the task of recommender system.

PARETO DETERMINISTIC POLICY GRADIENT
In this section, we first define the problem studied in this paper, and then revise traditional DPG to make it compatible for multi-aspect rewards based on Pareto optimization (we call our model as PDPG).
At last, we theoretically analyze the designed model by presenting the upper bound of its gradient bias, and propose a weight-reuse mechanism to lower this upper bound.

Problem Definition
Suppose we are given a user set U = {u 1 , u 2 , ..., u | U | } and an item set I = {i 1 , i 2 , ..., i | I | }. The interactions between the users and items are collected in the set of O = {(u,i)|u has interacted with i,u ∈ U, i ∈ I}. For each element (u, i) ∈ O, the user can score on the item from multiple aspects, for example, in the hotel recommendation, a user can give ratings to a room on its environment, size, price and etc. We define the rating set R as {r ui |(u, i) ∈ O}, where each r ui = {r ui,m } M m=1 represents a user's ratings on an item's different aspects, and M is the aspect number. Given {U, I, O, R}, our task is to build an RL-based recommender model, such that it can maximize the users' long-term engagement on all the aspects.

Multi-aspect Critic
Different from previous RL-based recommender models, where the reward is unified in the optimization process, we have multiple rewards, which can be inconsistent with each other due to the users' diverse preferences on different item aspects. To well handle these rewards, we assign each aspect with a tailored critic.
Formally, suppose we have M item aspects, then the critics are defined as: {Q 1 (s, a|ϕ 1 ), Q 2 (s, a|ϕ 2 ), ..., Q M (s, a|ϕ M )}, and we optimize them based on: where y i,m = r i,m + γQ m (s i+1 , a i+1 |ϕ ′ m ) is the target value for the mth critic. r i = {r i,m } M m=1 is reward vector with each r i,m corresponding the user's preference on the mth aspect of item a i .
Remark. i) If we have some prior knowledge about the relationship between different item aspects, the corresponding Q m 's can partially (or fully) share their parameters, which makes our critic optimization similar to a multi-task learning problem [11]. ii) Remember that the Q values represent the user's long-term engagements on different aspects. Larger Q value means the user may prefer more on the corresponding aspect. Thus, we can explain the recommendation by highlighting the aspect (e.g., x) with the largest Q value. A possible explanation template can be that "we recommend this item to you because it can satisfy your long-term engagement on aspect [x]". Such explanation cares more on the users' long-term preference, which is different from previous shortterm recommendation explanations.

Pareto-efficient Actor
As mentioned in section 2.1, the actor in DPG is learned by maximizing the Q function (i.e., equation (2)). However, in our framework, there are multiple Q m 's, and it is difficult to find a unified θ which can maximize all the Q functions. Straightforwardly, we can average different Q m 's with some predefined weights, and use single-reward models to learn the parameters. However, such method is limited in two aspects: on one hand, different objectives may vary in scale and importance. To find appropriate weights, one has to grid search the value for each weight. For M aspects with d search points, the model have to be optimized for d M −1 times, which is quite timeconsuming and labor-intensive, especially when the aspect number becomes larger. On the other hand, such method only guarantees that the sum of the Q functions is maximized, but there is no mechanism to make sure that each Q m is continually increased in the optimization process.
To alleviate these problems, we introduce Pareto optimization into the actor learning process. More specifically, we still average different Q functions, and the weights are assumed to be w = {w 1 , w 2 , ..., w M }, which induces a loss function: However, different from the previous methods, we dynamically adjust the weights to guarantee that different Q functions can be simultaneously increased, and finally achieve a Pareto efficient solution.
Formally, the weights are determined by solving the following quadratic programming (QP) problem: where 1 is an all-one vector. Remark. i) With the preference vectors and values, we actually incorporate the prior knowledge on different aspects into the training process implicitly. For example, if e k is a one-hot vector, then the constraint in equation (5) aims to set an importance-level for the corresponding Q function by b k . More general constraints can also be added according to the specific applications. ii) w can be computed as long as we know is just the gradient of θ originally needed for optimizing the actor. This means that the gradient information is reusable, and our adopted Pareto optimization method can be smoothly infused into the DPG framework.
To see why the weights derived from equation (5) can lead to a Pareto-efficient solution, we have the following theory: Theorem 1. If w is determined by solving the quadratic programming (QP) problem of (5), then either one of the following holds: i) The solution to the optimization problem is 0, then the local Pareto efficient solution is achieved.
θ cannot be improved to increase all Q functions, thus the local Pareto efficient solution achieves [6,11].
For ii), we write the Lagrangian of problem (5) as: The KKT condition for this Lagrangian yields: Recall Once we have determined w, the actor parameter can be updated by θ ← θ + α θ d, where α θ is the learning rate.
Supervised regularization. Above, the policy µ(·) is learned only based on the Q-values. For further constraining the optimal actions in a reasonable and safe space [16], we align the output of µ(·) with the user's real preference on different items. In specific, we regularize µ(·) in a supervised manner. The predicted action is forced to be closer to the items with positive feedback, and simultaneously stay away from the negative ones. The objective to be maximized is: exactly the user's purchased item a i , and 0 for the negatively sampled items. In this objective, the knowledge learned from the supervision signal is expected to influence the RL model, such that the predicted action is not far from the users' real preference.
When optimizing L(θ ), we can straightforwardly merge l(θ ) (i.e., equation 4) and L(θ ) by a hyper-parameter (e.g., β) or learn them alternatively. However, both of these methods are suboptimal. For the former method, even if we determine w by solving problem (5), the gradient direction ∇ θ (l(θ ) + βL(θ )) does not necessarily satisfy equation (7), which is crucial for achieving the Pareto efficient solution. For the latter strategy, after optimizing l(θ ), the parameter θ will be further changed by L(θ ), which cannot guarantee the increasing of all the Q-values.
To overcome these drawbacks, one may find that the objective (8) can be rewritten as L(θ ) =Q((o, y i ), µ(s i |θ )), which is similar to a Q function if we regard (o, y i ) as a pseudo-state. Thus we take L(θ ) as a special Q function, and incorporate it into equation (4) and (5) to learn the Pareto efficient solution jointly, where we assignQ with an additional Pareto weightw. With this method, all the Q functions and L(θ ) can be simultaneously optimized along a non-decreasing direction.
Learning algorithm. We present the whole training procedure of our framework in Algorithm 1. To begin with, many transactions are generated according to the current policy, and we push them into the "replay buffer" B (line [5][6][7][8][9][10][11]. Then the critic is optimized based on the multi-aspect ratings (line [12][13][14][15][16][17][18][19]. In the next, we derive the gradients of Q m (orQ) w.r.t. θ , which will be used for computing w and the actor learning (line 21-23). The weight w is computed by solving problem (5) (line 24). And based on the learned w, the actor is optimized by stochastic gradient ascent (line [25][26][27]. At last, the target parameters are updated in a soft manner (line 28-29).

Implementation of the Critic and Actor
Before describing the architectures of the critic and actor, we firstly introduce how to derive the environment state s. In an RL-based recommender system, the state concludes the user's current status. In our model, it is computed from the embeddings of a user and her previously interacted items. For a user u, suppose her interacted items are {i 1 , i 2 , ...i l u }, then the state s is computed as: where p u and q i m are the user and item embeddings 1 , respectively. In our critic, the state-action pair is transformed into a Q-value based on a two-layer neural network, that is, where Select an action according to a t = µ (s t |θ ) + N t , N t is an exploration noise. 9 Execute a t to obtain the new state s t +1 and the reward vector r t = {r t,1 , r t,2 , ..., r t, M }. 10 Push {s t , a t , r t , s t +1 } into the replay buffer B Compute

end
operation. a is the output from the actor (when updating the actor and computing the target Q) or the real item embedding (when updating the critic). In the actor, we project a state into an action in a deterministic manner, that is, where {W 4 ,W 3 , b 4 , b 3 ,W A s } are model parameters. Once our model is learned, the final recommendations are generated by selecting the items whose embeddings are closer to the output of µ(s) [19].

Analysis of the Gradient Bias
In the above sections, we have described the implementation of our framework. Here, we provide some theoretical insights on the designed model. In the field of neural network optimization, gradient-based methods are very common and effective. In this section, we are interested in whether or not the gradients used in our model is biased from the true gradients, and if yes, how large is this bias.
Suppose we have many training batches {B 1 , B 2 , ...}, and each B i is composed of Z samples. For the ith training step, w is derived based on B i via problem (5). We re-denote w by w(B i ) to highlight the relation between w and B i , and the loss 2 for each training batch B i is:l For easy derivation, we denote ∇ θ Q m (s, µ(s |θ )) by f m (s; θ ) ∈ R d , where θ is assumed to be a d-dimensional vector. Let the true and mini-batch stochastic gradients be ∇ θ l(θ ) and ∇ θl (θ ), respectively. Then their discrepancy is represented as: For G, we have the following theory: The batched gradient of the action-value function for each objective is unbiased, that is: where I ∈ R d ×d is an identity matrix and σ is a scalar. Then we have: Here, we present a scratch of the proof, and the complete version can be seen in the Appendix.
Proof. Since w m (B i ) ≥ 0 and M +1 m=1 w m (B i ) = 1, by Jensen's inequality, we can derive , then д and д are d-dimensional random variables, where the ith dimensions are defined asд i and д i , respectively. According to assump- Since we assumeд i follows a normal distribution, then v i =д i −д i σ д ∼ N (0, 1), and d i=1 v 2 i follows Chi-square distributionχ 2 d . Thus we have: □ From this theory, we can see, the gradients used in our model is biased, and the bias's upper bound is in inverse proportion to the batch size. This suggests that if we use a larger batch size, the upper bound of G can be lowered, and we may potentially learn more accurate parameters. However, larger batch size means more cost on the computational resources (e.g., GPU memory). If one looks deeper into this theory, she may find that If w m is not related with B i , then w m (B i ) can be removed out of the expectation in equation (14).
At this moment, G becomes 0, given that 1 . Inspired by this phenomenon, we design the following "weight-reuse" mechanism.
Weight-reuse mechanism. In this method, we introduce a container W ∈ R L×(M +1) for storing previously derived Pareto weights. For each training batch B i , w ∈ R M +1 is not always computed by solving problem (5). We firstly check the weights in the container: (1) If there is a candidate w * ∈ W , such that its corresponding , then we set w = w * 3 . Since the weights in W is not derived from B i , the bias G becomes 0 at this moment.
(2) If there is no such weight in W , we solve problem (5) to derive w, which is then pushed into the container for future "reuse". In this scenario, G is not 0, which is bounded by equation (15).
From the above analysis, we can see, under the weight-reuse mechanism, the bias G has some chances to become 0, thus the overall upper bound is lowered. In practice, the container size is fixed as L, and the earliest weights will be moved out when the container is full. The instances for deriving the weights in W are temporally frozen to avoid of being sampled into B i , and causing dependency 4 .

RELATED WORK 4.1 RL-based Recommender Models
Users' long-term engagement in a recommender system has recently attracted increasing attention [23][24][25]. To capturing such information, reinforcement learning, as a powerful tool for balancing short-and long-term rewards, has became an interesting framework for building recommender models. Previously, many models focus on designing effective agents to generate accurate recommendations. For example, [20] proposes a GRU based model to capture user historical behaviors, which is then incorporated into the DQN framework. [22] also bases itself on DQN, but further involves more contextual features and continuous time information. [7,19,21] build their model based on DPG. Since the item 3 If there are multiple w * 's, we sample one from them. 4 Here, we assume that different samples are independent set (action space) can be very large in real-world recommender systems, these models can be more efficient when learning the agent. Meanwhile, many models are proposed to build better user environments for providing more reliable rewards. For example, [2] explicitly builds a model to simulate user decision process, and the user model and recommender agent are jointly learned. [1] leverages model-based RL to formulate the recommendation task, where the user environment is explicitly learned to accommodate the recommender agent. Existing models mainly focus on how to design effective agents or environments, while little effort has been devoted to studying the rewards, which is important for understanding the users and learning more accurate recommendation policies. In our work, we take a step towards more comprehensive user reward shaping, where we explicitly model the users' diverse preferences on different item aspects.

Pareto Optimization
In many real-world problems, the machine learning models usually need to simultaneously optimize multiply objectives. Different objectives may not always consistent with each other, and the optimal parameter for one objective may not perform well on the other ones. In such a scenario, Pareto optimization provides a reasonable method to trade-off different objectives. In specific, under the supervised learning framework, [11] leverages multi-objective optimization technique to solve the multi-task learning problem. [6] extends [11] by adding preference vectors for generating more evenly distributed Pareto frontier. Many efforts have also been devoted to applying Pareto optimization to enhance the reinforcement learning framework. Typically, many models [8,14] study how to design multi-objective DQNs based on heuristic Pareto optimization strategies. These methods have achieved promising results for the problems with small and discrete action sets. However, little attention has been paid to the extremely large or continuous action spaces, which is yet important for real-world applications. In this paper, we fill in this gap by extending DPG with multi-objective rewards, and more importantly, we theoretically analyze the designed model by presenting and lowering the upper bound of the gradient bias.

EXPERIMENTS
In this section, we conduct extensive experiments to demonstrate the effectiveness of our model, where we focus on the following research questions: RQ1: Whether our model can outperform the state-of-the-art methods? RQ2: How does different components in our model contribute the final results? RQ3: How different hyper-parameters influence our model's performance?
We begin with the experiment setup, and then present and analyze the results to answer the above questions.  different beers, while TripAdvisor is a travel dataset including the hotel ratings from the customers. In all these datasets, in addition to an overall rating, we also have users' ratings on different itemaspects. For each beer in RateBeer and BeerAdvocate, people are allowed to make ratings on its appearance, aroma, palate, and taste. For the hotels in TripAdvisor, we have user ratings on the service, cleanliness, value, sleep quality, rooms, and location. The ratings of these datasets are scaled into the range of [1,10]. The statistics of these datasets are summarized in Table 1. We can see these datasets cover different characters, e.g., TripAdvisor is small and sparse, while RateBeer is much larger and denser. Based on these diverse datasets, our model can be evaluated under different settings in a comprehensive manner. Baselines. We select the following representative methods as our baselines: • BPR [10]: This is a well-known recommender method for modeling user implicit feedback. We use matrix factorization as its predictive function in the experiments.
• NCF [3]: This is a state-of-the-art deep recommender model, where the user and item representations are fed into multiple nonlinear layers to predict the final results.
• EFM [17]: This is a well known explainable recommender model, where the user preference on different item aspects is incorporated into the matrix factorization method.
• MATF [5]: This is a multi-aspect recommendation model based on tensor factorization, where we optimize it based on the pair-wise BPR loss for fair comparison.
• GRU4Rec [4]: This is a sequential recommender model, where the interacted items are modeled by a recurrent neural network.
• DRR [7]: This is a recently proposed RL-based recommender model, where the overall rating of each user-item pair is regarded as the reward.
Environment simulation. Ideally, a model should be trained and evaluated in an on-line recommender systems to get real user rewards. However, unlike classical RL problems (e.g., playing Atari Games), where we can interact with the environment and obtain the reward with little effort, it is costly and not safe to directly deploy an immature RL model onto real-world systems [2,7]. Thus, we follow previous works [2,7,21] to build simulators for approximating the user reward generation process. In general, the simulator should well balance the trade-off between the simplicity (efficiency) and performance (effectiveness), such that our RL model can be trained with acceptable speed and accuracy. Our simulator is designed as a two-layer fully connected neural network with ReLu as the activation function. The input is a state-action pair, and the outputs are the estimated ratings for different item aspects. The user simulator is learned based on the training and validation sets, and the average rating prediction performance is satisfied in terms of RMSE, which is about 0.96 for RateBeer and BeerAdvocate, and 0.91 for TripAdvisor, respectively. Implementation details. Following the common practice [18], we chronologically organized each user's interacted items as a sequence in the beginning. And then, we split the sequence when the time interval between successive interactions is larger than some threshold, which results in many shorter but more coherent sessions. We leverage each user's last 30% sessions as the testing set, while the others are left for training. The commonly used metrics including F 1 @5, NDCG@5 and Cumulated Reward (Cum-Reward) are adopted to evaluate our models. Among these metrics, F 1 aims to measure the overlap between the recommended items and the ground truth. NDCG is a ranking-based metric, and a hit with higher ranked prediction contributes more to the final results. Cumulated reward is utilized to evaluate the users' long-term satisfaction, and we report the sum of the rewards for each aspect in the testing episodes. In our model, we leverage stochastic gradient decent (SGD) to optimize the parameters, and the learning rates for the ac-  Table 2, from which we can see: • On different datasets, NCF, EFM and MATF perform better than BPR in most cases, which agrees with the previous work [3,17]. The reasons can be that NCF can leverage neural networks to model non-linear user-item relationships, and EFM and MATF are able to incorporate user multi-aspect preference into their modeling process. As a result, they both exhibit better performance than BPR.
• It is interesting to see that the sequential model GRU4Rec did not achieve superior performance than the non-sequential ones. We speculate that the sequential characters of our datasets are not significant, the users may comment on the beers or hotels in a quite random manner. Leveraging recurrent architectures, such as GRU, to model our data may impose too strong assumptions, which may lead to unsatisfied performance.
• DRR can usually obtain larger cumulated rewards than the other baselines, which verifies the capability of RL for modeling users' long-term engagement on the recommendation task. For the other cases, DRR does not perform very well, which may imply that maximizing the expected Q values do not always align with the accuracy-based metrics, such as F1 and NDCG.
• Encouragingly, by incorporating multi-objective rewards into the DPG framework, our model can achieve the best performance on all the metrics across all the datasets. This observation demonstrates the effectiveness of our model, which positively answers the first research question. Comparing with DRR, the modeling of multiaspect preference enables us to more comprehensively profile the users, and incorporated supervised regularizer compensates the Qfunction optimization by constraining the generated actions into a safe space. Both of these designs help to better understand the users and improve the final recommendation performance. Comparing with the other baselines, which only optimize the users' immediate preference, our model can appropriately trade off the short-and Table 2: Performance comparison between the baselines and our model. For each metric on different datasets, we use bold fonts and * to label the best performance and the best baseline performance, respectively. Impr. is short for improvement, and the last column shows the relative improvement of our results against the best baseline. BeerAd and TripAd are short for the datasets of BeerAdvocate and TripAdvisor. The aspects are abbreviated as the capitalization of their first two letters,e.g., AP is short for appearance. long-term user engagements, which leads to superior results on different metrics.

Ablation Study ( RQ2).
In the above section, we have evaluated our model as a whole. In order to verify whether different model components are useful for the final result, we conduct ablation studies in this section. In the experiments, the model parameters are fixed as the optimal values, and the performance is evaluated based on F 1 @5 and NDCG@5, respectively. We are interested in the following questions: (1) Whether Pareto optimization is necessary? (2) Whether the weight-reuse mechanism is benefit for the performance? (3) Whether the supervised regularizer can improve the evaluation results? (4) Whether the Q-function can lead to better actor optimization? For answering these questions, we compare our model with its five variants, that is, (i) PDPG (random pooling): in this method, different Q-functions are merged by a set of random weights. (ii) PDPG (average pooling): in this method, we directly average different Q-functions. In both PDPG (random pooling) and PDPG (average pooling), the weights for different Q functions are fixed in the optimization process. (iii) PDPG (−reuse): in this method, we drop the weight-reuse mechanism. (iv) PDPG (−super ): in this method, we do not use the supervised regularizer (i.e., equation (8)), and the actor is solely optimized based on the Q-functions. (v) PDPG (−Q): in this method, we drop the Q-functions, and only use equation (8) to learn the actor in a supervised manner. We present the comparison results in Table 3.
• We can see, the winner and performance gap between PDPG (random pooling) and PDPG (average pooling) varies across different datasets, e.g., on RateBeer, PDPG (random pooling) shows slightly inferior performance than PDPG (average pooling), while on TripAdvisor, PDPG (random pooling) outperforms PDPG (average pooling) by a considerable margin. This result manifests that the weights for merging different Q functions can be data dependent, and we may need to search a large space to determine their optimal values. An encouraging observation is that our final model can consistently achieve better performance than both of these variants. This observation verifies the effectiveness of introducing Pareto optimization to coordinate different learning objectives. Based on the weights derived from problem (5), all the targets are continually optimized along a non-decreasing direction, which is shown to be effective in promoting the final recommendation performance.
• Comparing with the final model, if we drop the weight-reuse mechanism, the performance is lowered on all the datasets. This result is as expected, and positively answers question (2). An interesting observation is that, in some cases (e.g., on RateBeer), despite leveraging Pareto optimization, PDPG (−reuse) performs worse than the fixed weight models, which manifests that the simple Pareto optimization method may not necessarily bring improved performances. As mentioned in theory 2, the parameter gradients Table 3: Comparison between our model and its variants. We use bold fonts to label the best performance. PDPG (ran) and PDPG (ave) are short for PDPG (random pooling) and PDPG (average pooling), respectively. For saving the space, we omit the recommendation number "@5".

RateBeer
BeerAd  are biased after introducing the Pareto optimization, which may negatively impact the actor learning process. As a remedy, we design the weight-reuse mechanism to lower the upper bound of the bias, such that the actor optimization can be more aligned with the true gradients, which is shown effective by the superior performance of the final model.
• For questions (3) and (4), we can see neither PDPG (−super ) nor PDPG (−Q) can outperform the complete model, and the results are consistent across all the datasets. These observations imply that both the Q function and supervised regularizer are necessary for the final better performance, which verifies our claims in previous sections. The balance between these components are also studied by tunning the hyper-parameters in the following experiments.  section, we firstly study the importance of the supervised regularizer, and then we investigate the influences of the batch size and discount factor for the final results, respectively. When studying one parameter, we fixed the other ones as their optimal values.
Study on the importance of the supervised regularizer. By the supervised objective (8), we aim to regularize the action into a safe space, which is not far from the users' real preference. In this section, we study the importance of this supervised regularizer for the final performance. More specifically, we constrain the weight of the regularizerw by imposing different preference vectors and values. For example, if we want set the importance level as 0.1, then the preference vector and value is set as [0, 0, 0, 0, 0, 1] and 0.1 7 , respectively. We tune the importance level in the range of [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], and the results are presented in Figure 2. We can see, the optimal importance level is small for the datasets of RateBeer and BeerAdvocate, while larger for Tri-pAdvisor. Considering that TripAdvisor is a much sparser dataset, the optimization of the Q functions can be highly insufficient. In such a scenario, we speculate that the supervision signals can be more needed for compensating the Q function learning, and thus lead to better performance when the importance level is higher.
Study on the batch size. Batch size is an important hyperparameter for training the neural models. In this section, we tune the batch size in the range of [32, 64, 128, 256, 512, 1024], and the performance is evaluated based on F1@5 and NDCG@5, respectively. From the results shown in Figure 3, we can see: the best results are usually achieved when the batch size is relatively large. This observation agrees with theory 2, that is, larger batch size can lower the upper bound of the gradient bias, which may potentially correct the gradient error, and improve the model performance.
Study on the discount factor. In the context of RL-based recommender system, γ is used to balance the short-and long-term rewards. Smaller γ pays more attention to the users' immediate preference, while larger γ puts more focus on the future engagement. In this experiment, we study the influence of γ by tunning it in the range of [0.1, 0.3, 0.5, 0.7, 0.9]. The results are presented in Figure 4. We find that the optimal γ for different datasets varies. For example, on TripAdvisor and RateBeer, smaller γ can lead to better results, while on BeerAdvocate, a moderate γ is more preferred. This observation suggests that γ is sensitive to the dataset, and should be well tunned in practice for achieving the best performance.

CONCLUSION
In this paper, we propose to capture user multi-aspect preferences in the context of RL-based recommender system. To this end, we extend traditional deterministic policy gradient with multi-objective rewards, and seamlessly infuse Pareto optimization into the modeling process. We provide a theoretical analysis on the designed framework, and also propose a mechanism to correct the gradient bias. To demonstrate our model's effectiveness, extensive experiments are conducted based on the real-world datasets.
Different from previous RL-based recommender models, which mostly focus on the design of the agent or environment, this paper opens the door for modeling complex or even conflict user rewards. We believe there is still much room for improving this work. To begin with, we can make a more thorough study on the preference vectors and values, based on which we plan to propose the concept of "aspect-level" fairness, that is, we should try to equally optimize different aspects in the training process. In addition, we may also design more advanced weight-reuse mechanisms, such that we can find the optimal Pareto weights more efficiently.