Rectifying Unfairness in Recommendation Feedback Loop

The issue of fairness in recommendation systems has recently become a matter of growing concern for both the academic and industrial sectors due to the potential for bias in machine learning models. One such bias is that of feedback loops, where the collection of data from an unfair online system hinders the accurate evaluation of the relevance scores between users and items. Given that recommendation systems often recommend popular content and vendors, the underlying relevance scores between users and items may not be accurately represented in the training data. Hence, this creates a feedback loop in which the user is not longer recommended based on their true relevance score but instead based on biased training data. To address this problem of feedback loops, we propose a two-stage representation learning framework, B-FAIR, aimed at rectifying the unfairness caused by biased historical data in recommendation systems. The framework disentangles the context data into sensitive and non-sensitive components using a variational autoencoder and then applies a novel Balanced Fairness Objective (BFO) to remove bias in the observational data when training a recommendation model. The efficacy of B-FAIR is demonstrated through experiments on both synthetic and real-world benchmarks, showing improved performance over state-of-the-art algorithms.


INTRODUCTION
Popular online recommendation platforms such as Amazon, Yelp, TikTok etc, aim to help customers browse items online in an efficient manner by recommending the user personalized items that they might be interested in.By providing a recommendation policy, these platforms can connect customers and item producers by training their models on a huge corpus of self-collected logged data.However, recently the question of social responsibility in the recommendation systems has attracted a lot of attention.In particular, users are wondering how the recommendation system works and whether there are any biases in these systems [5,30].
Consider the advertisement industry in the cosmetics sector as an example.The logged training data in this scenario is likely to contain biases that result in the overwhelming recommendation of makeup products to women.However, some women may not have an interest in cosmetics.This phenomenon is due to unfair feedback loops, where the correlation between women and makeup is being amplified at each recommendation iteration.While gender is considered as sensitive information, and may not be explicitly used as an input to the recommendation system, it can be inferred from other inputs that are correlated with gender, such as purchasing history or social media following.Additionally, there are other biases such as item-specific biases in recommendation systems, where popular items tend to receive more exposure compared to less popular ones, limiting the audience for smaller vendors [36,45].
The unfairness problem from training recommendation models on biased logged data is well-known [24,30].Building a model based on data that was generated by one's own recommendation systems will inevitably create a feedback loop which amplify biases [9,17] and where the true relevance score is harder to determine.
Traditionally, most existing methods consider the approach of designing fairness constraints based on sensitive information to guarantee groups or individuals are treated fairly under the online policy [20][21][22]25].These model-based approaches mainly focus on fair predictions, i.e., exposure being independent of the sensitive attributes, which can degrade the performance of the original recommendation model [24,39].
Contrary to previous methods, we tackle the fairness problem from a data debiasing perspective.Specifically, we focus on learning debiased representations of the data, which can then be used as inputs for any downstream recommendation models.This can be seen as a fair feature extraction step before applying the actual recommendation model.The main goal in our case is to determine the unbiased true relevance score of the user-item pair and thus achieve fairness by unbiasedly recommending items to the user.
Hence, this approach from a debiasing perspective is orthogonal to standard fairness constraint-based methods, as both of these approaches can be combined [11,12].We leave this combination of literature for future work as it is out of the scope of this paper.
More precisely, in this work we introduce a new concept of fairness called balanced fairness from a data debiasing perspective.We argue that a model is balanced fair and does not suffer from feedback loops, if it has been trained on a dataset, in which, recommendations were selected uniformly based on sensitive attributes (further discussed in Section 3).In this scenario, we show empirically (Section 5) that there is no unfair feedback loop that reinforces biases, as the model is being trained unbiasedly at each time step.This enables us to calculate the true relevance score between user and item.However, in reality such training data is not readily available due to practical constraints i.e. users leaving because recommendations are uniformly at random and not personalized [35].
Hence, to achieve balanced fairness we develop a two-stage representation learning framework as follows: Firstly, given context features (i.e.input to the model which can come from both users and items), we extract the sensitive-correlated information into representations using an identifiable VAE [16].Secondly, we learn a second-level representation of these sensitive representations to remove biases across sensitive groups by proposing a new adversarial learning strategy.Lastly, these representations are used as input to any recommendation model.
We show that by adopting our debiasing training framework, we can train a recommendation model as if the data came from a balanced/unbiased dataset, while only having access to biased data.In addition, given that we extract the sensitive-correlated information into representations in the first stage, we can also check how much of the sensitive information has been removed (See Ablation study in section 5.4).
The main contribution of this paper are summarized below: • We propose a new type of fairness from a of data debiasing perspective, which we term balanced fairness and develop an objective called Balanced Fairness Objective (BFO).• Next we present a two-stage end-to-end algorithm (B-FAIR), which given biased unfair logged training data allows us to train representations for any recommendation model as if the data came from an unbiased and fair dataset.• Lastly, we show the effectiveness of our method B-FAIR over existing methods in synthetic as well as real-world experiments.

RELATED WORKS
Fairness has recently attracted a lot of attention in recommendation system community [37,44].The key objective in fairness is that groups or individuals should be treated independently of their sensitive attributes such as gender etc.For clarity, we introduce these stages separately in section 2.1 and 2.2 respectively.

Debiasing recommendation
Given that unfairness can be caused by data bias [12], the debiasing objective can be reduced to solving popularity bias and exposure bias induced by a missing-not-at-random problem in the dataset [7,8,34,46].In particular for recommendation systems, the lack of interaction between the user and the item does not necessarily signify that the user was not interested.This missing interaction in the data could be due to the recommendation model not exposing the user to the item and hence computing the true user-item relevance score becomes increasingly hard.Recommendation models trained on such data could be heavily biased and thus reinforce unfairness.There are three types of methods for data debiasing ( We rectify unfairness in data trained directly on the entire sample space, where every user-item interaction has been "observed" [19]. (2) Optimization level objective: These methods are based on distribution adjustments, where importance sampling and re-weighting of the data are applied to optimization objectives [4,19,29,30,32,33].(3) Representation learning: Lastly, representation learning methods [40] aim to learn mappings of the data such that the representation is not affected by the biased historical dataset.Our method B-FAIR is most closely related to the representation learning perspective and contrary to the existing debiasing method, B-FAIR focuses on eliminating the unfairness in training data stemming from sensitive information.

Fairness recommendation method
For the algorithmic level objective, previous methods mostly focus on how to achieve fair policies by exposing similar items across sensitive groups in an online recommendation system.The key ingredient in most of these methods is to develop fairness constraints, which allows them to obtain fair item exposure, i.e. exposure to items independent of sensitive attributes such as gender [1,6,13,26].
In addition, some works consider a re-ranking/post-processing step of the outputs of recommendation systems to increase fairness [3,20,38,43].In summary, previous fairness approaches to recommendation systems are mostly concerned with learning a model with fairness constraints i.e. predictions being independent of the sensitive attributes.However, in this paper, we focus on tackling the unfairness created by the feedback loop, i.e. biased logged data and hence we will concentrate on the second fairness objective proposed in [11], which is that of data debiasing fairness.To reiterate, both of these objectives mentioned in [11] are orthogonal to each other and hence could be used in conjunction.

PROBLEM DEFINITION AND SETUP 3.1 Notation and setup
We denote c ∈ R  as a context variable which is collected from an online recommendation system, such as the user and item context features e.g.user profile and item attributes. ∈ [1, • as the logged training dataset, which has been collected in a biased and unfair fashion.

Fair exposure in feedback loops
Before diving deeper into our formulation of balanced fairness, we would like to clarify the fairness that we aim to tackle and remove.To this end, we illustrate the feedback loop in figure 1 to shed more light on the problem.In particular, figure 1 depicts the relationship among the components in the recommendation pipeline.
• Firstly, the users are exposed to a selection of items recommended by a recommendation policy.• These interactions are then recorded and generate the data that is stored as historical data.• Finally, this stored data, which might include biases based on sensitive attributes, is used to train the recommendation policy used in step 1.
The key problem is that the recommendation policy might not extract the true relevance score for a user-item pair but rather a pattern from historical data, which is being reinforced with every iteration [2,29,30].Going back to our leading example, if the logged data has abundant examples of female users interacting with cosmetic products, the policy might wrongly recommend items to female users who are not interested in these products.
Hence, aiming to find the true personalized relevance score between the user-item is crucial to ensure balanced and fair recommendations and is the core question that we aim to tackle in this paper.One way to better estimate the true user-item relevance is to collect the training data from the online system with a uniform item exposure probability conditioned on the sensitive group.The uniform/fair exposure in historical data is formally defined below.Definition 1. (Fair Exposure) Assuming that there is uniform exposure conditioned on the user-sensitive features z u and itemsensitive features z i , the following two equations hold: where   denotes uniform exposure for items conditioned on the sensitive information, i.e.
Intuitively, the idea of "Fair Exposure" is that being exposed to an item should not be dependent on a person's sensitive attributes, like gender or race.The ideal scenario is when the exposure to items is uniformly at random and unbiased, allowing us to accurately capture the user's feedback, or relevance score.
However, collecting data in this manner is not practical as users may leave the platform before providing accurate feedback due to biased recommendations.To address this, we propose the B-FAIR algorithm in section 4 without access to unbiased data.Note that the two equations in definition 1 infer the same independence statement because

Balanced fairness objective
In this section, we introduce our solution to the problem of unfair feedback loop in recommendation systems by proposing a novel objective, called the Balanced Fairness Objective (BFO).This objective differs from the traditional fairness metrics of demographic parity, equalized odds, and equal opportunity as follows: Definition 2. (Balanced Fair Objective), For any loss function , the balanced fair objective on any downstream recommendation model  is defined as: There are three key takeaways from this objective: • Training a recommendation model under an unfair and biased empirical distribution of the exposures p (  |z  , z  ) instead of  uni (  |z  , z  ) would inevitably result in biased relevance scores due the feedback loop [17,30], i.e. unfair up or down weighting for specific items.Hence, by optimizing BFO, we are in fact able to estimate the true relevance score for a given user-item pair, because each item was exposed to the user uniformly at random given the sensitive attributes.• The BFO can be easily estimated when "Fair Exposure" data is provided i.e. the item exposure is uniformly sampled conditioned on the sensitive attributes.• However, as mentioned above, in the real world we rarely have access to this type of fair exposure data and hence one way to still use this objective would be to use importance sampling [10,32,33].These estimators, however usually come with high variance and hence we propose a representation learning-based method in the next section, which avoids the drawbacks of importance sampling.The key idea is to remove the causal relationship between sensitive attributes and item exposure in historical data.(seefig. Recall, this is different to standard fairness definitions [11,21,23,31] as we are motivated from a data debiasing perspective rather than a model prediction perspective.In particular, we define balanced fairness through being able to train a recommendation model on fairly exposed data and thus recover the true relevance scores between user and item.Given that we do this at each iteration of the loop, we are argue that we construct a recommendation system that does not suffer from the unfairness of feedback loops.Note that, traditional fairness constraints could be added to our training framework in future work.To summarize, we defined the concept of "Fair Exposure" and the balanced fairness objective (BFO).BFO aims to emulate the scenario where there is no unfair feedback loop present and the true relevance score between user and item can be obtained.However, computing this objective is difficult when data with fair exposure is not available.To overcome this challenge, we propose a two-stage approach based on representation learning in the next section.

METHOD
In this section, we describe how our proposed framework Balanced and FAIr Representations (B-FAIR) can optimize the objective given in Eq. 1, when we do not have access to unbiased data.B-FAIR is divided into two stages: First, we use an injective function  : R  → R  to disentangle the sensitive and non-sensitive information into a representation z ∈ R  from context c.The disentangled representation z is defined as z = [z u , z i , z n ], where   ∈ R   and   ∈ R   are user and item sensitive representations respectively.z n ∈ R   captures the remaining information from context which is orthogonal to the sensitive attributes.(See figure 2(a)).
Next, these disentangled representations are fed into a balanced representation function  : R   +  → R  , which we properly define in Section 4.2 to generate a fair and balanced representation with respect to the sensitive attributes.These newly learned representations, together with the non sensitive representation   can be used to predict the user feedback  using on any downstream recommendation model  :

Disentangle sensitive features from context
It is common in the fairness literature to assume access to sensitive attributes   ,   in the dataset [21].If we were to simply remove the sensitive attributes from the training data, we will inevitably ignore the sensitive correlated information (e.g. the occupation is correlated with age) in context data.Instead of using the group index directly, we determine the boundary between sensitive and nonsensitive information before we learn the fairness representation.To achieve this goal, we assume that the context c is generated by three factors z = [z u , z i , z n ], i.e. user sensitive, item sensitive and non sensitive information, respectively (see figure 2(a)).For example, if we had a sensitive attribute   as gender, then z u would contain all the gender-correlated information from the context .
To accurately identify sensitive correlated information, we build upon the work in [16] and use a conditional variational inference approach, where we infer the hidden z u , z i , z n in an identifiable manner.Assuming that  is conditionally independent of   ,   given , we can write out the following generative model.
where  h are the parameters for the injective generator h : R  → R  of the VAE and  1 () =  • 1,  2 = 1 and 1 is a vector of ones.
The generative model   h , 1 , 2 (c, z | g u , g i ) can thus be decomposed into the context generation model   h (c | z) and the conditional sensitive feature   1 , 2 (z | g u , g i ).We can further decompose the   1 , 2 (z | g u , g i ) as follows: Intuitively, this corresponds to the independence between the  variables as shown in figure 2(a).Inspired by the previous identifiable VAEs framework [16], we design conditional multivariate Gaussian distribution for each of the components as follows: where I  3 a diagonal matrix of dimension  3 .
According to [16,42], by training a VAE using the above conditional priors, we can learn an identifiable latent representation of the context c in terms of z = [z u , z i , z n ], where all the sensitive information of   is captured in z u .Similarly for   and   .As is standard in VAE, the learning objective is to maximize the data likelihood E log   ℎ , 1 , 2 (c, z |   ,   ) which in general is not analytically tractable.Hence, we instead maximise the lower bound of the data likelihood which is also known as the Evidence Lower Bound (ELBO).Denoting D  (•∥•) as the KL divergence, C = {(c  ,    ,    )}  =1 as the observed context dataset without the user feedback  and item exposure   , the learning objective can be written as, where  is the feature inference function formally defined in Section 4. Due to the factorization that we described previously in Eq. 3, the D  term can be further decomposed further as: The details on the derivation of this ELBO are given in the section 6.2.With this form, we can implement a loss function to train the disentanglement part of our method.Since the prior in the final term is a standard multivariate Gaussian distribution and is a spherical Gaussian [16], we can not guarantee the identifiability of z n .Therefore, we add a constraint to the final objective to make sure z n retains the information from context and removes all the sensitive correlated information.The final disentangling objective can thus be written as below, where  is a hyperparameter.
By optimizing   (  ,  h ), we are able to learn a disentangled representation z = [z u , z i , z n ] of the context c.
In summary, we use an iVAE [16] with a specific prior on the latent space  to obtain a disentangled representation z = [z u , z i , z n ] of the context  in terms of sensitive and non-sensitive attributes.With these disentangled representations, we move on to the second stage, which takes z u , z i (user and item sensitive feature representations) as input to learn fair balanced recommendation policy, i.e. learning a model which breaks the feedback loop by training under the Balanced Fair Objective (BFO) in Eq. 1.

Learning fairness data representation
Now that we have an identifiable and disentangled representation of the context  we will describe how we optimize the balanced fairness objective (BFO) in Eq. 1 efficiently when we only have access to biased unfair logged data.To this end, we design an adversarial learning strategy which comprises of two components: (1) A discriminator that is tasked to determine if the current representation of the disentangled context features satisfy "Fair Exposure" and (2) A representation learning function  which aims at learning balanced sensitive features that remove unfair factors in historical data.These new balanced sensitive representation  (  ,   ) and non-sensitive features z n are then used as inputs to a function  to predict the user feedbacks .Note that  can be any recommendation model that uses feature embeddings as inputs.
To understand how we get to our final objective function, we will start by describing the discriminator in more detail.The proposed discriminator  : R   +  → R 2 is a classifier with a 2-dimensional softmax output layer, tasked to identify whether  (z u , z i ) is exposed to the user, i.e. dimension 1 indicates the probability that the item was exposed to the user and dimension 0 indicates the probability that the item was not.
Intuitively, if the discriminator  is not able to determine whether item  was exposed to the user  based on  (z u , z i ), we can conclude that the items were in fact exposed uniformly at random i.e. "Fair Exposure".In parallel to this classification task, we also train the representation function  and a recommendation model  to satisfy the user feedback prediction.This is to avoid trivial solutions learned through the classification task (adversarial process).Hence we arrive at the following objective (c is implicitly included in z).
User Feedback Prediction (BFO) Discriminator (8) where   denotes the -th element from the output of the discriminator.We optimize the above function in an alternating minmax game fashion as follows: min In order to understand why the representation  (z u , z i ) is unbiased and how optimizing   achieves BFO based on  (z u , z i ), we present the following theorem.
and has optimal solution when  ( (z u , z i In other words, the above theorem states, that under the assumption that we are able to optimize the in minimax game    of the discriminator, we are in fact in the setting of "Fair Exposure" wrt to , i.e.  ( (z u , z i )|  = 1) =  ( (z u , z i )|  = 0).Note that only optimizing    in Eq.8 without the discriminator loss    corresponds to learning the feedback from the context in the standard recommendation setting using biased unfair training data.By adding the discriminator loss    we enforce learning of a representation  (z u , z i ) such that the "Fair Exposure" condition is upheld and thus BFO is optimized.Once we defined these representations of the sensitive attributes  (z u , z i ), we learn the function  *  below.Note, the function   does not include c as it is implicitly included in  (z u , z i ) and z n .

Overall optimization objective
Now that we have described the two main stages of our proposed training process, we will show how to train our model B-FAIR in an end-to-end manner.To this end, we proposed the following objective function which is summarized below as a minmax game.As is common in adversarial/minmax games, we will alternate between maximizing wrt to   and minimizing wrt to   ,   ,   and  h .Recall that our method aims at rectifying the unfairness problem from the data perspective using representation learning.In particular, we aim to get balanced fairness.We could easily add additional fairness constraints or change the downstream policy model  to obtain stricter constraints on fair prediction.However, we leave this for future work as it is not the focus of this work.
In summary, in this section, we described how to optimize the BFO in a two-stage representation learning process.We firstly, in section 4.1, describe a VAE-based disentanglement model which allows us to extract the sensitive features z u , z i as well as the nonsensitive features z n from our context data c.By using the architecture proposed in [16], we can guarantee the identifiability of the extracted representations.In section 4.2, we then describe how we can use an adversarial training scheme to emulate the training

EXPERIMENTS
In this section, we conduct extensive experiments to demonstrate the effectiveness of our proposed method B-FAIR.In particular, we illustrate how B-FAIR outperforms current state-of-the-art debiasing methods on both synthetic as well as real-world data experiments.Furthermore, to get a better understanding, we also perform several ablation studies to investigate under which conditions our end-to-end algorithm can disentangle the context.

Experiment setup
5.1.1Synthetic data: Similar to previous work [41,47], we simulate 10,000 different users with 9 sensitive groups and 32 items with 3 sensitive groups, where each item has its own attributes.Each of the users and items consists of 32-dimension sensitive features and 32-dimension non-sensitive features, which are generated from a uniform distribution  (−1, 1).Let z s u , z s i describe the sensitive features of the user and item and z n u , z n i be non-sensitive features of the user and item respectively.The context c  is generated as follows c  =  (z s u , z s i , z n u , z n u ), where we investigate several options for the function : (1) A Concatenation of z s u , z s i , z n u and    , (2) a linear function (3) and non-linear function.The details of the exact functions are given later in section 5.4.
We report overall results using the concatenation function in Table 2 and further analyze the disentanglement of all three functions in section 5.4.After defining the context generation process, as in previous works [41,47], we move on to how the scores are computed based on the context.In particular, as in [41,47], for each item , we define a score  , as follows, where the top-N exposure list is based on the score and we select 5 items from 32 item sets to the user.
Based on the exposure list, we generate the feedback of a user  on an item  (in the exposure list) as follows: is normalization term.Note that the above recommendation policy is unfair since the exposure list is influenced by both sensitive and non-sensitive context information i.e.  =  (   ,    ,    ,    ).Hence this represents the setting where we are collecting biased unfair logged data on which we would then train our new recommendation policy.The resulting dataset will be our training/validation data.In order to evaluate how well B-FAIR performs we also generate a fair dataset for testing, which is generated using the below fair policy, as it does not use z s u , z s i (the sensitive attributes): In the above equation, the score function does not depend on the sensitive information z s u , z s i and hence can be used as our fair ground truth to assess our method.

5.1.2
Real-world dataset.MovieLens-1M1 : The dataset contains user-item interactions and user profile information for movie recommendation.We use gender (binary classes) and movie tags (18 classes) as user and item-sensitive features respectively.Insurance2 : The dataset is collected from insurance products recommendation system.We use the gender (binary classes) as user-sensitive information and item id (21 classes) as item-sensitive features.Here the sensitive information item id means that we consider individual fairness on the item side.

Metrics.
Since the prediction task is a binary classification, we consider the following standard evaluation metrics: AUC [27], ACC [14], Precision@N and Normalized Discounted Cumulative Gain (NDCG@N).For Movielens and synthetic datasets, we show Precision@5 and NDCG@5 and for the Insurance dataset, we show Precision@3 and NDCG@3.In the synthetic data, since we can generate the fair test dataset, the scores given in table 2 reflect whether we were able to learn a unbiased recommendation system.In the real-world dataset, similar to [18], we use metrics ACC-F, AUC-F, NDCG@N-F and Precision@N-F, where we evaluate the scores (ACC, AUC, Precision@N, NDCG@N) for each sensitive group and calculate the discrepancy between the highest and lowest score (e.g.AUC in male group and female group).By achieving fairness in the training process we are inherently also reducing the performance gap (e.g.ACC-F, AUC-F, NDCG@N-F and Precision@N-F).

5.1.4
Baselines.As mentioned in the related works (Section 2), there are two parts to the fairness pipeline proposed by [11]: (1) Removing the data bias and (2) then improving fairness in the online system through constraints.Since the objective of our framework is to achieve data debiasing, we mainly focus on the debiasing literature for our baseline experiments.
These baselines include the "Base model" (the model without any debiasing) and the state-of-the-art debiasing methods such as Inverse Propensity Score (IPS) [32], Self-nomarlized IPS (SNIPS) [33], Doubly Robust (DR) [10], ATT [28] and CVIB [35].The baselines IPS and SNIPS are IPS-based methods, where the pretrained propensity weight evaluation model is required.Direct and ATT are both Direct-learning-based methods, where an imputation model to generate a counterfactual data samples is required.DR method is the combination of IPS and Direct method and CVIB is a representation learning based method.By reformulating the BFO into a debiasing problem we can use a similar objective to ours.The only difference is that they aim to achieve uniform exposure in all contexts, whereas we aim to achieve fair exposure conditioned on the sensitive group.To achieve a fair comparison, we add the user and item-sensitive group index as a part of the context feature to each baseline.For IPS based method, we use the fairness representation learned from the disentanglement method to calculate the propensity score.We apply our framework to different base models including MLP and GMF [15].All the experiments are conducted based on a server with a 16-core CPU, 128g memory and RTX 5000 GPU.We specify different parts of objective (11) as follows:  is implemented by the binary cross-entropy loss to model user feedback.The categorical features for the users are encoded by embedding matrix.The continuous features are directly multiplied by weighting matrices to derive the representation.Denoting   ,   and   as weighting parameters,  as softmax function with 2 dimension output and ELU are activation functions.The disentanglement function  consist of three separate functions   (c),   (c),   (c) to infer z u , z i , z n , respectively.Each of neural network is designed as  1 ELU( 2 ELU( 3 [c])).The architecture of generate function is  1 ELU( 2 ELU( 3 [z i , z u , z n ])).The balanced representation function is implemented using a linear layer.The discriminitor function is defined as  ( (  ,   )) =  ( 1 ELU( 2  (  ,   ))).

Overall comparison
In this section, we present our results in table 2. In most cases, the performance of the direct-based method (Direct, ATT) is worse than that of IPS or SNIPS.This is because the imputation model is biased by the dataset since the dataset is disturbed by a complex environment and thus the performance of the imputation model cannot be guaranteed.CVIB and DR can generally achieve better performance than IPS, SNIPS and Direct and ATT, which is consistent with the previous work [10].In most cases, the state-of-art representation learning-based method CVIB perform better than other baselines, because the model considers a more general setting of the real system to avoid noise interference.
Synthetic Data Results: Taking a closer look at each of the experiments separately: Firstly, in the synthetic data, our proposed method B-FAIR obtained significant performance improvements compared to the base model across all 4 metrics AUC, ACC, NDCG@5 and Precision@5 (see table 2).Note that, we generated fair and unbiased synthetic test data according to Eq. 13.Thus the performance on standard metrics allows us to determine whether we achieve less biased prediction.To further investigate each component of our two-stage method, we also performed experiments where we did not apply a disentanglement step (B-FAIR(-d)).As can be seen in Table 2, B-FAIR(-d) does not perform on par with B-FAIR and hence we conclude that the disentanglement step is crucial to the success of B-FAIR.These experiments in controlled environments, validate our hypothesis that our method B-FAIR allows us to train a recommendation policy as if the logged data was balanced fair.
Finally, we examined the scenario in which the recommendations at time  + 1 are contingent on the recommendations at time , i.e. sequential recommendation with feedback loops.To simulate this scenario, we created a synthetic data example with a feedback loop.We updated the recommendation list using the current policy after 50 epochs of training.The exposure list at time  + 1 was constructed based on the score   , which was calculated from the previous recommendation model at time .The recommendation list was then re-labeled using the rule outlined in Eq. 12.The results in Figure 4 demonstrate that our method B-FAIR outperforms other methods over multiple feedback loops.In particular, the B-FAIR method achieved the best fairness performance, with an AUC of 0.927 after three loops, while the base method failed to maintain fairness as the performance declined with each loop.
Real-World Data Results: For the real-world dataset Movie-Lens and Insurance, we demonstrate the performance on the fairness scores AUC-F, ACC-F, NDCG@5-F and Precision@5-F on user sensitive group gender and item sensitive group movie tags for MovieLens and gender sensitive group for Insurance.Given that we do not have access to a balanced dataset as in our synthetic data experiments, we use these metrics as proxies to illustrate B-FAIR.This means the lower the fairness scores, the smaller the discrepancy between the sensitive group and therefore the fairer the method.
Compared with base method, we get a fairness performance improvement of {AUC-F: 11.6% ACC-F: 29.6% NDCG@5-F: 26.3% Precision@5-F: 26.3% } on user sensitive group (gender), {AUC-F: 39.4% ACC-F: 31.7%NDCG@5-F: 20.4% Precision@5-F: 3.9% } on item sensitive group (movie tags) on MovieLens dataset and The performance on general scores is close to the current optimal.Although the general scores can not be applied to measure fairness performance, the results show that our method can enhance fairness performance and not damage user feedback prediction.

Ablation study
Ablation Study Setup: For our ablation study on the identifiability of our disentanglement stage, we consider the distance correlation between the features learned by disentanglement method ẑ and z, which is defined as  corr(ẑ, z) = .d cov in  corr describes the covariance between two variables (z, ẑ).The score measures the distance between two distributions, where higher distance correlation means better disentanglement of the method.
In other words, if the distance correlation between the learned sensitive features ẑs u , ẑs i and the ground truth sensitive features z s u , z s i is high, it means that we can capture most of the sensitive information in our learned representation.We equivalently also show the score between the learned non-sensitive feature [ẑ n u , ẑn i ] and ground truth [z n u , z n i ].We study the quality of the disentanglement stage under different context generative functions.We base

Figure 1 :
Figure 1: The figure illustrates the unfairness caused by feedback loops in recommendation systems.
2(b)) Definition 3. (Balanced Fairness) A recommendation model  * is called balanced fair if the model minimizes the BFO i.e.

Figure 2 :
Figure 2: (a)The figure shows the context generation process, where  is the feature inference function in the disentanglement method.(b) The main objective is to remove the red lines, i.e. removing the bias stemming from the sensitive features and item exposure using a learned representation .

Figure 3 :
Figure 3: The distance correlation between the learned representation and the ground truth representation.If the distance correlation is well above 0.5 we have evidence that our latent representations are indeed identifiable.

Figure 4 :
Figure 4: Performance over 3 feedback loops d cov(z,z) √︃  cov(z,ẑ) d cov(ẑ,ẑ) • • , ] denotes the item index from the vendor and   ∈ {0, 1} indicates if the item  is exposed to the user. ∈ {0, 1} denotes the feedback of user  for item .  ∈ G  and   ∈ G  denote the sensitive group features on the user side and producer side respectively.For the recommendation setting,   corresponds to the indicator variable for sensitive information such as gender and   corresponds to sensitive information for items such as movie tags in the movie recommendation system.Given that we are using a representation learning framework, we denote   ∈ R   ,   ∈ R   and   ∈ R   as neural network representations of the user sensitive features (i.e. (  ,   ,   ,  h ,   ) =   (  ,  h ) min   ,  , h , max      (  ,   ,   ,  h ,   ), . . 1 ∑︁ =0   ( (z u , z i )) = 1 (11) +   (  ,   ,   ,   )

Table 1 :
Summarization of the datasets.
under the BFO objective, thus yielding balanced and fair representations which are not affected by the biased logged data.Lastly, we also prove theoretically that by optimizing our proposed objective, we achieve balanced fairness.

Table 2 :
Comparisons on Synthetic and MovieLens.Our proposed method B-FAIR is outperforming most state-of-the-art baselines by the mean performance (± is the standard error of the results.).
5.2 Implementation DetailsFor MovieLens and Insurance dataset, we split the original dataset by 80% sample of train 10% of validation and 10% of test.For the simulation dataset, the ratio between the training and testing (including validation) sets is controlled as 3:1.We re-sample the data to make sure every sensitive group has same number of samples in test set and show statistical information in Table1.The weighting parameter  and  are determined in range of [.001, 0.5].The user/item embedding dimension is empirically set as 32.