Can people detect the trustworthiness of strangers based on their facial appearance? Evolution and Human Behavior

Although cooperation can lead to mutually beneficial outcomes, cooperative actions only pay off for the indi- vidual if others can be trusted to cooperate as well. Identifying trustworthy interaction partners is therefore a central challenge in human social life. How do people navigate this challenge? Prior work suggests that people rely on facial appearance to judge the trustworthiness of strangers. However, the question of whether these judgments are actually accurate remains debated. The present research examines accuracy in trustworthiness detection from faces and three moderators proposed by previous research. We investigate whether people show above-chance accuracy (a) when they make trust decisions and when they provide explicit trustworthiness ratings, (b) when judging male and female counterparts, and (c) when rating cropped images (with non-facial features removed) and uncropped images. Two studies showed that incentivized trust decisions (Study 1, n = 131 university students) and incentivized trustworthiness predictions (Study 2, n = 266 university students) were unrelated to the actual trustworthiness of counterparts. Accuracy was not moderated by stimulus type (cropped vs. uncropped faces) or counterparts' gender. Overall, these findings suggest that people are unable to detect the trustworthiness of strangers based on their facial appearance, when this is the only information available to them.


Introduction
Cooperation is a defining feature of human social life. In many situations, people act unselfishly and engage in costly behaviors in order to obtain a mutually beneficial outcome (Dawes, 1980). However, such behaviors only pay off if others can be trusted to cooperate as well. Therefore, identifying trustworthy interaction partners is a central challenge in social interactions (Cosmides & Tooby, 1992). Previous studies point to one cognitive mechanism that could address this challenge: People readily form impressions of others' trustworthiness based on their facial appearance (Krumhuber et al., 2007;Todorov, Olivola, Dotsch, & Mende-Siedlecki, 2015). But can people actually detect the trustworthiness of others based on their facial features? Addressing this question is important for two reasons. First, a person's appearance is a readily available cue, and in many situations the only one. If trustworthiness impressions are accurate (at least to some extent), then reliance on these judgments would represent one way in which people can establish cooperative relationships with strangers. Accurate inferences would allow people to make adaptive trust decisions even when little is known about counterparts or when such information would be costly and effortful to obtain. Second, perceptions of trustworthiness influence many important outcomes, including economic transactions, romantic partner choice, and legal sentencing decisions (Olivola, Funk, & Todorov, 2014). If trustworthiness judgments are not accurate, then this would imply that many consequential decisions are biased by irrelevant facial cues.
Previous studies have examined the accuracy of trustworthiness impressions in the context of social dilemma games such as the trust game (Berg, Dickhaut, & McCabe, 1995). In this dyadic interaction, a participant (i.e., the trustor) decides whether to send a monetary endowment to another participant (i.e., the trustee). In case the endowment is transferred, the money is multiplied and the trustee decides how much to return to the trustor. Trust and reciprocity lead to higher payoffs for both, but trust is risky as trustees face the temptation to keep the transferred money. Bonnefon, Hopfensitz, and De Neys (2013) presented facial photographs of trustees who had either reciprocated or betrayed trust, showing that participants were more likely to transfer money to counterparts that were actually trustworthy. Other studies yielded similar results, leading various authors to conclude that people are able to detect the trustworthiness of counterparts at levels slightly above chance (ca. 55%; Tognetti, Berticat, Raymond, & Faurie, 2013;Verplaetse, Vanneste, & Braeckman, 2007).
Yet, evidence for accurate trustworthiness detection is mixed. Some researchers did not find empirical support for accuracy when examining trust behavior in social dilemma games (Efferson & Vogt, 2013;Yamagishi, Tanida, Mashima, Shimoma, & Kanazawa, 2003) or when obtaining explicit ratings of counterparts varying in trustworthiness (Rule, Krendl, Ivcevic, & Ambady, 2013;Zylbersztejn, Babutsidze, & Hanaki, 2020). Moreover, accuracy often depended on extraneous factors, which did not replicate across studies. For example, Tognetti et al. (2013) found above-chance accuracy for male but not female counterparts, when using images that were uncropped and included non-facial features (e.g., hair style). Bonnefon et al. (2013), on the other hand, found higher levels of accuracy for female counterparts, but only with cropped images that occluded all non-facial features.
Various scholars have also criticized the accuracy claim by arguing that the reliability of any facial feature as an indicator of trustworthiness might be easily undermined if individuals exhibit the feature but act selfishly (Efferson & Vogt, 2013;McCullough & Reed, 2016). This could lead to the emergence of imitators who appear trustworthy and garner the benefits of trust without paying the costs of reciprocating it. Furthermore, trustworthiness impressions of the same individual vary substantially across different perceivers (Hehman, Sutherland, Flake, & Slepian, 2017) and contexts (Brambilla, Biella, & Freeman, 2018), questioning whether they could be a reliable indicator of any disposition. In sum, the existing evidence for accurate trustworthiness detection from faces is mixed and the topic remains subject to vigorous debate.

Aims of the present research
We present the results of two studies on the accuracy of trustworthiness impressions that address critical limitations of prior work. First, many prior studies relied on the same set of facial photographs De Neys et al., 2013 and explicitly selected photographs of trustees that were judged with the highest levels of accuracy in prior investigations (De Neys et al., 2015. These results do not provide independent accuracy estimates and it is unclear whether findings generalize to other stimulus sets. Here, we provide a stronger test of the generalizability of prior results by examining accuracy using novel samples of participants and stimuli. Second, past research uncovered several moderators (e.g., abovechance accuracy for female, but not male counterparts), but these moderators did not emerge consistently across studies Tognetti et al., 2013). We examine the robustness of the proposed moderators by testing whether participants show above-chance accuracy (a) when they make trust decisions vs. provide explicit trustworthiness ratings, (b), when they rate cropped images (with non-facial features being removed) vs. uncropped images, and (c) when the trustee is male vs. female.
Third, several scholars have posited that facial appearance is not indicative of actual trustworthiness (Efferson & Vogt, 2013;. Yet, existing studies have exclusively focused on statistical methods that cannot provide evidence for such a null hypothesis. The present research addresses this issue by reporting the results of Bayesian analyses (alongside frequentist statistics), which can quantify evidence in favor of the null hypothesis (Wagenmakers, 2007).
Fourth, unlike many previous studies De Neys et al., 2013Rule et al., 2013), we draw upon experimental methods from economics and use only fully incentivized experiments that avoided any deception (Camerer & Mobbs, 2017;Hertwig & Ortmann, 2001;Ortmann & Hertwig, 2002). This approach directly couples participants' responses to their financial payoffs in both the trust game and the prediction task, thereby motivating participants to correctly identify trustworthiness in their counterparts because this increases their own payouts. Decisions during payout relevant trials were paid out not only to the trustors, but also to the interaction partners whose faces were shown to maintain the interactive nature of the trust game (e.g., Engelmann, Meyer, Ruff, & Fehr, 2019). Participants were explicitly made aware of and tested on these payout contingencies for themselves and others before the experiment started. The experiment was conducted in a laboratory that had established a reputation of never deceiving participants, which reduces second-guessing of the experimental instructions and thereby reduces decision noise and experimenter demand effects (Hertwig & Ortmann, 2001).
In short, our studies constitute a stronger test of the hypothesis that people can detect the trustworthiness of others based on their facial appearance. In two studies, we test whether participants are more likely to entrust money to counterparts who are in fact trustworthy (Study 1, n = 131). We also compare participants' earnings to those expected by simple decision strategies that ignore facial appearance altogether (i.e., trust at random, always trust, never trust). This allows us to test whether knowing the facial appearance of counterparts gives participants a strategic advantage in social dilemmas. In Study 2 (n = 266), we examine accuracy using an alternative experimental design. We employ an incentivized prediction task and test whether participants can accurately predict the trustworthiness of counterparts based on facial photographs.
All data and analysis scripts are available at the Open Science Framework (https://osf.io/8wejn/). 1 We report how our sample sizes were determined and all data exclusions and measures for each study.

Study 1
Study 1 consisted of two phases. In the first phase (n = 31), we obtained facial photographs and behavioral data from participants who acted as trustees in the trust game. In the second phase, a separate sample of participants (n = 131) made trust game decisions in the role of trustors while being matched with (and seeing photos of) the trustees of the first phase. All decisions were incentivized and both trustors and trustees received additional payments to control for social preferences (Engelmann et al., 2019). We first examined whether participants relied on their counterparts' facial appearance when making trust decisions. We tested whether participants exhibit more trust towards counterparts that are perceived as more trustworthy in two ways, by identifying the effects of variations in the perceived trustworthiness of faces, and by identifying the causal effects of trustworthiness via manipulating counterpart's perceived trustworthiness using face morphing. The main goal of this study was to examine whether participants could accurately detect the trustworthiness of counterparts. We therefore tested (a) whether participants were more likely to transfer money to trustworthy counterparts and (b) whether knowledge of their counterparts' facial appearance allowed them to accumulate higher earnings in the trust game than simple decision strategies that ignore facial appearance (i.e., trust at random, always trust, always distrust).

Methods
The study procedures were approved by the University of Zurich Ethics Committee and participants gave informed consent.

Stimuli (trustees)
We first collected facial photographs and behavioral strategies for a sample of trustees. Participants (n = 84) were recruited from the University of Zurich participant pool and received a fixed payment of 20 CHF (ca. $22) and additional payment that depended on their behavior in the study. At the end of the study, one round of the trust game was selected at random and participants received their earnings from that round. All decisions were therefore fully incentivized, which is an important aspect for the main studies as they reflect the true preferences of the trustees.
Participant received a written description of the "decision situation" (i.e., the trust game) and were informed that they would play five rounds with different counterparts in the role of the trustor or the trustee. In each round, both participants received an endowment of 12 CHF and the trustor could decide whether to send 10 CHF to the trustee. If the money was sent, it was tripled and transferred to the trustee. The trustee could then decide how much to send back to the trustor (between 0 and 30 CHF). We recorded trustees' behavior with the strategy method. Trustees indicated how much they want to send back in case the trustor decided to send 10 CHF. That is, they indicated their decision without knowing whether the trustor had in fact sent anything. Participants played five rounds with anonymous counterparts and they did not receive feedback on their counterparts' behavior, except when they found out about their earnings after the payout relevant trial was selected at the end of the experiment. This approach precludes learning and history effects from influencing decisions. The average amount of money that trustees returned to trustors (across the five rounds) constituted our measure of trustworthiness.
After completing the trust games, participants filled out a series of unrelated questionnaires and we took photographs of their faces. All photographs were taken from the same distance against a uniform background and participants were instructed to display a neutral facial expression. Similar to previous studies , we cropped the photographs to remove all non-facial features, such as hairstyle and earrings (see Fig. 1 for an example). Sixty-three participants consented to having their photographs and behavioral data used in future studies. In the current study, we focused on the photographs and behavioral data of trustees. One trustee was removed from analysis for being considerably older (> 3 SD above the mean) than the rest, leaving a final sample of 31 trustees (14 female).

Participants (trustors)
We recruited a separate sample of 273 participants from the University of Zurich participant pool. In the current study, we focus on 131 participants (M age = 22.85, SD age = 4.45; 45.80% female, 54.20% male) who were assigned the role of the trustor in the trust game. Participants received a fixed payment of 10 CHF (ca. $11) and were informed that they would receive an additional payment that depended on their behavior in the study. At the end of the study, one round of the trust game was selected at random and participants, both the trustor and the trustee, received their earnings from that round.

Procedure
Participants received the same instructions explaining the trust game as in the first phase of the study. They were informed that they would play 31 rounds in the role of the trustor with different counterparts. In each round, participants saw a photo of the trustee and decided whether to transfer nothing or 10 CHF of their 12 CHF endowment (see Fig. S1 in the Supplemental Materials). Participants also indicated what they expected the trustee to do (i.e., how much the trustee would send back in case they transferred the money) by designating amounts between 0 and 30 CHF. They indicated their confidence in the estimate on an elevenpoint Likert scale ranging from "not at all certain" to "very certain". Participants did not receive feedback on their counterparts' behavior. After completing the 31 rounds of the trust game, participants saw the photographs of the trustees again and rated them on various characteristics, including trustworthiness, on a seven-point scale (see Table S1 in the Supplemental Materials for a description of all measures).

Treatment groups
Participants were randomly assigned to one of two conditions. In the "unmodified" condition (n = 56), participants saw the original facial photographs of the trustees.
In the "modified" condition (n = 75), participants saw photographs of the same 31 trustees, but we used face morphing software to manipulate the perceived trustworthiness of trustees. Specifically, we used computer-generated face prototypes that reflect the typical appearance of a trustworthy-looking or untrustworthy-looking faces (see Fig. 1; Oosterhof & Todorov, 2008). For each trustee, we created a trustworthy-looking and an untrustworthy-looking version by morphing their face, using the software Psychomorph (Tiddeman, Burt, & Perrett, 2001), with a trustworthy-looking or untrustworthy-looking face prototype. We transformed each trustee's face shape towards the face shape of the computer-generated prototype by 30%. This procedure created subtle differences in facial appearance (without compromising the realistic nature of the face stimuli), which affects the perceived trustworthiness of trustees (see Fig. 1). On approximately half of the 31 rounds, participants in the modified condition saw the untrustworthylooking (vs. trustworthy-looking) version of the trustee. They only played once with each trustee, that is, they only saw one face version for each trustee.

Analysis strategy
Analyses were based on 1736 observations in the unmodified condition (56 participants interacting with 31 trustees) and 2325 observations in the modified condition (75 participants interacting with 31 trustees), which were analyzed separately. All analyses were conducted in R (R Core Team, 2021). We used the lme4 package (Bates, Mächler, Bolker, & Walker, 2015) and the lmerTest package (Kuznetsova, Brockhoff, & Christensen, 2016) to estimate multilevel regression models with random intercepts and slopes. 2 All continuous predictors were z-standardized prior to analysis (full model results are reported in the Supplemental Materials).
We followed the approach proposed by Wagenmakers (2007) to compute Bayes factors. Specifically, we estimated regression models with and without the variable of interest and computed the Bayesian information criterion (BIC), an indicator of model fit, for both models. By comparing BICs of both models, we can estimate the extent to which the variable of interest increases model fit. We converted this measure to an approximation of the Bayes factor using the following formula: , where BF 10 represents the Bayes factor in favor of the alternative hypothesis and BIC(H 1 ) and BIC(H 0 ) denote the fit of the models with and without the variable of interests (Wagenmakers, 2007). We used the BayesFactor package with default priors (i. e., a Cauchy distribution with a width of r = ̅̅ 2 √ 2 ; Morey and Rouder, 2018) to calculate Bayes factors for t-tests. We always display Bayes factors so that they reflect support for the favored hypothesis (i.e., BF 10 when evidence favors the alternative hypothesis and BF 01 when evidence favors the null hypothesis). To aid the interpretation of Bayes 2 Some models only converged when we implemented simpler random effects structures. Models with maximal and simplified random effects structure yielded very similar effect size estimates and significance levels. We therefore report the results of models with maximal random effects structure throughout the paper. factors, we classify the evidence as anecdotal, moderate, strong, very strong, or decisive (see Jeffreys, 1961).

Sensitivity analysis
We conducted sensitivity analyses for our main effect of interest (the relationship between participants' trust decisions and trustees' actual trustworthiness). We used the simr package (Green & Macleod, 2016) in R (R Core Team, 2021) to determine the smallest effect size we were able to detect with 80% power (and α = 5%). The package provides power estimates for fixed effects in multilevel regression models. We varied the effect of interest in our model and calculated power at each level. This showed that we had 80% power to detect an odds ratio of 1.29. For a one standard deviation increase in trustworthiness, we could detect a change in the probability of trust from, for example, 50.00% to 54.29%. Thus, our design had sufficient power to detect even low levels of accuracy.

Descriptive statistics
Across the five rounds, trustees returned an average of 6.78 CHF (SD = 6.87 CHF) of the transferred money. Nine trustees never returned anything, one trustee always returned half of the transferred money, and no trustee always returned everything. Thus, in the current sample, trust did not pay off on average, as trustees would have to return at least 10 CHF for trustors to break even. In the unmodified condition, trustors sent their 10 CHF endowment 54.09% of the time. Eleven participants (19.64%) never trusted whereas nine participants (16.07%) always trusted. In the modified condition, trustors sent their 10 CHF endowment 52.77% of the time. Fourteen participants (18.67%) never trusted whereas thirteen participants (17.33%) always trusted.

Manipulation check
To test whether our morphing manipulation affected the perceived trustworthiness of trustees in the modified condition, we compared participants' trustworthiness ratings of the two face versions. The morphed trustworthy faces (M = 3.25, SD = 0.40) were rated as significantly more trustworthy (with decisive evidence in favor of the alternative hypothesis) than the morphed untrustworthy faces (M = 2.32, SD = 0.42), t(30) = 16.48, p < .001, d = 2.96, BF 10 = 3.92 × 10 13 . Moreover, the morphed trustworthy faces were rated as significantly more trustworthy (with decisive evidence in favor of the alternative hypothesis) than the original faces (M = 2.89, SD = 0.44), t(30) = 7.96, p < .001, d = 1.43, BF 10 = 1.84 × 10 6 , while the morphed untrustworthy faces were rated as significantly less trustworthy (with decisive evidence in favor of the alternative hypothesis) than the original faces, t(30) = 10.92, p < .001, d = 1.96, BF 10 = 1.53 × 10 9 . Thus, our morphing procedure successfully manipulated the perceived trustworthiness of trustees.

Reliance on facial appearance
First, we examined whether participants who saw the unmodified photographs relied on the facial appearance of trustees when deciding whom to trust. We estimated a multilevel regression model with random intercepts and slopes per participant in which we regressed participants' trust behavior (0 = did not transfer endowment, 1 = transferred endowment) on their trustworthiness ratings. This yielded a positive effect with very strong evidence in favor of the alternative hypothesis, β = 0.864, SE = 0.173, OR = 2.37, 95% CI [1.65, 3.58], p < .001, BF 10 = 41.23 (see Fig. 2A, Table S2). Participants were more likely to trust when they perceived their counterparts as trustworthy.
The positive relationship between perceived facial trustworthiness and trust behavior may also reflect a consistency effect. Rather than relying on the facial appearance of counterparts when making trust decisions, participants may have rated counterparts as more trustworthy because they trusted them. We addressed this alternative explanation in two ways. First, we computed average trustworthiness ratings of counterparts across all participants. Using this average trustworthiness rating instead of individual ratings, perceived trustworthiness was again positively related to the probability of trust (with decisive evidence in favor of the alternative hypothesis), β = 0.556, SE = 0.093, OR = 1.74, 95% CI [1.44, 2.16], p < .001, BF 10 = 601.0 (see Fig. 2B, Table S3). Second, to estimate the causal effect of facial appearance on trust decisions more directly, we analyzed the effect of our morphing manipulation on participants' behavior in the modified condition. Increasing (vs. decreasing) the facial trustworthiness of trustees had a positive influence on the probability of trust (with decisive evidence in favor of the alternative hypothesis), β = 0.994, SE = 0.149, OR = 2.70, 95% CI [1.98, 3.69], p < .001, BF 10 = 2314 (see Fig. 2C, Table S4). Together, these results show that participants relied on the facial appearance of counterparts when making trust decisions.

Trustworthiness detection
The above results show that participants relied on facial trustworthiness judgments when making trust decisions. But is this wise given that decisions had real financial consequences? We therefore asked whether participants were able to detect the true trustworthiness of Fig. 1. Exemplary stimuli. The image in the middle shows the original photograph that was displayed to participants in the unmodified condition. This image was morphed with the computer-generated trustworthy-looking and untrustworthy-looking face prototypes on the left and right, respectively, to create realistic faces with decreased or increased perceived facial trustworthiness. These morphed faces were displayed to participants in the modified condition. Participants viewed all images in color.
counterparts based on the facial photographs. To address our main research question, we tested whether participants who saw the unmodified photos were more likely to trust counterparts that were actually more trustworthy. We regressed trust behavior on the behavioral (not the perceived) trustworthiness of counterparts (i.e., the average amount of money that trustees had returned to trustors), which did not yield a significant effect and very strong evidence in favor of the null hypothesis, β = 0.048, SE = 0.075, OR = 1.05, 95% CI [0.89, 1.23], p = .52, BF 01 = 34.50 (see Table S4, Model 1). Thus, results showed that participants were not able to detect the true trustworthiness of counterparts based on trustworthiness inferences from photographs.
This last result suggests that reliance on the facial appearance of counterparts did not pay off, which we tested directly by comparing our participants' average performance to that of other decision strategies. If knowledge about the facial appearance of trustees actually gives trustors a strategic advantage, then participants' earnings across the 31 rounds should be higher than the earnings of a person who trusts at random. Participants' earned an average of 257.1 CHF across the 31 rounds (SD = 40.69 CHF). Crucially, participants' earnings were not significantly higher (with substantial evidence in favor of the null hypothesis) than the earnings of a trustor choosing at random (M = 260.1 CHF), t(55) = 0.55, p = .58, d = 0.07, BF 01 = 5.93 (see Fig. 3).
Another simple but potentially viable strategy for making trust decisions would be to (a) estimate whether trust will pay off on average (in the current context, whether trustees will on average return more than 10 CHF) and (b) always trust if it does or always distrust if it does not. Participants' earnings were higher than those of an always-trust strategy (M = 210.0 CHF) with decisive evidence in favor of the alternative hypothesis, t(55) = 8.66, p < .001, d = 1.16, BF 10 = 1.22 × 10 9 , but lower than those of an always-distrust strategy (M = 310.0 CHF) with decisive evidence in favor of the alternative hypothesis, t(55) = 9.73, p < .001, d = 1.30, BF 10 = 5.25 × 10 10 (see Fig. 3). Together, these results suggest that having access to the facial appearance of trustees did not give participants a strategic advantage. In fact, knowledge about the base rate of trustworthiness in the current sample of trustees (i.e., the fact that trust did not pay off on average) and a resulting strategy of consistent distrust would have resulted in higher earnings.

Moderators of accuracy
Next, we tested the hypothesis that accuracy in trustworthiness detection only emerges under specific conditions. Contrary to Tognetti et al. (2013), who found evidence for increased accuracy when participants were judging male but not female trustees, accuracy did not vary as a function of trustees' gender (with strong evidence in favor of the null hypothesis), β = − 0.036, SE = 0.163, OR = 0.96, 95% CI [0.69, 1.34], p = .83, BF 01 = 40.74 (Table S4, Model 2). We also explored whether accuracy varied as a function of trustors' gender or trustors' confidence in the accuracy of their expectations of reciprocity (Table S4, Models 3 and 4), but found no significant results and very strong to decisive evidence in favor of the null hypothesis.

Additional analyses
Two additional variables were recorded that provide additional insights about participants' knowledge of the trustworthiness of the trustees: participants' expectancy of reciprocity and their explicit trustworthiness ratings. We first analyzed whether participants' reciprocity expectation was associated with their counterparts' actual trustworthiness. The relationship between how much participants expected trustees to return and how much they actually returned was not significant with decisive evidence in favor of the null hypothesis, β = 0.050, SE = 0.104, 95% CI [− 0.151, 0.249], p = .63, BF 01 = 142.1 (Table S5). There was a significant positive association between explicit trustworthiness ratings and trustees' actual trustworthiness, β = 0.078, SE = 0.027, 95% CI [0.028, 0.136], p = .004, BF 01 = 9.91 (Table S6). It should be noted that this relationship was very small and Bayesian analyses indicated  Fig. 3. Participants' cumulative earnings across the 31 rounds compared to the expected earnings of three simple decision strategies: trust at random, always trust, and always distrust. Participants' earnings are not distinguishable from a random investment strategy, and they would have earned significantly more by never trusting at all. substantial evidence in favor of the null hypothesis.

Study 2
Results of Study 1 suggest that participants were not able to detect the trustworthiness of counterparts based on facial photographs. However, decisions in the trust game may be motivated by considerations other than the expected trustworthiness of counterparts. For instance, people may transfer money not because they think that their counterpart will reciprocate trust, allowing them to maximize their earnings, but because there is an injunctive norm to trust and to not question a counterpart's character (Dunning, Anderson, Schlösser, Ehlebracht, & Fetchenhauer, 2014). In Study 2, we therefore examined trustworthiness detection accuracy with an incentivized prediction task, in which participants' earnings were tied to the accuracy of their predictions. Participants viewed the same cropped images used in the unmodified condition of Study 1 and predicted the trustworthiness of trustees. We also examined detection accuracy for uncropped images in a separate condition.

Methods
The study procedures were approved by the University of Zurich Ethics Committee and participants gave informed consent.

Participants
We recruited a sample of 266 participants from the University of Zurich participant pool (M age = 22.31, SD age = 3.38; 49.25% female, 51.75% male). Participants received a fixed payment of 20 CHF (ca. $22) and a variable payment that depended on the accuracy of their guesses (see the Supplemental Materials for the exact payoff formula).

Procedure and treatment groups
Participants received written instructions that explained the trust game played by the stimulus group. They were asked to view photographs of these players and to guess the behavior of the players as accurately as possible. The instructions were followed by a comprehension test that tested whether participants had understood the game and the manner in which their own payments related to their guessing accuracy. Participants could not begin the study until all comprehension questions had been answered correctly. When viewing images of trustors, participants were asked to guess in what percentage of rounds the person sent 10 CHF, on a scale that ranged from 0% to 100%. When viewing images of trustees, participants were asked to guess the average amount that the trustee sent back, on a scale that ranged from 0 CHF to 30 CHF. For each guess, participants also indicated their confidence in the estimate on an 11-point Likert scale ranging from "not at all certain" to "very certain". Here, we analyze participants' predictions of trustees' behavior.
Of the 266 participants that participated in Study 2, 174 were randomly assigned to the "cropped" condition, and 92 to the "uncropped" condition. In the cropped condition, participants viewed the same set of 31 facial photographs as participants in the unmodified condition of Study 1. That is faces were cropped to remove all non-facial features, such as hairstyle and earrings (see Fig. 1). In the "uncropped" condition (n = 92), participants viewed the original images without the oval cropping.

Analysis strategy
We followed the same analysis strategy as in Study 1. For all tests, we report the results of frequentist and Bayesian analyses. We estimated cross-classified multilevel regression models with random intercepts and slopes per participant and trustee (full model results are reported in the Supplemental Materials).

Sensitivity analysis
We again conducted sensitivity analyses for our main effect of interest (the relationship between predicted and actual trustworthiness in the cropped and uncropped conditions). For participants in the cropped condition, we had 80% power to detect an effect of 0.10. In other words, for a one-point increase in actual trustworthiness, we could detect a 0.10-point increase in predicted trustworthiness. For participants in the uncropped condition, we had 80% power to detect an effect of 0.14. In other words, for a one-point increase in actual trustworthiness, we could detect a 0.14-point increase in predicted trustworthiness. Thus, our design had sufficient power to detect even low levels of accuracy.

Trustworthiness detection
Were participants able to detect the trustworthiness of counterparts? We estimated a multilevel regression model with random intercepts and slopes per participant, in which we regressed the predicted reciprocation rate of trustees on their actual reciprocation rate. For participants who viewed the same cropped images as participants in Study 1, there was no significant relationship between how much participants expected trustees to return and how much they had actually returned (with very strong evidence in favor of the null hypothesis), β = 0.  (Table S8, Model 1). Just like in the context of the trust game, participants that were incentivized to explicitly predict the trustworthiness of trustees were not able to do so.

Moderators of accuracy
We again tested the hypothesis that accuracy in trustworthiness detection only emerges under some conditions. Accuracy did not vary as function of trustees' gender (with substantial to very strong evidence in favor of the null hypothesis) in the cropped condition, β = − 0.283, SE = 0.496, OR = 0.75, 95% CI [0.27, 2.10], p = .57, BF 01 = 30.47 (Table S 7 (Table S8, Model 2). We also explored whether accuracy varied as a function of participants' gender (Model 3, Tables S7 and S8) or participants' confidence in the accuracy of their expectations of reciprocity (Model 4, Tables S7 and S8), but found no significant results and very strong to decisive evidence in favor of the null hypothesis.

Additional analyses
To replicate results from Study 1 outside the context of a trust game, we again analyzed the accuracy of explicit trustworthiness ratings of the facial photographs. We did not find a significant association between trustworthiness ratings and the actual reciprocation rate of trustees (with decisive evidence in favor of the null hypothesis) for participants who viewed the cropped images, β = 0.066, SE = 0.102, 95% CI [− 0.155, 0.294], p = .53, BF 01 = 136.6 (Table S9), and for participants who viewed the uncropped images, β = 0.006, SE = 0.114, 95% CI [− 0.237, 0.237], p = .96, BF 01 = 109.0 (Table S10). Together, these results suggest that participants were not able to predict the trustworthiness of counterparts based on facial photographs.

General discussion
People spontaneously rely on the facial appearance of strangers when deciding whether they can be trusted to cooperate in social interactions . But can people actually detect the trustworthiness of strangers based on their facial appearance? Prior studies have yielded mixed results and the question remains the subject of vigorous debate Todorov, Funk, & Olivola, 2015). Yet, the empirical evidence on the topic is limited. Many studies were based on the same set of stimuli, which limits the generalizability of findings De Neys et al., 2015. Conversely, studies providing evidence against accuracy relied on statistical techniques that cannot quantify evidence in favor of such a null hypothesis, which complicates the interpretation of results (Efferson & Vogt, 2013;Rule et al., 2013).
We conducted two studies to address these limitations. Confirming results from previous studies (Jaeger, Evans, Stel, & van Beest, 2019), we found that participants relied on the perceived trustworthiness of counterparts when making trust decisions. However, on average, participants failed to entrust money to counterparts that were actually more trustworthy. Bayesian analyses yielded very strong support for the null hypothesis indicating that our participants were not able to accurately detect the trustworthiness of their interaction partners. We also found that participants' earnings were not higher than the expected earnings of a decision strategy that trusts at random. This suggests that knowledge of their counterparts' facial appearance did not give participants a strategic advantage. In fact, participants would have earned more by consistently distrusting all counterparts, as trust did not pay off in the current sample.
Previous studies found evidence in favor of detection accuracy only under specific conditions, and these conditions varied across studies Tognetti et al., 2013;Verplaetse et al., 2007). Here, we tested these proposed moderators, but found no evidence for better-than-chance trustworthiness detection (a) for male or female counterparts, (b) when making trust decisions or when providing explicit trustworthiness ratings, and (c) when viewing cropped images (in which all non-facial features were removed) or uncropped images. In sum, our results provide consistent evidence against accuracy in trustworthiness detection from faces across various conditions.
Previous investigations have shown that trustworthiness impressions guide decision-making in many domains, including legal sentencing, personnel selection, and financial decision-making (Olivola et al., 2014). People even rely on trustworthiness impressions from faces when more diagnostic cues are available (Jaeger et al., 2019) and when decisions are highly consequential (Wilson & Rule, 2015). Future studies should explore whether some people are more prone to the biasing influence of first impressions and, importantly how biases could be mitigated (for first attempts, see Chua & Freeman, 2021;Jaeger, Todorov, Evans, & van Beest, 2020;Shen & Ferguson, 2021). An important future task in this line of research will be to delineate how difficult it is to override these biases, particularly when other more reliable information sources are available that may require more cognitive effort to process.

Limitations and future directions
Several limitations and constraints on the generalizability of the current results should be mentioned. Our results were based on samples of relatively young decision-makers from the University of Zurich. Additional studies are needed to examine the generalizability of our findings with larger and more diverse samples of both targets and raters.
Future studies should also examine the accuracy of trustworthiness impressions using varying types of stimuli. Cropped images, in which all non-facial aspects are removed, ensure that impressions are based on the facial features of counterparts. However, they represent only a relatively specific facet of the kinds of stimuli that people encounter in everyday interpersonal interactions. Accuracy may be better than chance when people have access to additional cues. For instance, previous findings suggest that people may be able to identify cooperative interaction partners with greater-than-chance accuracy after brief interactions (Brosig, 2002;DeSteno et al., 2012;Frank, Gilovich, & Regan, 1993;Reed, Zeglen, & Schmidt, 2012; but see Manson, Gervais, & Kline, 2013;McCullough & Reed, 2016). Ultimately, we believe that studies using a wide range of different stimuli are needed to map the accuracy of trustworthiness decisions under varying conditions.
We investigate one such condition here, namely the accuracy of trustworthiness judgments when judgments are solely based on facial features. This approach is informative for two reasons. First, even though people often have access to other cues, which may allow them to make more accurate judgments, there are also many situations in which a person's facial appearance is either one of the only cues or a particularly salient cue. People often engage with strangers and, in the first moments of the interaction, tend to judge them solely based on their appearance. Moreover, facial photographs are a common feature of many decision-making environments, including social media platforms (e.g., Twitter), professional networking sites (e.g., LinkedIn), and the sharing economy (e.g., Airbnb). Second, ample evidence suggests that people rely on facial appearance, even when they have access to other cues (Jaeger et al., 2019;Olivola et al., 2014). To determine whether reliance on facial appearance helps or hinders people in making accurate predictions, the accuracy of judgments that are solely based on facial appearance needs to be isolated. This requires a highly controlled and standardized study design, such as the one used in the current experiments, to ensure that judgments are based on the cue in question (Cox et al., 2015).

Declarations of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.