No evidence for a common blood microbiome based on a 1 population study of 9,770 healthy humans

17 Human blood is conventionally considered sterile. Recent studies have challenged this, 18 suggesting the presence of a blood microbiome in healthy humans. We present the 19 largest investigation to date of microbes in blood, based on shotgun sequencing 20 libraries from 9,770 healthy subjects. Leveraging the availability of data from multiple 21 cohorts, we stringently filtered for laboratory contaminants to identify 117 microbial 22 species detected in the blood of sampled individuals, some of which had signatures of 23 DNA replication. These primarily comprise of commensals associated with human body 24 sites such as the gut ( n =40), mouth ( n =32), and genitourinary tract ( n =18), which are 25 species that are distinct from common pathogens detected in clinical blood cultures 26 based on more than a decade of records from a tertiary hospital. Contrary to the 27 expectations of a shared blood microbiome, no species were detected in 84% of 28 individuals, while only a median of one microbial species per individual was detected in 29 the remaining 16%. Futhermore, microbes of the same species were detected in <5% of 30 individuals, no co-occurrence patterns similar to microbiomes in other body sites was 31 observed, and no associations between host phenotypes (e.g. demographics and blood 32 parameters) and microbial species could be established. Overall, these results do not 33 support the hypothesis of a consistent core microbiome endogenous to human blood. 34 Rather, our findings support the transient and sporadic translocation of commensal 35 microbes from other body sites into the bloodstream.

species detected in the blood of sampled individuals, some of which had signatures of 23 DNA replication. These primarily comprise of commensals associated with human body 24 sites such as the gut (n=40), mouth (n=32), and genitourinary tract (n=18), which are 25 species that are distinct from common pathogens detected in clinical blood cultures 26 based on more than a decade of records from a tertiary hospital. Contrary to the 27 expectations of a shared blood microbiome, no species were detected in 84% of 28 individuals, while only a median of one microbial species per individual was detected in 29 the remaining 16%. Futhermore, microbes of the same species were detected in <5% of 30 individuals, no co-occurrence patterns similar to microbiomes in other body sites was 31 observed, and no associations between host phenotypes (e.g. demographics and blood 32 parameters) and microbial species could be established. Overall, these results do not 33 support the hypothesis of a consistent core microbiome endogenous to human blood. 34 Rather, our findings support the transient and sporadic translocation of commensal 35 microbes from other body sites into the bloodstream. 36 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022.

37
In recent years, there has been considerable interest regarding the existence of a 38 microbiome in the blood of healthy individuals, and its links to health and disease. 39 Human blood is traditionally considered a sterile environment (i.e., devoid of viable 40 microbes), where the occasional entry and proliferation of pathogens in blood can 41 trigger a dysregulated host response, resulting in severe clinical sequelae such as 42 sepsis, septic shock or death 1 . Asymptomatic transient bacteraemia (i.e., bacterial 43 presence in blood) in blood donors is also known to be a major cause of transfusion-44 related sepsis 2 . Recent studies have suggested the presence of a blood microbiome, 45 providing evidence for microbes circulating in human blood for healthy individuals 3-7 46 (reviewed in Castillo et al 8 ). However, most of these studies were either done in 47 relatively small cohorts or lacked rigorous checks to distinguish true biological 48 measurements from different sources of contamination 8 . In this work, we analysed 49 blood DNA sequencing data from a population study of healthy individuals, comprising 50 of multiple cohorts processed by different laboratories with varied sequencing kits. By 51 leveraging the large dataset (n=9,770) complete with batch information in our 52 systematic differential analyses for potential contaminants, our aim was to determine 53 whether a blood microbiome truly exists in the general population. 54 For meaningful discourse, it is useful to formalise what the presence of a hypothetical 55 'blood microbiome' entails. Berg et al. 9 concluded that the term microbiome should refer 56 to a community of microbes that interact with each other and with the environment in 57 their ecological niche, which in our context is human blood. Therefore in a blood 58 microbiome, the presence of microbial cells in blood from healthy individuals should 59 exhibit community structures indicated by co-occurrence or mutual exclusion of 60 species 10 as seen in the microbiomes of other sites such as the gut 11 or mouth 12 . 61 Furthermore, we may expect the presence of core microbial species, which can be 62 defined as species that are frequently observed and shared across individuals 13,14 , such 63 as Staphylococcus epidermidis on human skin 15 . More precisely, taxa that are found in 64 a substantial fraction of samples from distinct individuals (i.e. with high prevalence) may 65 be considered 'core'. Notably, the prevalence threshold for defining core taxa is 66 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint proportion of detected species that are classified as contaminants decreased from 21% 159 to 10% (Figure 1b). Next, the microbial species were compared against human blood 160 culture records spanning more than a decade (2011-2021) from a tertiary hospital 161 (Figure 1c). These blood cultures were typically ordered if clinical indications of 162 bacteraemia were present, and therefore represent the range of microbial species that 163 are known to cause symptomatic infection as detected in a clinical setting. The 164 proportion of species that have been cultured from blood increased from 12% to 27% 165 after decontamination, suggesting that our filtering procedures enriched for microbial 166 species which are capable of invading the bloodstream. Finally, we compared the 167 proportion of human-associated microbes before and after decontamination using a 168 host-pathogen association database describing the host range of pathogens 31 ( Figure  169 1d). For species that were not found in this database, a systematic PubMed search 170 (Methods) was performed to determine if there was at least one past report of human 171 infection. The proportion of human-associated species increased from 40% to 78% after 172 decontamination, indicating that they are more likely to be biologically relevant. These 173 results collectively suggest that by using a set of contaminant-identification heuristics, 174 our filters effectively retain a higher proportion of biologically relevant taxa while 175 removing likely contaminants. 176

Blood microbial signatures from healthy individuals reflect sporadic translocation 177 of commensals 178
We next determined the fraction of distinct, healthy individuals for which microbes could 179 be detected (i.e., prevalence). Notably, the most prevalent microbial species, C. acnes, 180 was observed in 4.7% of individuals (Figure 2a), suggesting that none of the 117 181 microbes can be considered 'core' species that are consistently detected across most 182 healthy individuals. Additionally, we did not detect any microbial species in most (82%) 183 of the samples after decontamination (Figure 2b), whereas the remaining 18% of 184 samples had a median of only one microbial species per sample. This low number of 185 species detected per sample was not due to insufficient sequencing depth since there 186 was a weak negative correlation between the number of confidently detected species 187 per sample and the microbial read depth (Spearman's ρ =-0.232, p<0.001). Furthermore, 188 some samples containing no microbial species had a microbial read count of up to ~2.1 189 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint million (median=6,187 reads; distribution shown in Supplementary Figure 2). That is, 190 even though a considerable number of reads were classified as microbial, they were all 191 assigned to contaminant species. These results suggest that the presence of microbes 192 in the blood of healthy and apparently asymptomatic individuals, as estimated by our 193 detection methods, is infrequent and sporadic. 194 Given past reports of bacterial translocation from the mouth 32 or gut 33 into blood, we 195 asked if the microbes we detected could have originated from various body sites. To do 196 so, we assigned potential body site origins to the 117 microbial species detected in 197 blood based on microbe-to-body-site mappings extracted from the Disbiome 198 database 34 . We found that many (n=59; 50%) of these confidently detected species are 199 indeed human commensals that are present at various human body sites (Figure 2c). 200 This, together with their low prevalence, suggests that the microbial DNA of these 201 species may have transiently translocated from other locations in the body rather than 202 being endogenous to blood. We further categorised the microbial species based on their 203 growth environments (Figure 2d). A significant portion (n=42; 36%) of the species were 204 obligate anaerobes or obligate intracellular microbes, atypical of skin-associated 205 microbes that may be introduced during phlebotomy 2 , indicating that they are not likely 206 to be sampling artefacts. All in all, the diverse origins of the microbes detected in blood, 207 together with their low prevalence across a healthy population, is consistent with 208 sporadic translocation of commensals into the bloodstream. 209 Microbial presence in blood (i.e., bacteraemia) is typically associated with a range of 210 clinical sequelae from mild fevers to sepsis. As such, we asked if the common microbes 211 identified in patients with disease-associated bacteraemia are different from those 212 detected in our cohorts of healthy individuals. To do so, we compared the prevalence of 213 microbes detected in the sequenced blood samples against observations from 11 years 214 of hospital blood culture records. The prevalence of microbial genera detected in the 215 hospital blood culture records clearly differed from that in our sequenced blood 216 samples, despite the overlap in detected taxa (Figure 2e). For example, while 217 Staphylococcus, Escherichia and Klebisiella were the predominant genera identified in 218 blood cultures, they were rarely detected in our blood sequencing libraries. These 219 findings may be explained by the potentially higher virulence of pathogens detected in 220 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint the clinic, which are more likely to cause clinical symptoms in individuals that would 221 result in exclusion during our recruitment process. Conversely, our findings suggest that 222 the microbes detected in the blood of healthy individuals are potentially better tolerated 223 by the immune system (e.g. Bifidobacterium spp. 35 and Faecalibacterium prausnitizii 36 224 with immunomodulatory properties as gut commensals; Figure 2a). 225 Evidence for replicating microbial cells but without community structure or host 226 associations 227 To better characterise the microbial DNA signatures detected in blood, we asked if they 228 reflect the presence of viable microbial cells as opposed to circulating cell-free DNA. 229 This is because the former would allow for complex microbe-microbe or microbe-host 230 interactions that would be of greater and more direct clinical relevance. In contrast to 231 previous approaches that used microbial cultures 3,37 , we looked for more broad-based 232 evidence of live bacterial growth in by applying replication rate analyses 21,22 on our 233 sequenced blood samples. This approach is based on the principle that DNA 234 sequencing of replicating bacteria would yield an increased read coverage (i.e., peak) 235 nearer to the origin of replication (Ori) and decreased coverage (i.e., trough) nearer to 236 the terminus (Ter) 22 . A coverage peak-to-trough ratio (PTR) greater than one is 237 indicative of bacterial replication. Through this analysis, we found evidence for genomes, suggesting that the replication rate analyses are reliable. Additionally, all but 247 one of these replicating species are present in hospital blood culture records and in 248 previous reports of bacteraemia [39][40][41][42][43][44][45][46][47][48] (Figure 3a), indicating their ability to replicate in 249 human blood. Overall, beyond the detection of microbial DNA, we present the first 250 culture-independent evidence for replicating bacterial cells in blood. 251 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint iners, and Gardnerella vaginalis (Supplementary Figure 3b). Similarly, we found 283 enrichment of gut-associated bacteria such as Bifidobacterium spp. in GUSTO 284 Figure 3c). These findings suggest that bacterial translocation may be 285 more frequent in infants relative to adults, though differences in sample collection 286 (umbilical cord versus venipuncture) could also partially explain them. 287

(Supplementary
Next, we systematically tested for pairwise associations between eight host phenotypes 288 that were documented on the day of blood collection and the presence of each of the 289 117 microbial species detected in blood. These host phenotypes attributes were: sex, 290 ancestry, age, body mass index (BMI), blood total cholesterol (TC), blood triglycerides 291 (TG), systolic and diastolic blood pressure (SBP and DBP). Given the multiple large 292 independent cohorts, we could perform statistical tests on each cohort separately, which 293 allowed us to assess the consistency of identifed association patterns across the 294 different cohorts. Since these cohorts were sampled from a homogenous population, 295 true association patterns are expected to be detected repeatedly regardless of cohort. 296 Using this statistical testing approach, we found only five significant microbe-phenotype 297 associations (p<0.05; Supplementary Table 3) after adjusting for multiple comparisons. 298 Notably, all but one of the significant associations were present in only one cohort. The 299 exception was C. acnes, which was significantly associated with ancestry in two 300 cohorts. However, while C. acnes was more prevalent in individuals of Malay ancestry 301 within the SEED cohort, it was more prevalent in Chinese individuals within the MEC 302 cohort (Supplementary Figure 4). These cohort specific differences could be due to 303 other demographic variables that were not recorded in this study, or perhaps from C. 304 acnes subspecies differences. To ensure that we did not miss any associations due to 305 the possible non-linearity of host-phenotype and microbial relationships, we also derived 306 categorical phenotypes based on the recorded phenotypic information. These include 307 being elderly (age>=65), and other measures of 'poorer health', such as being obese 308 (BMI>30), having high blood triglycerides (TG>2.3 mmol/L), high total cholesterol 309 (TC>=6.3 mmol/L), or high blood pressure (SBP>=130 and DBP>=80). We then tested 310 for pairwise associations between these derived phenotypes and the presence of any 311 bacteria but found no significant associations (p>0.05; Supplementary Table 4). 312 Collectively, these results suggest no consistent associations between the presence of 313 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint microbes in blood and the host phenotypes tested within a healthy population of 314

316
We present the largest scale analysis, to date, of microbial signatures in human blood 317 with rigorous accounting for computational and contamination artefacts and found no 318 evidence for a common blood microbiome in a healthy population. Instead, we observed 319 mostly sporadic instances of blood harbouring DNA from single microbial species of 320 diverse bodily origins, some of which might be actively replicating. Our findings hint at 321 the possibility that the bloodstream represents a route for movement of microbes 322 between different body sites in healthy individuals. However, the low prevalence of the 323 detected species suggest that this movement is likely to be infrequent and transient. 324 Unresolved questions remain about how interconnected the microbiomes at various 325 body sites are, and whether these processes are altered during disease or throughout a 326 person's lifetime. Can perturbations to the microbial community at one body site affect 327 that at another site, and how does the host immune system asymptomatically regulate 328 microbial presence in blood? Our study lays the groundwork for future investigations 329 into these questions, which may pave the way for a systemic understanding of the 330 human microbiome across body sites in relation to human health and disease. 331 We found no core species in human blood on the basis of low prevalence across 332 individuals in our population-level dataset. The prevalence estimates provided in this 333 study are contingent on the sensitivity of detecting microbes through sequencing. 334 Previous studies have shown that untargeted shotgun sequencing is highly sensitive for 335 the detection of microbes in blood at a total sequencing depth of 20-30 million reads per 336 sample 53-55 , perhaps even more so than culture-based methods 56,57 . In contrast, a 337 median of 373 million reads was generated per sample for our sequencing libraries, 338 suggesting that our methods do not lack sensitivity. Our prevalence estimates are also 339 affected by the abundance thresholds used to determine whether a species is present in 340 a single sample (i.e., abundance filter; Figure 1a). We defined these thresholds in terms 341 of both absolute read count and relative abundance, which were determined based on 342 simulation experiments (see Methods). Overly stringent abundance thresholds would 343 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint lead to the erroneous masking of genuine signals, leading to an underestimation of 344 microbial prevalence. However, even when relaxing the threshold to just a relative 345 abundance of 0.001, none of the species, whether flagged as a contaminant or not, had 346 more than 52% prevalence (Supplementary Table 5). Furthermore, the 20 most 347 prevalent species at this threshold are all environmental microbes, and mostly comprise 348 of Sphingomonas and Bradyrhizobium species, which are known to be common 349 sequencing-associated contaminants 19 . This suggests that independent of our 350 decontamination filters, none of the species detected qualify as core members. 351 In addition to not being able to detect any core species, we could not detect any strong 352 co-occurrence or mutual exclusion associations between species regardless of whether 353 our decontamination filters were applied. These associations generally reflect 354 cooperation or competition between species, respectively 58 . Indeed, within a microbial 355 community, metabolic dependencies of species and the ability of different species to 356 complement these dependencies have been shown to be a key driver of microbial co-357 occurrence 59 . On the other hand, competitive behaviours such as nutrient sequestration 358 to deprive potential competitors of nutrients or producing adhesins to bind and occupy 359 favourable sites in an environment 60 can lead to mutual exclusion between species. The 360 fact that we could not detect any strong associations therefore points to the absence of 361 an interacting microbial community in healthy humans. Of note, since our dataset was 362 derived from circulating venous blood, we are, in principle, not able to detect microbial 363 interactions that may be occurring at other sites of the bloodstream such as the inner 364 endothelial lining of blood vessels. Experiments investigating the adherence of bacteria 365 to blood vessel linings may provide further insight into this. 366 The availability of 11 years of blood culture records from the same country of origin as 367 our blood samples enabled a reliable comparison of the prevalence of microbes in the 368 healthy population and in the clinic. This is because the frequency of infections caused 369 by different microbial species is known to differ from country to country 61 . Despite this, 370 we expect that some of the variation in prevalence estimates may be due to the 371 differences in detection methods. That said, previous studies have shown a strong 372 concordance between culture and sequencing-based detection 53,54,56,57 , suggesting that 373 the distinction between the prevalence of microbes found in healthy individuals and in 374 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint the clinic is not due to the differences in detection methods. Our results support the 375 conclusion that microbial presence in blood (i.e., bacteraemia) does not always lead to 376 disease. These results are consistent with our other observation that microbes detected 377 in our cohorts of asymptomatic individuals tend to be commensals, which may 378 inherently be less virulent and better tolerated by the host compared to disease-causing 379 pathogens. Indeed, the long-standing co-evolution of humans and colonizing microbes, vis common blood culture pathogens may be the key to design therapeutics to manage 393 or prevent the dysregulated host response that defines sepsis 1 . 394 We found no convincing associations between both measured (e.g. TC, SBP) and 395 derived (e.g. obesity) host phenotypes with microbial presence. This suggests that the 396 risk of transient microbial translocation, at least across our cohorts of healthy adults, is 397 fairly consistent. In contrast, this risk may increase in individuals with more severe 398 disease. In fact, variable microbial DNA profiles in blood have been used to delineate 399 health and disease states. This has most prominently been shown for sepsis 53-57,65 , 400 where the presence of viable microbes is expected, but also for cancer 30 , periodontal 401 disease 51 , and chronic kidney disease 66 , which are unrelated to bloodstream infections. 402 These studies highlight the promise of metagenomic sequencing of blood for developing 403 diagnostic, prognostic, or therapeutic tools. Our characterisation of the species breadth 404 in healthy individuals forms a crucial baseline for comparison with that in diseased 405 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint individuals. Indeed, our findings open new doors to understanding why and how blood 406 microbial profiles correlate with health status. One possible hypothesis is that mucosal 407 integrity is compromised in a disease state, leading to higher translocation rates of 408 microbes into the bloodstream. This is consistent with findings of increased intestinal 409 permeability (i.e., 'leaky gut') in disease or even during physiological stress 67 . Future 410 studies testing this hypothesis may consider a focus on the gut-associated bacteria that 411 were detected in our study (e.g. Bifidobacterium adolescentis, Faecalibacterium was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made   Table 6. 479

Data pre-processing and quality control 480
The bioinformatic processing steps applied to the sequencing libraries are summarised 481 in Figure 1a. Read alignment of sequencing reads to the GRCh38 human reference 482 genome was already performed as part of a separate study 68 using BWA-MEM 483 v0.7.17 74 . We retrieved read pairs where both members of the pair did not map to the 484 human genome. Following which, we performed quality control of the sequencing reads. 485 We trimmed low quality bases at the ends of reads with quality <Q10 (base quality 486 trimming) and discarded reads with average read quality less than Q10 (read quality 487 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint filter). We also discarded low complexity sequences with an average entropy less than 488 0.6, with a sliding window of 50 and k-mer length of five (low complexity read filter). All 489 basic quality control steps were performed using bbduk from the BBTools suite v37.62 490 (sourceforge.net/projects/bbmap/). 491

Taxonomic classification of blood sequencing libraries 492
Taxonomic classification of non-human reads was done using Kraken2 v2.1. To minimise noise in the taxonomic assignments, we defined a set of abundance 501 thresholds whereby species with abundance values less than or equal to these 502 thresholds (i.e., relative abundance≤0.05, read pairs assigned≤10) were counted as 503 absent (set to zero read counts). We performed simulations to systematically determine 504 a relative abundance threshold that minimizes false positive species assignments. 505 Sequencing reads were simulated using InSilicoSeq v1.5.4 75 with error models trained 506 on the SG10K_Health sequencing libraries and processed using the same bioinformatic 507 steps as per the SG10K_Health dataset to obtain microbial taxonomic profiles. We 508 simulated 373 million reads equivalent to the median library read count of all samples, 509 comprising reads from the GRCh38 human reference and ten microbial genomes 510 proportions. Due to read misclassification, some of the simulated reads were 514 erroneously assigned to another species and produced false positives. A final relative 515 abundance threshold of 0.005 that delineated these false positive assignments from 516 true positives was selected (Supplementary Figure 5). Relative abundances were 517 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint calculated by dividing the microbial read count in a sample by the total number of 518 microbial reads assigned to that sample. 519

Decontamination filters 520
After application of the presence/absence filter, we identified and removed putative 521 contaminants using established decontamination heuristics 26 that have been validated 522 in previous studies 27,28 , prior to our downstream analyses. These rules were applied 523 using eight types of batch information: source cohort, DNA extraction kit type, library 524 preparation kit type, and lot numbers for sequencing-by-synthesis kit (box 1, box 2), 525 paired-end cluster kit (box 1, box 2) and sequencing flow cell used. Other batch 526 information such as the pipettes and consumables used, or storage location and 527 duration were not recorded and could potentially contribute to some level of batch-528 specific contamination. However, these batches are expected to be correlated with the 529 other types of batch information available, and so the resultant contaminants could in 530 theory be accounted for using our filters. We describe the four decontamination filters 531 used, as shown in Figure 1a, in sequential order: 532 (1) Prevalence filter. A microbial species is considered a contaminant specific to a 533 batch if it is present at greater than 25% prevalence in that batch and has greater 534 than a two-fold higher prevalence than that for any other batch. Batches with less 535 than 100 samples were excluded from this analysis. This filter is based on the 536 principle that species which are highly prevalent in some batches but lowly 537 prevalent or absent in others are likely contaminants 26 . We illustrate this for an  Table 6) are considered contaminants. This filter is based on 552 the principle that species that can be repeatedly observed across different 553 reagent batches are more likely to reflect genuine non-contaminant signals 26 . 554 Library preparation kit type was excluded from this analysis since only three kit 555 types were used, with 86% of samples processed using one of the kits. 556 (4) Read count filter. A microbial species is considered a sequencing or analysis 557 artefact if it is not assigned at least 100 reads in at least one sample. This filter is 558 based on the principle that species that are always assigned a low number of 559 read pairs, never exceeding the background noise within sequencing libraries, 560 are more likely to be artefactual rather than genuine signals. An example of an 561 artefactual species is Candidatus Nitrosocosmicus franklandus, which was 562 assigned at most 22 read pairs by Kraken2 across 21 sequenced samples. 563

Characterisation of microbial species 564
We classified microbial species as human-associated or not based on a published host-565 pathogen association database 78 . In this database, host-pathogen associations are 566 defined by the presence of at least one documented infection of the host by the 567 pathogen 31 . For species that were not found in this database, we performed a 568 systematic PubMed search using the search terms: (microbial species name) AND 569 (human) AND ((infection) OR (commensal)). Similarly, species that had at least one 570 published report of human colonisation/infection were considered human-associated. 571 Additionally, we classified the potential body site origins for each microbial species 572 using the Disbiome database, which collects data and metadata of published 573 microbiome studies in a standardised way 34 . We extracted the information for all 574 microbiome experiments in the database using the URL: 575 'https://disbiome.ugent.be:8080/experiment' (accessed 26 th April 2022). We first 576 . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made      was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint . CC-BY-NC-ND 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 30, 2022. ; https://doi.org/10.1101/2022.07.29.502098 doi: bioRxiv preprint