Plate tectonics drive deep biosphere microbial community structure

Katherine M. Fullerton, Matthew O. Schrenk, Mustafa Yücel, Elena Manini, Marco Basili, Timothy J. Rogers, Daniele Fattorini, Marta Di Carlo, Giuseppe d’Errico, Francesco Regoli, Mayuko Nakagawa, Costantino Vetriani, Francesco Smedile, Carlos Ramírez, Heather Miller, Shaunna M. Morrison, Joy Buongiorno, Gerdhard L. Jessen, María Martínez, J. Maarten de Moor, Peter H. Barry, Donato Giovannelli, and Karen G. Lloyd


Main text:
The deep subsurface biosphere drives a wide range of key biogeochemical transformations 1,2 , so the composition of microbial communities in subsurface fluids of terrestrial and marine hydrothermal systems has been linked to geochemical parameters such as temperature, pH, redox state, and energy availability 2,4-7 . However, in terrestrial hot springs, this work is often conducted on microbial biofilms and planktonic communities that developed in the presence of sunlight after the fluids emerged at the surface, resulting in a mix of photosynthetic and chemosynthetic processes as recorded in the isotopic composition of organic carbon 8 . While the role of subsurface communities in mediating biogeochemical transformations has been widely investigated 1 , no study has previously addressed how the subsurface biosphere is connected to large-scale geologic settings. For example, variations in microbial community compositions of New Zealand's convergent margin hot springs have been linked to dispersal and local surficial geochemistry 9 , but to our knowledge, no study has investigated potential relationships between microbiology and geochemistry linked to tectonic processes.
Convergent margins connect the vast repository of carbon in the deep Earth with the planetary surface. As denser oceanic plates subduct beneath continental crust, carbon compounds and other volatiles are transferred from Earth's surface to its interior 10 . These compounds are also recycled back to the surface through arc volcanoes and secondary geothermal processes along subduction boundaries 10 . In this dynamic geologic setting, fluid release, magmatism, and deformation provide diverse habitats that may be colonized by microbial assemblages with different preferences for temperature, pH, redox, elemental compositions, pressure, and salinity 2 . While subducting slabs can penetrate 20-150 km, the subsurface biosphere is thought to be limited to the upper few kilometers, where temperatures are below ~150°C 11 . Even though they are separated by a large distance, the upward mobility of deeply-sourced fluids may connect subsurface microorganisms to the deep tectonic processes below.
Here we compare bacterial community composition, aqueous and solid phase geochemistry, as well as volatile emissions across the Costa Rican convergent margin (Fig. 1). Here, the Cocos oceanic plate subducts beneath the Caribbean plate at a rate of 80-90 mm/yr 12 . The shallow subduction geometry promotes slab dehydration prior to reaching the magma generation zone, allowing for the release of large fluxes of carbon and reduced chemical species into the overlying plate in the outer forearc, forearc, and arc 13 . Remarkably, the continental extension of the oceanic plate boundary between the East Pacific Rise (EPR) and Cocos Nazca Spreading center (CNS) is identifiable by a shift in the carbon isotopic composition of arc and forearc fluid and gas emissions 3 , suggesting tight coupling between deep tectonic structure and near-surface fluids.
Within the Central American Volcanic Arc (CAVA), only a few hot spring systems have been characterized microbiologically 14,15 . In February 2017, we sampled 21 hot springs across a >100km section of Northern and Central Costa Rica. Fresh fluids venting from the subsurface were collected prior to reaching the surface in order to minimize input from surface microbes. Sediments that accumulated near the outflow source in surface pools of the same springs were also collected. The sites cover a range of subduction provinces from the outer forearc (20-40°C, pH 8-10, and 20-40 km slab depth), to the forearc (40-60°C, pH 4-7, and 40-100 km slab depth), and arc (>60°C, pH 0-3, and 100-120 km slab depth) 3 . This includes volcanoes from the Northern Guanacaste Geothermal Province, as well as Arenal, Poás and Irazú volcanoes in the Central Cordillera (Figs. 1, S1, and S2, Table S1).
Cell abundances range from 1.5×10 3 to 3.3×10 6 cells/mL in the fluids (Table S2), typical of other hydrothermal systems 1,5 . Ten sites produced high-quality bacterial 16S rRNA gene libraries in both fluids and sediments, six were successful only in sediments, and two more only in fluids, comprising 1,933,379 reads after quality control, 33,188 total amplicon sequence variants (ASVs), and 59 phyla (Figs. S3 and S4). Each site contains enriched helium isotope ( 3 He/ 4 He) ratios relative to air, suggesting a mix of mantle mantle and slab derived fluids with little input from surface fluids (attributed to high 4 He/ 20 Ne values), and isotopically enriched dissolved inorganic carbon (DIC), derived from a mixture of the oceanic slab and mantle, rather than air 3 . The δ 13 C values of dissolved organic carbon (DOC) correlate positively (Pearson moment correlation r 2 =0.53, p<0.01) with those of slab/mantle-derived DIC (Fig. 1c), rather than with the degree of mixing with surface-derived organic matter 3 . Photosynthetic biomarkers are low in abundance, with <5 µg/g total photosynthetic pigments in the surface sediments and <1% and <4% chloroplast-related 16S rRNA genes in fluids and surface sediments, respectively 3 . Altogether, these data suggest that the fluids are deeply-sourced, and primarily contain subsurface microbial communities flushed from a subsurface habitat, similar to cold springs 16 and deep-sea hydrothermal vents 5 .
Subsurface bacterial community composition varies significantly with geologic province across the subduction zone (outer forearc, forearc, and arc), between the upper plate overlying the EPR and CNS subducted crusts, as well as with the dominant bedrock types (Fig. 1, Table  S3, Fig. S5; ADONIS, weighted Jaccard, p < 0.01). Phyla with thermophilic isolates, such as Thermotogae, Aquificae, and candidate phyla Atribacteria and Hydrothermae, increase in relative 16S rRNA gene abundances with increasing temperature and acidity. The opposite trend is observed for Proteobacteria, a few uncultured phyla, Candidatus Cloacimonetes, and others (Figs. 2a, S5, S6, Jaccard similarity Table S4, S5). These results suggest that the microbial community composition is related to subsurface geochemical parameters.
Most phyla, however, do not vary systematically with temperature and pH (Fig. S6), suggesting that the distribution of bacterial communities across this convergent margin is not a simple function of these two parameters (Fig. 3). In order to investigate to what extent the basement rocks and the resulting hydrothermal fluids influence bacterial distribution, we used concentrations of major aqueous anions and cations to categorize the geothermal fluids. A ternary distribution of aqueous anions (Cl -, SO 4 2-, and DIC, Fig. 2c) 17 distinguishes between: a) acidic (pH 0 to 3) chloride-sulfate waters associated with direct absorption of magmatic gases at arc sites, b) sulfate-poor peripheral geothermal fluids, intermediate in composition between deeply-derived chloride-rich waters and soda springs characteristic of flank volcanic sites and forearc locations, and c) alkaline outer forearc sites, relatively poor in both sulfate and chloride. Aqueous major cations ( Fig. 2d; Mg 2+ , Ca 2+ , and Na + +K + ) distinguish between: a) Ca 2+ -rich acidic arc sites (e.g. 18 ), b) volcanic flank geothermal sites and forearc springs often associated with travertine deposition 3 , and c) mature deep fluids in equilibrium with feldspars and clays in the outer forearc 17 . Both anion and cation fluid compositions correlate well with total bacterial community compositions ( Fig. 2c and 2e; Tables S4 and S5). This suggests that besides variation due to temperature and pH, deep tectonically-controlled geochemical differences correlate with microbial diversity.
The extent to which geochemical differences influence bacterial distribution and composition across the Costa Rica convergent margin is demonstrated by the stark differences in the distribution of likely sulfur-and iron-oxidizing bacteria. Genera of known iron-oxidizing isolates (like Gallionella, Geobacter and Ferritrophicum) (Fig. S7), are only found at central sites. An iron-oxidizing metabolism for these clades is independently supported by the presence of their mineral product, twisted stalks of iron hydroxide 19 (Figs. S7, S8). In contrast, genera of known sulfide-oxidizing isolates (like Sulfurihydrogenibium) are found exclusively at northern forearc/arc sites, with the exception of Poás Lake in the Central Cordillera (PL; Fig. S7). We hypothesize that this difference is driven by structural differences between the volcanology of the two areas and it is ultimately connected to the parameters of the subduction of the two plates for the following reasons. The northern sites are mostly located within the the Guanacaste Geothermal Province, where the formation of large calderas at Rincón de la Vieja and Miravalles volcanoes has created sufficient permeability and long-lived heat sources to form extensive high enthalpy hydrothermal systems (Fig. 1a). In Northern Costa Rica the subducting slab angle is significantly steeper than to the south (66° vs. 49°) 12 , resulting in an extensional stress regime and favorable conditions for caldera formation 20 , more permeability in the upper crust, and extensive gas-fluid-rock interactions leading to abundant sulfide generation. In the central region, a shallower subduction angle results in compressional tectonics favoring the formation of large stratovolcanoes with limited permeability in the Central Cordillera 21 . Here, the hydrothermal systems are likely shallower and less extensive than in Northern Costa Rica and sulfur is mostly present in more oxidized forms (SO 2 , sulfuric acid, native sulfur). Large quantities of sulfur are lost through open vent degassing 22 , and S is more concentrated toward the vents of active volcanoes (like in PL, the single site in the central region with evidence of sulfuroxidizing bacteria). The hot springs surrounding the Central Cordillera stratovolcanoes are enriched in iron leached from acidic water-rock interactions, making iron (II) available for microbial oxidation. In the northern sites, the abundant sulfide may limit iron availability for microbial respiration, since sulfide reacts with iron (II) to form iron-sulfide and pyrite, as suggested by the presence of pyrite framboids (Figs. S7, S8). Altogether, these data suggest that the along-arc crustal structural differences between the two regions, ultimately driven by the dip angle and extensional versus compressional local stress regime 21 , dictate the distribution of sulfur-and iron-oxidizing microbes.
To detect emergent patterns in the distribution of the whole bacterial community relative to deep subsurface geochemistry, we used a co-occurrence network of amplicon sequence variants (ASVs) coupled to Random Forest variable ranking and correlations with Spearman and Pearson statistics (Fig. 3, Tables S6-S11) 23 . This approach allows for the identification of microbial groups, hereafter called cliques, that cohesively respond to variation in geochemical parameters across the dataset. In this way, cliques of functionally redundant or closely interacting microbes can be identified and correlated to subsurface conditions as a group rather than as individuals. We identify ten cliques of co-occurring ASVs (Figs. 3, S9, Tables S6, S7), most of which contain multiple taxa (Fig. S10), comprising >99% of the reads in the network analysis (Table S7).
The Random Forest approach identifies correlation with subsurface geochemistry for all cliques with the exception of 6 and 10 ( Fig. 3, Tables S8-S11), suggesting a strong link with the underlying geological settings. Clique 10 is not present in enough sites to make cross-site correlations significant, and clique 6 is dominated by Aquabacterium sp. (46% of the reads) and Alishewanella sp. (22%), which are common heat-tolerant bacteria from soils 24 and freshwater 25 . Clique 6 therefore acts as an internal negative control, demonstrating that surface-associated bacteria that were washed into the system can be easily identified by their lack of correlation with deep subsurface parameters. Clique 6 comprises < 1% of the total reads in our dataset.
Carbon availability is the most important factor correlating with the greatest number of cliques. Cliques 9 and 2 are dominated by Thiothrix sp. (65% of clique 9) and Hydrogenophilaceae (12% of clique 2), and correlate best with decreasing DIC and increasing pH, respectively (Fig. 3b). Isolates from these clades include chemolithoautotrophic sulfur oxidizers 26 , suggesting their distribution across the arc may be due to affinity for highly limiting DIC concentrations and high pH in the outer forearc. Clique 4 correlates best with increasing DOC concentrations (Fig. 4b) and is dominated by Sulfurihydrogenibium sp. (55%) which, along with the other unclassified Hydrogenothermacea sequences in this clique, are related to facultative autotrophic sulfur-and hydrogen-oxidizers that fix carbon using the reverse tricarboxylic acid (rTCA) pathway 27 . Clique 1 correlates more loosely with DOC, which increases in concentration toward the arc, and includes diverse heterotrophic genera able to utilize a variety of anaerobic terminal electron acceptors. Together these cliques contain the majority (57.3%) of reads in the network, suggesting that across-arc changes in DIC availability are key drivers for microbial community composition.
To investigate whether the correlations between putative chemolithoautotrophs and carbon geochemistry are related to the ability of the microbial community to fix carbon, we conducted a targeted search for metagenomic DNA sequences related to all types of central carbon metabolism across all sites. Co-occurrence network analysis of carbon genes shows that enzymes for three main carbon fixation pathways (rTCA, Calvin-Benson, and Wood-Ljungdahl 28 ) cluster in statistically robust cliques around their key genes (Fig 4a, S12, Table  S12). Clique B, containing almost all the rTCA genes including the key genes encoding for ATPcitrate lyase (as well as its alternative citryl-CoA lyase 29 ) and 2-oxoglutarate synthase (Fig. 4c), is the only clique that correlates consistently with subsurface geochemical parameters (Fig. 4b, S11-S12, Tables S13-S15). Across all statistical tests (Tables S13-S15), clique B correlates best with DIC, suggesting a strong relationship between DIC and rTCA-based chemolithoautotrophy.
Our microbial community analyses show that 1) cliques containing known rTCAutilizing chemolithoautotrophic genera correlate best with DOC concentrations, consistent with these organisms contributing significantly to the DOC pool; 2) members of known heterotrophic genera also correlate with DOC, suggesting that chemosynthetically-derived organic matter stimulates secondary consumers; 3) most genes in the rTCA pathway, including those encoding key enzymes, form a statistically-significant clique, consistent with them being used together in the rTCA pathway; and 4) the rTCA gene clique correlates best with DIC concentrations, consistent with higher DIC concentrations stimulating more rTCA-based chemolithoautotrophy. Our geochemical analyses show that 1) the entire DIC pool was initially derived from a deep subsurface slab/mantle source, which was subsequently modified by varying amounts of CO 2 loss from deep calcite deposition 3 ; 2) the entire DOC pool has a Δ 13 C DIC−DOC value (-8.7 ± 1.3‰) in the range of the measured fractionation factors for rTCA (-2 to -12 ‰) 28 (Fig. 1); and 3) concentrations of DOC correlate with concentrations of DIC, suggesting that more DOC is produced when more DIC is available. Together, these findings suggest that the ecosystem is driven primarily by chemolithoautotrophic biomass production through the rTCA cycle, dependent on the supply of DIC from deeply-sourced fluids rising from the slab/mantle after subsequently undergoing calcite precipitation. Given reasonable estimates of cell turnover time and carbon content (Methods), we estimate that 500 to 2,000 years would be required to produce the entire DOC pool from chemolithoautotrophy. Fluid residence times in hot springs (100 to 100,000 years 30 ) are commonly sufficiently long for this to occur. Precedent for chemolithoautotrophically-based ecosystems comes from metagenomic, metatranscriptomic, and metaproteomic data in oligotrophic (DOC < 30 µM) deep subsurface aquifers 6,7 . Our work shows that such chemolithoautotrophically-based ecosystems can fuel much greater DOC production (>1 mM) across an active convergent margin.
We conclude that the subsurface microbial community of a ~400 km subduction segment in Costa Rica cohesively responds to geochemical signals that can ultimately be traced to deep tectonic processes (Fig. 2). This subsurface biosphere landscape varies along the convergent margin, through changes in supply of redox-active substrates, as well as across the convergent margin, through changes in the supply of DIC to a chemosynthetically-based ecosystem. Collectively, our work shows that volatiles and elements mobilized from the descending slab and mantle can be significantly altered by interaction with the deep subsurface biosphere on their trek back to the surface, resulting in a previously unrecognized coupling between geological and biological feedbacks in a convergent margin, with significant implications for the understanding of carbon reservoir changes in deep time.

Data and analysis availability
A complete R script containing all the steps to reproduce our analysis is available at https://github.com/dgiovannelli/SubductCR_16S-diversity.git and released as a permanent version using Zenodo under the DOI: https://doi.org/10.5281/zenodo.3483104. This Targeted Locus Study project has been deposited at DDBJ/EMBL/GenBank under the accession KEBJ00000000, with project ID PRJNA579365. The version described in this paper is the first version, KEBJ01000000. Metagenomic data are in the NCBI SRA with project ID PRJNA627197. All other data is available in the main text or the supplementary materials.

Location and Sample Collection
At each sampling site, 0.5 to 1.5 liters of hydrothermal fluids were filtered through Sterivex 0.22 µm filter cartridges (MilliporeSigma) and 15ml falcon tubes were filled with sediments. Both filters and sediment-filled tubes were immediately frozen onsite at liquid nitrogen temperature in a cryogenic dry shipper (ThermoFisher Scientific, Arctic Express 20) for transport back to the home laboratory. A description of the sites and their GPS coordinates were described previously 1,2 . Field sampling for trace metals in the fluids was carried out by fixing a filtered (0.22 µm) sub-sample in 5% HNO 3 ; filtered (0.22 µm) sub-samples were also taken for major ion measurements. Samples for the determination of trace elements in the solid fraction were sampled in 15 ml falcon tubes and stored frozen. Sediments were also sampled for mineralogical analyses, while samples for scanning electron microscopy and cell counts were fixed in 3 % formaldehyde and kept at +4°C.

DNA extraction
DNA extractions from Sterivex filters were performed using a modified phenol-chloroform extraction optimized for low biomass samples based on previously published methods 31 , with additional modifications for use with Sterivex filters as described previously 32 . Briefly, extractions were performed via chemical lysis with lysozyme, proteinase K, and SDS treatment then purified with phenol-chloroform extractions and precipitation with sodium acetate and isopropyl alcohol. Initial extractions from sediment samples were performed using the Qiagen DNeasy PowerSoil HTP 96 Kit, with additional extractions performed using the modified phenol-chloroform extraction described above, followed by concentration using the Zymo Genomic DNA Clean & Concentrator Kit. Extracted DNA was quantified using a NanoDrop 2000c (ThermoFischer Scientific) with additional PCR screening performed using universal bacterial primers 33,34 .

Sedimentary Organic Matter
Total protein, carbohydrate, lipid, chlorophyll-a, and phaeopigments were determined as previously described 35 . Concentrations were calculated using standard curves, and normalized to sediment dry weight after desiccation (60°C, 24 h). Protein, carbohydrate, and lipid concentrations were converted into carbon (C) equivalents using the conversion factors of 0.49, 0.40, and 0.75 bgC/bg dry weight, respectively. Chloroplastic pigment equivalents are defined here as the sum of the chlorophyll-a and phaeopigment concentrations.

Geochemistry
Data for the carbon isotope analysis of dissolved inorganic carbon (DIC) and dissolved organic carbon (DOC), the concentration of sedimentary aliphatic hydrocarbons and polycyclic aromatic hydrocarbons were previously reported 3 along with the methods used to quantify them. Concentrations of anions were determined using a Dionex AS4A-SC separation column, sodium hydroxide eluent and ASRS-I suppressor. For cations a Dionex CS12-SC separation column was used, with methane sulfonic acid eluent and CSRS-I suppressor. Trace metal concentrations were determined in aqueous and acid-digested solid samples with a NexIon 350X ICP-MS instrument. Total acid digestion included microwave-assisted digestion of dry sediments with nitric acid (16N HNO 3 ) and suprapure hydrofluoric acid (HF) and boric acid. The calibration standards were prepared by using Perkin Elmer multi-element calibration standard solution of metals (including Fe, Al, As, Mn, Mg, K, V, Cr, Co, Ni, Ca, Mg, Se, Sr, Ga, Ba, Be, Pb, Cs) in 5% HNO 3 with concentration of 10 bg/ml each element. Internal yttrium standard was added in each sample before analysis to correct the intensity deviations during measurement with ICP-MS. The molar concentration of each element was calculated by a standard calibration curve of each element with multiplying by volume, dilution and dividing by molar mass.

Flow Cytometry
Fluids (1 ml) obtained directly from the source were placed into a 2 ml plastic tube with a rubber o-ring screwcap (to prevent evaporation) containing 500 µl 3% paraformaldehyde solution in phosphate-buffered-saline (PBS). Cell count samples were kept at room temperature during return to the University of Tennessee and were weighed upon returning to the lab. Cell counts were determined on a Guava Easy Cyte 6HT-2L (Millipore) flow cytometer. Triplicate aliquots of each sample (200bl) were stained with 5× SybrGreen prior to analysis. Gating strategy was optimized using stained, unstained, and filtered controls.

Scanning electron microscopy
Scanning electron microscopy (SEM) micrographs of the hydrothermal sediments were obtained on a Phenom ProX scanning electron microscope at 10 and 15 kV and using a charge reduction sample holder at Rutgers University. Samples were previously dried at 40°C for 24 hours before imaging, and mounted using conductive carbon tape on a sample pin. The same instrument was used at 15 kV to perform Energy Dispersive X-Ray Spectroscopy (EDX) for elemental analysis of particles in the samples.

X-ray diffraction and Raman spectroscopy
Each sediment sample was dried at 50°C for 24 hours. A representative portion of each sample was ground to <10 um grain size with an alumina mortar and pestle. The ground sample material was analyzed with Bruker D8 powder X-ray diffractometer, with a Cu source (1.5406 nm) and 2-theta range of 5° to 70° at 0.01° increments. Bruker Eva software was used to identify mineral phases with pattern search-match performed on the RRUFF database (http://www.rruff.info/) and AMCSD (http://rruff.geo.arizona.edu/) pattern libraries. Raman spectroscopy was performed on selected samples. The crystals were randomly oriented and the Thermo Almega microRaman system was set at 100% power, using a 532 nm solid-state laser and a thermoelectrically cooled CCD detector. The laser was partially polarized with 4 cm-1 resolution and a spot size of 1 bm. Phase identification was performed using the search-match routines available in the Thermo Almega Omnic and CrystalSleuth software against the RRUFF database Raman spectra library. Trimming and background removal was performed with CrystalSleuth software.

Sequence processing and statistical analysis
Extracted DNA was sequenced for the analysis of the bacterial diversity after amplifying the bacteria-specific V4-V5 region of the 16S rRNA gene using primers 518F (AATTGGANTCAACGCCGG) and B1048R (CGTCTGCCATGYACCWC). The same extracted DNA was used for shotgun metagenomic sequencing without amplification. Sequencing was performed as part of the Census of Deep Life initiative within the Deep Carbon Observatory and performed at the Marine Biological Laboratory sequencing facility (https://www.mbl.edu/) on an Illumina MiSeq platform for amplicons and an Illumina NextSeq platform for metagenomes. Amplicon sequences were screened for quality, including chimerachecking with UCHIME, by the MBL as previously described and high-quality merged sequences were published on the Visualization and Analysis of Microbial Population Structures (VAMPS) website 36,37 . Obtained reads were processed using mothur 38 , following the Miseq standard operating procedure using amplicon sequence variants (ASVs) rather than operational taxonomic units. Taxonomy was assigned using the RDP naive bayesian classifier against the Silva v132 release 39 . Metagenomic reads were annotated using Mifaser 42 using the GS+ database available here (https://bromberglab.org/project/mifaser/) that includes gene sequences from biogeochemically relevant pathways.
All statistical analyses, data processing and plotting were carried out in the R statistical software 3.6 41 , using the phyloseq 42 , vegan 43 , ggtern 44 , missForest 45 , VSURF 46 and ggplot2 47 packages. A complete R scrip containing all the step to reproduce our analysis is available at https://github.com/dgiovannelli/SubductCR_16S-diversity.git and released as a permanent version using Zenodo under the DOI: https://doi.org/10.5281/zenodo.3483104. Briefly, the obtained count table, taxonomy assignment and phylogenetic tree were combined together with the environmental variables into a phyloseq object. Low prevalence ASVs, mitochondria and chloroplast-related sequences and potential contaminants were removed (described in more detail below, Fig. S3). In both fluids and sediments, common laboratory contaminants from DNA processing, feces, and skin 48 were largely absent (< 0.04% in the entire dataset and less than 0.01% in any individual library), and no ASV was shared by all samples. Acinetobacter sp., a group containing hospital-acquired pathogens as well as environmental clades 49 , was in high abundance (between 20 and 60% of the reads) in three samples of hydrothermal water collected in spas/resorts, and was removed from further analysis. The remaining ASVs represented ~81% of the original reads. In total 1,933,379 reads and 33,188 ASVs were retained after the preprocessing steps.
Obtained results were used for diversity plots and for multivariate analysis. The basic approach involved non-metric multidimensional scaling (nMDS) using Jaccard and Unifrac distances (the latter both weighted and unweighted) to identify similarity in bacterial diversity community composition across the sampled stations as previously described 35,50 . nMDS ordinations were used to identify potential environmental explanatory variables using linear correlations of environmental vectors with the envfit function in vegan. The roles of different sampling factors in influencing the observed community patterns were tested using a permutation distance-based approach using the adonis function of the vegan package. Tested factors included the sample matrix type (type: fluids, sediments), the subducting plate (plate: EPR or CNS), the location of the sampling site along the volcanic arc (province: outer forearc, forearc, arc), the geological province based on the map reproduced in Fig. 1 (geol_prov: 1, 2, 3 and 4), dominant basement rock type obtained from the USGS Mineral Resources GIS maps (49) (rocks), and the volcanic area the sampling sites is located in (volcano: forearc and the name of the major Costa Rica volcanoes). Additionally, two factors (anions and cations) were obtained based on the ternary plot of the aqueous geochemical composition of the fluids at each site as presented in Fig. 2b and 2d. The sampled sites were classified using their position in these plots, based on their geochemical composition, their interaction with different underlying basement rocks and the degree of equilibration 17,51 . Dominant ASVs were obtained by adding a further step of prevalence filtering, removing all the ASVs with a global abundance of less than 20 reads and present in less than 3 samples. This step reduced the number of ASVs to about 12% of the original variants, while retaining ~73% of the total reads. The diversity plots were inspected to ensure that no major changes in the dominant phylotypes and taxonomic groups were introduced (Fig. S4). At this step PL was removed from the dataset because it was the only sample representing a hyperacidic volcanic crater lake, and was therefore an outlier in the nMDS analysis (Fig. 2). A co-occurrence network was constructed based on pairwise Spearman rank correlations among the ASVs across the entire dataset. Only positive correlations with a Spearman's correlation coefficient (ρ) > 0.6 ) > 0.65 were retained, as they provide information on microbial phylotypes that may respond similarly to environmental conditions. The network we recovered included 3,935 nodes with 339,803 edges. The topology of the obtained network was investigated and a modularity analysis using a number of clustering algorithms built in the R package igraph 52 was performed (random walks, label propagation and Louvain clustering algorithms). While the total number of clusters changed, the main clusters identified by the tested algorithms converged, and the 10 clusters identified by the Louvain clustering algorithms were retained for downstream analysis. Identified clusters represented ecological cliques of phylotypes showing a cohesive distribution across the sampled hot springs. The relationship between each clique's cumulative abundance and environmental predictors was investigated using a Random Forests (RF) regression analysis. The analysis was carried out using the VSURF package. Clique abundances were re-scaled using z-scores before the RF analysis and missing environmental observations were imputed using the missForest package. To further test the validity of the identified environmental predictors the cumulative abundance of each clique was correlated using both Pearson moment correlation and Spearman rank correlation against all the environmental predictors. A conservative p level of p<0.01 was selected for all statistical tests performed 55 . Correlations were also manually inspected using scatterplots, to identify possible non-linear relationships and confirm correlations identified with statistical testing. The variables identified by each different approach (RF analysis, scatterplots inspection, Pearson moment correlation and Spearman rank correlation) overlapped significantly. The most informative environmental variable associated with the distribution of each clique was selected for plotting in Fig. 3b.
After trimming with Trimmomatic 53 , metagenomic short reads were processed with Mifaser 40 . The relative abundance for each enzyme was calculated by dividing the raw abundance value by the total number of enzyme counts for each sample. Enzyme abundance was then normalized to library size by multiplying the relative abundance by the median library size (65,709,491 bp). Cliques were obtained from pairwise spearman correlations and investigated for their relationship with environmental parameters using the same approach described for the ASV cliques.
Estimates of total residence time required for the total cellular biomass to be converted to the observed DOC were made with the following: where t 2 is the turnover time of the population, [DOC] is the concentration of dissolved organic carbon, [cells] is the concentration of cells, f autotroph is the fraction of cells that are autotrophs, and C cells is the carbon content of microbial cells. One year was used for t 2 based on the slowest estimates for the upper meter of marine sediments 54 . In marine sediments, populations are buried at measurable rates, so unlike hot spring fluids, they are amenable to measuring microbial biomass turnover times. Microbes in hot springs have faster turnover times than in marine sediments, due to advective fluxes of nutrients in hot springs, so the 1 year estimate is conservative. Values for [DOC] and [cells] (0.67 mmol/L and 6.2 x 10 5 cells/ml, respectively) were the average for all eight sites where both measurements were obtained, excluding SL, which had extremely high [DOC] values (6.29 mmol/L) and little evidence for chemolithoautotrophy since DOC was more 13 C-enriched than DIC ( 13 C values of -0.65 for DOC and -5.28 for DIC). Thirty percent was used for f autotroph , as shown from transcriptomic and proteomic data from the deep terrestrial subsurface 6 . Two values were used for C cells , 23 fgC/cell, based on extremely energy-starved marine sediment populations 4 , and 88 fgC/cell 5 . These two values account for the range of residence times reported in the main text.    The key genes for the major carbon fixation pathways were recovered in discrete cliques: Wood-Ljungdahl (WL) key genes in clique A, reverse tricarboxylic acid (rTCA) genes in clique B, and Calvin Benson Bassham (CBB) in clique C. The key genes for each pathway are highlighted with thicker outlines and have the EC number plotted. b) Sum of clique B genes plotted against DIC concentrations (in mmol C L -1 ) with blue color saturation corresponding to the distance from the trench (in km) for each site (Spearman correlation of ρ=0.66 p<0.001). c) diagram of the rTCA cycle with genes recovered in clique B highlighted in dark gray. These include the key genes ATP citrate (pro-S)-lyase and the citryl-CoA lyase necessary for the functioning of the two alternative versions of the rTCA cycle (Giovannelli et al. 2017). The two light gray rTCA genes were present in the samples but did not correlate as tightly with the other genes to be included in clique B, suggesting they were also used in other pathways.

Supplementary Discussion
The network analysis examined only the most prevalent amplicon sequencing variants (ASVs), which accounted for 73% of the total reads, but only 12% of the total ASVs, so additional environmental interactions may occur among the rare organisms. ASVs with similar relative abundance patterns were divided into cliques based on their degree of connectivity in the network. The dominant environmental variable correlating with each clique was determined by combining Random Forest variable ranking results, with the top variable identified by correlative approaches (Pearson moment correlation and Spearman rank correlation coefficients, Tables S5-S9).
Cliques 1 and 4 correlated positively with DOC concentrations (Fig. 3b). Clique 4 was dominated by Sulfurihydrogenibium (55%), with <10% contributions each from other members of the Hydrogenothermaceae, Hydrogenophilus, Meiothermus, and Rhodothermus. Cultured members of Sulfurihydrogenibium and Hydrogenothermaceae are facultative autotrophic sulfur and hydrogen oxidizers 1 and members of the deep-branching phylum Aquificae 2 . Therefore, clique 4 may contain the main chemolithoautotrophs driving DOC production across the convergent margin transect. Clique 1 membership was much more evenly distributed, with <10% contributions each from Chloroflexi, Firmicutes, Sulfurihydrogenibium, uncultured groups, and other organisms. Given the high diversity and prevalence of heterotrophic genera, members of clique 1 may be secondary consumers surviving on the organic carbon produced chemosynthetically by clique 4. Members of clique 3 often co-occurred with members of cliques 1 and 4, suggesting that they too may contribute to the dominant chemosynthetic primary or secondary production, yet they correlated most closely with iron. The most abundant ASV in clique 3 was uncultured at the phylum level, so phylogeny-based functional inference was impossible, but clique 3 had <10% contributions each from some of the same likely autotrophic and heterotrophic groups as cliques 1 and 4, and environmental variable analysis suggest that iron may be an important cofactor for them. Clique 2 correlated with pH and clique 9 correlated inversely with DIC (Fig. 3b). Clique 2 contained Hydrogenophilaceae and other facultatively autotrophic clades and clique 9 was 65% Thiothrix with lesser proportions of Hydrogenophilaceae and Sulfuritalea, whose cultured members are chemolithoautotrophic sulfur oxidizers. The low DIC and high pH sites where these organisms were in highest relative abundance were the outer forearc sites ( Figure S9), where the greatest extent of calcite sequestration was observed 3 . Therefore, these clades might have adaptations, such as high affinity for DIC, that allow them to sequester carbon even though the high pH, low DIC, and low DOC suggest that they are DIC-limited 3 .
Clique 7 correlated with methane concentrations (Fig. 3b), however, only one genus, Methylocystis, contains cultured methanotrophs. Other members of clique 7 were likely heterotrophs (Chloroflexi, Prevotella, and Acidovorax), iron-cycling bacteria, or uncultured groups. Members of Clique 8 lacked tight co-occurrence patterns with each other (Fig. 4a), and contained a wide variety of cyanobacteria, nitrogen-cycling bacteria, heterotrophs, and uncultured clades. Clique 8 correlated positively with copper (Fig. 3b), and less significantly with cobalt, chromium, and cadmium, suggesting that these loose affiliations may be due to common metal demands for the members of clique 8, rather than direct interactions between members or dependence on the same energy or carbon source. Cliques 6 and 10 did not correlate with any environmental variables. Clique 10 was 94% Sulfurihydrogenibium sp., and was found overly represented in the high temperature sites VC and TC. Clique 6 was 46% Aquabacterium spp. and 22% Alishewanella spp., which are common heat-tolerant bacteria from soils 4 and freshwater 5 . Clique 6, therefore, served as an internal negative control, showing that when surface-associated bacteria were washed into the system, they did not correlate with deep subsurface geological parameters. Clique 5 may also be a surface-associated group, since it is 96% Tepidimonas sp., an obligate aerobe 6 (54), and is correlated with dissolved oxygen.

Fig. S1.
Plot showing the relationship between temperature, pH, specific conductivity and distance from trench for the sampled hot springs. The Pearson correlation between temperature and pH measurements carried out in the field is -0.602. Poás acid Lake (PL) sample falls significantly below the trend line. The measured temperature of PL waters was influenced by the low activity of the magmatic-hydrothermal system underlying the crater lake during the sampling days in late February 2017. The volcanic activity increased in early April 2017 culminating with renewal of phreatomagmatic eruptions completely evaporating the acid crater lake.    S4. Figure_prevalence_filtering. Comparison of the broad phylum level classification of each site diversity before (A) and after (B) prevalence filtering preceding the co-occurrence network analysis. ASV with less than 5 reads and not appearing at least in 3 samples were removed. Poás Lake (PL) was removed from the analysis before prevalence filtering since it is not a hot spring. The diversity patterns remain the same. Before filtering: 33,187 ASVs in    S7. Bacterial sulfur oxidation occurs throughout the forearc/arc, whereas iron oxidation occurs only in the Central Cordillera. Relative cumulative abundance of putative iron (a) and sulfur (b) oxidizing genera across the sites corresponding to the two subducting plates. Sites are grouped according to the arc province (arc, forearc, outer forearc) and the two arc regions: the Guanacaste Geological Province (GGP) associated with the East Pacific Rise in north Costa Rica and the Central Cordillera (CC) associated with the Nazca-Cocos Spreading center in central Costa Rica. Note the difference in abundance scales (%) between the two plots. Lines drawn within a single box color represent contributions from different samples. Scanning electron micrographs show pyrite framboids at SL (d) and twisted stalks of iron hydroxides deposited by iron-oxidizing bacteria at QN (c).