Teaching genomics to life science undergraduates using cloud computing platforms with open datasets

Abstract The final year of a biochemistry degree is usually a time to experience research. However, laboratory‐based research projects were not possible during COVID‐19. Instead, we used open datasets to provide computational research projects in metagenomics to biochemistry undergraduates (80 students with limited computing experience). We aimed to give the students a chance to explore any dataset, rather than use a small number of artificial datasets (~60 published datasets were used). To achieve this, we utilized Google Colaboratory (Colab), a virtual computing environment. Colab was used as a framework to retrieve raw sequencing data (analyzed with QIIME2) and generate visualizations. Setting up the environment requires no prior experience; all students have the same drive structure and notebooks can be shared (for synchronous sessions). We also used the platform to combine multiple datasets, perform a meta‐analysis, and allowed the students to analyze large datasets with 1000s of subjects and factors. Projects that required increased computational resources were integrated with Google Cloud Compute. In future, all research projects can include some aspects of reanalyzing public data, providing students with data science experience. Colab is also an excellent environment in which to develop data skills in multiple languages (e.g., Perl, Python, Julia).

The research project is an essential component of a life science degree, 1 providing an experience of the entire research process, acquiring new skills, and testing a hypothesis. 2 The students also gain subject matter to present and discuss at interviews for jobs and postgraduate studies. COVID-19 heavily disrupted life science research, closing laboratories and causing global shortages of key laboratory materials. 3 Undergraduate students who had previously performed an integrated experimental/ computational research project 4 were now unable to perform the experimental component.
To overcome this challenge, we conducted computational research projects remotely with a cohort of 80 students by reanalyzing publicly available 16S rRNA amplicon microbiome data obtained from published studies. 16S rRNA amplicon sequencing datasets are of manageable size and excellent analysis pipelines exist, This article reports a session from the virtual international 2021 IUBMB/ASBMB workshop, "Teaching Science on Big Data". including QIIME2 (Quantitative Insights Into Microbial Ecology) 5-7 (Supplemental files 1-3, and on https://github.com/). The students were encouraged to search for microbiome data in an area of interest to them (e.g., microbiome and cancer) and microbiome data from over 60 studies covering a wide range of topics were identified.
To retrieve the data, we utilized Google Colaboratory (Colab)-a virtual computing environment that can be used with multiple programming languages and is linked to Google Drive (all available in supplemental files 1-3, and on https://github.com/ see data availability). This data science platform gave each student a readily accessed virtual machine that needed very little configuration. Having all students use the same drive structure was useful for live teaching sessions. We accessed public databases using Colab and saved the files to shared Google Drive folders (Supplemental file 1). Our workflow is described in Figure 1 with examples for the Colab QIIME2 pipeline given in Supplemental files 1-3.
The computational pipelines were virtually complete when shared, to ensure that students with less familiarity with computational biology were not disadvantaged. As students were required to understand the experimental design and sequencing method (e.g., 16S rRNA primers), assess sequence data quality, perform statistical decision making, and produce a final presentation, many essential components of experimental student research projects were maintained, including problem-solving and critical thinking skills. 13 We used Colab for both data analysis exercises and tutorials (see Supplemental file 3), closely aligned with published QIIME2 tutorials. 14 To visualize the output file, a standardized R markdown (see Figure 2) was provided. Rstudio Cloud was used to give students with limited computing resources access to the files.
The use of the Colab platform as a framework for students to reanalyze public data made it possible to complete research projects remotely and gave students the chance to write their research paper as if the data were their own. Using this approach, they explored the scientific method and relationships between data and knowledge. 13 Moreover, some of the students combined multiple datasets, performed meta-analyses, and used datasets with 1000s samples. 25 We did not seek to check the validity of the published data and the students were encouraged to make their own decisions from the analysis. Using this framework, all students were able to complete a research project using published data, a useful data science skill that should be incorporated in future projects.

ACKNOWLEDGMENTS
The authors thank the QIIME2 developers for providing a platform to analyze microbiome data and also providing many highly detailed examples in the QIIME2