This (first) blog post will be centered on addressing some concerns/questions/irritations that have been raised about our first attempt to bin metagenome-assembled genomes (MAGs) from the Mediterranean Sea. Since their release, these 290 MAGs have caused far more consternation than their more refined siblings that has culminated in the following tweet:
Hi there #metagenomics people. the fact that someting is binning with something DOES NOT mean it is coming from the same creature!!! it does not..........— Oded Beja (@BejaLab) October 19, 2018
This tweet led to a long series of conversations, as ‘metagenomists’ (not a real word at all) detailed various ways that, in fact, current methodologies could be applied to tease apart closely related organisms. While new techniques can, preliminary attempts may be sh*t. So, it seems like the perfect time to take stock and attempt to fix issues that have persisted in our TMED MAGs.
The Dangers of MAGs
First, let’s emphasize something that everyone needs to be reminded of from time-to-time – MAGs have a lot of deficiencies. Many people, including myself, have noted these deficiencies. But to summarize: (1) THE MOST OBVIOUS - in most cases, MAGs do not recover a complete, circular genome; (2) REGULARLY REPEATED – MAGs are a representation of a population of closely-related organisms; and, (3) NOT REPEATED ENOUGH - No amount of automated or manual assessment can guarantee that all DNA present in a MAG is accurate. This last point obviously has degrees of certainty. A MAG created using an automated process based on GC% content from a complex system is much more likely to have mistakes than a carefully curated MAG that has undergone intense scrutiny down to the gene-level with visualized recruitment plots, etc. But even these artisan MAGs may have incorrectly binned DNA, though at much lower rates and as a smaller percentage of the overall MAG.
Because of Point #3, many MAGs are presented with an accompanying percent contamination value – believe it! And if that contamination is too high for you, do not use the MAG or refine the MAG (however you see fit). And do not forget that metrics for determining contamination have their limitations, so these values are not absolute.
290 metagenome-assembled genomes from the Mediterranean Sea – A retrospective
In mid-2016, we wanted to know how well our under development binning tool worked on real samples. We began an in-house proof of concept that involved binning the Mediterranean Sea samples from the Tara Oceans metagenomic dataset. These bins were reconstructed based exclusively on variable coverage values and generated, what we thought was impressive, 290 MAGs with >50% estimated completion. We learned some lessons from the process, revamped the binning tool, incorporating DNA composition metrics to help aid the clustering process, and started the process of binning the entire Tara Oceans metagenomic dataset, repeating the Mediterranean. As it became apparent that this would take a while (1 year+), we had the idea of formally releasing those first 290 MAGs through PeerJ and GenBank. We lovingly refer to those MAGs as TMED (Tara-Mediterranean), while the subsequent MAGs are known in the lab as TOBG (Tara Oceans Binned Genomes).
The TMED MAGs went public in July 2017, just before the first calls for definitions regarding what constitutes ‘high-quality’ (Konstantinidis, ISMEJ 2017 & Bowers NatBiotech 2017). As TMED was one of the first large marine MAG datasets, we wanted to be as inclusive with the data as we could, which led us to including MAGs submitted to GenBank with high estimated contamination values. In hindsight, this may have been a mistake. But one we were very forthcoming about in the manuscript:
“It should be noted, researchers using this dataset should be aware that all of the genomes generated from these samples should be used as a resource with some skepticism towards the results being an absolute. Like all results for metagenome-assembled genomes, these genomes represent a best-guess approximation of a taxon from the environment (Sharon & Banfield, 2013). Researchers are encouraged to confirm all claims through various genomic analyses and accuracy may require the removal of conflicting sequences.”
We knew that more data would be available from the TOBG dataset and let the TMED MAGs into the world. Unfortunately, GenBank can be treated like it represents absolute truths and many use the resources provided without tracing back the source/publication/authors/etc.
In a world where GenBank has 90,000+ genomes, how important can 290 MAGs be on the grand scheme of things? Well, if your data of interest match one of those MAGs it could be the most important genome.
In revisiting the TMED MAGs, we wanted to concentrate on those that we had let through that would definitely not be included in a dataset in 2018. We identified 59 MAGs with >7.5% estimated contamination and performed a manual refinement using Anvi’o (see methods below). Additionally, we identified 43 TMED MAGs which had counterparts in the TOBG dataset with ‘better metrics’. These MAGs are available through the TOBG figshare, but were never uploaded to GenBank. In most cases ‘better metrics’ means a decrease in percent contamination. In a few exceptions, if a MAGs had a large increase in percent completion (and any contamination increase was offset by strain heterogeneity) these were used to replace TMED MAGs. You can find all changes in an Excel spreadsheet on the TMED figshare.
Will this fix everything? No, probably not. Researchers that have reached out to us have had very specific instances of misplacement in MAGs. One example. A researcher contacted us about a 0.9MB MAG from an Alphaproteobacteria (67% complete, 1.85% contamination, 0% strain heterogeneity). A single contig from this MAG was clearly misplaced – 60%G+C vs 30%G+C, annotations as a cyanobacteria, etc. This contig was <20kb in size (~2% of the MAG). The automated binning had missed it. And our screening via contamination would have missed it too. But, as I stated to this researcher, I was not overly distressed because the system was working. An independent researcher, performing a manual assessment for what (I can assume) was a perplexing result, identified the problematic contig in multiple ways. Should you trust every contig in a MAG (any MAG)? No. But the power is in numbers. Can you start to trust 10 MAGs of varying completion from a single clade that have overlapping predicted functions? Probably.
All paired-end reads from the Tara Ocean metagenomes originating in the Mediterranean Sea (TARA007, 009, 018, 023, 025, & 030) were recruited against the TMED MAGs with an estimated contamination >7.5% based on CheckM (v1.0.11) using Bowtie2 (v.2.2.5; –no-unal). For each sample for each MAG, an Anvi’o profile was created (v.5; anvi-profile). Profiles were merged (anvi-merge) and visualized (anvi-interactive) for manual refinement. Contigs were removed and/or MAGs split based on aberrant coverage profiles and %G+C. New MAGs were assessed by CheckM and the best representative chosen if necessary.
Identical TMED and TOBG MAGs were identified based on a 98.5% average nucleotide identity (ANI) cutoff using FastANI (–fragLen 1500). Completion and contamination determined by CheckM were compared and the best representative was selected.