As I am going through my Ph.D., I cannot stress enough the importance of having a critical mind and using it often.
When analyzing a database of 15 Millions of 16S sequences, I arrived to the point where I had picked OTUs (Operational Taxonomic Unit) and got an impressively high number (45580) from which 28113 were represented by singletons (present once) or doubletons (present twice). More then half of my bacterial community from 200 samples relied on the presence of one or two sequences. It made me think about these “rare” sequences, their significance and reliability.
SOURCES OF ERRORS
Although they are getting sexier and cheaper every day, the next-gen. sequencing techniques are not free of errors (see Schloss et al. 2011 for strategies to reduce them). Indeed, many sequences might be the result of PCR bias or sequencing errors and therefore be misleadingly assigned as a unique OTU afterwards. Our lack of knowledge on environmental bacterial communities reduces further our ability to distinguish between real “rare” sequences and errors/artifacts. The presence of these artifacts adds another challenge to the already technical task of analyzing high-throughput sequencing datasets.
Agreeing with Zhan et al. (2014), I found that filtering processes had more influence on low abundance OTUs. Indeed, quality and chimera filtering removed most of the sequences that were unique from my database (excluding 66% of singletons, doubletons, and tripletons). Another thing to think about is the 16S copy number in one bacterial cell. As shown by Lee et al. (2009) and Rastogi et al. (2009), it is highly common that one cell will hold multiple copies of the 16S sequence (see Kembel et al. 2012 for techniques to control for it). Therefore, to find only one copy of that sequence even through a cell holds multiple version of that amplicon makes me doubt the value of such “rare” sequence. And knowing about the flaws of OTU picking, I am eager to see some new techniques that minimize clustering errors or noise.
Should one consider excluding rare OTUs (singletons, doubletons, tripletons, etc.) from their community analysis? And where to stop the cut?
THUS WHAT THE HELL SHOULD I DO?
Although being more stringent with your database impedes you to study the “rare” members of the bacterial communities, I personally prefer to be overly strict (and miss some OTUs) than be overly negligent (and draw false conclusions). Additionally, one must consider his study’s objectives when deciding what to do with the “rare” sequences. You should be extremely careful when using statistical analyses or estimators that are sensitive to the presence of “rare” sequences. For example, diversity will definitely be over-inflated by the “rare” OTUs (Kunin et al. 2010; Bokulich et al. 2013). A strategy could be to use multiple thresholds of sequence number to accept an OTU as real and then repeat your analyses to see if the results change. In conclusion, although it is tempting to try to study the “rare” microbiome, it remains a hard challenge to distinguish between sequencing artifacts and real “rare” sequences.
Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, et al. (2013) Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Meth 10: 57–59. doi: 10.1038/nmeth.2276
Kembel, S. W., Wu, M., Eisen, J. A., & Green, J. L. (2012). Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS computational biology, 8(10), e1002743.
Kunin V, Engelbrektson A, Ochman H, Hugenholtz P (2010) Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol 12: 118–123. doi: 10.1111/j.1462-2920.2009.02051.x
Lee ZM-P, Bussema C, Schmidt TM (2009) rrnDB: documenting the number of rRNA and tRNA genes in bacteria and archaea. Nucleic Acids Res 37: D489–93 doi:10.1093/nar/gkn689.
Rastogi R, Wu M, DasGupta I, Fox GE (2009) Visualization of ribosomal RNA operon copy number distribution. BMC Microbiol 9: 208 doi:10.1186/1471-2180-9-208.
Schloss, P. D., Gevers, D., & Westcott, S. L. (2011). Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PloS one, 6(12), e27310.
Zhan, A., Xiong, W., He, S., & MacIsaac, H. J. (2014). Influence of artifact removal on rare species recovery in natural complex communities using high-throughput sequencing. PloS one, 9(5), e96928.