As I am going through my Ph.D., I cannot stress enough the importance of having a critical mind and using it often.

Can't stop thinking cartoon

When analyzing a database of 15 Millions of 16S sequences, I arrived to the point where I had picked OTUs (Operational Taxonomic Unit) and got an impressively high number (45580) from which 28113 were represented by singletons (present once) or doubletons (present twice). More then half of my bacterial community from 200 samples relied on the presence of one or two sequences. It made me think about these “rare” sequences, their significance and reliability.



Although they are getting sexier and cheaper every day, the next-gen. sequencing techniques are not free of errors (see Schloss et al. 2011 for strategies to reduce them). Indeed, many sequences might be the result of PCR bias or sequencing errors and therefore be misleadingly assigned as a unique OTU afterwards. Our lack of knowledge on environmental bacterial communities reduces further our ability to distinguish between real “rare” sequences and errors/artifacts. The presence of these artifacts adds another challenge to the already technical task of analyzing high-throughput sequencing datasets.


Agreeing with Zhan et al. (2014), I found that filtering processes had more influence on low abundance OTUs. Indeed, quality and chimera filtering removed most of the sequences that were unique from my database (excluding 66% of singletons, doubletons, and tripletons). Another thing to think about is the 16S copy number in one bacterial cell. As shown by Lee et al. (2009) and Rastogi et al. (2009), it is highly common that one cell will hold multiple copies of the 16S sequence (see Kembel et al. 2012 for techniques to control for it). Therefore, to find only one copy of that sequence even through a cell holds multiple version of that amplicon makes me doubt the value of such “rare” sequence. And knowing about the flaws of OTU picking, I am eager to see some new techniques that minimize clustering errors or noise.

Should one consider excluding rare OTUs (singletons, doubletons, tripletons, etc.) from their community analysis? And where to stop the cut?


Although being more stringent with your database impedes you to study the “rare” members of the bacterial communities, I personally prefer to be overly strict (and miss some OTUs) than be overly negligent (and draw false conclusions). Additionally, one must consider his study’s objectives when deciding what to do with the “rare” sequences. You should be extremely careful when using statistical analyses or estimators that are sensitive to the presence of “rare” sequences. For example, diversity will definitely be over-inflated by the “rare” OTUs (Kunin et al. 2010; Bokulich et al. 2013). A strategy could be to use multiple thresholds of sequence number to accept an OTU as real and then repeat your analyses to see if the results change. In conclusion, although it is tempting to try to study the “rare” microbiome, it remains a hard challenge to distinguish between sequencing artifacts and real “rare” sequences.



Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, et al. (2013) Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Meth 10: 57–59. doi: 10.1038/nmeth.2276

Kembel, S. W., Wu, M., Eisen, J. A., & Green, J. L. (2012). Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS computational biology8(10), e1002743.

Kunin V, Engelbrektson A, Ochman H, Hugenholtz P (2010) Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol 12: 118–123. doi: 10.1111/j.1462-2920.2009.02051.x

Lee ZM-P, Bussema C, Schmidt TM (2009) rrnDB: documenting the number of rRNA and tRNA genes in bacteria and archaea. Nucleic Acids Res 37: D489–93 doi:10.1093/nar/gkn689.

Rastogi R, Wu M, DasGupta I, Fox GE (2009) Visualization of ribosomal RNA operon copy number distribution. BMC Microbiol 9: 208 doi:10.1186/1471-2180-9-208.

Schloss, P. D., Gevers, D., & Westcott, S. L. (2011). Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PloS one6(12), e27310.

Zhan, A., Xiong, W., He, S., & MacIsaac, H. J. (2014). Influence of artifact removal on rare species recovery in natural complex communities using high-throughput sequencing. PloS one9(5), e96928.

Going Beyond OTUs

As a graduate student using Next Generation Sequencing Techniques I ask myself 100 technical questions a day on the method to be used, the amount of decisions to take to get robust results is humongous. Usually, when you read recent papers and they all (or almost all) talk about OTUs at 97% similarity, you do the same without thinking. However, I hear more and more concern about the robustness of OTUs as proxies for ecological similarity in bacterial communities. And even more, I noticed that 9 OTUs represent 32,6% in one of our datasets. This raised a red flag for me and needed to be investigated… What are these OTUs and how do they behave across our samples? Damn it, more questions…


Here for the blog:

Here for the paper:

Oligotyping is a “supervised computational method”, based on canonical techniques, that enables researchers to go beyond OTUs and investigate the sub-structure of their sequences in environmental data sets of 16S rRNA gene data. An oligotype is identified by the presence of nucleotides in information-rich (highest Shannon entropy) positions in reads. Therefore, it allows us to structure an OTU into different groups of sequences differing by a single or multiple nucleotides. With these oligotypes, one can test if there are changes in their behavior in samples (species/time/location) and understand better the dynamics of the bacterial communities. Indeed, 97% similarity OTUs could be masking a huge part of bacterial ecology and dynamics across samples and weaken studies conclusions. This is Figure 3a from the Oligotyping paper:

Capture d’écran 2015-04-02 à 10.06.53


Here for the paper:

In comparison, there is also this paper from Tikhonov et al. (2015) where they present a clustering-free approach allowing researchers to define sub-OTUs structure into what they call “subpopulations” independently from the similarity of 16S tag sequences. They use time-series to demonstrate that it is possible to structure sub-OTUs groups by combining an error-model-based denoising and systematic cross-sample comparisons. The biggest difference with Oligotyping is that the method is unsupervised, needing no input from the researchers at each step. This method compares the dynamic of pairs of sequences in time through the Pearson correlation of the measured abundance traces (with normalization by maximum possible correlation). As shown by their results, two sequences sharing 100% similarity can behave differently through time (thus one could infer that they belonged to separate ecological population) whereas two sequences at 81% similarity can behave in the identical way. These results suggest that we should not rely only on OTUs to draw understand bacterial community dynamics. This is part of Figure 2 from Tikhonov et al. (2015):

Capture d’écran 2015-04-02 à 10.09.03


I don’t mean to say that there is no value in looking at OTUs but rather that it appears beneficial to compare the trends seen at the OTU level with those at the sub-OTU level. For my fellow graduate students trying to find their way in analyzing 16S sequences without going crazy, I definitely suggest you read these two papers and consider going beyond OTUs to understand the ongoing dynamics in your samples. Good luck and I hope this was useful to you! Cheers!