ONE MORE QUESTION: WHAT ABOUT THE “RARE” OTUs?

As I am going through my Ph.D., I cannot stress enough the importance of having a critical mind and using it often.

Can't stop thinking cartoon

When analyzing a database of 15 Millions of 16S sequences, I arrived to the point where I had picked OTUs (Operational Taxonomic Unit) and got an impressively high number (45580) from which 28113 were represented by singletons (present once) or doubletons (present twice). More then half of my bacterial community from 200 samples relied on the presence of one or two sequences. It made me think about these “rare” sequences, their significance and reliability.

Untitled

SOURCES OF ERRORS

Although they are getting sexier and cheaper every day, the next-gen. sequencing techniques are not free of errors (see Schloss et al. 2011 for strategies to reduce them). Indeed, many sequences might be the result of PCR bias or sequencing errors and therefore be misleadingly assigned as a unique OTU afterwards. Our lack of knowledge on environmental bacterial communities reduces further our ability to distinguish between real “rare” sequences and errors/artifacts. The presence of these artifacts adds another challenge to the already technical task of analyzing high-throughput sequencing datasets.

Untitled

Agreeing with Zhan et al. (2014), I found that filtering processes had more influence on low abundance OTUs. Indeed, quality and chimera filtering removed most of the sequences that were unique from my database (excluding 66% of singletons, doubletons, and tripletons). Another thing to think about is the 16S copy number in one bacterial cell. As shown by Lee et al. (2009) and Rastogi et al. (2009), it is highly common that one cell will hold multiple copies of the 16S sequence (see Kembel et al. 2012 for techniques to control for it). Therefore, to find only one copy of that sequence even through a cell holds multiple version of that amplicon makes me doubt the value of such “rare” sequence. And knowing about the flaws of OTU picking, I am eager to see some new techniques that minimize clustering errors or noise.

Should one consider excluding rare OTUs (singletons, doubletons, tripletons, etc.) from their community analysis? And where to stop the cut?

THUS WHAT THE HELL SHOULD I DO?

Although being more stringent with your database impedes you to study the “rare” members of the bacterial communities, I personally prefer to be overly strict (and miss some OTUs) than be overly negligent (and draw false conclusions). Additionally, one must consider his study’s objectives when deciding what to do with the “rare” sequences. You should be extremely careful when using statistical analyses or estimators that are sensitive to the presence of “rare” sequences. For example, diversity will definitely be over-inflated by the “rare” OTUs (Kunin et al. 2010; Bokulich et al. 2013). A strategy could be to use multiple thresholds of sequence number to accept an OTU as real and then repeat your analyses to see if the results change. In conclusion, although it is tempting to try to study the “rare” microbiome, it remains a hard challenge to distinguish between sequencing artifacts and real “rare” sequences.

 

REFERENCES

Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, et al. (2013) Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Meth 10: 57–59. doi: 10.1038/nmeth.2276

Kembel, S. W., Wu, M., Eisen, J. A., & Green, J. L. (2012). Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS computational biology8(10), e1002743.

Kunin V, Engelbrektson A, Ochman H, Hugenholtz P (2010) Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol 12: 118–123. doi: 10.1111/j.1462-2920.2009.02051.x

Lee ZM-P, Bussema C, Schmidt TM (2009) rrnDB: documenting the number of rRNA and tRNA genes in bacteria and archaea. Nucleic Acids Res 37: D489–93 doi:10.1093/nar/gkn689.

Rastogi R, Wu M, DasGupta I, Fox GE (2009) Visualization of ribosomal RNA operon copy number distribution. BMC Microbiol 9: 208 doi:10.1186/1471-2180-9-208.

Schloss, P. D., Gevers, D., & Westcott, S. L. (2011). Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PloS one6(12), e27310.

Zhan, A., Xiong, W., He, S., & MacIsaac, H. J. (2014). Influence of artifact removal on rare species recovery in natural complex communities using high-throughput sequencing. PloS one9(5), e96928.

Posted in Methods, NGS, Ph.D. Project | Tagged , | Leave a comment

The Eleven Commandments of Conference and Stress Management

Lucy_Ninja

As the conference season is coming or has already started for Ph.D. students, stress management and slide preparation invade our schedule. So here I give you my best advices for lessening your stress level and achieving the best version possible of your presentation.

  1. PRACTICE: Practice before, and if possible, practice in front of an audience that cares so they can give you feedback. You will be more confident after your test run and the quality of your presentation will be improved by their comments.
  1. RHYTHM: (This one is especially important if you have a dense presentation) Insert a rhythm breaker into your presentation to refocus and recuperate the attention of people that could have gotten lost along the way.
  1. NINJA SLIDES: Imagine what could be the questions someone could ask you and prepare “ninja slides” at the end of your slideshow to support your answer. This will impress your audience and make you look prepared and professional.
  1. AUDIENCE: Know your audience and personalize your presentation to fit their requirements. You will increase your success by hitting the buzzword they want to hear and avoiding spending a lot of time explaining things they know a lot about.
  1. GRAPHS: If you show a graph in your conference, make sure to take the time to explain it’s meaning, the axes and the statistics used. If you don’t plan to explain it don’t show it.
  1. TEXT: Please, you are giving a presentation; not making your audience read your thesis on a big screen. AVOID big sentences, use key words and phrase out loud the theory rather than loading your slides with long pieces of text. Nobody reads them.
  1. WATER: Have a bottle of water with you so you can stop and take a sip. It will give your audience a break and help you to pace yourself.
  1. BREATH: Before the presentation: steady your breath, inspire slowly and keep calm. The first words are always harder, then your voice will steady and you just need to make sure that you will keep breathing!
  1. AAAAAaaaa: Don’t put an “aaaaaaa” sound at each pause between your sentences. It is the most annoying thing in a presentation and shows your lack of control. Just take the time to shape a full sentence and think before talking.
  1. EYE CONTACT: Make sure to make eye contact with your audience, you will be able to see if they are following you or if they are lost and you need to spend more time explaining something. You also engage more with the audience and they are more drawn to your presentation by your energy.
  1. ACCENT: For those of us that don’t speak English as a first language, remember that your accent can get in the way of your presentation and that you need to ARTICULATE and talk SLOWER. It’s already hard to keep focused on a long day of conferences, if your audience can’t understand what you are saying it’s over.

I hope this can be useful to you. What are your advices to give the best presentation?

Posted in Graduate student, Graduate Student Life, Ph.D. Project | Tagged | Leave a comment

Going Beyond OTUs

As a graduate student using Next Generation Sequencing Techniques I ask myself 100 technical questions a day on the method to be used, the amount of decisions to take to get robust results is humongous. Usually, when you read recent papers and they all (or almost all) talk about OTUs at 97% similarity, you do the same without thinking. However, I hear more and more concern about the robustness of OTUs as proxies for ecological similarity in bacterial communities. And even more, I noticed that 9 OTUs represent 32,6% in one of our datasets. This raised a red flag for me and needed to be investigated… What are these OTUs and how do they behave across our samples? Damn it, more questions…

OLIGOTYPING

Here for the blog: http://meren.github.io/

Here for the paper: http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12114/epdf

Oligotyping is a “supervised computational method”, based on canonical techniques, that enables researchers to go beyond OTUs and investigate the sub-structure of their sequences in environmental data sets of 16S rRNA gene data. An oligotype is identified by the presence of nucleotides in information-rich (highest Shannon entropy) positions in reads. Therefore, it allows us to structure an OTU into different groups of sequences differing by a single or multiple nucleotides. With these oligotypes, one can test if there are changes in their behavior in samples (species/time/location) and understand better the dynamics of the bacterial communities. Indeed, 97% similarity OTUs could be masking a huge part of bacterial ecology and dynamics across samples and weaken studies conclusions. This is Figure 3a from the Oligotyping paper:

Capture d’écran 2015-04-02 à 10.06.53

SUB-OTU RESOLUTION

Here for the paper: http://www.nature.com/ismej/journal/v9/n1/pdf/ismej2014117a.pdf

In comparison, there is also this paper from Tikhonov et al. (2015) where they present a clustering-free approach allowing researchers to define sub-OTUs structure into what they call “subpopulations” independently from the similarity of 16S tag sequences. They use time-series to demonstrate that it is possible to structure sub-OTUs groups by combining an error-model-based denoising and systematic cross-sample comparisons. The biggest difference with Oligotyping is that the method is unsupervised, needing no input from the researchers at each step. This method compares the dynamic of pairs of sequences in time through the Pearson correlation of the measured abundance traces (with normalization by maximum possible correlation). As shown by their results, two sequences sharing 100% similarity can behave differently through time (thus one could infer that they belonged to separate ecological population) whereas two sequences at 81% similarity can behave in the identical way. These results suggest that we should not rely only on OTUs to draw understand bacterial community dynamics. This is part of Figure 2 from Tikhonov et al. (2015):

Capture d’écran 2015-04-02 à 10.09.03

CONCLUSIONS

I don’t mean to say that there is no value in looking at OTUs but rather that it appears beneficial to compare the trends seen at the OTU level with those at the sub-OTU level. For my fellow graduate students trying to find their way in analyzing 16S sequences without going crazy, I definitely suggest you read these two papers and consider going beyond OTUs to understand the ongoing dynamics in your samples. Good luck and I hope this was useful to you! Cheers!

Posted in Methods, NGS | Tagged , | Leave a comment

Making the Best of a Three Months Internship in the West Coast

2015 started with a professional challenge for me: I was to do a three-months international internship at UC Davis, in Jonathan Eisen’s lab (https://phylogenomics.wordpress.com/). I found the Eisen lab extremely interesting because of their presence on social media and the variety of their projects. Most of their lab members are twitting, writing blogs, creating awesome scientific board games (http://microbe.net/gutcheck/) and they have many outreach and citizen science projects. As I remember leaving Montreal, I was excited but also stressed for the coming change of routine and environment. BUT with challenges come improvement.

As I aim to be present in the science world for a long time, I am highly conscious that great Science come from collaboration, not only from single researchers doing their own thing. Thus, I planned to take advantage of advance researchers experience and I contacted many different professors related to my field. On my list were:

Untitled

OSU, Oregon:

Thomas Sharpton (http://lab.sharpton.org/)

U. Of Oregon, Oregon:

Jessica Green (http://pages.uoregon.edu/green/)

Brendan Bohannan (http://pages.uoregon.edu/bohannanlab/)

UC Berkeley, California:

Steve Lindow (http://icelab.berkeley.edu/lindow-lab-1)

Paul Fine https://ib.berkeley.edu/labs/fine/Site/home.html)

David Ackerly (http://www.ackerlylab.org/)

Ellen Sims (https://ib.berkeley.edu/people/faculty/simmse)

UC Davis, California

Johan Leveau (http://plantpathology.ucdavis.edu/faculty/Leveau_Johan_HJ/)

Jonathan Eisen (https://phylogenomics.wordpress.com/)

UBC Okanagan Campus, British Colombia

John Klironomos (http://johnklironomos.com/)

This also meant that I would have to cover a lot of ground in a short time. However, the encounters with the PIs and their lab compensated greatly for the traveling length. Meeting great and welcoming human beings all over the west coast of USA and Canada that also happen to do research greatly motivated and inspired me to continue doing my best in my field. There are some crazy inspiring projects out there! It was also reassuring to see that the challenges are mostly the same in every lab, especially when working with Next-Generation Sequencing.

Another of my goals was to help teach two workshops, one given by Titus Brown (http://ged.msu.edu/) at UC Davis on mRNA (http://dib-training.readthedocs.org/en/pub/2015-03-04-mRNAseq-semimodel.html) and a Software Carpentry (http://software-carpentry.org/index.html) workshop at U. of Arkansas. These two workshops were great in the sense that teaching boosted my energy and I got great feedback from our attendees. From the RNA workshop I learned a lot; widened my horizon of knowledge and at the same time met the great Titus Brown while broadcasting my skills (http://ivory.idyll.org/blog/2015-a-first-workshop.html). The great people of Rkansas were absolutely amazing, welcoming and most importantly, truly interested.

So this concludes three months of international internship, visits, workshops, meetings, but more importantly three months of growing my network and creating connections with great people all along the west coast.

I would advise any Ph.D. student to take advantage of the scholarships provided by their University (if there are some) to mix traveling and Science.

Posted in Graduate Student Life | Tagged , | 1 Comment

Software Carpentry Plot This Challenge

Greg Wilson from Software Carpentry posted a plot challenge yesterday evening: http://software-carpentry.org/blog/2015/02/plot-this.html#comment-1845097980

The challenge was to redo this figure to maximize visual information.

per-capita


Here is what I would do for a static plot:

library(ggplot2)
per.capita <- read.csv(“~/Desktop/per.capita.csv”)

# New variables
per.capita[,5]<-(per.capita$Attendees)/(per.capita$Population/1000000)
per.capita[,6]<-(per.capita$Instructors)/(per.capita$Population/1000000)
colnames(per.capita)[5]<-“xvar”
colnames(per.capita)[6]<-“yvar”

# This is just to adjust tags so they don’t overlap
per.capita[,7]<-c(0,-0.5,0,1.5,0,0,0,0,0.5,1.5,0,0,0,0.5,1.5,-1,0,0,0,0,2,0,0,0,0,0,0,0,0)
colnames(per.capita)[7]<-“vjust”
per.capita[,8]<-c(-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,0.9,-0.2,-0.2,-0.2,-0.2,1.2,-0.1,-0.2,-0.2,-0.2,-0.2,-0.1,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2)
colnames(per.capita)[8]<-“hjust”

# Plot
(ggplot(per.capita, aes(x=log(xvar+1), y=log(yvar+1), color=Country))
+ theme(legend.position=”none”)
+ geom_point(shape=10, size=5)
+ scale_x_continuous(limits=c(0,4))
+ geom_text(aes(label=per.capita$Country, vjust=per.capita$vjust,hjust=per.capita$hjust))
+ xlab(“Attendees per Million Habitants (log)”)
+ ylab(“Instructors per Million Habitants (log)”)
+ ggtitle(“Software Carpentry Instructors and Attendees Statistics per Country”))

Plot_It_SWC

I also noticed there were some great Ipython interactive maps, definitely going to try something like this: http://nbviewer.ipython.org/gist/jiffyclub/3f3cc34745da55f36fcf

Posted in R | Leave a comment

The Eternal Generalist/Specialist Dilemma

Often I feel torn between being even better at what I am already good at (which is so fun!) and achieving more at all the other (scary) skills. It seems to be a long lasting struggle for all workers across all spheres:

when developing skills in our domain, is it better to diversify our spectrum of knowledge or should we become eminent experts in one area?

THE SPECIALIST

colibri

The advantages of being highly competent at a single skill might be tremendous in the short term. However, this strategy is also extremely risky, as specialists thrive only when the conditions are perfect. The specialist must choose wisely the ability at which he becomes an expert because if it were to be unnecessary tomorrow, he might lose all the hard work he has done. However, if the choice is right, the specialist could become a crucial asset for any employer, therefore increasing exponentially the possibility of employment.

THE GENERALIST

UnknownOn the other end, diversifying our domains of skills is expensive in time, energy and in memory space (if only we could have more than one brain…). The generalist might always feel like he needs to improve at all levels, that he is not THE best at anything. However, the generalist becomes highly independent, needing only inputs from experts once in a while, and he is never “out of the game” when the conditions change. In the long term, being a generalist is safer since he will always have a base to build on.

THE OPTIMUM

Optimus-Prime-optimus-prime-15529314-488-710-2To be the go-to person in at least one skill/domain in your field is necessary. This way you become essential to this area, contributing significantly to the work being done. In a curriculum, showing outstanding specific skills is the most appealing part for employers. However, these specialist skills must not be achieved at the expenses of a well-rounded background. The autonomy and resourcefulness of a generalist are essential to be successful in the long-term, especially in Science. If you aim for the best compromise long-term/short term, you should try to get the best of both worlds. I never said it would be easy, though.


My strategy is therefore spread in two guidelines:

  • Pick a few tactical skills, that should be in high demand in the next years, in which to become as much as an expert as I can.
  • Make sure that I am knowledgeable or at least introduced to for all other domains/skills/areas that could be of interest in my work.

Here are some of my specialist vs. generalist skills:

SPECIALIST

GENERALIST

Statistics

Coding
R

Python

Vegetal Ecology

Vegetal Physiology
Communication

Predictive Models

Plant-microbe interactions

Forestry

I’ll tell you in 30 years if it worked!

Posted in Graduate student, Graduate Student Life | 1 Comment

Why Learn R?

Learning a new language is always a tedious and challenging experiment. It is long. It is frustrating. It is tiring. Then, after a looooong while, it is rewarding. So why should you bother to learn a new statistical language? Why learn the R language? Why should you spend hours, days, weeks, drinking countless coffees, trying to get a grip of R?

6a010534b1db25970b015435c323fc970c-800wi

So you’ve been using EXCEL, SAS, SPSS, Minitab, or others, for years. You’ve gotten confortable and effective. And now SOMEBODY (What an A**!) tells you that you should switch to R because it’s much more effective/productive #trendy. And you are scared/pissed off/just not interested. But wait a minute; do you know what is R? R is a programing language designed to facilitate data exploratory analysis, classical statistical tests and high-level graphics. R is a flexible, cutting-edge, powerful, statistical tool. Now that you know what R is, what does it do more than EXCEL, SAS, SPSS, and the likes? Why is it worth your time and energy? R language offers many advantages compared to other statistical tools. It is strong enough to handle big messy data. It allows the writing of a reproducible script for cross-scientific validation (which is non-arguable for scientists). It offers you more than 2000 libraries (a plethora of packages developed to perform particular functions and tasks) in a huge variety of fields. These libraries are made by those who use them and for those who use them. R reinforces scientist good habits in statistics and data management. It integrates with many other programing languages such as Java, C++ and Python as well as with other statistical tools such as IBM SPSS. Many big industrial players such as Facebook, the NY Times, Google, and Bank of America use R. And R works on all platforms: Windows, MAC, Linux. R is the leading edge of development in statistics, data mining and data analysis. But most of all, R is free, open source, transparent, open to community critics and reviews thus fast to improve. R is definitely trending, yes, because R is moving now, and it’s moving fast. To include R as one of your assets is a crucial advantage for recruitment. Yes, it is possible that you will swear on a forgotten coma, or be slow on debugging R errors. But R has a ton of blogs and peer-supported communities to help you learn its components and integrate them to your day-to-day work. You have no excuse not to give a shot at learning R.

http://www.r-project.org

http://www.inside-r.org

http://www.revolutionanalytics.com/r-community

http://www.r-statistics.com/tag/r-community/

Posted in R | Leave a comment