Databases are nowadays an indispensable part of Biology, and of course, of Bioinformatics. Online bioinformatics databases boomed in the last decade. It is impossible for a person to know about all of them. Yet, there are some important databases every bioinformatician should know. Dr. Bob Lessick Associate Director, Center for Biotechnology Education, Johns Hopkins University, mentioned some of those databases in an online course, Bioinformatics: Life Sciences on Your Computer at Coursera. This blog summarizes databases taught in the course.
1. Pubmed (http://pubmed.gov):
Pubmed is a free database of scientific publications (references and abstracts) on life sciences and biomedical topics. It is hosted by the US National Library of Medicine (NLM) at the National Institutes of Health (NIH). It contains more than 23 million citations from biomedical literature. You can do free-text search as well as advanced search. Following are some example queries.
- Tan AC [author] (Tan AC [au]) : finds papers authored by persons whose last name is Tan and the initials are AC.
- Tan AC [au] AND plos one [journal] : finds AC Tan's papers published in the PloS One journal.
- RNAi [title] AND mello [au] : finds papers authored by Mello, with RNAi in the title.
- immunoglo* : finds papers with words starting with immunoglo. Note: * can be put only at the end of the query.
- "last 10 days" [edat] AND nature [journal]: finds nature papers entered into pubmed in last 10 days.
- 2014/02 [pdat] AND nature[journal]: finds nature papers published in Feb, 2014.
- 2014/02:2014/03 AND nature[journal]: finds nature papers published in Feb, 2014.
- (DNA[title] OR RNA[title] ) AND 2014/02:2014/04[pdat] AND science[journal] : finds science papers published between Feb 2014 and April 2014 which have either DNA or RNA in their titles.
2. MeSH (http://www.ncbi.nlm.nih.gov/mesh/)
MeSH stands for Medical Subject Headings. It contains controlled vocabulary for medical fields. Multiple words may mean the same phenomena. For example, both P53 and TP53 mean the same gene. MeSH gives specialized vocabularies for this term. And if you search with that vocabulary in PubMed, you'lll get all the papers related to that term, no matter how they are spelled in the manuscript. If you search in MeSH database, you'd get its controlled term - "Genes, p53". Now you can build queries for PubMed like below.
- "Genes, p53" [MeSH]
- "Genes, p53" [MeSH] AND nature [journal] AND 2013 [pdat]
Nucleotide is a database of sequences collected from diffrent sources. You can find various kinds of information about genomes, genes, transcripts, etc.
The following figure shows the top portion of a BRCA1 transcript.
- Locus contains several information - Accession number, gene length, type (mRNA means it is a splice RNA), genome type (linear or circular), type of organism (PRI = primates), last modification date.
- Every time the gene sequence is changed, its version number is added with GI id).
- Here, the sequence is from Homo Sapiense organism.
- Publication about this sequence is listed in the reference section.
- You can get the sequence in FASTA format, by clicking on "FASTA" link placed below the title. Note: the sequence is also shown in Gene Bank format at the end of this record page.
- You can also see the version history and compare those from the "Display Settings" menu.
- Each feature is followed by a location in the sequence. If you click on a feature title in the left column, the corresponding sequence will be highlighted.
- Here, one exon is located at position 1 to 213.
- CDS, the coding region including the start codon and stop codon, is an important feature. The translated amion acid sequence is also available here.