Bioinformatics Databases
-
Huge amounts and variety of collected data
-
Types (public)
-
Nucleotide Sequences
-
Protein Sequences
-
3D Structures
-
Enzymes and Compounds: e.g., LIGAND, Chemical compounds and reactions
-
Sequence Motifs: e.g., PROSITE, Pfam (Protein families database of alignments and hidden Markov models)
-
Pathways: e.g., PATHWAYS, Metabolic and regulatory pathway maps
-
Molecular Disease: e.g., OMIN (Online Mendelian Inheritance of Man)Protein Mutations
-
Protein Mutations: e.g., PMD (Protein Mutant Database)
-
Gene Expressions: e.g., GEO (Gene Expression Omnibus)
-
Gene Catalog: e.g., GENES (KEGG Genes Database)
-
Enabled by embedded computing, control, robotics, automation
-
High throughput sequencing
-
Robot arms/manipulators
-
Automated sequencing
-
Faster computers generate more processed data, simulated data, visualized data
-
Enabled by cheaper and higher capacity storage
-
Price / capacity ratio falling faster than CPUs (i.e., faster than Moore's Laws)
-
Faster computers to access and search that data
-
Technology
-
Data types
-
Data formats
-
Flat file delimited
-
HTML
-
XML
-
FASTA
-
PHYLIP
-
PAUP
-
MAML
-
NEXUS
-
FASTA+GAP
-
MnCIF
-
Definitions and examples
-
http://workshop.molecularevolution.org/resources/fileformats/
-
Data structures
-
Disk servers
-
File servers
-
File systems
-
Sequential access
-
Random access
-
Indexes
-
Hashing
-
Indexed Sequential Access (ISAM, VSAM)
-
Problems with above
-
Concurrent access
-
Integrity of data
-
Redundancy of data, lack of reuse
-
Consistency of the data
-
Reinventing access and search software
-
Nonstandard formats and query methods
-
Solution: database management systems (DBMS)
-
Transaction processing
-
Decision support/data mining
-
Data repositories
-
Data marts
-
Data warehouses
-
Data dictionaries
-
Data models
-
Flat
-
Hierarchical
-
Semi-structured (hybrid flat and hierarchical, e.g., XML)
-
Network
-
Relational
-
Object-oriented
-
Object-relational
-
Deductive (like Prolog, but more)
-
Database products
-
Open source
-
MySQL
-
PostgreSQL
-
Cloudscape
-
Ingress
-
Commercial
-
Oracle
-
IBM
-
DB2
-
Sybase
-
Informix
-
Microsoft
-
SQL Server
-
Access
-
Apple - FilemakerPro
-
Hundreds more
-
Interfaces
-
Command line utilities
-
Oracle sqlplus
-
PostreSQL psql
-
ODBC, JDBC, etc.
-
CGI/DBI
-
PHP
-
ColdFusion
-
Many more
-
Relational model
-
Based on relational algebra
-
Tables (relations)
-
Rows and columns
-
Records and fields
-
Tuples and attributes
-
Constraints on values
-
Not null
-
Data type
-
User defined complex data types (ORDBMS)
-
Foreign keys
-
"Key" columns
-
Foreign keys: referential integrity
-
Schema
-
Entity relationship diagrams
-
Data definitions
-
Chado schema
-
http://www.gmod.org/schema/doc/sequence.html
-
Ensembl schema
-
http://www.csd.abdn.ac.uk/~gjlk/ensemblfdm/daplex_schema.shtml
-
http://apr2005.archive.ensembl.org/Docs/schema_description.html
-
XML (example standards)
-
AGAVE
(Architecture for Genomic Annotation, Visualization and Exchange)
-
http://www.agavexml.org/
-
BioML (
BIOpolymer Markup Language)
-
http://xml.coverpages.org/bioml.html
-
BSML (Bioinformatic Sequence Markup Language)
-
http://www.bsml.org/
-
CML (Chemical Markup Language)
-
http://www.xml-cml.org/
-
MAGE-ML (Microarray Gene Expression - Markup Language
-
http://www.mged.org/Workgroups/MAGE/mage-ml.html
-
Sample
-
http://www.omg.org/cgi-bin/apps/doc?dtc/02-09-04.zip
-
SBML (Systems Biology Markup Language)
-
http://sbml.org/index.psp
-
Controlled vocabulary
-
Gene Ontology
-
http://www.geneontology.org/index.shtml
-
http://www.geneontology.org/ontology/gene_ontology.obo
-
Visualization
-
Cn3D from NCBI
-
About
-
http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml
-
Tutorial
-
http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3dtut.shtml
-
Download
-
ftp://ftp.ncbi.nih.gov/cn3d/
-
Entrez Search Engine
-
http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi
-
Select Structure
-
Search Structure for PTEN or hemoglobin
-
Select hit [1D5R]
-
Sequence browser
-
Vew 3D Strcutre of Best Model with Cn3D Display
Back to Top