Web Services
  1. Bioinformatics research
    1. Algorithms/Statistics
      1. Dynamic programming
      2. Viterbi
      3. Hidden markov models
      4. Contex-free grammars
      5. Graph theoretic approaches
        1. Protein-protein interactions
        2. Metabolic pathways
      6. Issues
        1. Quality of solutions
        2. Time to compute
    2. Simulation/Modeling
      1. Molecular dynamics
      2. Computational biology
      3. Computational chemistry
      4. Protein folding/structure
      5. Phylogentic trees
      6. Cell regulation
    3. e-Technologies/Data storage
      1. Problems
        1. Huge databases
          1. DOE article
          2. Biological research is becoming increasingly data driven. Bioinformatic data is reported to be doubling every 9-12 months. The data is becoming more complex as it extends vertically from the genome, proteome, transcriptome and physiome, and horizontally as more species databases are created (e.g., FlyBase, AnoBase, WormBase, Ensembl, NCBI, DDJB, VectorBase, etc.).
            1. Primary data
            2. Secondary data
          3. The global publicity around the human genome project, with draft releases and joint transatlantic presidential/prime minister press conferences, and the 50th anniversary of the discovery of the double helix, has catapulted bioinformatics into the forefront of biological research. Advanced hardware, algorithms, and software derived from the human genome project, in a classical positive feedback loop, are now being applied to other species, creating surprising new applications in public health, pharmaceutical research, homeland security/biodefense, etc.
          4. Molecular and non-molecular data are being collected at an ever-increasing rate and variety. Numerous “omic” data types (genomic, transcriptomic, etc), an alphabet soup of data types (EST, RNAi, cDNA, BAC, etc.), gene expression/micro-array data, images, ontologies, nomenclatures, cytogenetics, population phenomena (population genomics, genetics, ecology, epidemiology, etc.), annotations, and publications are accumulating in hundreds of online bioinformatics databases around the globe. Many of these databases focus on the research needs and expertise of the groups that curate them, with different data models and schema, diverse interfaces, often focusing on one model organism. So many such databases are being built that the bioinformatics community has a project at SourceForge, GMOD (Generic Model Organism system Database), to develop and support software tools for creating new community biological databases]. Again, in a classical positive feedback loop, GMOD makes it easier to build such databases, resulting in more of them. Subsets of data from these sites are aggregated at large national/regional data warehouse sites such as GenBank, EMBL, and DDBJ for nucleotide data.
        2. Different vocabularies
          1. Inconsistency in the naming of biological objects makes multi-database queries difficult. Lincoln Stein presents the following example:

            “[There is a] DNA-damage checkpoint-pathway gene that is named Rad24 in Saccharomyces cerevisiae (budding yeast). Saccharomyces pombe (fission yeast) also has a gene named rad24 that is involved in the checkpoint pathway, but it is not the orthologue of the S. cerevisiae Rad24. Instead, the correct S. pombe orthologue is rad17,which is not to be confused with the similarly named Rad17 gene in S. cerevisiae . Meanwhile, the human checkpoint-pathway genes are sometimes named after the S. cerevisiae orthologues, sometimes after the S. pombe orthologues, and sometimes have independently derived names. In C. elegans , there are a series of rad genes, none of which is orthologous to S. cerevisiae Rad17. The closest C. elegans match to Rad17 is, in fact, a DNA-repair gene named mrt-2.”
          2. Efforts to address these problems include data format, naming and concept standardization initiatives: for example, those of 1) the Inter-Union Bioinformatics Group, 2) the International Union for Pure and Applied Biophysics (IUPAB), 3) the International Union of Biochemistry and Molecular Biology (IUBMB), 4) the International Union of Crystallography (IUCr), 5) the International Union of Pure and Applied Chemistry (IUPAC), 6) the HUGO Gene Nomenclature Committee, 7) the Interoperable Informatics Infrastructure Consortium's (I3C) proposal for the Life Sciences Identifier (LSID) extension to the WWW URN/URL naming system, and 8) the Gene Ontology
          3. Integration efforts have focused on database architectures and middleware: data warehousing, database federation, database replication, FTP, HTML “screen scraping”, ad hoc data extraction scripts, CORBA, etc. GenBank, at the National Center for Biotechnology Information (NCBI), is an example of a data warehouse with over 38 million sequence entries totaling more than 43 billion bases pairs and downloadable as 147 gigabytes in flat-file format]. Recent integration and federation efforts, all free/open-source software initiatives, include 1) the Distributed Annotation System (DAS) for exchanging genomic annotations, 2) the BioMoby project designed to use semantic web, ontologies and web-services for discovery and distribution of biological data, and 3) the even more ambitious MyGrid project which includes prototype semantic-web and workflow composition software in the MyGrid collection of tools
        3. The data is only part of the problem.
          Hundreds of analysis tools (e.g., a varieties of BLAST programs, EMBOSS programs, etc.) search and process stored bioinformatic data, either on a local computer or at public service compute clusters. The public service clusters are often slow, and hence researchers download data using FTP, HTTP, scripts from web pages, CORBA, and server side scripts provided by the database maintainers (e.g., the Entrez Programming Utilities (E-Utilities) at NCBI]).
          1. This state-of-affairs, requires the research scientist to be knowledgeable of the sources of the data (possibly multiple databases, some with multiple replicated copies), the location of the analysis programs, the input requirements of the analysis programs, and how to formate/re-formate intermediate results for further analysis programs. Often simple data requests result in complex queries, data manipulation, and analysis:
          2. “Give me all human sequences submitted to GenBank/EMBL last week.”
          3. “Give me the sequences & chromosomal locations of all human genes that have a zinc-finger domain and have a good ortholog in drosophila.”
          4. Simple data requests often require complex workflows!
        4. Interoperability
          1. Complex queries against several of these databases may provide valuable new insights, but interoperability problems make this difficult. Biological researchers are presented with different interfaces at many different database sites. The researcher must often manually “cut, paste & click” data from one database resource to another. Solutions that automate these tasks using scripts and web-page “scrapping” frequently break over time as the database curators change their schema or make modifications to their HTML. Each database may contain a subset of the biological data needed to answer complex questions forcing the researcher to spend valuable time “database surfing” rather than doing research.
          2. Ad hoc, custom scripts may implement these complex queries, but they are fragile and result in duplicated efforts by many researchers.
            1. My fetch script doesn't work with your parse program!
            2. The webmaster at a web site "tweeks" the interface and everybody's fetch script breaks
          3. What is needed for bioinformatics is the equivalent of the WWW page: text and graphics retrieved with one click, possibly from multiple servers, anywhere in the world, and transparently composed for the viewer without any knowledge of the formats or source locations of the data files. This is possible because of open standards in wide use providing the economies-of-scale to share in the development costs and a broad base of users to justify solutions to “sticky” problems.
          4. Lincoln Stein Keynote
          5. Semantic web-services are such an open standards-based technology.
            1. Web-services: currenlty widely used by "e-Commerce" sites
              1. Amazon
                1. http://www.amazon.com/gp/browse.html/002-5071057-5574466?%5Fencoding=UTF8&node=3435361
              2. eBay
                1. http://developer.ebay.com/index_html
            2. Because business and government are embracing web-services, tools, standards, and technology are being advanced outside of the bioinformatics community, permitting bioinformatics to leverage solutions for its own needs that might not otherwise be possible. Tools extending the public web-services technology to unique bioinformatics needs are emerging from several projects, including myGrid] and bioMoby. Both are developing web-services and semantic web tools. Also, equally important, the maintainers of several bioinformatic databases are exposing their data and analysis programs as web services, a trend that is expected to continue.
            3. EMBL-EBI, XEMBL . 2004, http://www.ebi.ac.uk/xembl/ .
              EMBL-EBI, Web Services - Home . 2004, http://www.ebi.ac.uk/Tools/webservices/index.html .
              KEGG, SOAP/WSDL interface for the KEGG system . 2004, http://www.genome.jp/kegg/soap/ .
              NCBI, Entrez Utilities Web Service . 2004, http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html .
        5. Technology
          1. Service oriented architecture (SOA): Central Server Architecture => Client-Server Architecture => SOA => ?
          2. Web services stack
          3. XML
            1. SGML
            2. HTML
            3. Synatax
              1. Tags - well formedness
                1. sequences.xml
              2. Content - valid data
                1. DTD's & XML Schema
                2. bioml.dtd
              3. XML Parsers
                1. SAX
                2. DOM
                3. Xerces-j
              4. Namespaces
            4. Improvement / extension to HTML
              1. XSL - XML Style Sheet Language
              2. XSLT - XML Style Sheet Language Transformation
            5. Database to database conversion (interoperability)
          4. SOAP - used to be "Simple Object Access Protocol"
            1. Alternative: REST - Representational State Transfer
          5. Standards
            1. http://www.w3.org/
            2. Archecture Usage Scenarios
              1. http://www.w3.org/TR/ws-arch-scenarios/
            3. WS Usage Scenarios
              1. http://www.w3.org/TR/ws-desc-usecases/
            4. Semantic Web
              1. http://www.w3.org/2001/sw/
          6. Open source tools
            1. http://ws.apache.org/
          7. Services composition
            1. Preprogrammed
            2. Ad hoc
            3. End user specified
          8. Semantic web
          9. Goals
            1. One click solutions!
            2. Transparent discovery
            3. Interoperability
            4. "Semantic" Intelligence
Back to Top