File formats design overview
Updated: December 2005

The Workbench design defines a set of core data structures for describing pathways, chemical compounds, interactions, and so forth. Data may be imported to, and exported from these data structures in a variety of file formats.

File format plugins

File format support will be through Workbench plugins that support importing and exporting data for a specific file format, such as SBML, BioPax, or KGML. File format plugins handle the translation of data into and out of the Workbench's generic data structures and the format's own view of the data.

Pathway data

To support a wide variety of file formats and pathway applications, the Workbench's data structures contain a richer set of data than that supported by most existing file formats. For instance, the Workbench will support simulation parameters not supported by the BioPax format, and ontology data not supported by SBML files. Export to these formats must omit this data, and import from them will produce sparsely filled data structures in the Workbench.

To support the full range of data in the Workbench's data structures, the Workbench defines a new PATH file format. The format describes a pathway, ontologies and other typing schemes, simulation parameters, presentation and layout values, and database-specific IDs.

The following table characterizes the range of data storable in each of the file formats intended for the Workbench:

  Workbench
PATH
BioPax CellML GML KGML PSI-MI SBML SIF
Network data
  Structure yes yes yes basic yes yes yes basic
  Geometry yes     basic        
  Simulation yes   yes       yes  
  Presentation yes     basic basic      
Domain knowledge
  Cell types yes yes       yes    
  Compartment types yes yes       yes    
  Network types yes              
  Organism types yes yes       yes    
  Rate law types yes              
  Reaction types yes              
  Compound types yes yes basic     yes    
  Units of measure yes   yes       yes  

 

Comments:

  • BioPax and PSI-MI focus upon defining semantically rich pathway descriptions annotated with cell biology ontologies. While the syntax for BioPax and PSI-MI differs, the formats are essentially the same. Neither format includes features for detailed validation (such as charge balance), simulation (such as rate laws and quantities), or presentation (such as colors and icons). Annotation in the formats characterizes the cell compartment for a chemical compound (using the Gene Ontology component ontology), but does not describe that compartment's geometry or location.
  • CellML focuses upon defining the mathematics for simulating a pathway. The format includes equations for each interaction, variables, and syntax for connecting interactions, and their variables, to form a graph structure. A role naming convention labels variables that represent chemical compounds, but no other semantic information is present. The format's inherent modularity lends itself to modular pathways. Since the format is primarily used to describe simulation data, it's structural information is implicit and it contains no information on geometry or presentation style.
  • GML is a generic format for describing node-and-edge diagrams. It has no specific support for biology. Nodes and edges may be labeled, colored, and positioned, and nodes may have icon images. While the structure of a pathway can be represented, the semantics are lost.
  • KGML is KEGG's format for describing pathways containing chemical compounds and interactions. It is primarily the backend format used to draw pathway diagrams. Nodes and links are annotated with KEGG database IDs that may be used to look up attributes on-line, instead of them being embedded in the format. While this provides a connection to KEGG's rich database, it leaves the format unable to support non-KEGG attributes (such as database IDs for other databases) or general annotation. The format includes basic presentation information (colors and positions).
  • SBML focuses upon defining generic chemical reaction networks and their associated simulation parameters. It does not support ontologies to characterize chemical compounds, reactions, and so forth. It does support standard and user-defined units of measure. The current version does not support modularity or presentation information, but extensions have been proposed for these features.
  • SIF is a basic format for describing nodes and edges. Nodes may be named, and edges characterized as "protein-protein" or "protein-DNA". It does not support ontologies, simulation parameters, compartments, or presentation information. It's use is primarily as a rudimentary exchange format that can be easily created and edited in a text editor.

Ontology data

The Workbench will recognize several standard ontologies for chemical compounds, reactions, organizms, compartments, and so forth. The Gene Ontology project, for instance, provides classifications for chemical compounds, while the NCBI Taxon database classifies organisms. These ontologies provide type names and features for various components in a reaction network and enable the Workbench to provide better network validation and visualization.

The Workbench design supports the following specific ontologies and their file formats:

  Ontology Source
Cell types Cell type Open Biomedical Ontologies
Compartment types Cellular component ontology
Gene Ontology Project
Pathway types Biological process ontology Gene Ontology Project
Organism types Organism taxonomy National Center for Biotechnology Information
Rate law types    
Interaction types Molecular function ontology
Gene Ontology Project
Chemical compound types Protein types
Small molecule types
Alliance for Cell Signaling, Molecule Pages
Units of measure International System of Units Bureau International des Poids et Mesures
Reference on Constants, Units, and Uncertainty National Institute of Standards and Technology (NIST)