OpenCB was started in 2012 by Ignacio Medina while in Joaquin Dopazo's group. Ignacio Medina is now the head of the Computational Biology Lab in the UIS at Cambridge where he is continuing this work. OpenCB is now used by many projects and research institutes. Currently it is being developed by more than 10 active developers and researchers.
OpenCB consists of different projects that solve different problems in current genomics. Each of these projects constitutes a standalone solution than can be easily imported into existing projects. The projects have been designed to provide scalable and high-performance solutions for storing, processing, analysing, sharing and visualizing big data in genomics and clinics in a secure and efficient manner. To achieve this, OpenCB uses the most advanced computing technologies in HPC (such as SSE4/AVX2, GPUs, OpenMP) and Big Data (Hadoop, Spark) for data processing and analysis; NoSQL databases (MongoDB, HBase) for data indexing or HTML5 (SVG, IndexedDB) for interactive data visualization. Development is being carried out at the University of Cambridge, EMBL-EBI and Genomics England among other institutions. More information can be found at http://www.opencb.org.
OpenCB is open source and freely available. An overview of all the projects can be seen at GitHub https://github.com/opencb.
OpenCB consists of a number of projects; the most relevant of these are listed below:
CellBase (https://github.com/opencb/cellbase) constitutes the knowledge-base database for all OpenCB projects. CellBase is a NoSQL database that integrates the most relevant biological information about genomic features and proteins, gene expression, regulation, functional annotation, genomic variation and systems biology information. Its knowledge base relies on the most relevant repositories such as ENSEMBL, Uniprot, ClinVar, COSMIC or IntAct among others. CellBase has also a variant annotation built-in component that provides an Ensembl VEP compatible annotation. All data is available through either a command line or RESTful web services.
OpenCGA (https://github.com/opencb/opencga) provides a scalable and high-performance solution for big data analysis and visualization in a shared environment. OpenCGA integrates some of the OpenCB projects and implements, in addition, other components: i) a storage engine framework to store and index alignments and genomic variants into different NoSQL such as MongoDB or Hadoop HBase - the current implementation can efficiently store thousands of gVCF files while remaining responsive when querying data; ii) a Catalog which keeps track of users, projects, files, samples, annotations, etc and also provides authentication and authorization capabilities; iii) an analysis engine to execute genomic analysis on a traditional HPC cluster or in Hadoop. OpenCGA has implemented a command line and RESTful web services to manage and query all the data.
Visualization with Genome Maps and CellMaps
Finally in OpenCB, a genome browser called Genome Maps (https://github.com/opencb/genome-maps) and a systems biology tool called CellMaps (https://github.com/opencb/cell-maps) provide a high-performance HTML5 SVG-based genome browser to interactively display CellBase data and OpenCGA indexed data such as BAM and VCF files. Users can also easily extend Genome Maps to display their own data and formats. OpenCB projects are compliant with the new GA4GH data models and formats.
High-Performance Genomics (HPG)
HPG projects make use of standard HPC and big data technologies to provide a scalable and efficient solution for several genomic analyses. The main HPG projects are:
HPG Aligner (https://github.com/opencb/hpg-aligner) is an ultra-fast and sensitive HPC Next-Generation Sequencing (NGS) read aligner. It combines advanced data structures and novel algorithms implemented with multi-threading and AVX2. Current work at Cambridge is being performed to explore Intel Xeon PHI.
HPG Variant (https://github.com/opencb/hpg-variant) is HPC software to process and analyze genomic variant data. Several algorithms have been developed and implemented.
HPG BigData (https://github.com/opencb/hpg-bigdata) is a Hadoop MapReduce and Spark implementation of several genomics analyses for working with big data.
Who is using it
Many projects within research institutes around the world are using some OpenCB technologies demonstrating the success of this initiative. For instance ICGC, EMBL-EBI or Genomics England are using and contributing to some of these projects. Source code is open and freely available on GitHub at https://github.com/opencb.