Databases

Huge amounts of mass spectrometry data are produced on a daily basis. Storing them in files is not always the most optimal way, as parsing these files on the fly to extract specific information is time-consuming. Organizing them in data structures is also difficult, as they will have to be stored and retrieved from the disk with every change. Databases are the key technology for the organization and fast retrieval of huge amounts of data. However, not all data is meant to be used and stored in the same way. Relational databases (SQL) are perfect for storing different studies with their data, their results, parts of the analysis, metadata and anything that can describe a study. However, when the underlying data form a graph, such as for identifier mapping, a relational database is not as efficient as a graph database. A graph DB comes together with different graph theory algorithms, allowing the search of shortest paths, patterns, minimum spanning trees and dense subgraphs. Therefore, separating the data from a study in the appropriate database management systems (DBMS) can speed up processes that would not be easily available when using only one DBMS.

Our flagship project is ProteomicsDB (https://www.proteomicsdb.org/), an in-memory (SAP HANA) multi-omics multi-organism interactive resource for life science research. The development of ProteomicsDB started in 2012 and was fueled by a close collaboration between TUM and SAP. Today, ProteomicsDB and its development environment is hosted by the SAP UCC in Garching and is run on a 6 TB main memory Power 9 IBM machine in conjunction with a 48 core Intel GPU (2xP100) server for deep learning. Over the course of the years, more than 20 developers were involved in this project. Today we see on average ~600 unique IP addresses per day.

Available Projects:

  1. Spectra clustering on repository scale - MA
  2. PTM integration - BA, MA, Internship
  3. Analytics for ProteomicsDB - BA, MA, Internship