Spectra clustering on repository scale - MA - Kuster Lab

Title: Clustering fragment spectra on repository scale

Type: MSc

Category: ML, DB, DS

Programming language: [ SQL and Python / C++ ]

Language: [ English ]

Prior experience: [ experience with SQL and a programming language (Python or C++) is required, no biological background required ]

Complexity/Risk: high

Contact person: Matthew The

Brief background description: Shotgun proteomics relies on the identification of fragment spectra, which can be considered as fingerprints of peptides (and thus proteins). We have collected ~109 fragment spectra in our platform ProteomicsDB. However, for quite a big number of these fragment spectra, we do not know which peptide/protein they belong to. One promising approach to tackle this large proportion of unknowns is to apply unsupervised machine learning and find clusters of frequently occurring spectra without identifications.


  • Griss, Johannes, et al. "Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets." Nature methods 13.8 (2016): 651.
  • Skinner, Owen S., and Neil L. Kelleher. "Illuminating the dark matter of shotgun proteomics." Nature biotechnology 33.7 (2015): 717.
  • Samaras, Patroklos, et al. "ProteomicsDB: a multi-omics and multi-organism resource for life science research." Nucleic acids research 48.D1 (2020): D1153-D1163.
  • The, Matthew, and Lukas Käll. "MaRaCluster: A fragment rarity metric for clustering fragment spectra in shotgun proteomics." Journal of proteome research 15.3 (2016): 713-720.

Brief description of the project: In this project, you will explore several use-cases of applying fragment spectrum clustering on ProteomicsDB. The first part consists of coupling an existing fragment clustering algorithm to use all the spectra of ProteomicsDB as input. In the second part, you will evaluate if the clustering information can be used to transfer information between fragment spectra belonging to the same cluster. Additionally, different solutions to "online" clustering could be explored, where new datasets can be added to the existing clustering, without having to cluster the entire repository again.

Expected result: A clustering mechanism for ProteomicsDB which will serve as a starting point for more detailed analysis of the unidentified spectra in our database.