What's new


Christopher Schröder, & Sven Rahmann

Dupre is able to estimate the duplicate rate of a sequencing library at a given sequencing depth N, when the occupancy vector of a (small) subsample is known. This is very useful when one has to decide which sequencing depth should be aimed for, weighing the potential of new discoveries vs. cost.

28.07.2016 | Poster and Workshop at ECCB

A second poster from our working group will be presented at ECCB 2016. Daniela Beisser will present a poster about Taxonomic assignment of protist metatranscriptome sequences. She will also present the topic during the ECCB workshop “W11 – Recent Computational Advances in Metagenomics (RCAM’16)” on 4th September. See the workshop website for more information.

Taxonomic assignment of protist metatranscriptome sequences
Daniela Beisser, Nadine Graupner, Lars Grossmann, Jens Boenigk and Sven Rahmann

Next generation sequencing (NGS) technologies are increasingly applied to analyse complex microbial ecosystems by mRNA sequencing of whole communities, also known as metatranscriptome sequencing. In principle, each sequenced mRNA allows to both identify the species of origin and assign a function to the transcribed gene. While the functional information is sufficiently covered by databases such as Uniprot, NCBI, KEGG and many others, species identification is currently limited by incomplete reference databases. Inferring the community composition from metratranscriptomic samples is thus still a difficult problem. At the moment, most analyses are restricted to prokaryotic communities, which enjoy better database coverage, or to communities of few known species with sequenced genomes, or to a combination of rRNA and mRNA sequencing. However, the latter approach does not allow to link taxonomic and functional information directly.

Our approach focuses on an accurate assignment of taxonomic groups to metatranscriptomic reads. We constructed a custom database that comprises all major eukaryotic groups, developed a stand-alone tool to assign reads with a low false discovery rate and created a workflow for complete metatranscriptome analysis. The workflow covers all bioinformatic steps: preprocessing of the raw data, taxonomic and functional assignment, and visualisation of the results.

28.07.2016 | Poster about EAGLE at ECCB

A poster about the Exome Analysis GraphicaL Environment (EAGLE) was accepted for the ECCB 2016 at The Hague. Felix Mölder will present the poster there.

EAGLE: an easy-to-use web-based exome analysis environment
Christopher Schröder, Felix Mölder, Christoph Stahl and Sven Rahmann

High throughput exome sequencing is a widely used technology for deciphering mutations in the coding regions of a genome at relatively low cost. While bioinformatics analyses of exome sequencing data mostly agree on best practices regarding the analysis steps, called genomic variants depend on the set of parameters and applied filtering. We present EAGLE, a software that combines a best practices variant calling workflow with a web frontend. By storing the called variant information in HDF5 files (instead of SQL databases), EAGLE allows filtering and parameter tuning in almost real time. This enables iterative tuning of thresholds, or the selection of different samples for filtering by medical PIs via the web interface. The web interface presents metadata, annotations, quality control data and statistics to facilitate a comprehensive data analysis on different levels.

Juli 2016 | mundo berichtet über Projekt “Data Driven Materials Design”

Im aktuellen Sonderheft des Forschungsmagazins mundo zum Thema “Materials Chain” ist ein Bericht über das Projekt Data Driven Materials Design erschienen. Das vom 01.10.2012 – 30.09.2014 vom Mercator Research Center Ruhr (MERCUR) geförderte Projekt galt dem systematischen Design neuer Materialien durch die interdisziplinäre Zusammenarbeit zwischen Materialwissenschaften und Informatik. Dabei handelte es sich um eine Kooperation zwischen den Fakultäten für Physik und Astronomie (Prof. Drautz) und für Maschinenbau (Prof. Ludwig) der Ruhr-Universität Bochum mit zwei Informatik-Lehrstühlen der TU Dortmund (Prof. Morik) und der Universität Duisburg-Essen (Prof. Rahmann) zum Data Mining bzw. zur Hochdurchsatzanalyse.


Bianca Stöcker, Johannes Köster & Sven Rahmann

SimLoRD is a read simulator for third generation sequencing reads and is currently focused on the Pacific Biosciences SMRT error model.

Reads are simulated from both strands of a provided or randomly generated reference sequence.


  •  The reference can be read from a FASTA file or randomly generated with a given GC content. It can consist of several chromosomes, whose structure is respected when drawing reads. (Simulation of genome rearrangements may be incorporated at a later stage.)
  • The read lengths can be determined in four ways: drawing from a log-normal distribution (typical for genomic DNA), sampling from an existing FASTQ file (typical for RNA), sampling from a a text file with integers (RNA), or using a fixed length
  •  Quality values and number of passes depend on fragment length.
  • Provided subread error probabilities are modified according to number of passes
  • Outputs reads in FASTQ format and alignments in SAM format

SimLoRD can be obtained via Bioconda and PyPI.