The article “Analysis of min-hashing for variant tolerant DNA read mapping” by Jens Quedenfeld (now at TU Munich) and Sven Rahmann has received the Best Paper Award at the Workshop of Algorithms in Bioinformatics (WABI) 2017, held in Cambridge, MA, USA, August 20-23, 2017.
The authors consider an important question, as DNA read mapping has become a ubiquitous task in bioinformatics. New technologies provide ever longer DNA reads (several thousand basepairs), although at comparatively high error rates (up to 15%), and the reference genome is increasingly not considered as a simple string over ACGT anymore, but as a complex object containing known genetic variants in the population. Conventional indexes based on exact seed matches, in particular the suffix array based FM index, struggle with these changing conditions, so other methods are being considered, and one such alternative is locality sensitive hashing. Here we examine the question whether including single nucleotide polymorphisms (SNPs) in a min-hashing index is beneficial. The answer depends on the population frequency of the SNP, and we analyze several models (from simple to complex) that provide precise answers to this question under various assumptions. Our results also provide sensitivity and specificity values for min-hashing based read mappers and may be used to understand dependencies between the parameters of such methods. This article may provide a theoretical foundation for a new generation of read mappers.
The article can be freely accessed in the WABI conference proceedings (Proceedings of the 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), Russell Schwartz and Knut Reinert (Eds.), LIPICS Vol. 88).
This work is part of subproject C1 of the collaborative research center SFB 876.
The beta distribution is a continuous probability distribution that takes values in the unit interval [0,1]. It has been used in several bioinformatics applications to model data that naturally takes values between 0 and 1, such as relative frequencies, probabilities, absolute correlation coefficients, or DNA methylation levels of CpG dinucleotides or longer genomic regions. One of the most prominent applications is the estimation of false discov ery rates (FDRs) from p-value distributions after multiple tests by fitting a beta-uniform mixture. By linear scaling, beta distributions can be used to model any quantity that takes values in a finite interval [L,U]⊂R. We show that the Maximum likelihood estimation for Beta distributions, MLE has significant disadvantages for beta distributions. The main problem is that the likelihood function is not finite (for almost all parameter values) if any of the observed data points are xi=0 or xi=1.
For mixture distributions, MLE frequently results in a non-concave problem with many local maxima, and one uses heuristics that return a local optimum from given starting parameters. Because already MLE for a single beta distribution is problematic, EM does not work for beta mixtures, unless ad-hoc corrections are made. We therefore propose a new algorithm for parameter estimation in beta mixtures that we call iterated method of moments.
Christopher Schröder*, Elsa Leitão*, Stefan Wallner, Gerd Schmitz, Ludger Klein-Hitpass, Anupam Sinha, Karl-Heinz Jöckel, Stefanie Heilmann-Heimbach, Per Hoffmann, Markus M. Nöthen, Michael Steffens, Peter Ebert, Sven Rahmann and Bernhard Horsthemke
* Contributed equally Epigenetics & Chromatin 2017 doi:10.1186/s13072-017-0144-2
There is increasing evidence for inter-individual methylation differences at CpG dinucleotides in the human genome, but the regional extent and function of these differences have not yet been studied in detail. For identifying regions of common methylation differences, we used whole genome bisulfite sequencing data of monocytes from five donors and a novel bioinformatic strategy.
We identified 157 differentially methylated regions (DMRs) with four or more CpGs, almost none of which has been described before. The DMRs fall into different chromatin states, where methylation is inversely correlated with active, but not repressive histone marks. However, methylation is not correlated with the expression of associated genes. High-resolution single nucleotide polymorphism (SNP) genotyping of the five donors revealed evidence for a role of cis-acting genetic variation in establishing methylation patterns. To validate this finding in a larger cohort, we performed genome-wide association studies (GWAS) using SNP genotypes and 450k array methylation data from blood samples of 1128 individuals. Only 30/157 (19%) DMRs include at least one 450k CpG, which shows that these arrays miss a large proportion of DNA methylation variation. In most cases, the GWAS peak overlapped the CpG position, and these regions are enriched for CREB group, NF-1, Sp100 and CTCF binding motifs. In two cases, there was tentative evidence for a trans-effect by KRAB zinc finger proteins.
Allele-specific DNA methylation occurs in discrete chromosomal regions and is driven by genetic variation in cis and trans, but in general has little effect on gene expression
Ab dem ersten Juli diese Jahres beteiligt sich die Genominformatik an einem neuen Projekt zur Aufklärung molekularer Ursachen von Komplikationen bei der Osteoporose-Therapie.
Das übergeordnete Ziel des, auf vier Jahre angelegten, Projektvorhabens ist die Etablierung einer personalisierten Therapie.
Der Verbund verschiedener Wissenschaftler und Firmen wird von Prof. Nina Babel (Transplantationsimmunologie, Marienhospital Herne, Klinikum der Ruhr-Universität Bochum) koordiniert.
Auf Grund umfassender Erfahrungen im Bereich der Datenauswertung wird die Genominformatik sich mit der Analyse von bioinformatischen Daten auf genetischer und epigenetischer Ebene befassen. Hierzu erhält sie für eine Förderung in Höhe von 235.188€.
Das Projekt wird gefördert durch die Europäischen Fonds für regionale Entwicklung und die Leitmarkt Agentur NRW.
SimLoRD is a read simulator for third generation sequencing reads and is currently focused on the Pacific Biosciences SMRT error model.
Third generation sequencing methods provide longer reads than second generation methods and have distinct error characteristics.
In a SMRT library the sequenced DNA fragments are circular with adapter sequences between forward and backward strand, and a fragment may be sequenced multiple times in a single run. For a single pass through the sequence (subread), the error rate is high, but it is possible to calculate a consensus after multiple passes (circular consensus sequence read, CCS). Thus the error rate of CCSs decreases with the number of passes.
We analyzed public data from Pacific Biosciences (PacBio) SMRT sequencing, developed an error model and implemented it in a new read simulator called SimLoRD. Reads are simulated from both strands of a provided or randomly generated reference sequence. It offers options to choose the read length distribution and to model error probabilities depending on the number of passes through the sequencer. The new error model makes SimLoRD the most realistic SMRT read simulator available.
Christopher Schröder, Christoph Stahl, Felix Mölder, André Janowicz, Jasmin Beygo, Marcel Martin, Sven Rahmann
The Exome Analysis GraphicaL Environment (EAGLE) combines a best practices variant calling workflow, with a web frontend. By storing the called information in speficially structerd hdf5 files, EAGLE allows filtering and parameter tuning in almost real time. This enables iterative tuning of thresholds, or the selection of different samples for filtering by non computer scientists via the web interface.
Toll‐like receptor (TLR) 13 and TLR2 are the major sensors of Gram‐positive bacteria in mice. TLR13 recognizes Sa19, a specific 23S ribosomal (r) RNA‐derived fragment and bacterial modification of Sa19 ablates binding to TLR13, and to antibiotics such as erythromycin. Similarly, RNase A‐treated Staphylococcus aureus activate human peripheral blood mononuclear cells (PBMCs) only via TLR2, implying single‐stranded (ss) RNA as major stimulant. Continue reading →
Seit November 2015 arbeitet Bianca Stöcker bei uns in Essen. Bianca wird zunächst an ihrem Simulator für lange Reads weiterarbeiten, den sie während ihrer Masterarbeit entwickelt hat. Außerdem arbeitet sie zusammen mit Johannes Köster (Dana-Farber Cancer Institute, Boston) und Eli Zamir (MPI, Dortmund) an Protein-Hypernetzwerken.
Herzlich Willkommen, Bianca!
Bioinformatics Analysis of Heterogenous Data Reveals Characteristic Mutational Landscapes of Neuroblastoma Relapses, GCB 2015 in Dortmund
Marc Schulte, Johannes Köster, Daniela Beisser, Corinna Ernst, Christopher Schröder, Alexander Schramm and Sven Rahmann
Neuroblastoma is a malignancy of the developing sympathic nervous system that causes 15% of childhood cancer-related mortality. However, in the vast majority of cases death results not from the initial disease manifestation but rather from metastasis or recurrence.
Systematic search for genomic alterations in primary neuroblastomas has shown low genetic complexity, with significant mutations in only a very few genes. This study explored the genomic landscape of relapsing neuroblastoma in order to evaluate ‘driver’ mutations to be exploited as therapeutic targets.
Henning Timm and Till Hartmann
Dinopy (Dna INput and Output in PYthon) is a Python package that aims to simplify the development of bioinformatics applications by providing efficient facilities for DNA input and output.
At the time of writing, there is no library for I/O of DNA specific files available which makes full use of the potential of Cython. Dinopy exports Cython level API bindings which can be used by other Cython applications for increased speedup.
A. Schramm, J. Köster, Y. Assenov, K. Althoff, M. Peifer, E. Mahlow, A. Odersky, D. Beisser, C. Ernst, A. G. Henssen, H. Stephan, C. Schröder, L. Heukamp, A. Engesser, Y. Kahlert, J. Theissen, B. Hero, F. Roels, J. Altmüller, P. Nürnberg, K. Astrahantseff, C. Gloeckner, K. De Preter, C. Plass, S. Lee, H. N. Lode, K. Henrich, M. Gartlgruber, F. Speleman, P. Schmezer, F. Westermann, S. Rahmann, M. Fischer, A. Eggert, J. H Schulte
Neuroblastoma is a malignancy of the developing sympathetic nervous system that is often lethal when relapse occurs. We here used whole-exome sequencing, mRNA expression profiling, array CGH and DNA methylation analysis to characterize 16 paired samples at diagnosis and relapse from individuals with neuroblastoma. Continue reading →
The German Conference on Bioinformatics (GCB) is an annual, international conference devoted to all areas of bioinformatics. Recent meetings attracted a multinational audience with 250 – 300 participants each year. The meeting is open to all fields of bioinformatics. Continue reading →
Imprinting of the human RB1 gene is due to the presence of a differentially methylated CpG island (CGI) in intron 2, which is part of a retrocopy derived from the PPP1R26 gene on chromosome 9. The murine Rb1 gene does not have this retrocopy and is not imprinted. We have investigated whether the RB1/Rb1 locus is unique with respect to these differences.
Exomate: an easy to use exome sequencing analysis pipeline
Christopher Schröder, Johannes Köster, Christoph Stahl, Sebastian Venier, Sven Rahmann, Marcel Martin
Exomate is an exome-sequencing pipeline with a web frontend. It automates most steps needed to go from FASTQ files to variant calls, puts the calls and metadata about patients, samples, etc. into a database and then allows interactive analysis via a web frontend. It is primarily designed for easy use and has already been used in various studies [1,2,3].
 Martin, M. et al., 2013. Exome sequencing identifies recurrent somatic mutations in EIF1AX and SF3B1 in uveal melanoma with disomy 3. Nat. Genet. 45, 933–936.
 Czeschik, J.C. et al., 2013. Clinical and mutation data in 12 patients with the clinical diagnosis of Nager syndrome. Hum. Genet. 132, 885–898.
 Voigt, C., et al., 2013. Oto-facial syndrome and esophageal atresia, intellectual disability and zygomatic anomalies – expanding the phenotypes associated with EFTUD2 hfg mutations.
Orphanet J Rare Dis 8, 110.
In metabolic engineering by gene knockouts, one searches for genes controlling metabolic reactions that should be removed from a metabolic network in order to optimize the yield of a desired metabolite.
In a conservative way, this is done by undirected mutagenesis selection of the population with best efficiency.
Unrean et al. developed a simple algorithm to directly predict reaction targets, to save the high costs of this uncontrolled expensive process. It is based on elementary modes, undecomposable sequences of metabolite transformation flows in the network.
We substantially improved the algorithm and applied it to a network of Escherichia coli to show the improved results.