Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets

H Ikebata, R Yoshida - Bioinformatics, 2015 - academic.oup.com
H Ikebata, R Yoshida
Bioinformatics, 2015academic.oup.com
Motivation The motif discovery problem consists of finding recurring patterns of short strings
in a set of nucleotide sequences. This classical problem is receiving renewed attention as
most early motif discovery methods lack the ability to handle large data of recent genome-
wide ChIP studies. New ChIP-tailored methods focus on reducing computation time and pay
little regard to the accuracy of motif detection. Unlike such methods, our method focuses on
increasing the detection accuracy while maintaining the computation efficiency at an …
Motivation
The motif discovery problem consists of finding recurring patterns of short strings in a set of nucleotide sequences. This classical problem is receiving renewed attention as most early motif discovery methods lack the ability to handle large data of recent genome-wide ChIP studies. New ChIP-tailored methods focus on reducing computation time and pay little regard to the accuracy of motif detection. Unlike such methods, our method focuses on increasing the detection accuracy while maintaining the computation efficiency at an acceptable level. The major advantage of our method is that it can mine diverse multiple motifs undetectable by current methods.
Results
The repulsive parallel Markov chain Monte Carlo (RPMCMC) algorithm that we propose is a parallel version of the widely used Gibbs motif sampler. RPMCMC is run on parallel interacting motif samplers. A repulsive force is generated when different motifs produced by different samplers near each other. Thus, different samplers explore different motifs. In this way, we can detect much more diverse motifs than conventional methods can. Through application to 228 transcription factor ChIP-seq datasets of the ENCODE project, we show that the RPMCMC algorithm can find many reliable cofactor interacting motifs that existing methods are unable to discover.
Availability and implementation
A C++ implementation of RPMCMC and discovered cofactor motifs for the 228 ENCODE ChIP-seq datasets are available from http://daweb.ism.ac.jp/yoshidalab/motif.
Supplementary information
Supplementary data are available from Bioinformatics online.
Oxford University Press