Topic Mining over Asynchronous Text Sequences
Time stamped texts, or text sequences, are ubiquitous in real-world applications. Multiple text sequences are often related to each other by sharing common topics. The correlation among these sequences provides more meaningful and comprehensive clues for topic mining than those from each individual sequence. However, it is nontrivial to explore the correlation with the existence of asynchronism among multiple sequences, i.e., documents from different sequences about the same topic may have different time stamps. In this paper, we formally address this problem and put forward a novel algorithm based on the generative topic model. Our algorithm consists of two alternate steps: the first step extracts common topics from multiple sequences based on the adjusted time stamps provided by the second step; the second step adjusts the time stamps of the documents according to the time distribution of the topics discovered by the first step. We perform these two steps alternately and after iterations a monotonic convergence of our objective function can be guaranteed. The effectiveness and advantage of our approach were justified through extensive empirical studies on two real data sets consisting of six research paper repositories and two news article feeds, respectively.