A Probabilistic Topic Model based on an Arbitrary-Length Co-occurrence Window

  • Marziea Rahimi School of Computer and IT Engineering Shahrood University of Technology Shahrood, Iran
  • Morteza Zahedi School of Computer and IT Engineering Shahrood University of Technology Shahrood, Iran
  • Hoda Mashayekhi School of Computer and IT Engineering Shahrood University of Technology Shahrood, Iran
Keywords: probabilistic topic modeling, co-occurrence, context window, Gibbs sampling, generative models

Abstract

Probabilistic topic models have been very popular in automatic text analysis since their introduction. These models work based on word co-occurrence, but are not very flexible with respect to the context in which cooccurrence is considered. Many probabilistic topic models do not allow for taking local or spatial data into account. In this paper, we introduce a probabilistic topic model that benefits from an arbitrary-length co-occurrence window and encodes local word dependencies for extracting topics. We assume a multinomial distribution with Dirichlet prior over the window positions to let the words in every position have a chance to influence topic assignments. In the proposed model, topics being shown by word pairs have a more meaningful presentation. The model is applied on a dataset of 2000 documents. The proposed model produces interesting meaningful topics and reduces the problem of sparseness.

Downloads

Download data is not yet available.

Author Biographies

Marziea Rahimi, School of Computer and IT Engineering Shahrood University of Technology Shahrood, Iran

Marziea Rahimi is a Ph.D. student at the Deprartment of Computer Engineering of Shahrood University of Technology. Her research interests revolve around statistical machine learning, text mining and topic modeling.

Morteza Zahedi, School of Computer and IT Engineering Shahrood University of Technology Shahrood, Iran

Morteza Zahedi is an assistant professor at the Department of Computer Engineering of Shahrood University of Technology. He received his Ph.D. from RWTH Aachen University. His research interests focus on statistical pattern recognition, data mining and machine vision.

Hoda Mashayekhi, School of Computer and IT Engineering Shahrood University of Technology Shahrood, Iran

Hoda Mashayekhi is an assistant professor at the Department of Computer Engineering of the Shahrood University of Technology. She received her Ph.D. from Sharif University of Technology in 2013. Her research interests include parallel and distributed computing, data mining, decision making, peer-to-peer (P2P) networks and semantic structures.

References

[1] Blei, D.M., A.Y. Ng, and M.I. Jordan, Latent dirichlet allocation. Journal of machine Learning research, 2003. 3(Jan): p. 993-1022.
[2] Luo, W. and T.B. Zhang. Blind Image Quality Assessment Using Latent Dirichlet Allocation Model. 2014. Trans Tech Publ.
[3] Savoy, J., Authorship attribution based on a probabilistic topic model. Information Processing & Management, 2013. 49(1): p. 341-354 %@ 0306-4573.
[4] Razavi, A.H. and D. Inkpen. Text Representation Using Multi-level Latent Dirichlet Allocation. in Canadian Conference on Artificial Intelligence. 2014. Springer.
[5] Blei, D.M. and J.D. Lafferty. Dynamic topic models. 2006. ACM.
[6] Wallach, H.M. Topic modeling: beyond bag-of-words. 2006. ACM.
[7] McAuliffe, J.D. and D.M. Blei. Supervised topic models. 2008.
[8] Porteous, I., et al. Fast collapsed gibbs sampling for latent dirichlet allocation. 2008. ACM.
[9] Anandkumar, A., et al. A spectral algorithm for latent dirichlet allocation. 2012.
[10] Andrzejewski, D., X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. 2009. ACM.
[11] Barbieri, N., et al., Probabilistic topic models for sequence data. Machine learning, 2013. 93(1): p. 5-29 %@ 0885-6125.
[12] Griffiths, T.L., et al. Integrating Topics and Syntax. 2004.
[13] Griffiths, T.L., M. Steyvers, and J.B. Tenenbaum, Topics in semantic representation. Psychological review, 2007. 114(2): p. 211 %@ 1939-1471.
[14] Wang, X., A. McCallum, and X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. 2007. IEEE.
[15] Jameel, S., W. Lam, and L. Bing, Supervised topic models with word order structure for document classification and retrieval learning. Information Retrieval Journal, 2015. 18(4): p. 283-330 %@ 1386-4564.
[16] Yang, G., et al., A novel contextual topic model for multi-document summarization. Expert Systems with Applications, 2015. 42(3): p. 1340-1352 %@ 0957-4174.
[17] Momtazi, S., S. Khudanpur, and D. Klakow. A comparative study of word co-occurrence for term clustering in language model-based sentence retrieval. 2010. Association for Computational Linguistics.
[18] Evert, S., The statistics of word cooccurrences: word pairs and collocations. 2005.
Published
2017-06-30
How to Cite
Rahimi, M., Zahedi, M., & Mashayekhi, H. (2017, June 30). A Probabilistic Topic Model based on an Arbitrary-Length Co-occurrence Window. International Journal of Information & Communication Technology Research, 9(2), 19-25. Retrieved from http://journal.itrc.ac.ir/index.php/ijictr/article/view/6
Section
Information Technology