跳到主要导航 跳到搜索 跳到主要内容

HTM: A topic model for hypertexts

  • Congkai Sun*
  • , Bin Gao
  • , Zhenfu Cao
  • , Hang Li
  • *此作品的通讯作者
  • Shanghai Jiao Tong University
  • Microsoft USA

科研成果: 会议稿件论文同行评审

摘要

Previously topic models such as PLSI (Probabilistic Latent Semantic Indexing) and LDA (Latent Dirichlet Allocation) were developed for modeling the contents of plain texts. Recently, topic models for processing hypertexts such as web pages were also proposed. The proposed hypertext models are generative models giving rise to both words and hyperlinks. This paper points out that to better represent the contents of hypertexts it is more essential to assume that the hyperlinks are fixed and to define the topic model as that of generating words only. The paper then proposes a new topic model for hypertext processing, referred to as Hypertext Topic Model (HTM). HTM defines the distribution of words in a document (i.e., the content of the document) as a mixture over latent topics in the document itself and latent topics in the documents which the document cites. The topics are further characterized as distributions of words, as in the conventional topic models. This paper further proposes a method for learning the HTM model. Experimental results show that HTM outperforms the baselines on topic discovery and document classification in three datasets.

源语言英语
514-522
页数9
出版状态已出版 - 2008
已对外发布
活动2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation - Honolulu, HI, 美国
期限: 25 10月 200827 10月 2008

会议

会议2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation
国家/地区美国
Honolulu, HI
时期25/10/0827/10/08

指纹

探究 'HTM: A topic model for hypertexts' 的科研主题。它们共同构成独一无二的指纹。

引用此