Unsupervised Markdown Feature-Aware Keywords Extraction Towards Technology Blogs

Yangyang Wang, Liping Hua, Hui Zhao, Lingfeng Yang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A vast amount of blogs are generated from online technology communities every day. Most of them are in Markdown format. The increase of Markdown documents has brought opportunities and challenges to many natural language processing tasks. Extracting keywords from technology blogs is of great value for discovering, retrieving, and sharing knowl-edge about technical blogs. The mainstream keyword extraction algorithms remain to use statistical char-acteristics of words to determine the keywords of a document, seldom considering the structure char-acteristics of the document that potentially express the semantic information. We argue that Markdown markup features as well as the textual content of the document are both concerned with the keywords extraction. In this paper, we propose a novel un-supervised Markdown markup features aware key-words extraction algorithm for technology blogs. The algorithm integrates Markdown markup syntax in-formation with a blog text representation. Through experiments against TF-IDF, TextRank, and PositionRank algorithms on a real Markdown document dataset, our algorithm achieves higher performance with a substantial improvement when the number of keywords extracted is greater than 3.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE 46th Annual Computers, Software, and Applications Conference, COMPSAC 2022
EditorsHong Va Leong, Sahra Sedigh Sarvestani, Yuuichi Teranishi, Alfredo Cuzzocrea, Hiroki Kashiwazaki, Dave Towey, Ji-Jiang Yang, Hossain Shahriar
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages223-228
Number of pages6
ISBN (Electronic)9781665488105
DOIs
StatePublished - 2022
Event46th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2022 - Virtual, Online, United States
Duration: 27 Jun 20221 Jul 2022

Publication series

NameProceedings - 2022 IEEE 46th Annual Computers, Software, and Applications Conference, COMPSAC 2022

Conference

Conference46th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2022
Country/TerritoryUnited States
CityVirtual, Online
Period27/06/221/07/22

Keywords

  • Extraction
  • Markdown feature
  • TF-IDF
  • Tex-tRank
  • Unsupervised Machine Learning

Fingerprint

Dive into the research topics of 'Unsupervised Markdown Feature-Aware Keywords Extraction Towards Technology Blogs'. Together they form a unique fingerprint.

Cite this