TY - GEN
T1 - Using surrounding text of formula towards more accurate mathematical information retrieval
AU - Chen, Cheng
AU - Dai, Yifan
AU - Shen, Yuqi
AU - Cai, Jinfang
AU - Chen, Liangyu
N1 - Publisher Copyright:
© 2021 Knowledge Systems Institute Graduate School. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Formula retrieval is an important research topic in Mathematical Information Retrieval (MIR). Most studies have focused on comparing formulae to determine the similarity between mathematical documents. However, two similar formulae may appear in completely different knowledge domains and have different meanings. Based on N-ary Tree-based Formula Embedding Model (NTFEM), we introduce a new hybrid retrieval model combining formula with its surrounding text for more accurate retrieval. Using keywords extraction technology, we extract keywords from text around the formula which can supplement the semantic information of formula. Then we get the representation vectors of keywords by FastText N-gram embedding model, and the representation vectors of formulae by NTFEM. Finally, documents are first sorted according to the similarity of keywords, and then the ranking results are optimized by formula similarity. Experimental results show that the accuracy of top-10 results is at least 20% higher than that of NTFEM and can be 50% in some specific topics.
AB - Formula retrieval is an important research topic in Mathematical Information Retrieval (MIR). Most studies have focused on comparing formulae to determine the similarity between mathematical documents. However, two similar formulae may appear in completely different knowledge domains and have different meanings. Based on N-ary Tree-based Formula Embedding Model (NTFEM), we introduce a new hybrid retrieval model combining formula with its surrounding text for more accurate retrieval. Using keywords extraction technology, we extract keywords from text around the formula which can supplement the semantic information of formula. Then we get the representation vectors of keywords by FastText N-gram embedding model, and the representation vectors of formulae by NTFEM. Finally, documents are first sorted according to the similarity of keywords, and then the ranking results are optimized by formula similarity. Experimental results show that the accuracy of top-10 results is at least 20% higher than that of NTFEM and can be 50% in some specific topics.
KW - Extraction
UR - https://www.scopus.com/pages/publications/85114280458
U2 - 10.18293/SEKE2021-143
DO - 10.18293/SEKE2021-143
M3 - 会议稿件
AN - SCOPUS:85114280458
T3 - Proceedings of the International Conference on Software Engineering and Knowledge Engineering, SEKE
SP - 622
EP - 627
BT - Proceedings - SEKE 2021
PB - Knowledge Systems Institute Graduate School
T2 - 33rd International Conference on Software Engineering and Knowledge Engineering, SEKE 2021
Y2 - 1 July 2021 through 10 July 2021
ER -