TY - JOUR
T1 - A Hybrid Model Combining Formulae with Keywords for Mathematical Information Retrieval
AU - Shen, Yuqi
AU - Chen, Cheng
AU - Dai, Yifan
AU - Cai, Jinfang
AU - Chen, Liangyu
N1 - Publisher Copyright:
© 2021 World Scientific Publishing Company.
PY - 2021/12/1
Y1 - 2021/12/1
N2 - Formula retrieval is an important research topic in Mathematical Information Retrieval (MIR). Most studies have focused on formula comparison to determine the similarity between mathematical documents. However, two similar formulae may appear in entirely different knowledge domains and have different meanings. Based on N-ary Tree-based Formula Embedding Model (NTFEM, our previous work in [Y. Dai, L. Chen, and Z. Zhang, An N-ary tree-based model for similarity evaluation on mathematical formulae, in Proc. 2020 IEEE Int. Conf. Systems, Man, and Cybernetics, 2020, pp. 2578-2584.], we introduce a new hybrid retrieval model, NTFEM-K, which combines formulae with their surrounding keywords for more accurate retrieval. By using keywords extraction technology, we extract keywords from context, which can supplement the semantic information of the formula. Then, we get the vector representations of keywords by FastText N-gram embedding model and the vector representations of formulae by NTFEM. Finally, documents are sorted according to the similarity between keywords, and then the ranking results are optimized by formula similarity. For performance evaluation, NTFEM-K is not only compared with NTFEM but also hybrid retrieval models combining formulae with long text and hybrid retrieval models combining formulae with their keywords using other keyword extraction algorithms. Experimental results show that the accuracy of top-10 results of NTFEM-K is at least 20% higher than that of NTFEM and can be 50% in some specific topics.
AB - Formula retrieval is an important research topic in Mathematical Information Retrieval (MIR). Most studies have focused on formula comparison to determine the similarity between mathematical documents. However, two similar formulae may appear in entirely different knowledge domains and have different meanings. Based on N-ary Tree-based Formula Embedding Model (NTFEM, our previous work in [Y. Dai, L. Chen, and Z. Zhang, An N-ary tree-based model for similarity evaluation on mathematical formulae, in Proc. 2020 IEEE Int. Conf. Systems, Man, and Cybernetics, 2020, pp. 2578-2584.], we introduce a new hybrid retrieval model, NTFEM-K, which combines formulae with their surrounding keywords for more accurate retrieval. By using keywords extraction technology, we extract keywords from context, which can supplement the semantic information of the formula. Then, we get the vector representations of keywords by FastText N-gram embedding model and the vector representations of formulae by NTFEM. Finally, documents are sorted according to the similarity between keywords, and then the ranking results are optimized by formula similarity. For performance evaluation, NTFEM-K is not only compared with NTFEM but also hybrid retrieval models combining formulae with long text and hybrid retrieval models combining formulae with their keywords using other keyword extraction algorithms. Experimental results show that the accuracy of top-10 results of NTFEM-K is at least 20% higher than that of NTFEM and can be 50% in some specific topics.
KW - Formula embedding
KW - Formula similarity
KW - Keywords extraction
KW - Mathematical information retrieval
KW - Word embedding
UR - https://www.scopus.com/pages/publications/85124046761
U2 - 10.1142/S0218194021400131
DO - 10.1142/S0218194021400131
M3 - 文章
AN - SCOPUS:85124046761
SN - 0218-1940
VL - 31
SP - 1583
EP - 1602
JO - International Journal of Software Engineering and Knowledge Engineering
JF - International Journal of Software Engineering and Knowledge Engineering
IS - 11-12
ER -