TY - GEN
T1 - From categorical to numerical
T2 - SIAM International Conference on Data Mining 2015, SDM 2015
AU - Zhang, Kai
AU - Wang, Qiaojun
AU - Chen, Zhengzhang
AU - Marsic, Ivan
AU - Kumar, Vipin
AU - Jiang, Guofei
AU - Zhang, Jie
N1 - Publisher Copyright:
Copyright © SIAM.
PY - 2015
Y1 - 2015
N2 - Categorical data are ubiquitous in real-world databases. However, due to the lack of an intrinsic proximity measure, many powerful algorithms for numerical data analysis may not work well on their categorical counterparts, making it a bottleneck in practical applications. In this paper, we propose a novel method to transform categorical data to numerical representations, so that abundant numerical learning methods can be exploited in categorical data mining. Our key idea is to learn a pairwise dissimilarity among categorical symbol-s, henceforth a continuous embedding, which can then be used for subsequent numerical treatment. There are two important criteria for learning the dissimilarities. First, it should capture the important "transitivity" which has shown to be particularly useful in measuring the proximity relation in categorical data. Second, the pairwise sample geometry arising from the learned symbol distances should be maximally consistent with prior knowledge (e.g., class labels) to obtain a good generalization performance. We achieve them through multiple transitive distance learning and embedding. Encouraging results are observed on a number of benchmark classification tasks against state-of-the-art.
AB - Categorical data are ubiquitous in real-world databases. However, due to the lack of an intrinsic proximity measure, many powerful algorithms for numerical data analysis may not work well on their categorical counterparts, making it a bottleneck in practical applications. In this paper, we propose a novel method to transform categorical data to numerical representations, so that abundant numerical learning methods can be exploited in categorical data mining. Our key idea is to learn a pairwise dissimilarity among categorical symbol-s, henceforth a continuous embedding, which can then be used for subsequent numerical treatment. There are two important criteria for learning the dissimilarities. First, it should capture the important "transitivity" which has shown to be particularly useful in measuring the proximity relation in categorical data. Second, the pairwise sample geometry arising from the learned symbol distances should be maximally consistent with prior knowledge (e.g., class labels) to obtain a good generalization performance. We achieve them through multiple transitive distance learning and embedding. Encouraging results are observed on a number of benchmark classification tasks against state-of-the-art.
UR - https://www.scopus.com/pages/publications/84961894401
U2 - 10.1137/1.9781611974010.6
DO - 10.1137/1.9781611974010.6
M3 - 会议稿件
AN - SCOPUS:84961894401
T3 - SIAM International Conference on Data Mining 2015, SDM 2015
SP - 46
EP - 54
BT - SIAM International Conference on Data Mining 2015, SDM 2015
A2 - Venkatasubramanian, Suresh
A2 - Ye, Jieping
PB - Society for Industrial and Applied Mathematics Publications
Y2 - 30 April 2015 through 2 May 2015
ER -