TY - GEN
T1 - Distenc
T2 - 34th IEEE International Conference on Data Engineering, ICDE 2018
AU - Ge, Hancheng
AU - Zhang, Kai
AU - Alfifi, Majid
AU - Hu, Xia
AU - Caverlee, James
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/24
Y1 - 2018/10/24
N2 - How can we efficiently recover missing values for very large-scale real-world datasets that are multi-dimensional even when the auxiliary information is regularized at certain mode? Tensor completion is a useful tool to recover a low-rank tensor that best approximates partially observed data and further predicts the unobserved data by this low-rank tensor, which has been successfully used for many applications such as location-based recommender systems, link prediction, targeted advertising, social media search, and event detection. Due to the curse of dimensionality, existing algorithms for tensor completion that integrate auxiliary information do not scale for tensors with billions of elements. In this paper, we propose DisTenC, a new distributed large-scale tensor completion algorithm that can be distributed on Spark. Our key insights are to (i) efficiently handle trace-based regularization terms; (ii) update factor matrices with caching; and (iii) optimize the update of the new tensor via residuals. In this way, we can tackle the high computational costs of traditional approaches and minimize intermediate data, leading to order-of-magnitude improvements in tensor completion. Experimental results demonstrate that DisTenC is capable of handling up to 10~1000X larger tensors than existing methods with much faster convergence rate, shows better linearity on machine scalability, and achieves up to an average improvement of 23.5% in accuracy in applications.
AB - How can we efficiently recover missing values for very large-scale real-world datasets that are multi-dimensional even when the auxiliary information is regularized at certain mode? Tensor completion is a useful tool to recover a low-rank tensor that best approximates partially observed data and further predicts the unobserved data by this low-rank tensor, which has been successfully used for many applications such as location-based recommender systems, link prediction, targeted advertising, social media search, and event detection. Due to the curse of dimensionality, existing algorithms for tensor completion that integrate auxiliary information do not scale for tensors with billions of elements. In this paper, we propose DisTenC, a new distributed large-scale tensor completion algorithm that can be distributed on Spark. Our key insights are to (i) efficiently handle trace-based regularization terms; (ii) update factor matrices with caching; and (iii) optimize the update of the new tensor via residuals. In this way, we can tackle the high computational costs of traditional approaches and minimize intermediate data, leading to order-of-magnitude improvements in tensor completion. Experimental results demonstrate that DisTenC is capable of handling up to 10~1000X larger tensors than existing methods with much faster convergence rate, shows better linearity on machine scalability, and achieves up to an average improvement of 23.5% in accuracy in applications.
KW - Auxiliary Information
KW - Distributed Algorithm
KW - Spark
KW - Tensor Completion
UR - https://www.scopus.com/pages/publications/85057074498
U2 - 10.1109/ICDE.2018.00022
DO - 10.1109/ICDE.2018.00022
M3 - 会议稿件
AN - SCOPUS:85057074498
T3 - Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018
SP - 137
EP - 148
BT - Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 April 2018 through 19 April 2018
ER -