25th International Conference on Database Systems for Advanced Applications

Sep. 24-27, 2020, Jeju, South Korea

Click following URL

http://dasfaa2020.sigongji.com

to visit DASFAA 2020 Online Event Site

Paper details

Title: GDS: General Distributed Strategy for Functional Dependency Discovery Algorithms

Authors: Peizhong Wu, Wei Yang, Haichuan Wang and Liusheng Huang

Abstract: Functional dependencies (FDs) are important metadata that describe relationships among columns of datasets and can be used in a number of tasks, such as schema normalization, data cleansing. In modern big data environments, data are partitioned, so that single-node FD discovery algorithms are inefficient without parallelization. However, existing parallel distributed algorithms bring huge communication costs and thus perform not well enough. To solve this problem, we propose a general parallel discovery strategy, called GDS, to improve the performance of parallelization for single-node algorithms. GDS consists of two essential building blocks, namely FDCombine algorithm and affine plane block design algorithm. The former can infer the final FDs from part-FD sets. The part-FD set is a FD set holding over part of the original dataset. The latter generates data blocks, making sure that part-FD sets of data blocks satisfy FD-Combine induction condition. With our strategy, each single-node FD discovery algorithm can be directly parallelized without modification in distributed environments. In the evaluation, with p threads, the speedups of FD discovery algorithm FastFDs exceed p in most cases and even exceed p/2 in some cases. In distributed environments, the best multi-threaded algorithm HYFD also gets a significant improvement with our strategy when the number of threads is large.

Video file:

Slide file:

Sponsors