Sherlock: A Deep Learning Approach to Semantic Data Type Detection

Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çağatay Demiralp, César Hidalgo

Data processing and analysis flow, starting from (1) a corpus of real-world datasets, proceeding to (2) feature extraction, (3) mapping extracted features to ground truth semantic types, and (4) model training and prediction.

Abstract

Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on 686,765 data columns retrieved from the VizNet corpus by matching 78 semantic types from DBpedia to column headers. We characterize each matched column with 1,588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F1 score of 0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.

Citation

Sherlock: A Deep Learning Approach to Semantic Data Type Detection

Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çağatay Demiralp, César Hidalgo

ACM Knowledge Discovery and Data Mining (KDD), 2019.

Bibtex

@inproceedings{2019-sherlock,
 title = {{Sherlock: A Deep Learning Approach to Semantic Data Type Detection}},
 author = {Madelon Hulsebos AND Kevin Hu AND Michiel Bakker AND Emanuel Zgraggen AND Arvind Satyanarayan AND Tim Kraska AND \c{C}a\u{g}atay Demiralp AND C\'{e}sar Hidalgo},
 booktitle = {ACM Knowledge Discovery and Data Mining (KDD)},
 year = {2019},
 url = {http://vis.csail.mit.edu/pubs/sherlock}
}

Materials

Videos

Video Preview