Conferences >2021 IEEE 37th International ...

Capturing Semantics for Imputation with Pre-trained Language Models

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ig...Show More

Metadata

Abstract:

Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.

Published in: 2021 IEEE 37th International Conference on Data Engineering (ICDE)

Date of Conference: 19-22 April 2021

Date Added to IEEE Xplore: 22 June 2021

ISBN Information:

ISSN Information:

DOI: 10.1109/ICDE51399.2021.00013

Conference Location: Chania, Greece

Funding Agency:

Contents

I. Introduction

In practice, missing data are prevalent, due to the optional inputs in the information collection system and mismatching in integrating heterogenous data sources, and so on. Obviously, these missing values significantly reduce the quality of the data and make the data hard to use.

References is not available for this document.

MIT Libraries

MIT Libraries

Capturing Semantics for Imputation with Pre-trained Language Models

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Capturing Semantics for Imputation with Pre-trained Language Models

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?