Loading [MathJax]/extensions/MathMenu.js
Capturing Semantics for Imputation with Pre-trained Language Models | IEEE Conference Publication | IEEE Xplore

Capturing Semantics for Imputation with Pre-trained Language Models


Abstract:

Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ig...Show More

Abstract:

Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.
Date of Conference: 19-22 April 2021
Date Added to IEEE Xplore: 22 June 2021
ISBN Information:

ISSN Information:

Conference Location: Chania, Greece

Funding Agency:


I. Introduction

In practice, missing data are prevalent, due to the optional inputs in the information collection system and mismatching in integrating heterogenous data sources, and so on. Obviously, these missing values significantly reduce the quality of the data and make the data hard to use.

Contact IEEE to Subscribe

References

References is not available for this document.