Loading [MathJax]/extensions/MathMenu.js
Big Data Quality Assessment Model for Unstructured Data | IEEE Conference Publication | IEEE Xplore

Big Data Quality Assessment Model for Unstructured Data


Abstract:

Big Data has gained an enormous momentum the past few years because of the tremendous volume of generated and processed Data from diverse application domains. Nowadays, i...Show More

Abstract:

Big Data has gained an enormous momentum the past few years because of the tremendous volume of generated and processed Data from diverse application domains. Nowadays, it is estimated that 80% of all the generated data is unstructured. Evaluating the quality of Big data has been identified to be essential to guarantee data quality dimensions including for example completeness, and accuracy. Current initiatives for unstructured data quality evaluation are still under investigations. In this paper, we propose a quality evaluation model to handle quality of Unstructured Big Data (UBD). The later captures and discover first key properties of unstructured big data and its characteristics, provides some comprehensive mechanisms to sample, profile the UBD dataset and extract features and characteristics from heterogeneous data types in different formats. A Data Quality repository manage relationships between Data quality dimensions, quality Metrics, features extraction methods, mining methodologies, data types and data domains. An analysis of the samples provides a data profile of UBD. This profile is extended to a quality profile that contains the quality mapping with selected features for quality assessment. We developed an UBD quality assessment model that handles all the processes from the UBD profiling exploration to the Quality report. The model provides an initial blueprint for quality estimation of unstructured Big data. It also, states a set of quality characteristics and indicators that can be used to outline an initial data quality schema of UBD.
Date of Conference: 18-19 November 2018
Date Added to IEEE Xplore: 10 January 2019
ISBN Information:
Print on Demand(PoD) ISSN: 2325-5498
Conference Location: Al Ain, United Arab Emirates
References is not available for this document.

I. Introduction

Big data is commonly defined as the way we gather, store, manipulate, analyze and get insight from a fast-increasing heterogeneous data. Most of the new generated data is unstructured due to the increase of mobile and human's unlimited generated data from social medias that combine text, pictures, audio, video, in an unstructured way. Unstructured data is a fast-increasing phenomenon than all other types of data, industry analysts say. It will increase by as much as 800 percent during the next five years according to a survey conducted by [1]. This urge the need to automatically characterize and categorize such data. These classifications are strongly coupled with the semantic meaning of what the data represents. In many cases, the data comes in a format and a quality state in which it is impossible to process immediately as it is, and if so, the results cannot guarantee a valuable analysis and insights.

Select All
1.
R. Arsenault, "The Benefits of Utilizing Unstructured Data", Aberdeen.
2.
J. Manyika et al., "Big data: The next frontier for innovation competition and productivity", McKinsey Glob. Inst, pp. 1-137, 2011.
3.
M. Chen, S. Mao and Y. Liu, "Big Data: A Survey", Mob. Netw. Appl, vol. 19, no. 2, pp. 171-209, 2014.
4.
C. P. Chen and C.-Y. Zhang, "Data-intensive applications challenges techniques and technologies: A survey on Big Data", Inf. Sci, vol. 275, pp. 314-347, 2014.
5.
J. Wielki, "The Opportunities and Challenges Connected with Implementation of the Big Data Concept", Advances in ICT for Business Industry and Public Sector, pp. 171-189, 2015.
6.
I. A. T. Hashem, I. Yaqoob, N. B. Anuar, S. Mokhtar, A. Gani and S. Ullah Khan, "The rise of ‘big data’ on cloud computing: Review and open research issues", Inf. Syst, vol. 47, pp. 98-115, 2015.
7.
H. Hu, Y. Wen, T.-S. Chua and X. Li, "Toward Scalable Systems for Big Data Analytics: A Technology Tutorial", IEEE Access, vol. 2, pp. 652-687, 2014.
8.
P. Kluegl, M. Toepfer, P.-D. Beck, G. Fette and F. Puppe, "UIMA Ruta: Rapid development of rule-based information extraction applications", Nat. Lang. Eng, vol. 22, no. 1, pp. 1-40, 2016.
9.
M. W. Berry and J. Kog, "Text Mining: Applications and Theory", pp. 223.
10.
C. Rangu, S. Chatterjee and S. R. Valluru, "Text Mining Approach for Product Quality Enhancement: (Improving Product Quality through Machine Learning)", 2017 IEEE 7th International Advance Computing Conference (IACC), pp. 456-460, 2017.
11.
F. S. Gharehchopogh and Z. A. Khalifelu, "Analysis and evaluation of unstructured data: text mining versus natural language processing", 2011 5th International Conference on Application of Information and Communication Technologies (AICT), pp. 1-4, 2011.
12.
B. Plale, "Big Data Opportunities and Challenges for IR Text Mining and NLP", Proceedings of the 2013 International Workshop on Mining Unstructured Big Data Using Natural Language Processing, pp. 1-2, 2013.
13.
D. G. Chakraborty, "Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining", pp. 14.
14.
M. Salehan and D. J. Kim, "Predicting the performance of online consumer reviews: A sentiment mining approach to big data analytics", Decis. Support Syst, vol. 81, pp. 30-40, 2016.
15.
A. Kaushik, A. Kaushik and S. Naithani, "A Study on Sentiment Analysis: Methods and Tools".
16.
N. Tsirakis, V. Poulopoulos, P. Tsantilas and I. Varlamis, "A platform for real-time opinion mining from social media and news streams".
17.
L. Dey and M. Haque, "Opinion mining from noisy text data", presented at the Proceedings of the second workshop on Analytics for noisy unstructured text data, pp. 83-90, 2008.
18.
M. Kang, J. Ahn and K. Lee, "Opinion mining using ensemble text hidden Markov models for text classification", Expert Syst. Appl, vol. 94, pp. 218-227, May. 2018.
19.
P. Oliveira, F. Rodrigues and P. R. Henriques, "A Formal Definition of Data Quality Problems", IQ, 2005.
20.
M. Maier, A. Serebrenik and I. T. P. Vanderfeesten, Towards a Big Data Reference Architecture, 2013.
21.
M. Chen, M. Song, J. Han and E. Haihong, "Survey on data quality", 2012 World Congress on Information and Communication Technologies (WICT), pp. 1009-1013, 2012.
22.
F. Sidi, P. H. Shariat Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim and A. Mustapha, "Data quality: A survey of data quality dimensions", 2012 International Conference on Information Retrieval Knowledge Management (CAMP), pp. 300-304, 2012.
23.
P. Glowalla, P. Balazy, D. Basten and A. Sunyaev, "Process-Driven Data Quality Management–An Application of the Combined Conceptual Life Cycle Model", 2014 47th Hawaii International Conference on System Sciences (HICSS), pp. 4700-4709, 2014.
24.
C. Batini, C. Cappiello, C. Francalanci and A. Maurino, "Methodologies for data quality assessment and improvement", ACM Comput. Surv, vol. 41, no. 3, pp. 1-52, Jul. 2009.
25.
D. Firmani, M. Mecella, M. Scannapieco and C. Batini, "On the Meaningfulness of ‘Big Data Quality’ (Invited Paper)", Data Science and Engineering, pp. 1-15, 2015.
26.
A. McCallum, "Information extraction: Distilling structured data from unstructured text", Queue, vol. 3, no. 9, pp. 48-57, 2005.
27.
B. Carlo, B. Daniele, C. Federico and G. Simone, "A Data Quality Methodology for Heterogeneous Data", Int. J. Database Manag. Syst, vol. 3, no. 1, pp. 60-79, Feb. 2011.
28.
S. Malmasi, N. Hosomura, L.-S. Chang, C. J. Brown, S. Skentzos and A. Turchin, "Extracting Healthcare Quality Information from Unstructured Data", AMIA. Annu. Symp. Proc, vol. 2017, pp. 1243-1252, Apr. 2018.
29.
C. Kiefer, "Assessing the Quality of Unstructured Data: An Initial Overview".
30.
L. Cai and Y. Zhu, "The Challenges of Data Quality and Data Quality Assessment in the Big Data Era", Data Sci. J, vol. 14, May 2015.
Contact IEEE to Subscribe

References

References is not available for this document.