Loading [MathJax]/extensions/MathMenu.js
Visualizing the Scripts of Data Wrangling With Somnus | IEEE Journals & Magazine | IEEE Xplore

Visualizing the Scripts of Data Wrangling With Somnus


Abstract:

Data workers use various scripting languages for data transformation, such as SAS, R, and Python. However, understanding intricate code pieces requires advanced programmi...Show More

Abstract:

Data workers use various scripting languages for data transformation, such as SAS, R, and Python. However, understanding intricate code pieces requires advanced programming skills, which hinders data workers from grasping the idea of data transformation at ease. Program visualization is beneficial for debugging and education and has the potential to illustrate transformations intuitively and interactively. In this article, we explore visualization design for demonstrating the semantics of code pieces in the context of data transformation. First, to depict individual data transformations, we structure a design space by two primary dimensions, i.e., key parameters to encode and possible visual channels to be mapped. Then, we derive a collection of 23 glyphs that visualize the semantics of transformations. Next, we design a pipeline, named Somnus, that provides an overview of the creation and evolution of data tables using a provenance graph. At the same time, it allows detailed investigation of individual transformations. User feedback on Somnus is positive. Our study participants achieved better accuracy with less time using Somnus, and preferred it over carefully-crafted textual description. Further, we provide two example applications to demonstrate the utility and versatility of Somnus.
Published in: IEEE Transactions on Visualization and Computer Graphics ( Volume: 29, Issue: 6, 01 June 2023)
Page(s): 2950 - 2964
Date of Publication: 25 January 2022

ISSN Information:

PubMed ID: 35077364

Funding Agency:

References is not available for this document.

1 Introduction

Scripting languages including SAS, R, and Python have been widely accepted by data workers for data transformation. They usually seek to understand the semantics of scripts in various scenarios. For example, validation (or called double-checking in some companies and laboratories) is important for data scientists. A data scientist might seek to understand code pieces written by others, then locate and correct possible mistakes. Understanding the semantics of an intricate script, however, requires advanced programming skills. And sometimes, the process is tedious and error-prone [48], [62], [71].

Select All
1.
2015, [online] Available: https://stackoverflow.com/questions/30374143/recursive-error-in-dplyr-mutate.
2.
Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti and M. Stonebraker, "DataXFormer: A robust transformation discovery system", Proc. IEEE 32nd Int. Conf. Data Eng., pp. 1134-1145, 2016.
3.
E. E. Aftandilian, S. Kelley, C. Gramazio, N. Ricci, S. L. Su and S. Z. Guyer, "Heapviz: Interactive heap visualization for program understanding and debugging", Proc. 5th Int. Symp. Softw. Vis., pp. 53-62, 2010.
4.
L. Bartram, M. Correll and M. Tory, "Untidy data: The unreasonable effectiveness of tables", IEEE Trans. Vis. Comput. Graphics, vol. 28, pp. 686-696, 2022.
5.
P. Baudisch, N. Good, V. Bellotti and P. Schraedley, "Keeping things in context: A comparative evaluation of focus plus context screens overviews and zooming", Proc. SIGCHI Conf. Hum. Factors Comput. Syst., pp. 259-266, 2002.
6.
P. Baudisch, X. Xie, C. Wang and W.-Y. Ma, "Collapse-to-zoom: Viewing web pages on small screen devices by interactively removing irrelevant content", Proc. 17th Annu. ACM Symp. User Interface Softw. Technol., pp. 91-94, 2004.
7.
F. Beck, F. Hollerich, S. Diehl and D. Weiskopf, "Visual monitoring of numeric variables embedded in source code", Proc. 1st IEEE Work. Conf. Softw. Vis., pp. 1-4, 2013.
8.
F. Beck, O. Moseler, S. Diehl and G. D. Rey, "In situ understanding of performance bottlenecks through visually augmented code", Proc. 21st Int. Conf. Prog. Comprehension, pp. 63-72, 2013.
9.
"Power of irma", 2018, [online] Available: https://github.com/beecycles/Power_of_Irma.
10.
A. Bigelow, C. Nobre, M. Meyer and A. Lex, "Origraph: Interactive network wrangling", Proc. IEEE Conf. Vis. Anal. Sci. Technol., pp. 81-92, 2019.
11.
C. Bors, T. Gschwandtner and S. Miksch, "Capturing and visualizing provenance from data wrangling", IEEE Comput. Graph. Appl., vol. 39, no. 6, pp. 61-75, Nov./Dec. 2019.
12.
M. Bostock, "Visualizing algorithms", 2014, [online] Available: https://bost.ocks.org/mike/algorithms/.
13.
M. Bostock, V. Ogievetsky and J. Heer, "D data-driven documents", IEEE Trans. Vis. Comput. Graphics, vol. 17, no. 12, pp. 2301-2309, Dec. 2011.
14.
B. Burg, R. Bailey, A. J. Ko and M. D. Ernst, "Interactive record/replay for web application debugging", Proc. 26th Annu. ACM Symp. User Interface Softw. Technol., pp. 473-484, 2013.
15.
J. Cheon, D. Kang and G. Woo, "VizMe: An annotation-based program visualization system generating a compact visualization", Proc. Int. Conf. Data Eng., pp. 433-441, 2019.
16.
N. Chotisarn et al., "A systematic literature review of modern software visualization", J. Vis., vol. 23, no. 4, pp. 539-558, 2020.
17.
"Online interactive code to flowchart converter", [online] Available: https://app.code2flow.com/.
18.
C. Demetrescu, I. Finocchi and J. T. Stasko, "Specifying algorithm visualizations: Interesting events or state mapping?", Proc. Softw. Vis., pp. 16-30, 2002.
19.
Z. Deng et al., "Compass: Towards better causal analysis of urban time series", IEEE Trans. Vis. Comput. Graphics, vol. 28, no. 1, pp. 1051-1061, Jan. 2022.
20.
B. S. D. Desk, "Baltimore police overtime in fiscal years 2018 and 2019", 2020, [online] Available: https://github.com/baltimore-sun-data/baltimore-police-overtime.
21.
"R package: Dplyr v0.7.8", [online] Available: https://www.rdocumentation.org/packages/dplyr/versions/0.7.8.
22.
I. Drosos, T. Barik, P. J. Guo, R. DeLine and S. Gulwani, "Wrex: A unified programming-by-example interaction for synthesizing readable code for data scientists", Proc. CHI Conf. Hum. Factors Comput. Syst., pp. 1-12, 2020.
23.
"Eclipse layout kernel (ELK)", [online] Available: https://www.eclipse.org/elk/.
24.
R. Faust, K. Isaacs, W. Z. Bernstein, M. Sharp and C. Scheidegger, "Anteater: Interactive visualization for program understanding", 2020.
25.
Y. Feng, R. Martins, J. Van Geffen, I. Dillig and S. Chaudhuri, "Component-based synthesis of table consolidation and transformation tasks from examples", ACM SIGPLAN Notices, vol. 52, no. 6, pp. 422-436, 2017.
26.
B. Figures, "Apple iphone unit sales and revenue", 2021, [online] Available: https://barefigur.es/companies/apple/iphone/.
27.
D. Fisher, "Animation for visualization: Opportunities and drawbacks", Beautiful Visualization, 2010.
28.
S. Grissom, M. F. McNally and T. Naps, "Algorithm visualization in CS education: Comparing levels of student engagement", Proc. ACM Symp. Softw. Vis., pp. 87-94, 2003.
29.
P. J. Guo, "Online python tutor: Embeddable web-based program visualization for CS education", Proc. 44th ACM Tech. Symp. Comput. Sci. Educ., pp. 579-584, 2013.
30.
P. J. Guo, S. Kandel, J. M. Hellerstein and J. Heer, "Proactive wrangling: Mixed-initiative end-user programming of data transformation scripts", Proc. 24th Annu. ACM Symp. User Interface Softw. Technol., pp. 65-74, 2011.

Contact IEEE to Subscribe

References

References is not available for this document.