Introduction
While the emergence of computer-assisted learning has led to the development of new methods in mathematics education, such as Computer Algebra Systems (CASs) and web-based learning [1], [2], [3], general mathematics assessments performed in high schools still mostly call for solving essay or short-answer problems with pencil and paper [4]. Such assessments not only reflect students’ mathematics knowledge and logical thinking ability, but also the correctness of their calculations. That is, when a high school student who has enough knowledge and skill to solve a quadratic equation makes a simple number addition mistake at a given step of problem-solving, the score for the problem may be fully or partially degraded (as calculation correctness is one of the assessment factors in mathematics learning). As students progress beyond high school, such mistakes tend to be treated as technical errors and as trivial or minor issues because assessment objectives at those grades are focused on developing concept achievement and reasoning ability to solve complex problems.
Is this treatment of technical errors as a minor issue in mathematics assessment reasonable? Research [4], [5] has indicated that incorrect answers caused by technical errors such as arithmetic errors or misreadings of problems make up around 13 to 18 percent of all errors on middle and high school mathematics tests, while errors caused by a lack of knowledge are even rarer. More specifically, the statistics [4] on a high school equations test for 14-year-olds indicate that 13.3 percent of incorrect answers come from technical errors, while others stem from the student's misunderstanding of the problem (7.9 percent), improperly constructing the equation (36.9 percent), or reasoning incorrectly (32.8 percent). The study cited above also found that only 9.1 percent of errors were from a lack of knowledge; thus, errors caused by a lack of knowledge are even rarer than technical errors. This implies that test scores may reflect equation problem-solving ability that is lower than the students’ actual ability. Moreover, such technical errors still do not have proper remedies, so that students who often make technical errors rarely improve from assessment to assessment. As the present study was conducted in the Republic of Korea (South Korea), where the use of a handheld calculator is prohibited nationwide during mathematics assessments until the undergraduate level, technical errors have been a particular hurdle for students. Most other East Asian countries, such as Japan, Singapore, and China, also do not allow the use of a calculator in high school math classes, so students in these countries may experience similar issues.
This work aims to identify feasible causes of the occurrence of technical errors (or mistakes) and proposes a scheme to reduce them by viewing differences between work containing mistakes and examples of clean and correct work by the students’ peers. The study is based on the assumption that technical errors emerge due to students’ problem-solving behaviors and not due to issues with their knowledge or logical thinking. Few studies have sought to identify which of students’ problem-solving behaviors promote mistakes. However, based on our interviews, schoolteachers empirically acknowledge differences in problem-solving behaviors between high-scoring students and others. We interviewed seven mathematics teachers each of more than 10 years’ teaching experience in order to find relationships between students’ mathematics problem-solving behaviors and scores. The list below summarizes four conceptions shared by the teachers on the basis of the interviews; these informed how we designed our system and experiments.
High-scoring students write less than average students do, as they try to develop problem-solving steps without writing. This means they do not need to engage in as many trial-and-error exercises with pencil and paper, and they provide cleaner results.
High-scoring students write sequentially from top to bottom, while other students tend to write equations in no particular order. This allows the former group of students to trace their problem-solving results more easily in order to reevaluate their solutions.
High-scoring students only reluctantly give up trying to answer problems. Overall, they try once or twice more than average students.
High-scoring students generally write characters cleanly and in a well-controlled size; thus, they rarely misread their own writing. Misread examples are easily found at the high school level, such as “b” being inadvertently changed to “6,” “B” becoming “13,” or “i” becoming “1,” for instance. Students may already understand the utility of clear writing on some level, but there is no obvious or effective training method for fostering this understanding or ensuring clear writing in mathematics education at present.
The schoolteachers’ interviews support the value of investigation of problem-solving behaviors in mathematics education, as they can be dominant factors for the reduction of the occurrence of students’ problem-solving errors. To investigate problem-solving behaviors, we captured students’ problem-solving processes when completing handwritten activities using a tablet computer and converted the characteristics of their behavior into measurable and comparable values for contrast with students producing exemplary answers. This paper will show that students trained under the proposed scheme can reduce the number of technical errors they make and improve their problem-solving skills, attaining better achievement through a training process that is measureable by standardized assessments.
The rest of this paper is organized as follows. In the next section, we provide a short review of related work on problem-solving errors in mathematics and work on educational uses of tablet computers. In Section 3, we describe the design and implementation of the system used in the current research and discuss the technological and pedagogical approach implemented in our pilot study. Section 4 discusses the main experiments conducted using the training system. In Section 5, we discuss the findings and the limitations of this study. Finally, Section 6 provides concluding remarks.
Related Work
This work combines two fields of research on education, namely, research on mathematics problem-solving errors and research on the usage of tablet computers in education. Research on mathematics problem-solving errors has been conducted for a long time, with a focus on categorizing errors and proposing remedies to treat them in order to improve mathematics knowledge and achievement. In terms of error classification, Movshovitz-Hardar and Zaslavsky [5] classified errors into five categories (misused data, misinterpreted language, logically invalid inference, distorted theorem or definition, unverified solution, and technical error), and Orton [10], into three (structural error, arbitrary error , and executive error, with subcategories). Borasi [6] suggests remedies for three error cases (performing math task, technical content, nature of mathematics ) in order to utilize errors to help students understand mathematics more deeply, with many other studies providing similar remedies for math problem-solving errors. However, little research has been conducted on reducing mistakes in problem-solving, which are treated as trivial or minor issues in the field of problem-solving error and error remedy research. This is because mistakes, unlike errors, occur unpredictably, making it hard to find commonalities, and thus mistakes and avoiding mistakes are not a main component of assessment. Problem-solving mistakes should be focused on because mathematics assessment is generally conducted on the basis of comprehensive scores on particular skills at a given testing level, while research on learning effectiveness is normally performed by comparing handwritten pre- and post-test problem-solving results. It can be argued that problem-solving errors constitute mistakes; Borasi [7] uses the term “errors” to stand for inevitable consequences of any limitations in the mathematical systems one is employing, while “mistakes” occur in a specific procedure conducted by a specific person. According to another paper of Borasi's [8], errors can be exploited as reusable knowledge resources in order to teach and to give feedback to learners, whereas mistakes provide tedious and unusable outcomes. Examples of mistakes are calculation errors, forgetting to write in a minus sign, or inadvertently writing the wrong numbers in problems; Movshovitz-Hardar and Zaslavsky [5] classify those errors as technical errors and positioned them at the lowest priority level of assessment. In this research, we consider technical errors to be a subtype of mistakes in order to discriminate cases of mistakes from cases of (non-technical) error, following the understanding of these concepts in recent work and focusing our attention on mistakes only.
In order to record and analyze students’ handwritten problem-solving activity, we utilize a tablet computer equipped with a stylus (digital pen). Deploying a tablet computer for educational purposes is not a new approach [9]; in the specific context of mathematics education Siozos et al. [10] developed a mathematics assessment system using a tablet computer. Anthony et al. [11] utilized a tablet computer combined with an intelligent tutoring system in order to identify students’ mathematics problem-solving errors. The proposed system takes up the same challenge, but focuses on a behavioral factor analysis of students’ writing and navigation activity. The present study's approach uses a keyboardless design, stylus input system, and handwriting recognition. However, mathematics handwriting recognition accuracy is lower than general language recognition [12], [13]; for this reason, we decided not to adopt such a system. One common alternative is the use of video-recording methods [14], but this method requires manual video tagging, which leads to imprecise measurement of behaviors. To address this problem, the proposed system utilizes a tablet computer, which allows users to write with an electromagnetic resonance stylus on the computer screen, approximating the usual pencil-on-paper test experience while recording the student's problem-solving activity (as represented by stylus movements) without requiring any additional learning on the student's part in order to adapt to a novel experimental environment.
Attempts to classify steps in mathematics problem-solving rest on the principle that each step in solving a math problem has its own rationale and that together these steps take the problem-solver incrementally toward the goal. Thus, it is possible to identify a wrong or right path followed by the answer to a given problem. This approach is commonly followed in computer-assisted learning research. Conceptually, if a given problem is closed-ended in form (has a definite answer) and each path toward the answer is defined, the system can analyze students’ knowledge acquisition, mistakes, and weaknesses precisely. For example, Brinkmann's work [15] displays graphs of steps for solving mathematics problems and accompanying theorems based on students’ efforts. The problem with this approach is that the user's input has to be in a computer-readable format, meaning the user has to type the equations on the keyboard. Another approach to classifying steps in mathematics problem solving is the cognitive-based approach, which supposes that mathematics problem solving is a streamlined process of logical thinking rather than a matter of knowledge achievement and application [16]. Cognitive steps have been investigated very little, where this approach is mainly performed by means of observation and discussion (in order to more reliably demonstrate the existence of these steps or identify others). The approach of the present work is in this vein; we observe students’ problem-solving behaviors and define formulas for measuring them in order to compare problem-solving behaviors between students and thus evaluate our proposed scheme.
Design
3.1 Concepts
This study is motivated by the aforementioned statistic [4] indicating that errors due to lack of knowledge are not more common than technical errors. This study assumes that the knowledge of average students is not very different from the knowledge of good performers, or high-scoring students, in terms of the learned information itself. Moreover, we assume that technical errors can be improved by training students in appropriate problem-solving behaviors. These assumptions are based on the results of our pretest, presented below.
Before designing the experiments, we needed to clarify the scope of the concepts and related considerations. In the problem-solving domain, various actions can be encompassed by the term “behavior,” ranging from writing equations or drawing graphs to scratching one's head or shaking one's legs. However, in our work, the term “behavior” denotes only problem-solving activities that involve pencil on paper, such as writing, erasing, or navigating from one problem to another. We decided not to consider extraneous motions, such as scratching one's head, shaking one's legs, or looking at the clock. Second, we assumed that students would solve problems only with a pencil and paper and not with a calculator, as our work is primarily targeted for application in the East Asia region, where mathematics education at the K-12 level usually does not involve the use of calculators, as mentioned above [17].
Fig. 1 illustrates, by way of example, two students’
problem-solving results (The problem in Fig. 1 can be translated as
follows: “The value of the quadratic equation
An example of two students’ problem-solving processes. The circled numbers are marked in order to explicate the respective writing sequences.
The proposed scheme discriminates between high-scoring students, who generally commit few mistakes and are placed in a “high-scorer” group, and other students, grouped as “average students.” We gathered information on the problem-solving processes of these two groups and distinguished the behavioral characteristics of the groups. Later, we trained the average group of students in the behaviors of the high-scoring group. In this work, high-scoring students have mathematics achievement scores above the 96th percentile in the national secondary school math achievement test of the Republic of Korea, while average students have achievement scores from the 77th percentile to the 95th percentile on the same test; these divisions are based on the national grading system, which calls them “high-level” and “mid-level,” respectively. Generally, the 77th percentile is considered to represent students who have enough knowledge to solve the assessment problems, and the 96th percentile, those who fulfill the required learning objectives.
The experiment was organized as follows. Firstly, two groups of students performed a pretest, and their errors were classified into two domains: mistakes and non-mistakes. We considered mistakes to be technical errors, while the non-mistakes included, for example, invalid inferences, distorted theorems or definitions, or unverified solutions. Based on the results of this pretest and of the interviews with teachers, we established behavior measures and values in order to compare students’ problem-solving behaviors. On the basis of established problem-solving behavior values, we organized students into three groups with close to equal average scores: a control group, who performed the test only; an experiment group, who performed the test and were allowed to use the problem-solving playback function; and another experiment group, who had the same experience as the experiment group but were also allowed to view their problem-solving behavior values. Finally, we statistically analyzed differences among the groups. In the next section, we will describe the implementation of our system in further detail.
Students in the two experiment groups could see the recorded solutions presented side by side on a tablet screen; each student could view their own solution and anonymously navigate peers’ solutions at the same time. The problem-solving playback function could be used to replay the recorded solutions from the first stroke of writing to the last, including any erasing.
Proper training methods are needed to help mistake-prone students. However, as technical errors are hard to trace to their causes and to reproduce, there is no obvious method to guide or teach students not to make the same mistake again. For this reason, we present average students’ solutions side by side with their peers’ good problem-solving examples to help students identify the advantageous behaviors themselves. According to Hsiao et al. [18], this method proves effective when learning requires self-diagnosis rather than mere compliance with instructions and when participation is highly encouraged.
3.2 System Implementation
In order to record and analyze students’ writings, we designed a system resembling a pencil-on-paper test to record students’ problem-solving activities, such as writing, erasing, and flipping across problems, with precise time data. Four requirements were adopted for the design of the system, as follows:
The tablet computer that we adopt should provide a realistic pencil-on-paper test experience to learners.
The test software has to be intuitive, without the need for an additional adaptation period.
The training software must be able to play recorded behaviors back and forward so as to allow side-by-side comparison with the peer data.
The training software has to be used immediately after the test.
In this section, we will describe how these requirements are realized.
Unlike a web-based learning system, the proposed scheme relies on dedicated hardware, in this case a tablet computer, as this hardware accepts handwriting inputs and can approximate a traditional pencil-on-paper test experience [16]. The stylus used was equipped with an “eraser” at the end, allowing students to erase strokes by smudging the end of the stylus, as though using a common pencil with an eraser. Otherwise, the system would require a “pencil/eraser toggle” button onscreen, which seemed more inconvenient. Computing power and proper screen size were also critical factors, as writing had to be captured in real time and a sufficient writing area had to be provided. We chose a 12-inch tablet computer running the Windows 7 operating system, which achieves both quick response for handwriting capture and a sufficient screen size.
Two software packages were implemented: the test program and software known as the Visualizer. The test program collects students’ handwriting data in real time, whereas the Visualizer software displays the test results and allows students to review them.
Fig. 2 shows the test software, which consists of three major parts: a header panel, the content area, and a navigation panel at the bottom. At the top of the screen, the header panel has the title in the center, duration time at left, and two buttons at right. The “Erase Screen” button on the right erases all strokes in the content area, while the “Finish Exam” button closes the test. This clearing action does not affect the data for the other problems. The content area displays the problems and supports handwriting input. We designed the test such that only one problem is displayed in the content area at a time, as this helps trace students’ focus questions while also giving them more space to solve the question onscreen. The bottom-right corner of the content area has five circles, for the answers to a five-point difficulty survey to be answered by the students. At the bottom of the screen is the navigation bar, which allows users to go back and forth through the questions. All activities are stored in the computer memory sequentially and then saved to local and network repositories after each test session ends. Three types of data are recorded for each stroke: (1) handwriting information, including the width, color, shape, pressure, and position of strokes; (2) timestamps for writing-on and writing-off events; and (3) timestamps for erased strokes.
After the students took the test and it was graded by the teachers, students could execute the Visualizer in order to review their problem-solving behaviors. As all questions required short answers representing each problem-solving step, scoring was performed manually for each test. Manual scoring took approximately one minute per question.
The Visualizer has the following features for students to use freely while replaying and comparing their data.
Students can check the correctness of the given questions.
Students can playback the writing and erasing activities for a selected problem.
Students can see the navigation sequence and complete timespan of the test.
Students can replay peers’ problem-solving behaviors on the right side of the panel.
Fig. 3 shows a screenshot of the Visualizer software. It consists of three horizontally stacked panels: the leftmost panel shows a list of tests on top and a question list pertaining to the selected test on the bottom, while the center panel shows the student's problem-solving content area. The software records handwriting sequences via a solid trace line (colored in yellow here), as shown in Fig. 3. The rightmost panel includes multiple child panels, each of which can serve as a function panel for providing the correct answers, a timeline graph of navigation, and peers’ problem-solving records. The list of peers’ records is given under the student's currently selected problem, which displays peers’ solutions anonymously. The selected peers’ problem-solving records can be replayed using the “play” button. With this software, a student can browse and reply his/her problem-solving behaviors and those of his/her peers, simultaneously and side by side.
Findings from the Pilot Study
To meaningfully investigate data regarding the proposed system and to advance the construction of a concrete training program, we conducted a pilot study based on a pretest completed by 45 high school students (15 high-scorers and 30 average students). Ten questions on polynomial equations were administered in a 30-minute test. The students solved the questions with the software described in Section 3, running on a tablet computer; the software recorded the students’ problem-solving activities. Many features can be examined and extracted from these data, but in the present work, we selected three measurable behaviors: Navigation Count (NC), Erasing Ratio (ER), and Vertical Movement (VM). Based on the schoolteacher interviews described in Section 3, we considered these to be likely key factors in the occurrence of mistakes during problem-solving tasks. In this section, we introduce our approach to the measurement of these factors and the findings of our early (pilot) experiments.
4.1 Navigation Count
Our teacher interviewees indicated that high-scoring students solve problem in a sequential order. To reflect this, we developed the navigation count value, which refers to the number of page-flipping actions from one problem to another. As the proposed system displays only one problem at a time, a page stands for viewing a problem. The NC value increases by one when a student navigates from one page to another, and is later divided by the total number of problems. The system allows students to go back and forth freely in the limited test time; hence, NC reflects the students’ level of busyness as measured by flipping. Students may change their focus question for many reasons; for example, they may quickly scan questions at the start of the test in order to get a sense of the knowledge they may need; finish solving the current question and navigate to the next question; give up on solving the current question and try to solve another instead; or review or retry an already solved or given-up-on question before finishing the test. Overall, students with higher grades had fewer unknown or unsolved questions. Therefore, we can expect that they will have lower NC scores than average students.
Fig. 4 shows an example of a total time consumption graph, which represents the student's navigation time during the test. The numbers at the left side of each row are question numbers, in this example ten questions, and the squares in each row show the timespan (labeled in minute:second:millisecond format) of the question the student viewed. For example, the student on the top read the first question for 2 minutes, went to the second question for 4 minutes, and then went back to the first question again. Also, he navigated across the questions swiftly at the end of the test. As shown in Fig. 4, the student on the bottom navigated across the questions to a lesser extent than the student on the top.
Example of a total time consumption graph (top: average student, bottom: high-scoring student).
Obvious differences between two groups were found: the NC of high-scoring students (mean 2.508, SD 1.829) was 44 percent less than that of average students (mean 4.559, SD 3.271). We found that high-scoring students did not navigate as much as average students, even when they faced difficult questions. From a problem-solving point of view, a low NC is correlated with longer problem-solving time for a single question, which may indicate longer reasoning time. The p-value (two-tailed, 95 percent confidence) of an independent-samples test between the two groups is 0.03 when equal variance was assumed and 0.01 when it was not; this indicates that NC is a valid feature upon which to discriminate these groups.
4.2 Erasing Ratio
Another finding pertained to ER, the ratio of erasing behavior to overall behavior while solving a question. Students used the eraser when undertaking trial-and-error processes or correcting certain characters or symbols. It is considered that erasing more than one line of an equation may reflect trial-and-error, reflecting a backward step away from the goal to a previous position, in order to retry a previous step. We found that students erased characters in a line very frequently during the pretest, likely for many reasons (writing the wrong character, performing a short calculation, or trial-and-error), so we decided to exclude minor erasing behaviors when the erasing occurred in a line. Based on our observations and on the teacher interviews, average students tend to use erasers more than high-scoring students, who implement problem-solving steps and writing expressions only when they can anticipate the next step toward the goal. However, average students tend to simplify the current expression first and then implement problem-solving steps, leading them to engage in more erasing behavior. Thus, occurrence of erasing will seemingly decrease if students can anticipate the next step toward their goal. The expected result of training would then be to yield decreased changes for both average and standard deviation.
Using formula 1, we measured how frequently erasing actions occurred among a student's entire body of writing,
produced during the test,
\begin{equation}
ER = \frac{{\sum\limits_{q = 1}^N {\frac{{n(removed\_stroke_q)}}{{n(written\_stroke_q)}}}}}{N}.
\end{equation}
The above formula denotes the erasing ratio, given by dividing the total number of questions by summation of the erased strokes ratio for all questions. With this formula, we can acquire an average ratio for erasing behavior while writing (0.0 = no erasing behaviors; 1.0 = writing is completely cleared).
In pretest, the ER of high-scoring students (average 0.055, SD 0.025) was 41 percent less than that of average students (average 0.093, SD 0.067). Moreover, the variance in the average group was 37 percent larger than the high-scoring group. The p-value (two-tailed) was <.039 when equal variance was assumed and <.009 when it was not. For this reason, based on the 95 percent confidence interval of difference, we can say that ER between the two groups was different.
4.3 Flow Direction during a Question
It is important to read one's own output while engaging in problem solving in order to review or reevaluate one's answers. Especially for equation questions, which have clear steps that must be traversed to reach the answer, unordered writings are more difficult to trace, potentially leading to mistakes. Thus, the third factor, termed here Vertical Movement, is used to denote direction in which writing moves in each equation on the paper. This is calculated by counting the vertical position changes of the written algebraic expressions, which are grouped based on the position and sequence of the written strokes. If a student writes an expression below a written one, the count is increased by one; otherwise, it is decreased by one. The count is then divided by the total number of changes. The results depict how orderly students equation writing is; the value 1.0 indicates writing flow from top to bottom, orderly and sequentially, while -1.0 would stand for writing from bottom to top (writing flow in a reverse direction).
The high-scoring group shows higher VM than the average students, indicating more ordered writing from top to bottom, as we expected. The average VM of the high-scoring group is 0.664 with standard deviation 0.290, while that of the average group is 0.479 with standard deviation 0.372. An independent-samples test gave a p-value of 0.011 when equal variance was assumed and 0.003, otherwise. Based on the pretest statistics, then, VM was the most distinctive difference between two groups on the pretest.
For reference, NC has no minimum and maximum boundaries, while ER and VM have a range of -1 to 1. The value of NC can be zero, at minimum, if a student engages in no page navigation behavior during the test, while maximum value can reach the total in milliseconds of the test period if the student navigates only from problem to problem. However, we observed that students navigate two to five times more than the number of questions on average.
Table 1 shows the summarized pretest values for the two groups under the proposed scheme, for Navigation Count, Erasing Ratio, and Vertical Movement; as can be seen, the proposed values meaningfully discriminate between the two groups.
To investigate correlation of the mistake ratio and the proposed features, two schoolteachers checked all errors and discriminated mistakes and non-mistakes. After that, we calculated Pearson correlation values in order to find the correlation between the mistake ratio and each of the proposed features, for each group.
Table 2 describes the correlation between the mistake ratio and each proposed feature across three categories: all students, high-scoring group, and average group. The mean test score of the high-scoring group was 81.333 and the average group was 51.667, on the basis of 100 points. The mistake ratio of high-scoring group was 6 out of 28 (21.429%, N = 15) and average group was 48 out of 145 (33.103%, N = 30). Overall mistake occurrence has a positive correlation with NC and ER and a negative correlation with VM. This indicates that mistake occurrence is higher the more students navigate across pages, erase, and/or write in a disorderly manner than in mistake-free cases. Separating these results into those for the high-scoring group and the average group, we found that the high-scoring group showed a correlation between mistakes and VM, rather than other features, while the average group showed meaningful correlations with ER and VM. However, the high-scoring group made few mistakes and exhibited similar problem-solving behaviors regardless of correctness given normal question difficulty. Therefore, the statistical interpretation of the pilot study remains tentative, as the number of samples in the pilot study was not sufficient to generalize. Despite this there were meaningful findings supporting further experiments.
In summary, the high-scoring group engaged in less page browsing across problems on a given test, which gave them more continuous time in which to solve each problem. The high-scoring group engaged in less erasing activity after writing, and wrote their equations in a more top-down, orderly manner. In addition, mistake occurrence has a positive correlation with NC, negative with ER, and negative with VM. These three behavioral characteristics statistically support the interviewed teachers’ opinions on students’ problem-solving behaviors. Schoolteachers’ shared concepts are not directly coupled with the three proposed features, but each feature does partly relate to the concepts. Specifically, ER reflects the number of instances of trial-and-error on the paper, VM reflects sequential writing order, and NC reflects the fact that students navigate to other questions after completing the current question. However, with regard to the matter of character size and misreading, we did examine uniformity of written character size but dropped it as a measurable factor for now, as fractions, superscript (square notation), and complex equations such as quadratic formulae lead character sizes to vary. It can be noted that although we do not investigate character size in this work, we did observe that unordered writings (that is, those with a low VM value) tend to have irregular character size. Given these pilot findings, we then designed the three-session training program presented in the next section, using the three features of NC, ER, and VM.
Experiments
5.1 Design
In order to collect sufficient data to investigate the students’ problem-solving behaviors, we first considered the target curriculum and the characteristics of our participants. The questions took an average of three minutes to solve and required a minimum of five lines of development. Each test had 10 questions and was performed under a 30-minute time limit. If students finished solving questions in less than 30 minutes, we did not allow them to leave in the middle of the test. In practice, all students (except one or two) finished (released pen from hand) before the end of the time provided, in all sessions. The subject matter covered was mainly algebra, including quadratic equations, polynomials, rational/irrational equations, and inequalities. We selected first-year Korean high school students (age 16) who were studying these topics as part of the Korean standard education curriculum. The participants were 45 high school students (15 high-scoring and 30 average), equally divided into three groups with close to equal average scores (five high-scoring students and 10 average students in each group) to make a one-way analysis of variance (ANOVA) possible in order to assess the effectiveness of the software and proposed values. Accordingly, Group A was set as a control group, performing the test and assessment only, Group B as a treatment group using the Visualizer after the test but without information on the measured values, and Group C as a treatment group provided with the Visualizer, including NC, ER, and VM values in the bottom-right corner of the screen. The intention in displaying these values was to provide students with information upon which to base adjustment of their problem-solving behaviors. If a student's NC was higher than his/her peers’, this indicates that he/she navigated from page to page many times and conveys to him/her that this behavior should be lessened. Likewise, if the student's ER was higher than his/her peers’, it was recommended that the student lessen his/her erasing behavior, and so on.
Three tests were performed weekly; we considered the first and third tests as the pretest and the posttest respectively. Students in all groups performed the test using the test software and then reviewed their results after five minutes (the time it took for the teacher to grade the results manually); only treatment groups used the Visualizer to explore their results after grading. In the treatment group, instructors guided students manipulating the Visualizer, and thus no teaching of knowledge was performed during the entire experimental period. The review time with the Visualizer was set at 30 minutes based on the pilot study observations, as this seemed to be enough to discuss the questions and their results. In short, in this experimental design, the control and treatment groups had the same test experiences; however, visualizing, replaying, and comparing problem-solving behavior were only done by the treatment group.
5.2 Results
The treatment group students (Group B and C) were given information about the training objectives and how to use the software and, after the pretest, were given certain additional information about the experiments. Due to the side-by-side visualizing and replaying capabilities of the software, the treatment group students were able to identify differences between processes and distinguish good problem-solving examples. In particular, we gave additional information to Group C about the meanings of the NC, ER, and VM values and the objectives reflected in the decision to measure them. At the end of each experiment session, for 30 minutes, most students in the treatment group discussed differences in problem-solving behaviors with their peers. However, we observed that control group students did not actively discuss their results. As scores (the number of correct answers) were the only results provided after the test, the control group students focused on their correctness of questions and comparison with their peers, and in fact, some students in Group A did nothing even after about 10 minutes of reviewing. In contrast, treatment group students actively discussed high-scoring students’ results and used playback many times in the provided 30 minutes. We found that many Group B students, who used the Visualizer without displaying the proposed three values, could recognize good problem-solving behaviors along with good answers, although we did not explain anything about problem-solving behaviors. During the training period, we also found that students in the treatment group concentrated on problem solving in an effort to intentionally emulate high-scoring students’ behavior, since they had access to their peers’ problem-solving behaviors, such as writing and erasing strokes.
Table 3 gives descriptive statistics for the three test results. The first test was the pretest, and the training was performed twice. Group A, who were not provided any feedback or training during the experiments, showed results similar to those we expected as they had similar difficulties and problem areas during all sessions, with no experimental change: even without additional training, we can see that t(1,14) = 4.267, p<.003. The average score in Group A improved 4.267 points from session one to session three, and a paired-samples test shows the difference as significant (p <.003, 95 percent confidence). The reason for this improvement was that the students studied on their own after the assessment and became familiar with the problem areas and patterns.
Group B improved by 11.067 points on average during the experiment, with a paired-samples test result of t (1,14) = 3.674, p<.003. This difference between pretest and posttest is notable—2.59 times larger than that in Group A—and has meaningful significance. In addition, Group B shows larger improvement than Group A and larger differences than Group A from Group C. Based on the Group B result, it seems that the use of the Visualizer did help improve students’ problem-solving skills. During the experiments, students in Group B used the Visualizer to navigate across questions and gather additional knowledge from peers’ problem-solving processes. However, as the time period set for the experiment was limited, it is possible that there was not enough time for students to realize the full benefits of the treatment and that a larger change would have been seen give a longer timeframe.
Group C improved 17.267 points on average, t(1,14) = 5.022, p<.001, which is significant and higher than Group A or Group B. This shows that the combination of Visualizer and feature values (NC, ER, and VM) yields the greatest improvement in student achievement.
The ANOVA result for the three groups in Session 1 is F(2, 42) = .045 p <.956, in Session 2 is F(2, 42) = 1.073, p < .351, and in Session 3 is F(2,42) = 3.115, p < .055, respectively. Based on these statistics, the difference in the first session is not significant (almost identical), while the differences in sessions 2 and 3 significantly distinguish the groups. The difference in the last session is not significant at p < 0.055, but the differences among groups become distinctive as sessions are performed. Additionally, Group A has r = 0.101, showing that only 1.020 percent of variance is not shared with either variable. However, both Group B, r = 0.279, and Group C, r = 0.402, suggest moderate practical significance. In other words, r-squared values 7.784 and 16.160 percent for the variances of Session 1 variables are shared with Session 3 variables in each group.
Table 4 describes differences between Groups A and C based on Cohen's d and its effect size. In Group A and C in Session 1, Cohen's effect size ( d = .101) suggested low practical significance. However, there are notably large differences between Groups A and C in Session 3, where the Cohen's effect size (d = .891) suggests high practical significance. Based on the statistical analysis of scores in Tables 3 and 4, we can conclude that Group C becomes highly distinct from Group A.
Fig. 5 shows average score change for each session, and Fig. 6 shows the range of scores for each group and session in a boxplot. As we can see in Fig. 5, overall the score for Group C becomes higher, while scores for the other groups do not change as much as that of Group C. We also find that the variance of scores in Group C lessens over sessions.
Boxplot of scores of the three training sessions for the three groups. The lines in the bar graphs indicate median values.
How did the students of Group C come to attain higher scores than those in Group B? The effort to lessen NC made students intentionally lengthen the time they spent on each question, and the effort to reduce ER changed students’ problem-solving behavior from “writing first” to “thinking first.” During the experiments, students in Group C easily acknowledged the relationships between each proposed value and the replaying of results, and tried to adjust themselves intentionally in order to obtain better results by performing such behaviors at the next session. It is hard to measure thinking activity numerically; however, the improvement in Group C seems to indicate more interest among these students, after reviewing their results, in adopting new behaviors to improve those results, compared to Group B. As most of the questions were taken from a grade-level workbook, the average students were considered to have enough knowledge to understand their mistakes and address them.
The increased scores in treatment groups were not sufficient to support effectiveness of the proposed scheme as the score differences were from mistakes or other reasons. To investigate the effectiveness of the proposed scheme for reducing mistakes, we had all errors checked and mistakes and non-mistake errors discriminated by two schoolteachers. As our work did not focus on students’ reasoning processes while solving problems, we only created two categories, mistake and non-mistake; we marked technical errors such as misread questions or calculation errors as mistakes. If the proposed scheme works, the number of mistakes should be smaller after the treatment, and in fact, we did find such a decrease.
Fig. 7 visualizes the means of mistake-error ratios, based on Table 5. The mean for Session 1 is .258 (N = 45), SD = .125; the ANOVA result is F(2, 42) = .850, p<.850, so the differences are not significant. In Session 2, the ANOVA shows F(2,42) = 3.882, p < .028, and in Session 3, F(2,42) = 5.473, p<.008, so both Sessions 2 and 3 show significant differences among groups, increasing from the second to the third session. If we compare pretest and posttest mistake ratios, we see that Group A shows a mean difference of .034 with SD = .096; the paired-samples test shows t(1,14) = 1.380, p < .189, effect size r = 0.147, which implies a non-significant difference. In Group B, the mean difference between Sessions 1 and 3 is -.015 with SD = .137 and paired-samples t-test shows t(1,14) = -.429, p < .674, effect size r = -.047, so the differences between pretest and posttest is not significant. In Group C, the mean difference between Session 1 and Session 3 is -.121 with SD = .122; the paired-samples t-test shows t(1,14) = -3.815, p < .002, effect size r = -.422, which is moderately significant. On the basis of these statistics, we can conclude that the treatment in Group B, using the Visualizer only, is not significantly effective in lessening mistakes, while using both the Visualizer and feature values is effective.
Mistakes as a proportion of errors across the three sessions. Mistake Ratio 1 denotes mistakes in Session 1, Mistake Ratio 2 those in Session 2, and Mistake Ratio 3 those in Session 3.
Table 6 describes differences between Groups A and C based on the Cohen's effect size. The negative values denote decrease in mistake ratio. All comparisons in Session 1 show that Cohen's effect size suggests low significance, and Session 3 will be expected to show that all groups have low differences in scores. Session 2 already shows moderate to high significance for all comparisons, as Cohen's d values are all over the 0.3 threshold. In particular, the comparison values between Groups A and C in Session 3 show the high significance of the effect of the proposed scheme.
Finally, we examine changes in feature values: NC, ER, and VM. We consider only Group C, as it was the only group showing meaningful changes in mistake levels. Based on our observations during the test, many students in Group C made efforts to adjust their problem-solving behaviors closer to those of peers who had good results.
Table 7 describes changes in measurement values from pretest to posttest for Group C, and Fig. 8 plots these results. As we can see, NC (positioned at the top in the figure) and ER (positioned at the bottom) decreased gradually, while VM increased. These numerical changes reflect student behavioral changes in this group, namely, lessening page navigation, lessening erasing, and increasingly ordered writing. The paired-samples t-test shows that changes in NC (p < .091) and ER (p < .107) are not significant but those in VM (p < .002) are. The Pearson product-moment correlation coefficient shows that change in mistake occurrence for NC was .146 (p < .060), for ER, .309 (p < .026), and for VM, -.254 (p < .036); based on these values, we can conclude that mistake occurrence is meaningfully related to erasing (ER) and writing sequence (VM) behaviors during problem-solving rather than page navigation (NC).
Fig. 9 shows an example of a student's improvement due to the training implemented, specifically in VM. The visualized result (the yellow line and circled numbers) indicates the sequence of writing; VM values gradually increase as writings from bottom to top order gradually lessen. For reference, the respective VM values of the problems shown in Fig. 9 are 0.426, 0.710, and 1.0, from left to right.
Example of problem-solving improvement for a single student in Group C due to training (left: Session 1, center: Session 2, right: Session 3). The circled numbers positioned at the right side of the written areas denote the order of writing during problem solving. The circled numbers are not displayed in the Visualizer software.
During the test, students typically acknowledged their errors readily and engaged in problem solving once again. In addition, we found that the training sessions had other advantages, as follows: (1) Owing to the side-by-side review capability of the Visualizer software, high-scoring students’ correct solutions gained more attention than they otherwise would have; (2) Students tried quite hard to change their behavior after the first training session because their activities were being shared (anonymously) with others through the Visualizer; and (3) Because the software stored all of the students’ actions and scores, students easily noted their own improvement.
Discussion and Limitations
Our work intended to address a gap in learning and assessment effectiveness in mathematics education by providing training to lead to a reduction in the number of student mistakes. We did manage to achieve a positive result in this regard. However, several points may require further investigation and should be considered in future experiments.
Firstly, there is little research supporting the existence or illuminating the nature of a relationship between behavior and logical thinking (problem-solving) processes; thus, the actual learning effectiveness of the proposed scheme remains somewhat unclear. It was assumed that the scheme induces better problem-solving behaviors and thus allows students to better solve the questions. This matter could be illuminated by reference to research and theory in the fields of cognitive science and behavioral psychology, approaches that are not explored in the present study. Researchers who study mathematics education categorize problem-solving activities into stages, for example, in Schoenfeld's work [14], stages are characterized as reading, analyzing, exploring, planning, implementing, and verifying. While schemata of stages may help both teachers and students improve training and learning, the proposed system cannot discriminate such stages. In the categorization step of the proposed system, only two stages are available per problem: thinking and implementing; we may also adopt “writing” and “doing nothing” stages based on the input data, but these behaviors can also be considered as simply outward behaviors of students.
Secondly, the proposed scheme provides no helpful training information for a case in which a student does not write anything and gives up on the problem, which may be a limitation of our work. For this reason, the test difficulty levels need to be considered carefully and set according to the target learners’ knowledge and problem-solving skill. If we prepare questions with various levels of difficulty, the range of target learners can be expanded. With regard to students’ difficulties, we observed that high-scoring students do not necessarily exhibit exemplary problem-solving behaviors on the highest difficulty problems because they also struggle to solve these problems and engage in much trial and error. Currently, we provide all problem-solving results from high-scoring students and average students, but we provide only selected results for the latter group in order to find a relationship between training effectiveness and quality of solutions.
Third, several areas of mathematics do not fit our approach well. In particular, function problems, which tend to be solved by drawing a graph, and geometry problems cannot easily be analyzed by our current VM measurement method, which only takes into account written equations. We will further develop our writing analysis method so that it can be used with other expressions of mathematical learning, including graphs.
Conclusion
In this paper, we presented the design and implementation of a mathematics problem-solving training system to reduce the occurrence of mistakes. We adopted a tablet computer as our delivery platform, since it has been shown to enhance learning experience, to minimize the technology adaptation period required and to give new insight into next-generation learning systems. Our system provides a learner-friendly interface that supports the input of naturalistic handwriting and gathers data in real time. We showed that the number of navigation actions and erasing actions and the orderliness of writing manner during problem solving constitute meaningful variables for the occurrence of mistakes, and that behavioral training in these problem-solving skills improves them.
In an experiment, we evaluated the effectiveness of the proposed scheme among three groups of students. We found that the use of the Visualizer, together with displayed feature values, lessened mistakes in problem solving. We also found that erasing behavior and flow of writing were significantly related to mistake occurrence, while navigation behavior was not.
Using the proposed system, tutors can coach students in a more effective way, compared to traditional do-it-again feedback. With further research on the relationship between problem complexity and the learner knowledge model (which is not included in our work currently) we should be able to determine the concrete causes of mistakes. Moreover, the tablet computer platform also has a camera and microphone, and our work can be improved by gathering voice, eye gaze, and/or facial expression data in order to investigate various problem-solving behaviors. As tablet computers become more widely available, it will be possible to exploit them to implement new tutoring and learning systems for mathematics education. Further, our system provides student-centered learning and can help students practice mathematical problems while receiving beneficial behavioral training.
Currently, we cannot use computers to semantically interpret the content of students’ mathematics-related writing, due to inadequate character recognition success rates. We are looking forward to seeing mathematics handwriting character recognition improve to a high level, which will allow us to not only investigate problem-solving behaviors but also recognize equations and enhance problem-solving error analysis, in order to provide richer learning feedback to students.