A review on process-oriented approaches for analyzing novice solutions to programming problems

High attrition and dropout rates are common in introductory programming courses. One of the reasons students drop out is loss of motivation due to the lack of feedback and proper assessment of their progress. Hence, a process-oriented approach is needed in assessing programming progress, which entails examining and measuring students’ compilation behaviors and source codes. This paper reviews the elements of a process-oriented approach including previous studies that have used this approach. Specific metrics covered are Jadud’s Error Quotient, the Watwin Score, Probabilistic Distance to Solution, Normalized Programming State Model, and the Repeated Error Density.

are familiar with. Students who portray this characteristic are called stoppers (Perkins et al. 1986).
This scenario involving novice programmers is common not only during actual programming exercises in school but also at the comforts of their home when attempting to work on solutions to their programming assignments. This is very frustrating to students causing them to possibly disengage from what they are trying to solve. Their lack of interest to carry on with their solutions could persist making them less inspired to learn further about programming. This is seen as a likely threat to students who are under pressure and those who have become less motivated because studies have shown that there is a link between student motivation and success in learning to program (Carbone et al. 2009).
Programming teachers, on the other hand, have little or no way of knowing the real hardships students face and the progress that they make along the way since the teachers only see the final outputs submitted to them. Teachers possibly do not even realize to what extent their students' are capable of writing correct programs (Jadud 2005). This is the problem when students are being assessed traditionally. Giving students performance-based conventional assessments such as take-home programming assignments, examinations (e.g., requiring students to come up with code fragments), and charettes (e.g., short programming assignments carried out in a fixed-time laboratory sessions) (McCracken et al. 2001) and grading only their final outputs do not take into account the students' intentions (Johnson 1984) and the process that led to their final submissions (Lane and VanLehn 2005). Programming should be regarded as not just merely about code generation but also as a means to reflect how students think, decompose, and solve problems (Marion et al. 2007).
It is for this reason that it is important that students' intermediate results be likewise considered rather than just the completed output. This kind of approach could possibly address one of the challenges of novice student instruction in Computer Science and other related disciplines, that is, how to keep the students' interest in programming. It is believed that the key to keep them motivated is to help them succeed in their early programming courses. Students must not only possess the skills but also the motivation to succeed in learning (Helme and Clarke 2001). Hence, if teachers can help keep their students' interest in their early programming courses at bay, then there is a greater chance that attrition could be reduced. One way to do this is to use a process-oriented style of gauging their work (Worsley and Blikstein 2013).
The goal of this paper is to discuss the following: (1) process-oriented approach in programming and its potential benefits, (2) elements of a process-oriented approach, (3) studies conducted using this approach in the context of novice programming with focus on data collection, and (4) some metrics used in this approach.

A process-oriented approach in programming
Programming in itself is a process, and hence, the process that students undertake in developing a solution to a computer problem must also be evaluated. A completed source code weakly represents the student's knowledge and skills needed to create it (Lane and VanLehn 2005) because when a student's final work is solely graded, it is not possible to ascertain the student's level of ability since that single grade is a combination of several factors (Winters and Payne 2005). A final grade alone as a metric is neither valid nor reliable to deduce about the student's ability to write codes (Cardell-Oliver 2011). As such, it is proposed that using more complex metrics is more suitable in assessing students' work in the process of coming up with the final solution. These metrics, which are based on students' progress, code, or behavior, are called datadriven metrics . Examples of these metrics are the Intention-Based Score (IBS) (Lane and VanLehn 2005) also known as degree of closeness (Helme and Clarke 2001), Error Quotient (EQ) (Jadud 2005), Watwin Score (WS) (Watson et al. 2013), Probabilistic Distance to Solution (PDS) (Sudol et al. 2012), and the more recent Normalized Programming State Model (NPSM) (Carter et al. 2015) and Repeated Error Density (RED) (Becker 2016).
A process-oriented method of assessing open-ended problems (Johnson et al. 2013), such as in programming, is about capturing students' intermediate works and using a suitable metric to account for students' learning and behavior. In this approach, the process of producing the final output is examined in an attempt to see the students' progression while solving the problem. The data that can be captured in the course of program development can be turned into meaningful information to help teachers see if students understand the programming concepts used (Pettit et al. 2012). With this chronological data progression, it is possible to recreate the problem-solving steps the student makes (Vee et al. 2006a), giving a glimpse of how each student advances toward a solution.
Patterns can also be observed from this kind of data, and analyzing such patterns can give teachers and researchers alike an indication of students' behavior and learning process, which can be a fertile ground for research. Examples of these patterns that can be extracted are programming behavioral patterns such as debugging (Ahmadzadeh et al. 2005), compilation, and coding (Vee et al. 2006a;Spacco et al. 2006); patterns depicting practices common to successful students but not to struggling students (Norris et al. 2008); problem-solving strategies (Kiesmueller et al. 2010;Hosseini et al. 2014) and anomalies (Helminen et al. 2012); submission patterns (Falkner and Falkner 2012;Allevato et al. 2008); and code evolutions (Piech et al. 2012;Blikstein 2011). The possibility of cheating and the effect of starting late on projects can also be traced from these patterns (Fenwick et al. 2009) as well as distinguishing experienced students from poorly performing students (Annamaa et al. 2015;Leinonen et al. 2016). The data collected can also be explored to predict student performance identifying at-risk students in programming (Tabanao et al. 2008;Falkner and Falkner 2012;Watson et al. 2013;Koprinska et al. 2015) with the aim of providing them tailored remedial instructions. Capturing and analyzing these patterns as a result of using a process-oriented approach is indeed important as it can provide significant help to teachers to vary their strategies for the benefit of the students.
The use of a process-oriented approach in programming is a win-win situation for both teachers and students. Teachers can gain meaningful insights about how students move toward a solution and be able to judge their programming proficiency as well as provide the necessary remedial actions. Students, on the other hand, would appreciate if their teachers would inform them how their work would be assessed, provide them with intermediate feedback to encourage them more (Hattie and Timperley 2007), and award them the grade that would truly measure their progress. If no forms of instructional interventions are afforded, students would be left clueless as to where they have gone wrong and consequently would probably choose to drop out when they get discouraged.

Elements of a process-oriented approach and studies conducted
Using a process-oriented approach of assessing students' work entails a decision on how data will be collected, how much data will be captured, and how frequent the data will be stored. For this approach to effectively work, data must be acquired automatically from students during their actual programming sessions using some tools or programming environment (Jadud 2005). All source codes produced, together with other pertinent related information, are gathered. For instance, all students' behavior during programming sessions are captured every time they interact with the compiler or engage with the programming environment used. Hence, the analysis of the compilation behavior is dependent on the automatic collection of source codes written by students while programming. Per the work of Jadud (2005), the observations are not dependent on a researcher observing the subjects but it is actually the computer that does the entire job of automatically capturing and storing the students' interaction with it. The term online protocol is used to refer to this sort of data. Precisely, an online protocol is a collection of files submitted to a compiler that gives a chain of snapshots (Lane and VanLehn 2005) representing the programmers' in-between thought processes and hence has the capability to capture not only the final submissions but also the entire history of program development.
In developing an online protocol, the granularity of data must be decided. Different data granularity levels were identified (Koprinska et al. 2015): (1) submission-level data-stores submitted source codes; (2) snapshot-level data-stores file saves, compilations, and executions; and (3) keystroke-level data-stores data every time a student makes changes in the source code. These levels of data granularity are ordered as follows from the finest to coarsest (Ihantola et al. 2015): keystrokes, line-level edits, file saves, compilations, executions, and submissions.
It is important to select the level of granularity that could provide adequate and reasonable amount of data depending on the analysis needs. If data is collected at the smallest level of detail (e.g., mouse events or keystrokes), then there should be a means to store and handle large volumes of data. Some of these details might not be even relevant anymore. On the other hand, if data is collected at a larger granularity (e.g., assignment submissions), then extracting the necessary information might not be possible. Comparisons of these levels are found in .
Tools that could unobtrusively automate data collection must also be considered. The common tools utilized by researchers that could provide the means for data to be collected systematically are the automated grading systems, IDE instrumentation, version control systems (VCS), and key logging (Ihantola et al. 2015). The granularity of the data is dependent on these tools. For instance, automated grading systems like Web-CAT  produce data at the granularity of submissions. Web-CAT is capable of collecting file submissions of the same task repeatedly. Examples of IDE instrumentation tools, which collect data as snapshots, are called third generation approaches (Norris et al. 2008) like Hackystat (file modifications) (Johnson et al. 2003), Jadud's work (compilation events) (Jadud 2005), Marmoset (file saves) (Spacco et al. 2006), and ClockIt (compilation events) (Norris et al. 2008). VCS analyze source code snapshots at commit points only, whereas key logging offers data collection down to the smallest level of detail such as individual keystroke events.
There is plenty of literature using a process-oriented approach.     Of the 42 studies listed, majority of the data collection was done in Java (26), followed by Python (6), and C++ (2). Five of the studies did not mention the programming language used, while two of the studies were conducted using multiple programming languages (C/Python and C/Python/Java). Since Java was the preferred programming language, it is assumed that this is the reason why BlueJ was the most preferred tool for capturing data. Compilations (11) took the lead when it comes to data granularity followed by a combination of varying levels of granularity with submissions (9) coming in third.

Metrics
In a process-oriented approach, the use of a proper metric in analyzing online protocol is important if the aim is to derive meaningful information from the data collected. The studies listed in Table 1 utilized a combination of different metrics, which is dependent on the information need of the researcher. For the purpose of discussing how these metrics were derived and computed, five of these metrics are discussed. These metrics are Jadud's Error Quotient (EQ) (Jadud 2006a), Watwin Score (WS) (Watson et al. 2013), Probabilistic Distance to Solution (PDS) (Sudol et al. 2012), Normalized Programming State Model (NPSM) (Carter et al. 2015), and Repeated Error Density (RED) (Becker 2016). The main criterion for selecting these metrics is primarily because these metrics were all used to analyze novice programming behavior to understand better why students drop out early in the course and what possible interventions could be provided. It all started with Jadud's work, which was subsequently improved by the succeeding metrics by addressing the deficiencies of Jadud's Error Quotient metric.

Error Quotient
EQ is a metric that measures the number of syntax errors a student encounters in a single programming session (i.e., a sequence of compiles from one class period). This measurement is based on several criteria, such as the (1) number of syntax errors made by a student, (2) the frequency of repeating those syntax errors, and the (3) error's source location in subsequent compilations. A high EQ is given to students who commit many syntax errors and fail to fix them in between compilations. On the other hand, a low EQ is awarded to students who encounter a small number of syntax errors or who manage to quickly fix the syntax error encountered. The use of EQ permits us to compare student sessions easily. Given a session of compilation events e 1 through e n , the EQ is calculated as follows (Jadud 2006a): • Collate: Create consecutive pairs from the events in the session, e.g., (e 1 , e 2 ), (e 2 , e 3 ), (e 3 , e 4 ) up to (e n−1 , e n ).
• Calculate: Score each pair according to the algorithm presented in Fig. 1.
• Normalize: Divide the score assigned to each pair by 9 (the maximum value possible for each pair).
• Average: Sum the scores and divide by the number of pairs. This average (in the range 0-1) is taken as the EQ for the session.
To show how the scoring is done, consider the tabular visualization in Fig. 2. The table in this figure is an example of a novice's compilation session with four events. Each row in the table signifies the change between a pair of successive compilation events. The meanings of the table columns are as follows: • Err type: Denoted with a " * " if a compilation pair was error-free; otherwise, it shows a number representing a known syntax error. This number provides a unique index into the table of recognized syntax errors (e.g., error type no. 8 refers to "Class or interface expected").
• T: Represents the amount of time (in seconds) that passed between compilations (reduced to five bins: 0-10 s, 20-30 s, 30-60 s, 60-120 s, and more than 2 min). This provides a summary of which compilations were quick and reflexive, and which were done with more thought.
• Ch: Tells how many characters were added or removed from one compilation to the next. This gives a sense for the magnitude of the change (e.g., a single character Scoring begins by taking a pair of events one by one. Referring to Fig. 2, these event pairs are (1, 2), (2, 3), and (3, 4). Taking the first pair and tracing using the flowchart in Fig. 1, it can be verified that both events end in a syntax error. Since the second event is error-free, this pair of events is given a score of zero. This is repeated on the second pair, which also receives zero since the first event incurred no errors. As for the third pair, both events end in a syntax error (add 2), both have the same error type (add 3), and both have the same error location (add 3) yielding a total of 8. Hence, the three pairs have scores 0, 0, and 8, respectively. The scores are then normalized to 0, 0, and 0.88 resulting to an average penalty of 0.29 (0.88/3) to the pairs. This average over all the pairs is now the EQ of this programming session. Per Jadud's experiment, EQ appears to be inversely correlated with grades, which means that students who have low EQ scores are likely to have higher assignment and exam grades in programming than those students with high EQ scores. Further, EQ offers also a possibility as a predictor of students' academic performance.

Watwin Score
The Watwin algorithm attempted to address the flaws of EQ as the latter only assumes that students either work only on a single source file or linearly work on several files. Watson et al. (2013) found, however, that students can probably work on multiple files. This algorithm also offers considerable improvement by increasing its predictive accuracy in predicting student performance based on the way a student tackles different types of errors when compared to peers by proposing resolve time as a predictor. Students receive penalty based on the amount of time they take to fix an error relative to their peers, which is distributed normally. The algorithm is performed as follows: Input A set of student programming logs (compilation and invocation) for all files compiled during a session: 1. Prepare a set of successive compilation event pairings using the process shown in Fig. 3.

Quantify programming behavior
• Score all compilation pairings generated using the scoring algorithm in Fig. 4. Output The average of all pairings (in the range 0-1) is taken as the student's WS for the session. A score of zero denotes that the student's session is error-free, whereas a score of 1 means that all compilations ended in errors. A score approaching zero indicates that a student is more experienced in programming. Step 1 in Fig. 3 builds pairs of compilation events {{e 1 , e 2 }, {e 2 , e 3 }, . . . , {e n−1 , e n }} ordered by timestamp on a per file basis. In building such pairs, it considers the possibility where a student works concurrently on multiple files. In step 2, all pairings {e f , e t }, whose code snapshots are the same, are identified and removed as these can bloat the total number of compilation pairings. Pairings with a successful event type e f are likewise removed. Deletion and commented fixes are detected in step 3 as these do not reflect a student's ability in fixing errors. Deletion fixes are spotted by getting the difference ratio between the snapshots of e f and e t . The pair {e f , e t } is removed if the number of insertions and alterations is zero and if that of deletions is greater than zero. Commented fixes are spotted and removed using a regex expression. In step 4, error messages generated in every compilation event pairs are generalized to build a profile for different classes of errors. Finally, in step 5, the amount of time spent on each compilation pairing {e f , e t } is estimated. This is accomplished by creating a combined sequence of invocation and compilation events {h 1 , h 2 , . . . , h k−1 , h k } for all files in a session, sorted by timestamp. For every {e f , e t }, if there occurs an h i , such that the timestamp of e f > h i > e t , the estimated time spent on {e f , e t } is computed as the difference between the timestamps of e t and h i .

Villamor Research and Practice in Technology Enhanced Learning
In this approach, resolve time to fix errors is used as a predictor in the scoring model for two reasons: (1) this has not been explored in previous studies prior to the conduct of this research, and (2) the researchers found a strong significant correlation in students' mean resolve time vis-à-vis performance. As illustrated in the scoring algorithm in Fig. 4, students are penalized if their resolve time is more than one deviation above or below the mean. A low penalty means that students have fixed an error quickly than their peers, and a high penalty signifies that students have fixed an error much slower than their peers. After scoring all pairings, the scores are normalized, and the computed average becomes the WS. The components included and the penalties assigned in the algorithm were not products of random guesswork but were based on previous studies conducted using regression models and cross-validation, hence making this algorithm applicable to independent datasets. Using a student's WS as the independent variable, and the students' overall coursework mark as the dependent variable, a linear regression can be performed to test the predictive power of this algorithm on students' performance in programming. Moreover, WS can also be used as a classifier of student performance.

Probabilistic Distance to Solution
Traditional metrics for measuring student performance, such as those that depend on the number of submissions or deal with the time spent to correctly finish a programming problem, do not actually reflect student's understanding and misconceptions about the problem they are solving. In fact, these are considered too coarse-grained to make a distinction between conceptual misunderstanding and syntactical or parsing mistakes made by the students. The PDS metric gives a glimpse of the entire history of problemsolving steps a student takes starting from the first attempt of solving the problem up to the completed solution while providing additional insight into misconceptions and problem-solving paths by focusing on individual algorithmic components within the larger algorithm students are constructing. These algorithmic components, which are used to evaluate student progress, provide a way to determine if a student is making a productive edit (e.g., moving closer to a solution) or an unproductive edit (e.g., does not fix an incorrect algorithmic component of the code), whether engaging in a shotgun debugging or pursuing a misconception.
The idea is, for each submission, the student's code is converted to a vector of binary features called an alignment vector (AV), which encompasses the algorithmic components (e.g., looping structure, decision structure, use of the loop control in an array access, and inclusion of a return statement), correctness outcomes, and compilation success. The model of required algorithmic elements was based on the work of Sudol-DeLyser (2014). A value of 1 and 0 in the AV represents the appearance and absence of each algorithmic component, respectively. The generated AVs are represented as program states in a network. The model used is useful because it can be seen which individual concepts students struggle and is also useful for tracking and evaluating students' progress across multiple submissions. Figure 5 shows a snapshot of a network constructed given a particular problem. The network nodes S 1 . . . S n−1 are possible program states with an end state node F, and the edges represent the observed transitions between states. In the table, columns Fx are the binary program features that correspond to the required algorithmic components for a given problem. Row Sx indicates the presence (green box) or absence (white box) of those features. S0 is the starting state (empty starter code), and S56 represents the correct final state. In the transition graph, the thickness of the edges indicates the number of edits the transition was observed all over the students' submission, whereas the lengths are arbitrary and do not have something to do with the model. All lines are single directional, indicating a move from the thin side of the line to the thick side. Consider, for example, the path S0, S55, S60, and S56. It can be seen from the AV (table in Fig. 5) that from the initial state students submitted a code with a missing component (F8), and then a code with a correct return statement that did not compile before transitioning to the final state. For each node, a Maximum Likelihood Estimate (MLE) transition probability to every other node using the observed transitions is computed. Given the number of observed transitions from state x to state y (T x,y ), the probability of being in state y at time t is estimated with the MLE: p(S y (t)) = P(S y (t)|S x (t − 1)|) = T xy i T x,i This is equivalent to a Markov chain estimate with a 1-state history. A classic problem in Computer Science, which is the random walk problem in networks, is applied to the calculation of a PDS particularly referring to the number of likely edits it would take a student from any given node (AV state) to a final state (complete solution). A mean distance from each state to the final state is calculated using a set of linear equations. This is done by modeling each edge as a transition probability and single unit of distance between states. The PDS metric, along with the transition graph (TG) as seen in Fig. 5, can provide substantial information about the paths followed by students in their attempt to arrive at a correct solution. By looking at the PDS and the TG, it is possible to identify if a student has made productive edits or just guessing to fix compiler errors. This is because the AV was generated based on the instructional goals or learning objectives. Hence, its representation and visualization can draw generalizations how program states are parallel to student performance and other latent factors such as learning. For example, in the table in Fig. 5, one student's first submission is observed as S55 with PDS of 2.99, and the next state it transitioned to is S57 with PDS 4.43. This means that the edit made was less productive since it resulted in a shift to a state with a higher probabilistic number of submits needed to arrive at a correct solution.

Normalized Programming State Model
NPSM differs from EQ and WS in terms of predicting student performance because it determines not only syntactic correctness, but also semantic correctness of a programming solution given at any point in time. Hence, it offers a more holistic predictive model of student performance compared to EQ and WS. NPSM's approach deals with programming data stream and requires that a student's current programming solution be mapped into one of the four states in the 2 × 2 space shown in Fig. 6. The syntactic correctness of a program is determined based on whether the last compilation attempt results in an error. On the other hand, since semantic correctness is not immediately obvious, it cannot be determined right away. The only way to go around this is to look for the presence or absence of runtime exceptions in the last execution attempt. A program is semantically correct if it produces a runtime exception in the last execution attempt. It is semantically unknown if the last execution attempt does not produce a runtime exception. A state-transition diagram found in Carter et al. (2015) is used to map this model to the stream of programming log data provided by the IDE used in this study. The 11 states in the transition diagram along with their codes, for quick viewing, are shown in Table 2.
The editing states in the model, namely YN, YU, NU, and NN, serve as rough proxies representing both the syntactic and semantic status of the program being edited as shown in Fig. 6. It is inferred that students who spend more time in syntactically correct and semantically unknown states will likely to perform better than students spending more time in syntactically and semantically incorrect states. In the RN and DN states, students would seem to ask the question, "Why does not my program work?" and in the RU and DU states, they would be inclined to ask the question, "Does my program work?" Among these states, DN and DU are in a more powerful position to ask these questions since they are using the debugger. Lastly, students who spend time in R/ state may imply that a student is struggling. Overall, the relationship between these five execution states vis-à-vis student performance is not clear.
A student's programming activity is mapped to one of the 11 states in this model. To correlate the use of this method with student performance, the amount of time a student spent in each state is recorded and the recorded times relative to the total time the student spent programming is normalized. In the preliminary analysis of the data gathered using this model, it was observed that the Idle state was the dominating state of the  students, which means that students are likely to spend most of their day not doing any programming activity. Since the focus is on the time students spent in programming, the Idle state was eliminated from the normalization process. One variable was added in the model, which is the time-on-task, leaving a total of 11 data points per students. These 11 data points form the NPSM. A predictive measure was derived using NPSM to predict performance. Based on previous studies conducted and given that NPSM has 11 predictors, the ideal sample size is estimated at 220 students to achieve full statistical power when deriving the predictive measure. Four states were considered in the development of the predictive model, namely UU, NR, RU, and RN, and multivariate regression was performed using these states. RN and RU are found to be positive contributors to student success, while UU and NU are found to be negative contributors. This suggests that dealing with a program's runtime behavior, notwithstanding the semantic correctness, is a successful programming approach. On the other hand, writing large chunks of code and not compiling it (UU) does not contribute to programming success, and when students compile, they will be trapped in NU state, which is also found to be negatively correlated with performance. YN and NN are not significant predictors, so they were dropped in this predictive model.

Repeated Error Density
RED is a metric used to quantify repeated errors. It is considered more advantageous compared to Jadud's EQ since it is less context dependent and is suitable for short sessions. RED has properties making it capable to answer Jadud's questions, such as (1) "If one student fails to correct an 'illegal start of expression' error over the course of three compilations, and another over 10, is one student [about] three times worse than the other?" and (2) "What if the other student deals with a non-stop string of errors, and the repetition of this particular error is just one of many?" Dissecting these questions would mean that when EQ is used as a metric, both students in these cases would have an EQ of 1.
These cases can be demonstrated in the succeeding scenarios. Take students A, B, and C; an error type x; and a repeated error r. Suppose student A logs two consecutive errors of type x resulting in one repeated error r. Is it fair to say that student B who logs three consecutive errors of type x resulting in two repeated errors 2r is struggling with that error more by a factor of two? What about student C who also logs 2r but not all in one string s (say across 2s, by logging two consecutive errors of type x followed by some successful  compiles, followed by two more consecutive errors of type x)? Considering all these cases and given the assumption that no other errors are made, and all other variables among the three students are equal, the EQ will be equal to 1 in all cases. Table 3 summarizes these three scenarios. Table 3 shows that EQ is not of any help. Hence, RED came into the picture to quantify repeated errors by both looking at the amount of repeated error strings a student encounters and looking at the length of these strings. This metric involves summing a submetric calculated on each repeated error string encountered in a compilation event sequence. This submetric is r is the length of the string s i containing r i repeated errors. This can be expressed completely in terms of r i as r 2 i / (r i +1), since the length of a string s i containing r i consecutive errors is always equal to r i +1.
The value of RED for a given sequence S of n repeated error strings for each s i in S is shown in the equation below: where r i is the number of repeated error r as a pair of events where each event results in the same error. Table 4 shows the RED values for all sequence combinations for which a total of 0-4 repeated errors are encountered.
In Table 4, sequences 1 and 2 mean that a single occurrence of error x, or two occurrences separated by other activity, does not comprise repeated errors and hence has a RED value of 0. Sequences 4 and 5 illustrate two ways that error x can be repeated twice successively, but sequence 4 is considered the "natural" choice for showing that a student struggles twice of that the student in sequence 3 as evidenced by the RED values of sequence 4 (1) and sequence 3 (0.5) giving a ratio of 2:1. The same can also be observed for sequences 11 and 5. With this, Jadud's first question was addressed. For Jadud's second question, this is addressed by inspecting sequences 5 and 4. Sequence 5 contains one long string of two repeated errors with RED = 1.3, and sequence 4 also has two repeated errors but over two shorter strings and has RED = 1. In this case, a penalty of 1/3 is incurred for a sequence that encounters two repeated errors in one long string. Therefore, it can be said that the student in sequence 5 (B) is struggling more with error x than the student in sequence 4 (C) as reflected by the former's higher RED score.
It can be observed that Jadud's EQ has become the basis of the creation of other metrics. Newer metrics had been developed to address the shortcoming of the older metrics. As data evolve and as the need for answers are being sought, more suitable and better metrics will come to exist in the near future. Table 5 shows a comparison matrix of the five metrics along with their advantages and disadvantages.

Conclusion
Programming is perceived to be difficult particularly among novices as evidenced by the increasing dropout and attrition rates in introductory programming courses. Computing educators and researchers have endeavored to pinpoint the underlying problem by analyzing different learning behaviors while writing programs. It is possible that educators may not have a good grasp on the type of errors students normally encounter or have no or very limited idea about the students' misconceptions in programming. This is because traditionally what is usually required to be submitted and assessed by instructors are the just the final outputs. Hence, there is really no way of knowing the amount of effort students put into their work and the level of understanding students have about their programming assignments.
This paper discusses what is a process-oriented approach of analyzing students' programming solutions. This approach not only takes into account the final submissions of the students but also captures the progress the students make in writing programs. In using a process-oriented approach, online protocols (e.g., sequences of automatically collected source codes written by students while writing a solution to a programming problem) must be first be collected by deciding the level of data granularity to use. This approach could be beneficial to computing educators, students, and researchers. Capturing and analyzing students' online programming protocols has the potential of providing computing educators meaningful insights about how students behave a certain way during programming sessions and the processes by which they learn to program. This information is useful to computer educators and course designers so that they can create tailored interventions and adjust course materials accordingly. Educators may also be particularly interested in identifying at-risk students and what specific programming areas they are struggling with. Programming students could receive help from appropriate feedbacks and not just rely from the compilers' feedbacks which are not encouraging at all to novice programmers. This is an ideal pedagogical situation that could possibly increase the retention rates in programming courses.
The application and limitations of data-driven metrics covered in this paper were examined in more recent studies. For instance, since EQ's drawbacks include being context dependent and containing free parameters, Petersen et al. (2015) investigated the application of EQ in a multi-context setting to determine its sensitivity to different contexts. Significant differences in the predictive power of EQ across contexts were shown. The researchers were not able to find a model that effectively predicted performance,

PDS
Measures probabilistic distance between an observed student solution and a correct solution using a Markov model -Can be used to determine if an edit or student path is (a) typical of students who have mastered content, and (b) productive in progressing toward a solution -Can be adapted to focus on a model state transitions that indicate misconceptions or other modelbased goals of the data miner -A model of required algorithmic components must be identified first -Student evaluation is constrained based on the model used

NPSM
Characterizes students' programming activity in terms of the dynamically changing syntactic and semantic correctness of programs -Can be used to predict student performance by considering the programming process more holistically -A substantially better predictor than EQ and WS -Provides only a rough proxy for determining semantic correctness (e.g., presence or absence of runtime exceptions in the last execution attempt)

RED
Quantifies repeated errors by looking at the amount of repeated error strings a student encounters, and the length of these strings -Less context dependent -Useful for short sequences -Can be significantly reduced by an editor that has previously been shown to result in significantly fewer compiler errors -Its bounded nature brings some questions when a large number of data points are involved -Prone to outliers -Has not been validated if it correlates with student performance and even if a model existed, the effectiveness was highly dependent on the parameters chosen. Ahadi et al. (2015) then proposed an improvement to EQ that measures coding progression using source code snapshot metadata. This newer metric is no longer context-sensitive and language dependent, and does not suffer from ad hoc parameters. In a study by Watson et al. (2012) and Becker et al. (2018), it was suggested that enhanced syntax error messages could improve students' performance. However, this was argued by Denny et al. (2014) indicating that syntax error messages do not work all the time. It is Villamor Research and Practice in Technology Enhanced Learning (2020) 15:8 Page 20 of 23 possible that the differences in their results are due to differences in contexts. It is imperative, therefore, that data-driven metrics used for assessing student behaviors must be examined across a wide variety of teaching contexts and situations. More effective strategies for automatically tweaking the metrics to make them suitable in different contexts must also be developed. In a much recent study by , tracing Programming State Model (PSM) programming sequences can detect a student's programming ability and can be used as a platform to provide customized learning interventions to students learning how to program. There is also empirical proof that students have different ways of tackling programming assignments depending on their level of achievement. However, it was found that using automatically generated log data is incapable of connecting a student's intention to his/her programming processes. Hence, data collected must be augmented with alternate sources like lab observations and/or reflective dialogues to get the learner's intent correctly.
The study of  likewise confirms that EQ and WS as predictive models differ considerably based on setting. NPSM, on the other hand, outperforms these metrics as a predictor because NPSM is not affected by the programming environments and languages used in the studies. Whereas prior studies attempted to predict student achievement based on the students' programming data, this recent study have incorporated non-programming data, specifically online social behavior (e.g., participation levels in completing programming assignments), to investigate whether this would have an effect on the predictive power of EQ, WS, and NPSM on the academic success of students. Results showed that combining a programming-based metric with participation level measure could indeed add more explanatory power into the predictive models. However, to extend the generalizability of these predictive measures, more replication studies need to be conducted considering different student populations. This is especially true for NPSM, which, according to previous studies, happens to have a limitation in terms of its explanatory power. A replication study using the same settings should be conducted to verify if this limitation still exists.
Abbreviations EQ: Error quotient; WS: Watwin Score; PDS: Probabilistic Distance to Solution; NPSM: Normalized Programming State Model; RED: Repeated Error Density