Regression Test Case Prioritization by Code Combinations Coverage

Regression test case prioritization (RTCP) aims to improve the rate of fault detection by executing more important test cases as early as possible. Various RTCP techniques have been proposed based on different coverage criteria. Among them, a majority of techniques leverage code coverage information to guide the prioritization process, with code units being considered individually, and in isolation. In this paper, we propose a new coverage criterion, code combinations coverage, that combines the concepts of code coverage and combination coverage. We apply this coverage criterion to RTCP, as a new prioritization technique, code combinations coverage based prioritization (CCCP). We report on empirical studies conducted to compare the testing effectiveness and efficiency of CCCP with four popular RTCP techniques: total, additional, adaptive random, and search-based test prioritization. The experimental results show that even when the lowest combination strength is assigned, overall, the CCCP fault detection rates are greater than those of the other four prioritization techniques. The CCCP prioritization costs are also found to be comparable to the additional test prioritization technique. Moreover, our results also show that when the combination strength is increased, CCCP provides higher fault detection rates than the state-of-the-art, regardless of the levels of code coverage.


Introduction
Modern software systems continuously evolve due to the fixing of detected bugs, the adding of new functionalities, and the refactoring of system architecture. Regression testing is conducted to ensure that the changed source code does not introduce new defects. However, it can become expensive to run an entire regression test suite because its size naturally increases during software maintenance and evolution: In an industrial case reported by Rothermel et al. [1], for example, the execution time for running the entire test suite could become several weeks.
Regression test case prioritization (RTCP) has become one of the most effective approaches to reduce the overheads in regression testing [2,3,4,5,6]. RTCP techniques reorder the execution sequence of regression test cases, aiming to execute those test cases more likely to detect faults (according to some award function) as early as possible [7,8,9].
Traditional RTCP techniques [1,10,11] usually use code coverage criteria to guide the prioritization process. Intuitively speaking, a code coverage criterion indicates the percentage of some code units (e.g. statements) covered by a test case. The expectation is that test cases with higher code coverage value have a greater chance of detecting faults [12]. Because of this, a goal of maximizing code coverage has been incorporated into various RTCP techniques, including greedy strategies [1]. Given a coverage criterion (e.g., method, branch, or statement coverage), the total strategy selects the next test case with greatest absolute coverage, whereas the additional strategy selects the one with greatest coverage of code units not already covered by the prioritized test cases. Furthermore, Li et al. [2] proposed two search-based RTCP techniques (a hill-climbing strategy and a genetic strategy) to explore the search space (the set of all permutations of the test cases) to find a sequence with a better fault detection rate. Jiang et al. [3] investigated adaptive random techniques [13] to prioritize test cases using code coverage criteria. In an attempt to bridge the gap between the two greedy strategies, Zhang et al. [10] proposed a unified approach based on the fault detection probability for each test case (referred to as a p value).
In this paper, we propose a new coverage criterion, code combinations coverage, that combines the concepts of code coverage [12] and combination coverage [14]: Given a set of regression test cases T , each test case is first transferred to an equally-sized tuple. Each position in this tuple is a binary value representing whether the corresponding item (such as branch, statement, or method) is covered by this test case. In other words, T is represented by a set of abstract test cases with binary values T . The code combinations coverage of T is measured by the traditional combination coverage of T . We apply this new coverage criterion to RTCP, proposing a new prioritization technique: code combinations coverage based prioritization (CCCP).
We conducted empirical studies on 14 versions of four Java programs, and 30 versions of five real-world Unix utility programs. Our goal was to investigate the testing effectiveness and efficiency of CCCP compared with four widely-used RTCP techniquestotal, additional, adaptive random, and search-based test prioritization. The results show that when the lowest combination strength is assigned, overall, our approach has better fault detection rates than the other four test prioritization techniques. It not only achieves comparable testing efficiency to additional, but also requires much less prioritization time than the adaptive random and searchbased techniques. In addition, while the code coverage granularity does not impact on the testing effectiveness of CCCP, the test case granularity does significantly impact on it. Furthermore, when the combination strength is increased, CCCP provides better fault detection rates than all other RTCP techniques, regardless of the level of code coverage.
The main contributions of this paper are: • We propose a new coverage criterion called code combinations coverage that combines the concepts of code coverage and combination coverage.
• We apply code combinations coverage to RTCP, leading to a new prioritization technique called code combinations coverage based prioritization (CCCP).
• We report on empirical studies conducted to inves-tigate the test effectiveness and efficiency of CCCP compared to four widely-used prioritization techniques, and also analyze the impact of code coverage granularity and test case granularity on the effectiveness of CCCP.
• We provide some guidelines for how to choose the combination strength and code-coverage level for CCCP, under different testing scenarios.
The rest of this paper is organized as follows: Section 2 presents some background information. Section 3 introduces the proposed approach. Section 4 presents the research questions, and explains details of the empirical study. Section 5 provides the detailed results of the study and answers the research questions. Section 6 discusses some related work, and Section 7 concludes this paper, including highlighting some potential future work.

Background
In this section, we provide some background information about abstract test cases and test case prioritization.

Abstract Test Cases
For the system under test (SUT), there are some parameters p 1 , p 2 , · · · , p k that may influence its performance, such as configuration options, components, and user inputs. Each parameter p i can take some discrete values to form the set V i , which is finite. By selecting a value for each parameter, its combination becomes an abstract test case [15]. Definition 1. Abstract Test Case: An abstract test case is a discrete test case that can be represented by a ktuple (v 1 , v 2 , · · · , v k ), where v i (1 ≤ i ≤ k) is a value of a parameter p i from a finite set V i (i.e., v i ∈ V i ).
For ease of description, we define a function CombS et(tc, λ) that returns a set of all λ-wise parameter-value combinations covered by an abstract test case tc = (v 1 , v 2 , · · · , v k ), i.e., Obviously, the size of CombS et(tc, λ) is equal to C k, λ (the number of λ-combinations from k elements). To calculate the λ-wise parameter-value combinations covered by the set T of abstract test cases, the function CombS et(T, λ) is defined as:

Test Case Prioritization
Regression test case prioritization (RTCP) [1] aims to reorder the test cases to realize a certain goal, such as exposing faults earlier. RTCP is formally defined as:

Definition 2. Regression Test Case Prioritization:
Given a regression test suite T , PT is the set of its all possible permutations, and f is an object function from PT to real numbers. The problem of RTCP [1] is to find P ∈ PT , such that ∀P , P ∈ PT (P P ), f (P ) ≥ f (P ).
RTCP is an effective means to reduce the cost of regression testing, and has been widely investigated [1,2,3,9], with a large number of studies focusing on the coverage criterion and prioritization algorithms. Intuitively, code coverage criteria can be regarded as characteristics of the test cases, and many prioritization algorithms have used coverage criteria to guide the prioritization process (such as the greedy strategies [1], searchbased strategies [2], and adaptive random strategies [3]).

Approach
In this section, we introduce the details of test case prioritization by code combinations coverage.

Greedy Techniques
There are two widely investigated RTCP strategies: the total greedy strategy and the additional greedy strategy. The total strategy selects test cases according to a descending order of code units covered by the test case. The additional strategy also selects test cases according to a descending order, but uses the number of code units not already covered by previously selected test cases. According to previous studies [17,18,19], although seemingly simple, the greedy strategies (especially additional) perform better than most other RTCP techniques in terms of the fault detection rate. Therefore, in our study, we used a simple greedy algorithm to instantiate the CCCP prioritization function for statement, branch, and method coverage criteria. As we just want to evaluate the performance of code combinations coverage against traditional code coverage (e.g. statement) and the additional strategy has been widely accepted as the most effective prioritization strategy, we thus implemented greedy strategies based on the work of Rothermel et al. [1] as the control techniques for evaluation of our proposed approaches.

Code Combinations Coverage
Various RTCP approaches, based on different prioritization strategies, have been proposed to reduce regression testing overheads. Many of these approaches used individual code unit coverage of a test case to guide the prioritization process. For example, greedy strategies only take the number of covered code units into account, with the code units considered as parameters, or individually, and in isolation. However, this may lead to a loss of coverage information, and regression testing has traditionally used historical testing information to guide future testing. Thus, the degree to which the information is used is significant for regression testing, and if we consider the combination between code units, it may be possible to devise strategies to take further advantage of the code coverage information. In our hypothesis, the code units are related, not isolated, and faults may be triggered by combinations amongst them. Based on this, we can make use of more accurate testing information than traditional RTCP approaches to guide the prioritization process.
In our work, a code unit is a general term describing one structural code element -a statement, branch, or method. Consider a program P that has m code units (statements, branches, or methods) that form the code unit set U = {u 1 , u 2 , · · · , u m }, and a regression test set T with n test cases (T = {tc 1 , tc 2 , · · · , tc n }). We define a function isCovered(tc, u) to measure whether or not a test case tc covers the code unit u, as follows: As a result, each test case tc can be represented by an m-wise binary array through the convertT est function: convertT est(tc) = (isCovered(tc, u 1 ), isCovered(tc, u 2 ), · · · , isCovered(tc, u m )). For ease of description, with the increase of i for each u i in U, we make use of the incremental values to describe whether or not tc covers the code unit u i : isCovered(tc, u i ), where 1 ≤ i ≤ m, defined as: In other words, an odd number represents the situation where the code unit in question is covered by a given test case; and an even number means that it is not covered. For example, using Equation (3), the values (1, 1, 1, 0, 0) for a test case tc would mean that the first three code units are covered by tc, but that the last two are not. Equation (4) would allow this to be represented as tc = (1, 3,5,8,10). In effect, each code unit can be considered a parameter that contains binary parameter values (an odd number and an even number): the first code unit takes the value 1 or 2; the second code unit takes 3 or 4; and so on. In other words, each test case becomes an abstract test case (as defined in Section 2.1). The λ-wise code combinations coverage (CCC) value of tc against the test set T is defined as the number of λwise code-unit combinations covered by tc that are not covered by T : where T = tc∈T {convertT est(tc)}.

Code Combinations Coverage based Prioritization
In our model, we view CCCP as a general strategy that can be applied to different prioritization algorithms using different coverage criteria. As greedy strategies are among the most widely-adopted prioritization strategies [1,10], and the additional greedy strategy is considered to be one of the most effective RTCP approaches [3,17,18,20], in terms of fault detection rate, we adopted a simple greedy strategy to instantiate the function for the proposed code combinations coverage.
Generally speaking, the approach chooses an element from candidates as the next test case such that it covers the largest number of λ-wise code-unit-value combinations that have not already been covered by previously selected test cases. Accordingly, the test case with the stores the λ-wise code-unit-value combinations covered by tc i ; and a set UncoverCombinations stores all uncovered λ-wise code-unit-value combinations. In addition, the variable f lag indicates whether or not all λwise code-unit-value combinations have been covered.
In Algorithm 2, Lines 1-24 perform initialization, and also choose the first test case from the candidates; while Lines 25-49 prioritize the test cases. More specifically, since each candidate test case covers the same number of uncovered λ-wise code-unit-value combinations before prioritization, our approach follows the total and additional test prioritization techniques to choose the first test case: the one covering the largest number of units (Lines 9-16). Then, the number of uncovered λ-wise code-unit-value combinations against previously selected test cases (S ) is calculated for each remaining test case, and a candidate with the maximum value is selected as the next test case is appended to S (Lines 31-39). Before choosing the next test case, our approach examines whether or not there are any λwise code-unit-value combinations that are not covered by the test cases in S : If there are not, the remaining candidate test cases are prioritized by restarting the previous process (Lines [26][27][28]. Once an element is selected as the next test case, our approach updates the set of uncovered λ-wise code-unit-value combinations (Lines 23 and 46). This process is repeated until all elements from T have been added to S . Similar to additional test prioritization, when facing a tie where more than one test case has the largest number of uncovered code-unit-value combinations, our approach randomly selects one. ; . To further explain the details of the proposed approach, Figure 1 illustrates an example of the CCCP process with λ = 1. Similar to the total and additional test prioritization techniques, CCCP chooses the first test case that covers the largest number of code units (the maximum amount of odd numbers). Since there are two candidates with the maximum number of code units, tc 1 and tc 2 (both covering three code units), CCCP randomly chooses one of them (in this case, tc 1 ), and adds it to S . CCCP then updates the set UncoverCombinations, and calculates the CCC value for each candidate: CCC(tc 2 , S , 1) = 2 and CCC(tc 3 , S , 1) = 3. Since tc 3 has the greater CCC value, it is selected as the next test case, and added to S . In contrast, the total prioritization technique would choose tc 2 as the second test case, because tc 2 covers more code units than tc 3 ; and the additional technique would randomly select one from tc 2 and tc 3 as the second test case, because both candidates cover the same number of uncovered code units. Finally, in our approach, the last candidate tc 2 is added to S , resulting in S = tc 1 , tc 3 , tc 2 .

Empirical Study
In this section, we present our empirical study, including the research questions underlying the study. We also discuss some independent and dependent variables, and explain the subject programs, test suites, and experimental setup in detail.

Research Questions
Due to space limitations and practical performance constraints (higher λ values may result in more substantial running time), we present the evaluation of our proposed approach's performance when λ = 1 and 2. Unless explicitly stated, λ = 1 is used as the default value for CCCP. The empirical study was conducted to answer the following six research questions. RQ6 How does the use of code combinations coverage with λ = 2 impact on the testing effectiveness of CCCP?

Independent Variables
In this study, we consider the following three independent variables.

Prioritization Techniques
Since our proposed CCCP technique takes advantage of the information about dynamic coverage and test cases as inputs, for a fair comparison, it is necessary to choose other RTCP techniques that use similar information to guide the test cases prioritization. In this study, we selected four such RTCP techniques: total test prioritization [1], additional test prioritization [1], adaptive random test prioritization [3], and searchbased test prioritization [2]. The total test prioritization technique prioritizes test cases based on the descending number of code units covered by those tests. The additional technique, in contrast, greedily chooses each element from candidates such that it covers the largest number of code units not yet covered by the previously selected tests. The adaptive random technique greedily selects each element from random candidates such that it has the greatest maximum distance from the already selected tests. Finally, the search-based technique considers all permutations as candidate solutions, and uses a meta-heuristic search algorithm to guide the search for a better test execution order -a genetic algorithm was adopted in this study, due to its effectiveness [2]. These four test prioritization techniques, whose details are presented in Table 1, have been widely used in RTCP previous studies [17,18,20].

Code Coverage Granularity
When using code coverage information to support RTCP, the coverage granularity can be considered a constituent part of the prioritization techniques. To enable sufficient evaluations, we used structural coverage criteria at the statement-, branch-, and method-coverage levels.

Test Case Granularity
For the subject programs written in Java, we considered an additional factor in the prioritization techniques, the test-case granularity. Test-case granularity is at either the test-class level, or the test-method level. For the test-class level, each JUnit test-case class was a test case; for the test-method level, each test method in a JUnit test-case class was a test case. In other words, a test case at the test-class level generally involves a number of test cases at the test-method level. For C subject programs, however, the actual program inputs were the test cases.

Dependent Variables
Because we were examining the fault detection capability, we adopted the widely-used APFD (average percentage faults detected) as the evaluation metric for fault detection rates [1]. Given a test suite T , with n test cases. If T is a permutation of T , in which T F i is the position of first test case that reveals the fault i, then the APFD value for T is given by the following equation: Although APFD has often been used to evaluate RTCP techniques, it assumes that each test case incurs the same time cost, an assumption that may not hold up in practice. Elbaum et al. [21], therefore, introduced an APFD variant, APFDc, a cost-cognizant version of APFD that considers both the fault severity and the test case execution cost. Similar to APFD, APFDc has also been applied to the evaluation of RTCP techniques, resulting in a more comprehensive evaluation [22]. APFDc is defined as: where α 1 , α 2 , · · · , α m are the severities of the m detected faults, β 1 , β 2 , · · · , β n are the execution costs of n test cases, and T F i has the same meaning as in APFD. Because of the difficulty involved in estimating fault severity, following previous studies [22], we assumed that all faults had the same severity level for this study. Accordingly, the definition of APFDc can be described as: Intuitively speaking, prioritized test suites that both find faults faster and require less execution time, will have higher APFDc values.

Subject Programs, Test Suites and Faults
We conducted our study on 14 versions of four Java programs (three versions of ant; five versions of jmeter; three versions of jtopas; and three versions of xmlsec) downloaded from the Software-artifact Infrastructure Repository (SIR) [23,24], and 30 versions of five real-life Unix utility programs, from the GNU FTP server [25]. Both the Java and C programs have been widely used as benchmarks to evaluate RTCP problems [3,10,19,26]. Table 2 summarizes the subject program details, with Columns 3 to 7 presenting the version, size, number of branches, number of methods, and number of classes (including interfaces), respectively.
Each version of the Java programs has a JUnit test suite that was developed during the program's evolution. These test suites have two levels of test-case granularity: the test-class and the test-method. The numbers of JUnit test cases (including both test-class and testmethod levels) are shown in the #Test Case column, as #T Class and #T Method: The data is presented as x (y), where x is the total number of test cases; and y is the number of test cases that can be successfully executed. The test suites for the C programs are available from the SIR [23,24]: The number of tests cases in each suite is also shown in the #Test Case column of Table  2.
Because the seeded-in SIR faults were easily detected and small in size, for both C and Java programs, we used mutation faults to evaluate the performance of the different techniques. Mutation faults have previously been identified as suitable for simulating real program faults [27,28,29], and have been widely applied to regression test prioritization evaluations [1,10,17,18,19,20,30]. Eleven mutation operators from the "NEW DEFAULTS" group of the PIT mutation tool [31] were used to generate mutants for the Java programs. These operators, whose detailed descriptions can be found on the PIT website [32], were: conditionals boundary, increments, invert negatives, math, negate conditionals, void method calls, empty returns, false returns, null returns, primitive returns, and true returns. We downloaded the mutants from previous RTCP studies [19] for the C programs, which had been generated using seven mutation operators from Andrews et al. [33]: statement deletion, unary insertion, constant replacement, arithmetic operator replacement, logical operator replacement, bitwise logical operator replacement, and relational operator replacement.
Equivalent mutants [34,35] are functionally equivalent versions of the original program, and thus cannot be killed: no test case applied to both the mutant and the original program could result in different behavior. In our study, therefore, all equivalent mutants were removed, leaving only those mutants that could be detected by at least one test case. In Table 2, the #Mutant column gives the total number of all mutants (#All), and the (#Detected) column gives the number of detected mutants. Although all detected mutants were considered in our study, some mutants, called duplicated mutants) [35], were equivalent to other mutants (but not to the original program). Similarly, some mutants, called subsumed mutants [36,37] were subsumed by others: If a subsuming mutant [38]) is killed, then its subsumed mutants are also killed. We used the Subsuming Mutants Identification (SMI) algorithm [38] to remmove the duplicate and subsuming mutants in our fault set. SMI first removed duplicate mutants, and then greedily identified the most subsuming mutants -those mutants which, when killed, result in the highest number of other mutants being "collaterally" killed. The #Subsuming Mutant column gives the number of subsuming mutants used in our study (the subsuming faults are classified as either test-class level (#SM Class) or testmethod level (#SM Method) for the Java programs).

Experimental Setup
The experiments were conducted on a Linux 4.4.0-170-generic server with two Intel(R) Xeon(R) Platinum 8163 CPUs (2.50 GHz, two cores) and 16 GBs of RAM.
The Java programs were compiled using Java 1.8 [39]. The coverage information for each program version was obtained using the FaultTracer tool [40,41], which, based on the ASM bytecode manipulation and analysis framework [42], uses on-the-fly bytecode instrumentation without any modification of the target program.
There were six versions of each C program: P V0 , P V1 , P V2 , P V3 , P V4 , and P V5 . Version P V0 was compiled using gcc 5.4.0 [43], and then the coverage information was obtained using the gcov tool [44], which is one of the gcc utilities.
After collecting the code coverage information, we implemented all RTCP techniques in Java, and applied them to each program version under study, for all coverage criteria. Because the approaches contain randomness, each execution was repeated 1000 times. This resulted in, for each testing scenario, 1000 APFD or APFD c values (1000 orderings) for each RTCP approach. To test whether there was a statistically significant difference between CCCP and the other RTCP approaches, we performed the unpaired twotailed Wilcoxon-Mann-Whitney test, at a significance level of 5%, following previously reported guidelines for inferential statistical analyses involving randomized algorithms [45,46]. To identify which approach was better, we also calculated the effect size, measured by the non-parametric Vargha and Delaney effect size measure [47],Â 12 , whereÂ 12 (X, Y) gives probability that the technique X is better than technique Y. The statistical analyses were performed using R [48].

Results and Analysis
This section presents the experimental results to answer the research questions.
To answer RQ1 to RQ4, Figures 2 to 8 present box plots of the distribution of the APFD or APFD c values (averaged over 1000 iterations). Each box plot shows the mean (square in the box), median (line in the box), and upper and lower quartiles (25th and 75th percentile) for the APFD or APFD c values for the RTCP techniques. Statistical analyses are also provided in Tables 3  to 11 for each pairwise APFD or APFD c comparison between CCCP and the other RTCP techniques. For example, for a comparison between two methods TCP ccc and M, where M ∈ {TCP tot , TCP add , TCP art , TCP search }, the symbol means that TCP ccc is better (p-value is less than 0.05, and the effect sizeÂ 12 (TCP ccc , M) is greater than 0.50); the symbol means that M is better (the pvalue is less than 0.05, andÂ 12 (TCP ccc , M) is less than 0.50); and the symbol H means that there is no statistically significant difference between them (the p-value is greater than 0.05).
To answer RQ5, Table 12 provides comparisons of the execution times for the different RTCP techniques. To answer RQ6, Figure 9 shows the APFD and APFD c results for CCCP with λ = 2, and Table 13 Table 3: Statistical effectiveness comparisons of APFD for Java programs at the test-class level. For a comparison between two methods TCP ccc and M, where M ∈ {TCP tot , TCP add , TCP art , TCP search }, the symbol means that TCP ccc is better (p-value is less than 0.05, and the effect sizê A 12 (TCP ccc , M) is greater than 0.50); the symbol means that M is better (the p-value is less than 0.05, andÂ 12 (TCP ccc , M) is less than 0.50); and the symbol H means that there is no statistically significant difference between them (the p-value is greater than 0.05).  Based on Figure 2 and Table 3, we have the following observations: (1) Compared with the total test prioritization technique, CCCP achieves better performances for the program xmlsec, irrespective of code coverage granularity, with differences between the mean and median APFD values reaching about 3%. For the other programs (ant, jmeter, and jtopas), however, they have very similar APFD results.
(2) CCCP performs much better than adaptive ran-dom test prioritization, regardless of subject program and code coverage granularity, with the maximum mean and median APFD differences reaching about 12%.
(3) CCCP has very similar performance to the additional and search-based test prioritization techniques, with the mean and median APFD differences approximately equal to 1%.
(4) There is a statistically significant difference between TCP ccc and TCP art , which supports the above observations. However, none of the other three techniques (TCP tot , TCP add , or TCP search ) is either always better or always worse than TCP ccc , with TCP ccc sometimes  performing better for some programs, and sometimes worse.
(5) Considering all Java programs: Overall, because all p-values are less than 0.05, and the relevant effect sizeÂ 12 ranges from 0.58 to 0.98, TCP ccc performs better than TCP tot and TCP art . However, CCCP has very similar (or slightly worse) performance to TCP add and TCP search , withÂ 12 values of either 0.49 or 0.50.

APFD Results: Java Programs (Test-Method
Level) Based on Figure 3 and Table 4, we have the following observations: (1) Our proposed method achieves much higher mean and median APFD values than TCP tot for all programs with all code coverage granularities, with the maximum differences reaching approximately 30%.
(2) CCCP has very similar performance to TCP add , with their mean and median APFD differences at around 1%.
(3) Other than for some versions of jtopas, CCCP has much better performance than TCP art .

APFD Results: C Subject Programs
Based on Figure 4 and Table 5, we have the following observations: (1) Our proposed CCCP approach has much better performance than TCP tot and TCP add , for all programs and code coverage granularities, except for gzip with method coverage (for which TCP ccc has very similar, or better performance). The maximum difference in mean and median APFD values between TCP ccc and TCP tot is more than 40%; while between TCP ccc and TCP add , it is about 10%.
(2) TCP ccc has similar or better APFD performance than TCP art and TCP search for some subject programs (such as f lex and gzip, with all code coverage granularities), but also has slightly worse performance for some others (such as grep with method coverage and sed with statement coverage). However, the difference in mean and median APFD values between TCP ccc and TCP art or TCP search is less than 5%.   (3) Overall, the statistical results support the box plot observations. All p values for the comparisons between TCP ccc and TCP tot or TCP add are less than 0.05, indicating that their APFD scores are significantly different. TheÂ 12 values are also much greater than 0.50, ranging from 0.56 to 1.00 (except for the programs gzip v3, gzip v4, and gzip v5, with method coverage). However, although all p values for the comparisons between TCP ccc and TCP art or TCP search are also less than 0.05, theirÂ 12 values are much greater than 0.50 in some cases, but also much less than 0.50 in others. Nevertheless, considering all the C programs, not only does TCP ccc have significantly different APFD values to the other four RTCP techniques, but it also has better performances overall (except for TCP art with method coverage). Table 6 summarizes the statistical results, presenting the total number of times TCP ccc is better (), worse (), or not statistically different (H), compared to the other RTCP techniques. Based on this table, we can answer RQ1 as follows: 1. When prioritizing Java test suites at the test-class level, TCP ccc performs much better than TCP art ; similarly to TCP tot and TCP seaerch ; and slightly worse than TCP add . 2. When prioritizing Java test suites at the test- Table 7: Statistical effectiveness comparisons of APFD c for Java programs at the test-class level. For a comparison between two methods TCP ccc and M, where M ∈ {TCP tot , TCP add , TCP art , TCP search }, the symbol means that TCP ccc is better (p-value is less than 0.05, and the effect sizê A 12 (TCP ccc , M) is greater than 0.50); the symbol means that M is better (the p-value is less than 0.05, andÂ 12 (TCP ccc , M) is less than 0.50); and the symbol H means that there is no statistically significant difference between them (the p-value is greater than 0.05). In conclusion, overall, the proposed CCCP approach is more effective than the other four RTCP techniques, in terms of testing effectiveness, as measured by APFD.

RQ2: Effectiveness of CCCP Measured by APFD c
Next, we provide the APFD c results for CCCP for different code coverage and test case granularities. Figures  5 to 7 show the APFD c results for the Java programs at the test-class level; at the test-method level; and the C programs, respectively. Each sub-figure in these figures has the program versions across the x-axis, and the APFD c values for the five RTCP techniques on the yaxis. Tables 7 to 9 present the corresponding statistical comparisons.

APFD c Results: Java Programs (Test-Class
Level) Based on Figure 5 and Table 7, we have the following observations: (1) Compared with TCP tot , TCP ccc has much better APFD c results for the programs ant and jtopas, irrespective of program version and code coverage granularity, with the maximum difference between the mean and median values being up to about 30%. For the programs jmeter and xmlsec, however, TCP ccc performs worse than TCP tot , with a maximum APFD c difference of about 15%.
(2) Although TCP ccc performs better than TCP art in many cases (for example, with jmeter, for all code coverage granularities), it also sometimes performs worse (including with ant v2 and ant v3 for the branch and method coverage levels). Nevertheless, the TCP ccc APFD c values have much lower variation than TCP art .
(3) TCP ccc has very similar performance to TCP add and TCP search , sometimes performing slightly better (for example, with jtopas, using statement coverage) or worse (such as with xmlsec v3, for method coverage). The differences among the mean and median APFD c values for the three RTCP techniques are very small, at most, about 5%.
(4) Overall, the statistical analyses support the box plot observations. Looking at all Java programs together, TCP ccc is better than TCP tot and TCP art , for all code coverage granularities: The p-values are much less than 0.05; and theÂ 12

APFD c Results: Java Programs (Test-Method
Level) Based on Figure 6 and Table 8, we have the following observations: (1) Apart from some cases with the program ant (for example, ant v1), TCP ccc has much better APFD c performance than TCP tot , with the maximum mean and median differences reaching about 50%.
(2) TCP ccc performs much better than TCP art for the programs ant and jmeter, with the maximum mean and median APFD c differences being about 30%. In contrast, TCP art performs much better than TCP ccc for the program jtopas. For the program xmlsec, however, neither TCP art nor TCP ccc is always better: At branch coverage level, for example, TCP ccc is better for version v2; TCP art is better for version v3; and they have similar performance for version v1.
(3) The TCP ccc and TCP add APFD c distributions are very similar, in most cases, which means that, on the whole both techniques have very similar effectiveness (according to APFD c ).
(4) Although there are some large performance differences between TCP ccc and TCP search (such as for Table 8: Statistical effectiveness comparisons of APFD c for Java programs at the test-method level. For a comparison between two methods TCP ccc and M, where M ∈ {TCP tot , TCP add , TCP art , TCP search }, the symbol means that TCP ccc is better (p-value is less than 0.05, and the effect sizeÂ 12 (TCP ccc , M) is greater than 0.50); the symbol means that M is better (the p-value is less than 0.05, andÂ 12 (TCP ccc , M) is less than 0.50); and the symbol H means that there is no statistically significant difference between them (the p-value is greater than 0.05).  Table 9: Statistical effectiveness comparisons of APFD c for C programs. For a comparison between two methods TCP ccc and M, where M ∈ {TCP tot , TCP add , TCP art , TCP search }, the symbol means that TCP ccc is better (p-value is less than 0.05, and the effect sizeÂ 12 (TCP ccc , M) is greater than 0.50); the symbol means that M is better (the p-value is less than 0.05, andÂ 12 (TCP ccc , M) is less than 0.50); and the symbol H means that there is no statistically significant difference between them (the p-value is greater than 0.05). ant v2 and xmlsec, with branch coverage), overall, they have similar mean and median APFD c values. In most cases, TCP ccc has lower variation in APFD c values than TCP search .
(5) Overall, the statistical analyses support the box plot observations. Looking at all Java programs together, TCP ccc is better than TCP tot and TCP art , with pvalues much less than 0.05, and theÂ 12

APFD c Results: C Programs
Based on Figure 7 and Table 9, we have the following observations: (1) Except for some very few cases (such as gzip v3, gzip v4, and gzip v5, with method coverage), TCP ccc has much higher APFD c values than TCP tot and TCP add , with the maximum mean and median APFD c differences between TCP ccc and TCP tot reaching more than 35%; and between TCP ccc and TCP add being about 15%.
(2) TCP ccc performs differently compared with TCP art and TCP search for different programs and different code coverage granularities: With the program f lex, for example, for all versions and code coverage granularities, TCP ccc is more effective; however, with make, for both statement and branch coverage, TCP art and TCP search are more effective.
(3) Overall, the statistical results support the box plot observations. Considering all C programs, the p values for all comparisons between TCP ccc and TCP tot , TCP add , TCP art , and TCP search are less than 0.05, indicating that the APFD c scores are all significantly different. According to the effect sizeÂ 12 values, TCP ccc outperforms TCP tot and TCP add ; and performs slightly better than TCP search and TCP art (except at the method coverage level). In conclusion, overall, the proposed CCCP approach is more effective than the four other RTCP techniques, in terms of testing effectiveness, as measured by APFD c .

RQ3: Impact of Code Coverage Granularity
In this study, we examined three types of code coverage (statement, branch, and method). According to the APFD results (Figures 2 to 4) and APFD c results ( Figures 5 to 7), in spite of some cases where the three types of code coverage provide very different APFD or APFD c results for CCCP, they do, overall, deliver comparable performance. This means that the choice of code coverage granularity may have little overall impact on the effectiveness of CCCP. Figure 8 presents the APFD and APFD c results for the three types of code coverage, according to the subject programs' language or test suites (the language or test case granularity is shown on the x-axis; the APFD scores on the left y-axis; and the APFD c on the right y-axis). It can be observed that for C programs, statement and branch coverage are very considerable, but are more effective than method coverage. For Java programs, however, there is no best one among them, because they have similar APFD and APFD c values. Table 11 presents a comparison of the mean and median APFD and APFD c values, and also shows the p- values/effect sizeÂ 12 for the different code coverage granularity comparisons. It can be seen from the table that the APFD and APFD c values are similar, with the maximum mean and median value differences being less than 3%, and less than 8%, respectively. According to the statistical comparisons, there is no single best code coverage type for CCCP, with each type sometimes achieving the best results. Nevertheless, branch coverage appears slightly more effective than statement and method coverage for CCCP.
In conclusion, the code coverage granularity may only provide a small impact on CCCP testing effectiveness, with branch coverage possibly performing slightly better than statement and method coverage.

RQ4: Impact of Test Case Granularity
To answer RQ4, we focus on the Java programs, each of which had two levels of test cases (the test-class and test-method levels). As can be seen in the comparisons between Figures 2 and 3, and between Figures 5 and  6, CCCP usually has significantly lower average APFD and APFD c values for prioritizing test cases at the testclass level than at the test-method level.
Considering all the Java programs, as can be seen in Table 11, the mean and median APFD and APFD c values at the test-method level are much higher than at the test-class level, regardless of code coverage granularity. One possible explanation for this is: Because a test case at the test-class level consists of a number of test cases at the test-method level, prioritization at the test-method level may be more flexible, giving better fault detection effectiveness [10].
In conclusion, CCCP has better fault detection effectiveness when prioritizing test cases at the test-method level than at the test-class level. Table 12 presents the time overheads, in milliseconds, for the five RTCP techniques. The "Comp." column presents the compilation times of the subject programs, and the "Instr." column presents the instrumentation time (to collect the information of statement, branch, and method coverage). Apart from the first four columns, each cell in the table shows the prioritization time using each RTCP technique, for each program, presented as µ/σ (where µ is the mean time and σ is the standard deviation over the 1000 independent runs).

RQ5: CCCP Efficiency
The Java programs had each version individually adapted to collect the code coverage information, with different versions using different test cases. Because of this, the execution time was collected for each Java program version. In contrast, each P V0 version of the C programs was compiled and instrumented to collect the code coverage information for each test case, and all program versions used the same test cases. Because of this, each C program version has the same compilation and instrumentation time. Furthermore, because all the studied RTCP techniques prioritized test cases after the coverage information was collected, they were all deemed to have the same compilation and instrumentation time for each version of each program.
Based on the time overheads, we have the following observations: (1) As expected, the time overheads for all RTCP techniques (including CCCP) were lowest with method coverage, followed by branch, and then statement coverage, irrespective of test case type. The reason for this, as shown in Table 2, is that the number of methods is much lower than the number of branches, which in turn is lower than the number of statements; the related converted test cases are thus shorter, requiring less time to prioritize.
(2) It was also expected that (for the Java programs) prioritization at the test-method level would take longer than at the test-class level, regardless of code coverage granularity. The reason for this, again, relates to the number of test cases to be prioritized at the test-method level being more than at the test-class level.
(3) TCP ccc requires much less time to prioritize test cases than TCP art and TCP search , and very similar time to TCP add , irrespective of subject program, and code coverage and test case granularities. Also, because TCP tot does not use feedback information during the prioritization process, it has much faster prioritization speeds than TCP ccc .
In conclusion, TCP ccc prioritizes test cases faster than TCP art and TCP search ; has similar speed to TCP add ; but performs slower than TCP tot . 5.6. RQ6: CCCP Effectiveness with λ = 2 To answer RQ6, this section briefly discusses the effectiveness of CCCP when λ = 2. Figure 9 shows the detailed APFD and APFDc results for the C programs,  with the code coverage granularity on the x-axis, and the y-axis giving the APFD or APFDc scores. For ease of presentation, TCP ccc1 and TCP ccc2 denote CCCP with λ = 1 and λ = 2, respectively. Table 13 presents the statistical comparisons of the TCP ccc2 APFD and APFDc scores with those of the other five RTCP techniques: Each data cell shows the p-value/effect sizeÂ 12 value. Based on the experimental data, we have the following observations: (1) TCP ccc2 has the higher mean and median APFD and APFD c values than TCP tot and TCP add , and better or similar to TCP art , TCP search , and TCP ccc1 , regardless of code coverage granularity.
(2) The statistical results confirm the box plot observations. All p-values are much less than 0.05, indicating a statistically significant difference between the TCP ccc2 and each of other five RTCP techniques, regardless of APFD and APFD c values. TheÂ 12 results also show TCP ccc2 to outperform TCP tot , TCP add , TCP art , TCP search , and TCP ccc1 , with probabilities ranging from 74% to 97%, 62% to 77%, 52% to 60%, 54% to 62%, and 53% to 56%, respectively.
These observations partly confirm our hypothesis about the performance of CCCP: As the λ for unit combination increases, the testing information for guiding prioritization is greater, which may will result in improved performance.
Finally, regarding the prioritization time: TCP ccc2 requires about 351073.3, 159881.4, and 501.8 milliseconds for the C programs when using statement, branch and method coverage, respectively. This low prioritization time of 501.8 milliseconds for TCP ccc2 with method coverage is less than the prioritization time for TCP add using statement or branch coverage (2192.0 and 973.6 milliseconds, respectively). As shown in Figure 9, TCP ccc2 with method coverage has comparable fault detection effectiveness to TCP add with statement or branch coverage. Because method coverage is usually much less expensive to achieve than statement or branch coverage, TCP ccc2 with method coverage should be a better choice than TCP add with statement or branch coverage. Furthermore, method-level coverage -which is the most natural for projects that are large in scale and high in complexity -has greater potential practical application than statement and branch criteria, making TCP ccc2 with method coverage more feasible (2wise code combinations coverage may incur much more complex calculations for statement and branch coverage than for method coverage).

Practical Guidelines
Here, we present some practical guidelines for how to choose the combination strength and code-coverage level for CCCP, under different testing scenarios: (1) When testing resources are limited, it is (obviously) recommended that the lowest combination strength (λ = 1) be chosen for CCCP. This not only achieves better testing effectiveness than other prioritization techniques, but also has comparable testing speed to the additional test prioritization technique.
(2) When there are sufficient testing resources available, λ = 2 is recommended for CCCP, because of the higher fault detection rates it can deliver.
(3) If the system under test is large in scale and high in complexity, method coverage is recommended to be used for CCCP.

Threats to Validity
To facilitate the investigation of potential threats and to support the replication of experiments, we have made the relevant materials (including source code, subject programs, test suites, and mutants) available on our project website: https://github.com/huangrubing/CCCP/. Despite that, our study still face some threats to validity, listed as follows.

Internal Validity
The main threat to internal validity lies in the implementation of our experiment. First, the randomized computations may affect the performance of CCCP: To address this, we repeated the prioritization process 1000 times and used statistical tests to assess the strategies. Second, the data structures used in the prioritization algorithms, and the faults in the source code, may introduce noise when evaluating the effectiveness and efficiency: To minimize these threats, we used data structures that were as similar as possible, and carefully reviewed all source code before conducting the experiment. Third, although we used the APFD and APFD c metrics, which have been extensively adopted to assess the performance of RTCP techniques, APFD only reflects the rate at which faults are detected, ignoring the time and space costs, and APFD c assumes that all faults have the same fault severity. To address this threat, our future work will involve additional metrics that can measure other practical performance aspects of prioritization strategies.

External Validity
All the programs used in the experiment were medium-sized, and written in C or Java, which means that the results may not be generalizable to programs written in other languages (such as C++ and C#) and of different sizes. To reduce this threat, other relevant programs will be adopted to evaluate the CCCP performance. Mutation testing has been argued to be an appropriate approach for assessing fault detection performance [27,28,29]. Mutation testing has also been used in recent RTCP research studies [17,18,19,20]. However, Luo et al. [49] has highlighted the differences between real faults and mutants, explaining that the relative performances of RTCP techniques on mutants may not translate to similar relative performances with real faults. To address this threat, additional studies will be conducted to investigate the performance of RTCP on programs with real regression faults in the future.

Related Work
A considerable amount of research has been conducted into regression testing techniques with a goal of improving the testing performance. This includes test case prioritization [1,50], reduction [51,52] and selection [53,54]. This Related Work section focuses on test case prioritization, which aims to detect faults as early as possible through the reordering of regression test cases [55,56].
Prioritization Strategies. The most widely investigated prioritization strategies are the total and additional techniques [1]. Because existing greedy strategies may produce suboptimal results, Li et al. [2] translated the RTCP problem into a search problem and proposed several search-based algorithms, including a hillclimbing and genetic one. Motivated by random tiebreaking, Jiang et al. [3] applied adaptive random testing to RTCP and proposed a family of adaptive random test cases prioritization techniques that aim to select a test case with the greatest distance from already selected ones.
More recently, as the total strategy and the additional strategy can be seen as two extreme instances, Zhang et al. [10] proposed a basic and an extended model to unify the two strategies. Saha et al. [5] proposed an RTCP approach based on information retrieval without dynamic profiling or static analysis. Many existing RTCP approaches use code coverage to schedule the test cases, but do not consider the likely distribution of faults. To address this limitation, instead of traditional code coverage, Wang et al. [11] used the quality-aware code coverage calculated by code inspection techniques to guide prioritization process.
Coverage criteria. In terms of coverage criteria, structural coverage has been widely adopted in test case prioritization. In addition to statement [1], branch [3], method [10,11], block [2] and modified condition/decision coverage [57], Elbaum et al. [30] proposed a fault-exposing-potential (FEP) criterion based on the probability of the test case detecting a fault. Recently, Chi et al. [58] used function call sequences, arguing that basic structural coverage may not be optimal for dynamic prioritization.
Empirical studies. A large number of empirical studies have been performed aiming to offer practical guidelines for using RTCP techniques.
In addition to studies on traditional dynamic test prioritization [1,30,59,60], recently, Lu et al. [20] were the first to investigate how real-world software evolution impacts on the performance of prioritization strategies: They reported that source code changes have a low impact on the effectiveness of traditional dynamic techniques, but that the opposite was true when considering new tests in the process of evolution.
Citing a lack of comprehensive studies comparing static and dynamic test prioritization techniques, Luo et al. [17,18] compared static RTCP techniques with dynamic ones. Henard et al. [19] compared white-box and back-box RTCP techniques.

Conclusions and Future work
In this paper, we have introduced a new coverage criterion that combines the concepts of code and combination coverage. Based on this, we proposed a new prioritization technique, code combinations coverage based prioritization (CCCP). Results from our empirical studies have demonstrated that CCCP with the lowest combination strength (λ = 1) can achieve better fault detection rates than four well-known, popular prioritization techniques (total, additional, adaptive random, and search-based test prioritization). CCCP was also found to have comparable testing efficiency to the additional test prioritization technique, while requiring much less time to prioritize test cases than the adaptive random and search-based techniques. The results also show that CCCP with a higher combination strength (λ = 2) can be more effective than all other prioritization techniques, in terms of both APFD and APFD c .
Our future work will include examining more reallife programs to further investigate the performance of CCCP, including the impact of combination strengths. In this paper, we have only applied the concept of code combinations coverage to the traditional greedy prioritization strategy. It will be very interesting to examine new prioritization techniques based on code combinations coverage adopting other prioritization strategies such as search-based strategy.