Abstract Test Case Prioritization Using Repeated Small-Strength Level-Combination Coverage

Abstract test cases (ATCs) have been widely used in practice, including in combinatorial testing and in software product line testing. When constructing a set of ATCs, due to limited testing resources in practice (e.g., in regression testing), test case prioritization (TCP) has been proposed to improve the testing quality, aiming at ordering test cases to increase the speed with which faults are detected. One intuitive and extensively studied TCP technique for ATCs is <italic><inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>-wise Level-combination Coverage based Prioritization</italic> (<inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>LCP), a static, black-box prioritization technique that only uses the ATC information to guide the prioritization process. A challenge facing <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>LCP, however, is the necessity for the selection of the fixed prioritization strength <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula> before testing—testers need to choose an appropriate <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula> value before testing begins. Choosing higher <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula> values may improve the testing effectiveness of <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>LCP (e.g., by finding faults faster), but may reduce the testing efficiency (by incurring additional prioritization costs). Conversely, choosing lower <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula> values may improve the efficiency, but may also reduce the effectiveness. In this paper, we propose a new family of <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>LCP techniques, <italic>Repeated Small-strength Level-combination Coverage-based Prioritization</italic> (RSLCP), that repeatedly achieves the full combination coverage at lower strengths. RSLCP maintains <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>LCP's advantages of being static and black box, but avoids the challenge of prioritization strength selection. We have performed an empirical study involving five different versions of each of five C programs. Compared with <inline-formula><tex-math notation="LaTeX">$\lambda$</tex-math></inline-formula>LCP, and <italic>Incremental-strength LCP</italic> (ILCP), our results show that RSLCP could provide a good tradeoff between testing effectiveness and efficiency. Our results also show that RSLCP is more effective and efficient than two popular techniques of <italic>Similarity-based Prioritization</italic> (SP). In addition, the results of empirical studies also show that RSLCP can remain robust over multiple system releases.

(ILCP), our results show that RSLCP could provide a good tradeoff between testing effectiveness and efficiency.Our results also show that RSLCP is more effective and efficient than two popular techniques of Similarity-based Prioritization (SP).In addition, the results of empirical studies also show that RSLCP can remain robust over multiple system releases.
Index Terms-Abstract test case, level-combination coverage, regression testing, software testing, test case prioritization.

I. INTRODUCTION
I N PRACTICE, software systems are usually influenced by different parameters or factors (such as configuration options and user inputs), with each parameter possibly having a finite set of different levels or values.An abstract test case (ATC) represents a combination of levels of different parameters, and has been used in different testing situations, including combinatorial testing [1], software product lines testing [2], and highly configurable systems testing [3].
When an ATC set has been constructed, it is desirable to execute all the test cases-in which case execution order does not matter.However, due to often limited testing resources, it is often possible to only run some of the ATCs in the set.In such situations, the ATC execution order may become critical, because a well-prioritized test case execution sequence may identify failures more quickly, and thus may enable earlier fault characterization, diagnosis, and correction [1].Generally speaking, the process of scheduling the order of test cases is called Test Case Prioritization (TCP) [4], and the prioritization of ATCs is called Abstract Test Case Prioritization (ATCP) [5].
Many strategies have been proposed to guide ATCP according to different criteria, for example, random TCP [6], [7], and Similarity-based Prioritization (SP) [8]- [10].The most widely used ATCP is λ-wise Level-combination Coverage-based Prioritization (λLCP) [11], which adopts a fixed strength λ (called the prioritization strength) to choose each ATC in a greedy manner: When selecting each next ATC from the candidates, λLCP calculates the number of parameter-level combinations at a fixed prioritization strength λ covered by each candidate that has not yet been covered by executed test cases, and then chooses the one with the maximum number of uncovered λ-wise parameterlevel combinations.λLCP has many advantages, including that it is simple and intuitive [11].Furthermore, because it only uses the level-combination coverage information derived from the test cases, rather than information from the source code or program execution, λLCP is a static, black-box technique [12].
Although previous studies have shown that λLCP is an effective prioritization technique, in terms of fault detection [6], [11], [13], [14], it does have a constraint that the fixed strength λ must be set before prioritization begins.Different λ values may lead to different performances, with investigations [13], [14] finding that larger λ values may improve the testing effectiveness of λLCP (e.g., by finding faults faster), but may reduce the testing efficiency (by incurring additional prioritization costs); conversely, lower λ values may improve the efficiency, but may also reduce the effectiveness.
Although it is intuitive that a small prioritization strength for λLCP may be efficient (in terms of overheads), it has also been shown that over 50% of faults can be triggered by one parameter (1-wise combination coverage), and more than 70% can be triggered by two (2-wise combination coverage) [15], [16].This indicates that choosing a small prioritization strength may be effective for λLCP, especially when the number of ATCs is small.However, when the ATC candidate set is large, λLCP with small prioritization strengths may become ineffective [17].This is because, when the small-strength (1-wise or 2-wise) level-combination coverage is fully achieved by the selected or executed ATCs, the remaining candidates are effectively randomly ordered.
In this paper we propose a new family of λLCP techniques, Repeated Small-strength Level-combination Coveragebased Prioritization (RSLCP).RSLCP attempts to overcome the limitations of current versions of LCP, attempting to better balance the tradeoff between testing effectiveness and efficiency.In particular, RSLCP begins with a small prioritization strength λ (λ = 1, 2) to implement the λLCP algorithm-which means that RSLCP is also initially λLCP.Once the λ-wise level-combination coverage of selected or executed ATCs is fully achieved-i.e., the number of λ-wise level combinations covered by the selected ATCs is equal to that covered by all candidates-then RSLCP restarts with the same prioritization strength λ, repeating full λ-wise level-combination coverage in the next round.This process is repeated until all candidates have been chosen.RSLCP has the following three advantages: 1) it is very simple, adopting a similar mechanism to λLCP; 2) similar to λLCP, it is a static, black-box prioritization method, using only the level-combination coverage to guide the ATCP (this means that it is not necessary to obtain source code information, nor to execute the program); and 3) unlike λLCP, it is not necessary to set the prioritization strength before prioritizing ATCs.
To evaluate the proposed technique, we conducted empirical studies on five C programs, each of which had five different versions.In summary, the main contributions of this paper are as follows: 1) We propose a new strategy to guide the ATC prioritization, repeated small-strength level-combination coveragebased prioritization (RSLCP), and describe a framework to support it.2) Based on the proposed framework, we provide two categories (using two different strategies) involving six algorithms to implement RSLCP.
3) We report on empirical studies investigating each RSLCP technique, comparing with λLCP and SP, from the perspectives of: testing effectiveness (the speed of interaction coverage and fault detection); testing efficiency (the prioritization cost); and robustness [how well the overall fault detection potential is maintained across different versions of the software under test (SUT)].The rest of this paper is organized as follows.Section II describes the background information.Section III introduces the RSLCP method, including the framework, algorithm, complexity analysis, and mechanism.Section IV presents the research questions and experimental setup.Section V reports on the empirical studies conducted to answer the research questions.Section VI reviews related work, and, finally, Section VII concludes the paper.

II. BACKGROUND
In this section, we describe some background information about ATCs and TCP.

A. Abstract Test Case
Given some SUT that has k parameters that constitute a parameter set P = {p 1 , p 2 , . . ., p k }, with a corresponding level set L = {L 1 , L 2 , . . ., L k }, where each parameter p i has some valid levels from the finite set L i (i = 1, 2, . . ., k).In practice, parameters may represent anything that influences the performance of the SUT, such as components, configuration options, user inputs, and so on.Let Q be the set of constraints on level combinations.
Definition 2.1.Input Parameter Model: An input parameter model (or input model) for the SUT, denoted as Model(P, L, Q), is a model of the SUT that includes the set of some parameters P that may influence the SUT, the set of level sets L for each parameter, and constraints set Q on level combinations.
For example, Fig. 1 shows a screenshot of the font settings for Microsoft Word 2013. 1 As shown in the red box, we only consider the Effects aspects of the font settings, for which there are seven choices.Table I gives an input parameter model for the font effects of Microsoft Word 2013.The effects have seven parameters, each of which can have two levels.It is not possible to have both parameters from any of the sets (Strikethrough, Double Strikethrough), (Superscript, Subscript), or (Small caps, All caps) to be "Yes" at the same time.Therefore, there are three level combination constraints.To simplify the representation of this problem, each parameter can be denoted by p i (i = 1, 2, . . ., 7), and each level can be labelled by an integer (starting at 0), as shown in Table I.
Definition 2.3: η-wise Level Combination: An η-wise level combination is a k-tuple ( l 1 , l 2 , . . ., l k ) involving η parameters with fixed levels (called fixed parameters) and (k − η) parameters with arbitrary allowable levels (called free parameters that are denoted by "−"), where 0 ≤ η ≤ k and: An η-wise level combination is also called an η-wise schema [1].Without loss of generality, to more clearly describe the problem, free parameters can be ignored.In other words, an η-wise level combination can be considered an η-tuple.
For ease of description, we define a function ψ(η, tc) for an ATC tc that returns the set of all η-wise level combinations covered by tc, i.e.
(2) Similarly, a function ψ(η, T ) for a set T of test cases can be defined to return the set of all η-wise level combinations covered by all model inputs in T , i.e.

B. Test Case Prioritization
TCP seeks to schedule test cases such that those with higher priority, according to some criteria, are executed earlier than those with lower priority.When testing resources are limited or insufficient for the execution of all test cases in a test suite, a well-designed test case execution order can be crucial.The problem of TCP is defined as follows [4]: Definition 2.5: Test Case Prioritization: Given a tuple (T, Ω, g), where T is a test suite, Ω is the set of all possible permutations of T , and g is a fitness function from Ω to real numbers, the goal of TCP is to find a prioritized test suite (also called a test sequence) S ∈ Ω such that According to Rothermel et al. [4], prioritization can be done according to many possible criteria, including, for example, code coverage [18].To date, many TCP strategies have been proposed, based on various concepts, including: fault severity [19]; source code coverage [4], [20], [21]; search-based techniques [18]; integer linear programming [22]; risk exposure [23]; historical records from recent regression tests [24]; and information retrieval [25], [26].Most strategies can be classified as either meta-heuristic search methods or greedy methods [27].When TCP is applied to ATCs, it is called abstract TCP (ATCP) [5].

III. REPEATED SMALL-STRENGTH LEVEL-COMBINATION COVERAGE-BASED PRIORITIZATION
In this section, we present a new family of λLCP techniques that work by repeatedly using repeated, small-strength levelcombination coverage.We call these techniques RSLCP.We introduce two RSLCP versions in this section, and present an analysis of the space and time complexity for each version.

A. Framework
Unlike λLCP, because the RSLCP prioritization strength is limited to 1 or 2, it is not necessary that a value be assigned to λ before prioritizing ATCs.
As shown in Fig. 2, RSLCP prioritizes an unordered set of ATCs (denoted T ) into a prioritized set S that has been divided into α (α ≥ 1) disjoint and ordered parts S 1 , S 2 , . . ., S α , where each S i (i = 1, 2, . . ., α) has also been prioritized using a prioritization strength λ i .Formally, the following five conditions must be satisfied: Condition 1 means each subset S i is both non-empty and ordered.Condition 2 means that all test cases are divided amongst the α subsets.Condition 3 means that S is ordered by sequencing S 1 , S 2 , . . ., S α successively [which means that S j +1 follows S j (1 ≤ j < α)].According to Condition 4, no test case belongs to more than one test sequence, and, finally, Condition 5 means that S i is constructed using the prioritization strength λ i .
Although the value of α is fully determined by the given T , it does not impact on the framework or on the following algorithms.We created two versions of the framework, an independent and a partially independent version.

1) RSLCP Independent Version:
The RSLCP independent version (RSLCP-IV) guarantees that construction of S i+1 is independent of construction of S i (1 ≤ i < α).Formally, the following two conditions must be satisfied: Condition 1 means that each subset S i adopts a small strength (1 or 2) to guide the λLCP process.Condition 2 means that each subset S i covers all λ i -wise level combinations that could be covered by the candidates remaining before constructing S i .
Although the construction of test sequences S i and S i+1 (1 ≤ i < α) are independent, actually, construction of S i may impact on the construction of S i+1 (because S i+1 is constructed using only those test cases remaining after S i 's construction).The algorithms used to prioritize each test sequence S i will be presented in Section III-B.
2) RSLCP Partially Independent Version: The RSLCP partially independent version (RSLCP-PV) is similar to the independent version, but involves some S i+1 constructions that are based on the S i construction.The following three conditions must be satisfied (assuming S 0 = ∅): where x is an integer.Condition 1 differs from that of RSLCP-IV by assigning a prioritization strength of 1 to each S i when i is an odd number, and a strength of 2 when i is even.Conditions 2 and 3 mean that, when i is odd, the corresponding test sequence S i is constructed independently to achieve the highest 1-wise level-combination coverage; but when i is even, the S i is constructed so as to guarantee that S i and S i−1 cover the same 2-wise level combinations as those covered by the remaining candidates.In effect, RSLCP-PV first uses a prioritization strength of 1 to construct the subset S 2x−1 , and then considers S 2x−1 as the already selected ATCs for construction of the test sequence S 2x .This process is then repeatedly applied to the remaining candidates.

B. Algorithm
Algorithm 1 describes the basic RSLCP procedure, which includes iteratively constructing each S i (i = 1, 2, . . ., α) (Line 4).Once an S i is completely constructed, it is added to the end of S (S ← S S i ) (Line 5), and removed from the candidate set T (Line 6).The test sequence S i+1 can then be constructed, with such constructions continuing until all candidates have been chosen.Clearly, although construction of the test sequence S i is independent of construction of S j (1 ≤ i = j ≤ α), as can be seen, because S j is constructed using elements from candidates remaining after S i 's construction, the construction of S i can impact that of S j .
We propose an algorithm to complete the construction process for each S i .The algorithm draws from the well-known greedy approach, Additional Greedy Approach [18], which iteratively selects the element of maximum weight (for the problem) from those parts not yet selected or executed.The problem for construction of S i is to cover the maximum number of λ i -wise Select tc ∈ T , where max ψ(λ i , tc) ψ(λ i , S i ) Take a random one in case of equality 5: T ← (T \ {tc i }) 7: end while 8: return S i level combinations not yet covered by test cases that have already been selected or executed.Algorithm 2 describes the Additional Greedy algorithm to construct S i for RSLCP-IV, and Algorithm 3 describes it for RSLCP-PV.
1) RSLCP-IV Algorithm: As shown in Algorithm 2, the RSLCP-IV algorithm chooses one of the candidates as the next ATC in S i such that it covers the maximum number of λ i -wise level combinations that have not yet been covered by the already selected or executed ATCs in S i (Line 5).If more than one candidate has the highest λ i -wise level-combination coverage, then a random tie-breaking mechanism [28] is used, so that one best candidate is selected.This process is repeated until either of the following two conditions is satisfied (Line 3): 1) all candidates have been selected (i.e., T = ∅) or 2) S i achieves full λ i -wise level-combination coverage against T [i.e., ψ(λ i , S i ) = T empSet, where T empSet is the set of λ iwise level combinations covered by the remaining ATCs after completely constructing S i−1 ].
For each prioritization strength λ i (1 ≤ i ≤ α) used for constructing S i , we use the following five assignment categories: 1) Pure 1-wise RSLCP-IV: Each prioritization strength λ i is assigned a value of 1: 2) Pure 2-wise RSLCP-IV: Similar to the Pure 1-wise RSLCP-IV, this category assigns each prioritization strength λ i a value of 2: T ← (T \ {tc i }) 12: end while 13: return S i and 2 for the prioritization strengths.For S i where i is an odd number, the prioritization strength λ i is assigned a value of 1; and when i is an even number, λ i is assigned a value of 2: wise RSLCP-IV category.For S i with even i numbers, λ i is assigned a value of 1; S i with odd i numbers is assigned 2:

5) Random Assignment RSLCP-IV: In this category, each
prioritization strength λ i is randomly assigned either a 1 or 2 value: λ i = rand (1,2), where rand(x, y) is a function returning an integer in the range [x, y].

2) RSLCP-PV Algorithm:
The RSLCP-PV algorithm (Algorithm 3) is similar to the (1 + 2)-wise RSLCP-IV algorithm.Construction of S i with odd values of i (S 2x−1 , 1 ≤ x ≤ α/2) uses the same mechanism as the RSLCP-IV algorithm (a prioritization strength of 1), indicating that this part is independent of previous constructions (Line 3).However, when constructing S i for even values of i (S 2x ), although the same prioritization strength of 2 is used, this part is partially dependent (not completely independent): information about the λ 2x−1 -wise level combinations covered by ATCs in S 2x−1 is used (Line 5).Random tie-breaking [28] is again used when there is more than one candidate covering the same maximum level of combinations.
The RSLCP-PV algorithm first uses a prioritization strength of 1 to prioritize ATCs.When 1-wise level-combination coverage has been fully achieved for S 2x−1 , then a value of 2 is used for the prioritization strength.Effectively, the RSLCP-PV algorithm uses incremental prioritization strengths (from 1 to 2) to construct S i .

C. Complexity Analysis
In this section, we provide a brief analysis of both the space and time complexity of RSLCP.We first introduce the data structure used to store the λ i -wise level combinations.Given A 2-layer hierarchical data structure, denoted H all , is used to store all λ i -wise level combinations derived from the input parameter model.The first layer of H all is an array of C k, λ i ) elements, each of which is a parameter combination with size In other words, this array contains all possible λ iwise parameter combinations.Each parameter combination in the first level is actually a pointer to the next layer.Each structure in the second layer is a bitmap for all λ i -wise level combinations derived from each λ i -wise parameter combination.Each bitmap uses a single bit for each λ i -wise level combination, with a value of 1 indicting that the relevant level combination has already been covered by previously selected ATCs, but a value of 0 meaning that it has not yet been covered.
For each candidate tc ∈ T , we use an array H each of size C(k, λ i ), each element of which represents the index of the λ iwise level combination of the corresponding F C λ i in the second level of H all .To check whether each λ i -wise level combination is covered or not, its index can be used to locate the relevant position in the bitmap.
1) Space Complexity: We next present an analysis of the space complexity of RSLCP, which is determined by two parameters: 1) the number of candidates, n and 2) the number of η-wise (η ∈ {1, 2}) level combinations derived from the input parameter model.
Because each candidate covers η-wise level combinations of size C(k, η), the space complexity for parameter (1) is O n × C(k, η) .The space complexity for parameter (2) is determined by the input parameter model.As described in the previous section, the data structure used to store the possible η-wise level combinations, H all , has two layers.The first layer of H all contains all η-wise parameter combinations, resulting in a space complexity of O 1 = O C(k, η) .The space complexity of the second layer, O 2 , can be described as follows: Therefore, the RSLCP space complexity is Because η is limited to a value of either 1 or 2, the best space complexity is when η Of the different versions of RSLCP, only Pure 1-wise RSLCP-IV has the best space complexity.
2) Time Complexity: We next present an analysis of the time complexity of RSLCP, which is also determined by two parameters: 1) the number of candidates involved, n and 2) the time complexity of calculating uncovered η-wise level combinations for each candidate.
Regarding parameter (1), when selecting the ith model input from candidates, RSLCP needs to check each of the (n − i + 1) candidates.For parameter (2), there is a need to check whether or not the η-wise level combinations covered by each candidate tc are covered by previously selected ATCs.Since H each stores the index of each η-wise level combination, this check takes O(1) time for each η-wise level combination.Therefore, the RSLCP time complexity can be presented as Similar to the results of the space complexity analysis, RSLCP has best time complexity (O n 2 × k ) when η = 1, and worst complexity (O n 2 × k 2 ) when η = 2. Again, of the different RSLCP versions, only Pure 1-wise RSLCP-IV has the best time complexity.
Previous investigations [27], [29] have shown that the order of time complexity of λLCP is equal to O n 2 × C(k, λ) .This means that when 1 ≤ λ ≤ k/2 , then as λ increases, the prioritization time of λLCP also generally increases; however, when k/2 < λ ≤ k, then the prioritization time generally decreases as λ increases.As discussed by Petke et al. [13], [14], λ is generally assigned a value between 1 and 6, which means that λ is generally less than k/2 , especially when k is large.Since λLCP's order of time complexity is O n 2 × C(k, η) (where η is equal to 1 or 2), it is expected that RSLCP would have similar testing efficiency to λLCP when λ is 1 or 2. However, RSLCP should be more efficient than λLCP when λ is 3, 4, 5, or 6.

D. Discussion
This section briefly explains why RSLCP should achieve improvements over λLCP.RSLCP attempts to provide a tradeoff between testing effectiveness and efficiency for prioritizing ATCs.The analysis of time complexity showed that the testing efficiency of RSLCP should be similar or better than λLCP, which means that RSLCP is an efficient ATCP technique.The rest of this analysis, therefore, addresses how RSLCP should provide comparable testing effectiveness, comparing RSLCP with λLCP for different λ values.1) When 1 ≤ λ ≤ 2: As discussed earlier (Section III-A), RSLCP uses either 1 or 2 when applying λLCP to the prioritization of ATCs, which means that it would cover 1-wise or 2-wise level combinations as quickly as λLCP.This means that RSLCP should have testing effectiveness that is at least similar to λLCP.Furthermore, when 1-wise or 2-wise level combinations have been fully covered by the already selected ATCs, λLCP then randomly prioritizes the remaining ATCs.However, RSLCP repeats the λLCP process to prioritize any remaining ATCs, which should provide better performance than random prioritization (e.g., in terms of the speed of covering higher-strength level combinations).2) When 3 ≤ λ ≤ 6: RSLCP should be faster at covering 1wise or 2-wise level combinations than λLCP, because this is the basic principle of RSLCP.Compared with λLCP, RSLCP may, however, be slower at covering highstrength level combinations.However, because RSLCP repeatedly achieves full 1-wise or 2-wise interaction coverage, it may also be able to quickly (to some extent) cover high-strength level combinations.For example, if a candidate ATC tc covers a set of 1-wise level combinations that have not been covered by previously selected ATCs, tc may also cover a set of λ-wise level combinations.In other words, RSLCP may sometimes provide comparable testing effectiveness to λLCP, when λ is high.

IV. EXPERIMENTAL SETUP
In this section, we present the research questions related to the testing effectiveness and efficiency of our proposed techniques, and examine the experiments we conducted to answer them.

A. Research Questions
In the field of TCP, two important issues are: 1) the prioritization effectiveness and 2) the prioritization efficiency.Generally speaking, the prioritization effectiveness is measured by the rate of fault detection.However, due to the characteristics of ATCs, the prioritization effectiveness can be also measured by the rate of interaction coverage.In this paper, therefore, we focus on the rates of interaction coverage and fault detection with respect to the effectiveness.Furthermore, when a new version of the SUT is released, the original prioritized test suite may become less effective: the initial test ordering might no longer be optimal.It would be helpful, therefore, for testers to know how maintainable the fault detection potential (the robustness) of a test suite prioritization technique is over multiple releases of the system.The following four research questions were designed to examine the testing effectiveness, prioritization costs, and robustness of RSLCP.
RQ1: How well do the six RSLCP versions perform?RQ1.1:How well do the five RSLCP-IV algorithms perform?RQ1.2:How well does the RSLCP-PV algorithm compare with the RSLCP-IV algorithms?
Answering RQ1 will help testers know which RSLCP technique is the most effective or efficient.The two sub-questions are designed to further investigate the best RSLCP-IV algorithms and the differences between the RSLCP-IV and RSLCP-PV algorithms.
RQ2: How well does RSLCP compare with λLCP?As discussed, RSLCP attempts to balance the tradeoff between testing effectiveness and efficiency in λLCP.Answering RQ2 should make it clear whether or not RSLCP can achieve comparable testing effectiveness to current λLCP techniques, which would help clarify whether or not it should be considered as a cost-effective alternative.
RQ3: How does RSLCP compare with other widely used prioritization techniques such as Incremental-strength LCP (ILCP), and Similarity-based Prioritization (SP)?
The ILCP is another ATCP technique to avoid the selection of prioritization strength existed in λLCP; while the SP has been considered as an efficient prioritization technique.Therefore, answer RQ3 would enable a better understanding of the testing effectiveness and efficiency of RSLCP (compared with those of ILCP and SP), which would help decide whether it is more cost-effective or not.
RQ4: How robust is RSLCP across multiple releases of the SUT?
Answering RQ4 will help identify the robustness of RSLCP, and whether or not it degrades over multiple releases of the system.

B. Subject Programs
In our empirical study, we considered five versions of five programs (giving a total of 25 different programs) written in the C programming language.The five programs, which were obtained from the GNU FTP server,2 were: a tool for lexical analysis (flex); two widely used command-line tools for searching and processing text matching regular expressions (grep and sed); a widely used compression utility (gzip); and a popular utility used to control the compile and build processes of the programs (make).
These subject programs have been widely used in TCP research [4], [6], [13], [14], [29], [32]- [36].Table II gives the program details, including the input parameter model, 3 the number of ATCs obtained from the Software-artifact Infrastructure Repository (SIR) 4 [37], the program size excluding comments in lines of code (measured by cloc5 ), the program version number, and the number of faults in each version.The ATC set for each program was constructed using the test specification language [38].Apart from the program make (for which some ATCs were removed due to unsuccessful execution), the ATCs used cover all valid level combinations at each strength.description, prioritization objective, and corresponding reference in the literature.Because RSLCP is a new version of LCP, we also considered another λLCP version, denoted λW, which we investigated for six λ values (λ = 1, 2, 3, 4, 5, 6), following previous studies [6], [11], [14], [29]- [31].In addition, we also compared our methods with another two widely used ATCP techniques, ILCP [29], and SP [8].ILCP makes use of incremental strengths beginning with λ = 1 to run λLCP.We examined two versions of SBP [8], global SP (GSP), and local SP (LSP): GSP initially selects two elements as the first two test cases with the minimum similarity, and then iteratively chooses an element as the next test case such that it has the minimum Jaccard similarity against previously selected test cases; LSP, in contrast, iteratively chooses a pair of test cases with the minimum Jaccard similarity until all candidates have been chosen [8].

D. Fault Seeding
For each of the subject programs, the original version contains no seeded-in faults.Although a number of hand-seeded faults are available from the SIR [37], many of these faults are easily detected (on average more than 60% of test cases can reveal them).In this paper, therefore, we used mutation analysis [39] to seed in faults (see Table II).As discussed in previous studies [40], [41], mutation analysis can provide more realistic faults than hand-seeding, and may be more appropriate for studying TCP.Compared with real faults, however, the correlation between mutant killing ability and real fault detection may become weak when the test suite size is kept constant [42].Nevertheless, the detection of real faults should improve significantly when test suites attain the highest levels of mutant kills [42].In this paper, each mutant could be killed by the given test suite.
For the five subject programs, we used the same mutation faults 6 as used by Henard et al. [35].More specifically, for each version V i (1 ≤ i ≤ 5) of each subject program, the same mutant operators used in Andrews et al. [40] were adopted to produce the faulty versions (mutants) for our paper.The operators used were: constant replacement; statement deletion; unary insertion; arithmetic operator replacement; relational operator replacement; logical operator replacement; and bitwise logical operator replacement.As discussed by Henard et al. [35], equivalent, 7 and duplicated8 mutants were eliminated using the Trivial Compiler Equivalence (TCE) [44] tool, resulting in about one third of the mutants being removed.This was done to reduce interference in the fault detection evaluation of each prioritization technique.Furthermore, as suggested by Papadakis et al. [45], subsumed mutants [46] (also called disjoint mutants [47]) 9 were also identified and discarded [35] to avoid biasing the experimental results [43].The subsumed mutants were removed by executing all ATCs for each mutant, and identifying the failure-causing ATCs.
The performance of several ATCP techniques may depend on the Failure-Triggering Fault Interaction (FTFI) value of each mutant in each subject program, i.e., the number of parameters required to detect a failure [15], [48].Table IV shows the FTFI number distribution of each program.

E. Evaluation Metrics
In this paper, we focused on the testing effectiveness and efficiency of RSLCP, from the perspectives of interaction coverage, fault detection, and prioritization cost.

1) Interaction Coverage Metric:
The rate of interaction coverage was used to evaluate the speed of covering level combinations by the prioritized test suite.The Average Percentage of τ -wise Covering-array Coverage (APCC) [14], also called Average Percentage of Combinatorial Coverage [27], was used to measure the rate of interaction coverage of strength τ achieved by prioritized ATCs.Its definition is given as follows: Definition 4.1: Average Percentage of τ -wise Coveringarray Coverage: Suppose S = t 1 , t 2 , . . ., t n is a prioritized set of ATCs with size n, the APCC definition of S at strength The APCC metric values range from 0.0 to 1.0, with higher values indicating better rates of interaction coverage at a specific strength τ .In this paper, following previous studies [14], we considered APCC with τ = 1, 2, 3, 4, 5, and 6.
2) Fault Detection Metric: We used the fault detection rates of each prioritization technique as the fault detection metric.A well-known fitness function is the Average Percentage of Faults Detected (APFD) [4], which measures the fault detection rate of a given prioritized test suite.Higher APFD values indicate better prioritized test sequences.The APFD is defined as follows: Definition 4.2: Average Percentage of Faults Detected: Suppose T is a test suite containing n test cases, and F is a set of m faults revealed by T .Let SF i be the number of test cases in the prioritized test suite S of T that are executed before detecting fault f i .The APFD of S is calculated using the following equation (from Rothermel et al. [4]): 3) Efficiency Metric: The prioritization cost measures how quickly each prioritized test suite is constructed, and was used to represent the efficiency of the technique.Obviously, lower prioritization cost means better efficiency.

F. Inferential Statistical Analysis
Because some prioritization strategies involve randomization (due to the random tie-breaking technique [28]), we ran each experiment 1000 times, as suggested in previous studies [49].
As part of the investigation, we wanted to determine the statistical significance of any differences between the APCC or APFD values (used to evaluate each prioritization technique), for which there are many statistical tests, such as the t-test and Wilcoxon-Mann-Whitney test [49].Because there was no relationship among the 1000 iterations, we used an unpaired test [35].Furthermore, because no assumptions were made about which prioritization technique was better than the other, a twotailed test was used [35].Following previous guidelines on inferential statistical approaches for dealing with randomized algorithms [49], [50], we used the unpaired two-tailed Wilcoxon-Mann-Whitney test to check the statistical significance (at a significance level of 5%).
Since we used multiple statistical prioritization techniques, we report the p-values, which indicate whether or not the differences between two techniques are highly significant.When the p-value between two techniques M 1 and M 2 is less than 5%, the difference between M 1 and M 2 is highly significant; otherwise, it is not significant.As Henard et al. [35] explained, however, with an increase of the number of the executions, p will become sufficiently small, which means that there are differences between two algorithms.However, when the p-value is very small, it may be difficult to identify which algorithm is actually the better.We therefore used a different statistical measure, the effect size, which is generally measured by the non-parametric Varia and Delay effect size measure [51], Â12 .
The Varia and Delay effect size measure should provide more useful information when comparing two different algorithms.Â12 (M 1 , M 2 ) = 0.50, for example, would indicate that in the sample, there is no difference between algorithms M 1 and M 2 .Â12 (M 1 , M 2 ) > 0.50 would mean that M 1 is superior to M 2 ; and Â12 (M 1 , M 2 ) < 0.50 would mean that M 2 is superior to M 1 .The further the Â12 value is from 0.50, the larger is the effect size.Based on previous work [51], we classify four categories of the effect size: no-difference (| Â12 (M 1 , M 2 ) − 0.50| = 0); small (0

V. RESULTS
This section presents the results of the experiments conducted, and answers the research questions.In the displayed results, each box plot shows the distribution of the 1000 APCCs or APFDs (averaged over 1000 iterations), listed horizontally across the figure.Each box plot shows the mean (square in the box), median (line in the box), upper and lower quartiles, and minimum and maximum APCC values for the prioritization technique.In addition, a statistical analysis is given for each pairwise APCC or APFD comparison of prioritization techniques.For example, for a comparison between two methods M 1 versus M 2 , we use × to denote that there is no statistical difference between them (i.e., their p-value is greater than 0.05); ✔ to denote that M 1 is significantly better (p-value is less than 0.05, and the effect size Â12 (M 1 , M 2 ) is greater than 0.50); and ✗ to denote that M 2 is significantly better (p-value is less than 0.05, and Â12 (M 1 , M 2 ) is less than 0.50).

A. Interaction Coverage Results
In this section, we answer RQ1, RQ2, and RQ3, from the perspective of the interaction coverage rates.Figs.3-7 present the APCC results for programs flex, grep, gzip, make, and sed.Each figure describes different strength values for APCC, i.e., τ is assigned 1, 2, 3, 4, 5, and 6.Tables V and VI show the detailed Wilcoxon test APCC results at the 0.05 significance level for each comparison.
1) RQ1: RSLCP Techniques: Here, we try to answer the subquestions of RQ1: RQ1.1 and RQ1.2, according to APCC, and then briefly analyze each observation.a) RQ1.1: RSLCP-IV Techniques: Based on the experimental results, we can observe the following.i) When τ = 1, all RSLCP-IV techniques have very similar APCCs for all programs, because their 1-wise APCCs have very similar distributions.According to the statistical analysis (Table V), however, IV1 and IV3 generally have the best performances, followed by IV5, regardless of subject programs (apart from programs gzip and make).ii) When 2 ≤ τ ≤ 6, it can be observed that IV2 overall has the best performance for all programs, followed by IV4; while IV1 is worst, followed by IV5 and IV3.The statistical analysis (Table V) also confirms these observations.The main reason for the first observation is that both IV1 and IV3 initially make use of λ = 1 for prioritizing ATCs, which is the same mechanism as 1W-the mechanism 1W chooses an element as the next test case such that it covers the largest number of 1-wise level combination that have not been covered by previously selected ATCs.When all 1-wise level combinations have been covered by the selected ATCs, the order of the remaining ATCs does not change the 1-wise APCC value.Therefore, IV1 and IV3 perform very similarly, and have better 1-wise APCCs than other RSLCP-IV techniques.Regarding the difference for programs gzip and make, a possible reason may be that an element covering the largest number of uncovered 2-wise level combinations (selected as the next test case by IV2, IV4, and IV5), could cover a comparable number of uncovered 1-wise level combinations as IV1 and IV3, due to the characteristics of the input parameter model (i.e., each parameter contains a similar number of levels).
The second observation can be explained as follows: Similar to the case of IV1 and IV3, IV2 and IV4 have the same APCC values at τ = 2.However, IV2 repeatedly covers entire 2-wise level combinations, which may provide the faster speed to cover level combinations τ > 2 than other RSLCP-IV techniques.Similarly, IV1 only repeatedly covers entire 1-wise level combinations, which may not provide higher rates of interaction coverage at higher strengths.From the perspective of the interaction coverage rate, the answer to RQ1.1 is: Overall, IV2 has the best performance among all RSLCP-IV techniques, followed by IV4; and IV1 is generally the worst.b) RQ1.2: RSLCP-IV versus RSLCP-PV: Based on the experimental data, we have the following observations: i) When τ = 1, PV has very similar APCC values to RSLCP-IV techniques for all programs.Apart from programs gzip and make, however, the statistical analysis shows that compared with IV1 and IV3, there is no highly significant difference compared with PV; while the difference between PV and IV2, IV4, or IV5 is highly significant.In addition, the statistical analysis also shows that PV is similar to IV1 and IV3, but performs better than other RSLCP-IV techniques when τ = 1.
ii) When 2 ≤ τ ≤ 6, PV is worse than IV2 for all programs, but performs better than IV1, IV3, and IV5.The statistical analysis confirms the box plot observations.PV has APCC values comparable to IV4.However, the statistical analysis shows that IV4 has significantly better 2-wise APCCs than PV; but for high τ values, the opposite is true: PV has significantly better τ -wise APCCs than IV4.A plausible reason to the first observation is that, as was the case for IV1 and IV3, PV uses 1W at the start of the prioritization process, which guarantees that it has similar speeds to IV1 and IV3, but higher speeds than other RSLCP-IV techniques for covering 1-wise level combinations.However, IV2 uses 2W to prioritize ATCs, while PV uses 1W and 2W to guide the prioritization.Therefore, IV2 may provide faster speeds than PV  for covering high τ value level combinations.Similarly, IV4 initially uses 2W for prioritizing ATCs, and then independently chooses 1W to prioritize the remaining ATCs when 2-wise level combinations have been fully covered.This process may provide better 2-wise APCCs than PV.However, when all 1-wise level combinations have been covered by the selected ATCs, PV does not independently use 2W for prioritizing the remaining ATCs, i.e., it considers 2-wise level combinations that have not yet been covered by previously selected ATCs (obtained by 1W).This process may guarantee that PV has higher APCCs than IV4.
With respect to the interaction coverage rate, the answer to RQ1.2 is: Overall, PV achieves a better performance than other RSLCP-IV techniques.APCC values, irrespective of subject program.However, when 1 ≤ τ < 3, RSLCP has comparable, or even better, performance.d) The statistical analysis validates above three observations overall.The first observation can be easily explained: λLCP chooses the next test case with the largest number of uncovered λ-wise level combinations, leading to highest rate of interaction coverage.The second observation can be explained by the fact that RSLCP repeatedly achieves full 1-wise or 2-wise level combination coverage, which may provide comparable level combination coverage at high τ values; however, once 1-wise (or 2-wise) level combinations have been full covered by 1W (or 2W), the remaining ATCs are randomly prioritized.

2) RQ2: RSLCP versus
The third observation can be explained as follows: The RSLCP techniques focus on the prioritization strength equal to either 1 or 2, which guarantees better APCCs at τ = 1 or τ = 2.
Although RSLCP could cover level combinations at high τ values, its speed may be lower than that of λLCP at high λ values.Additionally, λLCP also achieves a faster speed of covering level combinations at low τ values, which may explain why λLCP could sometimes achieve better APCCs at low τ values (e.g., 3W, 4W, and 5W have higher 2-wise APCCs than IV1, IV3, IV5, and PV for the programs gzip and make).
From the perspective of the interaction coverage rate, the answer to RQ2 is: RSLCP generally has better τ -wise APCC performances than λLCP at low λ values for τ = λ; while it has comparable (and even better) τ -wise APCCs for 1 ≤ τ < 3, compared with 3W, 4W, 5W, and 6W.
3) RQ3: RSLCP versus ILCP and SP: As illustrated in Figs.3-7 and Table VI, it can be observed that: a) ILCP, IV1, IV3, and IV5 have comparable APCCs when τ = 1; while IV2 and IV4 achieve better APCCs when τ = 2. Similarly, PV achieves comparable 1-wise and 2wise APCCs to ILCP.However, for all other τ values (i.e., 3 ≤ τ ≤ 6), ILCP performs significantly better than all RSLCP techniques.b) Compared with GSP, different programs have different observations.More specifically, RSLCP and GSP have very similar 1-wise APCCs for the programs flex and grep; GSP has higher 1-wise APCCs than RSLCP for the programs gzip and make.For the program sed, however, the 1-wise APCC performances of RSLCP are better than those of GSP.In addition, the statistical analysis shows that, overall, RSLCP performs significantly better than GSP for the programs flex, grep, and sed; while the opposite is true for gzip and make.Nevertheless, when τ is greater than 1 (i.e., 2 ≤ τ ≤ 6), all RSLCP techniques apart from IV1 have significantly better performances than GSP, for all programs.IV1 performs better than GSP for the programs flex, grep, and sed, but has worse performance for gzip.
Regarding the program make, IV1 outperforms GSP for τ equal to 2, 3, and 4, but GSP is better than IV1 for τ equal to 5 and 6. c) Compared with LSP, all RSLCP techniques have higher APCCs, irrespective of subject programs and τ values.The statistical results confirm that the differences between the APCCs of RSLCP and LSP are highly significant.The first observation can be explained as follows: On the one hand, ILCP first adopts the same procedure as PV (it initially uses 1W for prioritizing ATCs until all 1-wise level combinations have been covered by the selected ATCs), and then uses 2W for prioritizing the remaining ATCs to cover uncovered 2wise level combinations as quickly as possible.Therefore, ILCP has comparable 1-wise APCCs to IV1, IV3, IV5, and PV, and also has similar 2-wise APCCs to PV.However, since IV2 and IV4 initially use 2W to guide the prioritization, it is understandable that their 2-wise APCCs are better than ILCPs.On the other hand, when all 2-wise level combinations have been covered by previously selected ATCs, ILCP makes use of 3W to prioritize the remaining ATCs, and attempts to cover uncovered 3-wise level combinations as quickly as possible.ILCP iteratively repeats this process, incrementing by 1 each time, until all ATCs have been chosen, which may guarantee that ILCP has higher APCCs at high τ values than RSLCP.
The second observation (that RSLCP generally has better APCCs than GSP) can be explained as follows: RSLCP makes use of the information of interaction coverage to guide the prioritization of ATCs, but GSP uses the information of similarity between each candidate and previously selected ATCs.Therefore, RSLCP provides faster speed of covering level combinations than GSP.Regarding why GSP has higher 1-wise APCCs than RSLCP for the programs make and gzip: GSP initially selects two elements from candidates as the first two ATCs with the minimum Jaccard similarity, indicating that these two ATCs cover the largest number of 1-wise level combinations.Although GSP may choose a different pair of elements as the first two ATCs due to the tie-breaking [28], each pair of ATCs cover the same maximum number of 1-wise level combinations.However, RSLCP randomly chooses an element as the first ATC tc 1 , and then chooses the second ATC tc 2 such that it may cover the largest number of uncovered 1-wise level combinations by tc 1 .In other words, it is not guaranteed that tc 1 and tc 2 can cover the maximum number of 1-wise level combinations among all pairs of ATCs.As shown in Table II, the input parameter models for both make and gzip contain nearly all parameters with the same number of levels, which means that it is difficult for RSLCP to choose two ATCs with the minimum Jaccard similarity.Therefore, after choosing two ATCs, RSLCP may cover fewer 1-wise level combinations than GSP, resulting in the lower 1-wise APCC.
For the final observation, the main reason can be described as follows: LSP iteratively chooses a pair of elements with the minimum Jaccard similarity as the next two ATCs until all candidates have been selected, which indicates that each pair of ATCs is constructed independently.In other words, it is possible that two successive pairs of ATCs cover very similar level combinations, leading to low interaction coverage rates.
With respect to the rate of interaction coverage, the answer to RQ3 is: RSLCP generally achieves lower APCCs than ILCP, in most cases, but always has better performance than SBP, and often also for GSP.

B. Fault Detection Results
In this section, we answer RQ1, RQ2, and RQ3, in terms of the fault detection rates.Figs.8-12 present the APFD results for programs flex, grep, gzip, make, and sed, respectively.Each figure describes different versions for APFD.Tables VII and VIII show the detailed Wilcoxon test APFD results at the 0.05 significance level for each comparison.
1) RQ1: RSLCP Techniques: Here, we attempt to answer the two sub-questions of RQ1, RQ1.1, and RQ1.2, according to the APFD results.a) RQ1.1: RSLCP-IV Techniques: Based on the experimental data, we have the following observations: i) Among the five RSLCP-IV techniques, IV1 generally performs worst, regardless of subject programs with different versions.However, the APFD difference between IV1 and the other RSLCP-IV techniques (IV2, IV3, IV4, and IV5) seems small, in terms of both median and mean values: the differences between the mean values of IV1 and each of the other four RSLCP-IV techniques are less than 3%; and the differences between the median values range from 0% to 3%-program gzip with versions v3, v4, and v5, for example, seems to have no difference in the median values for each comparison.As shown in Table VII, it can be observed that, apart from the comparison of IV1 and IV3 for the program gzip with the last three versions (i.e., v3, v4, and v5), the differences when comparing any IVx (2 ≤ x ≤ 5) technique against IV1 are highly significant-the corresponding p-values are less than 0.05.Additionally, the effect size measure Â12 (IV1, IVx) values are less than 0.50, which means that IVx outperforms IV1 more than 50% of the time for each version of each program.ii) Regarding the comparisons among IV2, IV3, IV4, and IV5, the greatest difference is less than 1% among all programs, for both mean and median APFD values.In other words, other than IV1, the RSLCP-IV techniques all have very similar performance (according to the fault detection rates).Regarding the comparison between IVx and IVy (2 ≤ x = y ≤ 5), most p-values are greater than 0.05, indicating that any differences between them are not significant.However, the effect size measure Â12 (IVx, IVy) results show that IV2 is the best, followed  by IV4 and IV5.Overall, the statistical analysis confirms the box plots observations.From the perspective of fault detection rate, the answer to RQ1.1 is: IV1 has the worst fault detection rates, while other RSLCP-IV techniques are similarly effective.b) RQ1.2: RSLCP-IV versus RSLCP-PV: Based on the experimental results, we can observe the following: Based on the box plots results, PV achieves higher APFDs than IV1, and has comparable performance to the other RSLCP-IV techniques, regardless of subject programs.According to the statistical analysis, however, we have the following observations.i) Regarding the p-values, all of the IV1 versus PV comparisons are highly significant, except for the programs  gzip-v3, gzip-v4, and gzip-v5.This means that the APFDs are very different when comparing IV1 and PV.The effect size measure Â12 (IV1, PV) values have different ranges for different programs: for example, there are large differences for the programs flex and sed, regardless of versions, because the Â12 (IV1, PV) values range from 0.24 to 0.33 (except for the program sed-v1).However, for the programs grep, gzip, and make, the Â12 (IV1, PV) Fig. 12. APFD results for each prioritization technique for the program sed.

TABLE VII STATISTICAL ANALYSIS FOR PAIRWISE APFD COMPARISONS OF RSLCP TECHNIQUES ( Â12 VALUES INCLUDED IN BRACKETS)
values range from 0.37 to 0.49, which means that their differences are relatively medium or small.ii) When comparing IVx (2 ≤ x ≤ 5) with PV, different IVx techniques have different observations.More specifically, apart from the programs sed-v4 and sed-v5, there is no significant differences between IV2 and PV.Except for grep with the first three versions, gzip with the last three versions, and make-v5, the differences between IV3 and PV are highly significant.In addition, the APFD differences between IV4 and PV are highly significant for the programs flex (except for flex-v3) and make, but not for the programs grep, gzip, and sed, for all versions.Similarly, the differences between IV5 and PV are highly significant for all versions of the programs flex and sed, and some versions of other programs.Nevertheless, the effect size measure Â12 (IVx, PV) only ranges from 0.42 to 0.53, which indicates that the differences between them are small.As a consequence, we can conclude that, overall, the statistical analysis confirms the box plots results.With respect to the rate of fault detection, the answer to RQ1.2 is: RSLCP-PV performs significantly better than IV1, but has very similar performances to other RSLCP-IV techniques.
2) RQ2: RSLCP versus λLCP: Based on the experimental data comparing RSLCP and λLCP, we have the following observations: a) The box plots distributions show that the differences between RSLCP and λLCP are relatively small, regardless of subject program.Both in terms of mean and median values, this maximum difference between RSLCP and λLCP is only approximately 3.5% to 5.0%.b) Different programs may have different observations.For example, for all five versions of the program flex, all RSLCP techniques perform better than all λLCP techniques; however, for the program sed, all λLCP techniques (except 1W and 2W for some cases) have higher mean and median APFD values than the RSLCP techniques.Nevertheless, the RSLCP techniques generally have comparable or better APFD performances than 1W and 2W, and comparable performances to 3W and 4W.Compared with 5W and 6W, overall RSLCP has the worse rates of fault detection; however, it can sometimes have significantly better performances.c) In general, the statistical analysis supports the observations above.It should be noted that for the program flex, the difference between RSLCP and λLCP has a large effect size in most cases (except for the comparisons of IV1 with 4W, 5W, and 6W)-with the effect size values ranging from 67% to 94%.However, for the program sed, the case is generally the opposite: apart from 1W and 2W, all other λLCP techniques generally have significantly better performances than RSLCP, either with a large or medium effect size.From the perspective of the fault detection rate, the answer to RQ2 is: Although RSLCP and λLCP have similar APFD observations, overall, RSLCP is comparable or more effective than λLCP with low λ values (1 or 2), and has comparable (and sometimes superior) APFDs to λLCP with medium λ values such as 3 and 4. In addition, RSLCP is worse than λLCP with high λ values such as 5 and 6, but it sometimes has much better fault detection rates.
3) RQ3: RSLCP versus ILCP and SP: Based on the experimental data comparing RSLCP, ILCP, and SP, it can be seen that: a) Compared with ILCP, although RSLCP has relative similar mean and median APFD values, it overall has better rates of fault detection for the program flex but worse performances for the other four programs, irrespective of versions.The statistical analysis supports these box plot observations.b) Compared with GSP, except for the program make, and some versions of the program gzip, RSLCP generally has higher APFDs, both in terms of mean and median values.
The statistical analysis shows that the differences between GSP and RSLCP are highly significant, because their pvalues are generally less than 0.05; and the effect size Â12 values are within a high or medium degree (except for the program gzip).c) Compared with LSP, except for the program make, RSLCP generally has better performances in most cases, regardless of mean and median APFD values.This observation is supported by the statistical analysis, i.e., nearly all the p values are less than 0.05; and most effect size Â12 values are within a high degree.From the perspective of fault detection rate, the answer to RQ3 is: RSLCP generally performs worse than ILCP, but can sometimes achieve a better performance.Compared with SP (both GSP and LSP), RSLCP is generally more effective, although it may be less effective in some cases.
As discussed in Section V-A, RSLCP is faster at covering 1-wise or 2-wise level combinations than λLCP.In addition, as shown in Table IV, the proportion of faults with the FTFI number being 1 or 2 is about 27% to 30% for the program flex; approximately 30% to 40% for grep; 75% to 100% for gzip; and 45% to 60% for the program sed.In other words, RSLCP may have better or at least similar fault detection rates to λLCP for these four programs.However, as discussed in the APFD results, RSLCP is more effective than λLCP for the program flex, but overall less effective for the programs grep, gzip, and sed.This phenomenon may be explained as follows: 1) λLCP could achieve the rates of covering 1-wise and 2-wise level combinations to some extent, which may detect faults with the FTFI number 1 and 2 as quickly as RSLCP.2) Even though two faults may have the same FTFI number, they may have very different properties.For example, consider two faults f 1 and f 2 with the FTFI number equal to 2, which means that both of them could be caused by two parameters p i and p j from the input parameter model Model(P, L, Q), i.e., p i , p j ∈ P , and the corresponding levels are L i ∈ L and L j ∈ L. The fault f 1 may be triggered by the combination of (p i = l i ) and (p j = l j ), where l i ∈ L i and l j ∈ L j ; while the fault may be caused by the combination of (p i = l i ) and (p j = l j ).Therefore, the probability of detecting f 1 is equal to ; and the probability of detecting f 2 is . In other words, the number of ATCs required to detect f 1 is much fewer than that to detect f 2 , especially when the sizes of L i and L j are large.As a consequence, a higher rate of covering 1-wise or 2-wise level combinations does not always imply a faster detection of 1-wise or 2-wise faults.
A possible reason to explain why RSLCP has significantly worse APFD performance than SP for the program make may be: As discussed in Section V-A3, SP first chooses the best pair of elements as the first two ATCs that cover the largest number of 1-wise level combinations.When there is only one best pair (i.e., not a tie-breaking situation), the first two ATCs are deterministic.If two such ATCs could detect many faults, then SP could certainly achieve higher rates of fault detection than RSLCP.Manual inspection of the ATC set and mutants for the program make has confirmed this to be the case.

C. Prioritization Cost Results
In this section, we answer RQ1, RQ2, and RQ3, from the perspective of the prioritization cost.Table IX shows the prioritization time, in milliseconds, for the RSLCP, λLCP, ILCP, and SP techniques.The table presents the mean prioritization time (μ) and the standard deviation (σ) over the 1000 independent runs performed per technique.
1) RQ1: RSLCP Techniques: From the table, it can be clearly observed that among the six RSLCP techniques, IV1 has the fastest prioritization, with the other RSLCP techniques all showing similar prioritization costs.Therefore, the answer to RQ1.1 is that IV1 is most efficient among RSLCP-IV techniques; while others are comparable to each other.Similarly, the answer to RQ1.2 is: IV1 is more efficient than PV; while other RSLCP-IV techniques are similar to RSLCP-PV.
2) RQ2: RSLCP versus λLCP: Compared with λLCP, IV1 requires a very similar time to 1W, with the other RSLCP techniques (IV2, IV3, IV4, IV5, and PV) taking a very similar amount of time to 2W.In addition, all the RSLCP techniques require considerably less time than 3W, 4W, 5W, and 6W.From the perspective of prioritization cost, therefore, the answer to RQ2 is: RSLCP has similarly efficiency to λLCP when λ is small (such as 1 and 2), and requires much less time when λ is larger (such as 3, 4, 5, and 6).
3) RQ3: RSLCP versus ILCP and SP: Compared with ILCP, each RSLCP technique requires significantly less prioritization time, i.e., the mean (μ) prioritization time of ILCP ranges from 589 to 15 540 ms; however, the mean time of RSLCP ranges from only 6 to 111 ms.Similarly, RSLCP also needs much less time than SP (both GSP and LSP) to prioritize ATCs.Therefore, the answer to RQ3 is: The RSLCP techniques are much more efficient than ILCP and SP.

D. Summary
By combining and summarizing the observations from the experimental results of interaction coverage rates, the fault detection rates, and the prioritization time, we have the following conclusions: 1) Compared with λLCP with low λ values (such as 1 and 2), RSLCP achieves superior or at least comparable testing effectiveness, measured with APCC and APFD, while maintaining testing efficiency.2) Compared with λLCP with medium λ values (such as 3 and 4), RSLCP achieves comparable fault detection rates but has worse interaction coverage rates.However, RSLCP also requires less prioritization time.
3) Compared with λLCP with high λ values (such as 5 and 6), RSLCP usually achieves worse (though sometimes comparable or better) rates of interaction coverage and fault detection, while maintaining much better testing efficiency.4) Compared with ILCP, similar to λLCP with high λ values, the testing effectiveness of RSLCP is generally worse than that of ILCP (although sometimes RSLCP is better or comparable to ILCP), but RSLCP is much more efficient than ILCP.5) Compared with SP, RSLCP not only provides better testing effectiveness in most cases, but is also more efficient.As expected based on the discussion in Section III-D, overall, RSLCP provides a good tradeoff between testing effectiveness and efficiency.

E. Robustness Across Versions
This section attempts to answer RQ4 about the robustness of each RSLCP technique across different versions.
As shown in Figs.3-7, the APFD distributions for each RSLCP technique are similar over the five versions of each subject program.In addition, Table X presents the mean and median APFDs of RSLCP over the five versions, from which we can have the following observations: 1) In terms of the mean APFD, even in the worst case, it varies only slightly: from v1 to v5, for example, varying less than 4.30% for IV1 and IV3 for all subject programs; and less than 4.00% for other RSLCP techniques (IV2, IV4, IV5, and PV). 2) With respect to the median APFD, the case is very similar to that of the mean APFD: each RSLCP technique achieves the median value varying less than 4.00% over the five versions of each program, in the worst case (except for a few cases, such as for flex), where the median values vary approximately 4.27%, 4.01%, and 4.01% for IV1, IV3, and IV5, respectively).Overall, it can be observed that the APFD is quite robust over the five versions, for all RSLCP techniques, irrespective of subject programs.The answer to RQ4, therefore, is: Each RSLCP technique can remain robust over multiple releases of the SUT.

F. Threats to Validity
In this section, we examine some potential threats to the validity of our paper.
1) Construct Validity: In this paper, we have focused on the testing effectiveness and efficiency, measured using the rate of fault detection and the prioritization time, respectively.The APFD metric has been commonly adopted in the study of TCP [4], [6]; while the APCC metric has been widely used in the field of combinatorial interaction testing [13], [14].Nevertheless, we acknowledge that there may be other metrics which are also pertinent to this paper, for example, Average Percentage of Statement Coverage [18], Average Percentage of Decision Coverage [18], and Average Percentage of Method Coverage [52].
2) Internal Validity: Any potential threat to internal validity would be mainly about the implementation of our algorithms.
We have used C++ to implement the algorithms, and have carefully tested the implementation to minimize this threat as much as possible.
3) External Validity: With respect to the external validity, the main threat is the generalization of our results.We have used only five subject programs, written in the C language, all of which are of a relatively medium size.However, we believe that the 25 versions studied here (five versions of each of the five programs) are sufficient to support the conclusions.Nevertheless, more larger size programs, or programs written in different languages, will be required to further validate our techniques.
A second potential threat to external validity relates to the input parameter models for the subject programs, which we adopted from previous studies [13], [14].Based on these models, we also availed of the widely used sets of ATCs provided by SIR [37] in our experiments.However, different ATC sets may lead to different findings, and therefore different models (for constrained or unconstrained environments) and other sets of ATCs are still required to help generalize our findings.

4) Conclusion Validity:
The main potential threat to conclusion validity relates to the randomized computation in our algorithms.To minimize this threat, all algorithms were repeated 1000 times, and inferential statistics was applied to the comparisons of results.

VI. RELATED WORK
In this section, we introduce some related work about ATCP, which has been widely researched in several different fields, including in combinatorial testing [1], software product lines testing [2], and highly configurable systems testing [3].
According to Qu et al. [6], ATCP can be divided into two categories: (1) Regeneration Prioritization Strategy (RPS)considering TCP during ATC generation and (2) Pure Prioritization Strategy (PPS)-re-ordering a given set of ATCs.Our proposed RSLCP method belongs to the second category.

A. Regeneration Prioritization Strategy
RPS has generally been applied to combinatorial testing, which attempts to generate prioritized Covering Arrays (CAs)-CAs are special sets of ATCs satisfying certain criteria [53].Two CA construction strategies exist: greedy and meta-heuristic search.
1) Greedy Strategy: Bryce and Colbourn [54], [55] first proposed an RPS to generate the prioritized CAs by assigning test case weight to interaction coverage.Their strategy defines a weight for each parameter, calculates the weight for each parameter interaction (called generation strength) at strength 2, and then uses a greedy algorithm to construct pairwise prioritized CAs.Qu et al. [6], [56] proposed different weighting assignments for parameters and levels, including three weighting methods based on code coverage and one based on specification, and applied them to Bryce and Colbourn's method [54], [55], later extending this to configurable systems [57], and then also using the installation and generation cost of new configurations to improve the method [58].
2) Meta-Heuristic Search Strategy: Chen et al. [59] used ant colony optimization to build 2-wise prioritized CAs, using the same weighting calculations as Bryce and Colbourn [54], [55].Similarly, Lopez-Herrejon et al. [60] used a genetic algorithm to generate the prioritized pairwise CAs, applying them to software product lines.

B. Pure Prioritization Strategy
Most prioritization studies have focused on interaction coverage information or test case dissimilarity to guide the prioritization of test cases.To clearly describe these strategies, we classify them into two categories of prioritization methods: level-combination coverage-based prioritization and similaritybased prioritization.
1) Level-Combination Coverage-Based Prioritization: Level-Combination Coverage-Based Prioritization (LCP) greedily chooses an element from the candidates as the next test case such that it covers the largest number of level combinations that have not already been covered by previously selected test cases.According to the prioritization strength(s) adopted, LCP can be categorized as Fixed-strength LCP (FLCP, i.e., λLCP), Incremental-strength LCP (ILCP), or Mixed-strength LCP (MLCP).FLCP uses a fixed prioritization strength λ throughout the entire prioritization process, whereas ILCP and MLCP can use different strength values.When choosing each element from the candidates as the next ATC in the test sequence, both FLCP and ILCP use a single prioritization strength (even though ILCP may change the prioritization strength in later steps).MLCP, on the other hand, uses more than one prioritization strength.
Regarding FLCP, Bryce and Memon [11] first proposed λLCP for prioritizing existing test suites for GUI-based programs using λ = 2, 3. Sampath et al. [61] applied FLCP with λ = 1, 2 to reorder user-session test cases for web applications.Bryce et al. [62] proposed a single model to define generic prioritization criteria (including FLCP with λ = 1, 2) that are applicable to event-driven software, such as GUI and Web applications.Qu et al. [6], [56] used λLCP with λ = 2, 3 to prioritize CAs by assigning a weighting for each parameter, and then obtaining the weighting of each test case.Bryce et al. [63] extended their λLCP (using λ = 2) technique by defining the test case length as its prioritization cost, suggesting that longer test cases would require more execution time.Wang et al. [27] combined test case weight and cost with Bryce and Memon's λLCP technique [11], and proposed two heuristic prioritization methods to reorder CAs by considering total and additional techniques.They also proposed a series of evaluation metrics based on test case weight and cost, and used them to assess prioritized CAs.Huang et al. [7] extended the strategy of adaptive random prioritization [32] to CAs, proposing a method which, by replacing Bryce and Memon's [11] λLCP technique (with λ = 2, 3, 4), attempts to reduce time costs, while maintaining testing effectiveness.Huang et al. [17] proposed a new λLCP technique using repeated base-choice coverage.This belongs to the same category as our proposed RSLCP method IV1.Recently, Petke et al. [13], [14] conducted an extensive study to investigate the testing effectiveness of λLCP λ 2, 3, 4, 5, 6, showing that λLCP with small λ values could achieve comparable performance to that with high λ values.
For ILCP, Huang et al. [31] applied Bryce and Memon's [11] λLCP technique to the prioritization of Variable-strength CAs (VCAs) [64], and proposed two new VCA prioritization algorithms.The technique first uses the main-strength to prioritize VCAs using λLCP, and when all level combinations with the main-strength are covered by the selected test cases, the technique then uses the sub-strength to order the remaining candidate test cases.Huang et al. [29] used incremental interaction coverage to guide CA prioritization, by applying λLCP [11] with incremental prioritization strength values.This approach starts with strength λ = 1, and then increments the λ value when all possible λ-wise parameter-level combinations have been covered by the selected test cases.
Regarding MLCP, Huang et al. [33] proposed a new dissimilarity measure based on the aggregate-strength interaction coverage, and presented a new greedy PPS algorithm to prioritize CAs.This method combines the prioritization strength values 1, 2, . . ., τ (where τ is the generation strength of the given CA) to guide the selection of each test case from the candidates.
According to the classifications above, different versions of RSLCP belong to different categories.IV1 and IV2 belong the category of FLCP, and IV3 and PV belong to the category of ILCP.However, IV4 and IV5 do not fully belong the categories above, because IV4 adopts decreasing strengths, and IV5 may adopt either incremental or decreasing strengths.Nevertheless, when choosing an element from the candidates as the next ATC each time, all RSLCP techniques have the same mechanism as FLCP.Moreover, RSLCP has the same aim as ILCPovercoming the biased selection of prioritization strength for λLCP.In order to cover all 1-wise and 2-wise level combinations as quickly as possible, PV uses the same process as ILCP.After that, PV repeats the process, ignoring the previously selected ATCs.In contrast, ILCP increments λ to 3 to prioritize the remaining ATCs, considering the already selected ATCs-it checks the λ-wise level combinations that have not been covered by the previously selected ATCs.
2) Similarity-Based Prioritization: Wu et al. [65] used Srikanth et al.'s [58] switching cost between two test cases as the similarity measure to prioritize CAs, proposing two greedy algorithms and a graph-based algorithm.Their extended work [66] proposed single-objective algorithms to reduce the switching cost, and also proposed hybrid and multiobjective algorithms to balance the tradeoff between high level-combination coverage and low switching cost.Henard et al. [8] introduced another similarity measure, and proposed two greedy PPSs for software product lines-local maximum distance prioritization and global maximum distance prioritization.Recently, Huang et al. [36] investigated 14 similarity measures for two similaritybased prioritization algorithms proposed by Henard et al. [8], attempting to identify the best similarity measure for each algorithm.Al-Hajjaji et al. [9] also proposed a new similarity measure, and applied it to the greedy prioritization algorithm for software product lines.
Huang et al. [5] have also recently reported on an empirical examination of 16 popular ATCP techniques, from the perspective of fault detection effectiveness.

VII. CONCLUSION
ATCP has been widely used in different testing situations, including combinatorial testing [1].λLCP is perhaps the most well-known ATCP technique [11].λLCP requires to set a fixed prioritization strength λ before testing.There is a tradeoff between testing effectiveness and efficiency for λLCP: When λ is higher, λLCP has higher testing effectiveness but lower testing efficiency; however, when λ is lower, it has lower testing effectiveness but higher testing efficiency.In this paper, we proposed a new method that attempts to balance the tradeoff between testing effectiveness and efficiency for λLCP, namely RSLCP.RSLCP has the same advantages as λLCP-it is simple, and is a static black-box technique (which means that it neither requires information about the source codes, nor is test execution necessary).However, RSLCP also has some additional advantages compared with λLCP, including that it does not face the requirement of assigning a value to λ in advance of testing.We conducted empirical studies to compare RSLCP with λLCP, ILCP, and SP, in terms of interaction coverage rate, fault detection rate, and prioritization cost.Based on these empirical studies, we have the following findings: 1) Among the six RSLCP techniques, IV1 generally has the worst testing effectiveness, while the other five techniques all have very similar performance.2) Compared with λLCP and ILCP, RSLCP could provide a good tradeoff between testing effectiveness and efficiency.3) Compared with SP, RSLCP not only provides better testing rates of interaction coverage and fault detection in most cases, but also requires less prioritization time.4) All of the six RSLCP techniques are robust over multiple releases of the software-systems under test.Because of RSLCP's potential for prioritizing ATCs, in the future we would like to develop and examine more cost-effective algorithms to achieve better tradeoff between testing effectiveness and efficiency.Multiobjective algorithms, for example, will be considered in ATCP.We will also conduct more empirical studies to further compare with other ATCP techniques, and evaluate our techniques in different applications (such as GUI and web applications), especially in larger and more complex systems.

Fig. 1 .
Fig. 1.Screenshot of the font settings from Microsoft Word 2013.
size of ψ(η, tc) (|ψ(η, tc)|) is equal to C k, η (the number of η-combinations from k elements).We next present the definition of η-wise level-combination coverage for an ATC, or for a subset of the given test set.Definition 2.4: η-wise Level-Combination Coverage: Given a valid test suite T , a valid ATC tc, and a subset T of T (tc ∈ T and T ⊆ T ), the η-wise level-combination coverage of tc against T can be defined as the ratio of the number of η-wise level combinations covered by tc to those covered by T : |ψ (η ,tc)| |ψ (η ,T )| .The η-wise level-combination coverage of test set T against T can be written as |ψ (η ,T )| |ψ (η ,T )| .

Fig. 8 .
Fig. 8. APFD results for each prioritization technique for the program flex.

Fig. 9 .
Fig. 9. APFD results for each prioritization technique for the program grep.

Fig. 10 .
Fig. 10.APFD results for each prioritization technique for the program gzip.

Fig. 11 .
Fig. 11.APFD results for each prioritization technique for the program make.

TABLE I INPUT
PARAMETER MODEL FOR MICROSOFT WORD 2013 FONT EFFECTS

Algorithm 1: RSLCP Procedure. Input: T
= {tc 1 , tc 2 , . . ., tc n } wise RSLCP-IV: Unlike the previous two assignment categories, this category uses a combination of 1 Table III gives an overview of the 15 prioritization techniques investigated, listing each technique's category, mnemonic,

TABLE III RSLCP
, λLCP, ILCP, AND SBP TECHNIQUES CONSIDERED IN THE EXPERIMENTS

TABLE V STATISTICAL
ANALYSIS FOR PAIRWISE APCC COMPARISONS OF RSLCP TECHNIQUES ( Â12 VALUES INCLUDED IN BRACKETS) TABLE VI STATISTICAL ANALYSIS FOR PAIRWISE APCC COMPARISONS OF RSLCP AGAINST OTHER ATCP TECHNIQUES ( Â12 VALUES INCLUDED IN BRACKETS)

TABLE VIII STATISTICAL
ANALYSIS FOR PAIRWISE APFD COMPARISONS OF RSLCP AGAINST OTHER ATCP TECHNIQUES ( Â12 VALUES INCLUDED IN BRACKETS)

TABLE X MEAN
AND MEDIAN APFD VALUES OVER VERSIONS FOR EACH RSLCP TECHNIQUE