A Survey on Adaptive Random Testing

Random testing (RT) is a well-studied testing method that has been widely applied to the testing of many applications, including embedded software systems, SQL database systems, and Android applications. Adaptive random testing (ART) aims to enhance RT's failure-detection ability by more evenly spreading the test cases over the input domain. Since its introduction in 2001, there have been many contributions to the development of ART, including various approaches, implementations, assessment and evaluation methods, and applications. This paper provides a comprehensive survey on ART, classifying techniques, summarizing application areas, and analyzing experimental evaluations. This paper also addresses some misconceptions about ART, and identifies open research challenges to be further investigated in the future work.


INTRODUCTION
S OFTWARE testing is a popular technique used to assess and assure the quality of the (software) system under test (SUT). One fundamental testing approach involves simply constructing test cases in a random manner from the input domain (the set of all possible program inputs): This approach is called random testing (RT) [1]. RT may be the only testing approach used not only for operational testing, where the software reliability is estimated, but also for debug testing, where software failures are targeted with the purpose of removing the underlying bugs 1 [3]. Although conceptually very simple, RT has been used in the testing of many different environments and systems, including: Windows NT applications [4]; embedded software systems [5]; SQL database systems [6]; and Android applications [7].
RT has generated a lot of discussion and controversy, notably in the context of its effectiveness as a debug testing method [8]. Many approaches have been proposed to enhance RT's testing effectiveness, especially for failure detection. Adaptive random testing (ART) [9] is one such proposed improvement over RT. ART was motivated by observations reported independently by many researchers from multiple 1. According to the IEEE [2], the relationship amongst mistake, fault, bug, defect, failure and error can be briefly explained as follows: A software developer makes a mistake, which may introduce a fault (defect or bug) in the software. When a fault is encountered, a failure may be produced, i.e., the software behaves unexpectedly. "An error is the difference between a computed, observed, or measured value or condition and the true, specified, or theoretically correct value or condition." [2, p.128]. different areas regarding the behavior and patterns of software failures: Program inputs that trigger failures (failurecausing inputs) tend to cluster into contiguous regions (failure regions) [10]- [14]. Furthermore, if the failure regions are contiguous, then it follows that non-failure regions should also be adjacent throughout the input domain. Specifically: if a test case tc is a failure-causing input, then its neighbors have a high probability of also being failure-causing; similarly, if tc is not failure-causing, then its neighbors have a high probability of also not being failure-causing. In other words, a program input that is far away from non-failurecausing inputs may have a higher probability of causing failure than the neighboring test inputs. Based on this, ART aims to achieve an even spread of (random) test cases over the input domain. ART generally involves two processes: one for the random generation of test inputs; and another to ensure an even-spreading of the inputs throughout the input domain [15].
ART's invention and appearance in the literature can be traced back to a journal paper by Chen et al., published in 2001 [9]. Some papers present overviews of ART, but are either preliminary, or do not make ART the main focus [15]- [20]. For example, to draw attention to the fundamental role of diversity in test case selection strategies, Chen et al. [15] presented a synthesis of some of the most important ART research results before 2010. Similarly, Anand et al. [18] presented an orchestrated survey of the most popular techniques for automatic test case generation that included a brief report on the then state-of-the-art of ART. Roslina et al. [19] also conducted a study of ART techniques based on 61 papers. Consequently, there is currently no up-to-date, exhaustive survey analyzing both the state-of-the-art and new potential research directions of ART. This paper fills this gap in the literature.
In this paper, we present a comprehensive survey on ART covering 140 papers published between 2001 and 2017. The paper includes the following: (1) a summary and analysis of the selected 140 papers; (2) a description, classification, and summary of the techniques used to derive the main ART strategies; (3) a summary of the application and testing domains in which ART has been applied; (4) an analysis of the empirical studies conducted into ART; (5) a discussion of some misconceptions surrounding ART; and (6) a summary of some open research challenges that should be addressed in future ART work. To the best of our knowledge, this is the first large-scale and comprehensive survey on ART.
The rest of this paper is organized as follows. Section 2 briefly introduces the preliminaries and gives an overview of ART. Section 3 discusses this paper's literature review methodology. Section 4 examines the evolution and distribution of ART studies. Section 5 analyzes the-state-of-theart of ART techniques. Section 6 presents the situations and problems to which ART has been applied. Section 7 gives a detailed analysis of the various empirical evaluations of ART. Section 8 discusses some misconceptions. Section 9 provides some potential challenges to be addressed in future work. Finally, Section 10 concludes the paper.

BACKGROUND
This section presents some preliminary concepts, and provides an introduction to ART.

Preliminaries
For a given SUT, many software testing methods have been implemented according to the following four steps: (1) define the testing objectives; (2) choose inputs for the SUT (test cases); (3) run the SUT with these test cases; and (4) analyze the results. Each test case is selected from the set of all possible inputs that form the input domain. When the SUT's output or behavior when executing a test case tc is not as expected (determined by the test oracle [21]), then the test is considered to fail, otherwise it is passes. When a test fails, tc is called a failure-causing input.
Given some faulty software, two fundamental features can be used to describe the properties of the fault(s): the failure rate (the number of failure-causing inputs as a proportion of all possible inputs); and the failure pattern (the distributions of failure-causing inputs across the input domain, including their geometric shapes and locations). Before testing, these two features are fixed, but unknown.
Chan et al. [22] identified three broad categories of failure patterns: strip, block and point. Fig. 1 illustrates these three failure patterns in a two-dimensional input domain (the bounding box represents the input domain boundaries; and the black strip, block, or dots represent the failurecausing inputs). Previous studies have indicated that strip and block patterns are more commonly encountered than point patterns [10]- [14].
Generally speaking, failure regions are identified or constructed in empirical studies (experiments or simulations). Experiments involve real faults or mutants (seeded using mutation testing [23]) in real-life subject programs. Simulations, in contrast, create artificial failure regions using predefined values for the dimensionality d and failure rate θ, and a predetermined failure pattern type: A d-dimensional unit hypercube is often used to simulate the input domain D (D = {(x 1 , x 2 , · · · , x d )|0 ≤ x 1 , x 2 , · · · , x d < 1.0}), with the failure regions randomly placed inside D, and their sizes and shapes determined by θ and the selected failure pattern, respectively. During testing, when a generated test case is inside a failure region, a failure is said to be detected.

Adaptive Random Testing (ART)
ART is a family of testing methods, with many different implementations based on various intuitions and criteria. In this section, we present an ART implementation to illustrate the fundamental principles.
The first implementation of ART was the Fixed-Size-Candidate-Set (FSCS) [9] version, which makes use of the concept of distance between test inputs. FSCS uses two sets of test cases: the candidate set C; and the executed set, E. C is a set of k tests randomly generated from the input domain (according to the specific distribution); and E is the set of those inputs that have already been executed, but without causing any failure. E is initially empty. The first test input is generated randomly. In each iteration of FSCS ART, an element from C is selected as the next test case such that it is farthest away from all previously executed tests (those in E). Formally, the element c ′ from C is chosen as the next test case such that it satisfies the following constraint: ∀c ∈ C, min e∈E dist(c ′ , e) ≥ min e∈E dist(c, e), (2.1) where dist(x, y) is a function to measure the distance between two test inputs x and y. The Euclidean distance is typically used in dist(x, y) for numerical input domains. Fig. 2 illustrates the FSCS process in a two-dimensional input domain: Suppose that there are three previously executed test cases, t 1 , t 2 , and t 3 (i.e., E = {t 1 , t 2 , t 3 }), and  two randomly generated test candidates, c 1 and c 2 (i.e., C = {c 1 , c 2 }) ( Fig. 2(a)). To select the next test case from C, the distance between each candidate and each previously executed each test case in E is calculated, and the minimum value for each candidate is recorded as the fitness value ( Fig. 2(b)). Finally, the candidate with the maximum fitness value is selected to be the next test case (Fig. 2(c)): in this example, c 2 is used for testing (t 4 = c 2 ). As Chen et al. have explained [15], ART aims to more evenly spread randomly generated test cases than RT across the input domain. In other words, ART attempts to generate more diverse test cases than RT.

METHODOLOGY
Guided by Kitchenham and Charters [24] and Petersen et al. [25], in this paper, we followed a structured and systematic method to perform the ART survey. We also referred to recent survey papers on other software engineering topics, including: mutation analysis [23]; metamorphic testing [26], [27]; constrained interaction testing [28]; and test case prioritization for regression testing [29]. The detailed methodology used is described in this section.

Research Questions
The goal of this survey paper is to structure and categorize the available ART details and evidence. To achieve this, we used the following research questions (RQs): The answer to RQ1 will provide an overview of the published ART papers. RQ2 will identify the state-of-theart in ART strategies and techniques, giving a description, summary, and classification. RQ3 will identify where and how ART has been applied. RQ4 will explore how the various ART studies involving simulations and experiments with real programs were conducted and evaluated. RQ5 will examine common ART misconceptions, and, finally, RQ6 will identify some remaining challenges and potential research opportunities.

Literature Search and Selection
Following previous survey studies [26], [28], [29], we also selected the following five online literature repositories belonging to publishers of technical research:

Wiley Online Library
The choice of these repositories was influenced by the fact that a number of important journal articles about ART are available through Elsevier Science Direct, Springer Online Library, and Wiley Online Library. Also, ACM Digital Library and IEEE Xplore not only offer articles from conferences, symposia, and workshops, but also provide access to some important relevant journals.
After determining the literature repositories, each repository was searched using the exact phrase "adaptive random testing" and for titles or keywords containing "random test". To avoid missing papers that were not included in these five repositories, we augmented the set of papers using a search in the Google Scholar database with the phrase "adaptive random testing" as the search string 2 . The 2. The search was performed on May 1st, 2018.  Removal of duplicates and application of the exclusion criteria reduced the initial 1,276 candidate studies to 138 published papers. Finally, a snowballing process [25] was conducted by checking the references of the selected 138 papers, resulting in the addition of two more papers. In total, 140 publications (primary studies) were selected for inclusion in the survey.
We acknowledge the apparent infeasibility of finding all ART papers through our search. However, we are confident that we have included the majority of relevant published papers, and that our survey provides the overall trends and the-state-of-the-art of ART.

Data Extraction and Collection
All 140 primary studies were carefully read and inspected, with data extracted according to our research questions.
As summarized in Table 2, we identified the following information from each study: motivation, contribution, empirical evaluation details, misconceptions, and remaining challenges. To avoid missing information and reduce error as much as possible, this process was performed by two different co-authors, and subsequently verified by the other co-authors at least twice.

PUBLISHED STUDIES?
In this section, we address RQ1 by summarizing the primary studies according to publication trends, authors, venues, and types of contributions to ART. showing the cumulative number. It can be observed that, after the first three years there are at least six publications per year, with the number reaching a peak in 2006. Furthermore, since 2009, the number of publications each year has remained relatively fixed, ranging from seven to 10. An analysis of the cumulative publications ( Fig. 3(b)) shows that a line function with high determination coefficient (R 2 = 0.9924) can be identified. This indicates that the topic of ART has been experiencing a strong linear growth, attracting continued interest and showing healthy development. Following this trend, it is anticipated that there will be about 180 ART papers by 2021, two decades after its appearance in Chen et al. [9].

Researchers and Organizations
Based on the 140 primary studies, 167 ART authors were identified, representing 82 different affiliations. Table 3 lists the top 10 ART authors and their most recent affiliation (with country or region). It is clear that T. Y. Chen, from Swinburne University of Technology in Australia, is the most prolific ART author, with 62 papers.

Geographical Distribution of Publications
We examined the geographical distribution of the 140 publications according to the affiliation country of the first author, as shown in Table 4. We found that all primary studies could be associated with a total of 18 different countries or regions, with Australia ranking first, followed by the People's Republic of China (PRC). Overall, about 41% of ART papers came from Asia; 32% from Oceania; 19% from Europe; and about 8% from America. This distribution of papers suggests that the ART community may only be represented by a modest number of countries spread throughout the world.

Distribution of Publication Venues
The 140 primary studies under consideration were published in 72 different venues (41 conferences or symposia, 20 journals, and 11 workshops). Fig. 4 illustrates the distribution of publication venues, with Fig. 4(a) showing the overall venue distribution, and Fig. 4(b) giving the venue distribution per year. As Fig. 4(a) shows, most papers have been published in conferences or symposia proceedings (57%), followed by journals (30%), and then workshops (13%). Fig. 4(b) shows that, between 2002 and 2012, most ART publications each year were conference and symposium papers, followed by journals and workshops. Fig. 4(b) also shows that, since 2012, this trend has changed, with the number of journal papers per year increasing, usually    As Fig. 5(a) shows, the main contribution of 43% of the studies was to present new ART techniques or methodologies, 25% were case studies, and 24% were assessments and empirical studies. 4% of studies were surveys or overviews of ART, and the main contribution of five primary studies (4%) was to present a tool. Fig. 5(b) shows that the primary research topic of 10% of the studies was about basic ART approaches. The effectiveness and efficiency enhancement of ART approaches were the focus of 29% and 7%, respectively; and application and assessment of ART were 27% and 23%, respectively. Finally, 3. If a paper has multiple types of contributions or research topics, then only the main contribution or topic is identified.
4% of the papers report on techniques, achievements, and research directions.    In this section, we present the state-of-the-art of ART approaches, including a description, classification, and summary of their strengths and weaknesses.

Summary of answers to RQ1:
ART attempts to spread test cases evenly over the entire input domain [15], which should result in a better failure detection ability [31]. There are two basic rationales to achieving this even spread of test cases: Rationale 1: New test cases should be far away from previously executed, non-failure-causing test cases. As discussed in the literature [10]- [14], failure regions tend to be contiguous, which means that new test cases farther away from those already executed (but without causing failure) should have a higher probability of being failure-causing.

Rationale 2: New test cases should contribute to a good distribution of all test cases over the entire input domain.
Previous studies [31] have empirically determined that better test case distributions result in better failure detection ability.
Discrepancy is a metric commonly used to measure the equidistribution of sample inputs [31], with lower values indicating better distributions. The discrepancy of a test set T is calculated as: where D 1 , D 2 , · · · , D m are m randomly defined subdomains of the input domain D; and T 1 , T 2 , · · · T m are the corresponding subsets of T , such that each T i (i = 1, 2, · · · , m) is in D i . Discrepancy checks whether or not the number of test cases in a subdomain is proportionate to the relative size of the subdomain area -larger subdomains should have more test cases; and smaller ones should have relatively fewer. Both Rationale 1 and Rationale 2 achieve a degree of diversity of test cases over the input domain [15]. Based on these rationales, many strategies have been proposed: Select-Test-From-Candidates Strategy (STFCS), Partitioning-Based Strategy (PBS), Test-Profile-Based Strategy (TPBS), Quasi-Random Strategy (QRS), Search-Based Strategy (SBS), and Hybrid Strategies (HSs).

Select-Test-From-Candidates Strategy
The Select-Test-From-Candidates Strategy (STFCS) chooses the next test case from a set of candidates based on some criteria or evaluation involving the previously executed test cases. Fig. 6 presents a pseudocode framework for STFCS, showing two main components: the random-candidate-setconstruction, and test-case-selection. The STFCS framework maintains two sets of test cases: the candidate set (C) of randomly generated candidate test cases; and the executed set (E) of those test cases already executed (without causing failure). The first test case is selected randomly from the input domain D according to a uniform distributionall inputs in D have equal probability of selection. The test case is then applied to the SUT, and the output and behavior are examined to confirm whether or not a failure 1: Set C ← {}, and E ← {} 2: Randomly generate a test case tc from D, according to uniform distribution 3: Add tc into E, i.e., E ← E {tc} 4: while The stopping condition is not satisfied do 5: Randomly choose a specific number of elements from D to form C according to the specific criterion Random-candidate-set-construction component 6: Find a tc ∈ C as the next test case satisfying the specific criterion Test-case-selection component 7: E ← E {tc} 8: end while 9: Report the result and exit has been caused. Until a stopping condition is satisfied (e.g., a failure has been caused), the framework repeatedly uses the random-candidate-set-construction component to prepare the candidates, and then uses the test-case-selection component to choose one of these candidates as the next test case to be applied to the SUT.

Framework
Two basic (and popular) approaches to implementing the STFCS framework are Fixed-Size-Candidate-Set (FSCS) ART [9], [32], and Restricted Random Testing (RRT) [33]- [36] 4 . Clearly, there are different ways to realize the randomcandidate-set-construction and test-case-selection components, leading to different STFCS implementations. A number of enhanced versions of both FSCS and RRT have also been developed.

Random-Candidate-Set-Construction Component
Several different implementations of the random-candidateconstruction component have been developed.
1) Uniform distribution [9]: This involves construction of the candidate set by randomly selecting test cases according to a uniform distribution.
2) Non-uniform distribution [37]: When not using a uniform distribution to generate the candidates, the non-uniform distribution is usually dynamically updated throughout the test case generation process. Chen et al. [37], for example, used a dynamic, non-uniform distribution to have candidates be more likely to come from the center of D than from the boundary region.
3) Filtering by eligibility [38], [39]: Using an eligibility criterion (specified using a tester-defined parameter), this filtering ensures that candidates (and therefore the eventually generated test cases) are drawn only from the eligible regions of D. The criterion used is that the selected candidate's coordinates are as different as possible to those of all previously executed test cases. Given a test case in a d-dimensional input domain, (x 1 , x 2 , · · · , x d ), filtering by eligibility selects candidates such that each i-th coordinate,  4. Previous ART studies have generally considered RRT to represent an ART by exclusion category [18], [19]. However, both RRT and FSCS belong to the STFCS category. 4) Construction using data pools [40], [41]: "Data pools" are first constructed by identifying and adding both specific special values for the particular data type (such as -1, 0, 1, etc. for integers), and the boundary values (such as the minimum and maximum possible values). This method then selects candidates randomly from either just the data pools, with a probability of p, or from the entire input domain, with probability of 1 − p. Selected candidates are removed from the data pool. Once the data pool is exhausted, or falls below a threshold size, it is then updated by adding new elements. 5) Achieving a specific degree of coverage [42]: This involves selecting candidates randomly (with uniform distribution) from the input domain until some specific coverage criteria (such as branch, statement, or function coverage) are met.

Candidate Set Size
The size of the candidate set, k, may either be a fixed number (e.g., determined in advance, perhaps by the testers), or a flexible one. Although, intuitively, increasing the size of k should improve the testing effectiveness, as reported by Chen et al. [32], the improvement in FSCS ART performance is not significant when k > 10: In most studies, therefore, k has been assigned a value of 10. However, when the value of k is flexible, there are different methods to design and determine its value, based on the execution conditions or environment.

Test-Case-Identification Component
The test-case-identification component chooses one of the candidates as the next test case, according to the specific criterion. There are generally two different implementations: Implementation 1: After measuring all candidates, identifying the best one (as implemented in FSCS); and Implementation 2: Checking candidates until the first suitable (or valid) one is identified (as implemented in RRT). The goal of the test-case-identification component is to achieve the even spreading of test cases over the input domain, which it does based on the fitness value. In other words, the fitness function measures each candidate c from the candidate set C against the executed set E. We next list the seven different fitness functions, f itness(c, E), used in STFCS, with the first six following Implementation 1; and the last one following Implementation 2.
1) Minimum-Distance [9]: This involves calculating the distance between c and each element e from E (e ∈ E), and then choosing the minimum distance as the fitness value for c. In other words, the fitness function of c against E is the distance between c and its nearest neighbor in E: f itness(c, E) = min e∈E dist(c, e). 2) Average-Distance [40]: Similar to Minimum-Distance, this also computes the distance between c and each element e in E, but instead of the minimum, the average of these distances is used as the fitness value for c: 3) Maximum-Distance [42]: This assigns the maximum distance as the fitness value for c. In other words, this fitness function chooses the distance between c and its neighbor in E that is farthest away. 4) Centroid-Distance [43], [44]: This uses the distance between c and the centroid (center of the gravity) of E as the fitness value for c: where 1 |E| e∈E e returns the centroid of E.

5)
Discrepancy [45], [46]: This involves choosing the next test case such that it achieves the lowest discrepancy when added to E. Therefore, this fitness function of c can be defined as: [47]: Fuzzy Set Theory [48] can be used to define some fuzzy features to construct a membership grade function, allowing a candidate with the highest (or threshold) score to be selected as the next test case. Chan et al. [47] defined some fuzzy features based on distance, combining them to calculate the membership grade function for candidate test cases. Two of the features they used are: the Dynamic Minimum Separating Distance (DMSD), which is a minimum distance between executed test cases, decreasing in magnitude as the number of executed test cases increases; and the Absolute Minimum Separating Distance (AMSD), which is an absolute minimum distance between test cases, regardless of how many test cases have been executed. During the evaluation of a candidate c against E, if ∀e ∈ E, dist(c, e) ≥ DMSD, then selection of c will be strongly favored; however, if ∃e ∈ E such that dist(c, e) ≤ AMSD, then c will be strongly disfavored. The candidate most highly favored (with the highest membership grade) is then selected as the next test case. 7) Restriction [33], [47], [49]: This involves checking whether or not a candidate c violates the pre-defined restriction criteria related to E, denoted restriction(c, E). The fitness function of c against E can be defined as: The random-candidate-construction criterion successively selects candidates from the input domain (according to uniform or non-uniform distribution) until one that is not restricted is identified. Three approaches to using Restriction in ART are: • Previous studies [33]- [35], [50] have implemented restriction by checking whether or not c is located outside of all (equally-sized) exclusion regions defined around all test cases in E. In a 2-dimensional numeric input domain D, for example, Chan et al. [33] used circles around each already selected test case as exclusion regions, thereby defining the restriction(c, E) as: where R is the target exclusion ratio (set by the tester [51]), and dist(c, e) is the Euclidean distance between c and e. In fact, Eq. (5.8) relates to the DMSD fuzzy feature [47].
• Zhou et al. [49], [52], [53] designed an acceptance probability P β based on Markov Chain Monte Carlo (MCMC) methods [54] to control the identification of random candidates. Given a candidate c, the method generates a new candidate c ′ according to the applied distribution (uniform or non-uniform), resulting in restriction(c, E) being defined as: where U is a uniform random number in the interval [0, 1.0), X(c) is the execution output of c (X(c) = 1 means that c is a failure-causing input, and X(c) = 0 means that it is not), and P (X(c) = 1|E) represents the probability that c is failure-causing, given the set E of already executed test cases. According to Bayes' rule, we have: 10) where Z is a normalizing constant. Assuming all elements in E are conditionally independent for a test output of c, we then have: As illustrated by Zhou et al. [49], [52], [53], P (X(e)|X(c) = 1) is defined as:

12) and
P (X(e) = 0|X(c) = 1) = 1 − exp(−dist(e, c)/β 1 ), (5.13) where β 1 is a constant. If one candidate is a greater distance from the non-failure-causing test cases than another candidate, then it has a higher probability of being selected as the next test case.
• Using Fuzzy Set Theory [48], Chan et al. [47] applied a dynamic threshold λ to determine whether or not a candidate was acceptable, accepting the candidate if its membership grade was greater than λ. If a predetermined number of candidates are rejected for being below λ, they adjusted the threshold according to the specified principles. It should be noted that any Implementation 1 approach to choosing candidates can be transformed into Implementation 2 by applying a threshold mechanism.

Partitioning-Based Strategy
The Partitioning-Based Strategy (PBS) divides the input domain into a number of subdomains, choosing one as the 1: Set E ← {} 2: Randomly generate a test case tc from D, according to uniform distribution 3: Add tc into E, i.e., E ← E {tc} 4: while The stopping condition is not satisfied do 5: if The partitioning condition is triggered then 6: Partition the input domain D into m disjoint subdomains D 1 , D 2 , · · · , D m , according to the specific criterion Partitioning-schema component 7: end if 8: Choose a subdomain D i according the specific criterion Subdomain-selection component 9: Randomly generate the next test case tc from D i , according to uniform distribution 10: E ← E {tc} 11: end while 12: Report the result and exit location within which to generate the next test case. Core elements of PBS, therefore, are to partition the input domain and to select the subdomain. Fig. 7 presents a pseudocode framework for PBS, showing two main components: the partitioning-schema, and subdomain-selection. The partitioning-schema component defines how to partition the input domain into subdomains, and the subdomain-selection component defines how to choose the target subdomain where the next test case will be generated.

Framework
After partitioning, the input domain D will be divided into m disjoint subdomains D 1 , D 2 , · · · , D m (m > 1), according to the partitioning-schema criteria: Next, based on the subdomain-selection criteria, PBS chooses a suitable subdomain within which to generate the next test case.

Partitioning-Schema Component
Many different criteria can be used to partition the input domain, which can be achieved using either static or dynamic partitioning. Static partitioning [55]- [58] means that the input domain is divided before test case generation, with no further partitioning required once testing begins. Dynamic partitioning involves dividing the input domain dynamically, often at the same time that each new test case is generated. There have been many dynamic partitioning schemas proposed, including random partitioning [59], bisection partitioning [59], [60], and iterative partitioning [61], [62].
1) Static partitioning [58]: Static partitioning divides the input domain into a fixed number of equally-sized subdomains, with these subdomains then remaining unchanged throughout the entire testing process. This is simple, but influenced by the tester: testers need to divide the input domain before testing, and different numbers of subdomains may result in different ART performance.
2) Random partitioning [59]: Random partitioning uses the generated test case tc as the breakpoint to divide the input (sub-)domain D i into smaller subdomains. This partitioning usually results in the input domain D being divided into subdomains of unequal size.
3) Bisection partitioning [59], [60]: Similar to static partitioning, bisection partitioning divides the input domain into equally-sized subdomains. However, unlike static partitioning, bisection partitioning dynamically bisects the input domain whenever the partitioning condition is triggered. There are a number of bisection partitioning implementations. Chen et al. [59], for example, successively bisected dimensions of the input domain; whenever the i-th bisection of a dimension resulted in 2 i parts, the (i + 1)-th bisection was then applied to another dimension. Chow et al. [60] bisected all dimensions at the same time, with the input domain D being divided into 2 i * d subdomains (where d is the dimensionality of D) after the i-th bisection. Bisection partitioning does not change existing partitions during bisection, only bisecting the subdomains in the next round.
4) Iterative partitioning [61], [62]: In contrast to bisection partitioning, iterative partitioning modifies existing partitions, resulting in the input domain being divided into equally-sized subdomains. Each round of iterative partitioning divides the entire input domain D using a new schema. After the i-th round of partitioning, for example, Chen et al. [61] divided the input domain into i d subdomains, with each dimension divided into i equally-sized parts. Mayer et al. [62], however, divided only the largest dimension into equally-sized parts, leaving other dimensions unchanged, resulting in a dimension with j parts being divided into j + 1 parts.
Although random partitioning may divide the input domain into subdomains with different sizes, the other three partitioning approaches result in equally-sized subdomains.

Subdomain-Selection Component
After partitioning the input domain D into m subdomains D 1 , D 2 , · · · , D m , the next step is to choose the subdomain where the next test case will be generated. The following criteria can be used to support this subdomain selection process: 1) Maximum size [59]: Given the set T of previously generated test cases, among those subdomains D i (1 ≤ i ≤ m) without any test cases, the largest one is selected for generation of the next test case: ∀j ∈ {1, 2, · · · , m} satisfying 2) Fewest previously generated test cases [59], [60]: Given the set T of previously generated test cases, this criterion chooses a subdomain D i containing the fewest test cases: 3) No test cases in target or neighbor subdomains [58], [61], [62]: This ensures that the selected subdomain not only contains no test cases, but also does not neighbor other subdomains containing test cases. 4) Proportional selection [63]: Proportional selection uses two dynamic probability values, p 1 and p 2 , to represent the likelihood that some (or all) elements of the failure region are located in the edge or center regions, respectively. Kuo et al. [63], for example, used two equally-sized subdomains in their proportional selection implementation, with each test case selected from either the edge or center region based on the value of p 1 /p 2 .
Three criteria (fewest previously generated test cases, no test cases in target or neighbor subdomains, and proportional selection) require that all subdomains be the same size; only one (maximum size) has no such requirement. Furthermore, two criteria (maximum size and no test cases in target or neighbor subdomains) generally select one test case per subdomain; the other two (fewest previously generated test cases and proportional selection) may select multiple test cases from each subdomain.
Intuitively speaking, three criteria (maximum size, fewest previously generated test cases, and proportional selection) follow Rationale 2. The maximum size criterion chooses the largest subdomain without any test cases as the target location for the next test case -test selection from a larger subdomain may have a better chance of achieving a good distribution of test cases. Similarly, the fewest previously generated test cases, and proportional selection criteria ensure that subdomains with fewer test cases have a higher probability of being selected. The third criterion (no test cases in target or neighbor subdomains) follows both Rationale 1 and Rationale 2, choosing a target subdomain without (and away from) any test cases, thereby achieving a good test case distribution. Furthermore, because this criterion also avoids subdomains neighboring those containing test cases, the subsequently generated test cases generally have a minimum distance from all others.

Test-Profile-Based Strategy
The Test-Profile-Based Strategy (TPBS) [64] generates test cases based on a well-designed test profile (different from the uniform distribution), dynamically updating the profile after each test case selection.  Adjust the test profile based on already selected test cases from E Test-profile-adjustment component 6: Randomly generate the next test case tc based on adjusted test profile 7: E ← E {tc} 8: end while 9: Report the result and exit probabilities. When a test case is executed without causing failure, its selection probability is then assigned a value of 0.

Test-Profile-Adjustment Component
Based on the intuitions underlying ART [9], a test profile should be adjusted to satisfy the following [64]: • The closer a test input is to the previously executed test cases, the lower the selection probability that it is assigned should be.

•
The farther away a test input is from previously executed test cases, the higher the selection probability that it is assigned should be.

•
The probability distribution should be dynamically adjusted to maintain these two features.
A number of test profiles exist to describe the probability distribution of test cases, including the triangle profile [65], cosine profile [65], semicircle profile [65], and power-law profile [66]. Furthermore, the probabilistic ART implementation [50] uses a similar mechanism to TPBS.
Because the test profiles use the location of non-failurecausing test cases when assigning the selection probability of each test input from the input domain, TPBS obviously follows Rationale 1: If a test input is farther away from non-failure-causing test cases than other candidates, it has a higher probability of being chosen as the next test case.

Quasi-Random Strategy
The Quasi-Random Strategy (QRS) [67], [68] applies quasirandom sequences to the implementation of ART. Quasirandom sequences are point sequences with low discrepancy and low dispersion: As discussed by Chen et al. [31], a set of points with lower discrepancy and dispersion generally has a more even distribution. Furthermore, the computational overheads incurred when generating n quasirandom test cases is only O(n), which is similar to that of pure RT. In other words, QRS can achieve an even-spread of test cases with a low computational cost. Generate the next element ts from a given quasirandom sequence Quasi-random-sequence-selection component 4: Randomize ts as the test case tc according to the specific criterion Randomization component 5: E ← E {tc} 6: end while 7: Report the result and exit motivation for involving randomization in the process is that quasi-random sequences are usually generated by deterministic algorithms, which means that the sequences violate a core principle of ART: the randomness of the test cases.

Quasi-Random-Sequence-Selection Component
A number of quasi-random sequences have been examined, including Halton [69], Sobol [70], and Niederreiter [71]. In this section, we only describe some representative sequences for quasi-random testing.
1) Halton sequence [69]: The Halton sequence can be considered the d-dimensional extension of the Van der Corput sequence, a one-dimensional quasi-random sequence [72] defined as: where b is a prime number, φ b (i) denotes the i-th element of the Van der Corput sequence, i j is the j-th digit of i (in base b), and ω denotes the lowest integer for which ∀j > ω, i j = 0 is true. For a d-dimensional input domain, therefore, the i-th element of the Halton sequence can be defined as Previous studies [72], [73] have used the Halton sequence to generate test cases.
2) Sobol sequence [70]: The Sobol sequence can be considered a permutation of the binary Van der Corput sequence, φ 2 (i), in each dimension [72], and is defined as: where Sobol(i) represents the i-th element of the Sobol sequence, i j is the j-th digit of i in binary, ω denotes the number of digits of i in binary, and β 1 , β 2 , · · · , β r come from the coefficients of a degree r primitive polynomial over the finite field. Previous studies [72]- [74] have used this sequence for test case generation.
3) Niederreiter sequence [71]: The Niederreiter sequence may be considered to provide a good reference for other quasi-random sequences: because all the other approaches can be described in terms of what Niederreiter calls (t, d)sequences [71]. As discussed by Chen and Merkel [67], the Niederreiter sequence has lower discrepancy than other sequences. Previous investigations [67], [73] have used Niederreiter sequences to conduct software testing.

Randomization Component
The randomization step involves randomizing the quasirandom sequences into actual test cases. The following three representative methods illustrate this.
1) Cranley-Patterson Rotation [75], [76]: This generates a random d- 2) Owen Scrambling [77]: Owen Scrambling applies the randomization process to the Niederreiter sequence (a (t, d)sequence in base b). The i-th point in the sequence can be The permutation process is applied to the parameter a ijk for each point according to some criteria. Compared with Cranley-Patterson rotation, Owen Scrambling more precisely maintains the low discrepancy and low dispersion of quasi-random sequences [68].
3) Random Shaking and Rotation [74], [78]: This first uses a non-uniform distribution (such as the cosine distribution) to shake the coordinates of each item in the quasi-random sequence into a random number within a specific value range. Then, a random vector based on the non-uniform distribution is used to permute the coordinates of all points in the sequence.
QRS generates a list of test cases (a quasi-random sequence) with a good distribution (including discrepancy [31]), indicating that it follows Rationale 2.

Search-Based Strategy
The Search-Based Strategy (SBS), which comes from Search Based Software Testing (SBST) [79], [80], uses search-based algorithms to achieve the even-spreading of test cases over the input domain. In contrast to other ART strategies, SBS aims to address the question: Given a test set E, of size N (|E| = N ), due to limited testing resources, how can E achieve an even spread of test cases over the input domain, thereby enhancing its fault detection ability? SBS needs to assign a parameter (the number of test cases N ) before test case generation begins. Fig. 10 shows a pseudocode framework for SBS. Because ART requires that test cases that have some randomness, SBS generates an initial test set population P T (of size ps) where each test set (of size N ) is randomly generated. A search-based algorithm is then used to iteratively evolve P T into its next generation. Once a stopping condition is satisfied, the best solution from P T is selected as the final test set. Two core elements of SBS, therefore, are the 1: Set the number of test cases N 2: E ← {} 3: Generate an initial population of test sets P T = {T 1 , T 2 , · · · , T ps }, each of which is randomly generated with size N according to uniform distribution, where ps is the population size 4: while The stopping condition is not satisfied do 5: Evolve P T to construct a new population of test sets P T ′ by using a given search-based algorithm Evolution component 6: P T ← P T ′ 7: end while 8: E ← the best solution of P T 9: Report the result and exit choice of search-based algorithm for evolving P T , and the evaluation (fitness) function for each solution. Because the fitness function is also involved in the evolution process (to evaluate the P T updates), we do not consider it a separate SBS component.

Evolution Component
A number of search-based algorithms have been used to evolve ART test sets, including the following: 1) Hill Climbing (HC) [81]: HC makes use of a single initial test set T , rather than a population of test sets P T (i.e., ps = 1). The basic idea behind HC is to calculate the fitness of T , and to shake it for as long as the fitness value increases. One HC fitness function is the minimum distance between any two test cases in T , where the distance is a specific Euclidean distance [82]: 2) Simulated Annealing (SA) [83]: Similar to HC, SA also only uses a single test set T (ps = 1). During each iteration, SA constructs a new test set T ′ by randomly selecting input variables from T with a mutation probability and modifying their values. The fitness values of both T and T ′ are then calculated. If the fitness of T ′ is greater than that of T , then T ′ is accepted as the current solution for the next iteration. If the fitness of T ′ is not greater than that of T , then the acceptance of T ′ is determined by a controlled probability function using random numbers and the temperature parameter adopted in SA. Bueno et al. [83] defined the fitness function of T as the sum of distances between each test case and its nearest neighbor: 3) Genetic Algorithm (GA) [83]: GA uses a population of test sets rather than just a single one, and three main operations: selection, crossover, and mutation. GA first chooses the test sets for the next generation by assigning a selection probability -Bueno et al. [83] used a selection probability proportional to the fitness of T (calculated with Eq. (5.19)). The crossover operation then generates offspring by exchanging partial values of test cases between pairs of test sets, and then through mutation by randomly changing some partial values. 4) Simulated Repulsion (SR) [84]: Similar to GA, SR makes use of a population of test sets, P T , with each solution T i ∈ P T (1 ≤ i ≤ ps) evolving independently from, and concurrently to, the other test sets. In each SR iteration, each test case from each solution T i updates its value based on Newton mechanics with electrostatic force from Coulomb's Law. The principle of moving a test case tc ∈ T i is as follows: where m is a constant (the mass of all test cases), and −→ RF (tc) is the resultant force of tc, defined as: where Q is the current value of electric charge for the test cases.

5)
Local Spreading (LS) [85]: Similar to HC and SA, LS also only uses a single initial test set T . LS successively moves each point tc ∈ T that is allowed to move according to the following: tc's first and second nearest neighbors in T , tc f and tc s , are identified, and the corresponding distances, d f and d s , are calculated. A direction of movement is identified related to tc f . Then, tc is moved a small distance (related to d s − d f ) in the identified direction, slightly increasing the minimum distance from tc to its nearest neighbor (the distance between tc and tc f ). These steps are repeated until there are no points remaining that can still move. LS effectively attempts to increase the minimum distance among all test cases in T , thereby producing a more evenlyspread test set. 6) Random Border Centroidal Voronoi Tessellations (RBCVT) [73]: RBCVT uses an initial test set T of size N , and makes use of Centroidal Voronoi Tessellations (CVT) [86] to achieve an even spread of the N test cases over the input domain, D. It constructs N disjoint cells around the initial N test cases using a Voronoi tessellation with random border point set, (5.22) RBCVT then calculates the centroid of each Voronoi region to obtain N new points for the next generation and evolution.
Since SBS achieves an even spread of the N test cases over the input domain, many studies have used test suites generated by other ART approaches to replace the random test suites, to speed up the evolution process. Shahbazi et al. [73], for example, used RBCVT to improve the quality of test suites obtained from STFCS and QRS. Huang et al. [85] have also argued that it would be better to use adaptive random test suites than random test suites as the input for LS.

Hybrid Strategies
Hybrid Strategies (HSs) aim at improving the testing effectiveness (such as fault detection capability) or efficiency (such as test generation cost) by combining multiple ART approaches.

STFCS + PBS
The STFCS + PBS hybrids aim to enhance the effectiveness of either STFCS or PBS.
From the perspective of STFCS enhancement, Chen et al. [87], [88], when generating the m-th test case, divided the input domain D into m disjoint, equally-sized subdomains, D 1 , D 2 , · · · , D m , from the edge to the center of D, such that: and |D 1 | = |D 2 | = · · · = |D m |. Next, for the STFCS random-candidate-construction component, they generated random test cases in those subdomains not already containing test cases. Mayer [89] used bisection partitioning to control the STFCS test-case-identification component, only checking the distance from a candidate c to points in its neighboring regions, instead of to all points. These methods could significantly reduce the STFCS computational overheads, for both FSCS and RRT. Mao et al. [90] proposed a similar method, distance-aware forgetting, to reduce the FSCS computational overheads, but they used static, rather than bisection, partitioning. Chow et al. [60] proposed a new efficient and effective method called ART with divide-andconquer that independently applies STFCS to each subdomain (using bisection partitioning). Previous studies [55]- [57] have combined STFCS with static partitioning, using the concept of mirroring to reduce the computational costs. Enhancements to mirroring have included a revised distance metric [91], and dynamic partitioning with new mirroring functions [92]. Chan et al. [47] applied bisection partitioning to each dimension of the input domain, then checking the amount of executed test cases in each subdomain: candidates in subdomains with fewer executed test cases were then more likely to be selected.
Regarding the enhancement of PBS, Chen and Huang [93] applied the test-case-identification component to improving the effectiveness of PBS with random partitioning, selecting test cases based on the principle of Minimum-Distance and Restriction. Mayer [94] used a similar mechanism to improve the effectiveness of PBS with bisection partitioning. Mao and Zhan [95] also used this mechanism to enhance PBS by bisection partitioning, but instead of Euclidean distance, they used the coordinate distance to boundaries (boundary distance). Similarly, Mayer [96], [97], for PBS with random and bisection partitioning, used exclusion regions in a possible subdomain to generate a new test case. Mao [98], to overcome the drawbacks of random partitioning, proposed a new partitioning schema, twopoint partitioning, based on the STFCS test-case-identification component: When generating a new test case tc, it randomly chooses two candidates from the subdomain that needs to be partitioned, and then uses the midpoint of tc and the farthest candidate as the break point to partition the subdomain.
In addition to the hybrid methods listed above, other, new ART techniques have been proposed based on other combinations of different concepts. Chen et al. [99], for example, introduced a new test-case-identification criterion (identifying the test case that is more adjacent to the subdomain centroid), and combined it with PBS with bisection partitioning to form a new technique: ART by balancing. Mayer [100] proposed a new approach, latticebased ART, that uses bisection partitioning to divide the input domain for lattice generation. It then generates test cases by permuting the lattices within a restricted region. Chen et al. [101] enhanced Mayer's lattice-based ART [100] by refining the restricted regions for each lattice. Sabor and Mohsenzadeh [102], [103] proposed an enhanced version of Chen and Huang's method [93] by including an enlarged input domain [104].

STFCS + SBS
The STFCS + SBS hybrids either enhance STFCS, or represent new methods. Tappenden and Miller [105] proposed Evolutionary ART, a new method that aims to construct an evolved test set for the STFCS test-case-identification. The method initially generates a fixed-size random test set, according to a uniform distribution. Until a stopping condition is satisfied (for example, 100 generations have been completed [105]), each iteration uses an evolutionary algorithm, a Genetic Algorithm (GA), to evolve the test set. The fitness function used during the evolution stage is the same as Eq. (5.2), and the candidate with the highest fitness value is then selected as the next test case.
Iqbal et al. [106] combined STFCS with SBS to produce a hybrid that initially uses GA to generate test cases, but if no fitter test cases are found after running a number of iterations, then the algorithm switches to FSCS to generate the following test cases.

TPBS + PBS or STFCS
The TPBS + PBS or STFCS hybrids aim to augment the TPBS test profiles using PBS or STFCS principles. Liu et al. [107] proposed three methods to design test profiles, based on STFCS restriction (exclusion), and on PBS subdomain-selection criteria (maximum size [59] and least number of previously generated test cases [59], [60]). Using exclusion, all points inside the exclusion regions should have no chance of being selected as test cases: their probability of selection is 0. Using the maximum size [59], all points inside the largest subdomain have a probability to be chosen as test cases, and all other points (those in other subdomains) have no chance. Similarly, when using the least number of previously generated test cases [59], [60], all points within the subdomains with the least number of previously generated test cases have a chance to be selected, and all others have no chance.

Strengths and Weaknesses
Previous studies [18] have confirmed that ART is more effective than RT in general, according to several different evaluations. As discussed by Chen et al. [108], however, both favorable and unfavorable conditions exist for ART. In this section, therefore, we summarize the strengths and weaknesses of ART.
2) Fault detection capability: It is natural that ART should have better fault detection ability than RT when the failure region is clustered -ART was specifically designed to make use of this clustering information. Chen et al. [108] investigated the factors impacting on ART fault detection ability, identifying a number of favorable conditions for ART, including: (a) when the failure rate is small; (b) when the failure region is compact; (c) when the number of failure regions is low; and (d) when a predominant region exists among all the failure regions. When any of these conditions are satisfied, ART generally has better fault detection performance than RT. Even when none of the conditions are satisfied, ART can achieve comparable fault detection to RT.
3) Code coverage: Studies have shown that ART achieves greater structure-based code coverage than RT for the same number of test cases [83], [84], [109], [110]. Bueno et al. [83], [84] have observed that SBS outperforms RT for data-flow coverage [111] (including all-uses coverage and all-du-paths coverage). Chen et al. [109], [110] have reported that FSCS is more effective than RT for both control-flow coverage [112] (including block coverage and decision coverage), and data-flow coverage (c-uses coverage and p-uses coverage [111]). 4) Reliability estimation and assessment: For the same number of test cases, ART has greater code coverage than RT [83], [84], [109], [110]. It has also been observed that coverage can be used to improve the effectiveness of software reliability estimation [113]. Compared with RT, therefore, ART should enable better software reliability estimation, and higher confidence in the reliability of the SUT, even when no failure is detected. Unfortunately, this characteristic (strength) of ART was obtained from the perspective of theoretical results rather than empirical studies, which means that no ART studies have yet investigated the reliability estimation and assessment.

5) Cost-effectiveness:
The cost-effectiveness of testing considers both effectiveness (e.g., fault detection) and efficiency (including test case generation and execution time). ART cost-effectiveness has often been examined using the Ftime [61], which is the amount of computer execution time required to detect the first failure (including the time for both generation and execution of test cases). Because ART involves additional computation to evenly spread the test cases over the input domain [17], [18], ART should naturally take more time than RT to generate the same number of test cases, suggesting that it may have worse cost-effectiveness than RT. However, studies [18], [92], [114]- [116] have shown that compared with RT, ART typically requires less time to identify the first failure (F-time) -therefore, ART can be more cost-effective than RT. In general, three main conditions can result in ART achieving a better cost-effectiveness than RT: (a) ART using fewer test cases than RT to detect the first failure (F-measure); (b) the computational overhead of the ART approach being acceptable (comparable or slightly higher than that of RT); or (c) the combined program execution and test setup time being more than the time required by ART to generate a test case.

Weaknesses
There are three main challenges associated with some ART approaches: boundary effect, computational overheads, and high dimension problem.
1) Boundary effect [117]: Some ART approaches tend to generate more test cases near the boundary than near the input domain center, a situation known as the boundary effect. One reason for the boundary effect, as explained in the context of RRT [33]- [36], is that test cases cannot be generated outside the boundary, thus reducing the number of sources of restriction from close to boundary regions. Both FSCS and RRT have been shown to suffer from the boundary effect, especially when the failure rate and dimensionality are high [117].
A number of attempts have been made to address the boundary effect. Some studies increased the selection probability of candidates from the input domain center over those from the boundary [37]- [39], [63], [87], [88]. Other studies have removed the boundaries themselves, either by joining boundaries [82], [118], [119], or by extending the input domain beyond the original boundaries [104], [117], [120]. Chen et al. [99] preferred to choose test cases close to the input domain centroid. Mayer [121] initially selected test cases within a small region around the center of the input domain, extending the region if no failures were identified.
2) Computational overheads [17], [18]: ART approaches typically incur heavier computational costs than RT to generate test cases. This is particularly the case for some ART approaches, such as FSCS, and RRT. When testing, the actual test execution time can be an important factor that may mitigate the computational overheads: When the test execution time is very long, for example, the ART computational costs may be more acceptable. However, because the test execution time depends mainly on characteristics of the SUT, and not on the test generation, we do not discuss it here.
The time complexity of FSCS and RRT are of the order of O(n 2 ) and n 2 log n, respectively (where n is the number of generated test cases) [32], [122]. The PBS and QRS techniques of ART are significantly more efficient than others (such as STFCS and SBS). Many hybrid approaches have been developed to reduce the overheads incurred by FSCS and RRT. Other techniques for ART overhead reduction have also been explored. Chan et al. [36], [123], for example, used a square exclusion region version of RRT to reduce the distance calculation overheads (though not the algorithm's complexity, O(n 2 )). Chen and Merkel [124] applied Voronoi diagrams [125] to reduce the FSCS distance calculations, lowering the complexity from O(n 2 ) to O(n 4 3 ).
Mirroring [55]- [57], [92] has been used to directly generate mirror test cases of a source test case based on a principle of symmetry of subdomains. Although the time complexity for FSCS mirroring can still be O( n 2 m 2 ) [55]- [57] (where m is the number of subdomains), an enhanced mirroring technique can reduce this to O(n) [92]. Shahbazi et al. [73] have also proposed a linear-order (O(n)) ART approach: a fast search algorithm for RBCVT (RBCVT-Fast).
While the overhead-reduction approaches listed above can only be applied to numeric input domains, forgetting [126], which reduces overheads by omitting some previous test cases from calculations, has no such limitation. Based on the Category-Partition Method [127], Barus et al. [128] have also recently introduced a linear-order FSCS algorithm using the test-case-identification with Average-Distance for nonnumeric inputs.
3) High dimension problem [31], [129]: It has been observed that the effectiveness of some ART approaches may decrease when the number of dimensions of the input domain increases, due to the curse of dimensionality [130]. Because the center of a high dimensional input domain has a higher probability of being a failure region than the boundary [99], [121], it has been suggested that the boundary effect may impact (or even cause) the high dimension problem. Approaches for addressing the boundary effect [37]- [39], [63], [82], [87], [88], [99], [104], [117]- [121], [131], therefore, may also help to alleviate the high dimension problem. However, because their effectiveness is not constant across dimensions, although they may alleviate, current approaches do not solve the dimensionality problem [81]. Furthermore, finding a solution with consistent effectiveness across all dimensions seems unlikely [81], so some decrease in effectiveness in higher dimensions may need to be tolerated. Nevertheless, Schneckenburger and Schweiggert [81] have combined Hill Climbing with continuous distance [122] to produce a search-based ART approach that (slightly) reduces the dependency on dimensionality.

1) Based on different concepts for the even-spreading of test cases, the various ART approaches can be classified into the following six categories: Select-Test-From-Candidates Strategy (STFCS); Partitioning-Based Strategy (PBS); Test-Profile-Based Strategy (TPBS); Quasi-Random Strategy (QRS); Search-Based Strategy (SBS); and Hybrid Strategies (HSs). 2) For each of the first five categories, a framework has been
presented showing the basic steps involved.

3) Compared with RT, ART generally performs better when certain conditions hold (identified as "favorable conditions" for ART), according to test case distribution, fault detection capability, code coverage, reliability estimation and assessment, and cost-effectiveness. 4) ART suffers from three main weaknesses: boundary
effect, computational overheads, and the high dimension problem.

ANSWER TO RQ3: IN WHAT DOMAINS AND AP-PLICATIONS HAS ART BEEN APPLIED?
Of the 140 papers in our survey, 80 involved application of ART to specific testing problems. Fig. 11 presents the distribution of the ART applications, showing that 76% of studies focused on various testing environments (or programs), and 24% involved application of ART to other testing techniques.

Application Domains
Based on the classification shown in Fig. 11, numeric programs (47%) have been the most popular domains for ART application, followed by object-oriented programs (7%), embedded systems (6%), web services and applications (4%), configurable systems (4%), and simulations and modelling (3%). 5% of the papers revealed other application domains, including mobile applications and aspect-oriented programs.

Numeric Programs
Chen et al. [32] applied ART to the testing of twelve open-source numeric analysis programs from Numerical Recipes [132] and ACM Collected Algorithms [133], written in C or C++. These programs have also been widely used in other ART studies, and, in addition to C and C++, have also been implemented in Java. Zhou et al. [52] applied ART to three other numeric programs from the Numerical Recipes [132]. Arcuri and Briand [134] conducted experiments on a further nine numeric programs for basic algorithms [135] and for mathematical routines from the Numerical Recipes [132]. Chen et al. [109] used ART with two numeric programs from the GNU Scientific Library [136], and one from the Software-artifact Infrastructure Repository (SIR) [137]. Walkinshaw and Fraser [138] investigated six units within the Apache Commons Math framework [139] and two units within JodaTime [140].

Object-Oriented Programs
Ciupa et al. [40], [141] defined a new similarity metric for object-oriented (OO) inputs, and integrated it into ART (ARTOO). They compared ARTOO with RT using eight reallife, faulty OO programs from the EiffelBase Library [142]. Lin et al. [114] used their ART approach on six OO programs containing manually-seeded faults -five of these programs were from the Apache Common library [143], and one was a wide-area event notification system, Siena. Chen et al. [144] also proposed a new similarity metric for OO inputs, using it to apply ART to 17 OO programs (five C++ libraries and 12 C# programs) from the following open-source repositories: Codeforge [145], Sourceforge [146], Codeplex [147], Codeproject [148], and Github [149]. Putra and Mursanto [44] compared two ART techniques applied to eight OO programs, written in Java, from the Apache Common library [143]. Jaygarl et al. [150] evaluated different RT and ART techniques applied to four open-source OO programs (written in Java).

Configurable Systems
Chen et al. [110] compared the code coverage achieved by ART and RT using ten UNIX utility programs that can be considered configurable systems -they are influenced by different configurations or factors, obtained using the Category-Partition Method (CPM) [127]. Huang et al. [151] applied ART to combinatorial input domains (configurable input domains), testing five small C programs [152], and four versions of another configurable system, Flex (a fast lexical analyzer), from the SIR [137]. They also identified the configurable input domains using CPM. Barus et al. [128] proposed an efficient ART approach, and applied it to 14 configurable systems and programs -seven Siemens programs from the SIR [137]; six UNIX utilities; and one regular expression processor from the GNU Scientific Library [136].

Web Services and Applications
Tappenden and Miller [153] applied an evolutionary ART algorithm to cookie collection testing, applying it to six opensource web applications written in C# and PHP. Selay et al. [115] used ART in image comparisons when testing a set of real-world, industrial web applications. Chen et al. [154] developed a system to test web service vulnerabilities that generates ART test cases based on Simple Object Access Protocol (SOAP) messages: twenty web services (both opensource and specifically written services) were examined in their study.

Embedded Systems
Hemmati et al. [155]- [157] applied ART to two industrial embedded systems: a core subsystem of a video-conference system, implemented in C; and a safety monitoring component of a safety-critical control system written in C++. Arcuri et al. [106], [158] compared RT, ART, and searchbased testing using a real-life real-time embedded system. This very large and complex seismic acquisition system, implemented in Java, interacts with many sensors and actuators.

Simulations and Models
Matinnejad et al. [159] applied an ART approach to test three Stateflow models of mixed discrete-continuous controllers: two industrial models from Delphi [160], Supercharger Clutch Controller (SCC) and Auto Start-Stop Control (ASS); and one public domain model, Guidance Control System (GCS), from Mathworks [161]. Sun et al. [162] proposed an enhanced ART approach for testing Architecture Analyze and Design Language (AADL) models [163], reporting on a case study applying it to an Unmanned Aerial Vehicle (UAV) cruise control system, which includes three sensor devices (radar, GPS, and speed devices), and two subsystems (read data calculation, and flight control systems).

Other Domains
Parizi and Ghani [164] conducted a preliminary study of ART for aspect-oriented programs, identifying some potential research directions. Shahbazi and Miller [165] investigated the application of ART to programs with string inputs, comparing the performance of different ART approaches on 19 open-source programs. Liu et al. [116] used ART to test mobile applications, providing a new distance metric for event sequences. Their study examined six real-life mobile applications implemented in Java. Koo and Park [166] investigated ART for SDN (Software-Defined Networking) OpenFlow switches, aiming to generate test packages for switches. Table 6 summarizes the main results for some of the original studies involving application of ART to different domains ("NR" indicates that some details were not reported in the original paper). Some of the studies did not explicitly provide results of the comparison between RT and ART, and so have been omitted from the table. Information about the programs tested in each application domain is available in Table A.1, in the appendix. In Table 6, the columns "%Effectiveness Improvement" and "%Efficiency Improvement" show the percentage improvements of testing effectiveness and efficiency of ART over RT, respectively. Given an evaluation metric M (for example, to measure testing effectiveness or efficiency), if M a represents the value for ART, and M r the value for RT, then the percentage improvement (of ART over RT) can be calculated as: Because of the additional computations involved in ART, it is intuitive that it should take longer than RT to generate the same number of test cases. The "Efficiency" data here, therefore, refers to the time taken to achieve the stopping criterion -for example, the time taken to detect the first failure (the F-time [61]), which includes the combined generation and execution time of all test cases executed before causing the first failure). The "Efficiency" can be considered the cost-effectiveness metric.

Summary of Main Results for Application Domains
Based on Table 6, we have the following observations: • Across all application domains, ART usually achieves better testing effectiveness than RT, especially for numeric and object-oriented programs.
• For a particular application domain, due to the different characteristics of the various programs, ART may perform differently with different programs.

•
Many of the original studies surveyed (including the numeric programs and configurable systems) only provided the information for "%Effectiveness Improvement", not for "%Efficiency Improvement".

Regression Testing
Adaptive random sequences [15] have been extensively applied to regression testing, including for both prioritization and selection [167]. Jiang et al. [42] first used adaptive random sequences (obtained with FSCS [9]) to prioritize regression test suites, proposing a family of approaches that use code coverage to measure the distance between test cases. Other studies [173]- [175] have also used code coverage to guide test case prioritization, but with different distance measures or different ART approaches. Jiang & Chan [176], [177] proposed a family of novel input-based randomized local beam search techniques to prioritize test cases. Chen et al. [178] proposed a method using two clustering algorithms to construct adaptive random sequences for objectoriented software. Zhang et al. [179], based on work on string test case prioritization [180], introduced a method to construct adaptive random sequences using string distance metrics. They later used a different distance metric, based on CPM [127], to propose another method for prioritizing test cases [181]. Huang et al. [182] applied adaptive random prioritization to interaction test suites for combinatorial testing [168]. Zhou [183] used the same code coverage distance information as Zhou et al. [52] to support regression test case selection. Chen et al. [184] extracted tokens reflecting fault-relevant characteristics of the SUT (such as statement characteristics, type and modifier characteristics, and operator characteristics), and used these tokens to represent test cases as vectors for prioritizing programs for C compilers. Previous investigations have shown that although adaptive random sequences generally incur higher computational costs than random sequences, they are usually significantly more effective in terms of fault detection. Furthermore, adaptive random sequences are also sometimes comparable to test sequences obtained by traditional regression testing techniques [185], in terms of both testing effectiveness and efficiency.
† "NR" indicates that some details were not reported in the original paper.
investigated covering arrays constructed by RT, ART, and combinatorial testing, in terms of their ability to identify interaction-triggered failures.
Overall, ART requires far fewer combinatorial test cases to construct covering arrays than RT, and can also detect more interaction-triggered failures for the same number of test cases [186]. ART also performs comparably to traditional combinatorial testing, especially for identifying interaction failures caused by a large number of influencing parameters [188].

Reliability Testing
Liu and Zhu [189] used mutation analysis [23] to evaluate the reliability of ART's fault-detection ability by analyzing the variation in fault detection, concluding that it is more reliable than RT. Cotroneo et al. [190] evaluated the reliability improvement of two ART techniques, FSCS [9] and evolutionary ART [105]: Based on the same operational profile, for the same test budget, they found that ART had comparable delivered reliability [3] to traditional operational testing. For a given reliability level, however, for the same operational profile, ART typically requires significantly fewer test inputs, compared to traditional operational testing techniques.

Active, Fuzzing, and Integration Testing
Based on FSCS-ART [9], Yue et al. [191] proposed two input-driven active testing approaches for multi-threaded programs, with experimental evaluations indicating that the proposed methods are more cost-effective than traditional active testing. Similarly, Sim et al. [192] applied FSCS-ART [9] to fuzzing the Out-Of-Memory (OOM) Killer on an embedded Linux distribution, with results showing that their ART approach for fuzzing requires significantly fewer test cases than RT to identify an OOM Killer failure. Shin et al. [193] proposed an algorithm based on normalized ART for integration and regression tests of units integrated with a front-end software module. The related simulation studies showed that the proposed ART method could be useful for the integration tests.

Distribution of Empirical Evaluations
Of the 140 primary studies examined, 131 (94%) involved empirical evaluations. Fig. 12 shows the distribution of the empirical evaluations in these 131 studies, with 58 papers (44%) having only simulations, 53 (41%) only experiments with real programs, and 20 (15%) containing both simulations and experiments. It can be observed that the number of studies containing only simulations is comparable to the number with only experiments.

Simulations
Simulations that attempt to construct failures in the numeric input domain to resemble real testing situations have frequently been used to evaluate ART techniques. As discussed by Chen et al. [108], three factors 5 are typically considered in the design of simulations: the dimensionality (d) of the input domain; the failure rate (θ); and the failure pattern. 5. Simulations examining modeling or distance-calculation errors [194] have also appeared, but are very rare. In spite of their popularity, simulations have limitations as evaluation tools, including that: (1) they are mostly only used to simulate numeric input domains; (2) the assumed failure patterns may not be realistic; and (3) simulations that do account for test execution time often have very similar execution times, which means that the simulations are effectively comparing generation, rather than execution, time, which may decrease their applicability in practice.
In general, primary studies involving simulations assume the input domain D to be [0, 1.0) d -a unit hypercube (each dimension of D ranging from 0.0 to 1.0). Excluding those few studies using simulations for a special testing environment (such as for combinatorial testing [168]), 73 of the 78 papers involving simulations used numeric input domains. In this section, we review these 73 studies according to the three main simulation design factors (d, θ, and the failure pattern).

Failure Rate Distribution
Of the 73 primary studies involving simulations, 70 simulated failures in the input domain. Among these 70, the maximum failure rate (θ H ) used was 1.0, and the minimum (θ L ) was 1.0 · 10 −5 . Fig. 14 shows the distribution of lowest simulation failure rates (θ L ) across these 70 papers: As shown in Fig. 14, more than half of the simulations involved a minimum failure rate of either 1.0 · 10 −3 (24 papers: 34%) or 5.0 · 10 −4 (15 papers: 21%). The next two most commonly used θ L values are 5.0 · 10 −5 (17%), and 1.0 · 10 −4 (7%). In total, only one paper had θ L = 1.0 · 10 −5 , indicating that the failure rates used in simulations have not been very low. This lack of simulation data for very low failure rates contrasts with Anand et al.'s report that "lower failure rates are actually favorable scenarios for ART with respect to Fmeasures" [18]. Thus, it would be interesting and worthwhile to conduct more simulations involving lower failure rates to better evaluate ART performance.
As shown in Fig. 15, there are many types of failure patterns. Fig. 16 presents how many failure pattern types were used in the 69 studies involving failure patterns. Half of the papers examined three types, followed by one (34%), four (10%), and two types (4%). Only one paper (2%) looked at five types of failure pattern in the simulations [108]. A conclusion from this analysis is that many studies have not designed the simulations comprehensively enough to accurately evaluate ART.

Experiments with Real Programs
In this section, we summarize some details about the ART experiments involving real programs.

Subject Programs
We collected the details of each subject program used in the ART experiments, including its name, implementation language, size, description, and references to the primary 6. For ease of description, the failure patterns in Fig. 15 focus on twodimensions, but the categories are similar for higher d: For example, when d = 3, a square becomes a cube, and a circle becomes a sphere.
7. The rectangle type is a special case of the strip type, with the main difference being that each side of the rectangle type is parallel to the corresponding dimension of D, but strips are not necessarily so, and may not be parallelograms [22].
8. Previous studies [108] designed a predominant region by assigning q% of the failure region to one square (e.g., q may be equal to 30, 50, or 80), with the other squares sharing the remaining percentage of the failure region in a random manner.
9. The unknown shapes refer to situations where the shape details were not provided.  studies that reported results for that program. This information is summarized in the appendix, in Table A.1 -"NR" again indicates that some details were not reported in the original paper. For ease of illustration, we ordered the programs according to the number of references (the last column), listing the most studied ART subject programs at the top of the table.
In total, as can be seen from Table A.1, 211 subject programs were found, ranging in size from 8 to 4,727,209 lines of code. Fig. 17 shows the distribution of programming languages used to implement the programs. It can be observed that most programs were written in C/C++ (36%) and Java (33%), followed by C# (9%).

Types of Faults
The testing effectiveness of ART is generally evaluated according to its ability to identify failures caused by faults in the subject programs. Because actual defects are not always available in real programs, artificially faulty programs, in the form of mutants, are often used. The mutants can be created manually, or with an automatic mutation tool [23]. Similar to previous surveys [26], we investigated the relationship between artificial and real faults in the empirical evaluations of ART by calculating the cumulative number of primary studies using each, as shown in Fig. 18. It can be seen that the first study involving artificial faults was reported in 2002, while the first with real faults was in 2008. Furthermore, although both studies with artificial and real faults are increasing, the rate of increase for artificial faults is much higher than that for real faults. By the end of 2017, more than 55 primary studies had used artificial faults, compared with only 12 identifying real faults, indicating that relatively few studies have used ART to detect real bugs. Nevertheless, the number of ART studies detecting real faults has been increasing. 7.3.3 Additional Information used to Support ART ART uses information from previously executed tests to guide generation of subsequent test cases to achieve an even-spreading over the input domain. Each test case contains some intrinsic information: A test case (0.5, 0.5), for example, is a point in a 2-dimensional input domain, with intrinsic information represented by its location; another test case, "xyz", is a list of characters whose intrinsic information is represented by a string. In addition to a test case's intrinsic information, some further information may also be available that can be extracted and applied to support ART execution. In this section, we review some additional information obtained from the ART studies.
Most studies made some use of white-box information (including branch coverage, statement coverage, and mutation score) to guide test case generation. Several studies [42], [174], [183] have used branch coverage information, but have adopted different representations. For a given SUT with a list of β branches, denoted BR = {br 1 , br 2 , · · · , br β }, a test case tc could cover a set of these branches, denoted BR(tc), where BR(tc) ⊆ BR. Some previous studies [174], [183] have used a binary vector (x 1 , x 2 , · · · , x β ) where each element x i (1 ≤ i ≤ β) represents whether or not the branch br i is covered by tc: if br i is covered, then x i = 1, otherwise x i = 0. This information can also be represented by a set of branches [42], i.e., BR(tc). Zhou et al. [173] also used a test vector based on branch information (y 1 , y 2 , · · · , y β ), however, each element y i (1 ≤ i ≤ β) in their vector represents the number of times that br i is covered by tc. Jiang et al. [42] used sets of statements or methods for each test case. Both Hou et al. [41] and Sinaga et al. [175] used the program path to represent each test case by constructing a Control Flow Graph (CFG) [197] for the program.
Tappenden and Miller [153] also used a binary vector for individual test cases to represent the existence (or lack) of certain cookies within a global cookie collection. Patrick and Jia [198], [199] used mutation scores to construct a probability distribution for test case selections. Some previous studies [155]- [158] have described test cases using UML state machine test paths, considering each test path as either a set or sequence. Iqbal et al. [106], using the same UML state machine as in other studies [155]- [158], used a test data matrix to represent test cases. Matinnejad et al. [159] represented test cases using a sequence of signals that could be described as a function over time; and Liu et al. [116] represented test cases with an event sequence. Indhumathi and Sarala [200] used .NET Solution Manifest files to generate test case scenarios, each one producing at least one test case. Nikravan et al. [201] applied the path constraints of input parameters to support ART. Nie et al. [91] enhanced ART testing effectiveness through the use of I/O relations. When testing C compilers, where each test case was a C program, Chen et al. [184] counted the occurrence of certain tokens in each program, constructing a numeric vector to represent each test case. Hui and Huang [202] applied metamorphic relations to support ART test case generation, and Yuan et al. [203] have incorporated program invariant information into ART.

Evaluation Metrics
Various metrics have been used to evaluate the testing effectiveness and efficiency of ART approaches. In this section, we review those metrics used in the primary studies.

Effectiveness Metrics
The effectiveness metrics, which are used to evaluate the effectiveness of ART techniques, can be classified into three categories: fault-detection; test-case-distribution; and structure-coverage.
1) Fault-detection metrics: These metrics assess the faultdetection ability of ART, and include the F-measure [204], E-measure [205], and P-measure [205]. The F-measure is the expected F-count [206] (the number of test cases required to detect a failure in a specific test run); the E-measure refers to the expected number of failures to be identified by a set of test cases; and the P-measure is the probability of a test set identifying at least one program failure. Liu et al. [207] proposed a variant of the F-measure, the F m -measure, which they defined as the expected number of test cases required to identify the first m failures. These metrics may have different application environments: the E-measure and Pmeasure, for example, are appropriate for the evaluation of automated testing systems [73]; while the F-measure is more realistic for situations where testing stops once a failure is detected.
In addition to these three metrics, another widely-used one is the fault detection ratio, which is defined as the ratio of faults detected by a test set to the total number of faults present [185]. It should be noted that in the context of artificial faults (mutants), the fault detection ratio can be interpreted as the mutation score [23].
2) Test-case-distribution metrics: These metrics are used to evaluate the distribution of a test set, i.e., how evenly spread the test cases are. For ease of description in the following, assume a test set T = {tc 1 , tc 2 , · · · , tc n }, of size n, from input domain D.
• Dispersion [31]: The dispersion of T is calculated as the maximum distance among all pairs of nearest neighbor distances. Its definition is: Diversity [83], [84]: The diversity is similar to the dispersion, but uses the sum (not maximum) of all nearest neighbor distances. Its definition is: [114]: Similar to diversity, divergence [114] is defined as: [31], [82], [87], [104], [117]: This refers to the position of each test case over the entire input domain D, and can only be used for numeric input domains. The most intuitive version of spatial distribution depicts the locations of test cases: Mayer and Schneckenburger [82], for example, recorded the locations of the i-th test case in a 2-dimensional input domain using 10 million test sets, generating a picture of pixels. However, each picture only shows up to the i-th test case, and their method cannot depict spatial distributions for input domains with more than three dimensions. Some methods have tried to project the test case positions onto a single dimension: Chen et al. [117], for instance, projected test cases from T onto one dimension (the x-axis), dividing it into 100 equally-sized bins. The number of test cases within each bin was then counted, and the spatial distribution of T was thus described with a histogram. Other methods have described the spatial distribution of T by dividing D into a number of equallysized, disjoint subdomains, from D's edge to its centre: Chen et al. [31], for example, partitioned D into two subdomains, the edge and center regions, defining a new measure of spatial distribution as: where T Edge and T Center are the sets of test cases from T located in the edge and center regions, respectively. Chen et al. [87] also partitioned D into 128 subdomains, and analyzed the frequency distribution of test cases in each subdomain. Similarly, Mayer and Schneckenburger [104] divided D into 100 subdomains, and formalized the relative distance of a test case tc ∈ T to the center of D: where c is the center of D, and d is its dimensionality.
3) Structure-coverage metrics [208]: These metrics, which make use of structural elements in the SUT, have been widely used in the evaluation of many testing strategies. Among them, two popular categories are control-flow coverage [112] and data-flow coverage [111]. Control-flow coverage focuses on some control constructs of the SUT, such as block, branch, or decision [112]. Data-flow coverage, in contrast, checks patterns of data manipulation 10 within the SUT, such as p-uses, c-uses, and all-du-paths [111]. These metrics have 10. Patterns of data manipulation refer to the definition of some data (def, where values are assigned to the data), and its usage (use, where the values are used by an operation). Additionally, use can be categorized into c-use (where data are used as an output or in a computational expression), and p-use (where data appears in a predicate within the program) [111]. also been used to evaluate (fixed-size) test sets generated by ART.

Efficiency Metrics
There have generally been two metrics used to evaluate the testing efficiency of ART: the generation time, and the execution time. The generation time reflects the computational cost of generating n test cases; while the execution time refers to the time taken to execute the SUT with n test cases. On the one hand, because RT has fewer computations involved in the test case generation, it is intuitive that it should have a much lower generation time than ART: Given the same amount of time, RT typically generates more test cases than ART. On the other hand, the variation in execution time depends mainly on the SUT.

Cost-Effectiveness Metric
The F-time [61] is defined as the running time taken to find the first failure. Suppose the testing process requires n test cases to identify the first failure (i.e., the F-count is equal to n), then the F-time comprises the generation time for these n test cases, and the execution time for running them on the SUT. The F-time, therefore, not only shows the testing efficiency of ART, but also reflects its effectiveness: it is a cost-effectiveness metric [18]. Among the effectiveness metrics, the fault-detection metrics were the most used, followed by test-case-distribution metrics. Very few studies used the structure-coverage metrics. The majority of papers (73%) used the F-measure to evaluate ART fault-detection effectiveness, followed by the fault detection ratio (15%), and the P-measure (11%). Only one paper (1%) used the E-measure, which reflects one of its main criticisms: that higher E-measures do not necessarily imply more distinct failures or faults [32].

Application of Evaluation Metrics
Regarding the efficiency metrics, 21% of the 131 papers used the generation time, whereas only 1% used the execution time. Finally, about 8% of the studies adopted the F-time as the cost-effectiveness metric.

Number of Algorithm Runs
To accommodate the randomness in test cases generated by both RT and ART, empirical evaluations require that the techniques be run in an independent manner, a certain number of times (called the number of algorithm runs, S) [209].
In this section, we analyze the number of algorithm runs used in each study 11 . Because some studies involving experiments may have had practical constraints (such as limited testing resources), to present the results in an unbiased way, we investigated the numbers of algorithm runs for both simulations and for experiments with real programs. Fig. 20 presents the paper classification based on the number of algorithm runs reported, with Fig. 20(a) showing the distribution of the 78 studies with simulations, and Fig. 20(b) showing the 74 studies involving experiments. According to the Central limit theorem [210], to estimate the mean of a set of evaluation values (such as F-measures), with an accuracy range of ±r and a confidence level of (1 − α) × 100%, the size of S should be at least: where z is the normal variate of the desired confidence level, µ is the population mean, and σ is the population standard deviation.
As shown in Fig. 20(a), other than 5% of papers with S ≤ 100, and 9% "Not reported", all other simulation papers used either a value of S determined by the central limit theorem (28%), or had at least 1, 000 algorithm runs (58%). Most studies determined S based on the central limit theorem, followed by S = 5, 000 (17%), 10, 000 (14%), and 50, 000 (11%). On the other hand, as shown in Fig. 20(b), about 46% of papers involving experiments had 100 or less algorithm runs (S ≤ 100), followed by 19% with 1, 000 or more (S ≥ 1, 000). Only 13% of experiment papers used a value of S calculated using the central limit theorem. According to Arcuri and Briand's practical guidelines [209], 11. If a paper used different numbers of algorithm runs in different empirical studies, the minimum number was selected for S. algorithms involving randomness should be run at least one thousand times (S = 1, 000) for each artifact (exceptions being for heavy time-consuming SUTs, such as embedded systems [106], [155]- [158]). Therefore, while overall the number of algorithm runs for ART simulations was sufficient, it appears that the number of runs in some studies involving experiments was not.

Statistical Significance
One of the initial motivations behind developing ART was to enhance the testing effectiveness of RT. It is natural, therefore, to compare each new ART technique with RT, in terms of testing effectiveness. As already discussed, because test cases generated by both RT and ART contain randomness, it is necessary to determine the statistical significance of any comparison [209]. Statistical tests can, amongst other things, determine whether or not there is sufficent empirical evidence to support, with a high level of confidence, that there is a difference between the performance of two algorithms A and B. Furthermore, when A does outperform B, it is also important to quantify the magnitude of the improvement. In this section, we report on the application of statistical tests in the empirical evaluations of the ART studies.
Of the 78 primary studies involving simulations, only four papers (5%) used statistical tests. However, of the 73 papers with experiments, 26 (36%) examined the statistical significance when comparing two techniques, with the most used statistical tests being: the t-test; the Mann-Whitney Utest; ANalysis Of VAriance (ANOVA) test; and Z-test [209]. The effect size was the statistic most often used to measure the magnitude of improvements [209], with two papers using it for simulations [73], [105], and six including it for experimental data [73], [158], [159], [165], [198], [199]. The two main approaches used to calculate the effect sizes are from the work of Cohen [211], and Vargha and Delaney [212].
In summary, it appears that relatively few ART empirical studies have used sufficient and appropriate statistical testing.

ANSWER TO RQ5: WHAT MISCONCEPTIONS SURROUNDING ART EXIST?
During the development of ART, a number of misconceptions and misunderstandings have arisen, leading to confusion or incorrect conclusions. Some misconceptions have been discussed previously [18], indicating that they have existed for multiple potential ART users, especially those just beginning to apply it. Two main misconceptions are discussed in this section.

Misconception 1: ART is Equivalent to FSCS
Anand et al. [18] noted that, because FSCS was the first published ART algorithm [9], many studies have presented FSCS as being ART, or being equivalent or exchangeable. As discussed in Section 5, FSCS is an ART implementation belonging to the STFCS category, and there are many other STFCS implementations. There are also other ART implementation categories. ART refers to a family of testing approaches in which randomly-generated test cases are evenly spread over the input domain [15]. FSCS is only one of many ART algorithms, and hence ART and FSCS are not equivalent.

Misconception 2: ART Should Always Replace RT
Although RT requires very little information when testing, ART does make use of additional information (such as locations of previously executed test cases) to guide the test case generation. It may, therefore, seem reasonable that ART should always be better than RT, and thus always replace it. From the perspective of testing effectiveness, however, Chen et al. [108] found that ART's effectiveness is influenced by many factors, including the failure rate, failure pattern, and dimensionality of the input domain. They identified several favorable conditions for ART, including a small failure rate, a low dimensionality, a compact failure region, and a small number of failure regions. Furthermore, because different approaches to achieve an even spread of test cases have resulted in different ART implementations, each implementation also has its own relative advantages and disadvantages (resulting in favorable and unfavorable conditions for its application). In other words, there are situations where ART can have similar, or even worse, testing effectiveness compared to RT. In terms of testing efficiency, compared with RT, in spite of several overheadreduction algorithms [56], [73], [123], [126], ART still incurs more computational overheads. Consequently, even though ART may have better testing effectiveness than RT, there needs to be a balance between effectiveness and efficiency when choosing either RT or ART: If the ART test case generation time is considerably less than the test setup and execution time, then it would be appropriate to replace RT with ART; otherwise, RT may be more appropriate [18]. Nevertheless, it should be feasible to use ART rather than RT as a baseline when evaluating the state-of-the-art techniques for test case generation, especially from the perspective of testing effectiveness.

Summary of answers to RQ5:
• Two main misconceptions exist in much of the literature: that ART is equivalent to FSCS; and that RT should always be replaced by ART.

ANSWER TO RQ6: WHAT ARE THE REMAINING CHALLENGES AND OTHER FUTURE ART WORK?
A number of open ART research challenges remain, requiring further investigation and additional (future) work.

Challenge 1: Guidelines for Simulation Design
Although simulations may have limitations compared with real-life programs (because they may not easily and accurately simulate real-world environments, especially complex ones), they are indispensable in the field of ART research. For any given SUT, the fault details -including the size, number, location, and shape of failure regionsare fixed (but unknown) before testing begins. Intuitively, therefore, it is reasonable that studies attempt to simulate faults by controlling and adjusting the factors that create different failure patterns (resulting in different faults).
Although some such simulated faults may seldom occur in real-world programs, they may nonetheless be representative of potential real-world situations, especially for numeric programs. Furthermore, for a number of reasons, it can be challenging to obtain real-world faulty programs: their existence or availability, for example, may be limited Simulations, therefore, can be used to compensate for this lack of appropriate real-world programs. As discussed in Section 7, many studies (58 papers) have used simulations to evaluate ART, and different papers may have different simulation designs. However, these studies only simulated numeric and configurable programs. Furthermore, there is a lack of reliable guidelines regarding simulation design, especially from the perspective of those factors that influence ART effectiveness (such as the failure rate and failure pattern details). The existence of such guidelines could help testers when choosing simulations for experimental evaluations.

Challenge 2: Extensive Investigations Comparing Different ART Approaches
As discussed in Section 7, although many ART studies are based on simulations and experiments with real programs, all studies have used simulations with failure rates greater than 10 −6 , and very few [122], [134] have used experiments with failure rates less than 10 −6 [18]. Similarly, few studies have investigated the favorable and unfavorable conditions for each ART approach [108]. Furthermore, only a very limited number of studies have used statistical testing with a sufficient number of algorithm runs to evaluate ART. It is therefore necessary to more fully and extensively investigate and compare the different ART approaches. This investigation and comparison needs to address not only the strengths and weaknesses of each ART approach, but should also seek to confirm those theoretical results not yet empirically supported (including, as discussed in Section 5.7.1, the potential for ART to support software reliability estimation).

Challenge 3: ART Applications
As discussed in Section 6, ART has been used to test many different applications. Although ART could theoretically be applied to test any software applications where RT can be used, there remain some applications that have only been tested by RT, such as SQL database systems [6]. It will therefore be interesting and significant to apply ART to these domains. Furthermore, to date, only those ART approaches using the concept of similarity, such as STFCS and SBS, have been used in different applications -other approaches, such as PBS and QRS, have mainly been confined to numeric input domains. It will therefore also be important to apply more different ART approaches to different applications.
A goal of ART is to achieve an even-spreading of test cases over different input domains (including nonnumeric input domains). Unlike numeric input domains, nonnumeric domains cannot be considered Cartesian spaces, making visualization of test input locations and failure pattern shapes infeasible. A key requirement for the application of ART in nonnumeric input domains, therefore, is the availability of a suitable dissimilarity or distance metric to enable comparison of the nonnumeric inputs. Consider, for example, a program that checks whether or not an input string of the form "YYYY-MM-DD" is a valid date: Given three potential string input tests -tc 1 = "2019-01-31", tc 2 = "2019-01-3X", and tc 3 = "1998-12-24" -some string dissimilarity metrics (e.g., Hamming distance, Levenshtein distance, and Manhattan distance) may indicate that tc 3 is farther away from tc 1 than from tc 2 [165]. However, while both tc 1 and tc 3 are valid inputs, tc 2 is invalid, and is thus likely to trigger different behavior and output. If different test inputs trigger different functionalities and computations, they are also likely to have different failure behavior (including detecting or not detecting failures), which means that they are dissimilar to each other. This suggests that it would be desirable to incorporate the semantics of nonnumeric inputs into their dissimilarity metrics. If a dissimilarity metric exists that accurately captures the semantic differences between test cases (based on functionality and computation), then ART should be considered.

Challenge 4: Cost-effective ART Approaches
ART cost-effectiveness is critical for real-life applications, and a number of approaches to reduce the computational overheads while attempting to maintain testing effectiveness have been proposed [55]- [57], [73], [92], [123], [126], [128]. However, some approaches are only applicable to numeric inputs [55]- [57], [73], [92], using the location information of disjoint subdomains to enable their division. The main obstacles to applying these cost-reduction techniques to nonnumeric domains include: (1) how to partition a nonnumeric input domain into disjoint subdomains; and (2) how to represent the "locations" of these subdomains.
ART based on the concept of mirroring (MART) [55]- [57], [92] first partitions the numeric input domain into equally-sized, disjoint subdomains, designating one as the source subdomain and others as mirror subdomains. A mapping relation is used to translate test cases between the source domain and each mirror domain. For example, consider a twodimensional input domain D, divided into four equallysized subdomains, D 1 , D 2 , D 3 , and D 4 . Without loss of generality, assuming that D 1 is the source subdomain, and the others are mirror subdomains, then once a new test case is generated in D 1 using ART (e.g., FSCS or RRT), a mapping relation between D 1 and D i (i = 2, 3, 4) maps the test case to three other test cases in D 2 , D 3 , and D 4 . Although, intuitively speaking, subdomain locations can only be visualized/identified in a numeric input domain, not in a nonnumeric one, if partitioning and subdomain location assignment can be applied to nonnumeric input domains, then MART can be used.
While other overhead-reduction approaches [126], [128] may be applied to both numeric and nonnumeric input domains, they may also involve discarding some information, which may decrease their testing effectiveness. It is therefore necessary to investigate more cost-effective ART approaches for different applications.

Challenge 5: Framework for Selection of ART Approaches
A framework for the selection of an ART approach could help guide testers to apply ART in practice, especially when facing a choice among multiple approaches. Anand et al. [18] discussed two simple application frameworks, but only at a very high level, and a lot of technical details remain to be determined. The framework design will also need to address the favorable and unfavorable conditions for each ART approach, as identified in the various studies.

Challenge 6: ART Tools
Although many approaches have been proposed for ART, there are very few tools [40], [73], [114], [144], [150], some of which are: AutoTest, which supports ART for objectoriented (ARTOO) programs written in Eiffel [40]; ART-Gen, which supports divergence-oriented ART for Java programs [114]; Practical Extensions of Random Testing (PERT), which supports testing for various input types [150]; and OMISS-ART, which supports FSCS for C++ and C# programs. However, these tools are not publicly available. The only publicly available tool [213] was developed to support FSCS, RRT, evolutionary ART, RBCVT, and RBCVT-Fast [73], but this can only be used for purely numeric input domains. Currently, testers wanting to use ART have to implement the corresponding algorithm themselves. There is, therefore, a desire and need to develop and make available more ART tools to support both research and actual testing.

Summary of answers to RQ6:
• Six current challenges have been identified for ART that will require further investigation. These are the current lack of: (i) guidelines for the design of ART simulations; (ii) extensive investigations comparing different ART approaches; (iii) ART applications; (iv) cost-effective ART approaches; (v) a framework for the selection of ART approaches; and (vi) ART tools.

CONCLUSION
In this article, we have presented a survey covering 140 ART papers published between 2001 and 2017. In addition to tracing the evolution and distribution of ART topics, we have classified the various ART approaches into different categories, analyzing their strengths and weaknesses. We also investigated the ART application domains, noting that it has been applied in multiple domains, and has been integrated with various other testing techniques. Furthermore, we have identified that different types of failure patterns have been used in the various reported simulations, and that there has been an increasing number of real faults detected and reported. Finally, we discussed some misconceptions about ART, and listed some current and future ART challenges requiring further investigation. We believe that this article represents a comprehensive reference for ART, and may also guide its future development.