Metamorphic Testing: Testing the Untestable

What if we could know that a program is buggy, even if we could not tell whether or not its observed output is correct? Metamorphic testing provides this ability. This article explains the basics of the technique.

FEATURE: AUTOMATED SOFTWARE TESTING SUPPOSE YOU ARE helping your son with his homework. You ask him how many exercises he has to do in total, and he answers, "Two." You are not sure if that is correct, so a few minutes later you ask how many math exercises the teacher has asked him to do, and this time he answers, "Four." This is wrong-the total number of exercises cannot be fewer than the number of math exercisesso the boy seems forgetful. There was no need to know if the answers were correct to detect the problem, and, more importantly, the boy revealed the problem himself! Like the child in this example, some programs are extremely difficult to test because of the lack of an oracle, which is a mechanism that can decide whether or not the program's output is correct in a reasonable amount of time. 1 Consider, for example, testing programs such as compilers, search engines, machine-learning systems, or simulators: determining the correctness of the output for a given input may be nontrivial and error prone. This is known as the oracle problem, and the programs suffering from it are often referred to as nontestable (or untestable). [1][2][3] Metamorphic testing (MT) is an effective technique for alleviating the oracle problem and, thus, for testing untestable programs where, as in the previous example, failures are not revealed by checking individual outputs but by checking the expected relations among multiple executions of the program under test. Since its introduction by Chen et al. in 1998, 4 the literature on MT has grown impressively, and many successful applications of the technique have come to light. Some of these successes include the detection of bugs in real-world systems, such as the search engines Google and Bing, 5

the GNU Compiler Collection (GCC) and
Low-Level Virtual Machine (LLVM) compilers, 6 commercial code obfuscators, 7 NASA systems, 8 and the web application programming interfaces (APIs) of Spotify and YouTube. 9 Recently, GraphicsFuzz, a spin-off company from Imperial College London acquired by Google, has commercialized this technique. 10 In this article, we present an intuitive introduction to metamorphic testing, giving practical examples, describing success stories, and discussing its limitations.

MT in a Nutshell
MT approaches the software testing problem from a perspective not used by most other testing strategies: rather than focusing on each individual output, MT looks at multiple executions of the program. It checks whether the inputs and outputs of these multiple executions satisfy certain metamorphic relations, which are necessary properties of the intended program's functionality. A metamorphic relation transforms existing (source) test cases into new (follow-up) ones. If the program's behavior across these Metamorphic Testing: Testing the Untestable source and follow-up test cases violates the metamorphic relation, the program must be faulty.
As an example, consider the program merge(L 1 , L 2 ) that merges two lists into a single ordered list without duplicated elements. Deciding whether the output of the program is correct for any two nontrivial input lists is difficult, and, thus, this is an instance of the oracle problem. However, the order of the parameters should not influence the result, which can be expressed as the following metamorphic relation: Many other metamorphic relations could be identified, for example, merge(L 1 , L 2 ) = merge(L 3 , L 4 ), where L 3 = L 1 + L 1 , L 4 = L 2 + L 2 , and + is the list concatenation operator.
MT is also regarded as an effective test-data generation technique. This is because a metamorphic relation implicitly defines how a given source test case can be transformed into one or more follow-up test cases such that the relation can be checked. In the previous example, for instance, MT could be used together with a random list generator to automatically construct source test cases-for example, ( MT was originally proposed two decades ago as a technique for reusing successful test cases (those that pass and, thus, reveal no failures). Since then, it has thrived, becoming a well-established testing technique with numerous applications in both academia and industry. For a thorough introduction to the technique, we refer the reader to two recent surveys on the topic, 2,3 and a recent webinar. 11

A Hands-On Example
Suppose that you are part of the testing team for the popular website Booking.com, which allows users to find potential lodgings according to their preferences. You run an exploratory test by performing a search for accommodations in Rome, which returns 7,378 result items. Is this output correct? Is there anything missing in the result set? Is there any result not meeting the search criteria included in the list? Answering these questions would be extremely time-consuming. This is a clear case of the oracle problem. To alleviate this problem, and to automate the generation of test cases, MT could be employed by applying the following basic steps.

Step 1: Identification of Metamorphic Relations
Metamorphic relations are generally identified based on our knowledge of the problem domain, the program specification, and/or the user manual. To identify a metamorphic relation, we may think about how certain changes in the program's inputs may be expected to produce certain changes in the program's outputs. 12 For example, Figure 1 shows the interface of Booking.com, displaying the results of the search for accommodation in Rome. A closer look at this may lead us to several metamorphic relations: • MR 1 : Perform a search. Then, repeat the search adding a budget filter, such as "US$100-US$150 per night." The result set of the second search (followup test case) should be a subset of the result set of the first search (source test case), for which no filter was applied. • MR 2 : Perform a search for hotels with the rating filter "1 star" (source test case). Then, repeat the search four times, changing the rating filter to "2 stars," "3 stars," "4 stars," and "5 stars" (follow-up test cases). The result sets of the five searches should not contain any common result item, because the same hotel cannot have two different star ratings at the same time. • MR 3 : Perform a search (source test case). Then, repeat the search changing the ordering criterion from default ("our top picks") to "review score" (follow-up test case). Both searches should return the same items, regardless of their ordering.
These are just a few of the potentially huge number of metamorphic relations that could be identified for Booking.com and similar websites and APIs. 5,9 The reader may like to try identifying some other relations by looking at the "Popular filters" shown on the left side of Figure 1.
Step 2: Implementation Once the relations are identified, it is time to implement and run the actual metamorphic tests by instantiating each relation with specific test data. This can be done with any standard unit-testing framework. As an example, Figure 2 shows a metamorphic test for the relation MR 1 written in the JUnit framework. The key difference compared with standard unit test cases is that the method under test (Booking.search) is executed twice (lines 17 and 24), instead of just once, and that the assertion (line 27) refers to the output of both calls, instead of to a single output.
Several points are worth emphasizing in this example. First, every metamorphic test starts from an existing test case, called the source test case. Source test cases can be designed from scratch with any standard test-design technique, or they can be reused from an existing test suite. Each metamorphic relation can be typically instantiated into many metamorphic tests, by using different test data. Thus, we could easily implement many other metamorphic tests from MR 1 by simply using different search queries and budget filters.
Step 3: Automated Test-Case Generation Finally, the real potential of MT is fully realized when combining it with automated test-data generation techniques. In the example of Figure 2, for instance, both the query search and the filter data could be randomly generated, enabling the construction of a potentially limitless number of test cases. This process would include not only the generation of inputs, but also the generation of the corresponding output assertions, truly achieving full test automation.

Approaches for the Identification of Metamorphic Relations
The effectiveness of MT is strongly influenced by the metamorphic relations used; therefore, identifying effective metamorphic relations is a critical step. The identification of metamorphic relations is a task that requires creativity and domain knowledge, but there are clues that can help in the process. 12 We next describe two common approaches for the identification of metamorphic relations.

Input-Driven Approach
The input-driven approach involves thinking of changes to the program's inputs that should produce expected changes in the outputs. 12 The possible changes to the input parameters depend on their data types. For example, possible operations in an input list might include adding an element to the list, removing an element from the list, splitting the list, reordering the list, and so on. Analogously, possible changes to a search query might involve adding or removing filtering, sorting, or paginationrelated parameters. 5,9 These possible changes provide clues about how source test cases could be changed to generate new follow-up test cases and, consequently, to identify metamorphic relations. This approach is frequently used in numerical and graph-theory programs.

Output-Driven Approach
In contrast to the previous method, the output-driven approach proposes starting from possible relations among the outputs typically found in the target domain and then thinking about what kind of changes in the program's inputs would lead to satisfaction of the expected relation among outputs. For example, typical relations between outputs in search operations include having a result set that is a subset of another result set, having two result sets containing the same items, or having two disjoint result sets (sets with no common elements). 5,9 Suppose we are looking at the subset relation between outputs in the Booking.com example. We should think of changes to the search query that filters some items out of the result set. For example, we could perform a search for hotels and then perform the same search for hotels, but this time with "pets allowed" (or any other filter). The result set of the latter search should be a subset of the former (where no filters were used). This approach is frequently used in programs accessing data repositories such as information systems, search engines, and web APIs.
Regardless of the approach followed, previous studies suggest that metamorphic relations should be diverse, meaning that they should involve different input parameters and input constraints to exercise the program under test as thoroughly as possible. 13 Figure 3 summarizes the domains where MT has been applied, based on a survey of 84 case studies published between January 1998 and May 2018. (The total number of publications on MT, considering all types of articles, is much higher.) The most popular domain is web services and applications (14%), followed by computer graphics (11%). We also found a variety of applications to other fields (24%), such as financial software, optimization programs, cybersecurity, and data analytics as well as industrial applications in organizations, such as NASA and Adobe. Only 5% of the articles reported results in numerical programs, even though this is a frequently used domain to illustrate how MT works in the literature. The arrows in Figure 3 represent whether there has been an increasing or a decreasing trend in the applications in a particular domain in the last three years. The increasing number of applications of MT to bioinformatics and artificial intelligence (AI), including machine learning and autonomous vehicles, is worth noting. We next highlight some successful applications of MT in the domains of compilers and AI.

Compilers
Several researchers have proposed using MT to find failures in compilers based on the following idea: removing dead source code from a program should not alter the functionality of the compiler-generated code. 6 The approach works in three steps:

Run a program (source test case)
with some inputs using a profiler to identify the executed code. 2. Create a new version of the program (follow-up test case) by  This approach was reported to have detected 147 bugs (110 fixed) in the GCC and Low-Level Virtual Machine C compilers when first published, 6 but hundreds more bugs have been found (and fixed) since then. 14 A similar strategy has been used in industry by GraphicsFuzz for testing graphics drivers. 10 Their approach works as follows 15 : First, an image is rendered using a shader program [through the graphics driver under test, which includes a shader compiler that translates the shader program into low-level machine code for the target graphics processing unit (GPU)]. The output image is called the original image. Second, a transformation is applied to the shader program that should have no significant impact on how the image is rendered (for example, adding "+0.0" to an arithmetic expression). Third, an image is rendered by using the modified shader program (still through the graphics driver under test), obtaining a variant image. Finally, the original and variant images are compared. If the differences are significant, the graphics driver is faulty. At the time of writing, the GraphicsFuzz toolset had revealed more than 83 issues in several graphics compilers from popular GPU designers.
Among others, it has been publicly credited for detecting bugs in Apple (iOS Webkit), NVIDIA, and Chrome (on Samsung S6), receiving a Google Chrome bug bounty of US$2,000. GraphicsFuzz was acquired by Google in August 2018.

AI
MT has been used for testing AI tools such as supervised and unsupervised machine-learning programs, 16 which "learn" from a set of data samples composed of attributes and labels. Metamorphic relations in this domain define changes in the samples used for learning that produce a predictable effect in the knowledge extracted from them: changing the order of the attributes in the samples should have no impact in the outcome, for example. MT has been used to detect real bugs in the machine-learning tools Weka and RapidMiner, among others.
MT has also proven to be effective for testing AI-driven systems. Researchers at the Fraunhofer Center for Experimental Engineering developed a framework for the automated testing of simulated autonomous drones. 17 Their approach starts by defining a model of a flying scenario and observing the drone behavior in this scenario. Then, they programmatically generate multiple variations of the scenario in which the outcome of the drone flight should be equivalent. For example, the drone should behave consistently whether it is flying north or south, if the distances and relative positions of obstacles are the same. This approach enabled the researchers to detect a number of issues that led to unexpected behavior of the drone, including fatal crashes. For example, they found that the drone had problems landing in some situations when rotating objects were  in the scene. They determined that the problem was related to the direction of sunlight not being rotated and that this caused a shadow to fall on the landing pad in some orientations, causing the vision system to fail to recognize the landing spot. A similar idea was used by researchers at the University of Virginia and Columbia University to implement DeepTest, a testing tool for self-driving cars driven by deep neural networks (DNNs). 18 DeepTest automatically generates synthetic test cases with different driving conditions such as rain or fog, where the car should behave similarly. Among other results, DeepTest found thousands of erroneous actions under different realistic driving conditions, some of which led to fatal crashes in three top-performing DNNs in the Udacity self-driving car challenge. Figure 4 shows one of the test cases generated by DeepTest that revealed a failure. 19 The blue arrow in Figure 4(a) shows the original trajectory of the car. The red arrow in Figure 4(b) shows the erroneous trajectory calculated by the DNN-driven system when fog was added to the scene.

Limitations
MT also has some limitations that may narrow its applicability in certain domains. First, although MT can be used to alleviate the oracle problem, it cannot solve it completely. This is because metamorphic relations cannot be used to tell whether or not the output of a program is the expected one. For instance, the metamorphic test of Figure 2 could pass even if the outputs of the source test case ("hotels in Rome") and the follow-up test case ("hotels in Rome with a budget of US$120-US$180 per night") are wrong (if both searches return a particular hotel in Florence, for instance). Thus, MT (on its own) may not be suitable for testing critical systems that require the correctness checking of each individual output.
Another limitation of MT is the need to identify metamorphic relations, which is typically a manual process requiring effort and creativity. Although some approaches for the automated discovery of metamorphic relations exist, so far, they have mostly focused on numerical programs. 20,21 Reported experiences of teaching MT, however, suggest that students can easily identify effective metamorphic relations after only a few hours of training. 13 Finally, the number of metamorphic relations in most nontrivial programs is potentially huge. This leads to a common problem when applying MT: how to select the most effective metamorphic relations. Although some heuristics for guiding the process do exist, as previously explained, more systematic approaches for such optimal selection are yet to be proposed.

A thriving testing technique,
MT has demonstrated its ability to address the oracle problem and to enable test-case generation. Future contributions are expected in many areas, including new application domains, automated inference of metamorphic relations, tools, and integration with other testing techniques.