Quality Estimation and Optimization of Adaptive Stereo Matching Algorithms for Smart Vehicles

Stereo matching is a promising approach for smart vehicles to find the depth of nearby objects. Transforming a traditional stereo matching algorithm to its adaptive version has potential advantages to achieve the maximum quality (depth accuracy) in a best-effort manner. However, it is very challenging to support this adaptive feature, since (1) the internal mechanism of adaptive stereo matching (ASM) has to be accurately modeled, and (2) scheduling ASM tasks on multiprocessors to generate the maximum quality is difficult under strict real-time constraints of smart vehicles. In this article, we propose a framework for constructing an ASM application and optimizing its output quality on smart vehicles. First, we empirically convert stereo matching into ASM by exploiting its inherent characteristics of disparity–cycle correspondence and introduce an exponential quality model that accurately represents the quality–cycle relationship. Second, with the explicit quality model, we propose an efficient quadratic programming-based dynamic voltage/frequency scaling (DVFS) algorithm to decide the optimal operating strategy, which maximizes the output quality under timing, energy, and temperature constraints. Third, we propose two novel methods to efficiently estimate the parameters of the quality model, namely location similarity-based feature point thresholding and street scenario-confined CNN prediction. Results show that our DVFS algorithm achieves at least 1.61 times quality improvement compared to the state-of-the-art techniques, and average parameter estimation for the quality model achieves 96.35% accuracy on the straight road.


INTRODUCTION 1.Background
Stereo matching algorithms are key enablers for the perceptibility of smart vehicles, empowering various imperative functionalities such as traffic objects detection and tracking [6,8,30,37], visual odometry [13], and three-dimensional (3D) reconstruction [36]. Compared to active sensory systems such as those employing LIDAR [38], stereo matching-based solutions are typically cheaper and more versatile.
Adaptive stereo matching (ASM) algorithms, unlike normal ones, are able to provide scalable execution quality in reaction to the execution environment. The more execution cycles are assigned to ASM algorithms, the higher quality it can achieve in the output, at the price of increased system timing/energy resources [40]. For example, when more cycles are assigned, the larger disparity range is then allowed and increasingly accurate matching (quality) results can be obtained.
Accurately modeling the relationship between quality and execution cycles is challenging for ASM algorithms. Traditionally, quality modeling is implemented by extensive profiling and curve fitting [41]. We have seen past researchers having used linear [25,40,45] or exponential quality models [41]. It needs substantial work to decide which model suits an algorithm the most. In addition, each model may have its parameters, and the values of these parameters are usually input-dependent. For example, even after we have chosen to use an exponential model, we still need to further decide the values for several parameters in the exponential model. As a result, it is essential to decide not only a suitable quality model for ASM algorithms but also its best parameter values.
With simplified quality models, some dynamic voltage/frequency scaling-(DVFS) based optimization approaches have been developed to maximize the total cycles under timing and energy constraints [25,40,45]. A common assumption taken by those works is that maximizing quality can be implicitly achieved by maximizing the processor execution cycles of tasks. While the assumption makes the problem formulations and solutions easier, ignoring the non-linear quality-cycle relationship could produce suboptimal quality maximization results. Instead of being cycle-centric, we would like to develop a quality-centric DVFS formulation and solution in this work.
In total, several important issues have to be addressed before this promising ASM can be optimally and practically employed on smart vehicular systems, including how to (1) accurately model the output quality adaptivity versus execution cycles; (2) optimally determine the system execution parameters, such as processor operating voltage/frequency and execution cycles, that achieve the quality maximization; and (3) accurately estimate the parameters of the quality-cycle model. Since the parameters are input image-dependent, extensive profiling of the stereo matching may still lead to inaccuracy.

Illustrative Example
Being adaptive may not automatically guarantee optimized output quality for ASM tasks. In this part, we demonstrate the effectiveness of judiciously configuring system V/F settings (as well as pre-determining ASM cycle values) for ASM tasks to maximize the total execution quality, being aware of their quality models. Assume two stereo matching tasks, T 1 from Kitti datasets [9] and T 2 from Carla datasets [7], whose execution cycles/frequencies/voltages are to be determined before actually running. The tasks feature attenuatively increasing quality functions over each cycle, and we adopt the corresponding exponential function to model them, as described in Table 1. We also assume that the task's energy consumption is linearly related to execution cycles under a given voltage v dd . It is expressed as E = Kv 2 dd × o, where K is the chip-specific constant and o is the number of execution cycles. The initial conditions of each task can be found in Table 1 and Figure 1(a).  The execution cycle o can be calculated as the product of executing time and frequency. The total quality of execution T 1 and T 2 is calculated as The matching error Err , to be defined later as the inverse of the quality, of T 1 and T 2 are Err 1 = 15.3% and Err 2 = 14.96%, respectively. Figure 1(a) shows the effects of different cycle allocation methodologies to achieve quality maximization. Assume that instead of executing for 600 ms, T 2 finishes at 400 ms, and the spared processor cycles, which is equal to 300MHz ×200ms = 6E+7 cycles depicted in area A in Figure 1, are allocated to T 1 running at 300 MHz. The new total quality then becomes 7.3(1 − e − 1.5e +8 0.04e +9 ) + 6.7(1 − e − 1.2e +8 0.03e +9 ) = 13.71. The matching error, Err , of T 1 and T 2 are Err 1 = 14% and Err 2 = 15.2%, respectively. The quality improvement is subject to the awareness of quality function attenuation: Comparing the two quality functions, T 2 executing from 400 ms to 600 ms does not increase as much quality as T 1 does from 300 ms to 500 ms. This shows that the total execution quality can be optimized if system resources are located being aware of the quality function characteristics, especially quality attenuation. Also, fast and accurate quality function estimation is an imperative prerequisite for quality maximization. Figure 1(b) illustrates how judiciously applying DVFS could help further improve the execution quality. Here the T 2 's voltage/frequency is reduced from {0.9v, 300MHz} to {0.8v, 200MHz}. We accordingly extend the execution cycles of T 2 to 600 ms to keep the cycle and quality of T 2 unchanged. The reduction of T 2 's voltage/frequency saves energy of E = E 2,300MH z − E 2,200MH z = K · (0.81 − 0.64) · 1.2 × 10 8 units. If E is claimed by T 1 , then additional cycles and quality can be generated, as depicted in area B in Figure 1(b). Assume that T 1 still runs at {0.9v, 300MHz}; the new execution cycles become 1.75E+8 after claiming E. The total quality of T 1 then becomes 7.3(1 − e − 1.75e +8 0.04e +9 ) + 6.7(1 − e − 1.2e +8 0.03e +9 ) = 13.79 > 13.71 > 13.21, indicating that the DVFS further improves the execution quality of adaptive tasks. The corresponding matching error, Err , of T 1 and T 2 are Err 1 = 13.87% and Err 2 = 15.2%, respectively. Compared to the initial condition, T 1 's error is reduced by 9.35% at the cost of increasing T 2 's error by 1.6%.  Illustration of the ASM matching error, Err , (indicated in the red region) after cycle re-allocation and DVFS optimization. The top row shows the results executing task T 1 , whose input images are obtained from Kitti; the bottom row shows executing task T 2 whose input is from Carla. (a) Illustrates an initial matching error; (b) shows the results after for judicious cycle re-allocation; (c) shows the results after applying DVFS for further optimization.
The visual illustration of the effects of applying the abovementioned optimizations is shown in Figure 2. With judicious cycle allocation and DVFS strategies, it is possible to substantially improve the overall stereo matching quality under identical resource constraints. In this work, we propose a framework that efficiently estimates the quality function for the stereo matching application, takes both quality attenuation and DVFS into the problem formulation, and provides an efficient solution to find the optimal execution settings of cycle, voltage, and frequency.

Related Work
Adaptive applications have been studied extensively given its flexibility such that the system resources can be adjusted in response to changes in the environment. In Reference [23], the authors summarize current methods and applications of adaptive systems. There have been pioneering works modeling quality-scalable applications, including Imprecise-Computation tasks [5] and multi-version tasks [33]. Aydin et al. provide an optimal static solution for the imprecise computation task scheduling problem using convex programming [2]. An approximate computing framework, named ApproxIT, is proposed to manage the quality-scalable applications dynamically to guarantee the output quality, by unfolding the execution of iterative methods [43]. Chippa et al. propose the concept of dynamic effort scaling, which guides the feedback control to scale the system execution quality dynamically at different levels of abstractions [4].
There exist research works demonstrating that stereo matching methods are inherently suitable to be converted into adaptive versions [18,21,22,39]. Adaptive window methods [21,22] try to find an optimal support window to determine the size of matching and improve the matching accuracy. An adaptive guided image filter [39] is proposed to feature an adaptive rectangular support window instead of the traditionally fixed window. The adaptive support weight method [18] attempts to adjust the support-weights of the pixels in a given window to reduce the image ambiguity. Unfortunately, none of the works models the workload variations that the adaptivity could bring to processors.
However, several representative stereo matching algorithms are potentially suitable to be converted into the adaptive version with respect to processor cycles. The Efficient LArge-scale Stereo (ELAS) [10] is based on a generative probabilistic model, and the authors propose a Bayesian-based approach to compute accurate disparity maps. It builds a prior distribution over the disparity space by forming a triangulation on robust support points, decreasing stereo matching ambiguities without the need for global optimization. The bilateral filter [28] can decompose an image into different scales without causing haloes after modification while respecting strong edges. The guided filter can generate the filtering output by considering the content of a guidance image, derived from a local linear model [12]. It has the edge-preserving property and is runtime independent of the filter size. An undeveloped feature of these algorithms is that they can change their execution time by adjusting the disparity parameter to convert them into adaptive versions. In this article, we investigate the internal mechanisms of these stereo matching algorithms to exploit the possibility of turning one into its adaptive version.
DVFS allows processors to change the voltage and corresponding frequency dynamically, which in turn alter the behaviors of the workload execution, which affects energy consumption, makespan, heat generation, and so on. Traditionally, DVFS-based methodologies optimize the performance metrics such as energy [27] and throughput [11,14,44]. The problems are usually converted into optimization formulations with one or more of those metrics as objectives, and remaining ones as constraints that trade off the objectives [2,26,40]. Generally, an important assumption of the existing works is that the task execution cycles are non-adaptive, making them less applicable when applied to adaptive workloads.
Recently, DVFS-based optimization approaches have been observed on quality-adaptive applications, to maximize total quality under timing, energy, and thermal constraints [25,[40][41][42]45]. Mo et al. proposed to maximize the optional cycles of imprecise computation tasks by proposing formulations and solutions based on mixed-integer linear programming [25]. Zhou et al. presented adaptive task mapping and scheduling heuristics targeting quality maximization under renewable energy supplies [45]. A scheduling strategy targeting adaptive task dependency and communication is proposed in Reference [40]. While the above works strive to achieve the optimal execution cycles, they ignore the non-linear quality-cycle mappings that may lead to a sub-optimal decision. In this work, we consider the application-specific quality-cycle properties and directly optimize the quality. Authors in Reference [41] propose an efficient iterative pseudo quadratic programming heuristic to decide the optimal cycle using convex quality functions. However, the approach considers only frequency scaling under a given voltage, while the quality improvement space can be more significant if comprehensively applying DVFS. While all the above-mentioned algorithms attempt to distribute the energy budget to individual parallel tasks optimally, naive heuristics (such as Evenly Distributing (ED) the energy budget to parallel tasks) exhibit low algorithmic complexity at the expense of degraded quality output. The efficient formulation and solution of the resource budget distribution are essential to obtain optimal quality at comparable complexity.

Scope of the Article
In this work, we make the following contributions toward optimizing the total execution quality of adaptive stereo matching applications for smart vehicle systems: • We exploit the mechanism of converting a representative class of stereo matching applications (namely, the binocular stereo matching), into the adaptive version (namely, ASM) that can flexibly adjust the output quality. • We develop a DVFS-based approach to maximize the output quality of ASM tasks under the system timing, energy, and temperature constraints. Compared to state-of-the-art works, our approach achieves significant quality improvement by:  -Directly optimizing the system quality metric instead of implicitly maximizing CPU cycles, -Judiciously scaling the supply voltage to exploit larger optimization space compared to frequency scaling-based approaches, and -Proposing an efficient quadratic programming formulation to achieve quality optimization at low algorithmic overhead. • We propose two efficient and accurate solutions, namely location similarity-based feature point thresholding (L-FPT) and street scenario-confined CNN (S-CNN), that infer the quality function parameters. We verify the proposed methods using the dataset captured from the Carla simulator [7].
Our work leads to a framework for modeling, executing, and optimizing ASM on smart vehicle platforms, as summarized in Figure 3. In the following sections, we discuss the components inside the framework in detail. Section 2 introduces the modeling of ASM and timing/energy/ temperature constraints. Section 3 presents the DVFS algorithm to maximize output quality. How to estimate quality function parameters using L-FPT and S-CNN is described in Section 4. Section 5 presents the experimental results and analysis. Section 6 concludes the article.

MODELING 2.1 Adaptive Binocular Stereo Matching
Given binocular imagery of the same scene, stereo matching methods attempt to match pixels in one image with the corresponding ones in the other, to enable calculating the object distance. Figure 4 illustrates a simplified binocular stereo matching system, which captures the scene of an object as a pair of images taken into the paired cameras. The term disparity denotes the distance between two corresponding pixels in the left and right images. The scene depth, namely the detectable distance between the object and cameras, is inversely proportional to the disparity value.  Many stereo matching algorithms constrain the search space of disparity within a range [d min , d max ] to alleviate the computational cost [9,10,35]. However, lowering d max brings the side effect of increased matching error rate, since reducing the disparity leads to more regions that are undetectable directly while the paired pixels are located beyond d max , typically the regions near the cameras. Figure 5 illustrates the matching results of three different images after applying ELAS, where the red region indicates the parts that are not successfully detected and matched. Definition 1. The matching error of the stereo matching algorithms, denoted as Err , is defined as the percentage of the pixels whose disparity differs from the ground truth (GT) by a certain  threshold, according to Reference [34]: However, reducing the disparity range does reduce the computation cost of the stereo matching algorithms, since the search space for matching shrinks in a smaller disparity range. Figure 6 shows the profiling results of extensively running several representative stereo matching algorithms, which all show inherent characteristics that the disparity range exhibits a linear relationship with the application execution cycles. Given the linear and inversely proportional relationships among <execution cycle, disparity> and <disparity, error> pairs, respectively, it implies that by carefully manipulating the d max (assume that d min is fixed), the application execution cycle and matching error exhibit a reverse relationship. Simply put, more execution cycles lead to less error (i.e., more quality), and fewer execution cycles give more error (i.e., less quality). To capture the quality-cycle relationship, we examine three models, namely linear, power, and exponential models. Figure 7 gives an intuitive comparison using the three models. The green dots represent the ASM quality, measured by the inverse of the matching error. The modeling error is evaluated using the Mean Square Error (MSE) between the original quality values and the fitted values. The MSE values of <linear, power, exponential> modeling of the three sample images are <0.3014, 0.0104, 0.0037>, <0.4598, 0.0185, 0.0042>, and <0.2289, 0.1027, 0.0485>, respectively. It shows that the exponential model gives better modeling accuracy than linear and power models.
As quantitatively evaluated in Section 5, exponential functions most accurately model the stereo matching quality-cycle characteristics compared to the other two. 1 In this work, we thus focus on the following concave quality function: where a, b, m are stereo matching specific function parameters, and o represents the execution cycles. Ideally, the parameters a, b, m are input-invariant and can be obtained through extensive profiling and curve fitting [41]. However, for stereo matching applications, the parameters a, b, m are largely input image dependent. In Section 4, L-FPT and S-CNN solutions are proposed to obtain the parameters efficiently and accurately in real-time.

System and Energy Models.
We assume that the ASM tasks run on an MPSoC embedded in the vehicular system, including a set of heterogeneous processors p ∈ P with voltage/frequency scaling capabilities. The power consumption contains both dynamic and static power components. The dynamic power consumption for task i is expressed as: where SW is the average switching capacitance, f i is the processor frequency, and v i is the processor supply voltage. To emphasize the role that voltage plays, we change the expression to Leakage power can be approximated as a consecutive piecewise-linear model w.r.t. the temperature of the thermal block and the voltage of a core [11]. The corresponding expression is given as: where P i0,lkд is the initial power at the given linear section and T i is the chip temperature. K α and K β are the slopes of leakage power for the temperature, and the voltage in the given linear section and can be obtained by extensive profiling [11]. We have then the total power consumption: The total energy consumption to execute o i cycles for task i, which spends time In this work, we assume that for a given voltage v i , the f i chosen belongs to a corresponding discrete set

Thermal-Constrained Voltage.
The heat dissipation for on-chip thermal blocks can be expressed analogously to a thermal RC model [15]: where M is the set of all adjacent processors, G p,m is the thermal conductance between p and neighbor processor m, T env is the surrounding air temperature, and G p,env is the thermal conductance to air surroundings, including both chip cover and bottom surfaces. By substituting Equation (5) into Equatin (7), and assuming that considering only steady-state temperature, we obtain 10:10 F. Chen et al.
the relationships of temperature and voltage as the following matrix form: where , T l and T u are the vectored thermal lower and upper limits, and Ψ f 1 , Ψ f 2 , and Φ f ,T are coefficient matrices determined by frequency, power, and thermal conductance. The detailed derivation of Equation (8) can be found in Reference [41], where we replace the frequency variable with the voltage under the assumption that G p,m = 0 if processor p is not directly adjacent to other processor block m, and C dT i (t ) dt → 0, namely the temperature quickly converges to steady state.

DVFS FOR QUALITY MAXIMIZATION
Given the above-mentioned models, in this section, we propose the DVFS formulation and solution to obtain the maximum output quality of ASM tasks under timing, energy, and thermal constraints. The QP formulation is described in Section 3.1. The voltage scaling heuristic is presented in Section 3.2. Section 3.3 presents the complete DVFS algorithm that iteratively solves the QP problem, employing voltage scaling between the QP iterations.

QP Formulation for Quality Optimization
We focus on maximizing the total quality improvement of all the ASM tasks that run in parallel after being dispatched in the ready queue in a multiprocessor system. The total quality improvement can be expressed as: where a i , b i , o i are task-specific exponential parameters and initial cycles and Δo i is the number of cycles to be improved. To formulate the problem into a tractable form, namely the QP form, we apply Taylor's expansion to change the objective expression (9) to be quadratic, Because b i is constant through modeling exponential function, we can assume o i ≤ b i , and then higher order (≤3) can be ignored due to tolerable errors in our stereo matching adaptability. The objective expression (9) can be approximated as Equation (10) in the following QP formulation:

Subject to
The constraints are described as follows: • Energy Constraint (11) is obtained from Equation (6), stating that the total energy consumption of ASM should not exceed the energy budget ϵ s after changing execution cycles o i to o i + o i . TheT i here is chosen as the upper-temperature limit to avoid thermal violations. • Timing Constraint (12) states that the timing budget τ s limits executing o i + o i for each i. • Thermal Constraint (13) is obtained from Eqution (8). It limits the temperature condition on the linear section of the piecewise leakage power model.
In this formulation, the variables of interest are the set of Δo i and v i for all ASM tasks i. The frequency f i can only be determined after v i is known, and is temporarily set as the lowest value for the time being. Note that it is not yet a QP formulation, due to the non-linear constraints (11) and (13). To linearize the constraints, we make two transformations: (1) Decouple the product form of Equation (11) 2 v as an ensemble variable Θ. The above formulation is then transformed into the following QP formulation with variables of interests being Δo i and Θ:

Subject to
The problem (14)- (17) is a QP formulation and can be efficiently solved by the interior point method in polynomial time.

v * i Searching for Voltage Scaling
It would be ideal if the calculated v i coincides v * i in Equation (15). Otherwise, an iterative v * i adjustment process is needed to reduce the gap between v i and v * i until zero. We consider iteratively solving the QP problem, during which process, v * i is reduced to the QP-solved v i value if v i < v * i . We show that the corresponding cycle Δo i keeps increasing in this process.
To prove this concept, we assume that there are two consecutive rounds for QP optimization, namely r 1 and r 2 . For task i, we set an initial voltage v * i of round r 1 , then QP optimally produces voltage v r 1 i and o r 1 In other words, when the voltage decreases between round r 1 and round r 2 , o i is increasing.
Proof. We prove the claim by contradiction. Given that <v r 1 i , o r 1 i > and <v r 2 i , o r 2 i > are optimal solutions of round r 1 and round r 2 , respectively, they should satisfy the energy and timing constraints. Assume o r 2 i < o r 1 i , and we have the following formulas for positive C f i and K β : where ϵ s and τ s are converted from Equations (15) and (16). The above derivation shows that Δo r 2 i still has space to increase the value. Therefore, we can increase o r 2 i to a certain value, such as o r 1 i , which still satisfies energy and timing constraints. The fact that o r 2 i can still be improved contradicts the assumption that it is the optimal solution in round r 2 in the QP formulation. Hence the proof.

The Overall DVFS Algorithm
Algorithm 1 depicts the overall voltage/frequency scaling process. Initially, all the v * i are set as the maximal value for further scaling down. After each round of QP_Optimization as described in Section 3.1, the task i with maximal v i or maximal Δo i improving potential is selected as the voltage scaling down candidate, and a new QP round is executed after setting v * i = v i . This process continues until all v i ≥ v * i , and then we set all v i = v * i as the final voltage scaling results. The calculated Δo i is treated as the optimal cycles obtained since they keep increasing, according to Theorem 1.
The frequency f i can be determined after v i values are known since v i sets the upper limit for frequency scaling [20]. With frequency upscaling, there are still opportunities to improve Δo i further, as shown in Equation (16). However, energy and thermal constraints, namely Equations (15) and (17), might be jeopardized. We adopt a trial-and-error approach to heuristically upscale the frequency to further maximize Δo i : A random f i is selected to upscale to the next discrete value (15) and (17) are not violated, and then upscaling is successful. Otherwise, choose the next task to repeat the trial process. The process ends until all f i have been tried.

PARAMETER ESTIMATION FOR QUALITY FUNCTION
The key to DVFS optimization lies in the accurate parameter estimation of the quality function.
Traditionally, the quality function parameters, such as a, b in Equation (2), are obtained through extensive profiling and curve fitting [41]. However, during runtime, this method is impractical, since the input patterns of the on-vehicle cameras are highly dynamic. In this work, our basic assumption is that vehicles could follow certain probability distribution to encounter a street, with someone traveled more frequently. For example, a household car may confine itself in the limited number of routes in daily life scenarios. 2 The input images taken into the system are thus semantically confined. Based on this simplification, we propose two estimation methods for parameters of the quality model: • Location similarity-triggered Feature Points Thresholding (L-FPT), which estimates the quality function based on historical estimation at the geo-location that exhibits high input similarity.

• Street scenario-confined CNN (S-CNN), which is a neural network that trains and infer-
ences the quality function that is used only for a specific street.

The L-FPT Approach
The vehicle may repetitively come across the same geo-location, at which instance the input images, I s , could exhibit high similarity to historical images, Iŝ , taken at the same or nearby locations. In such cases, the characteristics of the processing loads are likely similar and thus could be attributed to similar quality functions for spatially recurred images. 3 To evaluate the similarity of two images, 4 namely the reference image and compared image, we resort to evaluating the feature points matching rate, R F P , of the compared images. R F P Δ = N comp N r ef × 100%, where N r ef is the number of ORB feature points of the reference image and N comp is the number of feature points that can be found in both of the compared and reference images. The motivation behind this is that obtaining R F P is an essential preprocessing step for state-of-the-art stereo matching algorithms, notably ELAS, where Sobel filtering is applied to generate the support points that are used for subsequent operations [10].
To obtain the feature points, off-the-shelf techniques such as Oriented FAST and Rotated BRIEF (ORB) [32], can be readily used without incurring significant overhead. An empirical threshold value for feature point matching rate, R thr esh F P , is introduced to identify and admit spatially recurred images. Note that two images that have R F P ≥ R thr esh F P imply that they may slightly differ in horizontal, vertical, front, or back means. E.g., I s could be taken at a slightly shifted vehicle  position compared to Iŝ , or at a location in front of or behind the Iŝ was taken. This considerably improves the practicality of L-FPT.

The CNN-based Approach
If the I s considered has R F P < R thr esh F P , then the results of L-FPT could be inaccurate. In this case, we propose a CNN-based solution to estimate the parameters of the quality function.

S-CNN v.s. Universal CNN.
A straightforward design option is to use a universal CNN, which is a one-for-all network used to predict any single quality function given the stereo image input taken at any street/road section. However, this may cause issues such as slow weight update, which lags behind dynamically altering traffic context, thus leading to prediction inaccuracy as well as training inefficiency.
We propose a street scenario-confined CNN (S-CNN), which is trained by per-street data and used in that specific street as located by the vehicle. It improves efficiency and availability over the one-for-all CNN in the following aspects: • Training speed and network size. Compared to the universal CNN, S-CNN requires limited training data, and the training can adapt to changes faster, since the ground knowledge, namely the passive objects on the street, remains unchanged with high possibility. Although it should be avoided in general, the overfitting issue could potentially contribute to the training speed since under confined input training data, slightly enforcing overfitting helps "memorizing" rather than identifying the underlying distribution each time. The S-CNN can lead to a more concise network by removing superfluous variables and asking only relevant questions, which favors the embedded processing. • Re-usability and self-containment. Rather than training a universal CNN for each vehicle, S-CNN is street-oriented such that different vehicles can share the same S-CNN whenever located on the street. Assuming trained in the cloud, the various vehicle could collaboratively contribute the training input images. Moreover, given the independence of the streets, a significant change in one street does not affect the S-CNNs of the other, thus ensuring the overall estimation performance of quality function during the vehicle's cruise activities.

Network Architecture.
We transform the problem of predicting quality function parameters into a classification problem, where the value range of the parameters is evenly and finegrainedly divided. The output of the network is thus a prediction of the "slot" that the parameters most likely fall in. We employ the LeNet network [19] for this purpose. The network has two layers of convolution and pooling operations, in addition to two fully connected layers. The output layer implements cross-entropy loss with softmax function to evaluate the output inconsistency [17]. Figure 8 illustrates the architecture of the network proposed.

EVALUATION AND ANALYSIS 5.1 Experimental Setup
We develop a simulated platform that consists of three computationally intensive modules, namely the DVFS module, the virtual processors, and the S-CNN inference module.
• DVFS module implements Algorithm 1, as well as algorithms for comparison. It intakes the quality functions and all energy, timing, and thermal constraints, and outputs the ASM execution cycles and voltage/frequency configurations. In our experiment, the maximum temperature is set at 70 • C, the initial frequencies of all the processors are set at 300 MHz, and the initial energy budget is 0.04 J. Each ASM task has an exponential quality-cycle function, whose parameters are fed in from the S-CNN inference network or the L-FPT estimator. Empirically, typical values for the parameters are a i ∈ [0.2, 0.5] and b i ∈ [0.5, 1.2]. The Matlab-based QP solver is employed to obtain the optimizing solution. • Virtual processors are abstract mathematical modules that simulate running the ASM tasks. The processors are assumed to be 2 × 2 tiled with voltage/frequency scaling capability. Each core runs a simulated ASM task independently without preemption. The power and thermal characteristics of the processors are obtained from PTScalar [20]. The voltage of the processors ranges from 0.5 V to 0.8 V. • S-CNN inference module is used to infer the parameters of the exponential quality functions. In the training phase, we train the S-CNN inference module by large datasets captured by Carla simulator [7]. Here we resize to 48*48 images due to vast datasets and set the output layer as 50 labels to divide the coefficient interval in detail. To obtain optimal accuracy, we set the training iteration number as 4,000. The S-CNN is implemented using Tensorflow [1].
In our work, we test our algorithm with three datasets, namely Kitti [9], New Tsukuba [24], and Carla [7]. Kitti and New Tsukuba are two widely used datasets to evaluate the stereo matching algorithm. Carla dataset is obtained from the Carla simulator to support the development, training, and validation of smart vehicular systems [7]. To verify the performance of our algorithm, we have compared it with three state-of-the-art algorithms: (1) TS-a scheduling heuristic that maximizes the ASM execution cycles by selecting the task with the highest energy metrics [45]. (2) DFS-a scheduling method to optimize the quality with the exponential model through dynamic frequency scaling [41]. (3) ED-a scheduling method is identical to our proposed approach, except that it evenly distributes the energy budget to all the processors.

Accurate Quality Function Modeling.
In this part, we evaluate the accuracy of using exponential functions to model quality-cycle relationships. We compare exponential models with linear and power models. Sample images are randomly selected from the three datasets [7,9,24], and the curve fitting technique is used to examine the modeling errors. The modeling error is evaluated using the Mean Square Error (MSE) between the original quality values and the fitted values. Figure 9 illustrates the modeling results of randomly selecting 20 sample images from each of the three datasets. The MSEs are in the magnitude of 10 −1 , 10 −2 , and 10 −3 for linear, power, and exponential modeling, respectively. This indicates a generally smaller modeling error using the standard exponential models, as defined in Equation (2).

Performance of DVFS.
We evaluate the efficacy of our proposed DVFS algorithm on the 60 randomly selected images in the previous experiment. Table 2 lists the averaged results of applying our approach, including the improvement of disparity/quality and execution cycles, as well as   Please refer back to Figure 5 to visualize the matching quality improvement in the three datasets.
optimized system voltage and energy consumption. Execution results are compared with the initial disparity (d max ) configurations, given as 62, 72, and 38, respectively. Initial voltages are all set as 0.8v. The third row of each group of tests shows the optimal results after applying our approach, where the disparity ranges increase to 197, 203, and 59, respectively, for images form each dataset. The last column of Table 2 shows that our approach achieves 2.21, 1.93, and 1.60 times overall quality improvement on the three datasets, respectively. We also pick sample tests where the execution cycles are roughly at the middle between the initial and optimized disparity settings (50% in the range), as shown in the second row of each group of tests. It is interesting to observe the followng: (1) The disparity value is also in the middle of the range, obeying the linear relationship between the disparity and execution cycles, as shown in Figure 6. (2) The output quality may not linearly increase with execution cycles. Rather, it improves  less quickly as the cycle increases. Observed from Table 2, it shows that the quality improvement ratios before and after the middle cycle increase point (namely second row) are <82.1%, 17.9%>, <76.9%, 23.1%>, and <74.3%, 25.7%>, respectively for the three datasets. This validates our strategy of directly optimizing the ASM quality rather than the ASM execution cycles, since quality does not increase uniformly with execution cycles for ASM tasks. Figure 10 shows the results of comparing our DVFS approach with the TS, DFS, and ED approaches. The results are the average values of executing the algorithms using 1,000 straight road section images and 1,000 road junction images captured from Carla, under various energy constraints. Results show that on the images of straight road sections, our approach achieves on average 1.61× more quality over ED, 2.78× over TS, and 4× over DFS. On the images of road junctions, our approach achieves on average 1.79× more quality over ED, 3.13× over TS, and 3.75× over DFS. In both cases, the advantage over ED is due to optimally distributing the energy resource among the processors through (15), rather than rigidly distributing the energy budget. TS distributes the execution cycles prioritized by the energy coefficient, rather than directly optimizing the quality. DFS only considers frequency scaling under the voltage of 0.8 V. Although this voltage gives maximal frequency scaling range, only considering frequency scaling is less effective for quality improvements. This is due to ignoring the significant role that voltage plays in dynamic+leakage energy consumption. Figure 11 numerically shows the inaccuracy of the quality metric introduced by using Taylor's expansion. The value of each column indicates the differences of the quality obtained from the optimization computation, which adopts Taylor's expansion, and the quality calculated from the  cycle input according to the exponential model. Over the 20 samples examined, the maximum quality error due to Taylor's expansion is at a scale of 10 −3 .

Effectiveness of L-FPT.
Recall that L-FPT assumes that at a certain location (could be with slight position shift), the processor executing ASM tasks with such inputs exhibits similar qualitycycle relationships; thus the parameters can be guessed from historical records at the same location. To validate this assumption, we randomly select a fixed location in the Carla simulator, took a reference image as shown in Figure 12(a), and manually creates 300 images with increased density of vehicles and pedestrians (see samples in Figure 12(b)-(d)). The purpose is to alter the feature points matching rate R F P such that within a certain threshold R thr esh F P , the function parameters have small differences in value. Figure 13(a) shows that with R F P ≥ 85%, which is empirically obtained for the considered location, the differences of parameters |a i − a j | and |b i − b j | are within 3.38% and 3.19%, respectively, compared with the worst-case differences in Figure 13(a). Correspondingly, Figure 13(b) shows that with R F P ≥ 85%, the ASM quality difference is confined within 7.15%. Figure 14 reflects a scenario that the vehicle is turning around a road junction and illustrates the changes in the function parameters (a i , b i ) and the calculated disparity range during this process. This process is captured by a continuous video clip of 28 s with a frame rate of 30 fps. The parameters (a i , b i ) of each frame is profiled as in Figure 7. The corresponding disparity range is calculated using our proposed DVFS algorithm, where the settings follow Section 5.2.2. Two scenarios are studied for the applicability of L-FPT. As can be observed from points (a), (b), and (c) in Figure 14, which correspond to frames 160, 180, and 200 and reflect a certain location with a slight front/back shift. At this location, values of a i and b i fluctuate in a small range, while the calculated disparity remains constant in the meantime. This implies that, to obtain the same calculated disparity range (hence the output quality), a i and b i values can tolerate certain fluctuations and inaccuracy. Thus, it is possible to avoid calculating a i and b i for each frame during this process but using the predefined historical values that satisfy the L-FPT requirement. This could help avoid a considerable computation workload. However, points at (d), (e), and (f) in Figure 14 correspond to frames 580, 610, and 650 and reflect the process of turning at the road junction. The L-FPT approach would fail in this process, since the images taken before and after the junction turning could be quite different. Figure 15 shows the corresponding images, as indicated in Figure 14.

Effectiveness of S-CNN.
To verify the proposed S-CNN approach, in the Carla simulator, we collect images captured during traversing straight sections of 10 different streets, as well as 10 street junctions, respectively, illustrated in Figure 16. For each street, we randomly collect 14.4K images with different densities of pedestrians or vehicles. We divide the 14.4K images into the training set of 12K images and the test set of 2.4K images. The output layer used to predict the model coefficient is divided into 50 intervals, and the training iteration is 4,000. To train and test the universal CNN, we capture 500K images with different densities of pedestrians or vehicles over the whole simulated town. In this dataset, the portions of the straight sections and junctions are even. We use 492.5K images to train and 7.5K images to test the performance of the universal CNN approach. Tables 3 and 4 show the MSE and accuracy results of employing the S-CNN and universal CNN methods for estimating the parameters of the exponential quality models. Several findings are detailed here: (1) The accuracy of predicting the junction images (T x ) using S-CNN is lower compared to predicting the straight section images (S x ). Specifically, for the average-10 (T-avg10 and S-avg10) results, the prediction accuracy of T-avg10 is 17.2% and 14.1% less than that of S-avg10 for a i and b i , respectively. The MSE data agree with this conclusion, indicating that S-CNN works well on straight sections, but less effective when the vehicle approaches the junctions. This is expected, since the junctional image sets contain nearly half of the images taken from the street after turning around the junction. That street is not the target of the S-CNN trained specifically for the   Table 3. Comparison of MSE Using S-CNN, Regression, and Universal CNN Approaches street before the junction. (2) After adopting the hybrid dataset that consists of straight sections and junction images, the average accuracy of the universal CNN is 11.4% less than the S-CNN on straight section-only scenarios, but 4.26% higher than the S-CNN on junction-only scenarios.
Regression-based methods have also been well recognized for parameter prediction. Thus, we design experiments to evaluate the prediction accuracy of the regression-based methods as compared to S-CNN. We define two types of predictors related to the stereo matching workload: (1) The number of matched feature points and (2) the Bhattacharyya Coefficient (B. C.) [29] that is used to measure the similarity of two images, which are represented by histograms on the number of pixels that fall in the numerical range of the greyscale. Four algorithms are employed to obtain the number of matched feature points, namely SURF, SIFT, ORB, and AKAZE, available from the OpenCV library [3]. We use 10K images for regression model training. As shown in Table 5, the Fourier regression with the B. C. predictor exhibits the best MSE. We supply the Fourier regression results into Table 3 and 4 to make comparison with the CNN-based approaches. Results show that the accuracy of the regression-based method is on average 69.6% lower than the S-CNN based one.

Timing Analysis.
The proposed framework consists of several important components, including the ASM (ELAS) algorithm, CNN for parameter inference, and the DVFS algorithm. To reflect its feasibility on realistic on-vehicle systems, we deploy the three components on FPGA for fast validation, using Xilinx UltraScale+ ZCU102 with the SDSoC framework. We randomly select 10 sample images of the size 1275×375 captured from Carla. In addition, we use the same window sizes for ELAS algorithms, i.e., 9×9 and 5×5, respectively, and set the number of support points as 32 [31]. To facilitate comparison, the clock frequency of all the accelerated parts is set to 300MHz. The QP core is the central functionality of the DVFS algorithm. We adopt the reference implementation of the QP solver from Reference [16], while the external control logics are implemented using SDSoC. The reference FPGA design of the S-CNN is adopted from Reference [46]. In Table 6, the first row shows the execution time of S-CNN inference for corresponding parameters. The Corresponding quality improvement under the ED and our DVFS is also shown. E.g., Quality× = 1.68 indicates 1.68 times of quality improvement.
second row shows the execution time of randomly selected 10 ASM tasks. The rest of the rows show the DVFS and ED approaches' execution time to decide the optimal cycles, as well as their corresponding quality improvement. The time of DVFS and S-CNN inference is negligible compared to the ELAS algorithm. Compared to the ED approach, where the energy constraint is simplified, the average quality improvement of our DVFS method is about 2.03× over the ED method, incurring only 2.9% more timing consumption compared to ED.

CONCLUSION
In this article, we convert a stereo matching algorithm into an ASM that can be used for smart vehicle platforms, derive an efficient DVFS algorithm to maximize the ASM's output quality under timing, energy, and thermal constraints, and devise L-FPT and S-CNN methods to estimate the parameters of the ASM quality function accurately. The work is the first kind that studies the application-level adaptability for smart vehicular systems. Results show that our approach achieves at least 1.61 times quality improvement compared to contemporary techniques, and the average parameter estimation method achieves 96.35% accuracy on the straight road. In the future, we will integrate the FPGA-based system onto our driverless car prototype and evaluate the system under realistic conditions.