Intelligent Handover Triggering Mechanism in 5G Ultra-Dense Networks Via Clustering-Based Reinforcement Learning

Ultra-dense networks (UDNs) are considered as key 5G technologies. They provide mobile users a high transmission rate and efficient radio resource management. However, UDNs lead to the dense deployment of small base stations (BSs) that can cause stronger interference and subsequently increase the handover management complexity. At present, the conventional handover triggering mechanism of user equipment (UE) is only designed for macro mobility and thus could result in negative effects such as frequent handovers, ping-pong handovers, and handover failures on the handover process of UE at UDNs. These effects degrade the overall network performance. In addition, a massive number of BSs significantly increase the network maintenance system workload. To address these issues, this paper proposes an intelligent handover triggering mechanism for UE based on Q-learning frameworks and subtractive clustering techniques. The input metrics are first converted to state vectors by subtractive clustering, which can improve the efficiency and effectiveness of the training process. Afterward, the Q-learning framework learns the optimal handover triggering policy from the environment. The trained Q table is deployed to UE to trigger the handover process. The simulation results demonstrate that the proposed method can ensure the stronger mobility robustness of UE that is improved by 60%–90% compared to the conventional approach with respect to the number of handovers, ping-ping handover rate, and handover failure rate while maintaining other key performance indicators (KPIs), that is, a relatively high level of throughput and network latency. In addition, through integration with subtractive clustering, the proposed mechanism is further improved by an average of 20% in terms of all the evaluated KPIs.


Introduction
To manage increasing demand for mobile data traffic and efficient data delivery, ultra-dense networks (UDNs) have been introduced in the fifth-generation mobile communications system (5G). UDNs involve the close deployment of small base stations (BSs) at traffic hotspots. Using this method, data traffic is mainly delivered by small BSs, which can significantly increase the system capacity, spectrum efficiency, throughput, coverage and provide ubiquitous access for user equipment (UE) [1]. When UE moving across the coverage of small BSs, the handover process needs to be performed to ensure UE's data delivery. As defined in the Third-Generation Partnership Project (3GPP) [2], the handover process is triggered by A3 event measured by UE. A3 event occur when the difference between the reference signal receiving power (RSRP) from UE serving cells and neighbouring cells is higher than a pre-determined condition, the handover hysteresis margin (HHM). When meeting the entering condition of A3 events, UE will wait for a pre-defined period, that is, the time to trigger (TTT). Subsequently, if the A3 event entering condition remains satisfied, the UE reports the A3 event to its serving base station (BS), and the handover process is then executed based on the Xn interface of the BS. The UE connection will subsequently switch to the neighbouring cell with the strongest RSRP.
On the other hand, UDNs also increase the complexity of cellular networks and introduce new challenges to handover management [3]. Traditionally, the A3 event was designed as the handover triggering mechanism for UE in macro BS systems. Therefore, the A3 event may face the following three challenges within 5G-UDNs. First, because the coverage area of small BSs is much lower than macro BSs, UE will meet the edge of the cell more frequently. UE can have many more neighbouring cells as potential handover targets in UDNs. In this situation, the A3 event's entering condition is easily satisfied, and the handover process is frequently triggered by UE even with short physical movements [4]. Since the handover process can interrupt the UE's serving link before transferring its connection to the target cell. Thus, frequent handovers can also overload core network signalling, diminish system capacity and degrade overall system performance [5]. Second, the decision-making process is easily affected by interference and frequent handovers continually occurring among serving and target cells (known as the ping-pong effect). As small BSs are deployed denser and closer to each other, this may result in much stronger inter-cell interference. As such, inter-cell interference can result in much stronger fluctuations in the RSRP that further worsen this problem. Third, the A3 event needs to adjust the handover parameters, that is, HHM and TTT, to avoid frequent handovers, ping-pong effects, and handover failure rates. To achieve this target, the network operator needs to frequently conduct extensive measuring activities and data analysis to determine the suitable handover parameters [6]. With the increasing deployment of BSs, the network maintenance workload and complexity also significantly increase. Therefore, in A3 events, it is unrealistic to adjust HHM and TTT to optimal levels to maintain a high level of network performance at all times.
Based on the analysis above, simply implementing the current A3 event in 5G-UDNs can lead to system performance degradation. To overcome the limitation of A3 event, and increase the mobility robustness of UE in 5G-UDNs, a novel handover triggering mechanism that can precisely trigger handover process with low maintenance cost is necessary to be investigated. In this study, we integrated both advantages of Q-learning and subtractive clustering techniques to develop an intelligent handover triggering mechanism for UE in 5G-UDNs. The main contributions of this study are summarised as follows: & First, we develop an instant handover triggering mechanism that can trigger handover processes precisely based on multiple decision criteria. The proposed mechanism has the objectives of enhancing user mobility robustness while maintaining other high-level key performance indicators (KPIs). & Second, we proposed a Q-learning framework to achieve an optimal handover triggering policy by considering multiple network metrics, that is, the RSRP, signal to interference and noise ratio (SINR), and transmission distance. The trained Q table is utilised as a triggering mechanism of UE to decide the optimal triggering timing without additional handover conditions intelligently. & This study utilises the subtractive clustering technique to generate state vectors from historical data. Using this method, the input metrics are systematically categorised into corresponding states with respect to the actual data distribution, which can improve the trained Q table's accuracy and effectiveness. This categorisation method can effectively process multiple fluctuating network metrics and minimise the impact of noise and interference in handover triggering decision-making.
To the best of our knowledge, this is the first study to directly apply Q-learning to make handover triggering decisions in 5G-UDNs rather than optimise handover parameters. This is also the first study to utilise subtractive clustering to optimise the Qlearning framework for handover decision-making.
The rest of this article is organised as follows: Section 2 reviews some existing studies that are related to this paper. Section 3 introduces the channel model and performance metrics used in this study. The detailed proposed method is described in Section 4. The proposed triggering mechanism is compared with the other existing handover triggering mechanisms to evaluate its performance. The simulation designs and results are shown in Section 5. The study's main conclusion is summarised in Section 6.

Threshold comparison based handover triggering optimisation methods
To address the negative handover effects caused by A3 events, different handover management algorithms are proposed in the current literature. One way to optimise the handover triggering mechanism is adjusting the handover-related parameters, that is, HHM and TTT are adaptively based on different algorithms. References [7][8][9] reported handover optimisation methods based on threshold comparisons with specific metrics. Reference [7] proposed a handover parameter optimisation method to enhance mobility robustness across small cells. The proposed method adopts a threshold to classify categories of handover failures and then updates handover parameters according to dominant failures. To avoid handover failures due to radio link failures, Reference [8] developed a novel distributed auto-tuning algorithm based on metaheuristic algorithms that can automatically update HHM and TTT on the basis of user speed, RSRP, and SINR. In Reference [9], the authors integrated fuzzy logic into the conventional handover decision to dynamically adjust HHM and TTT. The signal levels from both serving and target cells were used as a fuzzy interference engine input to generate the adjusted margin as output. The simulation results in References [7][8][9] showed that compared with the traditional method, the proposed technique significantly reduces the number of handovers, pingpong effect, handover failure rate, and radio link failure rate. However, some essential parameters of these proposed algorithms, that is, the thresholds, fuzzy rules, and fuzzy membership functions, rely heavily on human experience to define that is not applicable in practical situations.

Reinforcement based handover triggering optimisation methods
Since reinforcement learning and deep learning have demonstrated their powerful learning, decision-making and inference capabilities in many applications, such as [10][11][12][13][14]. As such, these techniques can be considered as effective ways to enable intelligent handover management. References [15][16][17] described reinforcement learning-based handover parameter optimisation methods. In Reference [15], the authors proposed a handover parameter tuning method that effectively detects handover events and minimises false handover triggers. To achieve optimal handover performance, the proposed method can also self-tune the handover decision parameters by defining two state variables for the Markov decision process, that is, handover decision parameters and radio state. In Reference [16], a Q-learning-based framework that can adjust HHM and TTT according to the UE speed was proposed. A multiple attribute decision-making method is then applied to choose the most suitable cell as a handover target. Reference [17] also proposed a Q-learning-based mobility robustness optimisation method to learn the most suitable HHM and TTT based on different UE speeds. The simulation results in References [15][16][17] showed that the three proposed methods can reduce the call drop rate, handover failure rate, and ping-pong handover ratio for high-speed movement UE compared to the traditional approach. However, the reinforcement learning framework in References [15][16][17] cannot define a large-scale state and action space; otherwise, the training process becomes inefficient. To scale down the size of the state-action space, References [15][16][17] attempted to categorise input metrics, for example, the speed and RSRP, into same the length range as the state vectors. This categorisation method lacks systematic methodologies to reflect actual data distribution into state vectors and thus could potentially affect the effectiveness and accuracy of the generated Q table.
These studies indicated that adjusting the HHM and TTT can effectively improve handover performance. However, the presence of HHM and TTT causes the triggering process to become not instantaneous, as the handover process is only triggered after these two pre-determined conditions. The coverage area of small BSs in UDNs is much smaller than in macro BSs. Small BSs only leave a very short time for triggering mechanisms to react and then execute the subsequent process. In this condition, an instant handover triggering mechanism can ensure the reliability and seamlessness of communications. In addition to optimising handover parameters for triggering mechanisms, some studies developed instant handover triggering mechanisms to directly trigger the handover process without any additional handover parameters and conditions.

Fuzzy logic based instant handover triggering mechanisms
In References [18,28], a triggering threshold called the handover factor that is generated by fuzzy logic was used to minimise the number of handovers. The RSRP, SINR, and user speed are input into the fuzzy interference system. The input metrics are processed by a group of pre-defined fuzzy membership functions and fuzzy rules to generate the output handover factor. The handover factor is distributed between 0 and 1, with 1 denoting that the probability of handover occurrence is very high. Conversely, 0 denotes that the possibility of handover occurrence is the lowest. The simulation results in References [18,28] showed that the handover factors can minimise unnecessary handovers and ping-pong effects. However, these three studies did not further discuss how to define an optimal membership function for each decision metric. Therefore, the reliability of these methods cannot be ensured with the changes in application scenarios.

Intelligent handover triggering mechanisms
In Reference [19], the authors proposed an adaptive fuzzy logic-based handover triggering method. The fuzzy membership functions and rules were first generated by subtraction from historical data and then tuned to the optimal level concerning different application scenarios by neural networks. Compared with conventional fuzzy logic-based handover triggering mechanisms and other traditional approaches, the proposed algorithm in Reference [19] demonstrated that it provided a significant improvement in handover performance in terms of mobility robustness and mobility load balancing. However, the approach in Reference [19] was unable to process too many metrics as input parameters; otherwise, the whole system becomes complicated and the training process is time-consuming. In Reference [20], the authors adopted model-free asynchronous advantage actor-critic (A3C) reinforcement learning techniques to learn an optimal handover method. Each network user is a local agent to interact with the environment and learn a local handover policy. The local handover policy in each UE is then uploaded and integrated as the global handover policy at the global controller. Each UE regularly copies up-to-date handover policies from a global controller to trigger the handover process and supervise the subsequent leaning process when controller updates are required. The simulation results in Reference [20] demonstrated that the proposed method can achieve better performance than existing online techniques in terms of handover rates. However, other KPIs, that is, the pingpong handover rate, handover failure rate, and throughput, were not further evaluated in this paper. Due to the limited computation power of UE, it is inapplicable to utilise UE as a training agent in A3C framework.
According to the aforementioned analyses, Q-learning demonstrates its powerful capabilities in handover triggering solutions. All of the Q-learning-based approaches focus on optimising handover parameters rather than instant triggering mechanisms. However, all of the Q-learning approaches lack systematic methodologies to convert radio conditions, that is, the RSRP and SINR, into state vectors. As shown in Reference [21], subtractive clustering can categorise data into corresponding groups based their distribution. This may provide a solution to define the proper state vectors for Q-learning frameworks in handover decision-making.

System model
In this study, we adopt two-tiered UDNs that consist of LTE-Advance and 5G networks. This two-tiered structure was widely used in many previous studies such as [20,26]. Figure 1 presents an example of proposed network deployment in this study. The LTE-Advance network consisting of N m macro BSs operates under a 5 GHz frequency band. 5G networks consisting of N s small BSs operate at million-metre wavebands. There are N ue randomly moving within the deployed area with a constant velocity V ue . Each UE is associated with one macro or small BS to exchange signalling. During UE movement, the UE periodically collects handover-related metrics from neighbouring BSs, that is, the RSRP, RSRQ, and SINR, and reports to its serving-based station. The proposed triggering mechanism is deployed at the UE to determine the optimal timing and then reports to its serving BS for handover execution.

Channel model
According to Reference [22], a large-scale channel model for macro Eq. (1a) and small base stations Eq. (1b) in urban areas are adopted. The path loss (PL ij ) between UE i and BS j is defined as In Eqs. (1a) and (1b), d i, j represents the transmission distance between UE i and BSj calculated by Eq. (1c). (x i , y i ) and (x j , y j ) in Eq. (1c) are the coordinates of UE i and BS j, f is denoted as the carrier frequency for small and macro BSs, and χ is the interference and noise modelled by Gaussian random and Rayleigh random variables. The RSRP of UE i is then calculated by subtracting PL ij from the cell reference signal of BS j.
According to Reference [23], the SINR from BS j to UE i is formulated as where P j and P o represent the transmission power of UE serving BSs and neighbouring BSs, respectively, and d ij and d io represent the distance between the UE to its serving and neighbouring BSs. P n is the power spectral density of the background noise and n m represents the number of BSs around the UE.

System measurements
Several KPIs are used in this study to quantify system performance due to different handover triggering mechanisms. The first KPI is the average number of handovers per UE (NOH ), which is the essential parameter to quantify the handover frequency in the entire simulation. The average handovers per UE is formulated as where NOH i is the number of UE i handovers and N ue is the total amount of UE in the environment. The second KPI is the probability of ping-ping handovers (PPHOs) used to determine unnecessary handovers between two BSs. The PPHO is counted when there are continual Fig. 1 Two-tiered system model handovers by UE between the target cell and presently serving cells within a certain interval T p . Thus, the average PPHO probability is calculated as where N PPHO is the number of PPHOs that occur during the entire simulation and N HO is the number of handovers during the entire simulation, respectively. The third KPI is the probability of handover fails (HOFs). According to the analysis in Reference [24], the handover process may fail if it is triggered too early or too late. Under these two conditions, the UE may out of the target coverage area or serving cell and subsequently lead to radio link failure before completely establishing handover. Moreover, HOFs may also occur when there is UE handover to the wrong cell. When this occurs, the target cell does not have sufficient resources to maintain UE connections. Therefore, the probability of HOF is a key parameter to evaluate the reliability of the proposed handover triggering mechanism, which is formulated as The fourth KPI is the average UE throughput in the entire simulation, which can reflect the quality of network service. According to Reference [25], the system throughput is calculated using Shannon's capacity theory. The correction factor is adopted in this formula to account for the inherent implementation losses, that is, the reference symbol loss (L ReferenceSymbol ) and cyclic prefix loss (L CyclicPrefix ). Therefore, Shannon's capacity theory is formulated as where Г total represents the sum of throughput gain by UE in bps; γ j, i represents SINR between UE i and BS j obtained from Eq. (2); ξ is the correction factor and B is the system bandwidth assigned to UE in Hz; T frame is the interval of one orthogonal frequency division multiple access (OFDMA) frame and equals 10 ms; T CP is the total cyclic prefix time of all OFDMA symbols in a frame calculated as (5.2 μs + 6 × 4.69 μs) × 20 = 666.8 μs; N SC is the number of subcarriers in the physical resource block (PRB), which is 12 subcarriers for both macro and small BSs; N S is the number of OFDMA symbols within a subframe, which is 14 symbols for macro BSs and 28 symbols for small BSs; and N rb is the number of PRBs assigned to the UE, which is 100 for macro BSs and 275 for small BSs. The bandwidth assigned to each PRB is the smallest unit of bandwidth that is assigned and can only be applied to one UE, and T sub is the time interval of an OFDMA subframe and equals 1 ms for both macro and small BSs. The last KPI is the network latency that directly affects network performance. According to the analysis in References [5,27], this study considers the network latency from BS j to UE i during time t, which is denoted as b Δ t i; j and formulated as b where ℓ trans is the transmission latency; ℓ propa is the propagation latency; ℓ ho is the handover latency; ℓ deal is the packet handling latency; and ℓ queue is the queuing latency, respectively. ℓ deal and ℓ queue are much shorter than ℓ trans and ℓ propa , so the last two items in Eq. (7) can be omitted. Eq. (7) is then rewritten as The first item in Eq. (8) calculates ℓ trans . Θ represents the transmitting packet size and is 100 kbit in this study. r i is the UE i throughput. The second part of Eq. (8) obtains ℓ propa , where ℓ maxi. j is the maximum propagation latency from BS j to UE i, which is assumed to be 20 ms for macro BS and 10 ms for small BS, respectively. d i. j is the distance between UE i and its serving BS j. d y represents the maximum transmission distance from BS j to UE i. ℓ ho is assumed to be 20 ms based on our measurements from real environments. Figure 2 demonstrates the proposed framework based on Q-learning and subtractive clustering. During UE movement, the UE will frequently collect handover-related metrics, that is, the RSRP, SINR, and transmission distance (d), from its serving and neighbouring BSs. These data are stored in the database as historical data. During the training stage, the historical data are used to build the Qlearning framework, which will be explained in detail in Section 4.1. To increase training efficiency, subtractive clustering is adopted to locate the clusters for each input metric and categorise input metrics into state vectors (Section 4.2). The trained Q table of the framework is used as a triggering mechanism to enable the UE to select the optimal timing and report to the BS for handover execution. The detailed handover process under the proposed triggering mechanism will be described in Section 4.3.

Q-learning framework for handover in 5G-UDNs
Q-learning is a model-free and off-policy reinforcement algorithm that provides the optimal policy from a set of Markov decision processes. The Q-learning framework consists of agent and triple < S; A; R>, where S and A represent the sets of all possible states and actions, respectively, and R is the reward function. When an environment is in state s t ∈ S at time step t, a t ∈ A is executed by the agent. The environment is subsequently subjected to a transition from s t to s tþ1 ∈ S, and an immediate reward r t ∈ R is received by the agent. The main target of agent is to learn the optimal action for each state (policy) from the environment that can maximise the accumulated reward.
In this study, the environment refers to the UDNs in a specific area l, and the triple < S; A; R > in this framework is defined as & Action A: At time step t and area l, the action a i;l;t ∈ A for UE i is set to execute the handover process or maintain the UE connection. If the agent decides to execute the handover process, the UE link switches to a new BS at time t + 1. Otherwise, the UE will maintain its link to the previous serving BS. & State S: The input metrics, that is, the RSRP, SINR, and transmission distance (d), are first normalised between 0 and 1. The normalised value for each metrics x is then mapped into the corresponding cluster to find its cluster index. The states are represented by the combination of the cluster index. At time step t and area l, the states s i, l, t for UE i are s i;l;t ¼ g c ı;l;t k;RSRP ; g c ı;l;t k;SINR ; g c ı;l;t where s i;l;t ∈ S and g c ı;l;t k;RSRP ; g c ı;l;t k;SINR ; and g c ı;l;t k;d are the input data at time t and area l belonging to the k-th cluster of the RSRP, SINR, and d, respectively. If the handover process is executed by the agent at time t, the state at t + 1 is updated based on the RSRP, SINR, and d from a new serving BS. Otherwise, the state at t + 1 is updated based on the input metric of the current BS.
& Reward R: The sum of centre value of the corresponding cluster is utilised as the reward value at each time step t. At time step t and area l, after the agent executes an action a i, l, t , the reward for UE i is defined as where r i;l;t ∈ R and g k;d , respectively. If the handover process is executed by the agent at time t, the reward signal is obtained from the new serving BS. Otherwise, the reward signal is obtained from the current serving BS.
After establishing the framework based on the aforementioned information, the Q-learning framework updates its value function, also known as the Q table, through several epochs. Assuming that the agent chooses an action based on policy π, the Q table is defined to represent every state-action pair. The expected total discount reward received from starting action a in state s is based on policy π. For the optimal policy π * , the Q π * is formulated as where γ ∈ (0, 1) is adopted as a discount factor to balance immediate and future rewards. During the learning stage of Q-learning, the agent estimates the Q value from received rewards using the temporal difference (TD) error, which means the difference between the actual Q value (Q(s t , a t )) and its currently estimated Q value ( b Q s t ; a t ð Þ ). Therefore, the Q value at time t + 1 and b Q tþ1 s t ; a t ð Þis updated by adding a discount TD error to the currently estimated b Q t s t ; a t ð Þas Fig. 2 Subtractive clustering-based Q-learning framework for handover triggering where η ∈ (0, 1) is the learning rate to balance the new and old information. For example, when η=0, all of the new information is abandoned and no further Q value is update required; when η=1, all of the oldest information is discarded and the Q value is updated entirely from the latest information.
In this study, each epoch comprises 10,000 simulation time steps and is equivalent to 2.7 h of actual network time. For each epoch, the accumulated reward (R e ) is calculated as The training stage is terminated when the accumulated reward is converged. To achieve optimum Q values, ϵ−greedy is adopted to facilitate a trade-off between exploration and exploitation of the state-action pair. With ϵ−greedy, at each time step t, the agent performs the action with the maximum reward, that is, a i ¼ arg max a Q * s i ; a ð Þ with probability 1 − ϵ; otherwise, it will take a random action. In the initial training phase, ϵ is set to nearly 1 and gradually becomes 1 as each epoch increases. The Q-learning-related parameters are {γ = 0.9, η = 0.1, and ϵ = 0.9 − 0.1} in this study.

Subtractive clustering
To improve the training efficiency and obtain a small Q table, it is necessary to reduce the scale of the state-action pairs. The traditional method is to categorise the related metrics into several equal length states. However, this categorisation method cannot reflect the actual characteristics of the input metrics. For example, the RSRP is divided into five equal length states −20 to −50 dB, −50 to −80 dB, −80 to −110 dB, −110 to −130 dB, and − 130 to −160 dB based on the traditional categorisation method. The actual data distribution of RSRP is concentrated between −80 and − 140 dB. Therefore, the states need to concentrate at −80 to −140 dB, rather than the other intervals, to ensure accuracy and effectiveness of the training results. In this study, we introduce a more systematic subtractive clustering technique to categorise the handover metrics into corresponding states based on the data distribution. Categorising input metrics into clusters effectively processes uncertain and imprecise data, minimising the effect of inference and noise on decision-making.
For m input metrics and each metric with n data points, Three input indicators for subtractive clustering are used in this study. Each metric has 5000 data points, hence n = 5000 and m = 3. A data point is counted as the high potential value if it has many neighbouring points. The potential value of each data points is evaluated as In Eq. (15b), r a defines a neighbourhood's effective radius. The data outside r a have only a limited influence on P i . After calculating P i for each data point, the point with the highest P i is located as the first cluster centre. P i for the rest of the data points is revised based on the potential P * 1 of the first cluster centre x 1 ! * as Subsequently, P i of the rest of the data is discounted by a , which includes the distance between each data point to the first cluster centre x 1 ! * . According to this function, the data points near the first cluster centre are unlikely to be selected as the new cluster centre as its P i is significantly discounted. Next, the point with the highest revised P i is then located as the new cluster centre. P i of the rest of the points continues to decrease as new centres are found. When the kth centre of the cluster is located, P i of the rest of the data points is updated as where x k ! * is the kth cluster centre with potential value P * k . The new cluster centres continue to be found until P * k < εP * 1 , where ε is the rejection ratio. The distance between each cluster centre is controlled by β.
If there are k clusters located for input metric m, the set of cluster is represented by {C m = (c 1m , c 2m , …, c km )}. Similarity, the set of cluster centres is denoted as { x m ! * ¼ Þ gfor the metric m cluster. The parameters related to subtractive clustering in this study are set as {α = 16, β = 12, and ε = 0.005}.
The subtractive clustering-based Q-learning algorithm is described by algorithm 1.

Handover triggering using the trained Q table
After the proposed framework learns the optimal handover policy from a specific application scenario l, the trained Q table is utilised by the UE as the handover triggering mechanism. During the movement of UE i, the measuring data, that is, the RSRP, SINR, and d at time t, are first converted into state vector s i,l,t using Eq. (9). The UE then searches corresponding action a i,l,t with the maximum accumulated reward from the Q table based on s i,l,t . If optimal action a i,l,t for state s i, l, t is execute the handover process, then the handover process is triggered by the UE. As shown in Fig. 3, once the handover process is triggered, the UE reports the handover event to its serving BS. Subsequently, the UE serving BS selects a neighbouring BS with the highest SINR and sends a handover request to it. The radio resource control (RRC) is reconfigured after target BS acknowledges the handover request. In this phase, the connection between the UE and its serving BS is transferred to the target BS, subsequently completing the RRC reconfiguration. If optimal action a i,l,t for state s i, l, t is maintain the UE connection, then the UE will maintain its connection with the current serving BS.

Analysis design
A simulation environment was built using MATLAB to test the mobility robustness of the UE under the proposed triggering mechanism. The environment designs are illustrated in Table 1. There are 16 small BSs and 2 macro BS deployed in a 1000 m × 1000 m scenario, and each small BS is approximately 350 m apart. The macro BS is deployed on the diagonal of the simulated environment. The 40 UEs move randomly at a speed of 30 km/h in the proposed environment.
The RSRP, SINR, and transmission distance (d) modelled by Eqs. (1) and (2) are implemented as decision criteria for the proposed triggering mechanism. The average number of handovers (NOH) per UE, the probability of PPHO, the handover failure rate, throughput, and network latency calculated by Eqs. (3)(4)(5)(6)(7)(8) are adopted as KPIs to test the effectiveness of the proposed algorithm as discussed in Section 2.
The 160,000 sets of data for each input metric are collected from the simulation environment to obtain a well-trained Q table. The 160,000 sets cover all of the physical locations and are collected from all of the BSs deployed in this environment. The final trained Q table is utilised as the proposed handover triggering mechanism to be evaluated. There are two comparative approaches adopted, that is, traditional A3 event RSRPbased [2] and fuzzy logic-based triggering mechanisms [17,18]. As previously mentioned, the A3 event is based only on a single metric, that is, the RSRP that triggers the handover process. The fuzzy logic-based approach in this study also considers the RSRP, SINR, and d as input metrics. The approach based on Q-learning without clustering techniques (with states of equal lengths) is also adopted as a comparison group to show the advantage of this clustering technique. These four approaches work in the same test environment.
Eq. (18) is utilised to quantify the improvement of each KPI (ΔKPI) under the proposed approach. KPI 1 means the evaluated KPI value under method 1, and KPI 2 denotes the same logic. Each KPI is tested for at least 100 rounds in this study to ensure reliable evaluation results. Figure 3 shows the training stage of Q-learning and accumulated rewards in each epoch with and without subtractive clustering. In the same training environment, the Q-learning with clustering approach (black solid line) converges at the 70th epoch, and the Q-learning only approach (green dash line) converges at the 20th epoch. After convergence, the accumulated Q-learning reward that optimised by subtractive clustering is approximately 3900, and the Q-learning only approach can  (18), the approach based on Qlearning with clustering can receive approximately 15% more rewards than the Q-learning only approach. The trained Q tables from both methods are utilised as the handover triggering mechanisms of the UE to be evaluated. The performance of these two mechanisms is shown in Figs. 4, 5, 6, 7, 8, 9. The KPIs in Figs. 5, 6, 7 are used to evaluate the mobility robustness of the UE based on different triggering mechanisms. According to the results in Figs. 5, 6, the A3 event RSRP-based triggering mechanism (blue line with circle) has the highest NOH (4250) and PPHO ratio (0.24%), as it depends on only a single metric, the RSRP, to trigger the handover process. The RSRP fluctuates due to noise and interference, which can significantly reduce the triggering decision's accuracy. The fuzzy logic (orange line with triangular) can consider multiple metrics as a decision criterion and thus has a lower NOH (2500) and PPHO ratio (0.21%) than the RSRP-based approach. Since the Q-learning-based approach has powerful learning capability and also considers multiple metrics in decision-making, the two proposed Q-learningbased approaches have the best performance. Based on the results in Figs. 5, 6 and Eq. (18), the Q-learning only approach (green dash line) can significantly reduce 70-90% of handovers and approximately 60% of PPHO ratios compared with the RSRP and fuzzy logic-based approaches. Moreover, the adoption of clustering (black solid line) can reduce another 22% of NOH and 20% of PPHO ratios compared with the Q-learning only approach.

Results and analysis of comparison experiments
As indicated in Fig. 7, the A3 event RSRP-based triggering mechanism (blue line with circle) has the second-lowest handover failure rate at 0.5%. This is because the RSRP is a key factor in determining the handover failure rate, and the A3 event RSRP-based approach will continue switching the UE connection to the neighbouring BS with the highest RSRP. As such, the A3 event RSRP-based approach can ensure a low handover failure rate. The fuzzy logic (orange line with    have relatively high handover failure rates of 5% and 1%, respectively. These two approaches consider multiple metrics to trigger the handover and weaken the weight of the RSRP in decision-making. The fuzzy membership functions used in fuzzy logic are not well designed for each input metric, which can cause the input to incorrectly convert to the corresponding level. The input metrics are also not well categorised into state vectors in the Q-learning only approach, which can degrade the effectiveness of the trained Q table. Therefore, these two factors degrade the handover failure rate. However, Qlearning with clustering (black solid line) outperforms the three other approaches, with a nearly zero handover failure rate of 0.1%. Compared with the Q-learning only approach (green dash line), the adoption of subtractive clustering can reduce the handover failure rate by approximately 75% based on Eq. (18).
The KPIs in Figs. 8, 9 are used to evaluate the quality of service (QoS). Some existing approaches focus only on the improvement in mobility robustness but result in a degradation of other aspects, such as load balancing and QoS. Thus, the objective of this study is to increase the mobility robustness while maintaining the other KPIs at relatively high levels. Figure 8 shows the network latency under the different handover triggering mechanisms. The A3 event RSRP and fuzzy logic-based approach have relatively high network latency of 16.7 ms and 15.2 ms, respectively. These two methods lead to many unnecessary handovers, which causes the accumulation of handover latency. The Qlearning only triggering mechanism also has a relatively high latency of 14.3 ms. Although the Q-learning only approach has fewer handovers, they primarily occur at the edge of coverage. This can result in a high propagation latency. Q-learning with clustering outperforms the other three approaches again and has the lowest average network latency of 10.1 ms. Compared with the Q-learning only approach, subtractive clustering can further reduce handover latency by approximately 27.3%.
As shown in Fig. 9, the RSRP-based approach has the highest sum throughput because the RSRP is also one of the key factors determining the system throughput. As such, fuzzy logic has the lowest throughput as it considers other metrics during decision-making. The throughput of the two Q-learning-based approaches is slightly lower than the traditional methods but remains at a relatively high level. Compared with the Q-learning only approach, subtractive clustering can increase throughput by approximately 9.7%.
Based on the simulation results, the proposed Q-learning with clustering-based approach outperforms the other three approaches in terms of the NOH, PPHO ratio, handover failure rate, and network latency while maintaining a relatively high level of system throughput. This good performance is due to the following advantages: first, the Q-learning framework has a powerful ability to learn the optimal policy from different environments. After obtaining the optimal policy, the trained Q table executes the action with the maximum reward based on the states it faces. Second, because subtractive clustering has a strong ability to process uncertain and imprecise information, the adoption of clustering minimises the effect of noise and inference during decision-making. Moreover, subtractive clustering can locate clusters for each input metric from the historical data. This approach ensures that input metrics can systematically be categorised as state vectors with respect to their actual data distribution. Therefore, the subtractive clustering technique ensures the accuracy and effectiveness of Q-learning to achieve training targets. The trained Q Fig. 9 Total throughput of the different triggering mechanisms

Conclusion
In this study, we proposed an intelligent handover triggering mechanism based on the Q-learning and subtractive clustering techniques to address the challenges of handovers in 5G-UDNs. In the proposed framework, Q-leaning can learn the optimal triggering policy from different application scenarios. The proposed framework's trained Q table can be used in UE to precisely trigger the handover process based on the RSRP, SINR, and transmission distance. To further enhance the proposed approach's performance, we adopted subtractive clustering to ensure accuracy and effectiveness of the training process. According to the simulation results, compared with the A3 event RSRP-based and fuzzy logic-based approaches, the proposed solution can effectively reduce the NOH, PPHO ratio, and handover failure rate while maintaining a high level of network latency and system throughput. The evaluation also indicated that the adoption of subtractive clustering techniques can further enhance the proposed approach's performance by approximately 20% in terms of all of the evaluated KPIs. Moreover, the proposed solution has a low maintenance cost, as it can intelligently trigger the handover process without any additional handover parameters or conditions such as HHM and TTT.
An effort will be made in future works to develop energy efficient handover mechanism based on machine learning technique, which should reduce power consumption while retaining the mobility robustness for UE.