Forecasting Tourism Demand with an Improved Mixed Data Sampling Model

Search query data reflect users’ intentions, preferences and interests. The interest in using such data to forecast tourism demand has increased in recent years. The mixed data sampling (MIDAS) method is often used in such forecasting, but is not effective when moving average (MA) dynamics are involved. To investigate the relevance of the MA components in MIDAS models to tourism demand forecasting, an improved MIDAS model that integrates MIDAS and the seasonal autoregressive integrated moving average process is proposed. Its performance is tested by forecasting monthly tourist arrivals in Hong Kong from mainland China with daily composite indices constructed from a large number of search queries using the generalized dynamic factor model. The forecasting results suggest that this new model significantly outperforms the benchmark model. In addition, comparing the forecasts and nowcasts shows that the latter generally outperforms the former.


Introduction
The perishable nature of the tourism industry makes accurately forecasting tourism demand an important task for tourism-and hotel-related decisionmakers. It is impossible to store unfilled airline seats and unsold hotel rooms. Therefore, accurate demand forecasts can help tourism practitioners make business decisions, such as those concerning scheduling, staffing and pricing.
In addition, policymakers in tourist destinations need accurate forecasts to formulate tourism development policies, such as tourism infrastructure investments.
Traditional tourism demand forecasting studies have often used historical tourism demand and macroeconomic data. However, macroeconomic data, such as GDP and CPI, are usually delayed and may take several weeks or months to be published. The rapid development of information technology and the Internet has given rise to massive-scale and readily available data (Kambatla et al. 2014). Such data often reflect users' intentions and can serve as early indicators of various activities. For example, search queries have been used for various forecasting purposes, such as unemployment claims (Choi and Varian 2012), influenza epidemics (Ginsberg et al. 2009) and housing prices and sales (Wu and Brynjolfsson 2015).
Search query data have also gained popularity in forecasting tourism demand. Tourists use search engines to look for travel information on weather, transportation, hotels, attractions, travel guides and other tourists' opinions (Fesenmaier et al. 2011). These web search behaviours are recorded and reflect users' intentions, preferences and interests. Therefore, they can be valuable predictors of tourism demand. Although the use of search query data in tourism demand forecasting is relatively new, interest in this area has increased rapidly in recent years. Search query data have often been aggregated and converted into the same frequency as tourism demand variables in previous studies because they are often sampled at a higher frequency (Choi and Varian 2012;Li et al. 2017;Pan et al. 2012;Rivera 2016;Yang et al. 2015). This can lead to information loss and poor forecasting performance because high frequency information is not used (Ghysels et al. 2007). Bangwayo-Skeete and Skeete (2015) were the first to introduce mixed data sampling (MIDAS) for tourism demand forecasting. They found that MIDAS performed better in forecasting monthly tourist arrivals using weekly Google Trends data in most forecasting exercises, whereas its performance was poor in other exercises.
Compared with common benchmark models, such as the seasonal autoregressive integrated moving average (SARIMA) model, traditional MIDAS models often involve autoregressive (AR) components and are unable to incorporate moving average (MA) dynamics. Indeed, they are not effective when the underlying data include MA dynamics. In fact, in a recent study, Foroni et al. (2019) showed that MA components emerged in a MIDAS model in which the low frequency variable was the result of temporal aggregation. They investigated the effect of neglecting MA components in the forecasts and found that including MA components improved the forecasting performance of their Monte Carlo simulations and application to US macroeconomic variables. In this study, the same idea is introduced to tourism demand forecasting and the relevance of MA components is investigated in this context. In addition, Foroni et al. (2019) focused on forecasting macroeconomic variables and did not consider seasonal ARMA components. As tourism demand often exhibits strong seasonality, it is important to account for seasonality in the modelling process. Moreover, Foroni et al. (2019) arbitrarily determined the orders of AR and MA components in MIDAS models. Doing so may yield a higher probability of model misspecification. To overcome these problems, a new model that integrates the MIDAS and SARIMA processes is proposed. This new model is an extension of traditional MIDAS models and is able to accommodate seasonal and non-seasonal ARMA components. The features of the MIDAS and SARIMA models are especially relevant in the tourism demand forecasting context. The mixed frequency aspect of the new model provides a more efficient way to utilise high frequency search query data. Furthermore, its seasonal and non-seasonal ARMA components capture important characteristics of tourism demand. In this study, the effectiveness of the model is investigated by forecasting monthly tourist arrivals in Hong Kong from mainland China, with daily composite indices constructed from a large number of search queries using the generalised dynamic factor model (GDFM). Previous studies have focused on forecasting or nowcasting tourism demand. In contrast, this study is the first to conduct a comparison analysis of forecasts and nowcasts. Such a comparison may be particularly useful for decisionmakers who need frequent updates to make more accurate forecasts. When forecasting tourism demand, traditional macroeconomic data, such as income level in the origin country and relative price level in the destination country, are often incomplete and subject to revision for the current and most recent periods. However, search query data are readily available on a daily and even hourly basis. They are especially useful in a nowcasting framework, which can enable more timely tourism demand forecast updates when new information becomes available. For example, timely and improved updates of nowcasts of demand are very valuable in hotel revenue management, which involves dynamic pricing.
The remainder of this paper is organised as follows. Section 2 reviews the relevant literature.
Section 3 presents the data and the construction of the search query index. Section 4 discusses the details of the models and their estimation results. Section 5 presents the forecasting and nowcasting results. Finally, Section 6 concludes.

Tourism demand forecasting
Tourism demand forecasting is a well-established research area. The three main types of modelling techniques include non-causal time series, econometric and artificial intelligence (AI)-based methods.
Traditional time series models include Naïve 1 models (no change), Naïve 2 models (constant growth rate), exponential smoothing models and simple AR models (Song and Li 2008;Wu et al. 2017). They are often used as benchmarks in tourism forecasting studies. Autoregressive integrated moving average (ARIMA) models and SARIMA models are the most commonly used models, depending on the frequency of the time series. Various extensions of the ARIMA model have also been used in the literature. For example, Chu (2009) introduced an autoregressive ARMA (ARARMA) model and a fractionally integrated ARMA (ARFIMA) model to forecast tourist arrivals in nine destinations in the Asia-Pacific region and found that the ARFIMA model outperformed the SARIMA and ARARMA models. Similarly, Assaf et al. (2011) used several models based on fractional integration to forecast tourist arrivals in Australia, confirming that they outperformed the standard ARIMA and SARIMA models.
Structural time series (Turner and Witt 2001) and generalised autoregressive conditional heteroskedastic (Divino and McAleer 2010) models have also been widely used in the tourism literature. In recent years, more advanced time series models have been used to generate better forecasting performance than traditional time series models, such as innovations state space models for exponential smoothing (ETS; Athanasopoulos et al. 2011), singular spectrum analysis (SSA) models (Hassani et al. 2017) and time-varying parameter structural time series models . Decomposition methods, such as SSA, empirical mode decomposition (Yahya et al. 2017) and ensemble empirical mode decomposition (Zhang et al. 2017), have gained much popularity in recent years and have demonstrated good forecasting performance. These techniques have been used in univariate time series forecasting settings (Hassani et al. 2017;Hassani et al. 2015;Silva et al. 2019) and causal time series forecasting settings (Li and Law 2020).
Unlike non-causal time series models, econometric models can analyse the relationship between tourism demand and its key determinants, and the information can be used to provide policy recommendations. Several important factors affecting tourism demand have been identified in the literature, such as tourist income, tourism prices in a destination relative to those of the country of origin, tourism prices in competing destinations and real exchange rates (Song and Li 2008;Wu et al. 2017).
Spurious regression is often present in traditional regression analysis. Several modern econometric models have been introduced in tourism modelling and forecasting, such as the autoregressive distributed lag model , the error correction model (Goh 2012), the vector autoregressive (VAR) model (Wong et al. 2006), the time-varying parameter model (Page et al. 2012), the almost ideal demand system model (Li et al. 2006) and the Bayesian VAR model (Gunter and Ö nder 2015;Wong et al. 2006). Numerous studies have concluded that econometric models perform better (Song et al. 2003), but some have confirmed that time series models outperform econometric models in predicting tourism demand ).
In addition to time series and econometric methods, a variety of AI-based methods have been introduced in the tourism forecasting literature. The dominant model is the artificial neural network (ANN) model. It consists of several layers, each of which can contain multiple neurons.
The ANN model is a nonparametric and data-driven method that can be used to model non-linear relationships. It is also the most frequently used AI-based method in tourism demand forecasting studies (Claveria et al. 2015;Law et al. 2019;Sun et al. 2019). Other AI-based methods used to forecast tourism demand include the support vector machine model (Chen et al. 2015;Hong et al. 2011), the fuzzy system model (Aladag et al. 2014), the rough set model (Goh et al. 2008) and grey theory (Sun et al. 2016).
Although various methods have been introduced and applied in the literature, there is a consensus that no model can outperform other models consistently under all conditions (Song and Li 2008). Using a meta-analysis, Peng et al. (2014) showed that their data characteristics and study features, such as demand measure, data frequency and origin/destination pairs, affected the forecasting accuracy of tourism demand.

Forecasting with search query data
People often search for information online and their search behaviour reflect their consumption preferences and decision-making processes (Du et al. 2014;Ghose et al. 2014).
Search query data can serve as a powerful predictor to improve forecasting accuracy. Thus, forecasting using search query data has gained popularity in a number of research areas. house prices and sales (Wu and Brynjolfsson 2015).
In recent years, forecasting tourism demand using search engine data has also attracted attention. For example, Choi and Varian (2012) used Google Trends data for the first 2 weeks of each month to predict the number of visits to Hong Kong in a given month. Pan et al. (2012) chose five related Google search queries to forecast demand for hotel rooms in Charleston, US, improving forecasting performance by including search query data. Pan and Yang (2017) used Google search engine queries and website traffic data to forecast hotel demand in Charleston and found that their forecasts were more accurate when they included both data sources. Rivera (2016) pointed out that Google Trends data differ each week because the data are constructed as a relative volume and come from a periodic sample of queries. Therefore, he proposed using a dynamic linear model and treated Google Trends data as a representation of an unobservable process. In addition, the association between hotel demand and Google Trends data can be better understood when the data are downloaded on multiple occasions. Yang et al. (2015) used Google Trends and the Baidu Index, which represents the absolute volume of the chosen search queries, to forecast the number of visitors to a province in China.
They found that although the data from both search engines improved forecasting accuracy, the Baidu Index performed better. Li et al. (2017)  (2019) applied a deep learning approach to forecast tourist arrivals in Macau using search query data. They showed that the deep learning approach significantly outperformed the support vector regression model and the traditional ANN model.

MIDAS regressions
Time series data are often collected at different frequencies, but most models require variables to be converted to the same low frequency. During this process, the potentially valuable information contained in high frequency variables is smoothed and lost. To tackle this problem, Ghysels et al. (2004) used MIDAS regressions to directly estimate equations with variables sampled at different frequencies.
The use of MIDAS regressions has proliferated in the macroeconomic literature. For example, Clements and Galvão (2008) used MIDAS to forecast quarterly output growth using monthly predictors and found significant improvement. Andreou et al. (2013) extracted a small set of daily financial data from a large panel of daily financial assets to predict quarterly real GDP growth using MIDAS and elucidated the value of daily financial information. MIDAS has also been used to forecast inflation and oil prices. Monteforte and Moretti (2012) showed a reduction in inflation forecast errors in the euro area by including daily financial variables using MIDAS. Baumeister et al. (2015) investigated the predictive power of daily and weekly financial market data in forecasting monthly oil prices. They demonstrated that the preferred MIDAS model improved forecasting accuracy compared with no-change forecasts.
MIDAS regressions have also been widely used in the financial literature. Ghysels et al. (2009) compared several models generating multi-period ahead forecasts of stock return volatilities and found that MIDAS performed best for longer horizon forecasts. Gurgul et al. (2018) used MIDAS-based models for systemic risk assessment in the banking sector and found that the information contained in the macroeconomic variables helped predict short-and long-term risk components. (2015) were the first to apply MIDAS with AR components in the tourism literature. Using weekly Google data to forecast monthly tourist arrivals in five Caribbean countries, they found that the MIDAS models generated better predictions than the baseline time series models for most of their experiments. However, MIDAS models can only accommodate AR dynamics and forecasting performance may deteriorate when MA dynamics are involved. They are not effective when the underlying data include MA dynamics. Foroni et al. (2019) showed that MA components in general emerged in a MIDAS model and improved the forecasting accuracy of US macroeconomic variables by including MA components. In this study, the relevance of MA components in tourism demand forecasting is investigated using an improved MIDAS model that incorporates seasonal ARMA components and automatically selecting appropriate structures. This novel model combines the advantages of MIDAS and SARIMA and can offer desirable features for modelling tourism demand using search queries. In addition to accommodating the mixed frequency variables provided by MIDAS, it can also automatically choose appropriate seasonal and non-seasonal ARMA components, which are often present in tourism demand data. tourism revenue and generated many job opportunities. However, it has also led to higher prices and a shortage of goods, causing tension between mainland visitors and Hong Kong residents. Thus, businesses and policymakers require accurate forecasts of tourist arrivals from mainland China to make informed decisions.

Data and composite index
In this study, we used data on monthly tourist arrivals from mainland China to Hong Kong between January 2011 and February 2018. The data were collected from the Hong Kong Tourism Board's B2B website, PartnerNet (https://partnernet.hktb.com). The data were sampled from 2011 because the Baidu Index data are only available from 2011. Following previous studies, the log transformation was applied before starting the modelling process.
Although Google dominates the global market, it left the mainland China market in 2010 following a dispute with the Chinese government. Baidu has since become the most popular search engine in China, holding the largest market share (Yang et al. 2015). Given this study's interest in tourist arrivals in Hong Kong from mainland China, the Baidu Index was used.
To apply the search query data to tourism forecasting, keyword selection was conducted first.
The most common method for selecting search query data is based on the researcher's intuition and prior knowledge (Brynjolfsson et al. 2014). This practice is common in the tourism field. For instance, Pan et al. (2012) chose five Google search queries to forecast hotel demand.
Bangwayo-Skeete and Skeete (2015) also adopted this method and used 'hotels' and 'flights' as keywords to forecast tourist arrivals in the Caribbean. Although this method is easy to apply, it can omit important information by excluding relevant search queries. To mitigate this problem, the set of initial keywords can be extended by adding pertinent keywords using the functions of the search engine Yang et al. 2015). In this study, the initial set of keywords was thus extended and conducted according to the following steps to select the keywords in the Baidu Index: 1. Six aspects of tourism planning were specified: dining, shopping, transportation, tours, attractions and lodging. Several initial keywords were determined for each aspect.
2. Keywords highly correlated to the initial keywords were added using a demand map interface provided by Baidu. This step was iterated until convergence.
3. As Baidu does not provide the search query volume below a certain threshold, the availability of each search query was manually checked using the keywords.
Ultimately, 101 Baidu search queries were collected (the names of the translated search queries can be found in Appendix A).
With this large number of search queries, some AI models, such as the deep learning models used in Law et al. (2019), can directly incorporate these search queries and identify the most relevant ones. However, most econometric models, including the MIDAS models used in this study, cannot perform this task. As a result, the dimensionality of the search queries must be reduced before the modelling process. This can be done by extracting common components using various factor models, such as static and dynamic factor models. Static factor models express common components as a linear combination of a small number of unobserved static factors that are loaded simultaneously (Stock and Watson 2002). The GDFM proposed by Forni et al. (2000) encompasses the static factor model and its common components, χ , are driven by unobservable common factors, , = 1, … , . For the observed variables {X , = 1, … , , = 1, … , }, the model can be formulated as where is the factor loading, is the lag operator and ε is the idiosyncratic component. The GDFM has two important characteristics: it is dynamic and allows for cross-correlation among idiosyncratic components. Unlike static factor models, in which lagged factors are added as additional static factors, the common components of the GDFM can accommodate AR and MA responses. Distinguishing between leading and coincident variables a priori is not needed. The common components of the GDFM depend on cross-correlations at all leads and lags, so they can incorporate different lead and lag information from the variables (Forni et al. 2000). This is advantageous for constructing a composite index from a large number of search queries.
The GDFM has been adopted by several economic and financial institutions to analyse and predict economic activities. The Banca d'Italia published a real-time monthly coincident indicator of the euro area business cycle (Eurocoin) based on the GDFM (Altissimo et al. 2010).
The Federal Reserve Bank of New York developed a similar index for estimating underlying inflation using these methods (Amstad and Potter 2009). In the tourism context, Li et al. (2017) were the first to use the GDFM to construct the composite index from Baidu search queries.
They found that the GDFM-based index had better forecasting performance than PCA.
Therefore, the GDFM was adopted in this study to construct the index.
Before estimating the GDFM, a number of common factors, , must be determined. To this end, Forni et al. (2000) used the variance contribution rate, where is the number of factors whose variance contribution rates converge. However, this is a heuristic eye inspection rule. (2007) proposed a formal test, using the log criterion of their study with the penalty function 1 and lag window √ to choose the number of factors, the maximum number of factors being set to 50. is the coefficient associated with the penalty function, is defined as the variability of the estimated when the size of the subsample increases and ; * is the estimated when the whole sample is used. The selection of can be based on the plot of and ; * on , where is based on the second stability interval (more details are provided by Hallin and Liška 2007). This method was applied to the search queries of this study. The plot is shown in Fig. 1.

Nevertheless, Hallin and Liška
(Insert Fig. 1 about here) The second stability interval appeared at the interval between 0.31 and 0.34 (with equal to 0) and the estimated was equal to 4. Therefore, the number of common factors was set to 4 for the search queries.
The common components were then calculated using standardised search query data with a mean of 0 and a standard deviation of 1. The coincidental index at time t was constructed using the common components, = ∑ χ =1 . As the search queries were collected daily, this index also had a daily frequency. The relationship between the daily index and the log transformation of monthly tourist arrivals is plotted in Fig. 2. (Insert Fig. 2 about here) The close relationship between the daily index and monthly tourist arrivals is clearly illustrated.

Research methods
In this section, the specifications and estimation procedure of the following competing models are presented: the SARIMA model, the SARIMA model with an exogenous variable (SARIMAX) and the traditional and improved MIDAS models. Data up to February 2017 were used for the estimation procedure and the remaining data were used to evaluate the forecasting performance.

SARIMA and SARIMAX
The SARIMA model can account for seasonality, which is a common feature of tourism demand.
It is the most commonly used time series model in the tourism demand forecasting literature and is often used as a benchmark (Song and Li 2008;Wu et al. 2017). A SARIMA (p, d , q)(P, D , Q) model with seasonal frequency m can be specified as follows: where is the log of tourist arrivals, is the backshift operator, Φ( ) and Θ( ) represent the seasonal AR and MA components (which are polynomials of order P and Q), respectively, ( ) and ( ) represent the non-seasonal AR and MA components (which are polynomials of order p and q), respectively, and is a white noise process. The forecast package in the R program (R Core Team 2016) was used to automatically select the orders and estimate the coefficients (Hyndman and Khandakar 2008) as follows: 1. The order of seasonal differencing was chosen using a test suggested by Wang et al. (2006), which is based on a measure of seasonal strength.
2. The order of non-seasonal differencing was chosen using the KPSS unit-root test (Kwiatkowski et al. 1992).
3. A stepwise procedure was used to traverse the model space and the orders and p, q, P and Q were chosen based on the corrected Akaike information criterion (AIC).
The SARIMAX model simply adds an exogenous variable to SARIMA so that it becomes a regression model with SARIMA errors. Therefore, the estimation procedure of the SARIMAX model is almost identical to that of the SARIMA model, except that the regression is conducted first. It can be formulated as follows: where is the exogenous variable (which may include lagged variables) and is the error from the regression model. It is equivalent to substituting the differencing terms in the following regression equation: where ′ is (1 − ) (1 − ) . Furthermore, this is equivalent to differencing and before fitting the model with ARMA errors. As non-stationary errors suggest the existence of spurious regression, it is necessary to difference the variables first.
The SARIMAX model can be rewritten as follows: It can be seen that the same AR terms are applied to and .
The exogenous variable used in this study was the composite index constructed from the search queries using the GDFM. As it was available daily, temporal aggregation was conducted by averaging the daily index for each month. However, the number of days varies in different months. To enable a direct comparison between the SARIMAX and MIDAS models, the 30 days preceding the first day of each month were considered to be a full last month. The monthly composite index at time t is denoted as .
The monthly index with at least one lag was added to the SARIMAX model and the lag length was determined based on the AIC and the Bayesian information criterion (BIC). A monthly index with one lag was found to generate the smallest AIC and BIC.
After estimation, the fitted SARIMA and SARIMAX models can be written as The details of the estimation results are summarised in Table 1.
(Insert Table 1 about here) The −1 coefficient was positive and significant at the 1% level, indicating that an increase in search queries leads to an increase in tourist arrivals the following month. The smaller AIC and BIC values of the SARIMAX model suggest that including the search query index fitted the model better. A Ljung-Box test was conducted to check the residuals of the fitted models and the p values were reported. The large p values indicate that the residuals were independently distributed and that the models were properly specified.

MIDAS models
Search query data are available at a higher frequency than tourist arrival data. They contain potentially valuable information, and temporal aggregation can lead to information loss (Ghysels et al. 2007). Most time series regressions involve data sampled at the same frequency, so high frequency information cannot be used directly. As an alternative to the common solution of converting all data to the same low frequency, MIDAS can directly accommodate variables sampled at different frequencies. MIDAS models can be applied in cases where high frequency variables are used to forecast a low frequency variable. In addition, they may have more salient advantages when the frequencies of the variables are significantly different, as using traditional methods can lead to greater information loss during temporal aggregation.
Therefore, MIDAS models are well suited to this study using monthly tourist arrivals and daily search queries.
The basic MIDAS model for a single explanatory variable can be written as where is the log of tourist arrivals, is the high frequency lag operator, ( ; ) is a polynomial that assigns the weights to the high frequency variable at lag i, is the maximum lag on the high frequency variable and is a white noise process.
Different weighting schemes can be used as functional constraints. A weighting scheme defined by the vector of parameters = ( 1 , 2 , … , ) can be written as The most popular specifications for ( , ) include the exponential Almon function and the beta function (Ghysels et al. 2007). Ghysels et al. (2007) argued that the beta function was flexible enough to accommodate different weighting shapes with only two parameters. For comparison purposes, the exponential Almon specification used in this study also used two parameters. In addition, a Gompertz function was added as an additional comparison. The specifications of these three functions with two parameters ( 1 , 2 ) are expressed below: Exponential Almon: ( , ) = exp ( 1 + 2 2 ).
MIDAS models can be expanded to include AR dynamics. However, this process is not straightforward, as noted by Ghysels et al. (2007). Consider a MIDAS-AR model with one lag of : It can be rewritten as where is the low frequency lag operator and ̃= (1 − λ ) −1 . The polynomial on is a combination of and . This generates a seasonal response of to , whether demonstrates seasonal patterns. This strategy is generally considered inappropriate. Therefore, Clements and Galvão (2008) suggested introducing AR dynamics in as a common factor, where the same AR dynamics are applied to and so that the response of to is non-seasonal. This model was adopted in this study as MIDAS-AR. Before estimating MIDAS-AR, the amount of seasonal and non-seasonal differencing for tourist arrivals was determined using the same tests as the SARIMA model (Kwiatkowski et al. 1992;Wang et al. 2006 The difference between this new model and a standard MIDAS is that is a SARIMA process.
Thus, the MIDAS-SARIMA model can accommodate seasonal and non-seasonal ARMA dynamics.
This model is distinguished from standard MIDAS models by its seasonal structure. In addition, it integrates the automatic order selection procedure of SARIMA models to determine the best structure of the MIDAS-SARIMA model. As a result, this model combines the advantages of the MIDAS and SARIMA models and offers considerable potential to improve forecasting accuracy.
The MIDAS-SARIMA model applies the same AR dynamics to and . The estimation (Insert Table 2 about here) Seasonal and non-seasonal differencing were performed for all MIDAS models. The MIDAS-AR models gave the same structures, with two lags of the AR dynamics, and the estimated coefficients of the two AR components were close for different weighting schemes. The same was observed for the MIDAS-SARIMA models, which had the same MA(1) and SMA (1) structures and similar estimated coefficients. This suggests that the different weighting schemes made little difference in the estimation of the MIDAS models. This result is consistent with the results of Bangwayo-Skeete and Skeete (2015). The AIC and BIC values suggest that the MIDAS-SARIMA models had a better fit and were more appropriate than the MIDAS-AR models.
The weights of the daily indices can be visualised. For example, the weights of the 30 daily indices for the MIDAS-SARIMA models are plotted in Fig. 3.
(Insert Fig. 3 about here) All three weighting schemes weighted the most recent indices more heavily. Most weights were put on the last 15 days, whereas the earlier days had almost 0 weight. Furthermore, the beta weighting scheme put the highest weight on Day 2, whereas Day 1 was given very little weight.
The exponential Almon and Gompertz weighting schemes generated similar patterns to that of the beta weighting scheme. However, their weighting curves were much smoother than that of the beta weighting scheme. The close estimates of 1 shown in Table 2 suggest that the total weights of the daily indices were similar.

Forecasting
In this subsection, the forecasting performance of the models using data from March 2017 to February 2018 is evaluated. Search query data with one lag were used in the modelling process, with the tourist arrivals and search query data available up to time t. Thus, the ARIMAX and MIDAS models had to first be estimated using tourist arrivals data up to time t and search query data up to time t-1. The results were then used to generate the forecasts at time t+1, with search query data at time t. Therefore, only one-step-ahead forecasts could be generated in this study. Longer-term forecasts may be further investigated in a future study with a different estimation procedure that uses search query data of lags longer than one but not conducted here. An expanding window approach was used to generate the one-step-ahead dynamic forecasts. For example, the data on tourist arrivals up to February 2017 and the search query data up to January 2017 were first used to estimate the models, then the search query data for February 2017 were used to forecast tourist arrivals in March 2017. The estimation period was then extended by 1 month and the models were re-estimated using the same procedure. The forecasts were generated at each round until all 12 one-step-ahead forecasts were calculated for the period from March 2017 to February 2018.
Forecast accuracy was evaluated using five commonly used forecast error measures, including the mean absolute deviation (MAD), the mean squared error (MSE), the mean absolute percentage error (MAPE), the root mean square percentage error (RMSPE) and Theil's U statistic (Goh and Law 2002;Law et al. 2019). The MAD and MSE are absolute error measures.
In contrast, the MAPE and RMSPE are relative error measures. Finally, Theil's U was constructed based on the error ratio of the underlying model to the seasonal naïve model. A seasonal naïve model basically predicts that monthly tourist arrivals for the following year will be the same as this year for the same month. A value less than 1 indicates that the performance of the model is superior to that of the naïve model. Their specifications are as follows: where A is the actual value, F is the forecast value at time t and n is the length of the forecast period (n = 12 in this study). Two extra benchmark models have been added to the forecasting practice: the ETS and the seasonal naïve model. Athanasopoulos et al. (2011) found that the ETS performed particularly well for monthly data in their tourism forecasting competition. The seasonal naïve model is also widely used as a benchmark model in forecasting seasonal tourism demand . Table 3 presents the results of the models' forecasting performance.
(Insert Table 3  results also indicate that the most recent search query data, which were assigned most weights in the MIDAS-SARIMA models, were the most valuable in predicting tourist arrivals. To further test the significance of forecasting differences between the two better benchmark models (SARIMA and ETS) and the models using search query data, a Diebold-Mariano (DM) test was conducted (Diebold and Mariano 1995). The test was based on the forecasting differences of four measures, namely the absolute deviation, the SE, the absolute percentage error (APE) and the squared percentage error (SPE), which were used to calculate the MAD, MSE, MAPE and RMSPE, respectively. As Theil's U had the same denominator derived from the seasonal naïve model and its numerator was calculated from the MSE, the corresponding DM test largely depended on the MSE and was therefore omitted. Tables 4 and 5 present the results of the DM tests for SARIMA and the ETS, respectively. The null hypothesis of the DM test is that the accuracy of the forecasts generated by the benchmark and alternative models does not differ.
(Insert Table 4 about here) (Insert Table 5  MIDAS-SARIMA-Gom significantly outperformed the ETS in terms of the AE and APE.

Nowcasting
The traditional models used in this study, such as the benchmark models and ARIMAX, are unable to update the forecasts until a full month of search query data are available, as they cannot incorporate high frequency search query data that offer a new daily index after each day. However, daily nowcasts can be generated using MIDAS models as new search query data become available every day. For example, when the search query data for the first day of the month are available, they can be added to the MIDAS models and used to predict tourist arrivals for that month (nowcasting). This can be repeated every day until the end of the month.
Again, due to the variable number of days in each month, nowcasts with search query data for 30 days starting from the first day of the same month were produced. As the MIDAS-SARIMA models had the best forecasting performance, their nowcasting performance was further investigated. Nowcasting is conducted in a similar way to forecasting. Nowcasting models must be refitted when new daily search query data become available, using the monthly tourist arrivals data until the end of the last month and the daily search query data until the end of that day. Then the nowcasts of the monthly tourist arrivals in the current month can be generated. This process can be repeated over an entire month to update the nowcasts of tourist arrivals in that month. The accuracy of these nowcasts can be investigated to determine whether updated nowcasts with more daily search query data perform better.
Nowcasting performance is plotted against the number of days of search query data added in The nowcasts showed a certain level of fluctuation for all models. Overall, the exponential Almon weighting scheme gave the best results. In addition, most of the points were below the dotted forecasting line of the corresponding colour, which became more apparent as the number of days increased. This suggests that nowcasting generally outperforms forecasting using the MIDAS-SARIMA models, especially when more data become available. The percentage of the nowcasts outperforming the forecasts were calculated for each model, as shown in Table   6. The exponential Almon and beta weighting schemes had more nowcasts that outperformed the forecasts based on all measures. However, the Gompertz weighting scheme had better nowcasts only with respect to the MAPE. Nevertheless, a downward trend was still visible for the Gompertz weighting scheme (as shown in Fig. 4-8), indicating that the nowcasts generally improved as more search query data became available.
(Insert Table 6 about here)

Conclusion
Search query data are increasingly used to improve the accuracy of tourism demand forecasting.
The aim of this study was to investigate the performance of an improved MIDAS model ( A comparison of forecasts and nowcasts was also conducted. As new search query data became available, their information could be incorporated into the MIDAS models using the mixed frequency structure and daily nowcasts could be generated. The nowcasts outperformed the forecasts most of the time for the exponential Almon and beta weighting schemes. Although the forecasts of the Gompertz weighting scheme outperformed most nowcasts, the scheme overall showed a downward trend in error measures. Thus, the nowcasts were generally more accurate as more search query data became available.
The results of this study have important implications for research in this area. Search query data have received considerable attention in forecasting tourism demand in recent years.
However, using the valuable information contained in these data is problematic and requires appropriate methods. Bangwayo-Skeete and Skeete (2015) were the first to use MIDAS models, which were found to have better forecasting performance than benchmark time series models.
However, they did not compare these MIDAS models with models that could also include search query information, such as the SARIMAX model. Therefore, whether the benefits of MIDAS can outweigh the cost of the limitations of its structure is unclear. Indeed, some studies have found no evidence supporting the use of mixed frequency methods (Rivera 2016