Generating prototypical residential building geometry models using a new hybrid approach

Building prototyping has regularly been used in building performance analyses with statistically feasible models. The novelty of this research involves a new hybrid approach combining stratified sampling and k-means clustering to establish building geometry prototypes. The research focuses on residential buildings in Ningbo, China. Seventeen small residential districts (SRDs) containing 367 residential buildings were systemically selected for survey and data collection. The stratified sampling used building construction year as the main parameter to generate stratification. Floor numbers, shape coefficients, floor areas, and window-to-wall ratios were used as the four observations for k-means clustering. Based on this new approach, nine building geometry prototypes were identified and modelled. These statistically representative prototypes provide building geometrical information and characteristic-based evaluations for subsequent building performance analysis.


Introduction
Generally, building typology deals with a classification of parameters commonly found in buildings, such as building sizes, building styles, and construction materials (Loga and Diefenbach 2011). Although the typology is predetermined, it has a dialectical relationship with technology, history, polity, social proceedings, building individuality, and commonality (Theodoridou et al. 2011a). Lang et al. (2018) suggest that building typology is a prerequisite for defining a thorough building classification scheme. By definition, a prototype is a copy or imitation following the object sample (Martí Arís 1993). Each geographical region can contain one or more typical prototypical buildings representing the limited building types within that region (Ye et al. 2019).
Typology is the study of types or common elements within a particular field of study or discipline. For example, in architecture, it can be used to organise and classify either form or function of buildings. Indeed, in past decades, building typology has been widely used to create a harmonised structure of different building stock scenarios (Dascalaki et al. 2011;Kragh and Wittchen 2014). The development of building typology can be briefly summarised into three stages. The first stage mainly involves the classification of individual buildings by the design concept associated with architectural forms, internal layouts, and functions. For example, Durand (1799) established architectural types by classifying and integrating building layout, elevation, and basic components into geometric figures. He proposed 72 categories of building types in the book Compilation and Comparison of Various Ancient and Modern Materials (Recueil Et Parallèle Des Édifices de Tout Genre Anciens Et Modernes), which was considered a pioneering work in the field. In the second stage, the classification scope can be expanded to cities and towns instead of individual buildings. Rossi et al. (1982) initiated new building typology research in his work The Architecture of the City, expanding the research objects from individual buildings to the forms of urban elements, the organisation of urban spaces, and dwellers' living habits. In the third stage, a further expansion enabled research to focus on building design and performance evaluation, such as thermal comfort, acoustics, and optics, rather than only the architectural forms. For example, in the book The Architecture of the Well-tempered Environment, Banham (1984) analysed building typologies from the perspective of users, including daylighting, heating, ventilation, and air-conditioning systems.
Unlike typology, a prototype is an original model of something from which other forms can be developed or copied. Building prototyping is a significant development from building typology, representing a standard or a type of building. Typology can be considered more as a process, while the prototype is more of an outcome. Retrospectively, the first prototypical building models were proposed by Synergic Resource Corporation in the 1980s from the study of ten office buildings that were subgrouped as new or existing buildings (Shahrestani et al. 2014). Prototypical building models can be divided into three categories: econometric and technological models, physical measurement models, and statistical models. The econometric and technological model was mainly developed during the energy crisis of the late 1970s and was used to help formulate national energy plans (de Sa 1999). This model is not geometric, as it attempts to explain the changing relationship between energy use and the adjustment of energy supply and energy prices. The model is usually designed to examine the relationship between large-scale, long-term energy consumption or carbon dioxide emissions and urban economic and social parameters, such as GDP development (Natarajan and Levermore 2007). The physical model and the statistical model can be used to analyse the correlation between building characteristics and associated building performance, such as thermal comfort, energy consumption, and life cycle costs (Hong et al. 2020). In urban building energy consumption analysis, modelling techniques based on physical characteristics are widely used. Most physical measurement models calculate each device's terminal energy consumption based on the type, quantity, and power usage characteristics of the building equipment to estimate the building's energy consumption (Enkvist et al. 2007). For example, Granade et al. (2009) use physical characteristics such as construction equipment components, indoor temperature settings, building shape, and innovative design technologies to investigate the energy consumption impact of renovation and transformation technologies. The statistical model relies on various input parameters to obtain impact parameters by mathematical methods (Zhang 2004). These data are mainly derived from building energy bill data and field survey results (Swan et al. 2009;Keçebaş and Yabanova 2012) to provide flexible and reasonable predictions using macro data simulation (Wilson and Swisher 1993;Booth and Choudhary 2012). However, each type of building prototype model has limitations, which are summarised in Table 1. In China, there is a lack of building prototyping research for its residential building stock. Earlier studies have mainly focused on either urban spatial layouts (e.g., Li and Yang 2019;Zhang et al. 2019) or building energy assessments and indoor environmental quality (e.g., Ma 2017;Hong et al. 2020). For example, Zhou (2008) investigated the energy consumption of different building types in Chongqing. Dong (2013) employed building construction years, building types, and space layout as parameters to classify and summarise residential buildings in Guangzhou. Additionally, Hong et al. (2020) used a hierarchical method to investigate low-rise office buildings in Shanghai based on building dimensions and floor areas.
Most of the earlier studies adopted a small-scale survey of building samples, which was not statistically feasible (SRC 1985;Akbari et al. 1989;Hernandez et al. 2008;Hong et al. 2020). This would make the building prototypes less representative of the entire building stock within the study area and create difficulties in developing a building benchmark. Significantly, investigating thousands of buildings in a Chinese city involves an enormous input of time and Long-term data acquisition is needed for a large-scale survey resources, which is impossible in many cases. This limitation has promoted the need for an efficient approach to generate specific prototypical building information through a wider range of building samples. To fill this gap, this research proposes a hybrid classification approach to establish prototypical building geometry models. The proposed approach involves four steps: site selection; building sampling; classification parameter selection; and k-means clustering, which is the core of the research (see Figure 1). In this research, 367 residential buildings from 17 selected small residential districts (SRDs) in Yinzhou District, Ningbo, China, were investigated as examples to expound the approach.

Site selection
Ningbo is an important port city along the southeast coast of China and the economic centre south of the Yangtze River Delta (see Figure 2). The city covers an area of 9,816 km 2 with a population of 8.542 million (Ningbo Statistics Bureau 2020). Climatically, Ningbo is located in the hot summer and cold winter zone in China. It is a transition zone between the cold and hot climates, characterised by high humidity, heating demands in winter, cooling demands in summer, and inadequate solar flux. The municipal area of Ningbo is comprised of six districts, two prefectures, and two county-level cities. In this research, the small residential district (SRD, also known as "xiaoqu" in Chinese) was used as the basic spatial unit for the investigation. Designed by professional planners and architects, an SRD is a planned neighbourhood where residential buildings are integrated with communal facilities such as kindergartens, convenience shops, and communication infrastructure, all operated by a professional property management company. Currently, a significant proportion of urban residents in China live in an SRD (Bray 2006). Generally, the geographic boundaries and internal layouts of an SRD can be identified from Google Maps. In addition, data related to the number of SRDs in Ningbo, including the locations, the number of floors, and the number of residential buildings within an SRD were obtained from Anjuke's online database (Anjuke is a distinguished real estate information service group with over 69 million independent users monthly). Table 2 shows the demographic and SRD information of each urban district in Ningbo. In total there are 4,775 SRDs in the city. Yinzhou is one of the six urban districts in Ningbo, which has the largest number of SRDs, approximately four times larger than Jiangbei District or Zhenhai District. In addition, Yinzhou has a long history of development, with the earliest SRD dating back to 1990. With many new SRDs developed in the last ten years, Yinzhou provides sufficient and diverse residential building types for sampling and analysis. Therefore, it is reasonable to focus the research on the Yinzhou District for further investigation.

Sampling
The number of SRDs in Yinzhou District (1,225) is too large to investigate; therefore, a stratified random sampling method was used to select the SRDs and their buildings. The stratification method is a microcosm of the population where each element has the same selection probability. The age of the buildings has a significant impact on their performance. Buildings constructed in the same time frame usually have similar thermal characteristics (such as the heat transfer coefficient of the building envelope, solar heat gain coefficient of windows, and airtightness) since they had to meet the building regulations issued by the government concurrently (Theodoridou et al. 2011b;Hong et al. 2020). Generally, building regulations are updated every five to ten years in China. Hence, this research adopted the major revision for building regulations, Design Standards for Energy Conservation of Residential Buildings in Hot Summer and Cold Winter Climatic Zones (JGJ134-2010 2010), as the preliminary basis for its stratified random sampling of all SRDs in Yinzhou District. Since its enactment, JGJ134 has stipulated boundary design standards for residential buildings in hot summer and cold winter (HSCW) zones; it was updated in 1993, 2005, and 2015 (MOHURD 2020). Therefore, the SRDs were divided into four time frames: before 1993; from 1994 to 2005; from 2006 to 2015; and after 2016.
The 1,225 SRDs in Yinzhou District were then assembled for further statistical analysis. The construction year of the SRDs and their proportions were used as coordinates to derive the SRD distribution in Ningbo. The results (see Figure 3) show that the building distribution conforms with the normal distribution. Therefore, each SRD can be considered a statistically independent normal random variable.
In statistics, the confidence level refers to the authenticity degree of an individual parameter to a given proposition where, z is the corresponding z-value; σ is the standard deviation; e is the sample allowable error; N is the total sample size (all SRDs in Ningbo city); and n is the selected sample size (the number of SRDs in Yinzhou District). s n fn = ( 2) where, n s is the number of SRDs of each stratification that needs to be selected; and f is the ratio of the number of SRDs of each stratification from Yinzhou District to all SRDs in Ningbo.
Given that the distribution of buildings after 2015 is slightly more than the normal distribution, two more buildings constructed after 2015 were added as a supplement in the analysis. The locations of the 17 selected SRDs are illustrated in Figure 4.
The buildings' geometrical information within the selected SRDs, such as orientation, building size, and building shape, was collected from the blueprints provided by the Yinzhou Building and Urban-Rural Development Bureau. In addition, to ensure a practical and accurate classification, three strategies were considered for simplifying the building prototypes:  Insignificant minor details (e.g., a decorative feature on the surface, attached features, and balconies) were ignored.  The irregular buildings with complicated forms were disassembled into smaller separate simple geometries.  All the residential buildings were represented by a combination of simple geometries using their dimensions of height (h), length (l) and depth (d).
The data collection was conducted between 25 September 2019, and 31 May 2020. After an initial screening, 367 residential buildings in these 17 SRDs were selected. The statistical analysis of these buildings is provided in Table 3.

Selection of the classification parameters
In the first stage of building prototyping development, a single parameter was used to classify the building stock. and "residential building" for searching relevant research projects. These studies included peer-reviewed articles from journal databases and emerging research databases (e.g., Passive House Institute). Additionally, other sources consulted included academic books, dissertations, government documents, and conference proceedings.
In total, fifty-eight publications were identified, and an analysis of various building classification parameters adopted in these studies and associated frequencies of use can be seen in Figure 5. These parameters can be grouped into six main areas (see the inner circle in Figure 5): building construction year; building physical information; user information; surrounding environments; internal conditions; and life cycle assessment. Among them, building physical information was most commonly used for building classification. More specifically, building physical information includes the general building features (e.g., window-to-wall ratio and shape coefficient), geometric data, and thermal properties of the envelope. The Typology Approach for Building Stock Energy Assessment (TABULA) is an Intelligent Note: l is the building length, d is the building width, h is the building height, and Cf is the building shape coefficient. According to GB50352-2015GB50352- (2015, a low-rise building has one to three storeys, a multi-storey building has four to six storeys, a mid-rise building has seven to nine storeys, and a high-rise building has over ten storeys.

Fig. 5 Proportion of residential building classification parameters based on an extensive literature review
Energy Europe Programme (IEE) project to create a harmonised structure for European residential buildings. The TABULA project used construction year, building size, type and age of the supply system, and regional location as classification parameters. The construction year was already selected for SRD stratified sampling in this research. Given that our project's research aim was to develop building geometry models, the classification parameters proposed were the number of floors, shape coefficient, average floor area, and window-to-wall ratio.  Floor numbers: A building can be divided into different types according to its number of floors, which is affected by each floor's height limitation and the whole building. Residential buildings can be categorised by the number of floors based on the Uniform Standards for Residential Building Design (GB 50352-2005(GB 50352- 2005. According to this document, there are four types: low-rise buildings (1-3 stories); multi-storey buildings (4-6 stories); mid-rise buildings (7-9 stories); and high-rise buildings (over 10 stories).  Shape coefficient (C f ): The building surface area and volume determine the building shape coefficient. Buildings can also be classified into point-style (l/d < 2) and stripestyle (l/d > 2) buildings by the ratio of length (l) to depth (d) with different Cf limitations.  Average floor area: The average floor area of a building is selected as the primary form of floor area, an essential building characteristic widely used for building typology (Wan and Yik 2004;Peri et al. 2013). Each floor's building layout is similar in a residential building; thus, it is mainly determined by its length and depth.  Window-to-wall ratio (WWR): WWR is the ratio of glazed area to the building elevation area. This parameter has a significant effect on building energy consumption performance. This parameter classification is based on the Design Standards for Energy Conservation of Residential Buildings in Hot Summer andCold Winter Climatic Zones (JGJ134-2010 2010), which gives the boundary values for different building facades depending on the orientation.

k-means clustering
As a basic machine-learning algorithm, the k-means method can effectively realise the clustering of the population with multiple defined variables (Shi et al. 2021). This research used k-means arithmetic methods to perform clustering based on the selected building samples' classification parameters. The elbow method was then used to discriminate the optimal cluster number expressed by distortion. This method aims to minimise the squared error between the sample and the centroid. The distortion degree is the sum of the square of the distance error between each sample point and the centroid in the same cluster. A smaller distortion means a higher degree of aggregation and vice versa. In addition, when the selected cluster amount (k) value is insufficient, the distortion will sharply decrease when k increases by one. After a certain critical point, the distortion degree would be vastly improved and decrease slowly. This critical point can be determined by the elbow method to select the cluster amount. At that time, the information of the centroids generated can be directly used for prototype modelling. The distortion of the k value can be calculated using Eq. x is a clustering parameter that can be considered as one variant.
In this research, four parameters were proposed for classifying and developing building geometry models. These parameters were treated as multidimensional vectors, which were also called observations in the k-means clustering.
Following the steps for determining the centroid, multidimensional k-means clustering is generated for prototyping: 1) Select the k initial centroid randomly, and each centroid is determined as a cluster. 2) Assign each observation to its nearest centroid and form a new cluster. 3) Recalculate a new centroid of each cluster. This new centroid is the average vector of all observations in a cluster. 4) Repeat step two and step three until the centroid does not change or reach the maximum number of iterations.

Clustering and prototyping results
By calculating the distortion against the number of clusters using Python software, the distortion value converges after k reaches nine (see Figure 6). This means that at least nine Fig. 6 Results of the elbow method analysis using Python software clustering groups need to form for this sampling population. The centroid value of each observation can represent the characteristics of nine clusters. A small number of outliers appeared during the clustering process. These outliers were insufficient to form a new cluster and had a weak impact on the cluster results. Table 4 indicates the nine building models generated by using the IES-VE software and the associated building characteristics. These models comprised three low-rise prototypes: two multi-storey building prototypes, one mid-rise prototype, and three high-rise building prototypes.
The most significant factors affecting prototyping are the average floor area for low-rise buildings and multi-storey buildings, the building height, and the number of floors for high-rise buildings. Additionally, the shape coefficient of the low-rise prototypes is generally larger than that of the other prototypes. Table 5 shows a comparison between the shape coefficients and WWRs of the building models and the specifications required in JGJ134-2010JGJ134- (2010. The nine prototypes all met those requirements, which means that the clustering results were reasonable, and the prototypical building models were statistically representative of the collection of buildings being investigated. They are therefore useful as reference buildings to further evaluate the building performance and effectiveness of building policies.

Discussion
Residential buildings with different construction periods, shapes, or equipped with various appliances and building systems may perform differently with the same design technologies (Filogamo et al. 2014). Significantly, the modern built environment has become more complex in terms of building typologies and environmental systems. This complexity causes difficulties in selecting appropriate technologies to optimise building performance for different building Note: The data listed above is clustering results of each centroid; the data for the number of storeys were rounded as an integer when generating building models. typologies. The geometric characteristics of residential buildings in different regions can also be influenced by local contextual conditions (e.g., climate, culture, building regulations). However, they tend to have similar geometric characteristics if constructed in the same time frame and the same city. As a case study for building prototyping, the Yinzhou District comprises the largest number of SRDs in Ningbo and covers a wide range of residential buildings built in different time frames. The nine building models generated from the research are based on a statistical analysis of the SRDs and buildings that were investigated. Importantly, this research proposes a new hybrid approach for generating prototypical building models. It is a bottom-up approach that involves building geometrical parameters for clustering and prototyping. Compared with other prototyping approaches, the hybrid approach is more convenient, effective, and can be widely used. It does not need long-term statistics and can generate statistically reliable building models. Regarding accuracy, the plotted curve is smooth, as shown in Figure 6, which is subjective when determining the optimal cluster number corresponding to the elbow point. The change rate of the distortion of k from nine to ten is 1%, which means that the nine models developed are statistically acceptable.
Based on a new hybrid approach, this research has developed nine prototypical building geometry models. The nine models can simultaneously express their corresponding building populations' characteristics in terms of geometric characteristics, which are related to building performance. Additionally, the approach can be expanded to other cities to generate prototypical local residential building models. In short, this research:  Provides an effective way to develop statistically sound building models for evaluating building performance using simulation software. For example, retrofitting strategies can be customised and optimised for a particular building prototype.  Generates a database for the building geometric characteristics of building stock.  Provides a reference for undertaking life cycle assessments for new city block subdivisions and city regeneration, which includes changes in the proportion of regional buildings and the adoption and optimisation of regional functional systems.

Conclusion
This research proposes a hybrid approach for developing building geometry models that combine statistical, mathematical and physical measurement methods. In this research, nine residential building models were identified within the case study city of Ningbo, China. With these building models, further performance evaluation can be performed with simulation software by adding additional information such as solar information, building system information, and occupancy profiles. This study is part of a research project to provide an integrated framework for developing specific retrofitting strategies for residential buildings. The establishment of building models will facilitate further building performance evaluation and optimisation. Future research will enable the existing building prototyping models to be further improved or subdivided through more data support. A sensitivity analysis of the correlation between various building technologies and associated energy performance can be performed using these building models, which can help provide planning assistance for future sustainable city development. The challenges of selecting optimised design technologies for different building types again highlight the importance of creating building prototypes. where, x is the building construction year; n is the constructed building numbers.
The raw data of building information (an Excel file) is in the Electronic Supplementary Material of the online version of this paper as additional supporting data.

Author contribution statement
Wu Deng conceived and designed the research analysis and revised the paper. Yuanli Ma contributed to the research analysis design and partook in data collation and processing and wrote the paper. Jing Xie, Professor Tim Heath and Yuanda Hong contributed to the design of the research analysis and paper revision. Yeyu Xiang contributed to the data support.