A tree-based measure for hierarchical data in mixed databases

Hassan, Diman (2016) A tree-based measure for hierarchical data in mixed databases. PhD thesis, University of Nottingham.

[thumbnail of PhD Thesis-Diman Hassan.pdf] PDF (Thesis - as examined) - Repository staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (4MB)

Abstract

The structure of the data in a mixed database can be a barrier when clustering that database into meaningful groups. A hierarchically structured database necessitates efficient distance measures and clustering algorithms to locate similarities between data objects. Therefore, existing literature proposes hierarchical distance measures to measure the similarities between the records in hierarchical databases.

The main contribution of this research is to create and test a new distance measure for large hierarchical databases consisting of mixed data types and attributes, based on an existing tree-based (hierarchical) distance metric, the pq-gram distance metric. Several aims and objectives were pursued to fill a number of gaps in the current body of knowledge. One of these goals was to verify the validity of the pq-gram distance metric when applied to different data sets, and to compare and combine it with a number of different distance measures to demonstrate its usefulness across large mixed databases. To achieve this, further work focused on exploring how to exploit the existing method as a measure of hierarchical data attributes in mixed data sets, and to ascertain whether the new method would produce better results with large mixed databases. For evaluation purposes, the pq-gram metric was applied to The Health Improvement Network (THIN) database to determine if it could identify similarities between the records in the database. After this, it was applied to mixed data to examine different distance measures, which include non-hierarchical and other hierarchical measures, and to combine them to create a Combined Distance Function (CDF).

The CDF improved the results when applied to different data sets, such as the hierarchical National Bureau of Economic Research of United States (NBER US) Patent data set and the mixed (THIN) data set. The CDF was then modified to create a New-CDF, which used only the hierarchical pq-gram metric to measure the hierarchical attributes in the mixed data set. The New-CDF worked well, finding the most similar data records when applied to the THIN data set, and grouping them in one cluster using the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) clustering algorithm. The quality of the clusters was explored using two internal validation indices, Silhouette and C-Index, where the values showed good compactness and quality of the clusters obtained using the new method.

Item Type: Thesis (University of Nottingham only) (PhD)
Supervisors: Aickelin, Uwe
Wagner, Christian
Keywords: databases, database, tree-based, data
Subjects: Q Science > QA Mathematics > QA 75 Electronic computers. Computer science
Faculties/Schools: UK Campuses > Faculty of Science > School of Computer Science
Item ID: 34652
Depositing User: HASSAN, DIMAN
Date Deposited: 19 Jan 2017 12:52
Last Modified: 19 Dec 2017 04:01
URI: https://eprints.nottingham.ac.uk/id/eprint/34652

Actions (Archive Staff Only)

Edit View Edit View