Ibrahim, Osman Ali Sadek
(2017)
Evolutionary algorithms and machine learning techniques for information retrieval.
PhD thesis, University of Nottingham.
Abstract
In the context of Artificial Intelligence research, Evolutionary Algorithms and Machine Learning (EML) techniques play a fundamental role for optimising Information Retrieval (IR). However, numerous research studies did not consider the limitation of using EML at the beginning of establishing the IR systems, while other research studies compared EML techniques by only presenting overall final results without analysing important experimental settings such as the training or evolving run-times against IR effectiveness obtained. Furthermore, most papers describing research on EML techniques in IR domain did not consider the memory size requirements for applying such techniques. This thesis seeks to address some research gaps of applying EML techniques to IR systems. It also proposes to apply (1+1)-Evolutionary Strategy ((1+1)-ES) with and without gradient step-size to achieve improvements in IR systems. The thesis starts by identifying the limitation of applying EML techniques at the beginning of the IR system. This limitation is that all IR test collections are only partially judged to only some user queries. This means that the majority of documents in the IR test collections have no relevance labels for any of the user queries. These relevance labels are used to check the quality of the evolved solution in each evolving iteration of the EML techniques. Thus, this thesis introduces a mathematical approach instead of the EML technique in the early stage of establishing the IR system. It also shows the impact of the pre-processing procedure in this mathematical approach. The heuristic limitations in the IR processes such as in pre-processing procedure inspires the demands of EML technique to optimise IR systems after gathering the relevance labels. This thesis proposes a (1+1)-Evolutionary Gradient Strategy ((1+1)-EGS) to evolve Global Term Weights (GTW) in IR documents. The GTW is a value assigned to each index term to indicate the topic of the documents. It has the discrimination value of the term to discriminate between documents in the same collection. The (1+1)-EGS technique is used by two methods for fully and partially evolved procedures. In the two methods, partially evolved method outperformed the mathematical model (Term Frequency-Average Term Occurrence (TF-ATO)), the probabilistic model (Okapi-BM25) and the fully evolved method. The evaluation metrics for these experiments were the Mean Average Precision (MAP), the Average Precision (AP) and the Normalized Discounted Cumulative Gain (NDCG).
Another important process in IR is the supervised Learning to Rank (LTR) of the fully judged datasets after gathering the relevance labels from user interaction. The relevance labels indicate that every document is either relevant or irrelevant in a certain degree to a user query. LTR is one of the current problems in IR that attracts the attention from researchers. The LTR problem is mainly about ranking the retrieved documents in search engines, question answering and product recommendation systems. There are a number of LTR approaches from the areas of EML. Most approaches have the limitation of being too slow or not being very effective or presenting too large a problem size. This thesis investigates a new application of a (1+1)-Evolutionary Strategy with three initialisation techniques hence resulting in three algorithm variations (ES-Rank, IESR-Rank and IESVM-Rank), to tackle the LTR problem. Experimental results from comparing the proposed method to fourteen EML techniques from the literature, show that IESR-Rank achieves the overall best performance. Ten datasets; which are MSLR-WEB10K dataset, LETOR 4 datasets, LETOR 3 datasets; and five performance metrics, Mean Average Precision (MAP), Root Mean Square Error (RMSE), Precision (P@10), Reciprocal Rank (RR@10), Normalised Discounted Cumulative Gain (NDCG@10) at top-10 query-document pairs retrieved, were used in the experiments. Finally, this thesis presents the benefits of using ES-Rank to optimise online click model that simulate user click interactions. Generally, the contribution of this thesis is an effective and efficient EML method for tackling various processes within IR. The thesis advances the understanding of how EML techniques can be applied to improve IR systems.
Actions (Archive Staff Only)
|
Edit View |