Salama, Ayman
(2025)
An investigation of neural information retrieval based temporal embeddings and knowledge extraction for scientific insight generation.
PhD thesis, University of Nottingham.
Abstract
Humans today generate information at an unprecedented rate, leading to a vast accumulation of knowledge. This immense amount of data poses challenges in extracting both explicit information and implicit knowledge, predicting scientific events, and investigating scientific trends and trajectories. For decision-makers, it is particularly challenging to keep pace with this abundance to make informed decisions grounded in quantitative and up-to-date data. Current state-of-the-art methods, such as those employing Word2Vec models for semantic analysis, face several technical challenges when applied to vast and diverse datasets, including scalability issues, contextual ambiguity, difficulty in capturing implicit and explicit knowledge, and challenges with temporal embeddings. These models are also susceptible to bias and noise, and the impact of hyperparameter tuning on knowledge representation can lead to inconsistent results, further complicating their reliability for informed decision- making. While some existing research addresses the problem of knowledge evolution, they often fall short in handling large datasets, leading to scalability issues. Many studies focus primarily on changes in word meaning rather than the evolving relationships among entities, neglecting the broader context of temporal knowledge evolution. Additionally, the use of default hyperparameters in these models often overlooks their sensitivity to implicit and explicit knowledge extraction, resulting in inconsistent outcomes. Moreover, some techniques overly focus on event detection through specific text classifications, limiting their broader applicability in understanding complex knowledge dynamics over time. There is a lack of a coherent and efficient framework to represent, extract, and discover large-scale temporal knowledge to date. Therefore, this research aims to explore efficient, large-scale methodologies for extracting both explicit and implicit knowledge from extensive scientific literature and to build a big data, cloud-based knowledge evolution framework to identify scientific discoveries and their trajectories. Firstly, to build a large-scale knowledge evolution framework, it is important to study methods for efficient and accurate knowledge representation. To this end, as a case study, we investigate various embedding techniques for identifying relationships between underutilized crops and their attributes, such as global interest, vernacular and scientific names, and soil requirements. Word2Vec embeddings were analysed on extensive Wikipedia datasets, including multiple languages, and we found that our approach effectively identified relationships between underutilized crops and their attributes. These findings were then compared with an international database of crop characteristics, which showed a 76.11% accuracy in predicting soil classifications using scientific crop names, surpassing semantic relationship extraction methods in the literature. Following this case study, we identified a suitable embedding method for accurate knowledge representation, addressing the scalability issue by leveraging a cloud- based framework capable of processing vast datasets efficiently and maintaining high predictive accuracy. To create a coherent temporal representation of knowledge and classify it effectively, it is vital to generate high-quality embeddings for these temporal vector spaces. To this end, we evaluated various hyperparameter combinations of Word2Vec (the embedding method identified for achieving the first objective) against diverse Deep Neural Network (DNN) architectures to gauge their impact on classification tasks using the Amazon Customer Review dataset. The results from this dataset, which demonstrated a generic use case for understanding the implications of hyperparameters on downstream tasks such as classification, suggest that the hyperparameter tuning method and values could be effectively applied to the process of creating the temporal vector spaces, ensuring reliable knowledge representation across varying contexts and timeframes. Finally, based on the findings from investigations on Word2Vec-based knowledge representation and its hyperparameter tuning for optimized representation for downstream tasks, we propose our Continuous Knowledge Evolution Construction method, where knowledge is represented with its temporal element. The temporal vector space is a dynamic framework that captures the evolution of knowledge over time by constructing word embeddings that reflect changes in language and concept significance across different periods. In this method, Word2Vec is used to generate embeddings for each time slice of the dataset, such as monthly intervals. These embeddings are then aligned sequentially to form a continuum, allowing us to observe and analyse how the meanings and relationships of words and concepts shift over time. This approach enables the tracking of scientific trends, the emergence of new research areas, and the fading of older topics, providing a comprehensive view of the temporal evolution of knowledge within a given field. We evaluated the proposed framework on 2.3 million publications from the arXiv dataset, spanning science, mathematics, and physics. The framework generated hundreds of millions of predictions and captured relationships between scientific entities over time. We examined thousands of scientific entities, tracking them throughout our temporal knowledge evolution framework. Given the sheer volume of our findings, we relied on large-scale, cloud-based data analytics tools for data processing, storage, and visualization. This approach allowed us to establish 25 million temporal relationships between scientific entities, pinpoint specific scientific events, comprehend the trajectories of certain scientific trends, and discern overarching patterns in the evolution of science. The results highlighted that 76.2% of the scientific events and trends studied showed low variance, indicating a high level of predictive accuracy. To conclude, our research proposes a cloud-based temporal knowledge evolution framework to analyse the large-scale corpus of scientific literature. Through our investigation of text mining, embedding techniques, and DNNs, we elucidated implicit and explicit knowledge, uncovering millions of temporal relationships, and identifying key scientific events and trends. These results highlight the critical role of neural information retrieval of large-scale data in shaping our understanding of scientific knowledge evolution and aiding informed decision-making.
Actions (Archive Staff Only)
 |
Edit View |