Liu, Zeyang
(2024)
On the evaluation of conversational search.
PhD thesis, University of Nottingham.
Abstract
The rapid growth of speech technology has significantly influenced the way that users interact with search systems to satisfy their information needs. Unlike the mode of textual interaction, Voice User Interfaces (VUIs) often encourage multiple rounds of interaction for reasons such as clarifying user input, error handling, progressive information disclosure, personalization, and completing complex tasks. On the other hand, VUIs also promote natural interaction by ensuring accurate speech recognition, maintaining context throughout conversations, offering clear prompts and guidance, and providing feedback and confirmation. Meanwhile, with the widespread application of large language models (LLMs), information retrieval systems increasingly engage in dialogue with users, such as Bing Chat and Google Bard. In comparison to the traditional query-document paradigm, conversational search systems often encourage users to express their search tasks using natural language and to interact over multiple rounds. By adopting this dialogue-like form, these systems can improve accuracy and better understand user intent. However, it is worth noting that the essence of a conversational search system is still an information retrieval (IR) system, which aims to provide users with the information that they want in order to satisfy their information needs. This interaction paradigm is more natural for humans and can help the users to better articulate their information needs. In turn, however, this search paradigm increases the complexity for systems to understand users’ intent and underlying information needs across the multi-turn interactions. Further, the added complexities mean that the evaluation of conversational search systems is also not straightforward. One of the challenges in evaluating conversational search systems is that the number of possible user utterances and system responses is infinite, thus it is difficult, if not impossible, to use a static, finite set of relevance labels to evaluate their effectiveness. Therefore, the evaluation of a conversational search system remains an open, non-trivial challenge. This thesis aims to address some aspects of this challenge and make contributions towards building a comprehensive and replicable evaluation framework for conversational search.
In the previous work, a typical evaluation framework (e.g., Cranfield paradigm) for IR research often comprises both the construction of test collection and the design of evaluation metrics. In the context of conversational search, the metrics should not only focus on the similarity between the candidate responses and the provided reference, but also consider the underlying information in the dialogue context. Unlike traditional query-document IR collections (e.g., TREC), the test collections for conversational search should contain dialogue context, queries, reliable references for the queries and quality assessments.
This thesis consists of three parts: meta-evaluating existing metrics, developing a set of metrics and building test collections. Firstly, we initiate the work by reviewing several representative metrics and engaging in a meta-evaluation of these metrics in conversational search scenarios. We aim to gain a better understanding of the evaluation process of conversational search and explore the potential limitations of existing metrics.
Secondly, according to the analysis of the meta-evaluation, we aim to investigate the underlying factors that influence quality assessments, with the purpose of designing robust and reliable metrics that have a better alignment with human annotations. In this part, we first explore the impact of the syntactic structures (e.g., Part of Speech) in the reference and find the effectiveness of Part-of-Speech (POS) in distinguishing the quality of the responses. We further proposed Part-of-Speech based metrics (POSSCORE) to effectively capture such syntactic matching and achieve significantly better alignments with human preferences than baseline metrics. Moving forward, we further consider the ongoing context of conversation search and develop a context-aware evaluation framework to effectively capture the novelty and coherence of the responses. By thoroughly analyzing the relationship between the ongoing utterance context and the quality of responses, we propose to improve the evaluation of conversational search by extending existing metrics into context-aware modes and proposing novel context-aware metrics (COSS).
After that, given that references are still an important part of the evaluation process, we aim to propose a general framework to improve the effectiveness of reference-based metrics. By utilizing the chain of thought prompting, we demonstrate the effectiveness of the chain of thought in improving the performance of reference-based metrics in conversational search.
Apart from the design of evaluation metrics, building test collection is another essential component of the evaluation framework. Since human annotations remain the most reliable method for labelling, the process of constructing conversational search test collections still needs laborious and time-consuming efforts. This increased the difficulty in constructing the collections and significantly limited the scale of the collections. To bridge this gap, we start this work by formalizing the user behaviour patterns. The goal of this work is to reuse the existing collections and extend the pseudo-relevance labels via some rule-based methods. After that, we further explore the possibility of employing large language models (LLMs) to reuse the existing datasets. By leveraging LLMs and relevant documents, we design efficient approaches to obtain reliable pseudo references and quality pseudo-labelling for the candidate responses. After constructing the dataset, we further engage in meta-evaluating the performance of existing metrics over LLMs’ responses.
In conclusion, this thesis centres on demonstrating the feasibility of developing automatic evaluation approaches for conversational search, aiming to achieve reproducible and reliable evaluation and establish a robust alignment with human annotations. One of our best contributions in this thesis is the development of a context-aware evaluation framework for conversational search, which effectively addresses the challenge of evaluating conversational search systems that need to generate relevant, informative, and engaging responses that consider the ongoing context of the conversation.
Actions (Archive Staff Only)
|
Edit View |