Forum for Information Retrieval Evaluation

FIRE

Keynote Address

On Soft Permutation, Set Alignment, and Graph Matching
by Soumen Chakrabarti

SoftMax converts unconstrained logits into a multinomial distribution. Analogously, Sinkhorn-Knopp iterations convert an unconstrained square matrix into a doubly stochastic approximation to a permutation matrix. A differentiable network that proposes a soft permutation in response to input features is a powerful tool for optimal transport, applicable to diverse text and graph matching tasks. It enables us to featurize complex data artifacts into sets of embeddings, rather than single, fixed-length vectors, and learn to compare a `query' set against a `document' set, suitable for scoring tasks in retrieval and textual entailment. Warming up from set-of-vector retrieval as a modified form of ColBERT, we discuss vector subset retrieval and efficient indexing, then proceed to matching problems in graphs with rich node and edge features, even if Sinkhorn-Knopp iterations may not directly apply to all of them. Knowledge graphs (KGs) have entities (Einstein, relativity) for nodes and relations (discovered) for edges. Nodes and edges have canonical IDs, but are also known by numerous textual aliases in various languages. Aligning KGs from various languages, i.e., inferring that two entities or relations are one and the same, can help maintain a "super-KG" like WikiData. We will describe a formulation where KG alignment is in synergy with inference of missing edges, using vector set similarity at its heart. While hallucination by large language models (LLMs) is generally regarded as a nuisance, we have found that allowing an LLM to hallucinate a relational schema (tables, columns, foreign keys) from a natural language question, then aligning the hallucinated schema graph to the actual schema graph of the target database can improve schema retrieval, and thereby, text to SQL generation. Moving on to classical combinatorial graph problems, such as subgraph isomorphism, graph edit distance, and maximal clique, we build new network gadgets around graph neural networks (GNNs) and Sinkhorn-Knopps networks, leading to a series of related solutions to these problems. GNNs can help bypass intractable quadratic assignment problems and replace them with transportation as approximations, induced by contextual node or edge embeddings. We conclude with a few remarks on the ongoing evolution of architectures where text and graph encoders and decoders must interact to solve retrieval and generation tasks.
Relevant Paper

Joint Completion and Alignment of Multilingual Knowledge Graphs. With Harkanwar Singh, Shubham Lohiya, Prachi Jain and Mausam. EMNLP 2022.
Graph Edit Distance with General Costs Using Neural Set Divergence. With Eeshaan Jain, Indradyumna Roy, Saswat Meher and Abir De. NeurIPS 2024.
CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL. With Mayank Kothyari, Dhruva Dhingra, and Sunita Sarawagi. EMNLP 2023.
Locality Sensitive Hashing in Fourier Frequency Domain For Soft Set Containment Search. With Indradyumna Roy, Rishi Agarwal, Anirban Dasgupta, and Abir De. NeurIPS 2023.
Maximum Common Subgraph Guided Graph Retrieval: Late and Early Interaction Networks. With Indradyumna Roy and Abir De. NeurIPS 2022.
Interpretable Neural Subgraph Matching for Graph Retrieval. With Indradyumna Roy, Venkata Sai Velugoti and Abir De. AAAI 2022.

Evaluating Search for Systematic Review Creation
by Allan Hanbury

Systematic reviews play an important role in the medical domain by surveying the literature of randomised controlled trials to find evidence for the effectiveness of medications and medical interventions. Hence the creation of systematic reviews is a high impact application of Information Retrieval techniques, as the search results lead eventually to the establishment of guidelines for medical treatment of patients. The methodology for doing systematic reviews has been established for decades, but additional challenges in the area arise because of the increasing number of publications as well as rapid advancement in medical science and treatment. These challenges have led to work to partly automate the creation of systematic reviews. This talk introduces systematic reviews and also examines two aspects of the evaluation of search in systematic review creation: the first aspect deals with using pertinent evaluation metrics, while the second aspect concerns an evaluation based on the outcomes of randomised controlled trial publications.

Evaluating Large Language Models in Legal Domain
by Yiqun Liu

Large language models (LLMs) have made significant progress in natural language processing tasks and demonstrate considerable potential in the legal domain. However, legal applications demand high standards of accuracy, reliability, and fairness. Applying existing LLMs to legal systems without careful evaluation of their potential and limitations could pose significant risks in legal practice. In this talk, I will introduce our recent efforts in developing a novel Chinese legal benchmark LexEval. This benchmark is notable in the following three aspects: (1) Ability Modeling: We propose a new taxonomy of legal cognitive abilities to organize different tasks. (2) Scale: To our knowledge, LexEval is one of the largest legal evaluation datasets, comprising 23 tasks and 14,150 questions. (3) Data: we utilize formatted existing datasets, exam datasets and newly annotated datasets by legal experts to comprehensively evaluate the various capabilities of LLMs. LexEval not only focuses on the ability of LLMs to apply fundamental legal knowledge but also dedicates efforts to examining the ethical issues involved in their application. Based on the dataset, we organize a legal LLM benchmark workshop named CAIL (China AI and Law Challenge). We hope these efforts will offer more valuable insights into the challenges and potential solutions for developing legal AI systems and LLM evaluation pipelines.

Hate Speech: Platform Needs, Detection Technologies and Regulation
by Thomas Mandl

Hate speech and other offensive and objectionable online content pose a huge challenge to societies. Offensive content undermines objective discussions and can lead to the radicalization of debates. Content Moderation is necessary to suppress offensive content and due to the massive amount of posts, AI needs to support the identification of problematic content. This decision introduces AI as an actor into everyday life of millions of social media users. Content moderation is based on text classification but the core of the technology needs to be embedded in a social context during development and evaluation as well as deployment. Challenges for the evaluation and typical results need to be considered when discussing current research on content moderation. The regula- tion of Hate speech and content moderation also needs to be ad- dressed. How can AI strike the right balance between censoring and overblocking on the one hand and freedom of speech on the other? The Digital Service Act within the EU poses an influential model which regulates the factors for removing content and which aims at an transparent implementation.

LLMs and IR Evaluation: Towards Building Reusable Conversational Test Collections
by Mohammad Allan Nejadi

The IR community is moving fast towards leveraging LLMs in the evaluation pipeline: from LLM-assisted human annotation to fully LLM-based evaluation. Although LLMs provide countless opportunities in this area, there are various reasonable concerns about such a fast-paced transition of the field, mainly concerning their reliability as well as transferring their learned bias in the evaluation phase. In this talk, I will review the recent advances in LLM-based evaluation and discuss the potential risks and opportunities in this area. I will describe the observations we have had as part of the TREC Interactive Knowledge Assistance Track (iKAT), where we observe that LLMs can be used to fill missing judgments in the human-assessed relevance pool. Furthermore, I will give an overview of our findings in the LLMJudge challenge, where we organized a shared task on LLM-based relevance assessment and welcomed a diverse set of work by the community. Finally, I will discuss approaches to nugget-based generated text evaluation aimed at facilitating reusable evaluation of generated content.

Flexible Representations with Learned Sparse Retrieval
by Andrew Yates

A nice property of large language models is that they use natural language for both their input and their output, allowing them to easily be applied to different tasks and combined together in arbitrary pipelines. In a similar spirit, learned sparse retrieval (LSR) provides a way to use natural language to align embedding spaces and modalities for search. In this talk, I will expand on this vision and describe some of my recent work using LSR across text and image modalities. Given that a bag-of-terms representation is not without downsides, I will also describe LSR's challenges and my recent work addressing them by improving the expressiveness of learned sparse representations for retrieval.

FIRE 2024