Forum for Information Retrieval Evaluation

FIRE

Keynote Address

Smart Tools, Critical Minds: How Users Shape the Reliability of AI
by Liana Ermakova

Large language models (LLMs) are transforming how people search for, summarize, and interact with information. Their ability to generate fluent text and synthesize complex materials is reshaping traditional information-seeking behaviors. However, their growing influence raises concerns: LLMs can produce hallucinations, plausible but false statements, and reinforce confirmation bias by mirroring users’ assumptions. Because their outputs sound convincing, users may struggle to judge reliability, risking the spread of misinformation and erosion of trust in digital knowledge sources. The talk will outline these limitations, explore how user behavior contributes to them, and present strategies to mitigate such risks.

Information Overload in Academia – Challenges and Opportunities for IR Research
by Ingo Frommholz

Scientific communication is experiencing unprecedented growth, with publication volumes increasing at a scale that overwhelms researchers’ capacity to process and evaluate information. This information overload is not only a byproduct of legitimate scholarly activity but is increasingly driven by low-quality and even fraudulent content. Alongside rigorous, well-designed studies, the scholarly record is also populated by weak methodologies, poorly vetted results, and intentional manipulation. The rise of AI-accelerated publishing, paper mills, tortured phrases, and other forms of “fake science” intensifies this problem, creating massive noise and undermining the reliability of academic information systems. For the IR community, this poses both critical challenges and unique opportunities. On the one hand, information overload and quality degradation pose a challenge that needs to be addressed more directly by models that, traditionally, are mainly considering topical relevance. On the other hand, advances in AI, NLP, and bibliometric-enhanced IR offer promising directions for filtering, ranking, and contextualising scholarly information. In this keynote, I will examine the evolving problem of fake science and its role in driving information overload. I will outline recent developments in scholarly information access, highlight open research problems — from detecting low-quality and fraudulent content to designing veracity-aware retrieval and recommendation models — and discuss how IR research can contribute to ensuring that high-quality knowledge remains discoverable, trustworthy, and actionable in an era of overwhelming information abundance.

On Multi-role Alignment of Language Models
by Utpal Garain

User authorization-based access privileges are a key feature in many safety-critical systems, but have thus far been absent from the large language model (LLM) realm. In this work, drawing inspiration from such access control systems, we introduce sudoLLM, a novel framework that results in multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance with, user access rights. sudoLLM injects subtle user-based biases into queries and trains an LLM to utilize this bias signal in order to produce sensitive information if and only if the user is authorized. We present empirical results demonstrating that this approach shows substantially improved alignment, generalization, and resistance to prompt-based jailbreaking attacks. The persistent tension between the language modeling objective and safety alignment, which is often exploited to jailbreak LLMs, is somewhat resolved with the aid of the injected bias signal. Our framework is meant as an additional security layer, and complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.

Beyond Hostility: Detecting Subtle and Overt Forms of Online Conflict with Multi-Objective Learning
by Joemon Jose

Social networks have become a central part of how people interact today — sharing ideas, expressing opinions, and engaging with communities. But this constant stream of online interaction also brings conflict. While research has made progress in detecting openly hostile behaviours like bullying, harassment, and threats, subtle forms of conflict—such as teasing, criticism, and sarcasm—have received much less attention. Yet, these forms can be just as harmful, both socially and psychologically.
In this talk, I’ll introduce our work on detecting both overt and subtle types of online conflict using a multi-class, multi-objective model. We developed a new conflict dataset that captures this full range of behaviours and designed a novel classification approach based on class-specific reward functions. These rewards help the model learn more effectively by penalising certain kinds of misclassifications—an important step in complex, multi-class problems. Our architecture leverages the Decision Transformer, allowing us to treat classification as a reinforcement learning task and better manage ambiguity between classes.
Across three benchmark datasets, our approach achieved significant improvements in recall, precision, F1-score, and overall accuracy compared to state-of-the-art deep learning models. Finally, I’ll share insights from our thematic analysis of model misclassifications, highlighting what they reveal about the blurry boundaries between teasing, criticism, and hostility in online communication.

How to Simplify Scientific Text (and Nothing More)?
by Jaap Kamps

Becoming scientifically literate is more important than ever. Objective scientific information helps people navigate a world where misinformation, disinformation, and other unverified claims are just one click or chat away. Generative AI models that simplify scientific text could give everyone direct access to the latest, objective scientific information in the academic literature. However, these models can also overgenerate, presenting users with the truth and more than the truth… This talk will cover recent efforts to speed up scientific text simplification—especially at the CLEF SimpleText Track—and look at the near future. Can we develop new models that not only answer questions but also question the answers?

Emancipatory Information Retrieval: A Call for Societally-Motivated IR Research
by Bhaskar Mitra

Access to information is critical to collective sense-making of our place and relationships in this world. Information (and access to information) has therefore always been saliently political. Throughout history authoritarian forces have tried to control what information is disseminated and how; and information access media have been sites of conflict between liberation and oppression. While there is currently much excitement in the IR community around the application of generative AI for Information Access, we must critically consider the systemic risks that these technologies pose with respect to concentrating control over our online information ecosystems in the hands of few privileged individuals and institutions as well as building effective tools for mass manipulation and persuasion. This talk is a provocation for the IR community to recognize the role of computer-mediated information access in our emancipatory struggles and acknowledge our own responsibilities and role in realizing more equitable, emancipatory, and sustainable futures. We are calling on the community to develop a new emancipatory IR research agenda that embraces humanistic values, commits to universal emancipation and social justice, challenges systems of oppression, grounds itself in practices of organizing and movement building, and works in solidarity with scholars and experts from other disciplines as well as with legal and policy experts, civil rights activists, movement organizers, and artists, among others. Collectively, we must both reimagine post-oppressive futures and the role of IR in leading us there.

LLM as an Evaluation Forum?
by Mark Sanderson

Evaluation has long been an important part of information retrieval research. Over decades of research, well established methodologies have been created and refined that, for years, have provided reliable relatively low cost benchmarks for assessing the effectiveness of retrieval systems. With the rise of generative AI and the explosion of interest in Retrieval Augmented Generation (RAG), evaluation is having to be rethought. In this talk, I will speculate on what might be solutions to evaluating RAG systems as well as highlighting some of the opportunities that are opening up. As important as it is to evaluate the new generative retrieval systems it is also important to recognize the traditional information retrieval has not (yet) gone away. However, the way that these systems are being evaluated is undergoing a revolution. I will detail the transformation that is currently taking place in evaluation research. Here I will highlight some of the work that we've been doing at RMIT university as part of the exciting, though controversial, new research directions that generative AI is enabling.

NLP in Finance: Sentiment analysis, Topic Modeling and Geopolitical biases in LLMs
by Arun Verma

Generative AI and NLP is now quite popular in quantitative finance for information extraction & signal discovery from unstructured textual data such as news and company filings. We show how extracting sentiment and topics using ML methods can lead to profitable quantitative trading strategies.
We will summarize BloombergGPT, a 50 billion parameter LLM that is trained on a wide range of financial data. We construct a 700 billion token dataset based on Bloomberg’s extensive data sources and some external sources. We validate BloombergGPT on standard LLM benchmarks as well as open financial benchmarks.
Finally we will share our analysis on potential geopolitical biases embedded in LLMs, especially in regard to financial sentiment analysis of news stories. We analyzed various large language models including GPT-4o, Llama & Claude among others to find biases in sentiment w.r.t. specific countries/regions or to specific industries. There is also an evidence of bias in terms of the language of the news stories; specifically English language stories tend to score more positive in sentiment as compared to sentiment of the original news stories in Chinese or Japanese language.

FIRE 2025