Forum for Information Retrieval Evaluation

FIRE

Lecture Series

The gaps between the words — a small history of space

Stephen Robertson, City University, London

Most modern information retrieval systems rely heavily on the initial separation of words in natural language text. In English, and in many other languages using alphabetic scripts, this is a simple, almost trivial, task — it is (almost) universal in English text to separate words with spaces (possibly with other signs such as punctuation). A few alphabetic languages (German, Finnish) are a little trickier, and some non-alphabetic languages (Japanese, Chinese) require more serious deconstruction for the extraction of words. But in the roughly 3000-year history of alphabetic writing, the habit of inserting spaces between words was only initiated about 1500 years ago, and only became universal some centuries after that. Further, the idea of having an explicit character code for space does not appear until the second half of the twentieth century, well over a century after Morse's coding scheme for alphabetic and numerical characters. In this talk, I will give a sketch of this history, and try to give an idea of how our sense of word spacing has changed over time.

Rich Explanatory Models of IR System Performance

Nicola Ferro, University of Padua, Italy

IR is an heavily experimental discipline with a long-lasting tradition in evaluation and system performance measurement. Nevertheless, there are two major shortcomings: we would need more insightful and richer explanations of IR system performance and we should devise techniques for prediction of IR system performance and the ahead design of IR systems, meant as the capability of designing them as to meet an expected level of performance before actually needing to implement them for discovering what are their performance. The talk will discuss the state-of-the-art in performance modelling and will outlook how to turn these results into a performance prediction framework, drawing from the results of a recent Dagstuhl Perspectives Workshop on predicting performance across IR/NLP/RecSys.

Word relatedness from Word Embedding in Information Retrieval

Allan Hanbury, TU Wien, Austria

Word Embedding approaches, such as word2vec, are being increasingly used as the basis for a wide variety of text analysis and information retrieval applications. In this talk, I present some of the recent contributions to this area from my research group. The first part of the talk analyses the similarity values produced by word2vec, in particular to determine the range of similarity values that is indicative of actual term relatedness. Based on these results, uses of the similarity values in Information Retrieval are presented. Finally, we discuss the problem of topic shifting in Information Retrieval resulting from the incorporation of word2vec term similarities, mainly due to the local context of these similarities. A solution is presented that involves combining the local context of word2vec with the global context provided by Latent Semantic Indexing (LSI)

Less is More: Learning Sparse Representations for Neural Rankers

Jaap Kamps, University of Amsterdam, The Netherlands

We introduce sparse neural ranking models that learn a latent sparse representation for each query and document. This representation captures the semantic relationship between the query and documents, but is also sparse enough to enable constructing an inverted index for the whole collection. Our model gains in efficiency without loss of effectiveness: it not only outperforms the existing term matching baselines, but also performs similarly to the recent re-ranking based neural models with dense representations. Our results demonstrate the importance of sparsity in neural IR models and show that dense representations can be pruned effectively, giving new insights about essential semantic features and their distributions.

Cross-Language Information Retrieval in the MATERIAL Program

Douglas W. Oard, University of Maryland, USA

MAchine Translation for English Retrieval of Information in Any Language (MATERIAL) is a research program that includes Cross Language Information Retrieval (CLIR). Test collections of CLIR are used for set based rather then ranked retrieval and it evaluates using system generated English summaries to identify relevant documents. It also includes discussion on issues in design of evaluation and some open research options

An architecture for the automatic identification of arguments in legal documents

Paulo Quaresma, Universidade de Évora, Portugal

The talk will propose a threefold architecture to identify arguments in legal documents. The system identifies argumentative sentences, clusters them into arguments and identifies the structure of the arguments. The proposed system was implemented at the University of Évora and it was evaluated with an annotated ECHR corpus. The architecture will be described in detail and the extension of this work to other language and/or domains will also be discussed

An architecture for the automatic identification of arguments in legal documents

Karin Verspoor, University of Melbourne

The language of biomedicine is varied and complex, with a plethora of surface variation, synonyms and name overloading occurring in a diversity of text types. This talk will explore the interaction between information extraction and information retrieval in the biomedical domain, arguing that normalization of concepts to ontologies and formal representation of relations is key to supporting effective retrieval of relevant texts in response to specific queries.

An architecture for the automatic identification of arguments in legal documents

Karin Verspoor, University of Melbourne

The language of biomedicine is varied and complex, with a plethora of surface variation, synonyms and name overloading occurring in a diversity of text types. This talk will explore the interaction between information extraction and information retrieval in the biomedical domain, arguing that normalization of concepts to ontologies and formal representation of relations is key to supporting effective retrieval of relevant texts in response to specific queries.

Machine Translation of Low Resource Languages

Pushpak Bhattacharya, IIT Patna, India

Machine Translation has always been an important problem, and data driven MT is the reigning paradigm today. However, absence of parallel corpora is a constraining factor which is a reality for most language pairs of the world. In this presentation we will describe our work on low resource MT, exploiting the techniques of factors, subwords, pivots and unsupervised Neural MT for translating between pairs of languages that have no or little parallel corpora. Our case studies are on various Indian Languages and English. We achieve significant performance improvement through the use of a combination of techniques.

FIRE 2018