Christel Geurts graduates on Cross-Domain Authorship Attribution

January 12th, 2018, posted by Djoerd Hiemstra

Cross-Domain Authorship Attribution as a Tool for Digital Investigations

by Christel Geurts

On the darkweb sites promoting illegal content are abundant and new sites are constantly created. At the same time Law Enforcement is working hard to take these sites down and track down the persons involved. Often, after taking down a site, users change their name and move to a different site. But what if Law Enforcement could track users across sites? Different sites or sources of information are called a domain. As the domain changes, often the context of a message also changes, making it challenging to track users simply on words used. The aim of this thesis is to develop a system that can link written text of authors in a cross-domain setting. The system was tested on a blog corpus and verified on police data. Tests show that multinomial logistic regression and Support Vector Machines with a linear kernel perform well. Character 3-grams work well as features, combining multiple feature sets increases performance. Tests show that Logistic Regression models with a combined feature set performed best (accuracy = 0.717, MRR = 0.7785, 1000 authors (blog corpus)). On the police data the Logistic Regression model had an accuracy of 0.612 and a MRR of 0.6883 for 521 authors.

Supporting the Exploration of Online Cultural Heritage Collections

January 10th, 2018, posted by Djoerd Hiemstra

The Case of the Dutch Folktale Database

by Iwe Muiser, Mariët Theune, Ruud de Jong, Nigel Smink, Dolf Trieschnigg, Djoerd Hiemstra, and Theo Meder

This paper demonstrates the use of a user-centred design approach for the development of generous interfaces/rich prospect browsers for an online cultural heritage collection, determining its primary user groups and designing different browsing tools to cater to their specific needs. We set out to solve a set of problems faced by many online cultural heritage collections. These problems are lack of accessibility, limited functionalities to explore the collection through browsing, and risk of less known content being overlooked. The object of our study is the Dutch Folktale Database, an online collection of tens of thousands of folktales from the Netherlands. Although this collection was designed as a research commodity for folktale experts, its primary user group consists of casual users from the general public. We present the new interfaces we developed to facilitate browsing and exploration of the collection by both folktale experts and casual users. We focus on the user-centred design approach we adopted to develop interfaces that would fit the users’ needs and preferences.

Screen Shot of the Dutch Folktale Database

Published in Digital Humanities Quarterly 11(4), 2017

[Read more]

Access the Folktale Database at:

Ethical Search Advertising

December 5th, 2017, posted by Djoerd Hiemstra

The past months Searsia investigated ways for search engines to provide search advertisements without participating in the large advertisement networks of Google and Facebook, and more importantly, without the need for search engines to track their users.

Read more

Ranking Learning-to-Rank Methods

October 2nd, 2017, posted by Djoerd Hiemstra

Slides of the keynote at the 1st International Workshop on LEARning Next gEneration Rankers, LEARNER 2017 on 1 October 2017 in Amsterdam are now available:

Slides of Learner 2017 keynote

Download the paper: Niek Tax, Sander Bockting, and Djoerd Hiemstra. “A cross-benchmark comparison of 87 learning to rank methods’’, Information Processing and Management 51(6), Elsevier, pages 757–772, 2015 [download pdf]

Dutch-Belgian Information Retrieval Workshop 2017

September 13th, 2017, posted by Djoerd Hiemstra

Send in your DIR 2017 submissions (novel, dissemination, or demo) before 15 October.

16th Dutch-Belgian Information Retrieval Workshop
Friday 24th of November 2017
Netherlands Institute for Sound and Vision,
Hilversum, the Netherlands

Netherlands Institute for Sound and Vision

DIR 2017 aims to serve as an international platform (with a special focus on the Netherlands and Belgium) for exchange and discussions on research & applications in the field of information retrieval as well as related fields. We invite quality research contributions addressing relevant challenges. Contributions may range from theoretical work to descriptions of applied research and real-world systems. We especially encourage doctoral students to present their research.

This year’s edition is co-organized by the CLARIAH project that is developing a Research Infrastructure for the Arts and Humanities in the Netherlands. Use cases in this infrastructure cover a wide range of IR related topics. To foster discussions between the IR community and CLARIAH researchers and developers, DIR2017 organizes a special session on IR related to data-driven research and data critique.

Read more

MTCB: A Multi-Tenant Customizable database Benchmark

September 4th, 2017, posted by Djoerd Hiemstra

by Wim van der Zijden, Djoerd Hiemstra, and Maurice van Keulen

We argue that there is a need for Multi-Tenant Customizable OLTP systems. Such systems need a Multi-Tenant Customizable Database (MTC-DB) as a backing. To stimulate the development of such databases, we propose the benchmark MTCB. Benchmarks for OLTP exist and multi-tenant benchmarks exist, but no MTC-DB benchmark exists that accounts for customizability. We formulate seven requirements for the benchmark: realistic, unambiguous, comparable, correct, scalable, simple and independent. It focuses on performance aspects and produces nine metrics: Aulbach compliance, size on disk, tenants created, types created, attributes created, transaction data type instances created per minute, transaction data type instances loaded by ID per minute, conjunctive searches per minute and disjunctive searches per minute. We present a specification and an example implementation in Java 8, which can be accessed from the following public repository. In the same repository a naive implementation can be found of an MTC-DB where each tenant has its own schema. We believe that this benchmark is a valuable contribution to the community of MTC-DB developers, because it provides objective comparability as well as a precise definition of the concept of MTC-DB.

The Multi-Tenant Customizable database Benchmark will be presented at the 9th International Conference on Information Management and Engineering (ICIME 2017) on 9-11 October 2017 in Barcelona, Spain.

[download pdf]

Alexandru Serban graduates on Personalized Ranking in Academic Search

August 18th, 2017, posted by Djoerd Hiemstra

Context Based Personalized Ranking in Academic Search

by Alexandru Serban

A criticism of search engines is that queries return the same results for users who send exactly the same query, with distinct information needs. Personalized search is considered a solution as search results are re-evaluated based on user preferences or activity. Instead of relying on the unrealistic assumption that people will precisely specify their intent when searching, the user profile is exploited to re-rank the results. This thesis focuses on two problems related to academic information retrieval systems. The first part is dedicated to data sets for search engine evaluation. Test collections consists of documents, a set of information needs, also called topics, queries that represent the data structure sent to the information retrieval tool and relevance judgements for the top documents retrieved from the collection. Relevance judgements are difficult to gather because the process involves manual work. We propose an automatic method to generate queries from the content of a scientific article and evaluate the relevant results. A test collection is generated, but its power to discriminate between relevant and non relevant results is limited. In the second part of the thesis Scopus performance is improved through personalization. We focus on the academic background of researchers that interact with Scopus since information about their academic profile is already available. Two methods for personalized search are investigated.
At first, the connections between academic entities, expressed as a graph structure, are used to evaluate how relevant a result is to the user. We use SimRank, a similarity measure for entities based on their relationships with other entities. Secondly, the semantic structure of documents is exploited to evaluate how meaningful a document is for the user. A topic model is trained to reflect the user’s interests in research areas and how relevant the search results are.
In the end both methods are merged with the initial Scopus rank. The results of a user study show a constant performance increase for the first 10 results.

[download pdf]

Bas Niesink graduates on biomedical information retrieval

August 18th, 2017, posted by Djoerd Hiemstra

Improving biomedical information retrieval with pseudo and explicit relevance feedback

by Bas Niesink

The HERO project aims to increase the quality of supervised exercise during cancer treatment by making use of a clinical decision support system. In this research, concept-based information retrieval techniques to find relevant medical publications for such a system were developed and tested. These techniques were designed to search multiple document collections, without the need to store copies of the collections.
The influence of pseudo and explicit relevance feedback using the Rocchio algorithm were explored. The underlying retrieval models that were tested are TFIDF and BM25.
The tests were conducted using the TREC Clinical Decision Support datasets for the 2014 and 2015 editions. The TREC CDS relevance judgements were used to simulate explicit feedback. The NLM Medical Text Indexer was used to extract MeSH terms from the TREC CDS topics, to be able to conduct concept-based queries. Furthermore, the difference in performance when using inverse document frequencies calculated on the entire PMC dataset, and on a collection of several thousand intermediate search results were measured.
The results show that both pseudo and explicit relevance feedback have a strong positive influence on the inferred NDCG. Additionally, the performance difference when using IDF values calculated on a very small document collection is limited.

[download pdf]

Term Extraction paper in Computing Reviews’ Best of 2016

July 10th, 2017, posted by Djoerd Hiemstra

CR Best of Computing Notable Article The paper Evaluation and analysis of term scoring methods for term extraction with Suzan Verberne, Maya Sappelli and Wessel Kraaij is selected as one of ACM Computing Reviews’ 2016 Best of Computing. Computing Reviews is published by the Association for Computing Machinery (ACM) and the editor-in-chief is Carol Hutchins (New York University).

In the paper, we evaluate five term scoring methods for automatic term extraction on four different types of text collections. We show that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.

[download pdf]

Federated Search for Sheet Music

July 7th, 2017, posted by Djoerd Hiemstra

After running the UT search engine for about a year now, there’s a new search engine that uses Searsia: The search engine, called Dr. Sheet Music is a federated search engine for sheet music. Give it a try at