April 23rd, 2018, posted by Djoerd Hiemstra
Welcome to Research Experiments in Databases and Information Retrieval (REDI)! The theme of this year’s course is: Recommendation in federated social networks. Federated social networks consist of multiple independent servers that cooperate. An example is Mastodon, a free open source implementation of a micro-blogging social network that resembles Twitter. Unlike Twitter (or Facebook for that matter), nobody has a complete view of all accounts and posts in a federated social network. We will address two research problems: 1) How to implement recommendations using only local knowledge of the network? and 2) How to evaluate your system in such a highly dynamic environment?
We are the first University of Twente course with a public Canvas syllabus.
Of course, we will appropriately use Mastodon to communicate about REDI. Please make an account on mastodon.utwente.nl and follow the hash tag #REDI. Use the hash tag in questions and toots about the course.
Posted in Social Data, Course REDI | Add a comment »
April 20th, 2018, posted by Djoerd Hiemstra
The Data Management and Biometrics group and Formal Methods & Tools groups at the University of Twente seek a PhD candidate for SEQUOIA: Smart maintenance optimization via big data & fault tree analysis, a project funded by the NWO Applied and Engineering Sciences, and the companies ProRail and NS. ProRail is responsible for the Dutch railway network, including its construction, management, maintenance, and safety; NS has the same responsibility for the Dutch train fleed. The project is led by Mariëlle Stoelinga, Joost-Pieter Katoen and Djoerd Hiemstra.
Job description
SEQUOIA aims to improve the reliability of the Dutch railroads by deploying big data analytics to predict and prevent failures. Its scientific core is a novel combination of machine learning, fault tree analysis and stochastic model checking. Key idea is that big data analytics provide the statistics on failures, their correlations, dependencies etc. and fault trees provide the domain knowledge needed to interpret these data. The project outcome aims at developing explainable machine learning techniques that discover causal relations instead of statistical correlations; machine learning of fault trees or of other models that are normally designed top-down by domain experts. The techniques should help ProRail to decrease train disruptions and delays, to lower maintenance cost, and to increase passenger comfort.
The project involves an intense cooperation ProRail and the RWTH Aachen University. The PhD candidate will spend a portion of their time at ProRail.
Key project deliverables are efficient analysis algorithms and a workable tool to be used in the ProRail context.
For more information, see:
https://www.utwente.nl/en/organization/careers/vacancy/!/phd-position-sequoia/134206
Posted in Vacancies | Add a comment »
April 16th, 2018, posted by Djoerd Hiemstra
by Johannes Wassenaar
Linking segments of video using text-based methods and
a flexible form of segmentation
In order to let user’s explore, and use large archives, video hyperlinking tries to aid the user in linking segments
of video to other segments of videos, similar to the way hyperlinks on the web are used – instead of using a
regular search tool. Indexing, querying and re-ranking multimodal data, in this case video’s, are subjects
common in the video hyperlinking community. A video hyperlinking system contains an index of multimodal
(video) data, while the currently watched segment is translated into a query, the query generation phase.
Finally, the system responds to the user with a ranked list of targets that are about the anchor segment. In this
study, the payload of terms in the form of position and offset in Elastic Search are used to obtain time-based
information along the speech transcripts to link users directly to spoken text. The queries are generated by a
statistic-based method using TF-IDF, a grammar-based part-of-speech tagger or a combination of both. Finally,
results are ranked by weighting specific components and cosine similarity. The system is evaluated with the
Precision at 5 and MAiSP measures, which are used in the TRECVid benchmark on this topic. The results show
that TF-IDF and the cosine similarity work the best for the proposed system.
[download pdf]
Posted in Multimedia Search | Add a comment »
April 15th, 2018, posted by Djoerd Hiemstra
The University of Twente is the first Dutch university to run its own Mastodon server.
Mastodon is a social network based on open web protocols and free, open-source software. It is decentralized like e-mail. Learning from failures of other networks, Mastodon aims to make ethical design choices to combat the misuse of social media.
By joining U. Twente Mastodon, you join a global social network with more than a million people. The university will not sell your data, nor show you advertisements.
Mastodon U. Twente is available to all students, alumni, and employees.
Join Mastodon U. Twente now.
Posted in Social Data, Course REDI | Add a comment »
March 27th, 2018, posted by Djoerd Hiemstra
Automatic Product Name Recognition from Short Product Descriptions
by Elnaz Pazhouhi
This thesis studies the problem of product name recognition from short product
descriptions. This is an important problem especially with the increasing use of ERP
(Enterprise Resource Planning) software at the core of modern business
management systems, where the information of business transactions is stored in
unstructured data stores. A solution to the problem of product name recognition is
especially useful for the intermediate businesses as they are interested in finding
potential matches between the items in product catalogs (produced by manufactures or
another intermediate business) and items in the product requests (given by the end
user or another intermediate business).
In this context the problem of product name recognition in specifically challenging
because product descriptions are typically short, ungrammatical, incomplete,
abbreviated and multilingual. In this thesis we investigate the application of supervised
machine-learning techniques and gazetteer-based techniques to our problem. To
approach the problem, we define it as a classification problem where the tokens of
product descriptions are classified into I, O and B classes according to the standard
IOB tagging scheme. Next we investigate and compare the performance of a set of
hybrid solutions that combine machine learning and gazetteer-based approaches.
We study a solution space that uses four learning models: linear and non-linear
SVC, Random Forest, and AdaBoost. For each solution, we use the same set of
features. We divide the features into four categories: token-level features,
document-level features, gazetteer-based features and frequency-based features. Moreover,
we use automatic feature selection to reduce the dimensionality of data; that
consequently improves the training efficiency and avoids over-fitting.
To be able to evaluate the solutions, we develop a machine learning framework
that takes as its inputs a list of predefined solutions (i.e. our solution space) and
a preprocessed labeled dataset (i.e. a feature vector X, and a corresponding class
label vector Y). It automatically selects the optimal number of most relevant features,
optimizes the hyper-parameters of the learning models, trains the learning models,
and evaluates the solution set. We believe that our automated machine learning
framework, can effectively be used as an AutoML framework that automates most
of the decisions that have to be made in the design process of a machine learning
solution for a particular domain (e.g. for product name recognition).
Moreover, we conduct a set of experiments and based on the results, we answer
the research questions of this thesis. In particular, we determine (1) which learning
models are more effective for our task, (2) which feature groups contain the most
relevant features (3) what is the contribution of different feature groups to the overall
performance of the induced model, (4) how gazetteer-based features are incorporated
with the machine learning solutions, (5) how effective gazetteer-based features
are, (6) what the role of hyper-parameter optimization is and (7) which models are
more sensitive to the hyper-parameters optimization.
According to our results, the solutions with maximum and minimum performance
are non-linear SVC with an F1 measure of 65% and AdaBoost with an F1 measure
of 59% respectively. This reveals that the role of classifiers is not considerable in the
final outcome of the learning model, at least according to the studied dataset.
Additionally, our results show that the most effective feature group is the document-level
features with 14.8% contribution to the overall performance (i.e. F1 measure), in the
second position, there is the group of token-level features, with 6.8% contribution.
The other two groups, the gazetteer-based features and frequency-based features
have small contributions of 1% and 0.5% respectively. However more investigations
relate the poor performance of gazetteer-based features to the low coverage of the
used gazetteer (i.e. ETIM).
Our experiments also show that all learning models over-fit the training data when
a large number of features is used; thus the use of feature selection techniques is
essential to the robustness of the proposed solutions. Among the studied learning
models, the performance of non-linear SVC and AdaBoost models strongly depends
on the used hyper-parameters. Therefore for those models the computational cost
of the hyper-parameters tuning is justifiable.
[download pdf]
Posted in Machine Learning | Add a comment »
February 19th, 2018, posted by Djoerd Hiemstra
To celebrate Peter Apers’ retirement, we created The Apers Tree, which displays the Academic Genealogy of Peter Apers. The tree is inspired by the wonderful Mathematics Genealogy Project and a gift from the Database Group of the University Twente on the occasion of Peter’s retirement on 16 February 2018.
Check out the Apers Tree on Github.
Posted in Uncategorized | Add a comment »
January 12th, 2018, posted by Djoerd Hiemstra
Cross-Domain Authorship Attribution as a Tool for Digital Investigations
by Christel Geurts
On the darkweb sites promoting illegal content are abundant and new sites are constantly created. At the same time Law Enforcement is working hard to take these sites down and track down the persons involved. Often, after taking down a site, users change their name and move to a different site. But what if Law Enforcement could track users across sites? Different sites or sources of information are called a domain. As the domain changes, often the context of a message also changes, making it challenging to track users simply on words used. The aim of this thesis is to develop a system that can link written text of authors in a cross-domain setting. The system was tested on a blog corpus and verified on police data. Tests show that multinomial logistic regression and Support Vector Machines with a linear kernel perform well. Character 3-grams work well as features, combining multiple feature sets increases performance. Tests show that Logistic Regression models with a combined feature set performed best (accuracy = 0.717, MRR = 0.7785, 1000 authors (blog corpus)). On the police data the Logistic Regression model had an accuracy of 0.612 and a MRR of 0.6883 for 521 authors.
Posted in Social Data, Data Science | Comments Off
January 10th, 2018, posted by Djoerd Hiemstra
The Case of the Dutch Folktale Database
by Iwe Muiser, Mariët Theune, Ruud de Jong, Nigel Smink, Dolf Trieschnigg, Djoerd Hiemstra, and Theo Meder
This paper demonstrates the use of a user-centred design approach for the development of generous interfaces/rich prospect browsers for an online cultural heritage collection, determining its primary user groups and designing different browsing tools to cater to their specific needs. We set out to solve a set of problems faced by many online cultural heritage collections. These problems are lack of accessibility, limited functionalities to explore the collection through browsing, and risk of less known content being overlooked. The object of our study is the Dutch Folktale Database, an online collection of tens of thousands of folktales from the Netherlands. Although this collection was designed as a research commodity for folktale experts, its primary user group consists of casual users from the general public. We present the new interfaces we developed to facilitate browsing and exploration of the collection by both folktale experts and casual users. We focus on the user-centred design approach we adopted to develop interfaces that would fit the users’ needs and preferences.
Published in Digital Humanities Quarterly 11(4), 2017
[Read more]
Access the Folktale Database at: http://www.verhalenbank.nl/.
Posted in Cultural heritage | Comments Off
December 5th, 2017, posted by Djoerd Hiemstra
The past months Searsia investigated ways for search engines to provide search advertisements without participating in the large advertisement networks of Google and Facebook, and more importantly, without the need for search engines to track their users.
Read more
Posted in Searsia | Comments Off
October 2nd, 2017, posted by Djoerd Hiemstra
Slides of the keynote at the 1st International Workshop on LEARning Next gEneration Rankers,
LEARNER 2017 on 1 October 2017 in Amsterdam are now available:

learner2017.pdf
Download the paper:
Niek Tax, Sander Bockting, and Djoerd Hiemstra. “A cross-benchmark comparison of 87 learning to rank methods’’, Information Processing and Management 51(6), Elsevier, pages 757–772, 2015 [download pdf]
Posted in Machine Learning | Comments Off