Archaeoinformatics - Data Science

MA: Probability-based Relevance Search in Uncertain Heterogeneous Information Networks

Author: Niko Amann, M.Sc.

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Uncertain Heterogeneus Information Network

Abstract

Today large multi-typed networks are ubiquitous and form a critical component of modern information infrastructure. Heterogeneous information networks (HINs) are a powerful modeling tool for such networks and meta-path based relevance measures allow for the utilization of the rich semantic relations to be found in these models. Therefor relevance search in HINs has drawn a lot of interest of researchers in recent years. Moreover there is an ongoing push towards taking uncertain data into consideration. Not only does the cost of eliminating uncertainty increase with the scale of the data and with its degree of heterogeneity, but also do we often loose the opportunity of finding important insights by ignoring different degrees of often natural uncertainty. Therefor this thesis explores the opportunities and the challenges that arise, when we make uncertainty a first class citizen by directly incorporating it into the model. For the first time, to the best of our knowledge, the problem of relevance search in probabilistic heterogeneous information networks is explored. We define a model for uncertain HINs using the possible worlds semantics and apply existing state-of-the -art relevance measures on the uncertain scenario, and as a result we present the probabilistic variants expected path count (EPC) and expected path-constrained random walk (EPCRW). For these measures efficient algorithms are presented that alleviate the otherwise prohibitive complexity of naive approaches. Thereby we employ an incremental computation scheme called Poisson binomial recurrence that was previously successfully employed for frequent itemset mining in uncertain databases [1]. The proposed approaches were implemented, and experiments on a bibliographical network (dblp) with artificially added uncertainty demonstrate the gains of the proposed algorithms.

[1]  Probabilistic frequent itemset mining in uncertain databases