Archaeoinformatics - Data Science

Theses

This is a list of all open and completed thesis topics (both Bachelor's and Master's theses). For more information please contact the supervisor(s).

Open Topics

BA/MA: Data Science Applications in Marine Sciences

There are multiple open topics available that target data science applications in marine science. If you are interested in one of the topics or have an idea about a related topic, please contact Carola Trahms, M.Sc.
You can find more suggested topics below.

Fish Larvae Trajectories in the Mediterranean Sea

Example data that can be encountered in marine data science.

 

BA/MA: Efficient spatio-temporal indexing for HINs: Getting from measurement tables to HINs

Example of a spatial HIN for marine data science

Much of the data and measurements obtained in marine science are of a spatial and/or temporal nature, i.e. they are associated with geo-coordinates or time stamps.
These spatio-temporal properties can be leveraged to obtain new, and deeper insights into the data. However managing spatial and temporal data often requires careful indexing of the data, in order to remain efficient. In this thesis you will study efficient methods to index spatio-temporal data for Heterogeneous Information Networks (HINs), which are large graphs, where different types of nodes and relationships are modelled. This topic offers the opportunity to hone skills and techniques learned in lectures like Information Systems, Geo-Information Systems, and Methods of Efficient Similarity Search in Large Databases (although the latter two are not a pre-requisite).

BA/MA: (Linear) combinations of MetaStructures for Clustering or Community Detection in (schema-rich) HINs

Heterogeneous Information Networks (HINs) are graphs, where nodes have different types, and edges form different relationships between the nodes (a homogeneous information network would just be a plain graph). In all graphs, but especially HINs, it is of great interest to find groups or communities that exhibit similar behavior or are more closely related to one another. Meta Structures are complex relationships in HINs, which can be used to express the 'relatedness' of nodes within the graph, and thus provide a powerful tool available to be used in clustering or community detection. In this thesis you will study efficient (fast, scalable) and effective (meaningful) clustering and CD algortihms using (linear) combinations of meta structures. Participation in a lecture such as KDDM (or similar) is required.

 

BA/MA: Text Mining and Knowledge Extraction in Marine Sciences

There are multiple open topics available that target text mining and knowledge extraction from text with applications in marine science. If you are interested in one of the topics, or have an idea about a related topic, please contact Asif Suryani, M.Sc.
You can find more suggested topics below.

BA/MA: Study and Evaluation: From NER to Network Representation of Scientific Text

Named entity recognition (NER) is the task of finding and extracting relevant entities from text. Scientific measurements and their associated values are of particular interest in this scenario, but also automatic recognition of locations, institutions, persons etc. Once these entities are extracted from a text, the task is to construct a network representation (e.g. a heterogeneous information network) of the text document at hand, where the challenge lies in predicting the appropriate relationships between the extracted entities (e.g. linking an extracted quantity 'mass' to its respective measurement '42', and unit 'kilograms'). The Target of this thesis is to develop and study novel techniques for NER and to link the extracted entities for a network representation of the document.

BA: Scientific Text Parser: An Interactive and Intelligent Approach

In this bachelor's thesis the task is to develop a framework for a parsing toolkit, that reads and summarizes scientific text documents - taylored to the needs of the user. The target domain for these studies will be scientific texts from marine science.

MA: Pre-trained Language Models for Domain-driven Q/A

In this master's thesis the objective is to leverage novel, and state-of-the-art pre-trained language models (such as BERT) to facilitate automated question answering (Q/A). The target domain for these studies will be scientific texts from marine science.

BA/MA: Scalable Co-Location Mining in Large Protein Databases

Contact: Steffen Strohm, M.Sc., Christian Beth, M.Sc.

Given a large set of genomes covering a set of genes where genes can come from a specific family (according to their function). The question at issue is which genes significantly co-occur (i.e. appear together) in genomes. These questions relate to comparing genes among species with similar/different ecological or physiological properties. In the context of this question, the aim of this thesis is to develop algorithms and methods that efficiently support the identification of co-occurance patterns in gene/genome-datasets.

MA: Evaluating Protein Networks – Puzzling the Biological Secret Behind the Data

Contact: Christian Beth, M.Sc.

Example Protein Network

Protein interaction and function overview (figure taken from [1])

The development of novel metallic biomaterials as implants for medical application requires an attentive elucidation of impaired or improved physiological processes and responses. Therefore, we establish in vitro models to analyse cellular reactions on different biological levels. Fundamental information can be obtained from analysing the variations in the protein synthesis. Functional interactions, relationships, and correlations between the tremendous diverse proteins lead to multifaceted and complex networks. Structured data science will help to elucidate these networks and contribute to define sensitive cellular processes and regularities. The main task will be the classification of big data sets in order to read out principles and pattern. This work will be based on the creative embedding of biological data into mathematical models.

[1] Understanding Protein Networks Using Vester's Sensitivity Model, Moreno, LA et al. in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 17, no. 4, pp. 1440-1450, (2020)

MA: Extending Expressiveness of Meta-Structures in Heterogeneous Information Network

Contact: Christian Beth, M.Sc.

Heterogeneous information networks (HINs) are graphs, where nodes have different types, and edges form different relations between the nodes, and thus allow semantically rich modelling of virtually any kind of data and information, ranging from protein-protein interaction networks to bibliogrpahical networks. The meta-path is a composite relationship between nodes in an HIN that is an integral part of state-of-the-art similarity/relevance measures in HINs, which are an integral part for downstream data mining tasks such as clustering, classification, or link prediction in HINs. To allow for more powerful, complex, and expressive relationships, the meta-structure was developed. It felxibly combines meta-path relations with the 'and'-linkage, which allows the user to specify more precisely what she is looking for. But why stop at the 'and'-linkage? Why not consider 'or', 'not', or other logical constraints? In this thesis you will design and study effective (meaningful) and efficient (fast, scalable) relevance measures based on more expressive meta-structures.

Completed Topics

MA: Evaluating Meaningful Metastructures in Heterogeneous Information Networks

Author: Steffen Strohm, M.Sc.

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Example Meta-Structures

Abstract

This work is based on the concept of meta structure and ETree traversal by Huang et al. 2016 [1] and the research of Zhu and Cheng 2018 and 2019 [2] [3], which is focused on discovering and ranking meta paths using a specifically designed importance function. These two works are used in an attempt to rank meta structures in heterogeneous information networks. The importance function is therefore modified for this task and then tested on a DBLP subset. Tests include two groups of meta structures set up around APA and APVTPA meta paths. An expected or intended ranking within these meta structure groups is compared to the found rankings. The results show that one of the components of the importance function (namely new or modified path count) tends to dominate the overall importance value in certain situations. Therefore a redesign of this component with specified connection to meta structure design is recommended and some ideas are given how to achieve this. Also the other components are discussed, however results show, their influence is more balanced.

 

[1]  Meta Structure: Computing Relevance in Large Heterogeneous Information Networks

[2]  Evaluating Top-k Meta Path Queries on Large Heterogeneous Information Networks

[3]  Effective and Efficient Discovery of Top-k Meta Paths in Heterogeneous Information Networks

 

MA: Probability-based Relevance Search in Uncertain Heterogeneous Information Networks

Author: Niko Amann, M.Sc.

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Uncertain Heterogeneus Information Network

Abstract

Today large multi-typed networks are ubiquitous and form a critical component of modern information infrastructure. Heterogeneous information networks (HINs) are a powerful modeling tool for such networks and meta-path based relevance measures allow for the utilization of the rich semantic relations to be found in these models. Therefor relevance search in HINs has drawn a lot of interest of researchers in recent years. Moreover there is an ongoing push towards taking uncertain data into consideration. Not only does the cost of eliminating uncertainty increase with the scale of the data and with its degree of heterogeneity, but also do we often loose the opportunity of finding important insights by ignoring different degrees of often natural uncertainty. Therefor this thesis explores the opportunities and the challenges that arise, when we make uncertainty a first class citizen by directly incorporating it into the model. For the first time, to the best of our knowledge, the problem of relevance search in probabilistic heterogeneous information networks is explored. We define a model for uncertain HINs using the possible worlds semantics and apply existing state-of-the -art relevance measures on the uncertain scenario, and as a result we present the probabilistic variants expected path count (EPC) and expected path-constrained random walk (EPCRW). For these measures efficient algorithms are presented that alleviate the otherwise prohibitive complexity of naive approaches. Thereby we employ an incremental computation scheme called Poisson binomial recurrence that was previously successfully employed for frequent itemset mining in uncertain databases [1]. The proposed approaches were implemented, and experiments on a bibliographical network (dblp) with artificially added uncertainty demonstrate the gains of the proposed algorithms.

[1]  Probabilistic frequent itemset mining in uncertain databases

BA: Spatial Semantics Expansion for computing relevance on Heterogeneous Information Networks

Author: Jerome Spindler

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Excerpt of the PANGAEA dataset.

Sample map region of the North Atlantic Ocean from the PANGAEA database.

Abstract:

In recent years, with the rise of Big Data within both the scientific community and commercial sectors, the demand for solutions for storing and performing computations on heterogenous data has steadily increased. In the fallout of this, the heterogeneous information network (HIN) model was conceived as a visually intuitive and semantically connected model for interpreting heterogeneous data. These networks, represented by directed graphs, led to the development of relation models, such as Meta Paths and, as a generalization thereof, Meta Structures, which were developed to represent subgraph patterns on the network by which to determine relations between objects. Upon these, for computing relevance between objects on the network such as Path Count, Struct Count and Structure Contained Subgraph Expansion (SCSE) were developed as measures of relevance based on all occurences of the pattern described by a Meta Path or Meta Structure starting from one designated source object. This work looks to expand the Meta Structure model to allow for edges, which are not manifested on the network, yet represent a non-trivial relation many objects may be in with each other with a focus on spatial features including, but not limited to, distance from each other, inclusion of one in the other and overlap of each other. These are intended to be definable by users and parameterizable per query. Additionally, it is intended to have these edges potentially influence the results of relevance computations to allow for increased expressiveness with regards to weighing properties of objects, such as "the closer the better". Example computation results on excerpt data taken from the PANGAEA database , utilizing Struct Count and SCSE, are provided to demonstrate the efficacy and impact of this paper’s proposed expansion, as well as a experimental comparison of implementation approaches for these ephemeral edges with regards to data page access.

BA: Relevance Measures in Temporal Heterogeneous Information Networks

Author: Fabian Krüger

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Temporal HIN example from PANGAEA-dataset.

An example excerpt of the PANGAEA-dataset, modelled as a temporal HIN.

Abstract

Many correlations and real systems can be modeled as heterogeneous information networks (HIN). Other than homogeneous information networks a HIN is a directed graph containing more than one type of nodes or more than one type of edges. To measure the relevance of two of these nodes, we use Struct Count and Structure Constrained Subgraph Expansion (SCSE) for meta structures. These measurement methods differently indicate how much two nodes are related to a given meta structure.
Therefore we search for instances in the given graph topologically along with the layer of a meta structure. Meta structures describe the relationship of one node to another by indicating types of edges. As opposed to meta paths they are more expressive. Meta paths can be considered as special meta structures that have no branches.
Current publications do not consider temporal aspects measuring relevance with meta structures in HINs. For that reason, different concepts are presented in this thesis. Furthermore, we implement an HIN in Python with the help of the NetworkX package, we add Struct Count and SCSE to it and evaluate the new approaches.

BA: Verortung von Fotodateien aus dem Kieler Stadtarchiv auf der Basis von Textlabels

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Abstract

Viele Institutionen besitzen historisch gewachsene Datensammlungen mit hohem Informationsgehalt, die aufgrund ihrer Form jedoch oft nur manuell verarbeitet werden können. Durch die Anreicherung mangelhaft strukturierter Datenbestände mit strukturierten Metadaten können für diese Datenbestände moderne Verfahren der automatisierten Erkenntnisgewinnung erschlosssen werden. In dieser Arbeit wird am Beispiel der Fotodatenbank des Kieler Stadtarchivs ein Verfahren entwickelt, mit dessen Hilfe automatisiert die in den den Fotos zugeordneten Textlabels vorkommenden Ortsangaben extrahiert und in einer strukturierten Form als zusätzliche Metadaten bereitgestellt werden können. Dazu werden die Textlabels auf das Vorkommen von Ortsbezeichnungen untersucht, welche dann durch Untersuchung ihres sprachlichen Kontextes weiter spezifiziert werden. Zusätzlich werden die erkannten Verortungen georeferenziert. Das Verfahren kann zu ca. 61% der vorliegenden Fotos eine Verortung extrahieren. Durch die Verortung und Georeferenzierung der Fotos können neue Nutzungsmöglichkeiten für die Sammlung, wie z.B. eine geographische Suche im Archivbestand ermöglicht werden. Das Beispiel zeigt, dass die automatisierte Aufbereitung größerer Datenbestände eine Alternative zur manuellen Strukturierung darstellen kann – oder sie zumindest stark vereinfacht.

MA: Overcoming Oversmoothing in Graph Neural Networks

Author: Christian Beth, M.Sc.

Supervisor: Prof. Dr. Matthias Renz

Graph Attention Network (GAT) Convolution

Abstract

Graphs are data structures that see a wide variety in their fields of applications, ranging from medical and life science to social networks. Recently, graph neural networks (GNNs) have proven to be a seminal tool for solving a multitude of graph-related tasks across these domains, including representation learning, node classification, and link prediction. The success of these approaches is attributed to the effect of Laplacian smoothing over the node features that occurs during the GNN filtering. This Laplacian smoothing effectively acts as a low-pass filter over the node features, which functions as a de-noiser and thus improves feature quality and performance on the learning task. However, state-of-the-art architectures run into the problem of oversmoothing when too many GNN layers are stacked. Oversmoothing occurs when the filtered features become too similar to each other - to the point where they become hard to distinguish for a learning task, which can greatly hurt performance. To overcome the oversmoothing problem, this work proposes a high-pass filter that - similarly to edge detectors in convolutional neural networks (CNNs) - highlights signal parts where large changes occur. Finally, the approach is evaluated in comparison to several state-of-the-art architectures on various real-world graphs.