Archaeoinformatics - Data Science

Completed Topics

This is a list of successfully completed bachelor's and master's theses. If you are interested in one of the topics (e.g. for your own thesis), please contact the respective supervisor(s) for further information and other current or open topics.

MA: Extraction of Scientific Measurements from Text

named-entity recognition example with scientic measurements

The goal of this work is to automatically extract scientific measurements form papers, such as 4 kg/m2. One of the many challenges is that the same quantities are often given in different ways (metre, meter, m etc.). A far sight goal is to also link the extracted quantities to their geo-spatial location (if given somewhere in the text).

For more information please contact Asif Suryani

MA: Extending Expressiveness of Meta-Structures in Heterogeneous Information Networks

Contact: Christian Beth, M.Sc.

Many databases of popular social media or video streaming sites can be modeled as a Heterogeneous Information Network (HIN). A common task is to compute relevance between two objects in such HINs for which meta paths or meta structure are used. While meta paths can only express sequential composition relations meta structures provide a way to describe complex relations like two users being friends and having the same interests. To extend the expressiveness even further inductive meta structures are introduced which provide a way to describe even more complex relations like two users being friends or having the same interests and also a way to deny certain structures like users not being friends but having the same interests. Additionally an approximate approach is presented granting faster computation times at the cost of a accuracy of 90%.

BA: Application of Language Models to Examine beta-Lactam Resistant AMRs

Contact: Steffen Strohm, M.Sc., Christian Beth, M.Sc.

Embedding_Visualization

For this work a bi-directional long-short-term-memory (bi-LSTM) neural network used for the study of viral mutations was adapted for bacteria. The network was trained on Lactamase and antimicrobiological resistant bacterial DNA-sequences, different settings for network parameters were studied. For these data sets embeddings into subsets of Rn for different embedding dimensions were created and analyzed via visualization methods based on dimension reduction. The networks ability to predict the likeliness of certain mutations was used ta analyze the correlation of the networks results to biological fitness-data of some mutations in the presence of two antibiotics.

BA: Wave-based Damage Detection in Engineering Structures using Artificial Neural Networks

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Steffen Strohm, M.Sc.

 

ResNet18

Abstract

Structural health monitoring plays a critical role in various disciplines of engineering. The less critical structures are monitored at a specified duration, with the conventional method comprising an array of sensors to detect damage and being tedious, labour intensive, and lengthy. However, the critical infrastructure applies the idea of digital twins where various physical properties are continuously measured and processed to estimate damage location and intensity. The other technique that serves the purpose is the numerical method which offers a versatile solution for scenario forecasting of cracks with location, orientation and length. Numerical methods could generate the response of crack-wave interactions but require a respectable amount of computation power. In this study, crack wave interaction data is generated from the Lattice Element Method and used to train the neural network model to predict the location, orientation and length of the cracks. A 1D-ResNetDense50, 1D-ResNetDecoder34, 1D-ResNetDecoder18 and 1D-SimpleCNN networks have been implemented in the framework to detect cracks. The work further explores the structure of all models and other selected essential components, such as the loss functions, metric, optimizer, learning rate and threshold. Necessary steps are taken to achieve high accuracy, high precision, eliminate possible error sources to get better performance and retain more information about investigated data.

BA: Colored Motif Search In Heterogeneous Information Networks

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Examples of homogeneous and heterogeneous motifs.

Most graphs only contain nodes and edges of the same type, such graphs are also called homogeneous information networks. Heterogeneous information networks can be seen as an extension of homogeneous information networks. Heterogeneous Information Networks are graphs that consist of different typed nodes and edges and can thus store additional information compared to homogeneous information networks. They carry richer semantic information than homogeneous information
networks do because of the types that can be assigned to nodes and edges. A network motif is a fundamental building block of a graph, it thus is a subgraph that plays an important role in the network structure of the graph. The gain of information in heterogeneous information networks can be used for the problem of network motif discovery as well, as heterogeneous network motifs not only contain structural information. They also contain semantic information about the node and edge types that appear in these network motifs. The problem of network motif discovery in heterogeneous information networks was covered by Rossi et al. in their work "Heterogeneous Network Motifs" [1] in a very efficient way. This work extends the approach by Rossi et al., so that the input networks can have directed and typed edges. These extensions allow the network motifs to contain richer information based on the edge types that had not been part of the network motifs in the approach shown in [1]. The extended approach is created step by step and the performance of the final approach as well as the performance of the intermediate approaches is evaluated on the DBLP dataset. This evaluation shows that directed edges do not have a big influence on the performance, but the typed edges do: The average runtime compared to the approach by Rossi et al. [1] has increased by the factor 2593. Nevertheless the extensions made in this work are having a big influence on the amount of information that a network motif can contain, so that the increase of the runtime is tolerable if it is seen in relation to the information gain about the network structure based on the network motifs with directed, typed edges.

For more information please contact Christian Beth, M.Sc.

[1] Heterogeneous Network Motifs, Rossi, Ryan A., et al. ". arXiv preprint (2019)

MA: Evaluating Meaningful Metastructures in Heterogeneous Information Networks

Author: Steffen Strohm, M.Sc.

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Example Meta-Structures

Abstract

This work is based on the concept of meta structure and ETree traversal by Huang et al. 2016 [1] and the research of Zhu and Cheng 2018 and 2019 [2] [3], which is focused on discovering and ranking meta paths using a specifically designed importance function. These two works are used in an attempt to rank meta structures in heterogeneous information networks. The importance function is therefore modified for this task and then tested on a DBLP subset. Tests include two groups of meta structures set up around APA and APVTPA meta paths. An expected or intended ranking within these meta structure groups is compared to the found rankings. The results show that one of the components of the importance function (namely new or modified path count) tends to dominate the overall importance value in certain situations. Therefore a redesign of this component with specified connection to meta structure design is recommended and some ideas are given how to achieve this. Also the other components are discussed, however results show, their influence is more balanced.

 

[1]  Meta Structure: Computing Relevance in Large Heterogeneous Information Networks

[2]  Evaluating Top-k Meta Path Queries on Large Heterogeneous Information Networks

[3]  Effective and Efficient Discovery of Top-k Meta Paths in Heterogeneous Information Networks

 

MA: Probability-based Relevance Search in Uncertain Heterogeneous Information Networks

Author: Niko Amann, M.Sc.

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Uncertain Heterogeneus Information Network

Abstract

Today large multi-typed networks are ubiquitous and form a critical component of modern information infrastructure. Heterogeneous information networks (HINs) are a powerful modeling tool for such networks and meta-path based relevance measures allow for the utilization of the rich semantic relations to be found in these models. Therefor relevance search in HINs has drawn a lot of interest of researchers in recent years. Moreover there is an ongoing push towards taking uncertain data into consideration. Not only does the cost of eliminating uncertainty increase with the scale of the data and with its degree of heterogeneity, but also do we often loose the opportunity of finding important insights by ignoring different degrees of often natural uncertainty. Therefor this thesis explores the opportunities and the challenges that arise, when we make uncertainty a first class citizen by directly incorporating it into the model. For the first time, to the best of our knowledge, the problem of relevance search in probabilistic heterogeneous information networks is explored. We define a model for uncertain HINs using the possible worlds semantics and apply existing state-of-the -art relevance measures on the uncertain scenario, and as a result we present the probabilistic variants expected path count (EPC) and expected path-constrained random walk (EPCRW). For these measures efficient algorithms are presented that alleviate the otherwise prohibitive complexity of naive approaches. Thereby we employ an incremental computation scheme called Poisson binomial recurrence that was previously successfully employed for frequent itemset mining in uncertain databases [1]. The proposed approaches were implemented, and experiments on a bibliographical network (dblp) with artificially added uncertainty demonstrate the gains of the proposed algorithms.

[1]  Probabilistic frequent itemset mining in uncertain databases

BA: Spatial Semantics Expansion for computing relevance on Heterogeneous Information Networks

Author: Jerome Spindler

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Excerpt of the PANGAEA dataset.

Sample map region of the North Atlantic Ocean from the PANGAEA database.

Abstract:

In recent years, with the rise of Big Data within both the scientific community and commercial sectors, the demand for solutions for storing and performing computations on heterogenous data has steadily increased. In the fallout of this, the heterogeneous information network (HIN) model was conceived as a visually intuitive and semantically connected model for interpreting heterogeneous data. These networks, represented by directed graphs, led to the development of relation models, such as Meta Paths and, as a generalization thereof, Meta Structures, which were developed to represent subgraph patterns on the network by which to determine relations between objects. Upon these, for computing relevance between objects on the network such as Path Count, Struct Count and Structure Contained Subgraph Expansion (SCSE) were developed as measures of relevance based on all occurences of the pattern described by a Meta Path or Meta Structure starting from one designated source object. This work looks to expand the Meta Structure model to allow for edges, which are not manifested on the network, yet represent a non-trivial relation many objects may be in with each other with a focus on spatial features including, but not limited to, distance from each other, inclusion of one in the other and overlap of each other. These are intended to be definable by users and parameterizable per query. Additionally, it is intended to have these edges potentially influence the results of relevance computations to allow for increased expressiveness with regards to weighing properties of objects, such as "the closer the better". Example computation results on excerpt data taken from the PANGAEA database , utilizing Struct Count and SCSE, are provided to demonstrate the efficacy and impact of this paper’s proposed expansion, as well as a experimental comparison of implementation approaches for these ephemeral edges with regards to data page access.

BA: Relevance Measures in Temporal Heterogeneous Information Networks

Author: Fabian Krüger

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Temporal HIN example from PANGAEA-dataset.

An example excerpt of the PANGAEA-dataset, modelled as a temporal HIN.

Abstract

Many correlations and real systems can be modeled as heterogeneous information networks (HIN). Other than homogeneous information networks a HIN is a directed graph containing more than one type of nodes or more than one type of edges. To measure the relevance of two of these nodes, we use Struct Count and Structure Constrained Subgraph Expansion (SCSE) for meta structures. These measurement methods differently indicate how much two nodes are related to a given meta structure.
Therefore we search for instances in the given graph topologically along with the layer of a meta structure. Meta structures describe the relationship of one node to another by indicating types of edges. As opposed to meta paths they are more expressive. Meta paths can be considered as special meta structures that have no branches.
Current publications do not consider temporal aspects measuring relevance with meta structures in HINs. For that reason, different concepts are presented in this thesis. Furthermore, we implement an HIN in Python with the help of the NetworkX package, we add Struct Count and SCSE to it and evaluate the new approaches.

BA: Verortung von Fotodateien aus dem Kieler Stadtarchiv auf der Basis von Textlabels

Supervisors:

Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Abstract

Viele Institutionen besitzen historisch gewachsene Datensammlungen mit hohem Informationsgehalt, die aufgrund ihrer Form jedoch oft nur manuell verarbeitet werden können. Durch die Anreicherung mangelhaft strukturierter Datenbestände mit strukturierten Metadaten können für diese Datenbestände moderne Verfahren der automatisierten Erkenntnisgewinnung erschlosssen werden. In dieser Arbeit wird am Beispiel der Fotodatenbank des Kieler Stadtarchivs ein Verfahren entwickelt, mit dessen Hilfe automatisiert die in den den Fotos zugeordneten Textlabels vorkommenden Ortsangaben extrahiert und in einer strukturierten Form als zusätzliche Metadaten bereitgestellt werden können. Dazu werden die Textlabels auf das Vorkommen von Ortsbezeichnungen untersucht, welche dann durch Untersuchung ihres sprachlichen Kontextes weiter spezifiziert werden. Zusätzlich werden die erkannten Verortungen georeferenziert. Das Verfahren kann zu ca. 61% der vorliegenden Fotos eine Verortung extrahieren. Durch die Verortung und Georeferenzierung der Fotos können neue Nutzungsmöglichkeiten für die Sammlung, wie z.B. eine geographische Suche im Archivbestand ermöglicht werden. Das Beispiel zeigt, dass die automatisierte Aufbereitung größerer Datenbestände eine Alternative zur manuellen Strukturierung darstellen kann – oder sie zumindest stark vereinfacht.

MA: Overcoming Oversmoothing in Graph Neural Networks

Author: Christian Beth, M.Sc.

Supervisor: Prof. Dr. Matthias Renz

Graph Attention Network (GAT) Convolution

Abstract

Graphs are data structures that see a wide variety in their fields of applications, ranging from medical and life science to social networks. Recently, graph neural networks (GNNs) have proven to be a seminal tool for solving a multitude of graph-related tasks across these domains, including representation learning, node classification, and link prediction. The success of these approaches is attributed to the effect of Laplacian smoothing over the node features that occurs during the GNN filtering. This Laplacian smoothing effectively acts as a low-pass filter over the node features, which functions as a de-noiser and thus improves feature quality and performance on the learning task. However, state-of-the-art architectures run into the problem of oversmoothing when too many GNN layers are stacked. Oversmoothing occurs when the filtered features become too similar to each other - to the point where they become hard to distinguish for a learning task, which can greatly hurt performance. To overcome the oversmoothing problem, this work proposes a high-pass filter that - similarly to edge detectors in convolutional neural networks (CNNs) - highlights signal parts where large changes occur. Finally, the approach is evaluated in comparison to several state-of-the-art architectures on various real-world graphs.

MA: Data Mining for deeper Understanding of Cyanobacteria Blooms in the Baltic Sea

Supervisor: Prof. Dr. Matthias Renz

cyano bacteria teaser

Abstract

Cyanobacteria (blue-green algae) blooms are of growing societal concern in the Baltic Sea. The potentially toxic algae deteriorate water quality and add extra nutrients to an already overfertilized system. Consequently, a comprehensive understanding of the controlling mechanisms are essential if eutrophication is to be managed effectively. This master project investigated the controlling factors that promote cyanobacteria mass accumulation in the Baltic Sea. The underlying data base consisted of a combination of numerical ocean model output, satellite observations and in-situ nutrient samples (collected from international partners and compiled by GEOMAR Helmholtz Centre for Ocean Research Kiel). Support vector machines, decision trees and random forest models were examined analytically and experimentally and compared with one another. Major challenges were given by the heterogeneity, the sparsity and noisiness of the available data, as well as by the large number of factors potentially inducing bacterial growth. The results show-case a remarkable forecast skill based on abiotic factors alone. Somewhat counter to intuition, ambient nutrient concentrations, feature only minor explanatory power.

For more information please contact Dr. Ulrike Löptien

MA: Scalable Methods for Traffic Optimization based on a Road Demand & Supply Model Strategy

Supervisor: Prof. Dr. Matthias Renz

BA: Erstellung einer Plattform zur Evaluierung von Verkehrsflussoptimierungs-Methodiken

Supervisor: Prof. Dr. Matthias Renz