Archaeoinformatics - Data Science


This is a list of all open and completed thesis topics (both Bachelor's and Master's theses). For more information please contact the supervisor(s).

Open Topics

BA/MA: Data Science Applications in Marine Sciences

There are multiple open topics available that target data science applications in marine science. If you are interested in one of the topics or have an idea about a related topic, please contact Carola Trahms, M.Sc.
You can find more suggested topics below.

Fish Larvae Trajectories in the Mediterranean Sea

Example data that can be encountered in marine data science.


BA/MA: Efficient spatio-temporal indexing for HINs: Getting from measurement tables to HINs

Example of a spatial HIN for marine data science

Much of the data and measurements obtained in marine science are of a spatial and/or temporal nature, i.e. they are associated with geo-coordinates or time stamps.
These spatio-temporal properties can be leveraged to obtain new, and deeper insights into the data. However managing spatial and temporal data often requires careful indexing of the data, in order to remain efficient. In this thesis you will study efficient methods to index spatio-temporal data for Heterogeneous Information Networks (HINs), which are large graphs, where different types of nodes and relationships are modelled. This topic offers the opportunity to hone skills and techniques learned in lectures like Information Systems, Geo-Information Systems, and Methods of Efficient Similarity Search in Large Databases (although the latter two are not a pre-requisite).

BA/MA: (Linear) combinations of MetaStructures for Clustering or Community Detection in (schema-rich) HINs

Heterogeneous Information Networks (HINs) are graphs, where nodes have different types, and edges form different relationships between the nodes (a homogeneous information network would just be a plain graph). In all graphs, but especially HINs, it is of great interest to find groups or communities that exhibit similar behavior or are more closely related to one another. Meta Structures are complex relationships in HINs, which can be used to express the 'relatedness' of nodes within the graph, and thus provide a powerful tool available to be used in clustering or community detection. In this thesis you will study efficient (fast, scalable) and effective (meaningful) clustering and CD algortihms using (linear) combinations of meta structures. Participation in a lecture such as KDDM (or similar) is required.


BA/MA: Text Mining and Knowledge Extraction in Marine Sciences

There are multiple open topics available that target text mining and knowledge extraction from text with applications in marine science. If you are interested in one of the topics, or have an idea about a related topic, please contact Asif Suryani, M.Sc.
You can find more suggested topics below.

BA/MA: Study and Evaluation: From NER to Network Representation of Scientific Text

named-entity recognition example

Named entity recognition (NER) is the task of finding and extracting relevant entities from text. Scientific measurements and their associated values are of particular interest in this scenario, but also automatic recognition of locations, institutions, persons etc. Once these entities are extracted from a text, the task is to construct a network representation (e.g. a heterogeneous information network) of the text document at hand, where the challenge lies in predicting the appropriate relationships between the extracted entities (e.g. linking an extracted quantity 'mass' to its respective measurement '42', and unit 'kilograms'). The Target of this thesis is to develop and study novel techniques for NER and to link the extracted entities for a network representation of the document.

BA: Scientific Text Parser: An Interactive and Intelligent Approach

In this bachelor's thesis the task is to develop a framework for a parsing toolkit, that reads and summarizes scientific text documents - taylored to the needs of the user. The target domain for these studies will be scientific texts from marine science.

MA: Pre-trained Language Models for Domain-driven Q/A

In this master's thesis the objective is to leverage novel, and state-of-the-art pre-trained language models (such as BERT) to facilitate automated question answering (Q/A). The target domain for these studies will be scientific texts from marine science.

BA/MA: Thesis in Data Science - Classification of Soil Layers based on Georadar Data

Contact: Steffen Strohm, M.Sc.


We are looking for an interested student who wants to engage in the classification of soil layers based on georadar data for a bachelor's or master's thesis. The aim is to develop data science methods for the automatic detection of sedimentary layer boundaries and for the automatic naming of geological units.

The topic is embedded in the cooperation between the Institute for Computer Science at the CAU Kiel and the Center for Baltic and Scandinavian Archeology in Schleswig  within the framework of the CRC1266. No previous knowledge of geophysical data processing/georadar or the archeology of the Paleolithic or Mesolithic is required.

Example of Soil Layers

Example of Geo Radar Data

Deutsche Beschreibung:

Wir suchen eine/n interessierte/n Student*in, die/der sich im Rahmen einer Bachelor oder Masterarbeit mit der Klassifizierung von Bodenschichten anhand von Georadar-Daten auseinander setzen will. Ziel ist die Entwicklung von Data Science Methoden zur automatischen Detektion von sedimentären Schichtgrenzen sowie zur automatische Benennung der geologischen Einheiten.

Das Thema ist eingebettet in die Kooperation zwischen dem Institut für Informatik der CAU Kiel mit dem Zentrum für Baltische und Skandinavische Archäologie in Schleswig im Rahmen des SFB1266. Es werden keinerlei Vorkenntnisse bezüglich geophysikalischer Datenverarbeitung/ Georadar oder der Archäologie des Paläolithikums  oder Mesolithikums vorausgesetzt.

Thesis Description (PDF)

BA/MA: Thesis in Data Science - Magnetic Pattern Classification

Contact: Steffen Strohm, M.Sc.

Magnetic prospection data depicts the remains of archaeological sites. We investigate the sites of the Cucuteni-Tripolye Culture via the magnetic anomalies of burned houses to infer the social strucutre. From the magnetic anomaly we derive the magnetization distribution which correlates with the mass of burnt clay. The magnetization patterns of the building remains are directly related to the architecture, which in turn can reflect social structures.

For the settlements of the Cucuteni-Tripolye Culture is characteristic: large sites with up to several thousands of buildings, concentric settlement layout and a high degree of standardization concerning the floor plan of the houses. The Figure shows a section of the magnetic anomaly map (a), the thereof derived magnetization distribution (b) and three examples of house anomalies in comparison to a standardized floor plan (c).

Example of a magnetic anomaly map, derived magnetization distribution and examples of house anomalies. 

(Click on picture for details in new tab)

There are two main questions related to the magnetic data of these sites, which can be tackled with supervised or unsupervised learning or data mining techniques:

(1) How can the building remains be automatically identified?
(2) Are there different clusters of buildings in terms of their magnetization patterns? If so, how are these clusters related to the ideas of standardized floor plans?


This work is a collaborative project with scientists from subproject G2 (Geophysics) and subproject D1 (Archaeology) within the CRC 1266 "Scales of Transformation" at Kiel University. No previous knowledge of Geophysics or Archaeology is required.

MA: Evaluating Protein Networks – Puzzling the Biological Secret Behind the Data

Contact: Christian Beth, M.Sc.

Example Protein Network

Protein interaction and function overview (figure taken from [1])

The development of novel metallic biomaterials as implants for medical application requires an attentive elucidation of impaired or improved physiological processes and responses. Therefore, we establish in vitro models to analyse cellular reactions on different biological levels. Fundamental information can be obtained from analysing the variations in the protein synthesis. Functional interactions, relationships, and correlations between the tremendous diverse proteins lead to multifaceted and complex networks. Structured data science will help to elucidate these networks and contribute to define sensitive cellular processes and regularities. The main task will be the classification of big data sets in order to read out principles and pattern. This work will be based on the creative embedding of biological data into mathematical models.

[1] Understanding Protein Networks Using Vester's Sensitivity Model, Moreno, LA et al. in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 17, no. 4, pp. 1440-1450, (2020)

MA: Identification and Synchronization of Events in Data Series from Lake Sediments

Contact: Steffen Strohm, M.Sc.

The preliminary title of this master thesis points towards a set of possible subproblems in the context of mining time series data derived from lake sediments. The project focusses on analysing these time series to identify patterns, which represent events, environmental conditions or human impact. Identifying these patterns and comparing the time series with other - independent - climate data series will help to understand (multifactorial) transformation processes, their role and temporal dynamics.

Example of synchronized time series derived from different data sources.

Example of time series derived from several data sources. Figure taken from [1] (Click on picture for details in new tab)

This work is a collaborative project with scientists from subproject F2 (Geoarchaeology) within the CRC 1266 "Scales of Transformation" at Kiel University. No previous knowledge in Geoarchaeology is required.


German Description

Im Fokus des Sonderforschungsbereiches 1266 „TransformationsDimensionen“ stehen die Untersuchungen von Mensch-Umwelt Wechselwirkungen. Das Projekt F2 erhebt hierbei Daten mittels pollenanalytischer und geochemischer Analysen von jahresgeschichteten Seesedimenten aus Norddeutschland. Diese Daten bilden Zeitreihen, die See-interne und landschaftliche Veränderungen widerspiegeln. Darin sind sowohl klimatische wie menschliche Einflüsse abgebildet. Ein Vergleich mit unabhängigen Klimadatenreihen soll helfen, die oftmals multifaktoriell bedingten Veränderungen sowie die Rolle und die zeitliche Dynamik (z.B. die Frage nach Synchronität bzw. Asynchronität) einzelner Prozesse besser zu verstehen.


[1] Feeser I, Dörfler W, Czymzik M, Dreibrodt S. A mid-Holocene annually laminated sediment sequence from Lake Woserin: The role of climate and environmental change for cultural development during the Neolithic in Northern Germany The Holocene. 2016;26(6):947-963.

Ongoing Topics

This is a list of ongoing thesis topics. For further information, or if you wish to suggest your own topic, contact the responsible supervisor(s).

MA: Extraction of Scientific Measurements from Text

named-entity recognition example with scientic measurements

The goal of this work is to automatically extract scientific measurements form papers, such as 4 kg/m2. One of the many challenges is that the same quantities are often given in different ways (metre, meter, m etc.). A far sight goal is to also link the extracted quantities to their geo-spatial location (if given somewhere in the text).

For more information please contact Asif Suryani

BA/MA: Investigating Fitness Effects for Beta-Lactamase with Recurrent Neural Networks

Contact: Steffen Strohm, M.Sc., Christian Beth, M.Sc.

In this work the goal is to predict the fitness of Bacteria against beta-Lactam-based antibiotics in silico (in the computer). This is done with protein sequence data from the beta-Lactamase enzyme, which is crucial for the survival of beta-Lactam resistant bacteria. The performance of the predictor in then evaluated on real-world data obtained from in vitro test (in the lab).

MA: Extending Expressiveness of Meta-Structures in Heterogeneous Information Network

Contact: Christian Beth, M.Sc.

Heterogeneous information networks (HINs) are graphs, where nodes have different types, and edges form different relations between the nodes, and thus allow semantically rich modelling of virtually any kind of data and information, ranging from protein-protein interaction networks to bibliogrpahical networks. The meta-path is a composite relationship between nodes in an HIN that is an integral part of state-of-the-art similarity/relevance measures in HINs, which are an integral part for downstream data mining tasks such as clustering, classification, or link prediction in HINs. To allow for more powerful, complex, and expressive relationships, the meta-structure was developed. It felxibly combines meta-path relations with the 'and'-linkage, which allows the user to specify more precisely what she is looking for. But why stop at the 'and'-linkage? Why not consider 'or', 'not', or other logical constraints? In this thesis you will design and study effective (meaningful) and efficient (fast, scalable) relevance measures based on more expressive meta-structures.

Completed Topics

This is a list of successfully completed bachelor's and master's theses. If you are interested in one of the topics (e.g. for your own thesis), please contact the respective supervisor(s) for further information and other current or open topics.

BA: Wave-based Damage Detection in Engineering Structures using Artificial Neural Networks


Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Steffen Strohm, M.Sc.




Structural health monitoring plays a critical role in various disciplines of engineering. The less critical structures are monitored at a specified duration, with the conventional method comprising an array of sensors to detect damage and being tedious, labour intensive, and lengthy. However, the critical infrastructure applies the idea of digital twins where various physical properties are continuously measured and processed to estimate damage location and intensity. The other technique that serves the purpose is the numerical method which offers a versatile solution for scenario forecasting of cracks with location, orientation and length. Numerical methods could generate the response of crack-wave interactions but require a respectable amount of computation power. In this study, crack wave interaction data is generated from the Lattice Element Method and used to train the neural network model to predict the location, orientation and length of the cracks. A 1D-ResNetDense50, 1D-ResNetDecoder34, 1D-ResNetDecoder18 and 1D-SimpleCNN networks have been implemented in the framework to detect cracks. The work further explores the structure of all models and other selected essential components, such as the loss functions, metric, optimizer, learning rate and threshold. Necessary steps are taken to achieve high accuracy, high precision, eliminate possible error sources to get better performance and retain more information about investigated data.

BA: Colored Motif Search In Heterogeneous Information Networks


Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Examples of homogeneous and heterogeneous motifs.

Most graphs only contain nodes and edges of the same type, such graphs are also called homogeneous information networks. Heterogeneous information networks can be seen as an extension of homogeneous information networks. Heterogeneous Information Networks are graphs that consist of different typed nodes and edges and can thus store additional information compared to homogeneous information networks. They carry richer semantic information than homogeneous information
networks do because of the types that can be assigned to nodes and edges. A network motif is a fundamental building block of a graph, it thus is a subgraph that plays an important role in the network structure of the graph. The gain of information in heterogeneous information networks can be used for the problem of network motif discovery as well, as heterogeneous network motifs not only contain structural information. They also contain semantic information about the node and edge types that appear in these network motifs. The problem of network motif discovery in heterogeneous information networks was covered by Rossi et al. in their work "Heterogeneous Network Motifs" [1] in a very efficient way. This work extends the approach by Rossi et al., so that the input networks can have directed and typed edges. These extensions allow the network motifs to contain richer information based on the edge types that had not been part of the network motifs in the approach shown in [1]. The extended approach is created step by step and the performance of the final approach as well as the performance of the intermediate approaches is evaluated on the DBLP dataset. This evaluation shows that directed edges do not have a big influence on the performance, but the typed edges do: The average runtime compared to the approach by Rossi et al. [1] has increased by the factor 2593. Nevertheless the extensions made in this work are having a big influence on the amount of information that a network motif can contain, so that the increase of the runtime is tolerable if it is seen in relation to the information gain about the network structure based on the network motifs with directed, typed edges.

For more information please contact Christian Beth, M.Sc.

[1] Heterogeneous Network Motifs, Rossi, Ryan A., et al. ". arXiv preprint (2019)

MA: Evaluating Meaningful Metastructures in Heterogeneous Information Networks

Author: Steffen Strohm, M.Sc.


Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Example Meta-Structures


This work is based on the concept of meta structure and ETree traversal by Huang et al. 2016 [1] and the research of Zhu and Cheng 2018 and 2019 [2] [3], which is focused on discovering and ranking meta paths using a specifically designed importance function. These two works are used in an attempt to rank meta structures in heterogeneous information networks. The importance function is therefore modified for this task and then tested on a DBLP subset. Tests include two groups of meta structures set up around APA and APVTPA meta paths. An expected or intended ranking within these meta structure groups is compared to the found rankings. The results show that one of the components of the importance function (namely new or modified path count) tends to dominate the overall importance value in certain situations. Therefore a redesign of this component with specified connection to meta structure design is recommended and some ideas are given how to achieve this. Also the other components are discussed, however results show, their influence is more balanced.


[1]  Meta Structure: Computing Relevance in Large Heterogeneous Information Networks

[2]  Evaluating Top-k Meta Path Queries on Large Heterogeneous Information Networks

[3]  Effective and Efficient Discovery of Top-k Meta Paths in Heterogeneous Information Networks


MA: Probability-based Relevance Search in Uncertain Heterogeneous Information Networks

Author: Niko Amann, M.Sc.


Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Uncertain Heterogeneus Information Network


Today large multi-typed networks are ubiquitous and form a critical component of modern information infrastructure. Heterogeneous information networks (HINs) are a powerful modeling tool for such networks and meta-path based relevance measures allow for the utilization of the rich semantic relations to be found in these models. Therefor relevance search in HINs has drawn a lot of interest of researchers in recent years. Moreover there is an ongoing push towards taking uncertain data into consideration. Not only does the cost of eliminating uncertainty increase with the scale of the data and with its degree of heterogeneity, but also do we often loose the opportunity of finding important insights by ignoring different degrees of often natural uncertainty. Therefor this thesis explores the opportunities and the challenges that arise, when we make uncertainty a first class citizen by directly incorporating it into the model. For the first time, to the best of our knowledge, the problem of relevance search in probabilistic heterogeneous information networks is explored. We define a model for uncertain HINs using the possible worlds semantics and apply existing state-of-the -art relevance measures on the uncertain scenario, and as a result we present the probabilistic variants expected path count (EPC) and expected path-constrained random walk (EPCRW). For these measures efficient algorithms are presented that alleviate the otherwise prohibitive complexity of naive approaches. Thereby we employ an incremental computation scheme called Poisson binomial recurrence that was previously successfully employed for frequent itemset mining in uncertain databases [1]. The proposed approaches were implemented, and experiments on a bibliographical network (dblp) with artificially added uncertainty demonstrate the gains of the proposed algorithms.

[1]  Probabilistic frequent itemset mining in uncertain databases

BA: Spatial Semantics Expansion for computing relevance on Heterogeneous Information Networks

Author: Jerome Spindler


Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Excerpt of the PANGAEA dataset.

Sample map region of the North Atlantic Ocean from the PANGAEA database.


In recent years, with the rise of Big Data within both the scientific community and commercial sectors, the demand for solutions for storing and performing computations on heterogenous data has steadily increased. In the fallout of this, the heterogeneous information network (HIN) model was conceived as a visually intuitive and semantically connected model for interpreting heterogeneous data. These networks, represented by directed graphs, led to the development of relation models, such as Meta Paths and, as a generalization thereof, Meta Structures, which were developed to represent subgraph patterns on the network by which to determine relations between objects. Upon these, for computing relevance between objects on the network such as Path Count, Struct Count and Structure Contained Subgraph Expansion (SCSE) were developed as measures of relevance based on all occurences of the pattern described by a Meta Path or Meta Structure starting from one designated source object. This work looks to expand the Meta Structure model to allow for edges, which are not manifested on the network, yet represent a non-trivial relation many objects may be in with each other with a focus on spatial features including, but not limited to, distance from each other, inclusion of one in the other and overlap of each other. These are intended to be definable by users and parameterizable per query. Additionally, it is intended to have these edges potentially influence the results of relevance computations to allow for increased expressiveness with regards to weighing properties of objects, such as "the closer the better". Example computation results on excerpt data taken from the PANGAEA database , utilizing Struct Count and SCSE, are provided to demonstrate the efficacy and impact of this paper’s proposed expansion, as well as a experimental comparison of implementation approaches for these ephemeral edges with regards to data page access.

BA: Relevance Measures in Temporal Heterogeneous Information Networks

Author: Fabian Krüger


Prof. Dr. Matthias Renz

Christian Beth, M.Sc.

Temporal HIN example from PANGAEA-dataset.

An example excerpt of the PANGAEA-dataset, modelled as a temporal HIN.


Many correlations and real systems can be modeled as heterogeneous information networks (HIN). Other than homogeneous information networks a HIN is a directed graph containing more than one type of nodes or more than one type of edges. To measure the relevance of two of these nodes, we use Struct Count and Structure Constrained Subgraph Expansion (SCSE) for meta structures. These measurement methods differently indicate how much two nodes are related to a given meta structure.
Therefore we search for instances in the given graph topologically along with the layer of a meta structure. Meta structures describe the relationship of one node to another by indicating types of edges. As opposed to meta paths they are more expressive. Meta paths can be considered as special meta structures that have no branches.
Current publications do not consider temporal aspects measuring relevance with meta structures in HINs. For that reason, different concepts are presented in this thesis. Furthermore, we implement an HIN in Python with the help of the NetworkX package, we add Struct Count and SCSE to it and evaluate the new approaches.

BA: Verortung von Fotodateien aus dem Kieler Stadtarchiv auf der Basis von Textlabels


Prof. Dr. Matthias Renz

Christian Beth, M.Sc.


Viele Institutionen besitzen historisch gewachsene Datensammlungen mit hohem Informationsgehalt, die aufgrund ihrer Form jedoch oft nur manuell verarbeitet werden können. Durch die Anreicherung mangelhaft strukturierter Datenbestände mit strukturierten Metadaten können für diese Datenbestände moderne Verfahren der automatisierten Erkenntnisgewinnung erschlosssen werden. In dieser Arbeit wird am Beispiel der Fotodatenbank des Kieler Stadtarchivs ein Verfahren entwickelt, mit dessen Hilfe automatisiert die in den den Fotos zugeordneten Textlabels vorkommenden Ortsangaben extrahiert und in einer strukturierten Form als zusätzliche Metadaten bereitgestellt werden können. Dazu werden die Textlabels auf das Vorkommen von Ortsbezeichnungen untersucht, welche dann durch Untersuchung ihres sprachlichen Kontextes weiter spezifiziert werden. Zusätzlich werden die erkannten Verortungen georeferenziert. Das Verfahren kann zu ca. 61% der vorliegenden Fotos eine Verortung extrahieren. Durch die Verortung und Georeferenzierung der Fotos können neue Nutzungsmöglichkeiten für die Sammlung, wie z.B. eine geographische Suche im Archivbestand ermöglicht werden. Das Beispiel zeigt, dass die automatisierte Aufbereitung größerer Datenbestände eine Alternative zur manuellen Strukturierung darstellen kann – oder sie zumindest stark vereinfacht.

MA: Overcoming Oversmoothing in Graph Neural Networks

Author: Christian Beth, M.Sc.

Supervisor: Prof. Dr. Matthias Renz

Graph Attention Network (GAT) Convolution


Graphs are data structures that see a wide variety in their fields of applications, ranging from medical and life science to social networks. Recently, graph neural networks (GNNs) have proven to be a seminal tool for solving a multitude of graph-related tasks across these domains, including representation learning, node classification, and link prediction. The success of these approaches is attributed to the effect of Laplacian smoothing over the node features that occurs during the GNN filtering. This Laplacian smoothing effectively acts as a low-pass filter over the node features, which functions as a de-noiser and thus improves feature quality and performance on the learning task. However, state-of-the-art architectures run into the problem of oversmoothing when too many GNN layers are stacked. Oversmoothing occurs when the filtered features become too similar to each other - to the point where they become hard to distinguish for a learning task, which can greatly hurt performance. To overcome the oversmoothing problem, this work proposes a high-pass filter that - similarly to edge detectors in convolutional neural networks (CNNs) - highlights signal parts where large changes occur. Finally, the approach is evaluated in comparison to several state-of-the-art architectures on various real-world graphs.

MA: Data Mining for deeper Understanding of Cyanobacteria Blooms in the Baltic Sea

Supervisor: Prof. Dr. Matthias Renz

cyano bacteria teaser


Cyanobacteria (blue-green algae) blooms are of growing societal concern in the Baltic Sea. The potentially toxic algae deteriorate water quality and add extra nutrients to an already overfertilized system. Consequently, a comprehensive understanding of the controlling mechanisms are essential if eutrophication is to be managed effectively. This master project investigated the controlling factors that promote cyanobacteria mass accumulation in the Baltic Sea. The underlying data base consisted of a combination of numerical ocean model output, satellite observations and in-situ nutrient samples (collected from international partners and compiled by GEOMAR Helmholtz Centre for Ocean Research Kiel). Support vector machines, decision trees and random forest models were examined analytically and experimentally and compared with one another. Major challenges were given by the heterogeneity, the sparsity and noisiness of the available data, as well as by the large number of factors potentially inducing bacterial growth. The results show-case a remarkable forecast skill based on abiotic factors alone. Somewhat counter to intuition, ambient nutrient concentrations, feature only minor explanatory power.

For more information please contact Dr. Ulrike Löptien

MA: Scalable Methods for Traffic Optimization based on a Road Demand & Supply Model Strategy

Supervisor: Prof. Dr. Matthias Renz

BA: Erstellung einer Plattform zur Evaluierung von Verkehrsflussoptimierungs-Methodiken

Supervisor: Prof. Dr. Matthias Renz