Archaeoinformatics - Data Science

Open Topics

This is a list of open thesis topics. For further information, or if you wish to suggest your own topic, contact the responsible supervisor(s).

BA/MA: Data Science Applications in Marine Sciences

There are multiple open topics available that target data science applications in marine science. If you are interested in one of the topics or have an idea about a related topic, please contact Carola Trahms, M.Sc.
You can find more suggested topics below.

Fish Larvae Trajectories in the Mediterranean Sea

Example data that can be encountered in marine data science.


BA/MA: Efficient spatio-temporal indexing for HINs: Getting from measurement tables to HINs

Example of a spatial HIN for marine data science

Much of the data and measurements obtained in marine science are of a spatial and/or temporal nature, i.e. they are associated with geo-coordinates or time stamps.
These spatio-temporal properties can be leveraged to obtain new, and deeper insights into the data. However managing spatial and temporal data often requires careful indexing of the data, in order to remain efficient. In this thesis you will study efficient methods to index spatio-temporal data for Heterogeneous Information Networks (HINs), which are large graphs, where different types of nodes and relationships are modelled. This topic offers the opportunity to hone skills and techniques learned in lectures like Information Systems, Geo-Information Systems, and Methods of Efficient Similarity Search in Large Databases (although the latter two are not a pre-requisite).

BA/MA: (Linear) combinations of MetaStructures for Clustering or Community Detection in (schema-rich) HINs

Heterogeneous Information Networks (HINs) are graphs, where nodes have different types, and edges form different relationships between the nodes (a homogeneous information network would just be a plain graph). In all graphs, but especially HINs, it is of great interest to find groups or communities that exhibit similar behavior or are more closely related to one another. Meta Structures are complex relationships in HINs, which can be used to express the 'relatedness' of nodes within the graph, and thus provide a powerful tool available to be used in clustering or community detection. In this thesis you will study efficient (fast, scalable) and effective (meaningful) clustering and CD algortihms using (linear) combinations of meta structures. Participation in a lecture such as KDDM (or similar) is required.


BA/MA: Text Mining and Knowledge Extraction in Marine Sciences

There are multiple open topics available that target text mining and knowledge extraction from text with applications in marine science. If you are interested in one of the topics, or have an idea about a related topic, please contact Asif Suryani, M.Sc.
You can find more suggested topics below.

BA/MA: Study and Evaluation: From NER to Network Representation of Scientific Text

named-entity recognition example

Named entity recognition (NER) is the task of finding and extracting relevant entities from text. Scientific measurements and their associated values are of particular interest in this scenario, but also automatic recognition of locations, institutions, persons etc. Once these entities are extracted from a text, the task is to construct a network representation (e.g. a heterogeneous information network) of the text document at hand, where the challenge lies in predicting the appropriate relationships between the extracted entities (e.g. linking an extracted quantity 'mass' to its respective measurement '42', and unit 'kilograms'). The Target of this thesis is to develop and study novel techniques for NER and to link the extracted entities for a network representation of the document.

BA: Scientific Text Parser: An Interactive and Intelligent Approach

In this bachelor's thesis the task is to develop a framework for a parsing toolkit, that reads and summarizes scientific text documents - taylored to the needs of the user. The target domain for these studies will be scientific texts from marine science.

MA: Pre-trained Language Models for Domain-driven Q/A

In this master's thesis the objective is to leverage novel, and state-of-the-art pre-trained language models (such as BERT) to facilitate automated question answering (Q/A). The target domain for these studies will be scientific texts from marine science.

BA/MA: Thesis in Data Science - Classification of Soil Layers based on Georadar Data

Contact: Steffen Strohm, M.Sc.


We are looking for an interested student who wants to engage in the classification of soil layers based on georadar data for a bachelor's or master's thesis. The aim is to develop data science methods for the automatic detection of sedimentary layer boundaries and for the automatic naming of geological units.

The topic is embedded in the cooperation between the Institute for Computer Science at the CAU Kiel and the Center for Baltic and Scandinavian Archeology in Schleswig  within the framework of the CRC1266. No previous knowledge of geophysical data processing/georadar or the archeology of the Paleolithic or Mesolithic is required.

Example of Soil Layers

Example of Geo Radar Data

Deutsche Beschreibung:

Wir suchen eine/n interessierte/n Student*in, die/der sich im Rahmen einer Bachelor oder Masterarbeit mit der Klassifizierung von Bodenschichten anhand von Georadar-Daten auseinander setzen will. Ziel ist die Entwicklung von Data Science Methoden zur automatischen Detektion von sedimentären Schichtgrenzen sowie zur automatische Benennung der geologischen Einheiten.

Das Thema ist eingebettet in die Kooperation zwischen dem Institut für Informatik der CAU Kiel mit dem Zentrum für Baltische und Skandinavische Archäologie in Schleswig im Rahmen des SFB1266. Es werden keinerlei Vorkenntnisse bezüglich geophysikalischer Datenverarbeitung/ Georadar oder der Archäologie des Paläolithikums  oder Mesolithikums vorausgesetzt.

Thesis Description (PDF)

BA/MA: Thesis in Data Science - Magnetic Pattern Classification

Contact: Steffen Strohm, M.Sc.

Magnetic prospection data depicts the remains of archaeological sites. We investigate the sites of the Cucuteni-Tripolye Culture via the magnetic anomalies of burned houses to infer the social strucutre. From the magnetic anomaly we derive the magnetization distribution which correlates with the mass of burnt clay. The magnetization patterns of the building remains are directly related to the architecture, which in turn can reflect social structures.

For the settlements of the Cucuteni-Tripolye Culture is characteristic: large sites with up to several thousands of buildings, concentric settlement layout and a high degree of standardization concerning the floor plan of the houses. The Figure shows a section of the magnetic anomaly map (a), the thereof derived magnetization distribution (b) and three examples of house anomalies in comparison to a standardized floor plan (c).

Example of a magnetic anomaly map, derived magnetization distribution and examples of house anomalies. 

(Click on picture for details in new tab)

There are two main questions related to the magnetic data of these sites, which can be tackled with supervised or unsupervised learning or data mining techniques:

(1) How can the building remains be automatically identified?
(2) Are there different clusters of buildings in terms of their magnetization patterns? If so, how are these clusters related to the ideas of standardized floor plans?


This work is a collaborative project with scientists from subproject G2 (Geophysics) and subproject D1 (Archaeology) within the CRC 1266 "Scales of Transformation" at Kiel University. No previous knowledge of Geophysics or Archaeology is required.

MA: Evaluating Protein Networks – Puzzling the Biological Secret Behind the Data

Contact: Christian Beth, M.Sc.

Example Protein Network

Protein interaction and function overview (figure taken from [1])

The development of novel metallic biomaterials as implants for medical application requires an attentive elucidation of impaired or improved physiological processes and responses. Therefore, we establish in vitro models to analyse cellular reactions on different biological levels. Fundamental information can be obtained from analysing the variations in the protein synthesis. Functional interactions, relationships, and correlations between the tremendous diverse proteins lead to multifaceted and complex networks. Structured data science will help to elucidate these networks and contribute to define sensitive cellular processes and regularities. The main task will be the classification of big data sets in order to read out principles and pattern. This work will be based on the creative embedding of biological data into mathematical models.

[1] Understanding Protein Networks Using Vester's Sensitivity Model, Moreno, LA et al. in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 17, no. 4, pp. 1440-1450, (2020)

MA: Identification and Synchronization of Events in Data Series from Lake Sediments

Contact: Steffen Strohm, M.Sc.

The preliminary title of this master thesis points towards a set of possible subproblems in the context of mining time series data derived from lake sediments. The project focusses on analysing these time series to identify patterns, which represent events, environmental conditions or human impact. Identifying these patterns and comparing the time series with other - independent - climate data series will help to understand (multifactorial) transformation processes, their role and temporal dynamics.

Example of synchronized time series derived from different data sources.

Example of time series derived from several data sources. Figure taken from [1] (Click on picture for details in new tab)

This work is a collaborative project with scientists from subproject F2 (Geoarchaeology) within the CRC 1266 "Scales of Transformation" at Kiel University. No previous knowledge in Geoarchaeology is required.


German Description

Im Fokus des Sonderforschungsbereiches 1266 „TransformationsDimensionen“ stehen die Untersuchungen von Mensch-Umwelt Wechselwirkungen. Das Projekt F2 erhebt hierbei Daten mittels pollenanalytischer und geochemischer Analysen von jahresgeschichteten Seesedimenten aus Norddeutschland. Diese Daten bilden Zeitreihen, die See-interne und landschaftliche Veränderungen widerspiegeln. Darin sind sowohl klimatische wie menschliche Einflüsse abgebildet. Ein Vergleich mit unabhängigen Klimadatenreihen soll helfen, die oftmals multifaktoriell bedingten Veränderungen sowie die Rolle und die zeitliche Dynamik (z.B. die Frage nach Synchronität bzw. Asynchronität) einzelner Prozesse besser zu verstehen.


[1] Feeser I, Dörfler W, Czymzik M, Dreibrodt S. A mid-Holocene annually laminated sediment sequence from Lake Woserin: The role of climate and environmental change for cultural development during the Neolithic in Northern Germany The Holocene. 2016;26(6):947-963.

MA: Predicting Ammonia Emissions of Livestock

Contact: Prof. Dr. Matthias Renz

Klima- und Umweltschutz ist ein wichtiges Ziel für die Zukunft. Um die geforderten europäischen Klimaziele einzuhalten und somit einen Beitrag zum Klima- und Umweltschutz zu leisten, muss sich die deutsche Tierhaltung anpassen. Sowohl der Klimaschutz als auch das Tierwohl sind in den Fokus der deutschen bzw. europäischen Politik geraten. Um die kontroversen von Umweltschutz und Tierwohl zusammenzubringen sind Maßnahmen und Strategien zur Emissionsminderung in der Tierhaltung notwendig.

Die derzeitigen Emissionsmessungen von relevanten Gasen in der Schweinehaltung sind sehr kosten-, zeit- und arbeitsintensiv. Um einen schnellen Wandel der Tierhaltung zu tier- und klimafreundlichen Haltungssystemen voranzutreiben, sind die bisherigen Vorgehensweise nicht für einen schnellen Wandel geeignet.

Ein Problem bei den Untersuchungen zur Freisetzung von umwelt- und klimarelevanten Gasen (Methan, Ammoniak, Lachgas, CO2) sind zum Beispiel die Vielzahl an verschiedenen Faktoren, die die Freisetzung beeinflussen. Bisherige Forschungsansätze basierten auf Modellbetrachtungen mit Varianz- und Regressionsanalysen oder mechanistischen Modellen. Diese Vorgehensweisen werden je nach Land, Universität oder Messinstitut unterschiedlich angewendet und permanent angepasst und verändert. Ein robustes und belastbares Ergebnis, das an verschiedenen Standorten funktioniert, ist noch nicht gefunden.

Auf Grundlage von vorliegenden Zeitreihendaten sollen die Zusammenhänge zwischen den relevanten Einflussfaktoren und der Freisetzung unterschiedlicher Gase herausgearbeitet werden. Aus diesen Zusammenhängen sollen zeitliche Emissionsverläufe vorhergesagt werden können. Ein entsprechendes Tool für die Vorhersage der Emissionen unterschiedlicher Nutztierställe soll erarbeitet werden.

Dieses Tool könnte unteranderem die Genehmigung von Neu- bzw. Umbauten emissionsmindernder und tierfreundlicher Haltungssysteme beschleunigen und somit einen entscheidenden Teil zum schnellen Wandel hin zu einer tier- und umweltfreundlichen Tierhaltung beitragen.

Die Arbeit findet in enger Zusammenarbeit zwischen den beiden Instituten für Informatik und Landwirtschaftliche Verfahrenstechnik der CAU statt.