Hello, and welcome to my website!

My name is Julia, and I'm a computer science PhD student located in Heidelberg, Germany. My main interests lay in bioinformatics, data analysis, and machine learning. I am currently doing a PhD in bioinformatics with the CME group at HITS in Heidelberg. Additionally, I work as a part-time freelance software engineer. I love learning new things and new technologies.

If you have an interesting project for me or just want to chat, feel free to contact me!

Education

Mar 2022 - Present
Heidelberg Institute for Theoretical Studies
Oct 2019 - Feb 2022
Karlsruhe Institute of Technology
  • Master's Thesis: "Empirical Numerical Properties of Maximum Likelihood Phylogenetic Inference" (Mark: 1.0)
  • Specialization subjects: Data-Intensive Computing, Machine Learning and Artificial Intelligence
  • Minor subject: Biology
  • Final mark: 1.1 (graduation with distinction)
Oct 2015 - Sep 2019
Karlsruhe Institute of Technology
  • Bachelor's Thesis: "Patient Tracking in Surgery: An Image-Guided, Markerless Approach" (Mark: 1.0)
  • Minor subject: Physics
  • Final mark: 1.9
Sep 2009 - Jul 2015
Robert-Gerwig-Gymnasium Hausach
  • General qualification for university entrance
  • Final mark: 1.3

Further Education

January 2024
DeepLearning.AI

Coursera course teaching the most important concepts of probability and statistics for Machine Learning and Data Science.

You can verify the certificate here. (Note: The certificate states my maiden name Julia Schmid, but yes, that is me ☺︎)︎

November 2023
Imperial College London on Coursera

Coursera specialization teaching the basic mathematical concepts for Machine Learning agorithms. The specialization consists of the following courses:

  • Linear Algebra
  • Multivariate Calculus
  • PCA

You can verify the certificate here. (Note: The certificate states my maiden name Julia Schmid, but yes, that is me ☺︎)︎

November 2023
Stanford University on Coursera

Coursera course teaching the basics of statistics and hypothesis testing.

You can verify the certificate here. (Note: The certificate states my maiden name Julia Schmid, but yes, that is me ☺︎)︎

September 2020
DeepLearning.AI on Coursera

Coursera course teaching the basics of medical diagnosis prediction using machine learning and deep learning.

You can verify the certificate here. (Note: I got married since I took this class, so the certificate says Julia Schmid, but that is me ☺︎)︎

July 2020
DeepLearning.AI on Coursera

Coursera specialization consisting of the following courses:

  • Neural Networks and Deep Learning
  • Convolutional Neural Networks
  • Sequence Models
  • Structuring Machine Learning Projects
  • Regularization and Optimization

You can verify the certificate here (Note: I got married since I took these classes, so the certificate says Julia Schmid, but that is me ☺︎).

Experience

Mar 2022 – Present
Heidelberg Institute for Theoretical Studies

PhD in Computer Science in the interdisciplinary field of computational biology.

Python Snakemake Biopython Pandas numpy Plotly SQLite scikit-learn LightGBM TreeLite Optuna C C++ Git Jenkins RAxML-NG IQ-Tree FastTree

I am currently pursuing my PhD in Computer Science at the HITS in Heidelberg in the field of Bioinformatics. The working title of my thesis is "Applications of Machine Learning and Data Science in Phylogenetics". One main aim of Phylogenetics, and a research focus in my group at HITS is to infer phylogenetic trees. Phylogenetic trees represent hypothetical evolutionary relationships between organisms or species. With advances in machine learning and deep learning over the last decades, deep learning and large scale data analytics is becoming more popular in phylogenetics.

The goal of my thesis is to explore potential applications of machine learning and deep learning techniques to improve phylogenetic inferences, both in terms of accuracy and runtime. Furthermore, I am trying to improve current techniques and explore future directions by analyzing vast amounts of biological data, and research results.

The following list is a brief summary of what I am working on in my day-to-day work and projects that I have finished.

  • Pandora: Quantification of the uncertainty of population genetics genotype datasets under dimensionality reduction. See below for more details on this project.
  • Debunking Simulations: In a joint work with researchers in France, we demonstrated that current state-of-the-art models of sequence evolution in phylogenetics cannot simulate empirical-like data. See below for more details on this project.
  • Pythia: Predicting the difficulty of phylogenetic analysis. See below for more details on this project.
  • Numerical Analysis of thresholds in phylogenetic inference tools. See below for more details on this project.
  • My group develops a C library providing frequently used phylogenetic inference functionality (Coraxlib). I am currently working on setting up a CI pipeline in Jenkins for this project.

Jun 2024 – Sep 2024
Apple
To be updated :-)
I'm very happy to share that I am joining Apple for an AIML-Internship this summer!

Oct 2021 – Mar 2024
Freelance

Major refactoring and migration of a business-critical Python 2 codebase to Python 3.

Python MySQL Redis FastAPI GraphQL Docker docker-compose Apache Kafka

In addition to pursuing my PhD in computer science, I worked as a part-time freelance software engineer. The idea was to gain further experience in the industry with frameworks and technologies I don't use in my research.

The primary task involved refactoring and migrating a business-critical, yet undocumented and untested Python 2 codebase to Python 3. The original developers are unavailable, and there's no specification for the expected functionality. The challenge lies in refactoring without tests to validate changes, and to write tests, the codebase must first be understood and refactored. The initial structure, with scattered database interaction logic, didn't allow for testing. Potential logical bugs present another issue, as it's unclear if they're actual bugs or undocumented, intended behavior. I made substantial progress in refactoring, adding unit tests and documentation, reducing technical debt, and enhancing maintainability. I further identified potential bugs, data inconsistencies, and opportunities for speedup and complexity reduction.

In various side-projects for the same company, I came in touch with a few additional technologies, including FastAPI, GraphQL, docker, docker-compose and Apache Kafka.

Dec 2020 - Feb 2022
Heidelberg Institute for Theoretical Studies

Large scale analysis of numerical properties of phylogenetic inference using the maximum likelihood method.

Python Snakemake Pandas Plotly SQLite

For more information, see the respective Project section below.

April 2020
Company on request

Development of a web-based filesharing tool.

Python Django JavaScript HTML CSS AWS

Files can be uploaded and shared with other users. Share-links allow for sharing specific files with non-registered users as well. The files are stored in an Amazon aws S3 bucket to ensure scalability.

Oct 2019 - Mar 2020
ArtiMinds Robotics GmbH

Development of a web-based analytics software suite for large-scale robotics data.

JavaScript Typescript HTML CSS MariaDB Python Pandas Matplotlib

ArtiMinds' main product is a robot programming suite (RPS) for industrial robots. Instead of writing code, users of the RPS formulate the task of the robot using pre-programmed building blocks. The RPS continuously monitors the task execution while recording measurements such as velocity and force. This data can be directly transferred to the second key product of ArtiMinds: the LAR (Learning & Analytics for Robots). The LAR is a web-based monitoring interface that displays the recorded measurements and task executions of the robot.

During my work as a working student, I was part of the LAR development team. This work included writing tests for the previously largely untested LAR frontend and backend codebase.

I further analyzed velocity and force data of an industrial robot for customers to identify errors and optimization potential.

Aug 2017 - Aug 2019
Walk In Fitness, KIT

As a balance to sitting in the library doing some programming and studying, I wanted to do something other than computer science. So I decided to do an internship and later work in the unversity gym. I helped clients achieve their fitness goals and did some basic nutrition counselling.

Apr 2018 - Jul 2018
Chair for Embedded Systems, KIT

Tutor for the subject “Digitaltechnik und Entwurfsverfahren“ (Digital Technology and Design Methods).

Study aid for 28 undergraduate computer science students.

Jan 2017 - Sep 2017
Institute for Anthropomatics and Robotics, KIT

Working with the ArmarX robot programming framework.

Preparation of experiments for marker-based human motion capture.

Mar 2016 - Sep 2016
Geophysical Institute, KIT

Writing lectures notes for the lecture "Introduction to Geophysics".

LaTeX

The resulting notes comprise 70 pages, including 50 custom-made graphics and are available on GitHub.

Projects

This is an (incomplete) list of projects I worked on or that I am still working on.

I contribute and update conda-forge recipes.

conda-forge

In my day-to-day work I rely on a lot of open-source software, and I think it's important to contribute to open-source projects myself. So far my contributions mainly concern conda-forge recipes. Namely, I contributed the following:

  • addition of the PyPythia feedstock (maintainer)
  • addition of the apricot-select feedstock (maintainer)
  • update the pomegranate feedstock (maintainer)
  • update the scikit-allel feedstock (maintainer)
  • update the r-curl feedstock
Additionally, all my software projects are available open-source on GitHub.

I implemented a Python tool to estimate the uncertainty of dimensionality reduction on population genetics datasets.

The tool is available on GitHub (WIP).

The respective preprint publication will be available soon.

Python Eigensoft Plotly Pandas Scikit-Learn GitHub Actions

Dimensionality reduction techniques, for example Principal Components Analysis (PCA) or Multidimensional scaling (MDS) are frequently used in population genetics and ancient DNA studies. To the best of my knowledge, there exists no standard in either field to report an estimate of the uncertainty or instability of such methods. Without such an estimate, the accuracy of a conclusion based on dimensionality reduction cannot be assessed by the reader of the study. In my current work, I'm implementing the Python framework Pandora that is able to provide such an uncertainty for a given genotype dataset based on bootstrapping. Pandora supports stability analyses for PCA and MDS analyses. In addition to an overall stability of the dataset, Pandora can also compute the stability of subsequent K-Means clustering analysis, as well as bootstrap support values for all samples in the input dataset.

The tool is unit-tested and I set up CI using GitHub actions.

We provide empirical evidence that current state-of-the-art models of sequence evolution cannot reproduce empirical-like evolutionary data.

The publication is available as preprint in BioRxiv.

Python Pandas Scikit-Learn LightGBM

Simulating sequence evolution plays an important role in the development and evaluation of phylogenetic inference tools. Naturally, the simulated data needs to be as realistic as possible to be indicative of the performance of the developed tools on empirical data. Over the years, numerous phylogenetic sequence simulators, employing various models of evolution, have been published with the goal to simulate such empirical-like data. In this study, we simulated DNA and protein Multiple Sequence Alignments (MSAs) under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how well supervised learning methods are able to predict whether a given MSA is simulated or empirical.

Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate the process of evolution.

This work is a joint work with my colleagues at HITS and a team of researchers in Lyon. Johanna Trost, Dimitri Höhler, and I contributed all equally to this work.

I implemented a Python library for predicting the difficulty of multiple sequence alignments in phylogenetics.

PyPythia is available on GitHub.

The peer-reviewed publication is available in Molecular Biology and Evolution.

Python Pandas Scikit-Learn LightGBM TreeLite Snakemake RAxML-NG GitHub Actions

Phylogenetic analyses under the Maximum Likelihood model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguishable topologies. At present, no method exists to quantify and predict this behavior. We introduce a method to quantify the degree of difficulty for analyzing a dataset and present Pythia, a Random Forest Regressor that accurately predicts this difficulty. Pythia predicts the degree of difficulty of analyzing a dataset prior to initiating Maximum Likelihood based tree inferences. The prediction and computation of all required attributes is substantially faster than a single Maximum Likelihood tree inference. Pythia can be used to increase user awareness with respect to the amount of signal and uncertainty to be expected in phylogenetic analyses, and hence inform an appropriate (post-)analysis setup. Further, it can be used to select appropriate search algorithms for easy-, intermediate-, and hard-to-analyze datasets.

In the current version of Pythia we replaced the Random Forest Regressor with boosted trees implemented in LightGBM.

PyPythia is unit-tested and I set up CI using GitHub Actions.

Pythia is available as open source software libraries in C (CPythia) and Python (PyPythia).

CPythia is also available on GitHub. CPythia can only be used in conjunction with Coraxlib.

I performed large scale numerical analysis of the influence of threshold parameters on the runtime and results of Maximum Likelihood phylogenetic inferences.

The peer-reviewed publication is available in Bioinformatics Advances.

Python Snakemake Plotly RAxML-NG IQ-Tree FastTree

Maximum Likelihood (ML) is a widely used phylogenetic inference model. ML implementations heavily rely on numerical optimization routines that use internal numerical thresholds to determine convergence. We systematically analyze the impact of these threshold settings on the log-likelihood and runtimes for ML tree inferences with RAxML-NG, IQ-TREE, and FastTree on empirical datasets. We provide empirical evidence that we can substantially accelerate tree inferences with RAxML-NG and IQ-TREE by changing the default values of two such numerical thresholds. At the same time, altering these settings does not significantly impact the quality of the inferred trees. We further show that increasing both thresholds accelerates the RAxML-NG bootstrap without influencing the resulting support values. For RAxML-NG, increasing the likelihood thresholds ϵLnL and ϵbrlen to 10 and 103 respectively results in an average tree inference speedup of 1.9 ± 0.6 on Data collection 1, 1.8 ± 1.1 on Data collection 2, and 1.9 ± 0.8 on Data collection 2 for the RAxML-NG bootstrap. Increasing the likelihood threshold ϵLnL to 10 in IQ-TREE results in an average tree inference speedup of 1.3 ± 0.4 on Data collection 1 and 1.3 ± 0.9 on Data collection 2.

Our research comprises four studies, I conducted Study 1 during my Master's thesis with the CME group at HITS. You can find the full thesis here

.

I implemented a private web-based diary to keep track of vacations.

The source code is available on GitHub.

Python Django HTML CSS Bootstrap

To remember vacations, holidays and trips I implemented a diary website using the django web framework. This website also stores images associated with certain trips and image descriptions, so I can show the best images to friends and family. It also includes a map, so I can see where on this beautiful earth I have already travelled to ☺︎ In its latest version, I can also upload files, e.g. GPS tracks of hikes.

I implemented a web-based tool to organize courses for the computer science master's degree at KIT.

The source code is available on GitHub.

Python Django HTML CSS Bootstrap

For my master’s studies at KIT I implemented a web-based tool to organize the courses I intend to take. The tool checks if the planned courses satisfy all requirements to get the master’s degree. It further shows a schedule of exam dates, as well as an overview of all grades. Since the computation of the final grade at KIT is a bit tedious, I also implemented an automatic computation of the final grade, as well as the average grades per module.

I implemented a markerless patient movement tracking algorithm for cochlear implant surgeries.

The source code is available on GitHub.

The peer-reviewed paper is available Frontiers in Surgery.

Python OpenCV scikit-image

As my bachelor's thesis, at the Intelligent Process Automation and Robotics Lab (IPR) at KIT, I implemented a system for markerless patient movement tracking during cochlear implant surgeries. The approach uses only the images obtained by the microscopic camera used by the surgeon. I implemented the algorithm using Python and the image processing frameworks OpenCV and scikit-image. It also includes a neural network for semantic image segmentation. My bachelor's thesis is available upon request only as it contains sensitive patient images.

I implemented a website for storing and sharing recipes.

The source code is available on GitHub.

Python Django HTML CSS Bootstrap

This project was inspired by my love for baking. It is meant for storing all the ideas and recipes of me and my friends. Different users can create new recipes, categorize them or search for a specific recipe. The website also includes functionality to filter recipes by ingredients and categories.

This website is an ongoing project and I keep adding new features. The latest feature I added was a shopping list that is especially useful when planning to bake multiple different recipes.

We implemented a web-based Lambda Calculus IDE and interpreter for the programming paradigm lecture at KIT.

The source code is available here.

Java Google Web Toolkit JavaScript CSS

As part of the undergraduate studies, we implemented a browser-based lambda calculus IDE and interpreter in a team of six students. We used Java with the Google Web Toolkit, JavaScript and CSS and wrote about 10 000 lines of code plus 6000 lines of tests. The project was graded with mark 1.0 (best possible grade). I mostly implemented the interactive frontend to explore the lambda terms and their reduction steps. Also, I wrote lots of tests for the controller layer (MVC architecture).

Publications & Preprints

Publications

Simulations of Sequence Evolution: How (Un)realistic They Are and Why

J. Trost*, J. Haag*, D. Höhler*, L. Jacob, A. Stamatakis and B. Boussau (2024) Simulations of Sequence Evolution: How (Un)realistic They Are and Why. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msad277
* equal contribution

Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using dataset difficulty

A. Togkousidis, A.M. Kozlov, J. Haag, D. Höhler and A. Stamatakis (2023) Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using dataset difficulty. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msad227

The Free Lunch is not over yet – Systematic Exploration of Numerical Thresholds in Maximum Likelihood Phylogenetic Inference

J. Haag, L. Hübner, A.M. Kozlov and A. Stamatakis (2023) The Free Lunch is not over yet – Systematic Exploration of Numerical Thresholds in Maximum Likelihood Phylogenetic Inference. Bioinformatics Advances, 3(1). https://doi.org/10.1093/bioadv/vbad124

From Easy to Hopeless - Predicting the Difficulty of Phylogenetic Analyses

J. Haag, D. Höhler, B. Bettisworth and A. Stamatakis (2022) From Easy to Hopeless - Predicting the Difficulty of Phylogenetic Analyses. Molecular Biology and Evolution, 39(12). https://doi.org/10.1093/molbev/msac254

Continuous Feature-Based Tracking of the Inner Ear for Robot-Assisted Microsurgery

C. Marzi, T. Prinzen, J. Haag, T. Klenzner and F. Mathis-Ullrich (2021) Continuous Feature-Based Tracking of the Inner Ear for Robot-Assisted Microsurgery. Front. Surg., 8:742160. https://doi.org/10.3389/fsurg.2021.742160

Preprints

Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data

J. Haag, A. I. Jordan and A. Stamatakis (2024) Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data. bioRxiv. https://doi.org/10.1101/2024.03.14.584962

Predicting Phylogenetic Bootstrap Values via Machine Learning

J. Wiegert, D. Höhler, J. Haag, Alexandros Stamatakis (2024) Predicting Phylogenetic Bootstrap Values via Machine Learning. bioRxiv. https://doi.org/10.1101/2024.03.04.583288

A representative Performance Assessment of Maximum Likelihood based Phylogenetic Inference Tools (Preprint)

D. Höhler, J. Haag, A. M. Kozlov and A. Stamatakis (2022) A representative Performance Assessment of Maximum Likelihood based Phylogenetic Inference Tools. bioRxiv. https://doi.org/10.1101/2022.10.31.514545

Talks

Educated Bootstrap Guesser: Predicting Phylogenetic Bootstrap Values

legend2024
May 2024
In this presentation, I presented the work of my master's student Julius.
Talk slides

Simulations of Sequence Evolution: How (Un)realistic They Are and Why

legend2024
May 2024
Talk slides

Predicting the Difficulty of Phylogenetic Analyses

ERGA BioGenome Analysis and Applications Seminar
November 2023
A recording of this talk is available on on YouTube.
Talk slides

Predicting the Difficulty of Phylogenetic Analyses

Peder Sather/Invertomics Symposium “Progress and Development in Phylogenetic Methods”
March 2023, University of Oslo, Norway
Talk slides

Predicting the Difficulty of a Phylogenetic Analysis

EVOLCYP Workshop on Biodiversity Genomics
September 2022, University of Cyprus, Cyprus

Skills

Programming Languages & Tools

I feel most comfortable working with Python, but I did come in touch with many other programming languages. Since most hobby projects are websites, I am also very familiar with HTML5 and CSS. During my work at ArtiMinds I wrote code in JavaScript and TypeScript and during my computer science studies I also coded in Java and C. I recently also learned the basics of C++ and R.

For all my projects I'm using Git, and I know the basic usage (commit, push, merge, rebase, ...). However, advanced Git features still feels like magic to me and I have to google a lot of commands ☺︎

Python

HTML5 & CSS

Git

SQL

JavaScript & TypeScript

Java

C++

C

R



Frameworks

Here is a list of frameworks I frequently work with and I would say I now my way around.

Pandas Plotly Dash Matplotlib Numpy scikit-learn LightGBM Django Snakemake Biopython


Languages

  • German (Native)
  • English (Professional)
  • Spanish (Elementary)
  • Greek (Elementary)

Miscellaneous

Hobbies
  • Baking and cooking! You can see pictures of my baked goods here ☺︎
    ...yes I will bring cake to the office ☺︎
  • Sports, especially tennis, bouldering, and cycling.
  • Drinking good coffee ☺︎ (I even have a barista certificate!)
  • Meeting friends
  • Reading a good book