Empowering Life Scientists: What HPC Support Should Look Like

It’s a paradox: many smaller HPC clusters are built by life scientists, for life scientists — yet the very idea of hosting life science workloads still feels alien to many traditional HPC administrators¹. Why? Because life science workflows rarely fit the classic HPC mould: they’re often data-heavy, workflow-driven, interactive, and built on scripting languages — not just MPI or GPU codes.

So what does it take to provide good support for life science — and other “non-traditional” domains — that don’t just want to submit a single, tightly coupled job and walk away?

In this article, I’ll share some practical ideas for bridging that gap — because great HPC support isn’t just about uptime and throughput. It’s about understanding the people and their scientific needs.

Admittance to a cluster: The First Hurdle

Ordinarily, a group leader (PI) must apply for compute time — but the process ranges wildly in complexity. At one end: “Just ask.” At the other: “Write pages justifying your need, list every software package, predefine performance metrics, and estimate storage requirements — before you’ve even run a test job.”

Moreover, many scientists — not just in the life sciences — aren’t fluent in HPC jargon. Asking them to estimate performance in FLOPS per CPU or compute node, for instance, is not just difficult — it’s often theoretically meaningless for applications that spend most of their time traversing matrices, managing I/O, or orchestrating pipelines but not multiplying floating-point numbers. Even most molecular dynamics programs do not give this figure, but rather a simulation time per real-time performance indicator, which — strangely enough — is usually accepted because these programs are known to the HPC world.

So how do we design an admittance process that respects HPC resource constraints without alienating the very users we’re trying to support?

Life scientists rarely run just one application — they use many, typically in sequence, with wildly different needs: some are memory-hungry, others I/O-bound, some need GPUs, others run on small cores for days. Asking them to predict all this upfront — before they’ve even tested their pipeline — is unrealistic and counterproductive. Instead, let there be a sandbox allocation during a discovery phase: ask only for the basics — storage needs and only for the most resource intensive tasks, things like approximate job duration and RAM requirements — and let the user explore their workflow. The real resource profile emerges after they run, not before. That’s when you can guide them to the right queues, containers, or allocations — not when you’re still trying to get them through the door.

Now, if a realistic resource profile emerges after they run a data analysis, not before, then administrators can help gather performance metrics from those early jobs to guide the next, more formal allocation — turning usage data into smarter, more efficient scheduling, without ever asking the scientist to guess in the dark. Eventually, after that sandbox phase, a more profound application for follow-up projects or extended project times may be expected.

I hear you saying: Without performance figures, should these groups not apply for lower tier clusters? Maybe so, but sometimes the sheer amount of data requires a bigger cluster and support already during the application phase.

Software provisioning – a user‑friendly, admin‑sustainable model

Most life‑science groups need many little tools², not just one monolithic application, and they often want to experiment with new versions as soon as they appear. The “either‑or” model you see today – either „we“ pre‑install a static list of programs or users try to install everything themselves in a personal Conda tree that quickly busts their disk quota breaks down for modern bio‑informatics pipelines³.

The reality? The required tools are not all available as pre-built modules. While “conda-in-home-dir” is the default fallback, it’s a quota disaster waiting to happen. A better path: start with a curated set of module files (bundled where possible for common workflows — easy to load, no additional install requirement from a user’s perspective, no quota hit). For the rest, allow self-install via Spack or easybuild (with a shared prefix or scratch space to avoid home-dir bloat) or, where appropriate, containers (Apptainer) for full reproducibility. The goal isn’t to lock users in — it’s to give them multiple safe, supported paths to get their software running, without turning their home directory into a package graveyard.

Also note, that contemporary workflow systems can install Conda packages on the fly – e.g. onto a project file system. A user can simply delete such unnamed environments after the analysis. Another potential solution is on the horizon.

Expertise and Workflow Support

A sustainable life‑science service on an HPC cluster needs at least one domain‑expert on the admin team. When that expertise is given a voice in resource‑planning and software‑provisioning, the centre can steer users toward reproducible, scalable pipelines instead of ad‑hoc solutions.

Current reproducibility initiatives already point to mature workflow managers that understand the batch systems used on HPC clusters (SLURM, LSF, HTCondor, …).Notably, Snakemake – with its SLURM plugin maintained by the HPC groups in Mainz and a Bioinformatics unit in Santa Cruz and support for LSF as well as HTCondor – and Nextflow, originally developed at the Barcelona Supercomputing Center, also with support for a number of HPC batch systems. These tools let scientists describe their analysis once and run it anywhere, while the admin team can support the workflow execution centrally.

Providing “quick‑look” interactive services such as RStudio or Jupyter notebooks (which currently are fashionable in the AI-world) may please users initially, but it often forces them to reinvent pipelines that a workflow manager already handles. (Not withstanding the fact, that a workflow manager can orchestrate many jobs concurrently and takes care of staging reference data into node local storage, etc.) Advising and maintaining workflow‑based solutions should therefore be a core responsibility of the HPC support team – with the domain expert’s guidance ensuring that the advice is both scientifically sound and technically feasible.

Thrashing Ideology

Many HPC centres impose minimum‑core policies (e.g. “only jobs ≥ 16 CPUs, or an average > 256 cores per allocation”) as traditionally HPC is all about well scaling applications and squeezing nanoseconds out of compute kernels. Also, scheduling gets easier. In practice, any realistic data‑driven workflow — life‑science or otherwise — spends most of its jobs on single‑core processing or steps with only a few threads such as quality‑control, format conversion, archiving, and the final plotting of results. Forcing all of these light‑weight tasks onto the compute nodes creates unnecessary queue pressure and defeats the purpose of the core‑size rule. A pragmatic compromise is to allow obvious “shepherd” jobs to run on the login node: the launch of a workflow manager (e.g. Snakemake or Nextflow) that merely builds a DAG of the jobs to perform and then submits the heavy jobs to the batch system and performs a quick download or a few‑second R/Python plot locally on the login node. (NB: root, the core program from the CERN to analyse high energy physics data, has only modest threading support and is used to analyse Petabytes of data on HPC systems worldwide.) All the while, shared memory programs can sometimes be pooled to be launched together in a bigger job – or a small partition for such programs can be provided.

Allowing such lightweight operations on the login node preserves queue efficiency, respects the administrators’ goal of keeping compute‑intensive work on the cluster, and avoids the needless “thrashing” caused by over‑constraining tiny jobs.

If you like to comment, please do so below, on this blog platform. Comments on Mastodon will not be persistent.

Note: This article has been written with the help of LaguageTool and no other AI assistance.

That, at least, is my personal experience. Something, I experience daily and at HPC conferences — and expressed to me by many, many life scientists ↩︎
A single workflow ranges in the order of one to six dozen application packages. So, my estimate is, depending on the heterogeneity of the research questions in my vicinity, 500 to 800 application packages to be a realistic figure. This, of course, varies from site to site. ↩︎
Yes, I know that some admin teams have a more flexible attitude to this challenge. ↩︎

rupture de caténaire

Empowering Life Scientists: What HPC Support Should Look Like

Comments

Eine Antwort zu „Empowering Life Scientists: What HPC Support Should Look Like“

Schreibe einen Kommentar zu FrankSonntag Antwort abbrechen