Empowering Life Scientists: What HPC Support Should Look Like

It’s a paradox: many smaller HPC clusters are built by life scientists, for life scientists — yet the very idea of hosting life science workloads still feels alien to many traditional HPC administrators1. Why? Because life science workflows rarely fit the classic HPC mould: they’re often data-heavy, workflow-driven, interactive, and built on scripting languages — not just MPI or GPU codes.

So what does it take to provide good support for life science — and other “non-traditional” domains — that don’t just want to submit a single, tightly coupled job and walk away?

In this article, I’ll share some practical ideas for bridging that gap — because great HPC support isn’t just about uptime and throughput. It’s about understanding the people and their scientific needs.

Admittance to a cluster: The First Hurdle

Ordinarily, a group leader (PI) must apply for compute time — but the process ranges wildly in complexity. At one end: “Just ask.” At the other: “Write pages justifying your need, list every software package, predefine performance metrics, and estimate storage requirements — before you’ve even run a test job.”

Moreover, many scientists — not just in the life sciences — aren’t fluent in HPC jargon. Asking them to estimate performance in FLOPS per CPU or compute node, for instance, is not just difficult — it’s often theoretically meaningless for applications that spend most of their time traversing matrices, managing I/O, or orchestrating pipelines but not multiplying floating-point numbers. Even most molecular dynamics programs do not give this figure, but rather a simulation time per real-time performance indicator, which — strangely enough — is usually accepted because these programs are known to the HPC world.

So how do we design an admittance process that respects HPC resource constraints without alienating the very users we’re trying to support?

Life scientists rarely run just one application — they use many, typically in sequence, with wildly different needs: some are memory-hungry, others I/O-bound, some need GPUs, others run on small cores for days. Asking them to predict all this upfront — before they’ve even tested their pipeline — is unrealistic and counterproductive. Instead, let there be a sandbox allocation during a discovery phase: ask only for the basics — storage needs and only for the most resource intensive tasks, things like approximate job duration and RAM requirements — and let the user explore their workflow. The real resource profile emerges after they run, not before. That’s when you can guide them to the right queues, containers, or allocations — not when you’re still trying to get them through the door.

Now, if a realistic resource profile emerges after they run a data analysis, not before, then administrators can help gather performance metrics from those early jobs to guide the next, more formal allocation — turning usage data into smarter, more efficient scheduling, without ever asking the scientist to guess in the dark. Eventually, after that sandbox phase, a more profound application for follow-up projects or extended project times may be expected.

I hear you saying: Without performance figures, should these groups not apply for lower tier clusters? Maybe so, but sometimes the sheer amount of data requires a bigger cluster and support already during the application phase.

Software provisioning – a user‑friendly, admin‑sustainable model

Most life‑science groups need many little tools2, not just one monolithic application, and they often want to experiment with new versions as soon as they appear. The “either‑or” model you see today – either „we“ pre‑install a static list of programs or users try to install everything themselves in a personal Conda tree that quickly busts their disk quota breaks down for modern bio‑informatics pipelines3.

The reality? The required tools are not all available as pre-built modules4. While “conda-in-home-dir” is the default fallback, it’s a quota disaster waiting to happen. A better path: start with a curated set of module files (bundled where possible for common workflows — easy to load, no additional install requirement from a user’s perspective, no quota hit). For the rest, allow self-install via Spack or easybuild (with a shared prefix or scratch space to avoid home-dir bloat5) or, where appropriate, containers (Apptainer) for full reproducibility. The goal isn’t to lock users in — it’s to give them multiple safe, supported paths to get their software running, without turning their home directory into a package graveyard.

Also note, that contemporary workflow systems can install Conda packages on the fly – e.g. onto a project file system. A user can simply delete such unnamed environments after the analysis. Another potential solution is on the horizon.

Expertise and Workflow Support

A sustainable life‑science service on an HPC cluster needs at least one domain‑expert on the admin team. When that expertise is given a voice in resource‑planning and software‑provisioning, the centre can steer users toward reproducible, scalable pipelines instead of ad‑hoc solutions.

Current reproducibility initiatives already point to mature workflow managers that understand the batch systems used on HPC clusters (SLURM, LSF, HTCondor, …).Notably, Snakemake – with its SLURM plugin maintained by the HPC groups in Mainz and a Bioinformatics unit in Santa Cruz and support for LSF as well as HTCondor – and Nextflow, originally developed at the Barcelona Supercomputing Center, also with support for a number of HPC batch systems. These tools let scientists describe their analysis once and run it anywhere, while the admin team can support the workflow execution centrally.

Providing “quick‑look” interactive services such as RStudio or Jupyter notebooks (which currently are fashionable in the AI-world) may please users initially, but it often forces them to reinvent pipelines that a workflow manager already handles. (Not withstanding the fact, that a workflow manager can orchestrate many jobs concurrently and takes care of staging reference data into node local storage, etc.) Advising and maintaining workflow‑based solutions should therefore be a core responsibility of the HPC support team – with the domain expert’s guidance ensuring that the advice is both scientifically sound and technically feasible.

Thrashing Ideology

Many HPC centres impose minimum‑core policies (e.g. “only jobs ≥ 16 CPUs, or an average > 256 cores per allocation”) as traditionally HPC is all about well scaling applications and squeezing nanoseconds out of compute kernels. Also, scheduling gets easier. In practice, any realistic data‑driven workflow — life‑science or otherwise — spends most of its jobs on single‑core processing or steps with only a few threads such as quality‑control, format conversion, archiving, and the final plotting of results. Forcing all of these light‑weight tasks onto the compute nodes creates unnecessary queue pressure and defeats the purpose of the core‑size rule. A pragmatic compromise is to allow obvious “shepherd” jobs to run on the login node: the launch of a workflow manager (e.g. Snakemake or Nextflow) that merely builds a DAG of the jobs to perform and then submits the heavy jobs to the batch system and performs a quick download or a few‑second R/Python plot locally on the login node. (NB: root, the core program from the CERN to analyse high energy physics data, has only modest threading support and is used to analyse Petabytes of data on HPC systems worldwide.) All the while, shared memory programs can sometimes be pooled to be launched together in a bigger job – or a small partition for such programs can be provided.

Allowing such lightweight operations on the login node preserves queue efficiency, respects the administrators’ goal of keeping compute‑intensive work on the cluster, and avoids the needless “thrashing” caused by over‑constraining tiny jobs.

If you like to comment, please do so below, on this blog platform. Comments on Mastodon will not be persistent.

Note: This article has been written with the help of LaguageTool and no other AI assistance. Also, this blog article is a contribution to an upcoming discussion on life science support in the German NHR network and reflects my opinion.

  1. That, at least, is my personal experience. Something, I experience daily and at HPC conferences — and expressed to me by many, many life scientists ↩︎
  2. A single workflow ranges in the order of one to six dozen application packages. So, my estimate is, depending on the heterogeneity of the research questions in my vicinity, 300 to 500 (edit: an earlier version said 500-800, but this was an upper bound estimate) application packages to be a realistic figure. This, of course, varies from site to site. ↩︎
  3. Yes, I know that some admin teams have a more flexible attitude to this challenge. ↩︎
  4. edit after publishing: Yes, this is anecdotal evidence from teaching elsewhere, from user feedback, from a number of tales and a yet unpublished survey. But otherwise totally anecdotal. ↩︎
  5. edit after publishing: To avoid misunderstanding, I should point out, that these two build-systems are established on many clusters and provide a HUGE number of already community approved and tested tools which run somewhere. Whilst no build-framework can cover all edge cases, either one can satisfy the vast majority of scientific needs. ↩︎

Tags:


Comments

2 Antworten zu „Empowering Life Scientists: What HPC Support Should Look Like“

  1. Avatar von Sebastian P.

    Dear Christian,

    Thank you for encouraging this conversation about life sciences and HPC. Before I start let me state that I am not a life scientist and only have an outsider’s opinion on this matter. However, in the past, I was on the user side of HPC (doing molecular dynamics and ab-initio simulations since ca. 2012) and then switched sides to the administrators side and tried to help researchers from various fields to perform computations on a (smaller tier) HPC cluster for the last couple of years.

    # HPC access:
    I agree, users without any or little prior knowledge of HPC need to have some kind of test ground to get familiar with the systems or the workflow in general. Ideally this should start at the users own work-station or laptop but this might not be available to everyone so a low bar for entry is necessary on the smaller tier systems. But just letting everyone (i.e. PhD students) access even the smallest cluster without any prior advice or knowledge should be avoided – e.g. group leaders at our site have to (once) write a small summary of what they intend to do on the system. No deep-dive, just the basics – and we caught many proposals that basically were better off using a „beefy“ laptop.
    I also strongly disagree with having easy-access to higher tier clusters. I would suggest to learn and optimize on the smaller systems to then be able to write full fledged proposals for the big sites. Also in my experience, next to the scientific justification, an applicant usually has to estimate their planned usage of core-hours (in one form or the other) – not FLOPS directly. You argue that sometimes massive datasets don’t allow the usage of smaller systems. In my opinion every (initial) test run should start with smaller datasets – if it’s too big, take a subset which can run fast on smaller systems – everything else is a waste of valuable resources.

    # Software provisioning:
    Again I agree with the assessment of the current state: static modules on the one side and the users own Conda (or any other package manager) environments on the other. But static modules are there for a reason – reproducibility. This is one of the most underestimated factors to produce reliable scientific output in this community. This also directly touches on the life sciences approach to use „many little tools“ in some kind or pipeline. My extreme view on this is: This is a disaster waiting to happen. There are so many small tools written by inexperienced users potentially riddled with bugs that I wouldn’t trust any produced output of these pipelines. Of course this is more nuanced, and there probably are well maintained tools. But trying out very new versions of this or that tool for my pipeline should be the responsibility of the user – I as an admin refuse to install the latest and greatest from some shady git repository to make it available cluster wide. The next unexperienced user, discovering these modules, will use them without a second thought. While containers (like Apptainer or Singularity) are often proposed as the middle ground to solve this, I see them more as a band-aid to cover up the underlying shortcomings of the code base rather than a cure for the lack of optimization and maintenance. And yes, the EESSI project is a good approach to provide software in a concise and tested way, but I would argue it will also concentrate on the big monolithic and well proven software applications.

    # Workflow support
    While admirable and definitely feasible at larger sites, dedicated personnel to help one specific group of users is costly and often not within the budget of what (public) Universities are willing to spend. For the smaller sites I would suggest to rely on networks of different sites (with different expertise) as is already the case with e.g. bwHPC or hpc.nrw in Germany.

    # Hardware
    Requiring users to allocate a minimum amount of cores is indeed very restricting. Like you mentioned, there are a lot of single core jobs requiring vast amounts of memory. So for me it’s either you require the cores, the memory or both of a node. But what we see much more often are jobs allocating full nodes (with e.g. 192 cores and 768GB) but effectively using 1 core and 16GB … at its peak. These are no HPC jobs in any way. Sometimes we can educate users to use only the amount they actually need – and then maybe submit hundreds of smaller jobs (so they are essentially performing High Throughput Computing – HTC – which is often confused with HPC). The takeaway is, that for these computations you actually require different hardware. You need fewer nodes with a lot of memory and lower core count CPUs which then instead can clock much higher. You don’t even need a fast interconnect. Those specific clusters could be much more affordable than a general purpose HPC cluster.

    # Cultural differences
    But my biggest gripe with the life science community is actually their expectations. Very often a „smart phone like“ experience is expected to run their code – and from an initial thought, who can blame them? We are fed perfectly tailored software solutions all day long. The enormous work force and large amounts of money these experiences actually require are usually invisible. So why shouldn’t this be the case for a University HPC system? Well, obviously resources …

    … and then I think about the fact that HPC has been historically driven by physics, computational chemistry, climate modeling and so on. Sure, maybe some of these groups are naturally more attuned to coding and informatics in general but surely not all of them. These communities have demonstrated the ability to adapt their code to fully leverage parallel architectures and optimize performance. They have moved from monolithic serial applications to highly scalable, optimized code. What’s so different about life science scientists? This community uses other complex, multi-million euro appliances (e.g. advanced mass spectrometers or cryo-electron microscopes). If a scientist wouldn’t touch a 2-million-euro microscope without training, why should they expect to run code on a 10-million-euro cluster without it. A cluster is no different in that sense and I would expect the same approach to their code base as other fields have already demonstrated – identify the parts of your pipeline/workflow that actually do require and can make use of HPC systems and optimize those parts. Everything else should be done elsewhere.

    I think, from an admins perspective, yes we of course have to support the life sciences and help to our best of knowledge. But in general (not only with regards to the life sciences) we are overdoing it. We cannot cater simple one-click (one-cmd) solutions for HPC – we simply do not have the resources to do this and the past has shown that scientists are very capable to efficiently using these installations to advance their research.

    Ok, this was my take. I apologize for the lengthy comment and often generalization of certain groups – I know there are lots of exceptions everywhere. This was only speaking from my own experience.

  2. Avatar von rupture-de-catenaire
    rupture-de-catenaire

    Thank you for your input, Sebastian! No need to apologize for a lengthy comment. A thread always falls short compared to a discussion – it is better than nothing. So, here is an answer, which I hope gives more common ground:

    as for „high tier clusters“: agreed, my remark was misleading. There are currently 4 life science support clusters within the German NHR (the national HPC association) labelled tier 2. The NHR actually offers starter accounts. So we can actually check this item. Most sites offer tier 3 clusters, too. Therefore, all which ought to be eased is the transition from 3 to 2 if required. But that is actually not hard, either.

    as for modules: I see their benefit more in better performance – software, which is compiled on „your“ machine and for „your“ machine, just runs faster (most of the time). Reproducibility depends on bugs within a version, whether the algorithm is deterministic, etc. etc. Most of all: on documenting all used parameters. So, if modules are provided to carry out an analysis: fine!

    as for the many applications: that is not just life science specific – all steps in data processing, combined, require more tools than „just“ MD. Refusal to install software often ends in abandoning HPC services and resorting to „basement servers“. This is a tale often told to me. Whether this is better is doubtful. I think there must be a compromise on the service level.

    as for wrong parameterization: I agree with you. Yet, I was writing about HPC and I think we support people are aware of HTC/HPC differences. It is just that big data analysis should be carried out A-Z on HPC systems, even if it requires many smaller tools at start and end. To give an example: Consider a pharmaceutical ligand screening workflow. Download ligands, preprocessing them and eventually collating statistical results requires those smaller jobs. The actual screening program, which takes > 90 % of the compute time, is an MPI program. Now, to require such users to carry out all these time-consuming steps elsewhere, transfer their data, compute, transfer again, is neither a remedy to the reproducibility crisis (as it gets harder to protocol) nor making HPC attractive.

    as for cultural differences: yes, indeed. And genomics is different again. Yet, considering a typical metagenomics task of local alignment. Pooling such jobs in a one or many-node job of such shared memory processing tools (smp) can a) be a remedy for I/O contention (avoiding random access patterns, by staging the reference once) and b) boost efficiency. Some smp tools scale pretty well, sometimes better than initially thought, with proper stage-in.

    > What’s so different about life science scientists?

    The algorithms. Scaling applications which suppose which are supposed to hold huge matrices in memory proofed relatively „easy“ (yes, programming MPI is challenging and programming numerics, too – this „easy“ qualifier is meant historically). There _is_ plenty of progress, machine wise. Which requires new software to be written and to mature. This applies mostly to genomics. Life Science, however, is a vast field. Considering electron microscopy: There are dedicated HPC applications. These people require visualization before the heavy processing and thereafter. But as for the compute heavy stuff therein: I am yet to learn the different between a physics Fourier transform and a „bio“ one. Yet, some sites fail to support such research. The result? I have seen self-build clusters, to run such code, then. Some on par with bigger tier 3 clusters. Or groups who abandon „their“ site and calculate elsewhere. The same goes for mass-spec. Life science means a huge spectrum of different requirements. If there are investments of millions of euros or dolloars for instruments – why deny crunching such data on a local site. Btw. my favourite workflow system would allows mounting the cluster file system and triggering jobs whenever a file gets written there. „Pseudo-real time computing“ so to speak. If the code is HPC „worthy“ we should be open for the possibilities.

    We HPC support scientist all have _one_ background. We cannot know everything. I have been told an anecdote of geneticists wanting to calculate a (genome) „assembly“ and compute scientist who though about „assembler“. It does not translate well. End of story was quite some frustration.

    Sometimes, we just need to learn to listen and translate needs from a compute-illiterate researcher to an HPC system. Or else we hinder scientific progress. And: there are many HPC sites who think differently, already.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert