{"id":313,"date":"2025-12-15T07:10:50","date_gmt":"2025-12-15T07:10:50","guid":{"rendered":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/?p=313"},"modified":"2026-01-03T10:46:04","modified_gmt":"2026-01-03T10:46:04","slug":"empowering-life-scientists-what-hpc-support-should-look-like","status":"publish","type":"post","link":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/2025\/12\/15\/empowering-life-scientists-what-hpc-support-should-look-like\/","title":{"rendered":"Empowering Life Scientists: What HPC Support Should Look Like"},"content":{"rendered":"\n<p>It\u2019s a paradox: many smaller HPC clusters are <em>built by<\/em> life scientists, <em>for<\/em> life scientists \u2014 yet the very idea of hosting life science workloads still feels alien to many traditional HPC administrators<sup data-fn=\"a35c4375-9ee7-48bd-a4fd-6872f9e933d1\" class=\"fn\"><a href=\"#a35c4375-9ee7-48bd-a4fd-6872f9e933d1\" id=\"a35c4375-9ee7-48bd-a4fd-6872f9e933d1-link\">1<\/a><\/sup>. Why? Because life science workflows rarely fit the classic HPC mould: they\u2019re often data-heavy, workflow-driven, interactive, and built on scripting languages \u2014 not just MPI or GPU codes.<\/p>\n\n\n\n<p>So what does it take to provide <em>good<\/em> support for life science \u2014 and other \u201cnon-traditional\u201d domains \u2014 that don\u2019t just want to submit a single, tightly coupled job and walk away?<\/p>\n\n\n\n<p>In this article, I\u2019ll share some practical ideas for bridging that gap \u2014 because great HPC support isn\u2019t just about uptime and throughput. It\u2019s about understanding the <em>people<\/em>&nbsp;and their scientific needs.<\/p>\n\n\n\n<p><strong>Admittance to a cluster: The First Hurdle<\/strong><\/p>\n\n\n\n<p>Ordinarily, a group leader (PI) must apply for compute time \u2014 but the process ranges wildly in complexity. At one end: <em>\u201cJust ask.\u201d<\/em> At the other: <em>\u201cWrite pages justifying your need, list every software package, predefine performance metrics, and estimate storage requirements \u2014 before you\u2019ve even run a test job.\u201d<br><\/em><br>Moreover, many scientists \u2014 not just in the life sciences \u2014 aren\u2019t fluent in HPC jargon. Asking them to estimate performance in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Floating_point_operations_per_second\">FLOPS<\/a> per CPU or compute node, for instance, is not just difficult \u2014 it\u2019s often <em>theoretically meaningless<\/em> for applications that spend most of their time traversing matrices, managing I\/O, or orchestrating pipelines but not multiplying floating-point numbers. Even most <a href=\"https:\/\/en.wikipedia.org\/wiki\/Molecular_dynamics\">molecular dynamics<\/a> programs do not give this figure, but rather a simulation time per real-time performance indicator, which \u2014 strangely enough \u2014 is usually accepted because these programs are known to the HPC world.<br><br>So how do we design an admittance process that respects HPC resource constraints\u202f<em>without<\/em>\u202falienating the very users we\u2019re trying to support?<br><br>Life scientists rarely run just one application \u2014 they use <em>many<\/em>, typically in sequence, with wildly different needs: some are memory-hungry, others I\/O-bound, some need GPUs, others run on small cores for days. Asking them to predict all this upfront \u2014 before they\u2019ve even tested their pipeline \u2014 is unrealistic and counterproductive. Instead, let there be a sandbox allocation during a <em>discovery phase<\/em>: ask only for the basics \u2014 storage needs and only for the most resource intensive tasks, things like approximate job duration and RAM requirements \u2014 and let the user explore their workflow. The real resource profile emerges <em>after<\/em> they run, not before. That\u2019s when you can guide them to the right queues, containers, or allocations \u2014 not when you\u2019re still trying to get them through the door.<br><br>Now, if a realistic resource profile emerges <em>after<\/em> they run a data analysis, not before, then administrators can help gather performance metrics from those early jobs to guide the next, more formal allocation \u2014 turning usage data into smarter, more efficient scheduling, without ever asking the scientist to guess in the dark. Eventually, after that sandbox phase, a more profound application for follow-up projects or extended project times may be expected.<br><br>I hear you saying: Without performance figures, should these groups not apply for lower tier clusters? Maybe so, but sometimes the sheer amount of data requires a bigger cluster and support already during the application phase.<\/p>\n\n\n\n<p><strong>Software provisioning \u2013 a user\u2011friendly, admin\u2011sustainable model<\/strong><\/p>\n\n\n\n<p>Most life\u2011science groups need <em>many<\/em> little tools<sup data-fn=\"8d9684bd-e111-44a8-824f-a53c77d87606\" class=\"fn\"><a href=\"#8d9684bd-e111-44a8-824f-a53c77d87606\" id=\"8d9684bd-e111-44a8-824f-a53c77d87606-link\">2<\/a><\/sup>, not just one monolithic application, and they often want to experiment with new versions as soon as they appear. The \u201ceither\u2011or\u201d model you see today \u2013 <em>either<\/em> &#8222;we&#8220; pre\u2011install a static list of programs <em>or<\/em> users try to install everything themselves in a personal Conda tree that quickly busts their disk quota breaks down for modern bio\u2011informatics pipelines<sup data-fn=\"569f152a-2db4-4e86-a826-32d1f47d3929\" class=\"fn\"><a href=\"#569f152a-2db4-4e86-a826-32d1f47d3929\" id=\"569f152a-2db4-4e86-a826-32d1f47d3929-link\">3<\/a><\/sup>.<br><br>The reality? The required tools are not all available as pre-built modules<sup data-fn=\"10e56c26-10ab-4e66-99d0-c71cafe49db8\" class=\"fn\"><a href=\"#10e56c26-10ab-4e66-99d0-c71cafe49db8\" id=\"10e56c26-10ab-4e66-99d0-c71cafe49db8-link\">4<\/a><\/sup>. While \u201cconda-in-home-dir\u201d is the default fallback, it\u2019s a quota disaster waiting to happen. A better path: start with a curated set of module files (bundled where possible for common workflows \u2014 easy to load, no additional install requirement from a user&#8217;s perspective, no quota hit). For the rest, allow self-install via <a href=\"https:\/\/doi.org\/10.1145\/2807591.2807623\">Spack<\/a>\u00a0or easybuild (with a shared prefix or scratch space to avoid home-dir bloat<sup data-fn=\"5dda9987-904a-4271-b164-e4b0dc65e4da\" class=\"fn\"><a href=\"#5dda9987-904a-4271-b164-e4b0dc65e4da\" id=\"5dda9987-904a-4271-b164-e4b0dc65e4da-link\">5<\/a><\/sup>) or, where appropriate, containers (Apptainer) for full reproducibility. The goal isn\u2019t to lock users in \u2014 it\u2019s to give them <em>multiple safe, supported paths<\/em> to get their software running, without turning their home directory into a package graveyard.<br><br>Also note, that contemporary workflow systems can install Conda packages on the fly &#8211; e.g. onto a project file system. A user can simply delete such unnamed environments after the analysis. <a href=\"https:\/\/doi.org\/10.1002\/spe.3075\">Another potential solution is on the horizon<\/a>.<br><br><strong>Expertise and Workflow Support<\/strong><\/p>\n\n\n\n<p>A sustainable life\u2011science service on an HPC cluster needs at least one domain\u2011expert on the admin team. When that expertise is given a voice in resource\u2011planning and software\u2011provisioning, the centre can steer users toward reproducible, scalable pipelines instead of <em>ad\u2011hoc<\/em> solutions.<\/p>\n\n\n\n<p>Current reproducibility initiatives already point to mature workflow managers that understand the batch systems used on HPC clusters (SLURM, LSF, HTCondor, \u2026).Notably, <strong><a href=\"https:\/\/doi.org\/10.12688\/f1000research.29032.3\">Snakemake<\/a><\/strong> \u2013 with its <a href=\"https:\/\/doi.org\/10.5281\/zenodo.16922261\">SLURM plugin<\/a> maintained by the HPC groups in Mainz and a Bioinformatics unit in Santa Cruz and support for LSF as well as HTCondor \u2013 and <strong><a href=\"https:\/\/doi.org\/10.1038\/nbt.3820\">Nextflow<\/a><\/strong>, originally developed at the Barcelona Supercomputing Center, also with support for a number of HPC batch systems. These tools let scientists describe their analysis once and run it anywhere, while the admin team can support the workflow execution centrally.<\/p>\n\n\n\n<p>Providing \u201cquick\u2011look\u201d interactive services such as <a href=\"http:\/\/www.posit.co\/.\">RStudio<\/a> or <a href=\"https:\/\/doi.org\/10.1109\/MCSE.2021.3059263\">Jupyter<\/a> notebooks (which currently are fashionable in the AI-world) may please users initially, but it often forces them to reinvent pipelines that a workflow manager already handles. (Not withstanding the fact, that a workflow manager can orchestrate many jobs concurrently and takes care of staging reference data into node local storage, etc.) Advising and maintaining workflow\u2011based solutions should therefore be a core responsibility of the HPC support team \u2013\u202fwith the domain expert\u2019s guidance ensuring that the advice is both scientifically sound and technically feasible.<br><br><strong>Thrashing Ideology<\/strong><br><br>Many HPC centres impose minimum\u2011core policies (e.g. \u201conly jobs \u2265\u202f16\u202fCPUs, or an average &gt;\u202f256\u202fcores per allocation\u201d) as traditionally HPC is all about well scaling applications and squeezing nanoseconds out of compute kernels. Also, scheduling gets easier. In practice, any realistic data\u2011driven workflow \u2014 life\u2011science or otherwise \u2014 spends most of its jobs on single\u2011core processing or steps with only a few threads such as quality\u2011control, format conversion, archiving, and the final plotting of results. Forcing all of these light\u2011weight tasks onto the compute nodes creates unnecessary queue pressure and defeats the purpose of the core\u2011size rule. A pragmatic compromise is to allow obvious \u201cshepherd\u201d jobs to run on the login node:  the launch of a workflow manager (e.g. Snakemake or Nextflow) that merely builds a DAG of the jobs to perform and then submits the heavy jobs to the batch system and performs a quick download or a few\u2011second R\/Python plot locally on the login node. (NB: <a href=\"https:\/\/root.cern\/\">root<\/a>, the core program from the CERN to analyse high energy physics data, has only modest threading support and is used to analyse Petabytes of data on HPC systems worldwide.) All the while, shared memory programs can sometimes be pooled to be launched together in a bigger job &#8211; or a small partition for such programs can be provided.<br><\/p>\n\n\n\n<p>Allowing such lightweight operations on the login node preserves queue efficiency, respects the administrators\u2019 goal of keeping compute\u2011intensive work on the cluster, and avoids the needless \u201cthrashing\u201d caused by over\u2011constraining tiny jobs.<br><br><em><strong>If you like to comment, please do so below, on this blog platform. Comments on Mastodon will not be persistent.<\/strong><\/em><br><br><em>Note: This article has been written with the help of <a href=\"https:\/\/languagetool.org\/\">LaguageTool<\/a> and no other AI assistance.<\/em> Also, this blog article is a contribution to an upcoming discussion on life science support in the German NHR network and reflects my opinion.<\/p>\n\n\n\n<p><\/p>\n\n\n<ol class=\"wp-block-footnotes\"><li id=\"a35c4375-9ee7-48bd-a4fd-6872f9e933d1\">That, at least, is my personal experience. Something, I experience daily and at HPC conferences \u2014 and expressed to me by many, many life scientists <a href=\"#a35c4375-9ee7-48bd-a4fd-6872f9e933d1-link\" aria-label=\"Zur Fu\u00dfnotenreferenz 1 navigieren\">\u21a9\ufe0e<\/a><\/li><li id=\"8d9684bd-e111-44a8-824f-a53c77d87606\">A single workflow ranges in the order of one to six dozen application packages. So, my estimate is, depending on the heterogeneity of the research questions in my vicinity, 300 to 500 (edit: an earlier version said 500-800, but this was an upper bound estimate) application packages to be a realistic figure. This, of course, varies from site to site.  <a href=\"#8d9684bd-e111-44a8-824f-a53c77d87606-link\" aria-label=\"Zur Fu\u00dfnotenreferenz 2 navigieren\">\u21a9\ufe0e<\/a><\/li><li id=\"569f152a-2db4-4e86-a826-32d1f47d3929\">Yes, I know that some admin teams have a more flexible attitude to this challenge. <a href=\"#569f152a-2db4-4e86-a826-32d1f47d3929-link\" aria-label=\"Zur Fu\u00dfnotenreferenz 3 navigieren\">\u21a9\ufe0e<\/a><\/li><li id=\"10e56c26-10ab-4e66-99d0-c71cafe49db8\">edit after publishing: Yes, this is anecdotal evidence from teaching elsewhere, from user feedback, from a number of tales and a yet unpublished survey. But otherwise totally anecdotal. <a href=\"#10e56c26-10ab-4e66-99d0-c71cafe49db8-link\" aria-label=\"Zur Fu\u00dfnotenreferenz 4 navigieren\">\u21a9\ufe0e<\/a><\/li><li id=\"5dda9987-904a-4271-b164-e4b0dc65e4da\">edit after publishing: To avoid misunderstanding, I should point out, that these two build-systems are established on many clusters and provide a <strong>HUGE<\/strong> number of already community approved and tested tools which run somewhere. Whilst no build-framework can cover <em>all<\/em> edge cases, either one can satisfy the vast majority of scientific needs. <a href=\"#5dda9987-904a-4271-b164-e4b0dc65e4da-link\" aria-label=\"Zur Fu\u00dfnotenreferenz 5 navigieren\">\u21a9\ufe0e<\/a><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>It\u2019s a paradox: many smaller HPC clusters are built by life scientists, for life scientists \u2014 yet the very idea of hosting life science workloads [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":287,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"[{\"id\":\"a35c4375-9ee7-48bd-a4fd-6872f9e933d1\",\"content\":\"That, at least, is my personal experience. Something, I experience daily and at HPC conferences \\u2014 and expressed to me by many, many life scientists\"},{\"id\":\"8d9684bd-e111-44a8-824f-a53c77d87606\",\"content\":\"A single workflow ranges in the order of one to six dozen application packages. So, my estimate is, depending on the heterogeneity of the research questions in my vicinity, 300 to 500 (edit: an earlier version said 500-800, but this was an upper bound estimate) application packages to be a realistic figure. This, of course, varies from site to site. \"},{\"id\":\"569f152a-2db4-4e86-a826-32d1f47d3929\",\"content\":\"Yes, I know that some admin teams have a more flexible attitude to this challenge.\"},{\"id\":\"10e56c26-10ab-4e66-99d0-c71cafe49db8\",\"content\":\"edit after publishing: Yes, this is anecdotal evidence from teaching elsewhere, from user feedback, from a number of tales and a yet unpublished survey. But otherwise totally anecdotal.\"},{\"id\":\"5dda9987-904a-4271-b164-e4b0dc65e4da\",\"content\":\"edit after publishing: To avoid misunderstanding, I should point out, that these two build-systems are established on many clusters and provide a <strong>HUGE<\\\/strong> number of already community approved and tested tools which run somewhere. Whilst no build-framework can cover <em>all<\\\/em> edge cases, either one can satisfy the vast majority of scientific needs.\"}]","_share_on_mastodon":"1"},"categories":[17,29],"tags":[],"class_list":["post-313","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bioinformatics","category-high-performance-computing"],"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/posts\/313","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/comments?post=313"}],"version-history":[{"count":24,"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/posts\/313\/revisions"}],"predecessor-version":[{"id":352,"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/posts\/313\/revisions\/352"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/media\/287"}],"wp:attachment":[{"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/media?parent=313"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/categories?post=313"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.fediscience.org\/rupture-de-catenaire\/wp-json\/wp\/v2\/tags?post=313"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}