At the Neic2019 conference, we hosted a workshop titled Reimagining Research Computing. This workshop was in fact the founding moment of NordicHPC. (For the “Jupyter for research facilities” meeting nodes, see jupyter.

The abstract can be found at the link above, but it short we had the realization that clusters are not keeping up with the times: they are still, for the most part, managed like decades ago. These days, standards for usability are much higher and our audience of users has changed a lot. At the same time, we could do much better with modern software and tools.

This page gives the main lessons from the workshop with some possible actions.

Reimagined HPC

First, what do we mean by HPC?: Traditional ssh-access Linux clusters with a scheduling system. There are simpler special-purpose computing platforms, but we are focused on more general-purpose infrastructure (there will always be need for this). We also aren’t exactly talking about new systems where users get private virtual machines clusters - these are important, but there will still be some need for large-scale scheduled infrastructure (and efficiently using these virtual clusters can be an even harder development challenge!).

Like it or not, the Linux cluster isn’t going away anytime soon and will continue to be the workhorse for people that need computing and can’t develop for more specialized platforms. Yet, these clusters are not evolving with the times. This has to change.

Thus, reimagined HPC is improving Linux clusters so that they are usable by more people. This is just one small portion of “research computing”, but still important, and something that not many think about from the start. Most attempts to improve clusters are small changes, but we think that we need to reimagine from the beginning while still focusing on Linux clusters.

The POSIX Linux interface is still standard, so needs good support. We want it to be easy to use this, but not to remove everything. Increased consumer usability can make the learning curve steeper. We need to make the learning curve more gradual again, while still having the possibility to end at the same place.

Just think how much more usable desktop Linux is compared to 10 or 20 years ago. Why haven’t clusters made the same strides?

Papercuts

A “papercut” is a usability term for a small problem which causes great pain for users, yet which seems easily solveable. Since HPC systems still operate mostly as designed long ago, there are many papercuts in them. One primary outcome of the workshop was cataloging what the most common papercuts are and how they could be resolved.

Organizational

  • Problem: Buy big resources first, usability later. Other companies make something usable first, scale later. Usability and staff is rarely a primary concern.
  • Problem: Need to agree on a common framework for communication.
  • Problem: Centralization = less user interaction
  • Problem: Sysadmin hiring - your ideal sysadmin is already a Linux expert and doesn’t represent most users.
  • Problem: Bad project management.
  • Problem: Bad bureaucracy
  • Problem: Bad decision-making
  • Idea: Documentation-driven development. Make clear, sensible instructions first then implement it.
  • Idea: Standardization as a goal. Sites are almost gratuitously different.
  • Are cloud computing and new-style platforms replacing HPC just because they have less history and are able to adapt faster?

Accounts/access

  • Idea: Better authentication/authorization delegation.
  • Idea: Easy on-demand access to HPC for new users. Tiered access levels, lowest level is free and without applications. Useful for testing.
  • Idea: Being able to login everywhere with my university identity or ORCID.
  • Idea: Cloud <-> HPC. Some workloads (data science) work fine on cloud. Or only part of workloads need HPC.
  • Idea: HPC in your pocket. Or on smartphone? Easier to connect.
  • Idea: JupyterHub as an interface. It can use same data, compute, software
  • Idea: Better terminal MOTDs
  • Idea: “Free tier” of HPC. Lets you get started immediately and test the system before deciding if you need to go through the complex application process.
  • Problem: HPC requires different account to manage
  • Problem: Lack of “community accounts” on HPC sites. Leads to sharing accounts or making it hard to share data. Or perhaps this is a symptom of making it hard to share data and billing between users.
  • Problem: HPC is CentOS/RedHat based, not same as usual desktops.
  • Problem: Easy and fast access not always the case. Both login and applying for accounts.

Documentation

  • Problem: Documentation is a headache, it limits what you can document.
  • Problem: Users will not read the documentation.
  • Problem: Look-and-feel differs between resources, and nomenclature differs sometimes. This is confusing for users.
  • Problem: “Normal” installation instructions you find online don’t work. No sudo, apt-get, etc.
  • Problem: Need good guidelines for job data sharing.
  • Goal: No need for documentation because everything is so easy (as easy as reading an email). Or perhaps as easy as using command line on your desktop computer.

Storage/filesystems

  • Problem: Filesystems are different everywhere. Why, is there any reason? Standardization as a value. At least standardize some environment variables.
  • Problem: Better understanding of filesystems. They are hard to understand.
  • Problem: Data owned by individual users (in per-user directories).j Sharing not easy, leads to reinventing and starting things from scratch.
  • Idea: Better illustrations in documentation, especially filesystems.
  • Idea: Web-based access to filesystems
  • Idea: Remote mounts of filesystems
  • Idea: Standard filesystem paths and names, at least within organizations The changing name and location is an unnecessary abstraction.
  • Idea: Global filesystem, accessing data without copying huge amounts of data. Or at least without having to learn the right way to tunnel ssh between two arbitrary locations

Outreach/training

  • Idea: Data stewardship is needed - more training, more people throughout the data lifecycle. Data stewards are people who support the incidental use of the data.
  • Idea: Focus on the long tail of science: The real long tail, not the long tail of real HPC.
  • Idea: Be strategic, not tactical: If the goal is to enable science, begin there. How to make HPC easier to use.
  • Problem: Visibility, awareness of resources available.
  • Problem: Poor preleminary study of users’ needs. lots of work with little feedback/results.

Computing/scheduling

  • Lightweight computing is getting more common
  • Ability to extend job once (self-service or automatically) for free would prevent wasted resources. Some site already has this available.
  • Problem: subtasks. There are too many subtastks which are out of our control. We need to aim for a unified approach and combined team.
  • Problem: testing/debugging must go through queue. It lasts only one or two minutes, why do we have to use the queue for this?
  • Problem: opaque queuing system. Why is my job prioritized like this, can you change it for me?
  • Problem: efficient resource use is an important task.
  • Problem: threads, core, processes: abstraction is needed, this is too far from day to day life.
  • Idea: Have enough resources
  • Idea: tools for precisely, dynamically estimating job runtime. Or just better tools for estimating needed resources period.
  • Idea: run singularity/docker containers
  • Idea: encourage interactive computation more for new users.
  • Idea: More powerful login nodes so interactive jobs can be done here. Note: CSC has a system for sending ssh directly to a Slurm node to emulate this.
  • Idea: Default options that just work, more effort on inferring default options. Ideally only specify time/mem/CPUs/GPUs needed.
  • Idea: Encapsulation, users think about 100% science only