At the Neic2019 conference, we
hosted a workshop titled Reimagining Research
Computing. This
workshop was in fact the founding moment of NordicHPC. (For the
“Jupyter for research facilities” meeting nodes, see
jupyter.
The abstract can be found at the link above, but it short we had the
realization that clusters are not keeping up with the times: they are
still, for the most part, managed like decades ago. These days,
standards for usability are much higher and our audience of users has
changed a lot. At the same time, we could do much better with modern
software and tools.
This page gives the main lessons from the workshop with some possible
actions.
Reimagined HPC
First, what do we mean by HPC?: Traditional ssh-access Linux clusters
with a scheduling system. There are simpler special-purpose computing
platforms, but we are focused on more general-purpose infrastructure
(there will always be need for this). We also aren’t exactly talking
about new systems where users get private virtual machines clusters -
these are important, but there will still be some need for large-scale
scheduled infrastructure (and efficiently using these virtual clusters
can be an even harder development challenge!).
Like it or not, the Linux cluster isn’t going away anytime soon and
will continue to be the workhorse for people that need computing and
can’t develop for more specialized platforms. Yet, these clusters are
not evolving with the times. This has to change.
Thus, reimagined HPC is improving Linux clusters so that they are
usable by more people. This is just one small portion of “research
computing”, but still important, and something that not many think
about from the start. Most attempts to improve clusters are small
changes, but we think that we need to reimagine from the beginning
while still focusing on Linux clusters.
The POSIX Linux interface is still standard, so needs good support.
We want it to be easy to use this, but not to remove everything.
Increased consumer usability can make the learning curve steeper. We
need to make the learning curve more gradual again, while still having
the possibility to end at the same place.
Just think how much more usable desktop Linux is compared to 10 or 20
years ago. Why haven’t clusters made the same strides?
Papercuts
A “papercut” is a usability term for a small problem which causes
great pain for users, yet which seems easily solveable. Since HPC
systems still operate mostly as designed long ago, there are many
papercuts in them. One primary outcome of the workshop was cataloging
what the most common papercuts are and how they could be resolved.
Organizational
- Problem: Buy big resources first, usability later. Other companies make something usable first, scale later. Usability and staff is rarely a primary concern.
- Problem: Need to agree on a common framework for communication.
- Problem: Centralization = less user interaction
- Problem: Sysadmin hiring - your ideal sysadmin is already a Linux expert and doesn’t represent most users.
- Problem: Bad project management.
- Problem: Bad bureaucracy
- Problem: Bad decision-making
- Idea: Documentation-driven development. Make clear, sensible instructions first then implement it.
- Idea: Standardization as a goal. Sites are almost gratuitously different.
- Are cloud computing and new-style platforms replacing HPC just
because they have less history and are able to adapt faster?
Accounts/access
- Idea: Better authentication/authorization delegation.
- Idea: Easy on-demand access to HPC for new users. Tiered access
levels, lowest level is free and without applications. Useful for testing.
- Idea: Being able to login everywhere with my university identity
or ORCID.
- Idea: Cloud <-> HPC. Some workloads (data science) work fine on
cloud. Or only part of workloads need HPC.
- Idea: HPC in your pocket. Or on smartphone? Easier to connect.
- Idea: JupyterHub as an interface. It can use same data,
compute, software
- Idea: Better terminal MOTDs
- Idea: “Free tier” of HPC. Lets you get started immediately and
test the system before deciding if you need to go through the
complex application process.
- Problem: HPC requires different account to manage
- Problem: Lack of “community accounts” on HPC sites. Leads to
sharing accounts or making it hard to share data. Or perhaps this
is a symptom of making it hard to share data and billing between
users.
- Problem: HPC is CentOS/RedHat based, not same as usual desktops.
- Problem: Easy and fast access not always the case. Both login
and applying for accounts.
Documentation
- Problem: Documentation is a headache, it limits what you can
document.
- Problem: Users will not read the documentation.
- Problem: Look-and-feel differs between resources, and nomenclature
differs sometimes. This is confusing for users.
- Problem: “Normal” installation instructions you find online don’t
work. No sudo, apt-get, etc.
- Problem: Need good guidelines for job data sharing.
- Goal: No need for documentation because everything is so easy (as
easy as reading an email). Or perhaps as easy as using command line
on your desktop computer.
Storage/filesystems
- Problem: Filesystems are different everywhere. Why, is there
any reason? Standardization as a value. At least standardize some
environment variables.
- Problem: Better understanding of filesystems. They are hard to
understand.
- Problem: Data owned by individual users (in per-user
directories).j Sharing not easy, leads to reinventing and starting
things from scratch.
- Idea: Better illustrations in documentation, especially
filesystems.
- Idea: Web-based access to filesystems
- Idea: Remote mounts of filesystems
- Idea: Standard filesystem paths and names, at least within
organizations The changing name and location is an unnecessary
abstraction.
- Idea: Global filesystem, accessing data without copying huge amounts
of data. Or at least without having to learn the right way to
tunnel ssh between two arbitrary locations
Outreach/training
- Idea: Data stewardship is needed - more training, more people throughout
the data lifecycle. Data stewards are people who support the
incidental use of the data.
- Idea: Focus on the long tail of science: The real long tail, not
the long tail of real HPC.
- Idea: Be strategic, not tactical: If the goal is to enable
science, begin there. How to make HPC easier to use.
- Problem: Visibility, awareness of resources available.
- Problem: Poor preleminary study of users’ needs. lots of work
with little feedback/results.
Computing/scheduling
- Lightweight computing is getting more common
- Ability to extend job once (self-service or automatically) for free
would prevent wasted resources. Some site already has this
available.
- Problem: subtasks. There are too many subtastks which are out
of our control. We need to aim for a unified approach and combined
team.
- Problem: testing/debugging must go through queue. It lasts only
one or two minutes, why do we have to use the queue for this?
- Problem: opaque queuing system. Why is my job prioritized like
this, can you change it for me?
- Problem: efficient resource use is an important task.
- Problem: threads, core, processes: abstraction is needed, this
is too far from day to day life.
- Idea: Have enough resources
- Idea: tools for precisely, dynamically estimating job runtime. Or
just better tools for estimating needed resources period.
- Idea: run singularity/docker containers
- Idea: encourage interactive computation more for new users.
- Idea: More powerful login nodes so interactive jobs can be done
here. Note: CSC has a system for sending ssh directly to a Slurm
node to emulate this.
- Idea: Default options that just work, more effort on inferring
default options. Ideally only specify time/mem/CPUs/GPUs needed.
- Idea: Encapsulation, users think about 100% science only