Combining Technical and Enterprise Computing (2019-01-23)

New workload types and management systems popping up once in a while. Blockchain based workload types started to run on CPUs, then on GPUs, some of them later even on ASICs. Recent success of AI systems are pushing GPUs, and in analogy to blockchains, specialized circuits to its boundaries. The interesting point here is that AI and blockchain based technologies are from general business interest, i.e. are the forces for lot’s of new business models in many industries. The automotive industry for example (re-)discovered artificial neural networks for solving many of the tasks required to let cars drive them-selfs. I had the luck joining a research team taking care of image based environment perception of a famous car maker for a short while more than 10 years ago. From my perspective the recent developments were not so clear back than - even we had already beginning of the 90s self-driving cars steered by neural networks.

The high-performance computing community has a long tradition in managing tons of compute workload, maximizing resource usage, and queueing and prioritizing workload. But business critical workloads for many enterprises are much different by nature. Stability, interconnectivity, and security are key requirements. Today the boundaries get blurred. Hyperscalers had to solve unique problems running huge amount of enterprise applications. They had to build their own systems which combined traditional batch scheduling with services. Finally Google came up with an open source system (Kubernetes) to allow companies building their own workload management system. Kubernetes solves a lot of core problems around container orchestration but many things are missing. Pivotal’s Container Service enriches Kubernetes when it comes to solve what we at Pivotal call day 2 issues. Kubernetes needs to be created, updated, and maintained. It’s not a closed box - PKS gives you choice, opinions, and best practices running a platform as product within large organizations.

But back to technical computing. What is clearly missing within Kubernetes are capabilities built into traditional HPC schedulers over decades. Batch jobs are only supported rudimentarily at the moment. There is no queuing, sophisticated job prioritization, nor first-class support for MPI workloads like we know from the HPC schedulers. Also the interfaces are completely different. Many products are built on the HPC scheduler interfaces. Also Kubernetes will not replace traditional HPC schedulers, like Hadoop’s way of doing batch scheduling (what today most people associate with batch scheduling) did not replace classic HPC schedulers, they are still around and will survive for a reason. Also the cloud let you think to make queueing obsolete - but only in a perfect world where money and resources are unlimited.

What we need in order to tackle large scale technical computing problems is a complete system architecture combining different aspects and different specialized products. There are three areas I want to look at:

  • Applications
  • Containers
  • Batch jobs

Pivotal has the most complete solution when it comes to provide an application runtime which abstracts about pure container orchestration. The famous cf push command works with all kind of programming languages, including Java, Java/Spring, Go, Python, node…it keeps the developer focusses on its application / business logic rather than on building and wiring containers. That’s already all completely automated since years by concepts like buildpacks, service discovery etc. Additionally to that we need a container runtime for pre-defined containers. This is what Pivotal’s Container Service (PKS) is for. Finally we have the batch job part which can be your traditional HPC system, might it be Univa Grid Engine, slurm, or HTCondor.

If we draw the full picture of a modern architecture supporting large scale AI and technical computing workloads it looks like following:

PKS HPC

Thanks to open interfaces, RESTful APIs, and software defined networking a smooth interaction is possible. The open service broker API is acting as a bridge in many of the components already.

Enough for now, back to my Macap coffee grinder and later to OOP conference here in Munich.