Friday 2 December 2016

Devops and traditional HPC

Last April, I have co-presented in Saudi HPC 2016 a short talk titled "What HPC can learn from DevOps." It was meant to bring awareness to DevOps culture and mindset to HPC practitioners. Following day, the talk was complemented by an Introductory tutorial to containers. This talk and tutorial were my second contribution promoting DevOps locally. The first attempt was In Saudi HPC 2013 with the DevOps afternoon; in which we had speakers from Puppet and Ansible with good examples back then of how automation "Infrastructure as code" frameworks encourage communications, visibility and feedback loops within the organisation.

Talk Abstract: 

Cloud, Web, Big Data operations and DevOps mindsets are changing the Internet, IT and Enterprise services and applications scene rapidly. What can HPC community learn from these technologies, processes, and culture? From the IT unicorns "Google, Facebook, Twitter, Linkedin, and Etsy" that are in the lead? What could be applied to tackle HPC operations challenges? The problem of efficiency, better use of resources? A use case of automation and version control system in HPC enterprise data centre, as well a proposal for utilising containers and new schedulers to drive better utilizations and diversify the data centre workloads, not just HPC but big data, interactive, batch, short and long-lived scientific jobs.

Here are some of my personal talk notes at that time. Apparently, they did not fit the 15 minutes window I was given.



Talk reflections and thought points:


Definitions: Presenting the different possible HPC workloads: HTC, HPC, HSC, and the recent trend in Data Centre convergence by considering BigData “Analytics” and more recently MLDM “Machine learning, Data mining.” Highlighting the diversity and variability of HPC workload, then moving to what DevOps means to HPC, Why it did not pick up as much? What HPC can learn from Enabling, cloud, and Big Data operations?

The disconnect: Traditional HPC software changes are infrequent; HPC does not need to be agile handling frequent continuous deployments. Each HPC cluster deployment is a snowflake unique in its way, making it hard for group users to port their work to other clusters, a process that takes days, weeks, often months.  The concept of application instrumentation and performance monitoring is not the norm, nor the plumbing and CI/CD pipelines.

The motivation: However, HPC infrastructures inevitably have to grow, innovations in HPC hardware requires a new look into HPC software deployments and development, HPC data centres will need them few highly skilled operational engineers to scale operations with fewer resources efficiently. The defragmented use of system resources needs to be optimised. The scientific and business applications might be rearranged, refactored, reworked to consider better workflows. Analysing application and data processing stages and dependencies looking at them as a whole and connected parts while avoiding compartmentalization and infrastructure silos.

The scalability Challenge: What could be the primary HPC driver to introduce DevOps culture and tooling?  Can't stress enough on scalability (the imminent growth due to initiative like national grids, and International Exascale computing, the workload, number of nodes, number of personalities or roles an HPC node might take)

DevOps tools: Emphasise richness of the tool set and culture that have driven tools evolution. Pointing out it is not about the tools, more than the concepts that tools enable. Not just automation, building, shipping and delivery workflows, but the ever engaging feedback loops, the collaboration, ease because of integration, highlight that communication and feedback are not just the human face-to-face but also the meaningful dashboards and actionable metrics,  the importance of code reviews, the rich API, the natural UX.  Such comprehensive set of tools and unlike the current HPC defragmented alternatives or in some cases Enterprise tools used wrongly for HPC.

Use case of differences:  The case of Provisioning; and how the terminology differs between the HPC and web/cloud communities. Taking this example further to pivot to the false assumptions of HPC can be just bare-metal provisioning.

Validation: Validation of the hypothesis of serious HPC workload in the cloud, and recent use cases for containers deployment in HPC from surveys and production ready vendor solution trending the last couple of years may be present some of the related HPC cloud news.

2nd Generation Data Centre provisioning tools: Alternatives, offer open source alternatives to traditional HPC provisioning tools and highlight their diversity in handling bare-metal, virtual images instances, and containers. As well the possibilities for combining this with diskless and thing OS hosts.

The current state of the HPC Data Centre:  Highlight the problem of static partitioning (silos), and the various workload needed to either support or complement the bigger business/scientific application and discuss valid reasons of partitioning.

Resource Abstraction:  What if we abstract the data centre resources, and break down the silos? How should that be done?  What core components need to be addressed? Why? Present an example proposal of such tooling with the reasoning behind it.

Unit of change:  Containers technology is a useful enabling technology for such problems. Does not have the performance overhead issues that HPC shied away from in virtualisation related solutions, and will enable portability for the various HPC workload deployments. Not to mentions the richness of its ecosystem to enhance the current status quo of scheduling, resources, and workload management to greater levels of efficiency and better utilisation of Data Centre resources.

The software-defined data Centre:  Everything so far can be either code or managed and monitored by code. How flexible is that? And what new opportunities it brings?  How can everything be broken down into components? How parts integrate and fit together? enabling a “Lego style” Compose-able infrastructure driven and managed by code, policies, and desired state models. How has code opened new possibilities to stakeholders?


Some Docker evaluation use cases:

Challenges ahead: The road ahead expectations? The unique differences and requirements?  Which underlying container technologies need to be in place and for what?  The right amount of namespace isolation vs. cgroups control, how about LXC, LXD, Singularity, Docker? What would we see coming next?


The importance of having the right mindset to evaluate, experiment new paradigms and technologies, eventually deploy and utilise them in production; introduce new workflows, enable better communication between the different teams (developers, users, security, operations, business stakeholders). The concept of indirection and abstraction to solve computer problems, in this case, the 2-level indirection scheduling for granular resource management. The container unit concept for the workload is not just for applications; it could also be for data.

to be continued ...

References:

https://blog.ajdecon.org/the-hpc-cluster-software-stack/
http://sebgoa.blogspot.com/2012/11/i-was-asked-other-day-what-was.html
http://qnib.org/data/isc2016/2_docker_drivers.pdf


1 comment:

  1. Hi Angelina,
    First I would advise you to join the devops weekly mailing list http://www.devopsweekly.com/

    As for books recently in configuration management camp key note by Kief Morris "Implementing Infrastructure as Code " he suggested a good list of book in slide 44. check the slides out at http://www.slideshare.net/KiefMorris/implementing-infrastructure-as-code-configmgtcamp-2017/44

    ReplyDelete