From 2 Weeks To 2 Hours: AWS Batch Fixes Solugen’s HPC Problems

High – Performance Computing

  • 30 September 2023
Share this post
AWS Funding Secured by Cloud303
  • Partner Opportunity Acceleration
  • Well-Architected
  • Migration Acceleration Program 2.0

About the Customer

Solugen is an innovative biotech startup committed to transforming the industrial chemicals market by using renewable resources as an alternative to petroleum-based production. Operating at the intersection of science and sustainability, Solugen faced a pressing challenge: their computational modeling of protein structures, core to their business, was consuming excessive time and computational resources.

Summary

Solugen is a biotech startup that produces industrial chemicals from sources other than petroleum. Solugen was using a single EC2 instance to do large amounts of high-performance computing (HPC) using RosettaCommons - a software for computational modeling and analysis of protein structures. A single job would take weeks to complete. They needed a solution that was more time- and resource-efficient.

Problem Statement

Solugen uses a software platform called RosettaCommons to model protein folding behavior. This analysis is core to their research and this research is core to their business. Due to the complexity and computational needs of this software, they were facing a situation where scaling their business using traditional computational methods would have meant making considerable investments into single instances of on-prem or cloud computing infrastructure and still not ending up with a scalable solution. Additional jobs would have meant additional machines. They needed a creative way to scale their application for larger jobs when that computational power was needed, but they needed to be able to do this in a reasonable amount of time and at an affordable cost.

Why Cloud303?

  • Demonstrated Expertise in HPC Cloud303 possesses specialized expertise in HPC, which is crucial for applications that require complex computational processes. This includes genomics sequencing, molecular modeling, and advanced simulations.
  • Robust Infrastructure The infrastructure provided by Cloud303 is tailored to meet the stringent performance, reliability, and scalability needs of HPC. Our team offers a robust ecosystem that can handle large-scale and intricate computations.
  • Exceptional Support and Security Cloud303 offers round-the-clock exceptional support, along with proven security protocols, to ensure that the sensitive data and complex workloads are managed in compliance with industry standards.
  • Proven Track Record Cloud303 has a strong history of successful partnerships within the life sciences industry. Our commitment to excellence, reliability, and client-focused solutions have made us a trusted partner.

Engagement Overview

Cloud303's engagements follow a streamlined five-phase lifecycle: Requirements, Design, Implementation, Testing, and Maintenance. Initially, a comprehensive assessment is conducted through a Well-Architected Review to identify client needs. This is followed by a scoping call to fine-tune the architectural design, upon which a Statement of Work (SoW) is agreed and signed.

The implementation phase kicks in next, closely adhering to the approved designs. Rigorous testing ensures that all components meet the client's specifications and industry standards. Finally, clients have the option to either manage the deployed solutions themselves or to enroll in Cloud303's Managed Services for ongoing maintenance, an option many choose due to their high satisfaction with the services provided.

Solution Provided

Optimizing with AWS Batch

Cloud303 leveraged AWS Batch as the ideal solution for their HPC workload, as the customer would only need to pay for resources used.

To start, a templatized Rosetta environment was created using Docker containers so the jobs could seamlessly scale. Then, multiple compute environments were deployed (for testing as well as production jobs). To optimize costs, S3 buckets were used to house data. To give faster access to storage and ensure data did not leave Solugen’s VPC, VPC endpoints were created.

One goal was to simplify Solugen’s experience as much as possible, so the data pipeline starts with the upload of an input file to S3. That file contains all the relevant instructions. The runtime will download the file, read the instructions and start the job based on those instructions. It is also possible to use the environment to spin up a single server (which picks up the job from S3, runs the job, then uploads the output artifact back to S3).

OpenMPI Framework

The more elegant solution, however, and the one that truly changed Solugen’s workflow, was the parallel computing solution that was designed. By leveraging the OpenMPI framework in the runtime environment, multiple nodes could be spun up by AWS Batch to process a single job. A number of instances could be spun up - one assigned as the master and the rest of them being worker nodes. The worker nodes would report their unique ID to the master and once the master had enough nodes to run the submitted job, it would run the Rosetta script while OpenMPI managed the computational distribution between the many worker nodes.

Building an ephemeral, distributed workload like this does have one significant challenge compared to a single server - storage. To solve that, Cloud303 incorporated an EFS file system to serve as a common storage solution. All worker nodes were mounted to the EFS share as a local drive so all artifacts produced by the cluster ended up in the same place when the nodes finished processing. Then the master node would compile the artifacts into a deliverable and upload them to an S3 bucket..

Customer Quotes

Working with Cloud303 and leveraging AWS resources has been a game-changer for us. The computational bottleneck was a major hurdle to our growth. Thanks to the new infrastructure, we're not only performing more experiments at a fraction of the time, but we're also focusing on what we do best: innovating in biotechnology

Gaurab Chakrabarti Chief-Executive-Officer, Solugen

Engineer Quotes

The challenge with Solugen was fascinating because it wasn't just about offloading computational work to the cloud. It was about doing so efficiently, economically, and securely, all while keeping their unique scientific requirements in mind. Utilizing AWS Batch along with OpenMPI really took their workflow to the next level.

Xhefri Toro Principal Solutions Architect (HPC and Life Sciences), Cloud303

Outcomes

Prior to this solution, Solugen was running Rosetta on a single EC2 instance and jobs would take about two weeks to complete. Due to the massive amount of parallelization that the new solution enables, a job that took two weeks before now takes about two hours to complete. This has been an enormous benefit to their business, contributing significantly to the efficiency of their workload. In September 2021, Solugen completed a US$357 million Series C round.

Due to the massive amount of parallelization that the new solution enables, a job that took two weeks before now takes about two hours to complete.