Cloud303’s goal was to get Mirvie up to the cloud, using an S3 bucket for storage and AWS Batch to orchestrate their computing environment to run NextFlow jobs for their bioinformatics pipeline, and to do all of this under the HIPAA compliance umbrella so that their clients’ data was protected at the highest level.
Cloud303 started by taking Nextflow scripts that Mirvie was already using and putting them into the cloud to run through AWS Batch. Since the goal of this deployment was cost optimization in addition to greater power and efficiency, the goal was to leverage spot instances, rather than on-demand instances, to perform compute tasks due to their excellent value. NextFlow is designed for parallel scientific computing, but it is not normally able to cope with compute nodes appearing and disappearing, as they can do when working with spot fleets. By utilizing S3 to preserve the application’s state, Cloud303 designed a head node with the ability to resume processes and retry jobs that were dropped, thereby allowing NextFlow to run in an environment of unknown consistency, vastly increasing both its flexibility and affordability as a parallel computing platform in the cloud.
Nodes share an EFS volume so they have a common ephemeral data directory to work with, though staged files is copied locally to increase overall speed. The whole workload is encrypted using customer-managed KMS keys (S3, EFS, local volumes). Secrets are managed by SSM Parameter Store. When files for a job are submitted, a text file must be included as well including various details necessary for the job to be completed successfully. S3 Events are used to monitor for those file submissions and, when those files are uploaded, a Lambda function is triggered to configure the job and get all the data where it needs to be. By using S3 Events and Lambda in this way, there is no need for a persistent running server to monitor for new jobs, which further helps with cost savings.
Once the job is complete, output files are stored in an S3 bucket where they can be viewed and downloaded by whomever needs them.
By utilizing spot instances and removing the need for continuously running servers, Cloud303 created a parallel computing deployment that was a great deal less expensive than even one running on on-demand EC2 instances, let alone dedicated on-prem servers.
Mirvie has a twofold need for robust audit tracking and logging: their own internal needs and HIPAA’s compliance requirements. So from the beginning of the process until the end, Cloud303 made sure that Mirvie was auditing everything, and keeping logs of everything that was happening on the cloud. To achieve this, Cloud303 implemented AWS Cloudtrail and AWS Config to make sure that all API calls, as well as all configuration changes, were recorded. All of those logs recorded in the individual services, along with application logs, are sent to CloudWatch Logs and then to S3 for long-term storage. They remain in standard storage for 30 days in case they need to be queried by Athena, and then they are moved to long-term storage in Glacier, where they remain for 6 years, in accordance with HIPAA regulations. The buckets also have S3 Server Access Logs enabled so any attempts to view log data are recorded.
CloudWatch Alarms were also utilized to create budget alarms so Mirvie could be confident in their budget without having to constantly proactively check their bill as the month goes on.
CodePipeline was deployed to automate the development process, sending built containers of the runtime environment to Elastic Container Registry (ECR) and the necessary NextFlow job files to S3.
One important aspect of this build was strict version control so jobs could be closely tracked. Commit IDs from the source repository were used to tag builds and NextFlow code so every file was associated with a unique identifier that could be monitored. CodePipeline is also part of this effort - updating parameters in SSM Parameter Store to help keep track of the current build.