Cloud303 Designs Computational Pipelines
For Interline's Data Storage and Processing Needs
Life Sciences Pharmaceutical Startup Utilizes Leading Cloud Provider to Develop a Modernized SolutionHIPPA Serverless Modernization
Interline is a drug discovery startup focused on systematically elucidating protein communities for identifying new medicines. Their discovery platform is used to identify new medicines targeting genetically validated signaling pathways and to ensure that their drug candidates comprehensively correct dysfunctional disease networks. Advances in multiple disciplines are aligning to expose previously hidden dimensions of protein communities in their native context. Interline is using this information to develop new medicines for genetically validated signaling pathways.
Life Science Pharmaceutical
Interline is a drug discovery startup focused on systematically elucidating protein communities for identifying new medicines. They are developing a precision medicine platform to map and modulate protein communities, leveraging recent advancements in genomics, proteomics, structural biology, and computational chemistry.
Interline faced a number of initial pain points as they began to build out their MVP and data lake which focused on High Performance Computing (HPC) involving genomics processing. The following are two of the biggest issues they faced.
Interline Therapeutics' data was heterogeneous and arrived from multiple sources. As a result, the data input was in multiple formats. This created difficulties when attempting to connect and centralize all of the separate data sources including genomics and LIMS.
Proper Movement and Storage of Data
Interline was also unsure of the best way to move and store their data. They were considering options such as having multiple Amazon Simple Storage Service (S3) buckets for different types of data or having fewer S3 buckets for map reduce. Additionally, they were considering using a graph database such as Amazon Neptune, but were unsure if that was the best fit for their needs.
Why Interline Chose AWS?
AWS was a great solution for Interline for a few reasons. First, as their data needs were initially relatively modest but set to scale quickly, moving to a scalable solution within AWS would be ideal due to its unparalleled ability to scale in a number of ways this paired with the fact that it would allow them to only pay for the capacity they need.
Why Interline Chose Cloud303?
Cloud303 turned out to be a great fit for Interline because of their extensive experience with Health and Life Science companies. Cloud303's reputation as the number one AWS Well-Architected Review partner boded well for Interline's aspiration to roll out the HPC pipeline, which had to be HIPAA-compliant. Led by Cloud303 Principal Solutions Architect and Life Sciences Team Lead, Tim Furlong, this experience also meant they were extremely knowledgeable on what requirements were needed to ensure HIPAA compliance in the cloud. Including but not limited to, how to effectively track the data required for HIPAA compliance on the AWS account level, as well as tracking data and logs from within the application itself. All of which were defined and discussed extensively during the Well-Architect Review during the Assess Phase of development.
Principal Solutions Architect/Life Sciences Team Lead
AWS Batch AWS Lambda Amazon Simple Storage Service (S3) Amazon Redshift Amazon FSx for LusterAWS Transfer Family CloudFormation Secrets Manager Key Management Service (KMS)
To help Interline overcome these challenges, Cloud303, an AWS Advanced Consulting Partner, was brought in to assist with the deployment of their HPC pipeline on AWS. Cloud303 delivered three accounts to Interline: a Pipeline Account to process all of the data required for their project, a Storage Account to store all of the files and metadata needed for further studies, and an Audit Account to centralize logging for auditing purposes. Additionally, a Root Account was set up as the main AWS Organizations account for users to login into all other accounts. This account also created the entry and exit point for all internet traffic inbound and outbound.
To facilitate the compute workload, Cloud303 deployed AWS Batch and used AWS Lambda functions to invoke the pipelines and process data. All of their files were stored in an S3 Bucket with Cross Region Replication to ensure a disaster recovery plan was in place. Amazon Redshift was used to store all of the metadata necessary for Interline, and Amazon FSx for Luster was used as scratch storage for the pipelines, storing temporary files and enabling parallel computing if required in the future.
Cloud303 also integrated proteomics tools such as Rosetta and Alphafold into the solution to facilitate the drug discovery process, allowing Interline to systematically identify new medicines.
For the benchling pipeline, Cloud303 set up an Event Bus to push events from Benchling and trigger AWS Batch for the Ingestion stage. This process would take the transaction id to query the Benchling APIs that retrieved the uploaded metadata for the new DNA sequence, the file, and selected data. This process then stored this data in an S3 Bucket. Additionally, any recent metadata events were configured to trigger the AWS Lambda function that launched the AWS Batch pipeline for the processing stage. During this stage, all necessary data would be pulled from S3, using the scripts provided by Interline run its computation, and then store the results into a folder named "result" within the same S3 bucket.
For any other manual data ingestion, Cloud303 provided Interline with an SFTP server using AWS Transfer Family. All connections to the SFTP server were made through a VPN Server deployed by Cloud303. All users could be added or removed using CloudFormation Templates. All credentials were stored in Secrets Manager, encrypted with custom Key Management Service (KMS) keys provided by Cloud303. These keys were configured with a rotation policy to rotate every 60 days. All logs produced by the SFTP server were pushed into the Audit account on an application logging bucket for retention to satisfy HIPPA Compliance.
Overall, by working with Cloud303, Interline was able to develop and deploy a computation pipeline on AWS that allowed them to automate their bioinformatics pipeline and promote new pipelines as needed. By following a multi-account structure for this project, Cloud303 ensured application scalability for any future additions required by Interline. This allowed Interline to focus on their core business of drug discovery, with the peace of mind that their data was being handled and stored in a compliant and efficient manner on AWS.
Phase 1: "POC" Proof of Concept Phase 2: Jumpstart