ASPPH scales data curation for members with a data lake on AWS

The Association of Schools and Programs of Public Health (ASPPH) is a nonprofit association with a vision for improved health and well-being for everyone, everywhere. They represent schools and programs accredited by the Council on Education for Public Health (CEPH) as well as applicants for CEPH accreditation. ASPPH’s mission is to advance academic public health by mobilizing the collective power of its members to drive excellence and innovation in education, research, and practice.

While ASPPH provides many services, members consistently rank the curated data resources published on the Data Center Portal (DCP) as a top benefit. ASPPH’s technical team has built custom web applications that capture and store data in a relational database. For many years, the maintenance of these systems and data curation was handled by a single full-time staffer. But as ASPPH membership grew, so did the demand for new curated datasets. They added new data sources to support dataset requests but found that the SQL queries used to analyze and integrate the data were becoming more complex and costlier. They realized that they needed more efficient ways to store, query, and govern their data.

To solve these growing issues, ASPPH partnered with Amazon Web Services (AWS) Professional Services (AWS ProServe) to move their curated data to a managed data lake on AWS. In this post, we share how ASPPH and AWS designed and built the data lake and the results of moving to a modern, scalable data architecture.

Solution overview

AWS ProServe met with ASPPH to understand their issues with data ingestion, storage, processing, and governance. Because ASPPH used multiple data sources and expected to onboard new sources in the future, AWS ProServe recommended a data lake architecture to store the curated data. AWS ProServe conducted detailed design and architecture reviews and used the AWS Data Lake Accelerator to create the solution shown in Figure 1.

Figure 1. Architectural diagram of the ASPPH solution described in this post. The major components are AWS Glue, Amazon Relational Database Service (Amazon RDS), Amazon Simple Storage Service (Amazon S3), and AWS CloudFormation.

Solution walkthrough

ASPPH staff found the development process simple and intuitive despite their inexperience with the AWS services that comprised the solution architecture. AWS ProServe provided the support needed to jumpstart the project, including curated resources and live, guided introductions to AWS services. ASPPH IT staff set up access to AWS services and resources that the development team needed to implement the solution. Business intelligence (BI) and data curation staff built the data ingestion, visualization, and analytics features on that foundation. ASPPH completed the transition to the new architecture in about six months.

The ASPPH DCP was previously hosted on two servers running MySQL located in an on-premises data center. The production server stored raw data from multiple sources. The analytics server stored the transformed, curated data that ASPPH provided to members as well as the metadata about the curated data. The first step in moving to the new cloud architecture was to migrate ASPPH’s production server to a highly available MySQL cluster hosted in Amazon RDS. This RDS cluster is spread across multiple AWS Availability Zones, so the database service will continue to operate in the event of a failure in a single data center.

ASPPH’s BI team developed the extract, transform, and load (ETL) pipeline that pulled data from the production server into a new data lake based on Amazon S3 using AWS Glue Jobs and Glue Workflows. AWS Glue met ASPPH’s needs for schema inference and metadata storage, eliminating the analytics server entirely. Using the data lake for transformed data reduced costs and made it easier for ASPPH to add new data sources compared to their previous system based on MySQL. Now, as more sources are added to the production RDS server, ASPPH triggers AWS Glue crawlers to read the table structure, store that metadata in the AWS Glue Data Catalog, and store the transformed data in the S3 data lake.

The DCP site was built with third-party tools including Tableau. Recreating the site dashboards would have been time-consuming for ASPPH’s analytics and visualization staff. Fortunately, they we were able to connect their new data lake directly to the existing dashboards using Amazon Athena, which greatly accelerated the project timeline and reduced the overall risk.

The majority of the work in transitioning to the new architecture was re-thinking how ASPPH transformed and curated data. While time consuming, this phase of the project enabled ASPPH to reconsider what data to include in curated datasets and how they could improve processes to be more scalable and governed. Under the new process, ASPPH’s data curation staff uses Jupyter notebooks in AWS Glue to interactively develop data transformations. They control access to these notebooks using carefully managed AWS Identity and Access Management (IAM) accounts, roles, and policies. With this setup, their distributed team can collaboratively develop data transforms, which leads to faster delivery of new products to members and better data quality. AWS Glue integrates with GitHub, which gives ASPPH the ability to develop datasets collaboratively, centrally manage tasks and resources, and track changes to notebooks. Figure 2 shows ASPPH’s development and deployment process for the curation notebooks.

Figure 2. Git-based workflow for developing and deploying data curation notebooks. The major components are AWS Glue Studio, Amazon S3, and Amazon Athena.

When data transform notebooks are ready to deploy, ASPPH staff create a CloudFormation template in YAML based on an easily modifiable example that was provided by AWS ProServe. The CloudFormation stacks set up an AWS Glue workflow that transforms the data based on notebook code, sends the data to an Amazon S3 bucket for temporary storage, then starts an AWS Glue crawler to map the Amazon S3 objects into the Data Catalog. Once the crawler is finished, the data is immediately available in Athena for one-time analysis by the ASPPH team. AWS Glue triggers establish connections between the new datasets and the tools that ASPPH uses to deliver the data to members.

Results

ASPPH staff saw an immediate improvement in their productivity after transitioning to the data lake architecture on AWS. Curation work that used to take 6 hours or more per month has been reduced to less than 1 hour per month. The AWS infrastructure also gives them greater flexibility when allocating staffing resources because their team can collaborate more easily thanks to the global availability of AWS tools and strong governance policies. ASPPH better leverages staff talent by distributing work to more people and allowing them to collaborate to solve problems and develop solutions. The new curation processes resulted in improved accuracy of many of their datasets, including student pipeline forecasting which helps universities plan for the size of incoming classes.

Conclusion

ASPPH’s data strategy will continue to improve with the help of AWS services and infrastructure. The association is curating more data and providing members with more information at a lower cost than before. They plan to use IAM to securely expand access of their Amazon S3 lake data to members. This will allow members to become a bigger part of mission fulfillment and will ultimately increase ASPPH’s contribution to public health.

AWS data lakes and the services used to build and maintain them are for organizations of all sizes who believe data is one of their most valued strategic assets. Visit Data Lakes on AWS to learn more.

AWS Public Sector Blog

ASPPH scales data curation for members with a data lake on AWS

Solution overview

Solution walkthrough

Results

Conclusion

Resources

Follow