How researchers can meet new open data policies for federally-funded research with AWS

In 2022, an order from the White House’s Office of Science and Technology Policy introduced new data sharing requirements for federally-funded research. Researchers across academic institutions, nonprofits, and federal agencies themselves will be expected to make research publications and underlying data accessible to the public at no cost, immediately upon publication. This new model gives researchers direct and equitable access to the raw data behind scientific publications, maximizing data reusability as well as experimental reproducibility.

Under these new requirements, agencies must update their public access policies as soon as possible—no later than the end of 2025—and must achieve full implementation of the public access requirements by 2027. So how should researchers prepare to comply with these new data sharing policies? Learn how federal agencies are enacting these public access policies, and how you can use Amazon Web Services (AWS) to prepare your research to meet these new data management and sharing requirements.

New public access policies push federal agencies towards open science

Movements towards open science are being implemented by federal agencies across all domains of research. Open access data and software primes researchers to pick up where other scientists left off. The United States Department of Agriculture, Department of Defense, Department of Energy, and the National Science Foundation have also taken an open data stance.

While not setting a default requirement for open access, the submission of plans necessitates that researchers consider and strive towards open science from the beginning of a project. Such an approach can help researchers design their data collection systems at the outset for public access. This model also makes sure that researchers address the privacy, legal, and ethical considerations in making certain data public—especially given the sensitive data often used in biomedical and human subject research studies.

“The new mandate will facilitate the necessary cultural change in biomedical research that patients and families deserve,” says Ashwini Davison, MD, healthcare executive advisor of academic medical centers at AWS. “Academic medicine has been grappling with a reproducibility crisis, but the enhanced data sharing and management plan requirements are a step in the right direction towards a future of truly open science.”

How AWS can help researchers align with new public access policies

As these and future data sharing policies take effect, researchers must consider how they can meet these new requirements, broaden access to their findings, and equip the next generation of researchers with better data to improve the health of the public. Researchers can use AWS to meet this challenge and design data architectures that optimize research abilities while supporting secure and cost-effective access to data.

Making your artifacts findable, accessible, and reusable with AWS

The AWS Data Exchange public data catalog lists over 3,000 data products that are available on a subscription basis. Publishing on the AWS Data Exchange is self service. Once you have registered as an AWS Marketplace vendor, you can provide a managed subscription experience for your data users. Data users who subscribe to your data product will receive a copy of the data product in their own Amazon Simple Storage Service (Amazon S3) bucket. At re:Invent 2022, AWS also announced AWS Data Exchange for Amazon S3, a feature for subscribers who want to use third-party data files for their data analysis with AWS services without needing to create or manage data copies, as well as data providers who want to offer in-place access to data hosted in their Amazon S3 buckets. AWS Data Exchange also supports tables (AWS Data Exchange for Amazon Redshift) and APIs (AWS Data Exchange for APIs).

The Registry of Open Data on AWS is an open-source digital catalog that helps data providers keep their data findable and accessible. Currently, the registry lists 390 datasets spanning the geospatial sciences, climate, weather, sustainability, healthcare, machine learning, and life sciences. Users can search for datasets that meet a certain keyword or by a specific data provider, and are pointed directly to the resource and mechanism by which they can access the dataset. Nearly all of the datasets on the Registry of Open Data on AWS are distributed using Amazon S3. Data users can access the data in place via AWS native APIs, often without needing an AWS account. To list your dataset on the Registry of Open Data on AWS, make a pull request to the GitHub repository.

Findability, accessibility, and reusability don’t stop at data. As NASA’s directive indicates, technical research artifacts include software and workflows. The Amazon Elastic Container Registry (Amazon ECR) Public Gallery lets you list and search for public container artifacts. All AWS users get 50 GB of public storage in Amazon ECR every month at no cost, and the Amazon ECR Public Gallery is available for anyone to browse at no cost, even if you do not have an AWS account. Learn more about how to get started with Amazon ECR Public Gallery.

Supporting data interoperability with AWS

Being able to use datasets in concert with other datasets can be a major challenge. AWS offers high level guidance and blueprints that can help users deploy AWS services to aggregate, manage, and integrate different data sources—as in a data lake, for example—with well-architected guidance for multi-modal and multi-omic data.

Several of AWS’s analytic services have features that allow you to work with datasets in different formats. For example, Amazon Athena Federated Query lets users query across datasets in different formats and databases. AWS Glue custom connectors also allows users to transfer data from applications and custom data sources to your data lake in Amazon S3. Amazon HealthLake suite of services lets you analyze imaging, structured, and unstructured health data in a HIPAA-eligible environment.

For research data that may be decentralized, AWS services also support data mesh principles to help customers find, aggregate, and analyze data. AWS customers have leveraged AWS LakeFormation and AWS Glue and Amazon HealthLake as data platforms built on data mesh principles. Customers who regularly use data from many third party sources can also use the AWS Data Exchange as the basis for a data mesh. At re:Invent 2022, AWS also announced AWS DataZone to help customers share, search, and discover data at scale across organizational boundaries.

Accelerating public access policy compliance with AWS support services, programs, and partners

AWS offers technical and strategic support to navigate the many options available to researchers to comply with data sharing policies. AWS Professional Services (AWS ProServe) supplements your team with specialized skills and experience to help you build and implement the right data solution for your organization. Through the Data Driven Everything program, AWS works with customers to move faster and with greater precision using a Working Backwards approach to address people, process, and technology-related considerations. The AWS Data Lab offers accelerated, joint engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data, analytics, artificial intelligence (AI) and machine learning (ML), serverless, and containers modernization initiatives.

The AWS Open Data Sponsorship Program (ODP) covers the cost of storage for publicly available high-value cloud-optimized datasets. All datasets sponsored by the Open Data Program are listed on the Registry of Open Data on AWS and the AWS Data Exchange, helping you to keep the cost of your shared data low while optimizing findability. The Open Data Program is managed by the open data team, who are experts in best practices for highly distributed datasets.

Researchers can also work with the robust AWS Partner community, which can help you build any data and analytics application in the cloud. Find an AWS data and analytics partner here.

Learn more about AWS for open data

These new public access policies offer opportunities for researchers and higher education institutions. Once implemented, we believe that they will accelerate the rate of scientific discovery and innovation, all while saving the research community time and money by ensuring maximal discoverability and reuse of data.

Do you have questions for how to use AWS to optimize your research for open science? Reach out to your AWS account team, or contact us to learn more.

AWS Public Sector Blog