AWS Public Sector Blog

Designing an educational big data analysis architecture with AWS

Educational institutions create a significant amount of data across administrative systems, learners’ educational journeys through the institution, and more. This data can largely be classified into two forms. One is the semi-structured or unstructured “teaching-learning” data that is generated from teacher-student interactions, including semi-structured data like student/teacher evaluation data/schemas, eLearning content, eBooks, teachers’ guides, and eTextbooks; or unstructured data such as emails, blogs, social networking services (SNS), educational entertainment, digital textbooks, and image, video, and audio data. The other form is “non-teaching-learning” data, like structured educational administration data that can be saved into fixed fields, such as information related to the school’s current state of affairs, teacher and student information, personnel-payroll information, school administration information, health and meal information, and academic achievements.

So how can educational institutions and education technology companies (EdTechs) make the most of all this data?

Graph databases are integral to managing this kind of big data so that the complex data points can be organized according to the relationships between the data. Graph analytics for big data is an alternative to the traditional data warehouse model as a framework for absorbing both structured and unstructured data from various sources to enable analysts to probe the data and the relations between data in an undirected manner.

In this blog post, I present a design for a high-level architecture, built on Amazon Web Services (AWS), that uses a graph database to analyse unstructured and structured educational data that can, for example, help inform a recommendation to a student for the appropriate courses to take in their next semester based on multiple personalized data factors.

Designing an educational big data analysis architecture on AWS

Figure 1. The high-level architecture of an example solution that uses AWS to analyse structured and unstructured educational data, which can support a use case like providing a personalized recommendation to a learner on which courses they should take in their next semester.

Figure 1. The high-level architecture of an example solution that uses AWS to analyse structured and unstructured educational data, which can support a use case like providing a personalized recommendation to a learner on which courses they should take in their next semester. Find a larger version of the architecture image here.

Educational institutions and EdTechs can design a big data platform to analyse various data attributes of academic information. For this example solution, we propose an architecture design for an academic information system that provides specialized services such as analysing patterns of students’ various learning information using an Amazon Neptune knowledge graph and predicting and recommending the courses to be taken in the next semester. The Neptune knowledge graph can be built from an existing AWS data warehousing ecosystem, in this case Amazon Redshift, and it can be used to integrate strategic relationships into the knowledge graph so that we can extend its use to serve as a recommendation system.

High-level overview for the educational big data analysis architecture

1. Student/educational data from various data sources are delivered into Amazon EMR—an industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning—to draw useful insights or values from the educational and student data.

2. Data from Amazon EMR is copied to Amazon Redshift. Amazon Redshift SQL helps parse and understand the data patterns and associated links between data from different sources involved in student and educational data. This can look like queries to determine the correlation between online behaviour and learning styles, as well as online behaviour and student results.

3. Business state entities and associated relationship must be defined in an entity relationship (ER) diagram. A graph model is created that identifies vertices, edges, and properties from the ER diagram.

4. Neptune load files are generated on Amazon Simple Storage Service (Amazon S3) from Amazon Redshift.

5. Amazon Redshift bulk loads the connected graph data in Neptune to build a knowledge graph. The resulting data can be used to identify patterns of data, for example, lecture data, achievement data, and evaluation data from courses that students have taken for the year and semester. These patterns include pattern information for students’ preferred lectures, and based on this, educators can implement a recommendation system that recommends the appropriate lectures to students who have not yet taken them.

6. From the front end of the application, a user, like an educator or student, can sign in to Amazon QuickSight and accesses the analysis or dashboard connected to an Amazon Athena dataset. Using a business intelligence tool like Amazon QuickSight is suggested to help equip educators, regardless of their technical experience, with information to support decision-making and respond to the learning styles of their students.

7. Amazon QuickSight sends the associated SQL query to Amazon Athena, a serverless, interactive analytics service

8. The Amazon Athena Federated Query functionality talks to an Amazon Athena Neptune connector lambda function to fetch the data from Neptune. In the case of graph datasets, where the schema or properties of the graph nodes and edges may vary, the database and the table schemas must be predefined. This example architecture uses AWS Glue Data Catalog to predefine these and the connector lambda function retrieves the details from the Data Catalog.

9. The Amazon Athena Neptune connector lambda function converts the incoming SQL query into a Gremlin traversal and sends the query to Neptune. It then converts the graph traversal output into a SQL result set by mapping it to the predefined schema in the Data Catalog, and returns that to Amazon Athena and thereby to Amazon QuickSight to deliver a data visualization to the user—in this use case, recommendations for course selection.

Learn more

Graph database solutions are increasingly becoming a necessity for educators performing analytics to obtain a unique advantage in the types of analyses they can perform. Having a solid data warehousing model and an understanding of the specific types of questions that educational institutions and EdTechs want to answer using graph algorithms can provide a framework to design a purposeful graph database model. Migrating a relationship from a relational data model to a graph data model can help solve very specific questions related to student behaviour and education recommendations. Graph databases can leverage the relationships between structured and unstructured data to help provide a more personalized and enhanced learning experience for students.

Learn more about Amazon Neptune and graph databases on AWS. Plus, discover how educators, EdTechs, and institutions use AWS to support student success and more at the AWS Cloud Computing for Education hub.

Read related resources:


Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.