CarahCast: Podcasts on Technology in the Public Sector

Conquering Data Warehouse Challenges in Higher Education with Snowflake

Episode Summary

Learn how Auburn changed the "data mindset" across campus while building a solid foundation and a repeatable process to scale its data warehouse as its needs continue to grow. Hear from the core project team about the best practices to make data warehouse projects successful and how Snowflake enables a centralized and secure single source of truth for all data.

Episode Transcription

On behalf of snowflake and Carahsoft, we would like to welcome you to today's podcast focused around conquering data warehouse challenges in higher education, where you'll hear from data experts at snowflake, Auburn University and Apex it share a compelling story on how they came together to create an innovative approach to accomplish Auburn's strategic vision around data. 

 

Nicholas Speece 

Hello, everyone. Welcome, welcome, welcome. During the next hour, we're going to have a fantastic conversation about vision innovation technology in the higher ed space. Joining me today from Auburn University Data Modeler all around great guy, Mr. Tyler Crawford. Tyler, welcome to the show.

 

Tyler Crawford 

Yeah, it's good to be here.

 

Tyler Crawford 

I enjoy talking about this stuff. As you know, that's talking is not one of the problems I have usually it's just talking too much. So but I'm glad to be here and glad. This is, you know, I think it's something that is, you know, the idea of just wrangling data, and you know, data warehousing all this stuff. It's just, it's very relevant in all industries, and particularly in higher ed right now. So very excited about this call.

 

 

Nicholas Speece 

Absolutely. And you'll have plenty of chances to talk over the next hour, I promise you that. Also joining me from Apex it Mr. Brian white Brian, welcome. 

 

Brian White 

Hey, thanks, Nick. I am what I am the engagement director and senior consultant on the Auburn project, and I have thoroughly enjoyed it. You mentioned vision. When I came in, they had a vision, they already had the vision, I didn't have to give it to them. And that's not my job anyway. And so I've just been excited to be a part of it.

 

Nicholas Speece 

Outstanding. And we'll have plenty of opportunity to talk about your contributions and the great stuff that you guys did here during the project. And finally, last but not least, joining me from my home company of snowflake, Mr. Dan Sandler. Dan, welcome to the show.

 

Dan Sandler 

Thank you, Nick. Thank you, everyone, for joining you forward to taking part of this conversation. Principal Solutions Architect with snowflake, I had the privilege and honor of working with Auburn University as they sort of set out on their snowflake journey.

 

Nicholas Speece 

Outstanding. So we mentioned some vision, let's come back and talk a little bit about vision and the opening parts of what eventually became one Auburn. Tyler, I'm going to turn to you to start us off here because it seems like you know, a lot of vision is in response to a problem, something is broken, I want to fix it, I'm going to do that in a big, bodacious, great, grand way. Right? Is that how things started at Auburn?

 

Tyler Crawford 

Yeah, um, I would say so I am gonna miss checking out him thinking and praying for him and his family today. Because I mean, he really, what I'm going to do now is, is really just I've been here at Auburn for about two years, and a lot of where this came was from the top was from executive sponsorship, and that in terms of division, if, if we didn't have that, in this, it just it wouldn't have worked. I mean, you talked about the problems, I'm sure the people on this call understand that there are going to be problems, there's resistance. But there was a vision, you know, by the time I was hired, I mean, coming on board was actually a part of that vision that data is is a big deal. And, you know, trying to, to govern data, just trying to find a way to massage it, to prepare it to get it ready for consumption. That's, that's a big deal. Again, in all, in all industries, you know, but in higher ed particularly, and just like public sector, what I learned, you know, when I came, I came, you know, from a different industry and came into higher ed and I learned that, you know, things move a little slower. And that's one of the benefits at times, it's it's a really nice pace, and I love it. But but it's something where there's a lot of other people that are, you know, other industries that are further along. But our CIO had that vision you're talking about, and really just kind of what what became apparent was the vision really was about just enterprise centralization. Right? And, and we realized it was even bigger than just we talked a lot about data governance early on, we realize it's even bigger than that, because we started realizing that Enterprise Project Management, when you talk about Enterprise Security modeling, systems architecture, it's just the more that we got into it, the more that we realized, you know, that it was really just about kind of bringing silos kind of together. And in terms of the project that we're going to discuss today, the data warehouse was a huge part of that centralization project. And that's really what became kind of this one ahlborn brand, which was centralization was bringing us all together. Have one purpose and goal.

 

Nicholas Speece 

That's very interesting. And so you went on this project, and you said that, you know, our main problem right now is our data is kind of all over the place. And, you know, the leadership team brings you in, what sort of was the first thing you decided to tackle? Did you say I need a data warehouse? Or did you say I need a connector? I gotta hire some developers? Where do you start something so big,

 

Tyler Crawford 

really, one of the first things that I did was, you know, before ever really got into technical weeds that my first year was mostly getting around campus and talking to business, the business community, you know, the actual end users try to understand their problems. And it was, it was from that, that you kind of you start to see, you know, how that vision like, why that vision, you know, you see what the problems are. And that really is where the articulation of we need a data warehouse came from. And we'll talk more about kind of about the data warehouse in the project. But but that's really where it came from, was we were understanding that these problems. And I think probably anybody that's on this call, call, no matter what industry you're in, have been a part of an organization where you can ask the same question. And, you know, there's three or four different groups that are rushing to answer it. And then it's embarrassing when there's four different answers to some executive, right? And then they're like, why does this happen? It's a simple question. How many students do we have? Right. And that's, that's one of those things. And that's what it was, it was great for me to understand that. And a lot of me, and my own processing, this was learning what I don't know, you know, that, that I didn't know what I didn't know. And along the way, you're growing into it, and understanding, wow, there's a lot that I have to learn. And so I guess all that kind of together really was we, we knew we needed a data warehouse, and just started kind of going down that path and exploring that with this idea of centralization, and, and trying to simplify the distribution of our data

 

Nicholas Speece 

outstanding. And so I want to kind of pivot over to a second with Brian, because, you know, Apex it got brought in, obviously, on this project, to help facilitate a lot of the construction of this, Brian, kind of talk us through how you got brought into this project, and sort of your first impressions of it. And more to the point for the folks at home. What scared you about a project like this, right.

 

Brian White 

so here's the thing, I wanted to give you my first impression of, of Auburn, and the whole project, and that is this. walked in first couple of meetings, Zane Everett,

 

Brian White 

the project manager, brought banana bread, fresh homemade banana bread, and I'll make pecan pie. And I thought, well, you know, this, this might work out. So. So that's first impression, it was a great impression. And from there, only greatness ensued. So, you know, Arbor brought me and brought Apex IPN, to do a couple of things, one, business process modeling, and then to what what my part was, was data governance, data quality and metadata management workshops. And so that's that was kind of my, my initial in with ahlborn, was to just go through a series of workshops and say, Hey, well, let's, let's take a look at your data quality, let's profile your data, see what's there, let's, you know, what kind of governance do you have in place? What do you need in place, finding out a lot of those things. So that was my role. And then somewhere in there, someone said, Oh, and we're already in negotiations with snowflake for this data warehouse.

 

Brian White 

I usually have system agnostic approach when I come in, because you know, everybody's got their own their own tools. So I quickly said, Well, what is this snowflake, and then I started digging into it, and I learned something about it. So usually, when I come into a project like this, I have multiple concerns. And it's not just about the banana nut bread and the pecan pie. big concern. So the first is the state of the data, I don't care how cool your technology is, if your data is in a mess, you know, is a problem. You just take dirty data, put it in a brand new system, guess what, you know, you have what you had before. And so, you know, is the data dirty? How did you get that way? So I'm usually very concerned about that. And so with Auburn, what we found out was he had there are some problems but it was a it wasn't Armageddon, it wasn't overwhelming. Then the next thing I look for is is executive buying and leadership and so Tyler has already mentioned this. It must be a top down thing, it you it doesn't have to be a top down thing, but it has to be lead the people cast the vision and offer the support to get there. A lot of times, this is You know, you try to do this as grass roots, and it just doesn't get legs and stay viable. So Auburn had that in place. And the third thing was, anytime there's an AARP or an SI s centric enterprise, what people say it is kind of a weird thing. They say everything's on fire, we're all going to die. The data is terrible, and the si s system. And oh, by the way, the only way that this new thing will work is if the data is exactly like the SI system. It happens every time.

 

Nicholas Speece 

Just for Brian, for the folks that don't know what an si es is, can you just real quick define what that means?

 

Brian White 

It is the all encompassing, hungry beast, that is the enterprise system, that is where everything, everything goes through it. So is it's it's your transactional system. It also includes all of your master data, you have the financial information in the course register, course registration information, it has everything. And so when you have an institution, and it was a big enterprise system like that, it everybody feeds the base daily in their jobs. And but, you know, they say that their problems, and usually that's when I get called in when things are really messed up. And I think the last thing I look for and you know, the other challenge, and thing I'm afraid of is I'm always afraid when I walk into a project, where are the feudal lords running their fiefdoms, the little silos behind stone walls, they say everything I do is great. You stay out of here, and that sort of thing. So I look for those? Well, because of the groundwork, and because of the enterprise leadership that ahlborn already did and had in place. I didn't run into a lot of that there's still you know, there's turf wars, but it's not like someone said, No, I shall go my own way. Thank you very much. So the concerns were there. But But Auburn had already put some things in place. I was very happy about that.

 

Nicholas Speece 

And it wasn't just you know, we've we've actually had some talking to chat about this. And please, folks on the attendees, keep the chat coming in a great dialogue. so far. We had questions like is it just er P and si s, and we're going to get into this in a little bit. There's a lot of components at Auburn that started to come together very quickly. So stay tuned, we have more coming on that. And along those lines coming off of what we said with Brian, Dan, I want to come over to you for a second and talk about you know, I'm a snowflake, you're a snowflake. We come from a company where we we try to envision the future of data, and how its arranged and how it shared and how its warehouse. The Data Cloud has been a huge mission for us and a huge talking point. When you come into an organization like Auburn and you build what you've built here. Can you talk about what it's like to come into a situation like this and start building something from the ground up? That looks like everything we always tell people to go build?

 

Dan Sandler 

Yes, great question, Nick. One day, I should take some time to thank Auburn for being a great customer. All of our customers are great Auburn in particular. Special thanks to apex for being a great partner, everyone in attendance and most importantly, condolences to Chuck and his family. I've been working in and around data for 25 years, I've learned one thing there are always unknowns. When you start a project of this nature. I think Tyler and Brian did a very good job at highlighting those things you always know there's some anxiety and your radar always has to be up and running. And if there was there were some there were some wrinkles. When we started off this initiative. I think one thing it was eye opening to it but it was also helped remind me of the importance of let's say an SI s or an ER p but in particular the role that banner plays banner for what I've learned along the ways banner, it makes the wheels go round and round. And if the solution doesn't work for the data coming into banner, and if the data doesn't make sense as it comes from banner, and if we impact banner and it's not so we have to be very conscious as to building on ramps from from banner into snowflake, but not have it be a one off right it can't be a spoke to spoke we was very clear the vision is we are building a hub. We want to make sure it works banner in the short term, but also down the road. There's gonna be other on ramps we need to build. So it was one of those things where we had to make sure we had there was a clear architectural vision. So that was clearly entrenched and it was just a matter of executing realizing that vision. I think that was very good. If we slowed things along the way of, you know, like rabid mq rabid mq was one of those things that we had to shovel the banner messages into snowflake, and that sort of somehow it all came together. But it had to be a matter of getting that data. Because like, those horses have never left that far. Right, you have to get the horses to leave the barn, but in a way that they don't go, you know, too far that you can go for, I think that was a key thing. That was very clear, right? It had to be secure, it had to be governed, there had to be control, we are dealing with sensitive data. And it is our mission to not just break down those silos, but to do it in a secure and governed fashion. And there was a concerted effort. And it was something that I appreciated early on, the team really engaged when it was the Office of the registrar's Provost, making sure that data was classified that it was care, it would at all while we were doing the engineering and the design to get this off the ground. So it was a very impressive thing that I certainly took back and took note of outstanding.

 

Nicholas Speece 

So I want to talk about the technology. But there's one more thing I want to hit on first. And it's it's the quality of the team and the people that build these kinds of solutions, having the right people in place, having the right people on the team to do their individual tasks. So Tyler, if I could come to you for just a second talk about you know, we Brian came on Dan came on. I know there's a much larger team at Auburn who is working on this project. Is the team important, more important than the goal? Is the goal more important than the team give us some feedback on how that played out at offer. Yeah,

 

Tyler Crawford 

I can see some, like you said some activity in the chat. People getting excited about the the technical weeds and and like you said, we're going to get there. Interestingly, I think that's all of the what we're talking about in terms of this project, which is highly technical, you know, and that and that was me, me and Dan just sitting in conference rooms, sometimes they're on a zoom call, you know, just getting into code, but but you have to start with the stuff we're talking about, you have to start with governance, you have to start with executive spark sponsorship. And you have to make sure that you have not just the right number of people in the room, although you know, resource, you have to make sure you have enough man hours. But you need to make sure you have the right people. And something that you know, a lot of times people, you know, because early on, I was one of the I was working just me and you in a conference room, when we did our PLC Nick, it was just us and so that a lot of people looked in associated early on, it was like, Oh, this is kind of just a toddler thing, you know, like that he's working on. But, you know, I always tell people that the success of this, it happened when we got the right people, you know, it wasn't a lot of people, but it was the right people. And remarkably, that came together so that we did have the right kind of architects like for systems architecture, we brought in somebody that that was able to help us figure out, you know, which combination of existing tools that we could use to try to do what we were planning. And then another thing is, you know, we're going to talk about, you know, the different what we've learned now, in terms of kind of initial goal data warehouse. What about data lake, a lot of people that will kind of pit those things against each other. But they don't have to be pitted against each other. You know, what we're finding out now is and again, this all was learned by having the right people the right team, we learn things along the way, by having a scoped out very small, like, let's just do this, right. Let's get our CIT si s, let's get a Lucien you heard mentioned banner, that's an Lucien product, or student information system. So we just wanted to know, can we do that? We all like everybody around campus already is on board with banner data,

 

Dan Sandler 

they want to actually want banner data.

 

Tyler Crawford 

So can we deliver that. And so that integration is the beginning of not only like transforming and conforming, for consumption in the data warehouse of student data. It's also the beginning of data lake, a data lake project, because it's the first of many sources that we're going out, we're going to go out and try to get so I was I was a lot as part of that was like some in response to some of the chat that I've seen already. But also I just think in terms of the team and everything that dynamic having the right people internally, but also the right partners, because we have two on here on this call that honestly I just I look back and it was just like things along the way that you realize when you look back, you're like that was crucial. We couldn't have done this. If we didn't have Brian's experience and his leadership, you know, and their partnership, especially when we're having some tough conversations about data governance, it's always nice to have somebody that has literally no dog in the fight, but just kind of come in there and just kind of throw it out there, hey, we got to talk about this. I know you don't want to, but let's do it to bring consultants to the table, and he'll go throw it out there. And then Dan, I mean, I, he really, really was a catalyst to our time to value because we were just his new systems, snowflake was completely new. So it was the best on the job training we could have. So having the right people, not necessarily a bunch of people, just the right people.

 

Nicholas Speece 

So I, in the summary of our first section here, for everybody who's been listed again, we now know that the right vision doesn't happen all at once we know that it comes iteratively. It's something that happens over time, as long as you have the right people solving a well defined problem. Couple things from the chat. There's snowflake work with PeopleSoft. Dan, do you have a direct integration, I know ODBC connections to PeopleSoft are possible. But I don't know the direct connector unless you do

 

Dan Sandler 

want to double check. I know that we have partners that specialize with near real time replication from sources like SAP into snowflake, so I have to double check that PeopleSoft also falls into that lane.

 

Nicholas Speece 

Awesome. Thank you very much. Let's get to the technology. Let's talk about the tech. Brian, I want to come to you with this because as sort of the pm on the project, knowing the landscape of the tools and technology that would have been leveraged for this project, you can kind of give us the lay of the land, what sources tools, stuff was out there that you had to navigate as the project kicked off and gained steam.

 

Brian White 

So again, you know, I come in with a system agnostic approach. And one of the very first things we do is we say let's take an inventory of sources. And now let's take an inventory of targets. And let's take an inventory of attributes. Let me just say this, Auburn has a lot of all of that. And so then we go to sort of a reference architecture. Let me go back to that. So banner, I did a chart with banner at the top and a list of all the sources and targets. It was a very big chart with look really small writing. So dozens of sources and targets. So I usually come in with a reference architecture. And the reference architecture is essentially this, you have a thing, it's a source, you have a thing between the source, and the eventual thing, which is the target. In this case, because this thing is a hub, the data warehouse is the hub, you have another thing, which is the thing between it and the final target, which was Salesforce in this case. And guess what Salesforce had a deadline we that the snowflake data warehouse project had to be completed before the snowflake. I mean before the Salesforce project could start. And that project that didn't even start already had a deadline. So we were we were under the gun from the very beginning. But But anyway, source thing. hub thing, target. So in in Auburn's case, what we did was we said okay, well, what is this thing between banner which is the original source, and and snowflake, we knew it was going to be stuff like we knew his banner, what's all this? Well, all of this, in in the past had been queries from banner extracting data to flat files, which are then passed on to two targets. And we said, well, we don't want to do that. We want to modernize the architecture. We're using modern tools less at this modern to modernize the architecture. And so we said what do we have? Well, banner has this thing called BP. BP is something where you can set triggers off we need triggers to say, what data from what tables need to go out. So there was a configuration of that thing. Then we said, well, what is it what is this middle piece look like? And we looked at well, do we need to buy middleware? Do we use Lucy's prod product? What do we have in house? Well, we had in house was rabbit mq. rabid mq could take these the data that was triggered by BP and ingested into a queue. Okay, we got it that far. What else do we have in house? Well, we have Azure. So we said now we we built banner through BGP to rabbit mq to Azure, which then connected to snow pipe, which connected to snowflake. So he made it we named that tune in like five notes. And and that's a that's a little more complex. And I like it. I kept saying let's do something. Let's make this sublime let's make this elegant, and it is certainly much more elegant than it has been. And the thing is, it's it's scary. And it's fast. And I was very happy to see that. So once it gets into snowflake, so let me shift my hand drawing over. So now we're in snowflake, we have to stage the data, then conform the data and then present the data. Well, when we save the data, we also have this thing called the historical data store. So we had that going on. So the data is going through different stages all the way to the presentation layer. The reason why we want the presentation layer is so we can present it to users of click, so they can write their analytics and reporting. But also, we want it to send data on down to Salesforce, which is the CRM. So what we did there was we created easily, we said, okay, we don't want to write a lot of transformations. In middleware to do this, what we're going to do is we're going to write queries that will present the presentation data layer, in easily digestible Salesforce forms, all the transformations happen in in snowflake, they wrote the queries, built the tables, we use Informatica ICS, to send the data on to Salesforce. So that's the tech stack for this whole thing. That one of the things that it gives us is, again, scalability and sustainability. And we'll be able to go back to this other side of the drawing, and add other sources that go to snowflake, and then go to different targets. That's the text net stack in a nutshell.

 

Nicholas Speece 

And then I'll come to you for just a second. So given that tech stack, it sounds like, you know, there's multiple databases inside of that single instance of snowflake that are acting as sort of the staging area, and then maybe the conformed layer, and then there's some kind of view or variety of views that are being used out to click, is that accurate? And can you give us a sense of how you decided where those pipelines were built and how those database stacked out?

 

Dan Sandler 

That is accurate. The layers of the architecture, we were definitely layered, we understood that we had data on ingest, we wanted to ingest data by the book, meaning as is from the source, no transformation was taken. Okay, so we had that intake valve, as Brian mentioned, was shoveled from banner, banner triggers, and we use rabbit mq shovel, get it into blob storage. And from there, we at that time and stuff like side, if you see me looking off to this monitor, it's because I'm looking at, I think it's slide number 16 in my useful diagrams, although it has home there. So there is a picture. And this is all spelled out. Right. So we shoveled that data in it gets into snowflake using snow pipe auto ingest. At that time storage integrations were new. So we just had to make sure that it was configured properly, it took some time. And we take that data as it's one thing, all these things might be new to a company, right? Azure snowflake, all these things are relatively new. So how do you establish some common ground are familiar? familiarity? So we're not just disrupting everything we've been talking about disruptive as a good day, it's actually very bad, you want to have it seen was in forward looking while making sure folks can support this and get it off the ground. So it's very simple center. Right? We basically focus on how do we transform the data in from the point that it comes in to something that is usable, right. So the Avro data that banner messages looks like we need to transform that into something that can be consumed by either a reporting layer BI tool or an end user just running ad hoc queries. So this sequel centricity of the consuming the data as it comes in via stream into the historical banner store, which was the ODS layer into the consumption layer, which is the dimensional model that, to Tyler's credit, we had day one, it was all in place, and we just had to feed it. And then that data had to be staged up and made available downstream into salesforce.com. All these movies, it was very sequel century, there were some nuances we had to sort of get everyone warmed up. Well, what if semi structured functions were like in sequel? How do we take that, that message and make it into you know, schema, on read versus schema on writes, we had to sort of impose a schema coming out of the banner messages, but I will say we kind of sometimes just have to be lucky in those banner matches message units were very conducive to ingest, very, very conducive to downstream consumption. So it all kind of worked out, even though there were a lot of moving parts.

 

Nicholas Speece 

And I assumed that data was hierarchal, right, so like a JSON or Avro or what was coming out of banner.

 

Dan Sandler 

It was definitely an object hierarchy. But in within that Enigma wrapped in a riddle, there was named value pairs that really made it easy to just just we basically pivoted that data, that's exactly what we did. And it actually sets the stage for anytime there was any new columns come in, we can, we can sort of leave those off to the side and then at our leisure, bring it in, and that sort of plays to Well, is it a data warehouse in OCS? Or it's in Italy? Or why does it matter? We're bringing all the data we need to use?

 

Nicholas Speece 

And that's sort of a follow on question. I want to be really specific in this instance, because technologically, I feel like this is interesting. So you brought in from the message queue, you brought in these these hierarchial data objects? Did you write those into a variant column that stored that natively? And then did you do a sequel transformation with tasks and streams? Almost like triggers inside of snowflake? Or did you just use a view to make it look structured? Does that question make sense,

 

Dan Sandler 

and we actually use all the above. So we use a very data coming in as it's kept it by the book, then we had a stream on top of that raw data. So we consume the stream only the deltas coming in, we actually consumed that we pivoted that data out with an ELT view on top of the stream. So that's how we the transformation layer was essentially ELT view. And the tasks were, we were able to orchestrate that and, and Tyler sort of latched on to that, that sort of, well, this is sort of one piece of it to orchestrate that he built his own framework to fully orchestrate that.

 

Nicholas Speece 

And that's a great time to pivot over to Tyler because I want to talk about two things at the same time. The first is data quality. And the second is data governance. So I know governance and quality were on top of your mind, even when you and I first started talking, you know, weeks and months before Dan and Brian sort of came on to the project and started building something out during the GFC. The data quality layered did you approach that? Is that done in snowflake? Are you doing data quality? Is that done in snowflake or through an outside technology? And then on the governance side? How did you sort of approach that on a day one? And anybody can jump in? That has a great answer on this one. But I was picking on Tyler.

 

Tyler Crawford 

Yeah, so data quality is significant in terms of I mean, it's just part of what people in terms of the brand if you're going to deliver data, and you can call it whatever you want data lake data warehouse, but but when you're structuring, that's what I mean, when I'm talking about a data warehouse, when you're structuring for consumption, you don't want to just send out, you know, just junk out there and create, you know, data swamp or whatever it is that you want to call it. You want to have a threshold that people have what they expect. And in some cases, if you try to get it perfect, then you'll never finish your project, right? In sometimes actually exposing it as it is, and then calling out these quality issues to the people who are consuming is what you need to actually get traction on the governance side, to have the people who can clean this stuff up at the source to go back to the source and clean it up. So there's some of that where we've taken visualization, our visualization tool, and actually, because and here's here's the reason why a lot of people are like, Well, why not just plug into your, your student system, your eirp? Why not? Like we've been doing it for a long time. Why would you go through all this trouble data warehouses reporting only, you don't need to do this for integrations? Well, the beauty of this is that when we efficiently are capturing changes from our student information system, we aren't competing for resources, we're not going to go shut the whole thing down in the middle of the day, when we're trying to collect tuition and bills and all this stuff. We're not going to shut it down by going in and starting to write, you know, a big gigantic, you know, file. We don't have to do that we got it out and snowflake. And so one thing that I think is really cool use case that we got is, you know this, mastering identities. So we've we've had a process before and back in time, the history of this, and I wasn't here, what is built is a really great process that was built to help with finding duplicates, because duplicates are awful, you don't want them, Well, we've got them out there, they get created, and you want to find them as quickly as possible. So you don't end up with two people getting financial aid that are the same person, two different records.

 

Dan Sandler 

Long story short,

 

Tyler Crawford 

there was a process that was running. Really smart people developed some common matching, but it would take like four days to run. I mean, literally you'd have to turn it on, watch it span, and just cross your fingers and go home and take a nap. I mean, you know you just have to wait. Well in snowflake, what we found was we could be Like really specific, we could say, hey, as these changes are coming in identities are being created, I can say, hey, let's, let's take this, you know, the five new identities. And if I'm not doing all of my identities versus all of identities, now I'm doing just the changes versus what exists, I can really quick, like, do that common matching. And we took something that was four days, down to seconds, you know, it was just like, it would just run and flag, and then we dashboard it, you know, dashboard that. And so yeah, data quality was huge. We're actually continuing to build out our governance and our data quality part of this. And what's great is that because we have the data all in one place, and because we started collect others collecting other sources, we have a product called Atacama that we just finished going through an RFP on, they do Master Data Management, data quality reference data management, and we can do an integration into snowflake, and do all of our, you know, standardization of the dresses and all that stuff. And we can write it and stage it and get it ready and say, Hey, this is what, you know, our address table and banner could look like if it's approved. And then we can just build the single like, you know, if it is approved, build that to send it back to communicate that back to better. So the fact that we're in the cloud, and we're doing it out there, I think is what you were trying to convince me that what you were selling it to us, and I just didn't get it because I didn't understand. But now that we're doing it, it start I'm starting to see just how valuable that is.

 

Nicholas Speece 

It's a huge change. You know, it's it when you as you said earlier, Tyler a lot of big architectures like what Auburn put together for one, Auburn, it's very resource constrained and not from the entire chain. But certain elements are resource constrained. And those constraints prevent you from inventing, or in a good way disrupting, as opposed to dance bad disrupting the good disruptions, like saying, Hey, why don't we do it every day, every comparison for four hours, we should just do a, here's a change. Let's compare that to the master set. And we know what we're working with. And because you can have multiple resources out there at the same time, you're no longer resource constrained, that opens up the amount of things that you can come up with. And I want to go to Brian and ask, you know, as you're architecting, this as you're building this out, one, how many did you take it as a as an approach to say, how many virtual warehouses Do we need in this snowflake instance? Or was it more? You know, the joke I always make is that the red shirts of Star Trek, throw them at the problem. And if they die, who cares? Right? I've got one source of data, it never goes away. So just throw my red shirts at it. Which approach Did you guys take at Auburn? And we're sort of are you now.

 

Brian White 

So when I came in, Tyler and everyone needs a Tyler, everyone on this call needs a Tyler not this Tyler but a Tyler. You can't have our Tyler but he's he's a guy that says, hey, what does this thing do? Hey, I wonder if I do that? What if we, he's the guy and you need that guy? Absolutely. But Tyler had already taken a big leap forward on the data model. before he even came in. They did the tour, they said, We they identified the 300, they had the 300, which were the 300 attributes they wanted to try to add to this model. And so he had he already had started on the entity entity diagram. It was looking good. And so we started with that we said, this is what we're going to build, we're not going to build 10 things, we're going to build this thing. And so it was the what was the final name of that thing. It was the course registration data model. And and it's star schema where the fact table in the middle. And that's the end. That's what we said we're going to build. And that's, that's what we built.

 

Nicholas Speece 

That's outstanding. I want to kind of so we got about 15 minutes or so left. And I've been very structured to this point. But I want to give you guys a little bit of freedom because you've been in the weeds so much. And Brian, since we're already talking, let's keep going. Think about like an aha moment that you had while building this whole thing out what was new, what was different? What really made it click in your head? Okay, I

 

Brian White 

have two stories. Ready go one. There was my own business. Writing the project plan was Zayn over a slice of pecan pie. When we said we sent out a text to Tyler, hey, Tyler, we need we need you the tasks that you're going to work on. We need the milestones and tasks and building out this data warehouse. We know that you need to ingest the data, we know that you need to stage it. Let's try some tasks around all that stuff. Well, come to find out. Tyler and Dan had already, as Dan says, left the bar. They were gone. They were developing away. They were banging away at this thing. And they were they were way down. So we ended up Zane and I had to write tasks that had already been accomplished just for the purpose of putting them in the project plan to say they were done. Because and so the AHA was this is rapid deployment. This is not As I say, this is not your daddy's data warehouse. This is rapid deployment and they were rapidly deploying. The second aha was essentially this. When we come in, there's always a point in a data warehouse project where we say it is a time sucker. And and and it's this. How do we do Change Data Capture? How do we get the Delta? How do we get the deaths? How are we going to do this? And what you usually do is you build a thing is that well, here's today's table, here's yesterday's table. And then you build this complex logic to figure out what's different from today. And yesterday, they stated to a third thing, and then you send it on. So they introduced me to a thing called time travel. Well, time travel is brilliant. It Star Trek, it's, it was awesome. And and so it's basically you write a query, you want Delta, so you write a query, and boom, that's it. It took it, I feel we were getting down to the end of the project where we had to start delivering to, to Salesforce. And I was like, Oh, no, we're not going to make it because we have to build this change, data capture, boom, time travel, that was a big aha. And it's a function of snowflake was great.

 

Nicholas Speece 

And that's just passing select blog from my table as of giving it a timestamp and comparing it right in a query.

 

 

Yep. Oscar,

 

Nicholas Speece 

we're getting some questions on on ETL. process, no good building out the model doing some ETL. I guess, either Dan or Tyler, I'll turn this over to you guys, you know that the points are relevant. What is your ETL tool of choice? We talked a little bit about that first data pipeline, we're essentially snowflake is doing the ETL. And it truly takes 80% of your data warehouse effort. Would you guys agree that it took 80% of your time to build those pipelines?

 

Dan Sandler 

Yeah, I would say it's, it probably is around 70 to 80% of those pipelines, we've got tremendous mileage out of it wasn't just for ongoing data integration, we actually were able to pay back majority of the of those of the pipelines to support the initial load, which is one of those things where you can't do this without having a little luck, and also being ruthlessly efficient. Right. So everything that was built was built with a purpose. And we got the maximum amount of mileage out of I remember some aha moments where Tyler said, Okay, we're actually can leverage this ongoing control auditing process to make sure what we have coming out of a banner actually matches what's in snowflake. That is actually the baseline for migration. And we were able to have that bounce process for ongoing data quality control, checking, feed the initial load, which actually, from that point forward, after you did the compare, we get the delta and initial load, you're comparing everything to nothing. So you get everything as a Delta that fed the downstream data pipeline. So while there was 70 to 80% of the work, that was, that was the that was the lifeline to get the data into snowflake and I see some questions was Informatica bit tool? choice is Yes, that was a tool choice. It was it was usually it's never a one size fits all. We had ircs to feed with the hooks into Salesforce. com ongoing but we've received within streams and tasks and to feed the data coming in from banner into the historical vendor store and into the dimensional data Mart's

 

Tyler Crawford 

outstanding. You manifest speak real quick, just

 

Nicholas Speece 

to that dollar. I would love that. Okay,

 

Tyler Crawford 

so in terms of tooling, just think you know, do you want or do you want to do, particularly the T part of ETL, or ELT? So do you want to learn kind of the visually visual coding process? And do all of your transformations just kind of drag and drop? Or do you want to maintain a code library and be able to be very specific on your definitions? And that's really the decision point there. What we use Informatica for is just point to point mapping. It's really just, it's just a connector to say we've got the data staged and ready. We don't actually want to have our logic live there. We want to know what it is and if there's a change, we're going to go to our Git repository make that change and version track that way. So we're managing with these what we call ELT views. So if I have 12 dimensions, I've got 12 elt views that they In a staging layer, I'm always calculating with our logic and saying, okay, here's what it should look like, what does it look like in presentation layer? And then based on those diffs, I'm going to send it the inserts, updates and deletes. And that's what me and Dan worked on. And that really was I mean, that's, that's the lift of the project is all of that. So it's really just a question of, what's your comfort level? Do you want to maintain your code snowflake is is really, I found I love you know, just the the nuances of it in terms of managing that. For us. It kind of also helped articulate what do we mean by ELT? Well, the first thing is just unpacking banner messages and flattening them into what we call our historical data store. Which means that we can now query What did our system student system information system look like, you know, anytime in the past, from from the onset of when we started tracking. So we do that first, and then we're transforming in to the presentation layer, I'll stop because I can see we're almost out of time.

 

Nicholas Speece 

We got a couple of questions about time travel, you know how it works and what it does. So really, really quickly. snowflake uses object storage underneath as its underlying file system. And whenever data is written into snowflake, it gets put into object storage in Amazon, it's s3 and Azure, it's Azure Blob store, we didn't invent our own, we have the one that's in the cloud. So whenever data gets written in, it's immutable, whenever a new piece of data and updated insert comes in, we create a new object. Because we have all those objects, we can move forward and backwards through time by knowing when objects were written and what data was relevant when this allows you to write a query dynamically that shows what data was relevant or real in this table at this time, doesn't require copies of data doesn't require replication doesn't require snapshots, it works up to 90 days in that window in that way. I hope that was helpful for the three or four folks who asked about using snowflake and some of those ways. Tyler, I want to end this with you and feel free to rant for the last four minutes, brother, they're all yours. Give me an aha moment a key takeaway and the future what you want from Auburn, take us out.

 

Tyler Crawford 

I I'll try not to use all the time in case anybody else wants to get the last word. But I will say that our aha moment was when we realized that efficient method like the modern approach, you talked about semi structured data. You know, it's now as we're trying to do integrations, what's next API integrations? And this may go to talking about tooling, you know, how do we do that we're starting to explore, you know, because we need to get our data to Azure. And that is kind of the snowflake way is get it to the blob, and then you can take it the rest of the way you can auto ingest, you can do all that? Well, because we need to get it to Azure. Azure actually has some tools that will help with API integrations. We've used logic apps for some simple communication of data between some different places and data factories out there. But But basically, knowing how to work with semi structured data, that is a really, really cool thing about snowflake is learning that, because now that we're doing those API integrations, whether it's coming in inbound, or outbound, which when we started, Dan, we had we had directory paths, and it was like inbound. And this is all the data that's coming in, and then told me he's like, and also outbound. And I was like, Well, what are we gonna do with that? He's like, well, he explains it to me, but I'm just like, yeah, we just know we have outbound. I don't know how we just never use it. Well, as soon as we got off the ground, then the question is, Oh, good. We can do this. Now this like, Okay, great. We got it in click, we got it in Salesforce. But we want it here that somewhere else. And so the delivery outbound learning how to deliver those objects, that semi structured data so that you can just give it to an API and in and out of systems. That's where in terms of what's next is what we're starting to figure out a scaling that is, is API libraries, and all this stuff of just different things. And that's really going to take us to true I mean, that the beginning like I said, it's it's data like bring in store all of the source data in its own contained schemas where we can control the security, but then data warehouse, whatever you need from all that, conform it into a nice consumable format for people and reporting or integrations.

 

Nicholas Speece 

I can't think of a better use case for a data platform in the cloud. Tyler, folks, Tyler, Brian, Dan, thanks so much for coming on the program today giving everybody at home what they needed. We got some great compliments. Now back over to Kara soft for our closing comments for snowflake. I'm Nick space. Thanks everybody so much for coming out and see us. Thanks for listening. If you'd like more information on how Carahsoft or snowflake can assist your educational institution, please visit www.carahsoft.com or email us at snowflake@carahsoft.com. Thanks again for listening and have a great day.