CarahCast: Podcasts on Technology in the Public Sector

Dive Deeper into Google Cloud Dataprep by Trifacta and How it Accelerates Analytics

Episode Summary

In this podcast, Eric Clark, Program Manager at SpringML, and Brian Shealey, Vice President of Public Sector at Trifacta, discuss how the service is used across many of its clients and how you can use Cloud Dataprep in your own organization.

Episode Transcription

Speaker 1: On behalf of Google Cloud Trifacta, SpringML and Carahsoft, we would like to welcome you to today's podcast focused around Google Cloud Data Prep by Trifacta. And how it accelerates analytics, where Eric Clark, Program Manager at SpringML, and Brian Shealey, Vice President of public sector at Trifacta, will discuss how Cloud Data Prep allows users to explore clean and prepare data for analysis and machine learning through an intelligent and visual interface.

Eric Clark: Thanks to everybody for carving out time today to join us. My name is Eric, I'm joined with Brian Shealey, from Trifacta. Brian, you want to introduce yourself really, really quick?

Brian Shealey: Hey, everybody, thanks for being here. I'm the Vice President of public sector for Trifacta. We are a software company, we're Silicon Valley based based in San Francisco, we have a really unique relationship with google. Carahsoft is our distributor in public sector and they've been a great partner for quite a long time. So my role is, is I'm responsible for driving effectively projects in new revenue across the entire public sector for North America, for Trifacta happens to be our fastest growing market segment worldwide. So we've done a pretty good job as a team, getting our message out there and having customers adopt our technology. Ultimately, what we are as a company is we focus on what's called self-service data preparation. What's unique about Trifacta and Google's relationship is, we are a product that's been OEM by Google into the GCP offering. So Google Cloud Data Prep is Trifacta. We have a unique Cosell motion with the Google teams. And so you're going to learn a little bit about how that all works today, where Data Prep fits into the bigger GCP ecosystem. And I can kind of cover core differentiation of that offering versus other cloud providers if people are interested. But all in all, we've worked on projects with GCP, spring amounts, doing a project with Data Prep, kind of driving it as a key component. And we fit nicely into the overall GCP ecosystem. So Eric, I'll let you kind of pick it up from there. Talk a little about what you guys do. But I'm excited to be here. Thanks for having me.

Eric Clark: Thanks, Brian. Again, my name is Eric Clark. I'm an account director for Google Cloud services with spring ml focusing on public sector and have been for the last 12 or so years. So spring embellished a little bit about us. We're a professional services company, and we're a premier Google Cloud partner. And our key focus areas are on machine learning and artificial intelligence, modernizing analytics platforms, handling cloud migrations, and also application development. So excited to talk today about you know, our perspective on where Data Prep fits into the overall kind of modern data analytics platform picture. And to kind of get started here, I want to start and end with sort of the things that the same three points, key takeaways that we want to communicate today. The first is planning now for growth of your data. Deciding now how you're going to handle the growth of your data is going to be an important step. And we'll talk about why. Secondly, increasing trust and usefulness of your data. This is something organizations struggle with quite a bit, especially if you're pulling together data from multiple sources. In order to build that modern analytics platform, you've got to have a source and pull in the sources and make it trusted, make it useful for your organization. And then thirdly, and probably the most important point I wanted to communicate today as a takeaway is the position of your data and how that matters. For the future of your capabilities around advanced analytics. We'll talk a little bit more in detail about what positioning your data means here in a second. So you might be familiar with this statistic. This is something that as I've done presentations over the years, I keep tabs on what the IDC publishes about the what they call it a global data sphere. And it's about how fast we're assimilating and growing and amassing data. And right now, their projection is as of a couple of years ago, that worldwide data growth will go to 175 zettabytes by 2025. You have to Google what is that a byte is lots of zeros on the end of that number. But what's driving that the adoption of cloud and how cheap storage is today and the fact that we're amassing just a ton of data from IoT devices and wearables and other sources of health data for research analytics, transactional data in retail settings, logistics and transportation data. We're able to collect and store more data more cheaply and more efficiently than ever before. And it's almost like an exponential growth curve. And maybe it's not a concern for you today. But we do need to be planning for how to handle this influx of Data? Because it's just going to keep growing? And how do we make sense of it? How do we prepare it for advanced analytics use cases? Or for machine learning purposes? Or how do we use it to develop an artificial intelligence, right. So it's important to have a plan to address this massive amount of growth. Here's some challenges that we've seen in the past working. You know, in my experience working on small and large scale Information Management type projects, analytics type projects, I've noticed some key challenges that organizations face with regard to Data Prep activities, to compare and contrast, some of these challenges with how Data Prep can actually solve these specific five challenges here. So first, is they're not scalable. So sometimes, you know, requiring a lot more capital expenditure upfront to, you know, deploy servers and a data center somewhere or or pay for annual licensing, that can get pretty expensive and add up. And data preps case, there's no servers to worry about, it's a service. So you turn it on, you can start using it immediately. It's going to scale both up and down with the usage. And, and you don't have to worry about, you know, it not scaling with your demand. Secondly, is there not really intelligent, at least not as intelligent as Data Prep can be. So Data Prep provides, as soon as you start to connect your data to it, it provides automatic insights about the quality of your data. And it points you right to the places where you need to, you know, apply some focus, and gives you actually recommended action, it's got, you know, recommended next steps on how to identify issues in your data and how to actually fix them with some intelligent recommendations, we'll see exactly how it works. And the demo on that front. You know, traditional methods typically haven't been that fast either. And in data preps UI, there's two really key components here that the user interface, you know, being a web based product is really fast, responsive, but it's also very intuitive, which makes learning curve a lot lower than other solutions, you don't have to go get a certification to know how to use Data Prep, you can just plug your data in. And it's intuitive enough for beginners to just kind of jump in and start, start using it right away. You know, fourthly, I've seen that, you know, typically, the traditional methods are not inherently collaborative. And Data Prep allows you to, to share the flows that you build to share the datasets that you add, within Data Prep, to share the recipes and the cleansing logic that you create with your colleagues. You can definitely use it on your own individually. But the ability to share and collaborate on those things together, is really important. And then finally, it's not really built for the these traditional methods aren't built for the modern workforce. And what I mean by that is, in the modern workforce is going to have a lot more what we call citizen developers that need to be empowered to take action on their data and do what they need to do. These are folks that, you know, haven't gone to school for computer science, and they don't need to, but they're close to the business. And they understand the challenges of the business. And they do understand the data surrounding their business, their organization. And so they need to be empowered to handle the data, prepare the data, and get it ready for advanced use cases. And Data Prep does exactly that makes data wrangling accessible to those closest to the business need itself. And it doesn't require, you know, a request and do it to spin up servers and installing this complex hardware. You know, you're off to the races within minutes, really. And that's, that's something that is really great about this solution.

Brian Shealey: So a couple interesting points here. And I think that these are, are very, very well positioned and understood challenges for anybody that's worked on modern data projects, especially across the public sector, sort of the pattern that we're seeing is this move to empower users to operationalize data more effectively and faster, to achieve their analytics outcomes. And those analytics outcomes are driven, obviously, by policy, by organizational needs. And also, I think one of the obvious things that we're seeing with the pandemic over the last 12 to 15 months has been, a lot of agencies haven't been ready to tackle large scale analytics, it's a major problem. So Trifacta as a company, we came out of a Stanford and Cal Berkeley joint project seven years ago, and a lot of the things that Eric just covered about scalability intuitiveness speed to number one to get your work done, speed to expedite your time to insight, the concept of collaboration in the modern design of working in data Ops, all those things were tackled, because of the way that these folks at Stanford and Cal Berkeley, solve the problem. They were living this in the world of big data. And I would say that the pattern we're seeing in public The sector, that's really expediting is most agencies at the state local level, all the way up to the largest federal programs are trying to consolidate, spend on tools, right? No one wants to spend unduly on tools, they want to spend tax dollars on it. And they also want to unlock the data from the silos that it's in, which is an inherent problem in any large enterprise or large program. So the pattern has been this that we've been witnessing for the last four years in public sector, which is, number one, everybody's trying to unlock the power of data. So whether you're in the defense space, and you've read the 2018, national defense strategy, where it's saying, hey, we've got competition for the future, we need to really feel AI and ml faster. That's one way of thinking about analytics and expediting those analytics capabilities to market. Or if you're a municipal government organization, and you've got data in a bunch of different systems, and you're trying to do better predictive planning. In either case, whether it's a AI example, on one side, or a business intelligence example, the other everybody is moving to cloud native architectures, they're trying to use compute and storage in a utility fashion, just like you consume your power, not having to go build your own power plant to power your house, you just plug into the grid and get what you get. And they're ultimately trying to set up these collaborative, very, very robust, integrated tool stacks to enable users of different persona, different skill in different business need to meet their analytics and big data needs in a timely fashion. And so Trifacta as a tool is definitely an enabling technology and a key part of the overall ecosystem. But these challenges, if you've ever lived any of these things, you'll understand that you can't just say, Oh, well, that's just an ETL function, because the concept of ETL has massively changed. So looking for cloud, native, fast capabilities to operationalize data across a myriad of users, is what we see. And the move to these cloud native sort of shared service models is, again over and over what we're seeing across the entire public sector, and the pandemic. Ultimately, regardless of what state you live in the state of Maryland certainly doing it where I live, I know that state of Washington is doing it, there's a bunch of others out there. They're really driven by the pandemic. They're trying to unlock a lot of different data and combine it with open source data and everything else, and have the ability to in a very, very fast turn, figure out new analyses related to things like what is the supply chain look like for goods and services related to pandemic contact tracing, that's a massive data integration problem analytics problem. So all these things are sort of driving it. And I would say that GCP has a very unique packaging, and the way they go to market, as you kind of go through this, Eric, I think you'll keep hitting on some of these things will be aha moments for folks that are wondering, Well, why is Data Prep different? What value does it provide outside of my normal desktop ETL tool?

Eric Clark: Great point, Brian, thanks for chiming in on that. I think, you know, this is a good kind of transition point to talk about this next point, as well as Why does all of that matter, you know, at the end of the day, we all need to have clean and well formatted data, because that's what's going to drive up trust. And that's what's going to drive up the actual usability of the data, all those things that Brian just mentioned, that's, that's what we're leading up to, and Data Prep is poised to handle just that. So that you can create and generate that clean and well formatted data quickly and easily. Especially as compared to prior ways of doing it much, much quicker. But, of course, this is true, right? This point is true. But given the amount of data that we're seeing that we're generating, right? Remember the 175 zettabytes by 2025. Right, coming from more and more sources, you know, the, the number of sources are just going to keep growing. And the amount of data that we're going to be thinking about and processing and trying to make sense of, it's just going to continue to grow. And so we need to learn how to leverage it and understand it and use it to our advantage. And we don't want to be thinking about servers and managing IT infrastructure, right to do that. That's exactly where Data Prep comes into the picture. So let's talk about what Data Prep is. You know, I think I like the definition here. And intelligent data service to visually explore clean and prepare your data structured and unstructured. To get it ready for analysis, reporting and machine learning. Very simple. A key point here, though, is that it's serverless. And what that means is, again, like I mentioned previously, and Brian had on his, there's no need to spin up servers, it's turned on the service and start consuming it right now. And that means it's going to scale with whatever demand you put at it. And there's no infrastructures, you know, infrastructure components here to worry about. And the other thing that's really interesting about it is that it intelligently suggests it understands the data about your data. And it gives you meaningful suggestions for how to cleanse it, how to transform it into useful data on the analytics front. So and no code, right. This is all as you've seen the demo point and click very intuitive user interface. It just think of it like this data preparation as a service, right, plug it in, and start using it right away. If you're using AWS or Azure, or any other public cloud, you can use Data Prep, it's not specific here to you're not locked into using Google Cloud. Specifically, it's more open than that. So even though our focus at Spring ml is building solutions as a Google Cloud premier partner on GCP, you know, we often connect with other public clouds. And that's just fine. It supports that. But from SpringML perspective, I think it's important just to define what is, in this bigger picture of what modern analytics platform is, and looks like, we basically kind of describe that, as is a cloud native platform, as Brian mentioned earlier, that empowers people to advance their organizations analytics capabilities. That's what at the heart of this is. And that means by abstracting as much complexity out of the way as we can, so we can get to that business value from the data as quickly as possible. So again, you can string these products and services together to create that modern analytics platform, it's not that you have to use every single one of these services. But it's organized in a way from left to right, that kind of shows, here's some example tools and services that you can use for data capture, ingesting data in. And then in the second one over from the left, the streaming and data pipeline, that's where Data Prep sits. And this is really about processing and cleansing and transforming your data. So that it can be trusted and useful. And then over to the next one, data warehousing and data lake, that's just basically storing data. And then once you have it, you know, in a place where you can store it, you can start layering on more advanced analytics and, and there's really great products and services out there that you can leverage to start building ml models right away without having to have a data science degree, you can add in Google Cloud auto ml, or you can add in reporting, right from Data Studio, or even partner BI tools like Looker and Tableau Power BI etc. So that's kind of our perspective on modern analytics platform and how we do it. Again, cloud native, all these things here are serverless. We're not worried about that infrastructure as much as we are the application of what we're trying to do with the data. But I do want to talk about something really strategic in this whole conversation, and sort of a transformational concept. And that is about the position of your data. So this is one of the key points that that I wanted to mention today is actually just asked, you know, where is your data positioned? And when we think about where your data is positioned? Is it disparate sources and unconnected? Is it you know, living on antiquated software and hardware that can't meet the demands of that data growth that we talked about can't meet the demands of the users as easily anymore? Is it still in the mainframe, right? You might even feel kind of vendor locked. In some cases where you're heavily invested in some licensing and you can't you feel like your hands are tied to move it? Where is your data positioned? Right, because turns out, when we think about building modern analytics platforms, the position of your data at the end of the day matters, it matters a lot. Because what that does, the position of your data determines your ability to achieve the advanced analytics, like making predictions and doing forecasting, the position of your data also determines your ability to adapt machine learning and augmented intelligence capabilities. Position of your data also determines your ability to consolidate and connect the dots, make sure that you've got all of those data sources that might be disparate today. Connected, so you get the full picture. And you know, it also determines your ability to future proof yourself by avoiding more vendor lock in creating that flexibility, so that you could use the data the way that you need to. It's all about positioning, where's your data. And that's why I've spent, you know, along with spring ml the last several years helping my public sector customers reposition their data, because it sets them up for a future of those types of use cases, and not having to worry about scale as much not having to worry about infrastructure and all those things, just so they can take better advantage, more full advantage of what the data is trying to tell them now. So Brian, I don't know if you have any quick thoughts on this before we transition over. But I want to talk about how Data Prep can help in positioning or repositioning data but feel free to chime in if you've got any thoughts.

Brian Shealey: I mean, I think you've touched on the key things. One concept I always say to people when you're doing these kinds of projects is when you get going on any analytics project. You know, your goal may be this sort of newly position your data on locket from where it lives today and figure out ways to integrate it together. But one of the things you should you should consider I believe is the key paradigm is built for agility, meaning what you start out trying to do with your data will evolve because you will get new insights once you start that journey. And one nice part about the packaging and GCP is Google has a very unique way of looking at cloud services, in my opinion, it's not just infrastructure, it's not a commodity based thing for storage and compute the real solutions that are tied together and understood really well integrated and packaged way. So that as that agility becomes a requirement. So you start out, hey, we just want decision support. And we need to unlock data from various siloed systems systems of record, and you start populating it into Google Cloud Storage, or, you know, co locating it into BigQuery, for data, warehousing, etc. What you're going to find is over time, other projects will be coming online, and they'll have the ability to leverage that data. And the nice part about the approach there is all of this stuff is interconnected around the storage of the data. But it gives you really, truly as self-service oriented as possible, a nice architecture that truly is scalable, agile, API oriented. And in many cases, it can be your centerpiece or even a strategy that is connecting to on prem and multi cloud data. So I think there's really good positioning, they're all Pun intended on helping you reposition your data with the right tool stack and GCP, which is unique packaging.

Eric Clark: Yeah, that's awesome, Brian, thanks. So how does Data Prep fit in in terms of positioning or repositioning data, it plays a key role here. So let's talk about how it works. What's the data, you know, is in the cloud, we call cloud storage, for example, kind of a landing zone area, right, you can load your data there. And in object storage, it could be a flat file, it could be, you know, something easy to upload, but then you can connect Data Prep right away. And then Dataflow is there to kind of process and move data to its eventual target location, which could be BigQuery, or Cloud Storage. And then you can start layering on these analysis tools. I do want to mention that, you know, although things like s3 from AWS aren't mentioned here, I mean, that that's a valid source, right? You know, it doesn't have to be just GCP technology here in this mix. Let me go and just talk briefly about how we see Data Prep being used. So in that kind of flow of data from connecting to sources, cleansing it processing it to ultimate usage of the cleanse data, we see kind of five key things here, right, preparing the data for machine learning modeling. Turns out machine learning, you need well prepared, very clean data to have accurate models, right. So Data Prep definitely has a use case there and creating and standardizing data to make it useful for machine learning. We see it used in combining multiple data sources could be flat files that you load into storage called storage location, it could be from spreadsheets, it could be from other databases, other data, lakes, other warehouses, could be from applications. But joining them together is where that magic starts to happen. And you can start to enrich your data pipeline, enrich the data and bring it through kind of this pipeline concept and clean it and load it into its eventual target. We also see Data Prep used here for quickly understanding the data about your data. So being able to profile your data, visually inspect it, and see, you know, charts and quality bars and determine right away what mismatches in my data do I have, you can see those things visually and start to click around and Data Prep and, and address those issues. Very quickly, sharing and collaborating on flows. Again, this is something that you know, more traditional tools, you're probably working more individually. But with Data Prep, you can actually build out your flows, think of a flow as a container that has, you know, your source data could be multiple sources, and then recipes, which are the kind of transformational things that you're doing to clean the data, and then output. But having the ability to share those flows with your team is really empowering. And so we'd love to see teams use it for that. And then finally, you know, just altogether, the components of a flow can be scheduled. And you don't have to code anything here, right. So if I build a flow, and it takes data from sources, transforms, it, cleans it, you know, gets all of those things ready for me, I can actually schedule that to load into my target on a daily basis, weekly basis, whatever the schedule might be. So very, very powerful options at your fingertips. And I wanted to give a very specific example of a request that came in actually last week. So we had a customer that we've been working with interested in exploring moving off of Oracle. And so we're talking about this, and he sends a 40 gigabyte export from their Oracle database that has a bunch of air quality data in it. And so it's 40 gigabytes in a CSV format contains about 35 million records. And the data set is very wide. Imagine opening a spreadsheet with 304 columns, you know, where do you start? So there's kind of an old way of doing this, you know, I could maybe carve off a piece of that 40 gigs because, you know, good luck opening a 40 gig CSV on your, your desktop. You know, I might be writing some regular expressions or trying to do some other things manually to you know, under Seeing the data and figure out what's in there. And what do I need to do before I can make it useful. And before you know it, you know, that adds up takes time. Better Way is, you know, again, positioning your data, think about this, the first thing we did is we move that 40 gig file to Google Cloud Storage, we position the data. And it's in a raw form, right? But because it was there, all we had to do is point Data Prep to that 40 gig file, Data Prep reads, it samples it, and we could immediately start interacting with it, and see information about the data. And instead, you know, within less than two days, we had understood the data, that 40 gig file, we knew the schema, we knew the issues we needed to fix. And we had it loaded into BigQuery. You know, the alternative data warehouse in this case, you know, in less than two days. Really cool story. And this is how that flowed right with ingestion and processing. And then on the analysts analysis, and ml side, right, landed in cloud storage, use Data Prep to understand it, run the job, which kicks off data flow, data flow, loads the data, the cleanse data into BigQuery. And then we just pointed Data Studio at BigQuery. And started to build some reports to start to understand it. So that's an example pipeline. That's an example kind of way to use Data Prep and the mix of that whole pipeline.

Speaker 1: Thanks for listening. If you would like more information on how Carahsoft or Google Cloud can assist your agency, please visit www.carahsoft.com or email Tierra Brooks at tierra.brooks@carahsoft.com. Thanks again for listening and have a great day.