Defining Data Integration and The Modern Data Landscape

By 
Elizabeth Garcia

Between Product and Partnerships is a podcast brought to you by our group the SaaS Ecosystem Alliance, and it’s focused on bringing together product, partnerships and engineering leaders to discuss how to build support and scale SaaS ecosystems. If you're interested in watching or listening in on this conversation, you can access the video here and a link to listen on podcast platforms here.

Our former Director of Marketing, Kelly Sarabyn, interviewed Arpit Choudhury, the Founder of astorik. The experienced data leader and educator talks about the biggest challenges to offering and managing product integrations, how to assess the internal data landscape of b2b customers when building integrations, and how customers are currently evaluating integrations.

Intro

Arpit: For the past few years, I've been working in the data space. I primarily started with Integromat, where I built our community, and then eventually, our growth team. I also set up our customer data infrastructure and contributed to building our partnership program of our ecosystem, that was a pretty incredible experience. 

Data itself has evolved so much, And there's been this explosion in the data technology landscape; there's been so many new data tools and technologies that have come up in just the last two to two and a half years. These tools are essentially solving a lot of interesting problems, and I’ve been fascinated by the space. 

Over the last few years I've spent a lot of time learning these new tools and technologies, and talking to founders of companies building these tools to understand the kinds of data problems they're solving. There are a lot of problems that exist that people don't know about when we talk about data. That's pretty fascinating and I write a lot about this stuff including in my newsletter. I’m excited to be here, thanks for having me.

Kelly: Well, let's jump right in.

What is data integration and why is it so important to SaaS companies today?

Arpit: In simple terms data integration is the process of moving data from internal systems or internal databases, as well as external tools and third-party tools to a target system. Now, the target system can also be an internal database, a third-party tool, or a third-party system. At its core, data integration is the process of moving data between systems. There are other nuances when you're moving data; you also have to make sure the data is formatted correctly ensuring the data is in the format the target system expects it to be in. 

Just moving data is not enough. It has to be moved in a manner that is also accepted by the target system. It must also be stored in a manner so that it is usable in the future. I will say at the core, it's about moving data but there's also a popular definition of data integration, which is essentially centralizing data, taking data from all your first-party and third-party systems and centralizing all of that data into a database, which is typically meant for analytics workflows.

This particular database is typically referred to as a data warehouse, where you're just warehousing all of your data to build a central source of truth. That is also a definition for data integration, which is becoming more and more popular, especially in the ecosystem. When we talk about data integration tools, people generally refer to this tool that basically allows you to extract data from systems then storing the data into a data warehouse.

Kelly: For the people in our audience who are working on these integrated ecosystems, from the perspective of SaaS companies, they tend to think about the issue of integrating with their partners, but I think it's really important to take a step back and think about your customer. If you're a B2B SaaS company, you are selling to a business. It's important that before you even get to the point where you're integrating with other SaaS companies, you really understand how they're handling their data internally, and what their problems are. 

This definitely changes at scale. If you have, Susie's Candle Shop, and she has seven employees, then her data practices are going to look very different from Kraft Foods or some large enterprise.

Could you share what the data landscape looks like internally, perhaps starting with a mid-market business and then larger companies. What does their data landscape look like internally? What are they struggling with? What are the big pain points?

Arpit: It boils down to all the data sources where data is generated for that business. The first-party sources are typically company’s websites, apps, or smart devices if they have smart devices. Then there are third-party sources, which are all the SaaS tools that a company uses. Now, if we talk about a mid-market company. I don't know what the average is, but some people claim that an average company uses over two hundred SaaS tools. 

They have so many different SaaS tools. I have to also understand that each SaaS tool has a different data model. There is no standardization in terms of the data model. Every tool refers to users or customers differently. The definition of a user is different as to how a company takes all of this data from all of these systems, third-party systems as well as their first party-sources, and then stores the data in a manner that is actually usable.

A lot of companies end up spending a lot of resources and buying expensive tools that allow them to take all of this data and dump all of this data pretty quickly into a data warehouse. The data warehouse is also becoming more and more affordable. But that's not enough. Just taking the data and dumping it is not enough. 

The biggest challenge is not the technology. There's good technology that's available. The biggest challenge is to understand how to make all of the data that's available usable. How do you make it easy within the entire organization, and for different leads, to derive insights from the data, and drive action on that data. That is a pretty difficult problem, because you cannot dump all of the data, and then figure out how to use it.

Companies need to have processes in place that allow people to understand what data is available, ask questions of the data, and then work backwards to figure out how to best answer that question. Sometimes, you already have the data available to answer that question in your warehouse, your analytics system, your BI tool, or whatever your analyst is using in their day-to-day. Other times the data is not available, in which case they have to work with their engineers to get that data into that system. Again, this system could be a warehouse or whatever.

Essentially, there’s a whole workflow where an engineer is tasked with taking data from this system and dumping the data into a warehouse. An analyst is tasked with cleaning this data, transforming this data, making this data usable, and eventually building reports that are consumed by various teams that derive insights from this data. Once they derive insights, they have more questions and they also want to take the data and do something with it.

They want to send this data into their marketing automation tools to build personalized customer experiences. Making the data actionable is the biggest challenge, I would say.

Kelly: Related to that is the real-time component of data and having accurate data move, if not in real-time, close to real-time. A lot of information can quickly go out of date and in some roles, and you need that actionable data in a timely manner. That adds another layer of complexity for the reason you're saying, which is, you can have processes and technology where some systems don't even allow for moving data in real-time, you just can't get it out of the system in real-time. There may not be a way around that for business, because if the system they're using doesn't allow that through some technical way, then you are going to have to find a way to work around that delay. 

That component of not only getting actionable data, but getting it to people when they need it. So many times, even if you have good data, if it comes to you past the point of actionability, it doesn't meet the business need and purpose. The other core challenge is just the surplus of data. As you noted, there's so many systems that are being used.

When you get into large organizations, ideally, they're all being tracked by some centralized system. Sometimes people just go rogue against certain business functions, because they're trying to meet their own goals and metrics, and they're looking if the system will enable them to innovate. As software gets cheaper, that becomes more possible, where someone can just try out different systems, but then you have a problem of moving data across the organization in a way that's accurate if you have some systems that aren't even on the radar.

Arpit: I would like to mention one other challenge. What you said makes a lot of sense. One big problem, one thing people forget, is that integrations break, no matter how good they're built, they tend to break for a variety of reasons. When your pipelines break, you have to have systems in place that notify you when a pipeline breaks, because oftentimes, you realize that something's broken, but you realize that much later, only to see that problem appear in a lot of the data. That is a very common problem. Even for small businesses, the businesses that are just trying to move data from their CRM into your marketing automation tool.

Then, there's a change in the API, and then something breaks or a user goes and changes something in the tool itself to change the way a certain object is defined. That breaks integration and nobody realizes until you actually have tools. There are certain tools that help you do this -  observability tools that monitor your pipeline and as soon as something breaks, they will notify you. So that is a significant problem. As companies build more and more pipelines as they integrate more and more tools, this problem is only becoming more severe and more critical.

Kelly: I agree, I think that's a huge challenge. That's a challenge for SaaS companies, as well, in terms of if you are going to provide products and integrations to your customers, how do you keep up with all those changes? And also, are you notifying your customers? Are they being alerted when something breaks?

As a business user of business systems, I can say there's many cases where I've used product integrations provided by a system, and I was not alerted when they broke, and then I find out like a day or two later. As a marketer, I may have lost a lot of registrations. If you don't respond to a demo request quickly, you can lose a certain number. That stuff is key, and it can really impact user experience.

These terms “ELT”, “ETL” and “reverse ETL”  are key to the data industry, but a lot of people don't know exactly what they are or how they're different. Can you kind of elucidate that?

Arpit: In my head it is really simple. ETL stands for “Extract”, “Transform”, “Load”. The concept of ETL goes back to the 70s. It's not a new concept at all. It essentially means you extract data from your source systems, typically, third-party tools. Before you load the data into the target system, which is your warehouse, where you want to analyze the data, you go through a transformation process to make sure that when the data lands in the target system it is exactly how analysts or data scientists would want it to be. This allows them to put the data into action. Meaning they can start using the data so they can start analyzing the data, building models, and building algorithms. 

The transformation step was a huge step. With ETL you figure out exactly what the data should look like and once you extract the data, you transform it, and only then load it. Now with ETL the biggest shift has been in the way data warehouses have become cloud native. Today, more and more companies are embracing cloud data warehouses like Snowflake, BigQuery, or Redshift. Their warehouses have also become much cheaper and faster. Companies have figured out it's easier and faster to just ELT all the data, which means extract and load all the data first, and then take care of transformation. 

This is a good thing and a bad thing. The good thing is obviously the analysts don't have to wait on engineers to make data available, the data is already available. And it's being loaded on a regular basis, sometimes even in real-time. They can use new technologies, such as DBT, which is basically a transformation framework, which allows analysts to write SQL queries and transform the data and maintain version control and take the steps that enable them to be more productive. That is a big shift. 

The downside is there's a lot of data being stored, and there are challenges. When there's too much data, there are more requests, analysts are constantly trying to fulfill requests, and they're all overburdened, they have no excuse, all the data is there. And they have to constantly transform the data, build models, and then make sense of the data. 

Reverse ETL is taking the transformed data from the warehouse and putting it back into third-party tools. This is why it's called reverse ETL. You're doing the opposite of ETL. You're still extracting the data, but instead you’re extracting it from your warehouse and loading it into your third party tools, so, the transformation is already taking place. So it's not exactly ETL. It starts from data that is there.

It's mainly done to make clean, actionable data available in downstream systems where data is eventually activated. Think marketing automation tools, or even CRMs, or sales engagement tools, all and all your advertising channels, Facebook etc., where you want to send data about your users. This could be near real time, if you needed to be, depending on different technology. 

There are real-time databases that reverse ETL to allow you to move this data in near real-time or reverse ETL tools that can be connected to your typical data warehouse. Then you can move this data in a batch process on a schedule. All of this boils down to moving data from a source system into a target system. Because organizations are collecting more and more data, and storing data and moving data is becoming cheaper, there's all these different technologies coming up that are enabling organizations to move fast and in a lot of ways.

Kelly: That was a great overview of the landscape. I'd be curious how you think product integrations from SaaS companies tie into that picture. Good product integration, but really, any product integration should be doing that for you. It's extracting the data from Salesforce, transforming it, and then loading it into HubSpot for example. Ideally, you have an integration, and as the business user, you don't need to do anything. Presumably, nor does your data analysts know that they're taking data out of both Salesforce and HubSpot independently, but for that particular motion, they presumably don't have to do anything. 

You also see more and more business systems integrating to data warehouses directly. You can take your Salesforce data and go straight into a data warehouse once you set it up. How are businesses looking at that? Are they saying, this is a good thing that we can get some of the work off of our plate and enable some of the business users to not need us as much?

What's your take on the interplay of internal data management and analytics and the SaaS product integrations?

Kelly: We all know they're proliferating, there's more and more SaaS integrations being offered and used. Where that's gonna end up who knows, but we have seen a trend.

Arpit: This goes back to my experience at Integromat, it was really about empowering the business user to connect all of these systems. If there was really integration, you would use it but if there wasn't, you will use Integromat to make that connection. How I like to think about this is I believe there's this divide between data teams, business teams, GTM teams, and marketing teams. For a data team data is really there to answer questions. They want the data to be properly formatted, properly transformed, and made available in the warehouse in a certain manner, so they can derive insights from the data, build reports, and answer people's questions. 

Whereas for a business user, it's not just about answering questions, data is also a tool to build better customer experiences. That's why integration is so important. It's about building better customer experience, it's also about improving internal workflows. A lot of internal workflows are improved when you connect the systems together.

When you connect these systems together, you're just making sure that when data is generated in System A it’s automatically moved to System B and is also available in System B. Whereas if you don't do that from integration, you'd have to later manually extract data from System A, typically as a CSV file, and then upload the data into System B, which is inefficient and error prone. 

These business integrations are extremely important for SaaS companies building these native integrations. Even from a company's strategic point of view, it's really important to offer good integration, because that is one of the best ways to ensure customer retention. Customers that use a lot of integrations are unlikely to switch from your product to your competitor's product. Business integrations are extremely important. The role that tools like Integromat play here is also very, very important. I know a lot of users will first look at native integration that a SaaS tool offers. If I think that the tool will do my job, I'll definitely use that. There is no fun in just building the integration yourself if there is an integration already available. 

These native integrations dissolve the more common use cases, so they're not super deep. If you have a more customized use case or more advanced use case, then that's when you use tools like Integromat or Zapier. TypeForm is a very good example, they have really neat integrations that they're super quick to set up. Doing the same thing using third party tools is easy, but it's still a waste of time. You end up spending more money and things can break. I hope that answers your question. The business integration, native integration, and SaaS tools are extremely important to the business persona.

If you talk about the data person, they actually prefer that these point-to-point integrations don't exist, because there is a downside also when a lot of these point-to-point integrations exist. The data team is tasked with taking all of the data from all of the sources and dumping that into the warehouse for other purposes. This is a struggle because then they find a lot of data is duplicated and a lot of missing data.

It just increases work for them. It just boils down to what a business's priorities are today. If you want to enable your business teams to move faster and get things done faster and not rely on external people, then you want to give them these capabilities. You want to give them access to these tools that allow them to do more without relying on other teams. I'm a huge proponent of empowering our business teams and GTM teams. 

My background is in, in go-to-market, I don't really have a background in data per se. I have personally benefited a lot from having access to these tools. Being able to connect, integrate, and build workflows has enabled me to experiment. If you think about it, no one really gets it right in the first attempt. They build integration, they run it, they benefit from it, and then they try to optimize it and then improve. You start with a very simple two or five step integration, and can turn into a twenty step, a fifty step, or sometimes even a hundred step integration.

Kelly: I was intrigued that you said for data analysts it can be a downside. I want to unpack that a little bit more. If you have an integration setup directly between HubSpot and Salesforce, wouldn't that make the data in those two systems, slightly more aligned than if there was no integration? From the perspective of the data analyst who's probably pulling from both systems, regardless of whether there's an integration, is that really adding complexity for the analyst to have that integration existing?

Arpit: It depends on how that integration between HubSpot and Salesforce has been set up. If it's an end integration, which is just literally moving leads from one to the other. It should not be a problem. But let's say I'm an advanced user, and I'm using Integromat, and then I'm doing some funky stuff. I'm transforming the data, I'm changing the data format, or doing stuff that enables me to improve my own workflows where I'm customizing their integration. I'm not just doing a one-to-one mapping. That could be challenging when data is extracted from these two systems and brought together into a centralized repository. 

Sometimes it doesn't add up, or sometimes I might want to only move data from System A that matches certain routes to System B, which means that not all of the data will be available to System B. This leads to discrepancies. There might be more leads in Salesforce and less leads in HubSpot. The other stages might be set up totally differently, they might be called something else.

Those things can become a challenge when the analyst is tasked with building a report, which takes data from the systems and consolidates everything, because for them, it's hard to figure out the actual number of leads or the actual status of a particular lead. It shows one thing here and something else there. There could be many other challenges with these point-to-point integrations. Things tend to break, you don't always tend to realize that, and you don't always get to fix those things.

Kelly: I think you can, especially when you're either using middleware, or even if somebody independently got someone to build an integration, in which case, then you can introduce a lot more complexity and transformation. I also also agree that since integrations do break, like any other app, you can then introduce a search job and see where some of the data moved, but some of it didn't and you thought that it did and then you have an inaccurate pipeline set. 

One thing I've noticed in the last few years is business buyers becoming a lot more savvy about integrations, I think a few years ago, people in pre-purchase would just be asking, do you connect to Salesforce? Or do you connect to Zendesk? The salespeople would say, “Yeah, it's great!” Then they would find out six months later, that actually the integration wasn't very robust, or that it was through middleware and the middleware had a very simplistic offering in terms of what it could do between those two systems.

My sense is that businesses have started to get more savvy, because they've been burned, and they realize this is something that should have been scoped before purchase. I'd be curious if that's what you've seen in the last few years as well.

How customers are B2B customers evaluating integrations currently?

Arpit: That's a pretty common problem. As a SaaS company, unless you're in the business of providing integrations, integration is a necessary evil because it's a pain because integration is hard. It's hard work to maintain integrations. It's not hard work to build those integrations. It's hard work to make sure that our integrations are up to date,  adhere to API changes, and that we make sure that if something breaks, the customer is informed - a lot of businesses neglect to do that.

The second challenge is that if you talk to engineers, they don't actually like to build integrations, because it's very monotonous work. At one point in time, I was trying to hire engineers to build more and more integrations for Integromat. It's pretty challenging, because people don't enjoy doing that. 

To answer your question it is a really, really common problem where the buyer will want a certain integration. Typically, they just look at a single integration directory and say, “Your tool XYZ integrates with whatever tool I'm looking to use it with”. There's a second challenge, just because A integrates with B, that doesn't mean it actually solves your needs. You have to also see how deep that integration is; often the integration will be very basic.

Let's say you want to integrate with a certain object from your system and there is an API. If you talk to the sales person, they’ll say, “Yeah, totally possible. We have an API for that!”. It doesn't mean that they actually offer native integration that pulls data from the endpoint and makes it available into your target systems. 

There's a lot of gray area where everything is literally possible as long as there's an API. A lot of times APIs are available, but they're not publicly available. It's also a sales tactic to offer integration or offer an API as part of a specific deal where they might offer you an API that's not publicly available. Then it's on you to make the API work with whatever system you want to work with. I think this will continue to be a challenge until there is some universal standard or a unified API. 

A lot of companies have tried to solve this problem, it's a really, really hard problem. With a lot of companies, priorities change. I don't want to name names, but the large acquisitions in the data integration space, and when these acquisitions take place a lot of the integrations that were on the roadmap, they disappear.

If you're, as a customer, you're relying on a certain tool and it works well with another tool that you use, but all of a sudden, it stops working, and they stop updating that integration. And they literally tell you, you need to go and talk to Vendor B and Vendor B says, “Oh, we didn't build the integration, talk to Vendor A”. So these problems will ever really go away and as a buyer, it's important that you understand what your integration needs are, sooner rather than later. 

It's not enough to say that I want to integrate A with B. You also need to understand what exactly you want to do when you say you want to integrate. What data do you want to move? How do you want to make the data available? Do you want the data to move in real-time? Or do you want to move it on a batch or the schedule? Or do you just want to move it once in a while? These sorts of questions need to be asked of vendors when you're evaluating tools and then hopefully make it less painful as a buyer in the long run.

Kelly: SaaS companies need to ask that as well. SaaS companies often look to integrations and native integrations that they're building as a way to increase retention. That does work and we've seen that but the problem is if you build an integration that doesn't address your customer’s use case, they may sign a contract thinking that they can do it but then ultimately, it can even increase churn if it's such a bad experience to a key system. SaaS companies need to really scope out those use cases, treat it like a product feature, do the research and tail to make sure that you're actually meeting the needs of the user, instead of just saying we connected these two systems. 

You mentioned this before in terms of one of the problems around having hundreds of systems, each with their own data models, that adds difficulty and complexity. We've seen with external APIs, they've moved very heavily towards using REST, and OAuth 2.0. That's happened organically. It's partly for the reason you're saying, which is, it's hard to find developers to work on this. If you can use a standard that a lot of developers already know, you are going to get more people willing and able to work with your API. 

I've seen in the product integration space, some companies crop up calling themselves unified, or universal APIs. Now, all the ones that I know of are vertical specific, because it's obviously easier to unify data models around a set of systems that at least have underlying realities  versus say, trying to unify QuickBooks and Salesforce, which are doing two pretty different things.

What's your take on the state of unified APIs right now?

Kelly: Where might they be able to go? Is it possible someone's gonna come up with a universal API? Or would that have to happen in an open source organic way versus a company being able to put it forth and have adoption?

Arpit: There have been attempts, a lot of people, a lot of companies have tried to solve this and attempted building a universal standard, typically, via an open source route. I haven't really seen a successful unified API company. But there are a couple of things that I'd like to mention. So stitch data, created this standard called Singer, which was basically just a standard for companies to build integration on top of, which would enable the end user to very easily and quickly extract data from a lot of these third-party systems and do whatever they wanted to do with the data. It was like a standard for companies to build on top of, so it was trying to build an open source framework or a standard. 

If every company would adopt that, then extracting the data from different tools and actually unifying them would be a lot easier. But it didn't quite take off, eventually Stitch, which is an ETL company, was acquired by another ETL company, their open source project gradually died down. There's a new company called Airbyte, which is also an ETL company, which says it will only embrace not only the Singer standard, but we will create a standard of our own, which will enable companies to build integration with our product really fast and enable our customers to integrate with all kinds of systems. 

This is not exactly a unified API, like what I'm referring to, because, like you said, a unified API seems like a pipe dream, because all of these systems are so different. Their models are so different. And now, SaaS is also becoming more extensible. The customer is becoming savvier. And they're personalizing or customizing tools a lot more.  A unified API that could understand all of these systems, transform the data in real time, and make the data available into all of these systems, again is really challenging. I would say I'm not really an expert here because I haven't really worked with a unified API company and have never used one. I haven't really come across a super successful one. 

If you talk about data integration, there have been attempts, and people are still trying to build these standards whereby it is also open source. AirByte is a two year old company, they've seen a pretty good adoption, they're a pretty active community, and a lot of people are basically building these connectors with an area of tools. If you're using AirByte, you can not only use one of the pre-built connectors, but if you're familiar with the standard, then you can actually build a connector pretty quickly, transform the data in your warehouse, and send it back to your destination or system where you want to utilize the data. It's not exactly the unified API solution.

If someone can build that would be really amazing, but we'll see how it plays out.

Kelly: Definitely. Well, thank you so much for joining us today. I think that was a lot of awesome information about data. Is there a place people can go on social media to find you?

Where to connect with Arpit

Arpit: Yeah, I’m definitely pretty active on LinkedIn. I am also on twitter at @ICanAutomate. In fact, listeners can go to astorik.com or ask.astorik.com which is essentially a place for folks to get answers to their data questions directly from experts and from me as well. I have my own community and there are other experts who have their own micro-communities on astorik where folks can learn from them and ask them questions.

End

If you enjoyed this interview, check out our Youtube channel and subscribe for more content on all things APIs, integrations, and technology partnerships. If you're someone who is working on building and scaling SaaS product partnerships, we invite you to apply to be a member of our community and network with other leaders like Arpit working on this at SaaSecosystemalliance.com.

Keep up with Pandium

We regularly create quality content on how to leverage tech partnerships and integrations to grow. Subscribe to our bi-monthly newsletter so you don’t miss any of it.
Keep up with Pandium by subscribing to our newsletter