Data Ingestion

Why Customer Data Ingestion Continues To Be a Painful Problem

Written by 
Naresh Venkat
December 8, 2022

This is part 3 of our first mile data ingestion series. Part 2 covered the most common sources of data. It outlined why customer data, the first mile of your data ingestion process, is often overlooked by companies.

When it comes to customer data ingestion, most companies don’t have a cohesive strategy because it is often a time-consuming and complex process. Some companies may not have the resources or technical expertise to properly develop a comprehensive strategy that incorporates all the variance around data sources, data types, and required transformations. Others may be reluctant to invest in the necessary infrastructure and tools needed to create a customer data ingestion strategy.

In this blog, we are going to explore the typical paths most companies end up taking, and the pitfalls they end up running into.

How is data shared today?

Before we start dissecting the various approaches companies take, let’s first understand how data is shared between organizations.

Data is shared between organizations
How data is currently shared between organizations

From talking to 100s of CTOs, it is quite evident that more than 80% of the data that is shared, is in the form of CSVs and Excel files. Less than 10% have truly adopted API integrations for data exchange. The reason is simple - While almost every modern service has APIs, writing, maintaining and managing API clients is a tedious effort requiring expensive software engineers. Hence most companies end up resorting to CSV exports, shared via FTP, S3, and emails. And last but not least, specifically for supply chain and healthcare data, EDI (Electronic Data Interchange) is still a very popular option. Approximately 10% of data sharing happens via EDI.  

How Companies Try to Solve Customer Data Ingestion Today

Based on our experience from working with 100s of companies, approaches often fall into 3 buckets:

Approach #1 - Throw people at the problem

This is the default option for most companies, especially for the CSV/Excel data ingestion scenarios. The people in this scenario tend to be the externally facing teams who interact with your customers, vendors, or partners – typically customer/partner success managers, data analysts, implementation specialists, or professional services teams. 
There are many reasons why companies do this:

  1. Data is “too messy” and the ingestion process involves removing inconsistencies, correcting errors, ensuring accuracy, and enforcing data integrity. This often needs a pair of human eyes and expertise for understanding the context and business logic.
  2. Even if the above tasks were to be automated somehow, the volume and variety of data across customers is often broad and inconsistent making automation across customers next to impossible. 
  3. The process requires customer involvement such as back and forth clarifications. Hence someone customer facing ends up owning the process. 

The problem with this approach is not the “who” or the “why,” but the “how”. Meaning which tools and what processes will be used to solve the problem? The ubiquitous tool for this problem ends up being Excel, and the process ends up being something like this:

What it looks like to ingest messy, customer data
When you throw people at the problem, it can get messy

This ineffective tool + process combination results in two problems:  

  1. You can't scale your business unless you hire more people to keep up with the manual data cleanup 
  2. Your customer experience suffers because of long onboarding times due to the time it takes to manually fix human errors.

Approach #2 - Use the tools we’ve already invested in

“To a man with a hammer, everything looks like a nail.” - Abraham Maslow 

Too often, businesses make large investments in technology to gain a competitive advantage. Then they try to make that technology work for every adjacent problem they encounter. 

The “Modern Data Stack,” or MDS, has been gaining a lot of attention in the data community recently. The MDS is a set of solutions that revolve around bringing ALL of your data into the data warehouse and dealing with the cleanup, validation, etc. in the data warehouse. This “Stack” involves not one, not two, but often 10 different tools glued together using Python scripts or similar. Tools like ELT (Extract Load Transform), DBT (data build tool), and Reverse ETL are among many others that are often needed to make this approach work.
Leveraging MDS tools to solve customer data ingestion takes the problem to the other extreme. One that requires an extremely technical team of people, with expertise in SQL and Python, and multiple expensive tools that eat up your budget very quickly. 

Leveraging MDS tools to solve customer data ingestion
Using MDS tools to solve customer data ingestion only shifts the problem

There are many pitfalls to this approach:

  1. The problem is not solved, it is shifted: ELT (Extract Load Transform) solutions, the tools that actually suck the data into Data Warehouse, are basically giant data vacuum cleaners. They are typically limited in terms of validation, cleanup, and transformation functionality and have difficulty accommodating more complex customer data scenarios. ELTs throw messy customer data directly into your data warehouse moving the problem from your externally facing teams to your data teams The validation, cleanup, and transformation of the messy data then happens in your data warehouse.
  2. Over-reliant on expensive engineering resources: Thanks to ELT tools shifting the problem to your warehouse, the only people who can now handle this problem are those who have knowledge of databases, tables, SQL, and Python. Your - engineering teams.
  3. Lack of business understanding: These engineering teams, while proficient in SQL and Python, are often not proficient in the business context and logic required to handle customer data in particular. This results in back and forth between your customer teams and engineering teams which is a headache and waste of time.
  4. Not scalable: While the tools of MDS are very scalable and capable of handling petabyte scale data, the process itself is not scalable. Every data ingestion pipeline ends up being a snowflake (pun intended) of its own. So if you have 1,000 customers, you need to have an army of engineers building, monitoring, and maintaining 1000s of these data pipelines. 
  5. Expensive - If it is not already obvious, this approach is expensive across various dimensions. First, the various tools of MDS come with hefty price tags. Second, the cost of the engineering resources and training required is pricey. And third, it requires over reliance on engineering, which slows down business agility and degrades the customer experience which leads to costly churn when customers have paid for a product but are unable to use it effectively. 

Approach #3 - Duct-tape together an in-house solution

Building an in-house solution for customer data ingestion has some advantages. Being able to build to your exact specifications and requirements while maintaining the tooling infrastructure internally are the biggest benefits. This sometimes works well if you need to build an EDI connector or an API integration with your customer. 

Build vs. buy considerations
Considerations to make before building or buying a customer data ingestion solution

That said, custom, in-house solutions work well…until they don’t. Writing custom Python scripts, building data uploaders, and maintaining data pipelines are all possible (it’s just software after all), but…

  1. It can get expensive: Building a data ingestion service internally can be a costly and time consuming process. It requires expertise, resources, time, and money to build, test, and deploy a solution that meets all the requirements of the customer. This project often ends up taking one year or more of your roadmap.
  2. It’s unreliable: The resulting solution will not be as reliable, scalable, or secure as those offered by established vendors whose sole purpose is to solve the customer data ingestion problem. 
  3. Ongoing commitment: The process of building an internal solution doesn’t stop with the large upfront investment. It needs ongoing care and maintenance, and often turns into a mini product of its own, that you need to maintain for years. 
  4. Focus dilution: Finally, building in-house can shift the focus away from core product development, which could ultimately lead to a decline in core product quality, customer service, and overall profitability.

The Status Quo Bias

"If you always do what you've always done, you'll always get what you've always got." - Henry Ford

Status quo bias is the preference to maintain your current situation as opposed to taking action that may change your current state.

According to Harvard Business Review, over half of business leaders say they rarely challenge their status quo which leads to employees who unconsciously feel the need to conform. Why? Many leaders feel pressed for time, have difficulties prioritizing, and others have a more systemic issue of a "if it ain't broke, don't fix it" culture.

Unfortunately, the “that’s the way we’ve always done it,” approach is the attitude many organizations take toward customer data ingestion. Change is hard.  It’s easy to push back against any new process or technology change, worrying that it might adversely change the state of affairs. But if your teams are working late nights and weekends manually wrangling spreadsheets, maintaining broken pipelines, or fixing bugs, it’s time for a change.

Do you have a customer data ingestion strategy?

You must strike a balance between people, processes, and technology when it comes to finding a solution for customer data ingestion. 

  • Empower the right team to handle it on their own, without over-reliance on engineering teams. We recommend externally-facing teams, like customer success, professional services, implementation teams, or data analysts.
  • Pick a solution that is fit for the job. If you end up with the wrong tools, it can end up shifting the problem and lead to more headaches and maintenance down the line.
  • Let your engineering teams do what they do best: build amazing things. Writing countless CSV and Excel parsers would be a substantial waste of their time and your money. 

In the next post, we’ll break down the top factors to look for when evaluating a customer data ingestion solution - Capability, Usability, and Self-Serviceability.

Should You Build or Buy a Data Importer?

But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.

view the GUIDE

Naresh Venkat

Co-founder and COO