Data Ingestion 101: Why Customer Data Ingestion is Still So Painful

Written by

Naresh Venkat

October 3, 2023

This is part 3 of our first mile data ingestion series. Part 2 covered the most common sources of data. It outlined why customer data, the first mile of your data ingestion process, is often overlooked by companies.

The time-consuming and complex customer data ingestion process often prevents businesses from developing a cohesive strategy to solve the problem. Some companies don’t have the resources or technical expertise to properly develop a comprehensive approach that incorporates all the variance around data sources, data types, and required transformations. Others may be reluctant to invest in the necessary infrastructure and tools needed to create a customer data ingestion strategy.

In this blog, we will explore the typical paths most companies take when trying to solve the first mile problem and the pitfalls they experience along the way.

Executive Summary

Maintaining and managing API clients is tedious and expensive.
Companies end up resorting to the most cost-effective solution– CSV exports.
This creates a situation where technical employees must handle data transformation.
This requires data teams, engineers, and developers to complete the data ingestion process.
Businesses lose time and money by:
- Trying to throw people at the data ingestion problem,
- Using tools that weren’t built for the job,
- Building expensive tools that prove unreliable.
Avoid the status quo bias by developing a robust customer data ingestion strategy that empowers any employee, technical or not, to take data ingestion head-on.

How is data shared today?

Before we start dissecting the various approaches companies take when trying to solve the first mile of data ingestion, we must first understand how data is shared between organizations.

How data is shared between organizations — Less than 10% of organizations have truly adopted API integrations for data exchange. More than 80% of the data businesses share is in the form of CSVs and Excel files.

From talking to 100s of CTOs, it is evident that more than 80% of the data that is shared is in the form of CSVs and Excel files. Less than 10% of businesses have truly adopted API integrations for data exchange. The reason is simple - While almost every modern service has APIs, writing, maintaining, and managing API clients is a tedious effort requiring expensive software engineers. So, most companies resort to CSV exports, shared via FTP, S3, and emails. Lastly, EDI (Electronic Data Interchange) is still prevalent, specifically for supply chain and healthcare data. Approximately 10% of data sharing happens via EDI.

How Companies Try to Solve Customer Data Ingestion Today

Based on our experience from working with 100s of companies, approaches often fall into three buckets:

Approach #1 - Throw people at the problem

This is the default option for most companies, especially for the CSV/Excel data ingestion scenarios. The people in this scenario tend to be the externally facing teams who interact with your customers, vendors, or partners – typically customer/partner success managers, data analysts, implementation specialists, or professional services teams.

There are many reasons why companies do this:

Data is “too messy,” and the ingestion process involves removing inconsistencies, correcting errors, ensuring accuracy, and enforcing data integrity. This often needs a pair of human eyes and expertise to understand the context and business logic.
Even if the above tasks were to be automated somehow, the volume and variety of data across customers is often broad and inconsistent, making automation across customers next to impossible.
The process requires customer involvement, such as back-and-forth clarifications. Someone customer-facing ends up owning the process.

The problem with this approach is not the “who” or the “why,” but the “how”. What tools and processes will be used to solve the problem? The ubiquitous tool for this problem ends up being Excel, and a typical process looks something like this:

What it looks like to ingest messy, customer data — When you throw people at the problem, it can get messy

This ineffective tool + process combination results in two problems:

You can't scale your business unless you hire more people to keep up with the manual data cleanup
Your customer experience suffers because of long onboarding times due to the time it takes to manually fix human errors.

Approach #2 - Use the tools we’ve already invested in

Too often, businesses invest significantly in technology to gain a competitive advantage. Then, they try to make that technology work for every adjacent problem they encounter.

The “Modern Data Stack,” or MDS, has received a lot of attention in the data community recently. The MDS is a set of solutions that revolve around bringing ALL of your data into the data warehouse and dealing with the cleanup, validation, etc., in the data warehouse. This “Stack” involves not one, not two, but often ten different tools glued together using Python scripts or similar. Tools like ELT (Extract Load Transform), dbt (data build tool), and Reverse ETL are among many others that are often needed to make this approach work.

Leveraging MDS tools to solve customer data ingestion takes the problem to another extreme, requiring a highly technical team of people with expertise in SQL and Python and multiple expensive tools that quickly eat up your budget.

‍

There are many pitfalls to this approach:

The problem is not solved. It is shifted: ELT (Extract Load Transform) solutions, the tools that actually suck the data into Data Warehouse, are basically giant data vacuum cleaners. They are typically limited in terms of validation, cleanup, and transformation functionality and have difficulty accommodating more complex customer data scenarios. ELTs throw messy customer data directly into your data warehouse, moving the problem from your externally facing teams to your data teams. The validation, cleanup, and transformation of the messy data then happens in your data warehouse.
Over-reliance on expensive engineering resources: Thanks to ELT tools shifting the problem to your warehouse, the only people who can now handle this problem are those who have knowledge of databases, tables, SQL, and Python — your engineering teams.
Lack of business understanding: While proficient in SQL and Python, these engineering teams are often not proficient in the business context and logic required to handle customer data in particular. This results in back and forth between your customer teams and engineering teams, which is a headache and a waste of time.
Not scalable: While the tools of MDS are very scalable and capable of handling petabyte-scale data, the process itself is not scalable. Every data ingestion pipeline ends up being a snowflake (pun intended) of its own. So if you have 1,000 customers, you need an army of engineers building, monitoring, and maintaining 1000s of these data pipelines. ‍
Expensive - If it is not obvious, this approach is costly across various dimensions. First, the multiple tools of MDS come with hefty price tags. Second, the cost of the engineering resources and training required is high. Third, it requires an over-reliance on engineering, which slows down business agility and degrades the customer experience, which leads to costly churn when customers have paid for a product but are unable to use it effectively.

Approach #3 - Duct-tape together an in-house solution

Developing an in-house solution for customer data ingestion has some advantages. Building to your exact specifications and requirements while maintaining the tooling infrastructure may offer benefits. This sometimes works well if you need to build an EDI connector or an API integration with your customer.

Build vs. buy considerations — Considerations to make before building or buying a customer data ingestion solution

That said, custom, in-house solutions work well until they don’t. Writing custom Python scripts, building data uploaders, and maintaining data pipelines are all possible. It’s just software, after all. Then there are these considerations:

It can get expensive: Building a data ingestion service internally can be a costly and time-consuming process. It requires expertise, resources, time, and money to build, test, and deploy a solution that meets the customer's requirements. This project often ends up taking one year or more of your roadmap.
It’s unreliable: The resulting solution will not be as reliable, scalable, or secure as those offered by established vendors whose sole purpose is to solve the customer data ingestion problem.
Ongoing commitment: The process of building an internal solution doesn’t stop with a significant upfront investment. It needs continuous care and maintenance and often turns into a mini product that you need to maintain for years. ‍
Focus dilution: Finally, building in-house can shift the focus away from core product development, which could ultimately lead to a decline in core product quality, customer service, and overall profitability.

The Status Quo Bias

Status quo bias is the preference to maintain your current situation instead of taking action that may change your current state.

According to Harvard Business Review, over half of business leaders say they rarely challenge their status quo, which leads to employees who unconsciously feel the need to conform. Why? Many leaders feel pressed for time. They may have difficulties prioritizing. Others have a more systemic issue of an "if it ain't broke, don't fix it" culture.

Unfortunately, the “that’s the way we’ve always done it,” approach is the attitude many organizations take toward customer data ingestion. Change is hard. It’s easy to push back against any new process or technology change, worrying that it might adversely change the state of affairs. But if your teams are working late nights and weekends manually wrangling spreadsheets, maintaining broken pipelines, or fixing bugs, it’s time for a change.

What’s the best customer data ingestion strategy?

When it comes to finding a solution for customer data ingestion, you must strike a balance between people, processes, and technology.

Empower the right team to handle it independently without over-reliance on engineering teams. We recommend externally-facing teams, like customer success, professional services, implementation teams, or data analysts.

Pick a solution that is fit for the job. If you end up with the wrong tools, it can shift the problem and lead to more headaches and maintenance down the line.
Let your engineering teams do what they do best: build amazing things. Writing countless CSV and Excel parsers would substantially waste their time and your money.

In our next post, Part 4 in the series, we’ll break down the top factors to look for when evaluating a customer data ingestion solution - Capability, Usability, and Self-Serviceability.