Data Ingestion

The Most Overlooked Data Source Impacting Business Growth

Written by 
JD Prater
October 27, 2022

This is part 2 of our first mile data ingestion series. Part 1 covered why companies of all sizes are experiencing an explosive growth in their data sources, volumes, and velocity. It outlined a brief history of the tools used to ingest data and the big shift in first mile data ingestion.

We’ve definitely seen a shift not only in how companies ingest data, but also the first mile data ingestion problem. And with data ingestion now taking roughly one-quarter of our time, where does this come from? What types of data are we ingesting?

Four Most Common Sources of Data

Our world has become datafied. Like the Matrix, data is all around us. We know that this data is business critical, but where does it come from? What are the sources of data? The most common sources of data can be bucketed into four broad categories.

Let’s explore these categories and provide a few examples.

common sources of data
4 common sources of data

1. Internal data

Your organization's processes capture internal data. Internal data comes from a variety of sources and departments, including sales reports, financial documents, human resources data, and so on. It could be customer transactions or employees' wages. A machine-generated data collection system or an item's sensors or software may have recorded the data. You’ll typically find this data stored in databases, operational systems (e.g. CRM, ERP) or in system log files.

2. Third-party analytics

Third-party web analytics services can provide cost-effective collection and analysis of your website's performance over time or in comparison to averages across the provider's customer base. The most well-known are web analytics like Google Analytics.

3. External data

External data can range from historical demographic data to market prices, from weather conditions to social media trends. It’s used by organizations to analyze and model economic, political, social, and environmental factors that affect their operations.

4. Open data

You can access open data without charge or restrictions on how you use it. However, if the data is detailed or summarized, it might not be relevant to you. The data may also not be in the format you need, or very difficult to clean up. A lot of time can go into making the data usable. A few examples include, World Health Organization (WHO), Open Science Data Cloud (OSCDC), and Google Trends.

The Most Overlooked Data Source is Your Customer’s Data

But there’s another source that doesn’t get the attention it deserves - your customer’s data. For many companies, customer data is fueling the product’s value. This could be considered internal data after it’s in your production database, ERP, or CRM.

For example, let’s say you're a customer engagement platform that ingests conversation data (transcripts and metadata for chats, email, phone calls etc) and supplies analytical capabilities within that data set. But in order for your customers to get value from your platform, you must first ingest their data into your product in a timely manner.

If your company is to successfully deliver your product’s value then you must invest more time and effort in solving data’s first mile problem

Three trends emerged from this new first mile paradigm:

  1. Companies have an increasing number of interdependent “data relationships” with their customers and partners. In other words, customers need to be able to use your product and you need their data populated in your product.
  2. Companies are forced to figure out a solution for customer data exchange.
  3. To increase the number of data relationships, companies must simplify and automate their process. This results in a business critical need for customer data exchange methods that are scalable, reliable, and flexible.

This is data’s true first mile problem - ingesting clean customer data into your operational systems at scale. However, it’s a gnarly ball of interrelated problems that's difficult to untangle at scale.

5 Gnarly Problems with Customer Data Ingestion

Customer data ingestion is painful for the vast majority of companies. As a whole, the industry hasn't paid as much attention to this first mile problem, because each company has unique needs and it’s costly at scale.

The process is plagued by schema changes, volume anomalies, and formatting errors, which then spread to your downstream production tables and business processes. 

Problem #1 - N-1: Each customer is a different data source

Each customer and partner represents a different data source. There’s no industry standard for data exchange. So each customer wants to share their data in their particular shape and format and it doesn’t match your destination’s schema. 

customer data ingestion is a n:1 problem

Problem #2 - Data exchange mechanism

The mechanism for data exchange varies for each organization involved. Some want to use SFTP, S3, email attachments, or CSV uploads. 

Now think of all the different methods, files, syntaxes, and systems used to share data—CSV, JSON, TSV, XML, APIs, data warehouses, emails, CRMs, and ERPs. You have very little leverage to get the other party to change anything about their process.

Problem #3 - Customer data is inherently messy

As mentioned earlier, Incoming customer data arrives in different shapes and formats, but it all needs to land into a few curated "golden datasets'' in your system. You’re dealing with very messy data with no unified structure across businesses. Even if their CSV is successfully imported, the lack of standard formatting renders a lot of it useless.

Your operational or production system can quickly become riddled with errors, inaccuracies, duplicates, etc. when data isn't formatted or validated before being ingested. 

data isn't formatted or validated

Without the team or tools to manage the data onboarding process, it's difficult for customers to realize the value of your product. CRMs, inventory management software, ERPs, and product lifecycle management software all need clean data coming in to execute properly.

Problem #4 - Complex data cleanup process

The complexity of customer data cleanup spans across schema mapping (700 columns!) and dirty contents in the fields ("Addresses missing zip codes").

Image source

Ingesting customer data into a format and style that your systems can understand and use is no easy task, which is why companies rely on dev and data teams to support their data onboarding efforts.

These manual processes introduce errors. Without usable data, it's impossible for your product to work properly. This causes teams to build solutions, such as custom data importers and pipelines, that wind up taking more time away from strategic initiatives and sacrificing your product roadmap.

Problem #5 - Who’s dealing with the problem and who owns the problem? 

All of those one-off data cleaning requests have to go somewhere, and they usually fall to technical teams. Dev and product management teams get bogged down with tickets to import customer data, create custom solutions for high-level clients, or help clean up messy data before it's ingested into the company's operational system.

So your frontline teams say to customers, "Of course we can handle your data," but they know that it's going to take you a lot of all-nighters, a lot of stress, significant cost, and some potential missed deadlines to (hopefully) be able to work with the data they have.

Imagine your company is onboarding customer data via CSVs. Let’s say your customer’s CSV has a field called "PO #," but your schema requires a field called "Purchase Order No." The data won't import properly due to the different schemas, and without validations someone from your team must manually transform the data before it’s ingested. 

Who manually cleans up these files? Your CS team? The customer? Product teams? How long before your eng teams get involved?

Ingesting your customer’s data represents potential points of failure that are beyond the control and scope of a data team - typically falling more into the hands of engineering as a product infrastructure problem.

Engineering teams attempt to solve these recurring data onboarding problems by dedicating extra hours toward data wrangling, building and maintaining in-house solutions, and creating in-depth documentation for non-technical teams.

Unfortunately, these solutions cause further downstream problems. 

  1. The more custom scripts and glue code you write, the more people, time, and money you'll need to maintain each solution. You moved the problem to your dev teams. They’ll grapple with formatting errors, custom scripts, and maintaining APIs and building connectors.
  2. Companies struggle to scale in-house customer data ingestion solutions as the company grows. Custom data uploaders require constant updating to handle the multiple data importing scenarios, files, and formats. And bespoke data pipelines take weeks to build and regular maintenance to keep data flowing smoothly.
  3. Digging through documentation is cumbersome for customers and internal teams who want to quickly upload their data. That’s assuming you even have documentation that’s up-to-date and clearly written for customers.

Throwing more internal resources to manually onboard customer and partner data is a losing game—the high cost isn't worth the small bump in efficiency. Having developers and engineers to fix bugs and wrangle messy data takes away resources for investing in new products or technology.


In short, ingesting custom data is riddled with gnarly problems that are very difficult to solve at scale. Dealing with the all variety of sources, mechanisms, messiness, and ownership is no easy task. It requires careful cross-functional planning and resourcing to effectively solve these problems at scale.

In the third post, we’ll showcase how companies attempt to solve these problems and why current solutions aren’t well-suited to handle customer data ingestion.

Should You Build or Buy a Data Importer?

But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.

view the GUIDE

JD Prater