Data Ingestion 101: The Most Overlooked Data Source Impacting Business Growth

Written by

Naresh Venkat

October 3, 2023

This is part 2 of our first mile data ingestion series. Part 1 covered why companies of all sizes are experiencing explosive growth in their data sources, volumes, and velocity. It outlined a brief history of the tools and apps complicating the landscape and detailed the big shift in first mile data ingestion.

In this article, we’ll explore the four most common sources of data and investigate why we commonly overlook customer, vendor, and partner data as a business-critical data source.

Executive Summary

Customers need to use your product, and you need to input their data to make that happen. The challenge: Ingesting clean customer data into your operational systems at scale.
You need seamless data migration, but ingesting your customer’s data introduces potential points of failure that are typically beyond the control and scope of a data team. You must get engineering and development teams involved.
This presents a product infrastructure problem. Engineering teams attempt to solve recurring data onboarding problems by dedicating extra hours toward data transformation fixes building and maintaining in-house solutions to no avail.
Ingesting customer data is riddled with gnarly problems that are very difficult to solve at scale. If your company is to successfully deliver your product’s value, you must invest more time and effort in solving data’s first mile problem.

The Four Most Common Data Sources

Much of our decision-making time is spent analyzing the wealth of information we take in. Innovations like Cloud and IoT (Internet of Things) have created a datafied world that accelerates the flow of data from both consumers to businesses, and between the apps companies use to collaborate and maintain maximum operational efficiency. An Appen State of AI Report asserted that in 2020 data teams spent as much as 25% of their time on data ingestion. With the explosive growth in modern cloud data warehouses and the data ingestion market overall, this should come as no surprise.

Ingesting clean data is a huge operational challenge. Every business receives data in a variety of formats and schemas. The four most common data sources can be bucketed into four broad categories.

four common sources of data — Four common sources of data

1. Internal data

Your organization's processes capture internal data. Internal data comes from a variety of sources and departments, including sales reports, financial documents, human resources data, and so on. It could be customer transactions or employees' wages. A machine-generated data collection system or an item's sensors or software may have recorded the data. You’ll typically find this data stored in databases, operational systems (e.g., CRM, ERP), or in system log files.

2. Third-party analytics

Third-party web analytics services can provide cost-effective collection and analysis of your website's performance over time or in comparison to averages across the provider's customer base. The most well-known are web analytics like Google Analytics.

3. External data

External data can range from historical demographic data to market prices, from weather conditions to social media trends. It’s used by organizations to analyze and model economic, political, social, and environmental factors that affect their operations.

4. Open data

You can access open data without charge or restrictions on how you use it. However, if the data is detailed or summarized, it might not be relevant to you. Also, this data may not be in the format you need or be very difficult to clean up. You can invest a lot of time into making open data usable, but all that work may not prove to be profitable. Examples of open data include data.gov, World Health Organization (WHO), Open Science Data Cloud (OSCDC), and Google Trends.

Customer Data: The Most Overlooked Data Source

The data source that doesn’t garner the attention it deserves – customer, vendor, and partner data. This is the data that drives products and is the basis for many B2B applications. For many companies, customer data is fueling the product’s value. This could be considered internal data after it’s in your production database, ERP, or CRM.

For example, let’s say you're a customer engagement platform that ingests conversation data (transcripts and metadata for chats, emails, phone calls, etc.) and supplies analytical capabilities within that data set. But for your customers to get value from your platform, you must first quickly ingest their data into your product.

If your company is to successfully deliver your product’s value, you must invest more time and effort in solving data’s first mile problem.

Three trends have emerged from this new first mile paradigm:

Companies have an increasing number of interdependent “data relationships” with their customers and partners. In other words, customers need to be able to use your product, and you need their data populated in your product.
Companies are forced to figure out a solution for customer data onboarding.
To increase the number of data relationships, companies must simplify and automate their process. This results in a business-critical need for customer data exchange methods that are scalable, reliable, and flexible.

Ingesting clean customer data into your operational systems at scale is data’s true first mile problem. This gnarly web of interrelated challenges is exceedingly tricky to untangle at scale.

5 Gnarly Problems with Customer Data Ingestion

Customer data ingestion is painful for the vast majority of companies. The industry hasn't paid as much attention to this first mile problem because each company has unique needs, and historically, solutions have been costly at scale.

The data ingestion process can be plagued by schema changes, volume anomalies, and formatting errors, which then spread to your downstream production tables and business processes.

Problem #1 - N-1: Each customer is a different data source

Each customer and partner represents a different data source. There’s no industry standard for data exchange. So each customer wants to share their data in their particular shape and format, which doesn’t match your destination’s schema.

Problem #2 - Data ingestion mechanism

The mechanism for data ingestion varies for each organization involved. Some want to use SFTP, S3, email attachments, or CSV uploads.

Now, think of all the different methods, files, syntaxes, and systems used to share data—CSV, JSON, TSV, XML, APIs, data warehouses, emails, CRMs, and ERPs. You have little leverage to get the other party to change anything about their process.

data ingestion process with varying data ingestion mechanisms

Problem #3 - Customer data is inherently messy

As mentioned earlier, Incoming customer data arrives in different shapes and formats, but it all needs to land into a few As mentioned earlier, incoming customer data arrives in different shapes and formats, but it all needs to land into a few curated "golden datasets'' in your system. You’re dealing with very messy data with no unified structure across businesses. Even if their CSV is successfully imported, the lack of standard formatting renders much of it useless.

Your operational or production system can quickly become riddled with errors, inaccuracies, duplicates, etc., when data isn't formatted or validated before being ingested.

data isn't formatted or validated — Data transformation and data wrangling are a must

Problem #4 - Complex data cleanup process

The complexity of customer data cleanup spans across schema mapping (700 columns!) and dirty contents in the fields ("Addresses missing zip codes").

dirty data contents vs how the clean data should look — Image source

Ingesting customer data into a format and style that your systems can understand and use is no easy task. This is why companies rely on dev and data teams to support their data onboarding efforts.

These manual processes introduce errors. Without usable data, it's impossible for your product to work properly. Thus, there is a strong desire to build solutions, such as custom data importers and pipelines, that divert time away from strategic initiatives, jeopardizing your product roadmap.

build vs buy a data clean up and import solutions — Deciding to build or buy

Problem #5 - Who’s dealing with the problem and who owns the problem?

Those one-off data cleaning requests have to go somewhere, and they usually fall to technical teams. Dev and product management teams get bogged down with tickets to import customer data, create custom solutions for high-level clients, or help clean up messy data before it's ingested into the company's operational system.

So your frontline teams say to customers, "Of course we can handle your data," but they know that it's going to take you a lot of all-nighters, a lot of stress, significant cost, and some potential missed deadlines to (hopefully) be able to work with the data they have.

Imagine your company is onboarding customer, vendor, or partner data via CSVs. Let’s say your customer’s CSV has a field called "PO #," but your schema requires a field called "Purchase Order No." The data won't import properly. On a partner’s CSV, you spot a field labeled DoB in the format MM/DD/YY, but it needs to be Birthdate formatted as YYYY-MM-DD. These different schemas create the need for validation, so someone from your team must manually transform the data before it can be ingested.

Who manually cleans up these files? Your CS team? The customer? Product teams? How long before your engineering teams get involved?

Ingesting customer, vendor, and partner data creates potential points of failure that extend beyond the control and scope of a data team - typically falling more into the hands of engineering as a product infrastructure problem.

Engineering teams attempt to solve recurring data onboarding problems. Unfortunately, solutions like those found in the Modern Data Stack (MDS) cause further downstream problems.

The more custom scripts and glue code you write, the more people, time, and money you'll need to maintain each solution. You moved the problem to your dev teams. They’ll grapple with formatting errors, custom scripts, maintaining APIs, and building connectors.
Companies struggle to scale in-house customer data ingestion solutions as the company grows. Custom data uploaders require constant updating to handle the multiple data importing scenarios, files, and formats. Bespoke data pipelines take weeks to build and regular maintenance to keep data flowing smoothly.
Digging through documentation is cumbersome for customers and internal teams who want to upload their data quickly. That’s assuming you even have documentation that’s up-to-date and clearly written for customers.

Throwing more internal resources to onboard customer and partner data manually is a losing game—the high cost isn't worth the slight bump in efficiency. Requiring that developers and engineers fix bugs and wrangle messy data hijacks resources away from investments in new products or technology.

Learnings from this article

Ingesting customer data is riddled with gnarly problems that take considerable effort to solve at scale. Dealing with the various sources, mechanisms, messiness, and ownership is a challenging task. It requires careful cross-functional planning and resourcing to effectively solve these problems at any scale.

In the third post, we’ll showcase how companies attempt to solve these problems and why current solutions aren’t well-suited to handle customer data ingestion.

Read Part 3 in this series: Why Customer Data Ingestion Continues To Be a Painful Problem