Data Ingestion: A Brief History of the First Mile Problem
Have you purchased a book from Amazon? Do you use Google Maps to get from one location to another? Have you seen "Ted Lasso" on Apple TV? Ever uploaded a file to Box? Or take Lyft to get around town?
A product is only as good as the user data it collects. From navigation software to smart cars, and virtual assistants to robotic vacuums, we generate a ton of useful data that goes into improving almost every product we use today.
It seems that every time we do something, even a seemingly banal activity like watching a cat video on YouTube or purchasing a new book on Amazon, our data gets collected and improves future processes. As a result, companies recognize the importance of having a streamlined data ingestion processes to deliver successful business outcomes.
But how does all that data gets collected and ingested? For the sake of this article, let’s walk through a brief history of data ingestion solutions over the last decade.
2010 to 2015: The advent of cloud computing and growth of ETL solutions
The 2010s saw an explosion of new cloud and SaaS solutions as companies started moving workloads off premises and cloud giants emerged.
The big three biggest cloud providers were all underway by 2010. Amazon Web Services (AWS) rolled out in 2006, followed by Google's App Engine in 2008 and Microsoft's Azure Cloud Services in 2010. Their primary focus was on offering a web application hosting platform on which to create apps.
And cloud adoption took off. According to Gartner, global spending on public cloud is expected to end at $600 billion in 2023—almost eight times higher where it began the decade at $77 billion. With software-as-a-service (SaaS) comprising over 30% of revenue, the total market will reach $208 billion in 2023.
Companies like Netflix, Dropbox, or Uber all exist because of the Cloud. Cloud native services provide businesses with near-instant access to services that would have taken months to develop on premise. The number of cloud tools went through the roof as thousands of new SaaS, gaming, and tech companies were born in the Cloud.
Engineering teams needed a way to combine all these new cloud-based data sources like ads, ERP, CRM, and payment to help calculate their cost of acquisition. With the shift away from on-premise, the adoption of ETL (extract-transform-load) solutions accelerated, as building and maintaining data pipelines in-house began to make less and less sense.
This new paradigm gave businesses deeper insight into their legacy data, enabled visualization on top of a unified view, and enabled the rise of the data analyst, who enjoyed this new playground of sanitized data.
Advantages of ETL
There are many use cases for ETL in modern applications where security, customization, and data quality are priorities.
ETL is ideal for applications where anonymization and security are key. Organizations in the healthcare, government, and finance industries all benefit since they’re subject to compliance regulations, like HIPAA or GDPR.
When destinations like databases require data to align to a specific schema, ETL is a good fit. That’s because prior to loading, transformations clean and align the incoming data by ironing out the wrinkles brought on by incompatibility.
Additionally, ETL is great for operationalizing transformed data (aka Reverse ETL). For example, pushing data to your BI tools for analysis, building ML algorithms, and sending data to SaaS systems like Salesforce.
The Drawbacks OF ETL
However, there are some drawbacks to ETL. It’s inherently customizable resulting in more control over data quality, but at the expense of time and effort. It can take months to put into place. There’s also overhead cost required for a dedicated engineering team to oversee maintenance and keep up with changing requirements.
A few other limitations to consider:
- ETL requires data analysts to know the end result ahead of time so they can stitch together the data prior to loading into the data warehouse.
- Building ETL pipelines frequently goes beyond the technical capabilities of data analysts. Hence the rise in the number of analytics engineers.
- Not ideal for near real-time or on-demand data access, where fast response is required.
- Analysts only see the ‘clean data’ and not the raw data
2015 to 2020: ETL leaders make moves, the growth of modern cloud data warehouses, and the ELT paradigm
The data integration landscape changed significantly during this period. The enterprise data integration landscape has expanded considerably due to a number of factors. First, enterprise data integration market leaders have expanded their product portfolios.
- Tibco was taken private by private equity firm Vista Equity Partners in December 2014 for $4.3 billion.
- Informatica was taken private by Canada Pension Plan Investment Board and Permira in a $5.3 billion deal in April 2015.
- Talend went public in July in 2016. They snapped up Stitch, a fast-growing, self-service data integration company, in a $60 million deal in 2018.
- Alteryx goes public in March 2017.
The growth of modern cloud data warehouses
Since 2015 two big trends coalesced; the rise of modern cloud data warehouses and the emergence of the ELT paradigm. Cloud data warehouses such as BigQuery, Redshift, and Snowflake quickly became the best place to consolidate data. They offer significantly lower costs for data computation and storage than traditional data warehouses. For example, AWS has dropped prices 107 times as of August 2021.
The fast growth of cloud data warehouses has also led to new jobs. The graph above shows the percentage of companies mentioning these products in their job postings over the last two years.
In an effort to modernize data storage and analytics processes, data ingestion shifted to ELT solutions. With ELT, data transformation is done after being loaded into a data warehouse or data lake.
In ELT, the biggest change is the move towards commodity servers with commodity power and memory. The price of commodity servers has dropped by over 50% since 2015, providing a huge boost to cloud data warehouses. This means that you can now run many smaller data warehouses on a single server at a fraction of the cost of traditional data warehouses. They are also becoming more powerful and efficient, which means they can be used to handle large amounts of data.
The most recent innovation in ELT technology is the use of GPUs for both GPU acceleration and deep learning processing, as well as for parallel computing on large datasets. Both techniques are taking off in cloud data warehouses, but only recently have they been combined into a single algorithm to reduce cost and increase performance.
The algorithms are based on sparse grids using deep neural networks running within multiple GPUs or CPU cores, allowing them to process multiple datasets at once without slowing down or requiring more than one core per dataset. This results in higher throughput and better performance than any other method currently available.
So what does this all mean? It means that cloud data warehouses are now able to compete with traditional systems from both sides of the trade-off between cost, performance and scalability: high-performance traditional systems for workloads that require large volumes of data (e.g., searching databases), low-cost but highly scalable systems for workloads that require small volumes (e.g., statistical and machine learning algorithms) and high-performance systems where the workload is not that demanding (e.g., deep learning).
The ELT paradigm makes a lot of sense, especially when dealing with large volumes of data. Cloud environments offer significant computing power at a lower cost enabling data analysts to transform data as needed within the data warehouse.
Popular solutions have emerged to enable this new architecture include:
- Fivetran and Airbyte for EL
- DBT for T
- BigQuery, Redshift, and Snowflake for the data warehouse.
The stage is set for a paradigm shift in operational efficiency and the explosion of new cloud-native companies.
Matt Turck, VC at FirstMark, dubbed this explosive growth of Machine Learning, AI and Data solutions as the MAD landscape (see image below).
2020 - Current: More M&A and the big shift in first mile data ingestion
Over the last few years, the data ingestion market saw some significant activity. Here are a few examples:
- Thoma Bravo takes Talend private in March 2021
- Matillion raises $150M series E funding at $1.5B valuation in September 2021
- Informatica returns to public market in October 2021
- Vista Equity Partners start exploring a sale of Tibco in June 2021
- Alteryx acquires Trifacta in February 2022
- Airbyte acquires Groupoo in April 2022
Even with these big moves and acquisitions, the amount of ETL players continues to blossom. Look at this G2 Grid® for ETL Tools.
With all these solutions available and the gap of differentiation getting smaller, these ETL solutions are primarily focused on moving and processing internal data. Meaning, moving data from your Salesforce to your data warehouse. But what about ingesting data from outside of the company walls?
The big shift in first mile data ingestion
There is a problem in supply chain management known as the first-mile/last-mile problem. The first mile is the distance it takes for raw materials to travel from their extraction location to the processing location in your supply chain. The last mile problem, on the other hand, describes the challenge of getting the finished product from the shipping depot to the customer. Your data supply chain is also plagued by these issues.
In 2018, then CEO of Talend, Mike Tuchen described the first mile of data ingestion as “pulling the data together, cleaning it up, and making it right” during a Mad Money interview.
And this has been the general definition for the first mile problem of data ingestion - pulling data from a variety of sources, (cleaning it up if using ETL), and ingesting it into your data warehouse. In practice, this might look like using Fivetran to bring LinkedIn ads data into Snowflake for transformation with DBT.
Put another way, the first mile problem focuses on getting data out of operational data sources and into the data warehouse for analysis.
The first mile problem shifts upstream
There are a variety of operational data sources such as ERP, CRM, database, Google Sheets, CSVs, 3rd party data, etc. that get piped into the data warehouse. But what's the process for ingesting data into these operational sources? For example, what's the process for ingesting customer data into your product to power your app?
The challenge here is the “the first mile problem” or getting data into operational sources that’s translated into a meaningful way.
Thus, the first mile problem shifted upstream. Ingesting clean data into your operational sources is now the first mile problem. The previous first mile is now the middle mile and the last mile is activating your transformed data in your SaaS application (reverse ETL).
No company operates in a data silo. The average organization employs more than 130 applications, a figure that is increasing by 30% year over year. If your company is to successfully deliver your product’s value then you must invest more time and effort in solving the first mile problem. Especially, when you're ingesting hundreds to thousands of external data sources (ex. customer and partner data) to power your products.
As a whole, the industry hasn't paid as much attention to this first mile issue, because it’s painful to solve at scale. It’s plagued by schema changes, volume anomalies, and late deliveries, which then spread to your downstream warehouse tables and business processes. These external data sources represent potential points of failure that are beyond the control and scope of a data team - typically falling more into the hands of engineering as product infrastructure problem.
Despite substantial investments in on-premises systems, infrastructure clouds, and application clouds, data silo issues still plague businesses, preventing them from realizing the full value of their data. Data ingestion now takes up roughly one-quarter of our time, and first mile data ingestion receives far less "attention" than we believe it deserves.
In the future posts, we’ll break down the steps to solving this new first-mile problem and why current solutions aren’t well-suited to handle it.
Should You Build or Buy a Data Importer?
But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.view the GUIDE
Is the Modern Data Stack the Right Solution for You?
Learn what modern data stack tools make sense for your organization. Get the tools to collect, process, and analyze data from extraction to data ingestion. Understand the business challenges, namely making tooling decisions, mitigating risk, and laying a foundation for growth.
Why Customer Data Ingestion Continues To Be a Painful Problem
When it comes to customer data ingestion, most companies don’t have a cohesive strategy because it is often a time-consuming and complex process. we are going to explore the typical paths most companies end up taking, and the pitfalls they end up running into.
The Most Overlooked Data Source Impacting Business Growth
There’s a source of data that doesn’t get the attention it deserves - your customer’s data. For many companies, customer data is fueling the product's engine. However, it’s a gnarly problem for most internal teams to solve efficiently and effectively.