FTP to BigQuery: How to Automate Data Ingestion in Minutes
We are seeing a generational change in how companies collaborate and work with one another to raise the tide for everyone. This collaborative working model brings along with it the need for ongoing data sharing between the involved parties.
Let’s be honest—dealing with third-party data is a pain. You have no control over the schema, formats, or types. Worst of all, it’s never a one-time cleanup process; it needs constant engineering and operational upkeep.
At Osmos, we provide solutions to simplify the onboarding of external data into your operating systems. Osmos Pipelines, one of our products, is specifically designed to automate the process of ingesting third-party data into your operational systems. It lets you transform and map your data in less than five minutes, making it simple to bring in cleaner, fresher data.
Let’s take a look at one of the most common scenarios companies need to deal with on a regular basis: a customer or partner sends you CSV data via FTP. You need to ingest it into your operational system. The data that's sent is in a schema that's very different from the data schema in your database. Additionally, you also need to handle null characters, duplicates, and pre-processing the file before reading it. In this blog, you'll see how this scenario can be handled using Osmos Pipelines, in five minutes or less.
This is my setup for this use case:
- Data is in CSV format in an SFTP folder
- The operational database is Google BigQuery
- Field names and types are as indicated in Figure A
Step 1: Define the Source
The first step is to build a connector to the source system (in this case the FTP server). After selecting the source, you fill in additional information such as the connector name, details to connect to the FTP folder, such as a name, SFTP host, port number, and so on.
You may follow the instructions on the right pane of the page to create the connector or follow the instructions on our docs page.
Dealing with CSVs often requires dealing with multiple edge cases—files without headers, files with special characters or null characters, or CSVs that are zipped. With Osmos, handling these edge cases is as simple as configuring the connector to handle each unique situation.
More information about these advanced options can be found on our docs pages. Once all the details are provided, test and save the connector.
Step 2: Define the Destination
Your data needs a destination. In this example, you want to save the data in a table within Google BigQuery data warehouse. We offer two modes of connecting to the BigQuery Data warehouse: connection using OAuth and connection using Service account.
After selecting the destination, you need to provide the project ID, dataset, and the table to write the data to. Once all the details are provided, test and save the connector.
All instructions to create the connector can be found in the side panel or on the docs pages.
Step 3: Data Mapping
You have the source and the destination mapped. Now, it's time for data mapping. When assigning data to fields in the destination system, one needs to ensure both structural integrity (the structure of the data is accurate) and type integrity (data type is accurate).
Structural integrity can be accomplished in two ways with Osmos:
- Direct source to destination column mapping
- Transforming the data first, and then mapping the transformed output to destination
Data transformation can further be accomplished in two ways within Osmos:
- Transforming the data using simple formulas before mapping the columns
- Transforming the data using examples, so the system can auto-generate the necessary transformations
Data type integrity can be ensured using in-path Validations. Osmos natively understands data types from the destination systems. For example, if the user tries to enter a string in a number field, Osmos warns the user and prompts them to enter only acceptable values. Osmos also understands the date and time formats of the underlying systems and makes the best attempt to auto-transform date and time to the format accepted by the destination system.
In the video above, you can see all the methods outlined above in action:
- Extracting out EAN value from a combined EAN-UPC string and assigning it to numeric productID field in the destination system, using examples
- Extracting out first name from full name using just one example
- Extracting and transforming dates using datetime() formula
- And so on...
For more details on transformation methods and supported formulas, please refer to our documentation.
Step 4: Schedule and Automate the Pipeline
With data mapping complete, you can now choose to save the pipeline as a “run-on-demand” pipeline or a “scheduled” pipeline. If setup as a “run-on-demand” pipeline (“No scheduled runs” option), the pipeline can be manually trigged at any time from the pipeline details page. If setup as a “scheduled” pipeline, you can choose the time, day, and frequency for how you want the pipeline to run. The pipeline will automatically run at the scheduled frequency.
Step 5: Save the Pipeline
That’s it, folks! You just automated the process of ingesting messy data into your database without writing a line of code, in less than five minutes. In part two of this blog, I will talk about how to handle exceptions and errors from data handling.
Simplify Your B2B Data Ingestion
Using data onboarding tools to ingest external data does more than simply save time. It creates exponential value for businesses by cutting down the cost of managing data, freeing up your technical team, and allowing more time to transform data into insights that can impact your business.
Get hands-on. Explore how Osmos Pipelines can help your onboarding teams ingest data from external partners, without writing a line of code.