Data Collaboration

Big Data Glossary: 101 Data Terms You Should Know

Written by 
JD Prater
August 25, 2021

It's pretty critical that teams learn how to speak each other's language. While internal teams have some shared vocabulary, there are plenty of data terms that get thrown around in meetings that leave people scratching their head. Well, we thought it was time to create a blog post that could serve as a holistic data glossary -- one that not only defines each term, but also offers some helpful resources in case you want to learn about them in more depth.

Instead of throwing hundreds of terms at you from other glossaries, we narrowed this one down to the top 101 data terms that are imperative to know for anyone working with data. We hope you can bookmark this post and come back to it whenever you need to.

101 Definitions of Common Data Terms

1) API - An application programming interface (API) is a set of defined rules and protocols that explain how applications talk to one another.

2) Artificial Intelligence
- Artificial intelligence (AI) is a wide-ranging branch of computer science concerned with building smart machines capable of performing tasks commonly associated with intelligent beings.

3) Batch Processing - the running of high-volume, repetitive data jobs that can run without manual intervention, and typically scheduled to run as resources permit. A batch process has a beginning and an end.

4) Big Data - refers to data that is so large, fast or complex that it's difficult or impossible to process using traditional methods yet growing exponentially with time.

5) BigQuery - a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data by enabling super-fast SQL queries using the processing power of Google's infrastructure.

6) CSV - A Comma-Separated Values (CSV) file is a delimited text file in which information is separated by commas, and is common in spreadsheet apps.

7) Customer Data Onboarding - the process of ingesting online and offline customer data into a product’s operational system(s) in order to successfully use that product.

8) Customer Data Platform (CDP) - software that consolidates, integrates, and structures customer data from a variety of touchpoints into one single database creating an unified customer view so marketing teams have relevant insights needed to run campaigns.

9) Data Analytics
- the science of analyzing raw data in order to make conclusions about that information.

10) Data Architecture - defines how information flows in an organization for both physical and logical data assets, and governs how they are controlled. The goal of data architecture is to translate business needs into data and system requirements and to manage data and its flow through the enterprise.

11) Data Augmentation - a technique to artificially increase the amount of training data from existing training data without actually collecting new data.

12) Data Capture - the process of collecting information and then converting it into data that can be read by a computer.

13) Data Catalog - a neatly detailed inventory of all data assets in an organization. It uses metadata to help data professionals quickly find, access, and evaluate the most appropriate data for any analytical or business purpose.

14) Data Center -  a large group of networked computer servers typically used by organizations to centralize their shared IT operations and equipment like mainframes, servers and databases.

15) Data Cleanroom - a secure, protected environment where PII (Personally Identifiable Information) data is anonymized, processed, and stored to give teams a secure place to bring data together for joint analysis based on defined guidelines and restrictions.

16) Data Cleansing - the process of preparing data for analysis by amending or removing incorrect, corrupted, improperly formatted, duplicated, irrelevant, or incomplete data within a dataset.

17) Data Collaboration - Data collaboration is the practice of using data to enhance  customer, partner, and go-to-market relationships and create new value. Anytime two or more organizations are in a business relationship, data sharing and data collaboration can be seen in action. 

18) Data Democratization - data is accessible to the average user without gatekeepers or barriers in the way.

19) Data Enrichment - the process of enhancing, appending, refining, and improving collected data with relevant third-party data.

20) Data Exchange - process of taking data from one file or database format and transforming it to suit the target schema.

21) Data Extensibility - the capacity to extend and enhance an application with data from external sources such as other applications, databases, and one-off datasets.

22) Data Extraction - the process of collecting or retrieving data from a variety of sources for further data processing, storage or analysis elsewhere.

23) Data Fabric - a single environment consisting of a unified architecture with services or technologies running on that architecture to enable frictionless access and sharing of data in a distributed data environment.

24) Data Governance - the system for defining the people, processes, and technologies needed to manage, organize, and protect a company’s data assets.

25) Data Health - the state of company data and its ability to support effective business objectives.

26) Data Hygiene - the ongoing processes employed to ensure data is clean and ready to use.

27) Data Import - the process of moving data from external sources into another application or database.

28) Data Ingestion - the process of transporting data from multiple sources into a centralized database, usually a data warehouse, where it can then be accessed and analyzed. This can be done in either a real-time stream or in batches.

29) Data Integration - the process of consolidating data from different sources to achieve a single, unified view.

30) Data Integrity - is the overall accuracy, consistency, and trustworthiness of data throughout its lifecycle.

31) Data Intelligence - the process of analyzing various forms of data in order to improve a company’s services or investments.

32) Data Interoperability - the ability of different information technology systems and software applications to create, exchange, and consume data in order to use the information that has been exchanged.

33) Data Joins - combining multiple data tables based on a common field between them or “a key.'' There are 6 types of join: inner, left inner, left outer, right inner, right outer and outer.

34) Data Lake - a centralized storage repository that holds large amounts of data in its natural/raw format. 

35) Data Lineage - tells you where data originated. It’s the process of understanding, recording, and visualizing data as it flows from origin to destination.

36) Data Loading - Data loading is the “L” in “ETL” or “ELT”). After data is retrieved and combined from multiple sources (extracted), cleaned and formatted (transformed), it is then packed up and moved into a designated data warehouse.

37) Data Manipulation - the process of organizing data to make it easier to read or more structured.

38) Data Mapping - the process of matching data fields from one or multiple source files to a data field in another source.

39) Data Masking - is a data security technique in which a dataset is copied but with sensitive data obfuscated. Also referred to as data obfuscation.

40) Data Mesh - a data mesh is a highly decentralized data architecture, unlike a centralized and monolithic architecture based on a data warehouse and/or a data lake. It ensures that data is highly available, easily discoverable, secure, and interoperable with the applications that need access to it.

41) Data Migration - the process of transferring internal data between different types of file formats, databases, or storage systems.

42) Data Mining - the process of discovering anomalies, patterns, and correlations within large volumes of data to solve problems through data analysis.

43) Data Modeling - the process of visualizing and representing data elements and the relationships between them.

44) Data Munging - the preparation process for transforming data and cleansing large data sets prior to analysis.

45) Data Onboarding - the process of bringing in clean external data into applications and operational systems.

Osmos Data Onboarding Process

46) Data Orchestration - the process of gathering, combining, and organizing data to make it available for data analysis tools.

47) Data Pipeline - the series of steps required to move data from one system (source) to another (destination).

48) Data Portability - the ability to move data among different applications, programs, computing environments or cloud services. For example, it lets a user take their data from a service and transfer or “port” it elsewhere. 

49) Data Privacy - a branch of data security concerned with the proper handling of data – consent, notice, and regulatory obligations. It relates to how a piece of information – or data—should be handled with a focus on compliance with data protection regulations.

50) Data Replication - the process of storing your data in more than one location to improve data availability, reliability, redundancy, and accessibility.

51) Data Quality - a measure of how reliable a data set is to serve the specific needs of an organization based on factors such as accuracy, completeness, consistency, reliability and whether it's up to date.

52) Data Science - a multidisciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today’s organizations.

53) Data Scientist - a professional who uses technology for collecting, analyzing and interpreting large amounts of data. Their results create data-driven solutions to business challenges.

54) Data Scrubbing - the procedure of modifying or removing data from a database that is  incomplete, incorrect, inaccurately formatted, repeated, or outdated.

55) Data Security - the practice of protecting data from unauthorized access, theft, or data corruption throughout its entire lifecycle.

56) Data Sharing - the ability to distribute the same sets of data resources with multiple users or applications while maintaining data fidelity across all entities consuming the data.

57) Data Silo - a collection of information within an organization that is scattered, not integrated, and/or isolated from one another, and generally not accessible by other parts of the organization.

58) Data Stack - a suite of tools used for data loading, warehousing, transforming, and analyzing & business intelligence.

59) Data Transfer - any information that is transferred or shared between systems or organizations.

60) Data Transformation - the process of converting the format, structure, or values of data to another, typically from the format of a source system into the required format of a destination system.

61) Data Upload - the transmission of a file from one computer system to another.

62) Data Validation - ensuring the accuracy and quality of data against defined rules before using, importing or otherwise processing data.

63) Data Virtualization - the process of aggregating data across disparate systems to develop a single, logical and virtual view of information so that it can be accessed by business users in real time.

64) Data Warehouse - a repository for structured, filtered data that has already been processed for a specific purpose.

65) Data Wrangling - the process of restructuring, cleaning, and enriching raw data into a desired format for easy access and analysis.

66) Data Workflows - a sequence of tasks that must be completed and the decisions that must be made to process a set of data.

67) Database - an organized collection of structured information, or data, typically stored electronically in a computer system so that it can be easily accessed, managed and updated. Examples include MySQL, PostgreSQL, Microsoft SQL Server, MongoDB Oracle Database, and Redis.

68) Database Schema - it’s the collection of metadata that describes the relationships between objects and information in a database. It’s the blueprint or architecture of how our data will look.

69) Dataflows - represents the path for data to move from one part of the information system to another.

70) DataOps - the practice of operationalizing data management used by analytic and data teams for developing, improving the quality, and reducing the cycle time of data analytics.

71) Dataset - a structured collection of individual but related items that can be accessed and processed as individually or as a unit.

72) Deep Learning - a subfield of machine learning that trains computers via algorithms to do what comes naturally to humans such as speech recognition, image identification and prediction making.

73) Dummy Data - mock data that has the same content and layout as real data in a testing environment.

74) Electronic Data Interchange (EDI) - the intercompany exchange of business documents in a standard electronic format between business partners.

75) ELT - stands for Extract, Load, and Transform. The data is extracted and loaded into the warehouse directly without any transformations. Instead of transforming the data before it’s written, ELT takes advantage of the target system to do the data transformation.

76) ETL - stands for Extract, Transform, and Load. ETL collects and processes data from various sources into a single data store making it much easier to analyze.

77) First-Party Data -  the information you collect directly from your owned sources about your audience or customers.

78) FTP - File Transfer Protocol (FTP) is a standard communication protocol that governs how computers transfer files between systems over the internet.

79) JSON - JavaScript Object Notation (JSON) is a text-based, human-readable data interchange format for storing and transporting data.

80) Master Data Management (MDM) -  the technology, tools, and processes to ensure the organization's data is consistent, uniform, and accurate on customers, products, suppliers and other business partners.

81) No-code ETL - the ETL process is performed using software that has automation features and user-friendly user interface (UI) with various functionalities to create and manage the different data flows.

Osmos No-code, ETL Data Pipelines

82) NoSQL - non-relational database that stores and retrieves data without needing to define its structure first - an alternative to the more rigid relational database model. Instead of storing data in rows and columns like a traditional database, a NoSQL database stores each item individually with a unique key.

83) Metadata
- data that describes and provides information about other data.

84) MySQL - pronounced "My S-Q-L" or "My Sequel," it is an open-source relational database management system (RDBMS) that’s backed by Oracle Corp. Fun fact: it’s named after My, the daughter of Michael Widenius, one of the product’s originators.

85) PostgreSQL - a free and open-source object-relational database management system emphasizing extensibility and SQL compliance. It’s the fourth most popular database in the world.

86) Program Synthesis - the task to construct a program that provably satisfies a given high-level formal specification. It’s the idea that computers can write programs automatically if we just tell them what we want. Bonus: Program Synthesis is what’s behind the Osmos AI-powered data transformation engine that lets end users easily teach the system how to clean up data.

87) Raw Data - a set of information that’s gathered by various sources, but hasn’t been processed, cleaned, or analyzed.

88) Relational Database (RDBMS) - a type of database where data is stored in the form of tables and connects the tables based on defined relationships. The most common means of data access for the RDBMS is SQL.

89) RESTful APIs - Representational State Transfer (REST) is an architectural style for designing networked applications since it provides a convenient and consistent approach to requesting and modifying data.

90) Reverse ETL - As the name suggests, Reverse ETL flips around the order of operations within the traditional ETL process. It’s the process of moving data from a data warehouse into third party systems to make data operational. For example, extracting data from a database and loading it into sales, marketing, and analytics tools.

91) Rust (programming language) - a static multi-paradigm, memory-efficient, open-source programming language that is focused on speed, security, and performance. Rust blends the performance of languages such as C++ with friendlier syntax, a focus on code safety and a well-engineered set of tools that simplify development. The annual Stack Overflow Developer Survey has ranked Rust as the “most loved” programming language for 5 years running. The code-sharing site GitHub says Rust was the second-fastest-growing language on the platform in 2019, up 235% from the previous year. And Osmos is built using Rust.

92) Stream Processing - a technique of ingesting data in which information is analyzed, transformed, and organized as it is generated.

93) Streaming Data - the continuous flow of data generated by various sources to a destination to be processed and analyzed in near real-time.

94) Structured Data - data that has been organized and predefined into a formatted repository before being placed in data storage.

95) Third-Party Data - any information collected by a company that does not have a direct relationship with the user the data is being collected on. It includes anonymous information about interests, purchase intention, favourite brands, or demography.

96) TSV - Tab Separated Values (TSV) files are used for raw data and commonly used by spreadsheet applications to exchange data between databases.

97) Unstructured Data - datasets (typical large collections of files) that are not arranged in a predetermined data model or schema.

98) Usable Data - data that’s understood and used without additional information. And data in an organization can be used to meet the goals defined in the corporate strategy.

99) Webhook - is a way for apps and services to submit a web-based notification to other applications that are triggered by specific events. Also called a web callback or HTTP push API.

100) XML (Extensible Markup Language) - a simple, very flexible markup language designed to store and transport data. It describes the text in a digital document.

101) XSL (Extensible Stylesheet Language) - a language for expressing style sheets. It is used for transforming and presenting XML documents.

The data world is full of terms that are critical to your success. These are just 101 of the most used data terms at the moment. New trends are emerging all the time, and we'll continue to add new terms to continue learning.

Hopefully, as you get familiar with these common data terms and how to apply them to your business, you'll grow more confident using these data terms in meetings and daily conversations.

Bring in newer, cleaner, fresher
external data

Try Osmos for free

JD Prater

Marketing