Analytics Glossary

Data Lake vs. Data Mart vs. Data Warehouse

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” – James Dixon, the founder and CTO of Pentaho

The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. The data warehouse can only store the orange data, while the data lake can store all the orange and blue data

Data Latency

Data Latency is how long it takes for a business user to retrieve source data from a data warehouse or business intelligence dashboard

Data Scientist

A Data Scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.

Probability vs. Statistics

Probability deals with predicting the likelihood of future events, while Statistics involves the analysis of the frequency of past events.

Edge analytics is the collection, processing and analysis of data at the edge of a network either at or close to a sensor, a network switch or some other connected device

Machine Learning: Algorithms that can make predictions through pattern recognition.

Deep Learning: A form of machine learning that uses a computing model inspired by the structure of the brain which requires less human supervision. Deep learning isn’t an application – it’s a technology that makes many applications smarter and more natural through experience. Deep learning is a subset of machine learning, and machine learning is a subset of AI, which is an umbrella term for any computer program that does something smart.

Propensity Modeling:

Propensity models are often used to identify those most likely to respond to an offer, or to focus retention activity on those most likely to churn.

The model may be applied to your database to score all your customers or prospects. You can then select only those who are most likely to exhibit the predicted behaviour, for example response, and focus your mailing activity appropriately.

DataOps
The application of continuous delivery and DevOps to data analytics has been termed DataOps. DataOps seeks to integrate data engineering, data integration, data quality, data security, and data privacy with operations. It applies principles from DevOps, Agile Development and the statistical process control, used in lean manufacturing, to improve the cycle time of extracting value from data analytics.

Concept Drift

Concept drift in machine learning and data mining refers to the change in the relationships between input and output data in the underlying problem over time.

In other domains, this change maybe called “covariate shift,” “dataset shift,” or “nonstationarity.”

In most challenging data analysis applications, data evolve over time and must be analyzed in near real time. Patterns and relations in such data often evolve over time, thus, models built for analyzing such data quickly become obsolete over time. In machine learning and data mining this phenomenon is referred to as concept drift.

POC vs. Pilot

A proof of concept (POC) or a proof of principle is a realization of a certain method or idea to demonstrate its feasibility, or a demonstration in principle, whose purpose is to verify that some concept or theory has the potential of being used. A proof of concept is usually small and may or may not be complete.

A pilot project refers to an initial roll-out of a system into production, targeting a limited scope of the intended final solution. The scope may be limited by the number of users who can access the system, the business processes affected, the business partners involved, or other restrictions as appropriate to the domain. The purpose of a pilot project is to test, often in a production environment.

LLMs:

Large language artificial intelligence models, like GPT-4, can predict what text should come next based on unique text inputs and prompts, drawing from a large text-based data set. LLMs are just one branch of artificial intelligence, a broad term used to describe computers’ ability to mimic human intelligence through processing, synthesizing and generating information. Generative AI refers to artificial intelligence models, including LLMs like GPT-4, that can generate new content, like audio, video and text.

Customer Data Platform (CDP):

Gartner defines a CDP as “software that collects and unifies customer data – from multiple sources including first- and third-party – to build a single, coherent and complete view of each customer.” A software application that supports marketing and customer experience use cases by unifying a company’s customer data from marketing and other channels. CDPs optimize the timing and targeting of messages, offers and customer engagement activities, and enable the analysis of individual-level customer behavior over time.

Organizations increasingly expect customer data platforms to support customer experience (CX) use cases that fall outside of marketing’s direct control, such as the next best action for customer service or customer journey insights.

Note: These terms are compiled from various articles & publications as part of my learning journey over the course of time.