What Are

What Are “Data Models?”​

Seemingly simple terms such as data model or unstructured data can be confusing and do not always cross disciplines smoothly. These brief excerpts from Data Science for Marketing Analytics: Achieve your marketing goals with the data analytics power of Python by Tommy Blanchard, Debasish Behera, and Pranshu Bhatnagar succinctly explains what these terms mean and why they matter.

Any copy/paste and editing errors are mine

“Raw data coming from external sources cannot generally be used directly; it needs to be structured, filtered, combined, analyzed, and observed before it can be used for any further analyses. In this chapter, we will explore how to get the right data in the right attributes, manipulate rows and columns, and apply transformations to data. This is essential because, otherwise, we will be passing incorrect data to the pipeline, thereby making it a classic example of garbage in, garbage out.

When we build an analytics pipeline, the first thing that we need to do is to build a data model. A data model is an overview of the data sources that we will be using, their relationships with other data sources, where exactly the data from a specific source is going to enter the pipeline, and in what form (such as an Excel file, a database, or a JSON from an internet source). The data model for the pipeline evolves over time as data sources and processes change. A data model can contain data of the following three types:

  • Structured Data: This is also known as completely structured or well-structured data. This is the simplest way to manage information. The data is arranged in a flat tabular form with the correct value corresponding to the correct attribute. There is a unique column, known as an index, for easy and quick access to the data, and there are no duplicate columns. Data can be queried exactly through SQL queries, for example, data in relational databases, MySQL, Amazon Redshift, and so on.
  • Semi-structured data: This refers to data that may be of variable lengths and that may contain different data types (such as numerical or categorical) in the same column. Such data may be arranged in a nested or hierarchical tabular structure, but it still follows a fixed schema. There are no duplicate columns (attributes), but there may be duplicate rows (observations). Also, each row might not contain values for every attribute, that is, there may be missing values. Semi-structured data can be stored accurately in NoSQL databases, Apache Parquet files, JSON files, and so on.
  • Unstructured data: Data that is unstructured may not be tabular, and even if it is tabular, the number of attributes or columns per observation may be completely arbitrary. The same data could be represented in different ways, and the attributes might not match each other, with values leaking into other parts. Unstructured data can be stored as text files, CSV files, Excel files, images, audio clips, and so on.

Marketing data, traditionally, comprises data of all three types. Initially, most data points originated from different (possibly manual) data sources, so the values for a field could be of different lengths, the value for one field would not match that of other fields because of different field names, some rows containing data from even the same sources could also have missing values for some of the fields, and so on. But now, because of digitization, structured and semi-structured data is also available and is increasingly being used to perform analytics.

A data model with all these different kinds of data is prone to errors and is very risky to use. If we somehow get a garbage value into one of the attributes, our entire analysis will go awry. Most of the times, the data we need is of a certain kind and if we don’t get that type of data, we might run into a bug or problem that would need to be investigated. Therefore, if we can enforce some checks to ensure that the data being passed to our model is almost always of the same kind, we can easily improve the quality of data from unstructured to at least semi-structured…

OSEMN is one of the most common data science pipelines used for approaching any kind of data science problem. It’s pronounced awesome. OSEMN stands for the following:

 

D

  • Obtaining the data, which can be from any source, structured, unstructured, or semi-structured.
  • Scrubbing the data, which is getting your hands dirty and cleaning the data, which can involve renaming columns and imputing missing values.
  • Exploring the data to find out the relationships between each of the variables. Searching for any correlation among the variables. Finding the relationship between the explanatory variables and the response variable.
  • Modeling the data, which can include prediction, forecasting, and clustering.
  • INterpreting the data, which is combining all the analyses and results to draw a conclusion.”

Source: Tommy Blanchard, Debasish Behera, and Pranshu Bhatnagar. Data Science for Marketing Analytics: Achieve your marketing goals with the data analytics power of Python. Packt Publishing.

Arrange a Conversation 

Browse

Article by channel:

Read more articles tagged: Analytics, Featured

Data & Analytics