Provider 1: Product 1

Data Quality Scorecard

Parameter definitions are available in Appendix

Parameter	No	Comments
Duplicate rows > 10% ?	✓	100% Unique Rows
Missing values > 50% for one or more columns ?	✓	0 Columns have missing values > 50%
Most recent updates is before 6 months ago ?	✓	Latest Date: 15-Nov-2019
Data Contain PII (Personally identificable information)?	✓

1. Metadata Summary

1.1. Product Structure Tree

Number of Files Received: 1

1.2. Data Dimensions

Number of Columns: 5

Number of Rows: 360,961

Date Range: 08-Jan-2014 to 15-Nov-2019

1.3. Structure/Data Types

A list of columns present in the dataset, the respective datatypes and the first value present.

1.4. Data Subset

Sample 6 rows of the dataset.

1.5. Data Completeness

Discrete Columns: Columns consisting of discrete/categorical values

Continuous Columns: Columns consisting of continuous values

All Missing Columns: Columns in which all values are missing

Complete Rows: Rows with no missing values

Missing Observations: Number of missing values

1.6. Missing Data

The plot shows the percentage of missing values for each column, color-coded on a spectrum from Green (0%) to Red (100%)

2. Univariate Analysis

The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.

2.1. Histogram and Statistical Summary - Continuous Variable(s)

Histogram is a representation of the distibution of numerical data. The positioning values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.

2.1.1 Column - Column 2

2.1.3： Column 4

2.2. Frequency Counts - Categorical Variable(s)

2.2.1. Column 1

Below are the top 10 and bottom 10 Column 1s by frequency:

Top 10		Bottom 10
Column 1	Frequency	Column 1	Frequency
CABO	2,138	RCL	2,132
CNK	2,138	TGT	2,132
CPB	2,138	CAG	2,133
DISH	2,138	CHD	2,133
DRI	2,138	HRB	2,133
FWONK	2,138	KDP	2,133
GM	2,138	MHK	2,133
KSS	2,138	VIA	2,133
MIK	2,138	WMT	2,133
NCLH	2,138	BKNG	2,134

3. Bivariate Analysis

The following sections provide a Bivariate Analysis of the dataset, i.e. the dataset is being analyzed two or more column at a time.

3.1. Correlation Analysis

The coefficient of correlation is a measure of the linear correlation between two variables X and Y. It has a value between +1 and -1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation.

Following is a correlation heatmap for all numerical columns. It is colored using a Blue-White-Red spectrum in which the extreme colors represent -1 and +1 respectively.

4. Time Series Analysis

Periodicity: Daily

Below graph shows the trend of data collection over the period Jan-2014 to Nov-2019.

Appendix 1: Data Summary

1 Univariate Analysis

Univariate Analysis involves the analysis of one variable at a time.

1.1 Histogram

A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.

1.2 Density Plot

A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.

1.3 QQ Plot

A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.

2 Bivariate Analysis

Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

2.1 Correlation

A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

3 Trend Charts

Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.

Appendix 2: Data Quality Scorecard

Tresvista’s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.

Dimensions of Data Quality

A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.

1. Availability – Sufficient availability of data points

It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.

2. Completeness – All required data is captured, and no data is missing

It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.

3. Uniqueness – Data redundancy should be avoided

It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.

4. Recency – Data is recent and not outdated

It is the degree to which information is recent with the current period. It measures how “up-to-date” information is, and whether it is correct despite possible time-related changes.

5. Consistency – Uniform data types across the column

It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).