Parameter definitions are available in Appendix
Parameter | Yes | No | Comments |
---|---|---|---|
Duplicate rows > 10% ? | ✓ | 100% Unique Rows | |
Missing values > 50% for one or more columns ? | ✓ | 0 Columns have missing values > 50% | |
Most recent updates is before 6 months ago ? | ✓ | Latest Date: 15-Nov-2019 | |
Data Contain PII (Personally identificable information)? | ✓ |
Number of Files Received: 1
Number of Columns: 5
Number of Rows: 360,961
Date Range: 08-Jan-2014 to 15-Nov-2019
A list of columns present in the dataset, the respective datatypes and the first value present.
Sample 6 rows of the dataset.
Discrete Columns: Columns consisting of discrete/categorical values
Continuous Columns: Columns consisting of continuous values
All Missing Columns: Columns in which all values are missing
Complete Rows: Rows with no missing values
Missing Observations: Number of missing values
The plot shows the percentage of missing values for each column, color-coded on a spectrum from Green (0%) to Red (100%)
The following sections provide a Univariate Analysis of the dataset, i.e. the dataset is being analyzed one column at a time.
Histogram is a representation of the distibution of numerical data. The positioning values are binned and plotted on the X-axis and the corresponding frequency is plotted on the Y-axis.
Below are the top 10 and bottom 10 Column 1s by frequency:
Column 1 | Frequency | Column 1 | Frequency |
---|---|---|---|
CABO | 2,138 | RCL | 2,132 |
CNK | 2,138 | TGT | 2,132 |
CPB | 2,138 | CAG | 2,133 |
DISH | 2,138 | CHD | 2,133 |
DRI | 2,138 | HRB | 2,133 |
FWONK | 2,138 | KDP | 2,133 |
GM | 2,138 | MHK | 2,133 |
KSS | 2,138 | VIA | 2,133 |
MIK | 2,138 | WMT | 2,133 |
NCLH | 2,138 | BKNG | 2,134 |
The following sections provide a Bivariate Analysis of the dataset, i.e. the dataset is being analyzed two or more column at a time.
The coefficient of correlation is a measure of the linear correlation between two variables X and Y. It has a value between +1 and -1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation.
Following is a correlation heatmap for all numerical columns. It is colored using a Blue-White-Red spectrum in which the extreme colors represent -1 and +1 respectively.
Periodicity: Daily
Below graph shows the trend of data collection over the period Jan-2014 to Nov-2019.
Univariate Analysis involves the analysis of one variable at a time.
A Histogram visualizes the distribution of a numerical field over its continuous range of values. Each bar in a histogram represents the tabulated frequency at each interval/bin.
A Density Plot visualizes the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.
A QQ plot is a scatterplot created by plotting two sets of quantiles (theoretical and sample) against one another. The shape of the QQ plot indicates whether the data is normally distributed, skewed, or has a heavy tail.
Bivariate Analysis involves the analysis of two variables for the purpose of determining the empirical relationship between them. It explores the concept of the relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.
A correlation matrix displays the coefficient of correlation for every pair of variables present in the dataset. This allows you to see which pairs have the highest correlation. It can be used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.
Trend charts are simple and efficient graphical representations of time-series data. Monthly trend charts can often reveal seasonal trends for a variable while yearly trend charts show trends over a longer period.
Tresvista’s Data Quality (DQ) framework is designed to assess the data quality and data health. Data quality monitoring is performed on an ongoing basis to ensure sustainable data quality.
A Data Quality Dimension is a term used to describe a data quality measure that can relate to multiple data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. While there are multiple parameters on which a dataset can be assessed in terms of quality, we have identified the following 5 core dimensions for our assessment.
It is ensuring that enough data is available to end users and applications, when and where they need it for further analysis. This is particularly important, as many of the machine learning algorithms require enough data samples for training and testing/validating the models.
It is identifying the percentage of records with non-NULL values. It can also be termed as comprehensiveness. Missing or incomplete data can hamper the analysis and affect the interpretability of the insights.
It points out that there should be no data duplicates reported. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that there is a key that can be used to uniquely access each entity (and only that specific entity) within the data set.
It is the degree to which information is recent with the current period. It measures how “up-to-date” information is, and whether it is correct despite possible time-related changes.
It refers to the data values in one column being consistent across the column. A strict definition of consistency specifies that two data values drawn from the same column must not conflict with each other (column level consistency).