Data Profiling

What is a SSIS Data Profiling Task?

SSIS has a component which performs data profiling, and output the profile of the data as a report.

The Data Profile Task analyses and provides an insight of the data which is very useful in understating as well as identifying the quality issues of the source data. Data profiling task can also be useful in defining the targets for the datasets.

Component

How to configure a SSIS Data Profiling Task?

First double click on the Data Profiling Task.

Window1

First and foremost the destination is required to be configured, which will be used to store the profile data in an XML format. The destination type can be either a file connection or a variable.

Window6

For this example we’ll use File Connection as the destination type.
You will be required to select a new connection and set the connection for the first time.

Window7

The profiler can be set to create a new file as well as to use an existing file.

Window8

Select the desired file type and file and click on OK.
Next to configure the profiling checks click on the Quick Profiler and the following window will appear.

Select the Target data connection from the drop down menu of ADO.NET Connection. (If no connection is available, click on New and create a new connection.)

Limitation: Profiling Task can only profile SQL Server database data.

(If having the data in a different platform, first load the data into the SQL Server database in order to profile the data)

After creating the connection select the required database and the table.

Select the required data checks.

Data profiler provides a number of different data checks as follows:

Column Null Ratio Profile: Percentage of Null records of all the columns of the selected table.
Column Statistic Profile: Minimum, Maximum and standard deviations of the column values.
Column Value Distribution Profile: Pattern of repeating data as well as the percentage of total repeats.
Column Length Distribution Profile: Length of the strings, as well as the occurrence of the length across the columns.
Column pattern profile: Reveals a set of regular expressions in a string column. Identifies invalid strings as well as provides suggestions of regular expressions, which can be used to validate new values for the dataset.
Candidate Key Profile: Identifies the candidates, which can be used for the primary key constraint.

When using the Candidate Key Profile, an additional option is available to be selected as the no. of columns to be considered at a time.

E.g.: 2 – a combination of 2 columns for candidates for the PK constraints.