Python

Data Cleansing with Python

Poor quality data with enormous amounts of issues or in other words dirty data will lead into dirty reading as well as misleading decisions. Especially Data Scientist spends hours and hours when it comes to cleanse data.

As the first tutorial of this series of tutorials we’ll learn how to cleanse data step by step in order to obtain data with high quality, to be used not only for data science, but also in BI.

The first issue we’ll be looking into is:

  • Data with Multiple pieces of information in one column

a8

First things first, before we move on with the data cleansing first we need to load the data as well as the panda library, which we’ll be using.

a1

Split: Function as the name implies, splits a string value in to chunks or smaller strings based on a delimiter. Space is the default delimiter. This is exactly the opposite of concatenation.

a2

The split function is applied to the name column of the data-set and ‘,’ has been provided as the delimiter for the split.

By using assessor str we enable the column ‘name’ values to be treated as string values.

The results of the split is assigned to a variable called New_Name

a3

a4

So likewise simply by using the split () we can split column data.

So now let us look how to assign the split values into a new column of the data-set.

a5

Using the variable New_Name we are getting the first item from the variable list and assigning the value.

It is simple as that.

Now let us see how to obtain the same results in just one go.

a6

By setting the expand parameter set to true in Split function instead of a list we can obtain a data frame.

So multiple new column values can be created by splitting a single column values in one go can be achieved with the use of Split function and the expand parameter.

a7

So in the next tutorial we’ll look in how to obtain only a subset of data-set removing all the unnecessary columns and move on with the data cleansing process.

 

Leave a comment