In this video, we discussed two related topics in data expiration,

data cleanup and transformation.

Here are some common questions to ask when you clean up data.

Do character variables have valid values?

Are numeric variables within range?

Are there missing values?

Are there duplicated values?

Are values unique for some variables, for example, ID variables?

Are the dates valid?

Do we need to combine multiple data files?

It is important to check those questions and

handle them appropriately in the data cleanup process.

While it is straightforward to handle some of those issues,

dealing with others might be highly non-trivial.

As an example, handling missing values usually requires good understanding of

the problem context and can take many alternative approaches.

We discuss it in a separate video.

There are many commercial and open source tools for data cleanup.

Two very popular open source tools are OpenRefine and Data Wrangler.

The tools can greatly help many common data exploration tasks.

I encourage you to explore their websites to learn more.

Next, we move on to data transformation,

which usually means that we apply a mathematical function to each data value.

Perhaps the most common data transformation is centering and

scaling of a single variable.

For those of you with statistics background,

you can think of it as calculating the z-score of each observed value.

That is, each data value is first reduced by the mean and

then divided by the standard deviation.

One of the immediate benefits of centering and

scaling is to make numerical procedures easier to work with and more stable.

This is because the centering and

scaling ensures that multiple variables in a dataset is on a common scale.

Centering and scaling is often required or recommended for some modeling tools,

such as clustering, principal component analysis, and neural networks.

The main drawback is that the data becomes harder to interpret.

The data value after centering and scaling measures the number of standard deviations

between each data point and the mean, and its uniqueness.

There are many other data transformations.

Some of them can be expressed using common mathematical functions such logarithm,

square, square root, and inverse.

Except for the first one,

all transformations mentioned here are considered polynomial transformations.

Because they involve polynomial terms of the original data value.

Different transformations are appropriate for different problem contexts.

Sometimes we have to experiment to figure out the right one to use.