0:53

I think the key challenge in, in pretty much any data analysis was well

Â characterized by Dan Meyer who's a mathematics

Â educator and he taught high school mathematics.

Â In his Ted talk he said ask yourselves what problem have you solved

Â ever, that was worth solving where you knew all the given information in advance.

Â Where you

Â didn't have a surplus of information and have to filter it out.

Â Or you had insuf, insufficient information and had to go find some.

Â And so I think that's a key element of data analysis that which is that you know,

Â typically, you don't have all the facts or

Â you have too much information, and you kind of

Â have to go through it, and the process, a lot of the process of data analysis is

Â sorting through kind of all this, all this stuff

Â And so, the first part, the, the kind of important

Â part of data analysis that you want to start with is, is define a question.

Â And not every data analysis starts with the very specific or coherent question.

Â But the, the more effort you can put into coming up with a reasonable

Â question, the, the less effort you'll spend

Â having to filter through a lot of, stuff.

Â And the reason why is that defining a question is the kind of the most powerful.

Â Dimension reduction tool you can

Â ever employ.

Â because if you're interested in, you know, in,

Â in, a specific variable, like height or weight, then

Â you can kind of remove, a lot of other

Â variables that don't really pertain to that at all.

Â But if you're interested in a different type

Â of variable then you can remove another subset.

Â And so the idea is if, if you

Â can narrow down your question as specifically as possible.

Â Then that will serve to reduce the kind of noise that you'll, that you'll have

Â to deal with when you're going through a potentially very large data set.

Â Now sometimes you just want to look at a data

Â set and see kind of what's inside this data set.

Â And then you'll have to explore all kinds of things in a large data set.

Â But if you can narrow down your interest, your, your interest to a

Â specific type of question, then that can

Â be extremely useful for simplifying your problem.

Â So I encourage you to to kind of think about what

Â type of question you're interested in answering before, you go into delving

Â into all the details of your data set.

Â So, the science, generally speaking, will determine

Â what type of question you're interested in asking.

Â And that will lead you to the data.

Â Which may lead you to applied statistics, which

Â is you know, you use to analyze the data.

Â And then if you get really you know,

Â ambitious you might want to think of some theoretical

Â statistics that will kind of generalize the the

Â methods that you apply to different types of data.

Â Now of course there are relatively few people

Â who can even, who can do that, and so I, that would not be expected of everyone.

Â So the part that's in the, the red bracket that's

Â number one That's typically referred to as statistical methods development.

Â The part that's in the purple bracket here number two, which is just kind

Â of, the application of statistics to, just to raw data without any sense of science.

Â is, is what I would refer to as the danger

Â zone, and which we, which I kind of, derive here from.

Â A kind of a Venn diagram of data science that's written by Drew Conway.

Â The idea is if you just kind of randomly apply

Â statistical methods to data sets to find an interesting answer.

Â First of all, you will find something interesting almost certainly, but

Â it may not be reproducible and it may not be really meaningful.

Â And so I think the a truly, a proper data analysis

Â has a scientific context, it hopefully has at least some general

Â question that we're trying to.

Â To try and to investigate which will narrow

Â down the kind of dimensionality of the problem.

Â And then we'll apply the appropriate

Â statistical methods to the appropriate data.

Â 4:20

So, let's start with the very basic example of a question.

Â So a general question might be you know,

Â can I automatically detect emails that are spam?

Â And those that are not.

Â Of course, this is an important question if, if you use email.

Â If you want to know what are the emails that you, that you should

Â read, that are important, and what are the emails that are just spam?

Â And so you might want to, and so if you want to turn that into

Â a data analysis problem there are many ways that you could answer this question.

Â For example, you could

Â just hire someone to just go through your email and figure out what's spam or not.

Â But that's not really that's probably not very sustainable.

Â It's not particularly efficient.

Â So, if you want to turn this into a

Â data analysis question, you have to make the question

Â a little bit more concrete and, and translate it

Â by using terms that are specific to data analysis tools.

Â And so a more concrete version of this question might be

Â you know, can I use quantitative

Â characteristics of the emails themselves to

Â classify them as spam or ham.

Â Okay so now we can start looking at emails and try to think well what are these

Â quantitative characteristics that I want to develop so

Â that I can kind of classify them as spam.

Â 5:50

And depending on the goal and the type of question you, you're asking.

Â A descriptive data set.

Â So if you're looking, interested in a descriptive

Â problem, you might think of a whole population.

Â So again, just kind of. So you don't need to sample anything.

Â You might want to just get the entire

Â census or population that you're looking for.

Â So all the emails in the universe, for example.

Â If you just want to explore your question.

Â You might just take a random sample with a bunch of variables measured.

Â If you

Â want to make inference about a problem then you

Â have to have you have yet to be very

Â careful about the sampling mechanism and, and the

Â definition of the population that you're sampling from because.

Â Typically when you make an inferential statement, you take you're, you're, you're

Â drawing from a sample to make a conclusion about a larger population.

Â So there the sampling mechanism, it was very important.

Â If you want to make a prediction, then you're going to need

Â something like a training set and a test data set

Â from this, from a population that you're interested in,

Â so that you can build a model and a classifier.

Â If you want to make a causal statement, so you want to

Â say okay, if I modify this component, then something else happens.

Â So this is basically, you're going to need

Â experimental data, and one type of experimental

Â data is from, from some, from something

Â like a randomized trial or a randomized study.

Â And then if you want to make mechanistic types

Â of statements, you need data about all the different

Â components of the system that you're trying to describe.

Â [SOUND]

Â So, for our problem here with spam one ideal

Â day so perhaps would be you know, if you use

Â Gmail you know that all the emails in the Gmail

Â system are going to be stored on Google's data centers, right?

Â So, why don't we just get.

Â All the data in, in Google data centers, all the emails in Google data centers.

Â Right, because that would be a whole population of

Â emails, and then we can just kind of build

Â our classifier based on all this data, and then we have, we,

Â we wouldn't have to worry about sampling, because we'd have all the data.

Â And then, and so that would be a, a kind of an ideal data set.

Â 7:42

So, of course, in the real world, you have to think

Â about, well what are the data that you can actually access, right?

Â So, maybe someone at Google can actually, can

Â access all the emails that go through Gmail.

Â But, but even in that extreme case, it may be difficult.

Â And furthermore, most people are not going to be able to access that.

Â So, sometimes you, you have to go for

Â something that's not quite the ideal data set.

Â And so you might be able to find free data on the web.

Â You might need to buy some data from a provider And if you, and in these

Â kinds of cases, you should be sure to respect the terms of use for the data.

Â So any agreement or contract that you agree, that you've

Â kind of agreed to about the data has to be

Â adhered to.

Â 8:23

And if the data simply do not exist out there,

Â you may need to generate the data yourself in some way.

Â So, getting all the data from Google will

Â probably not be possible, because most, I'm guessing their

Â data centers have some very high security, and

Â so we're going to have to go with something else.

Â And so one possible solution is the is, is comes from the

Â UCI machine on your repository, which is the spam based data set.

Â And this is a collection

Â of spam that was, that was pur, and this data set was created

Â by people at Hewlett Packard who collected some, a couple thousand spam messages.

Â Spam and regular messages, and classified them appropriately.

Â So you can use this database to explore your

Â problem of how to classify emails into spam or ham.

Â 9:07

So, when you obtain the data, the first goal

Â is to, you know, try to obtain the raw data.

Â For example, from the UCI machine on your repository.

Â You have to be careful to reference the source, so wherever you get the data

Â from, you should always reference the source

Â and keep track of where it came from.

Â If you're asking, if you want, if you need to get data from a person or an

Â invest, investigator that you're not familiar with often

Â a very polite email will go a long way.

Â They may be willing to share that data with you.

Â And if you,

Â if you get data from an internet source, you should

Â always make sure at the very minimum record the URL

Â which is the website indicator of where you got the

Â data and the time and date that you access that.

Â So people have a reference, when that data was available.

Â In the future, the website might go down or the URL may change or may not be

Â available, but at least at that time you

Â got that data you documented how you got it.

Â 10:39

You have to understand kind of where the data come from, so for example

Â if it came from a survey, you need to know how the sampling was done.

Â it, was it a convenient sample, or was,

Â did the data come from an observational study, did it come

Â from experiments of, the source of the data is very important.

Â You may need to reformat the data in a certain way

Â to get it to work in a certain type of analysis.

Â If the data set is extremely large you may want

Â to sub-sample the data set to make it more manageable.

Â And so anything you do to clean the data, it is very important that you

Â record these steps and write them down you

Â know, in scripts or whatever is most convenient.

Â Because someone you or someone else is going to have

Â to reproduce these steps if they want to reproduce your findings.

Â And if you don't document all these pre-processing steps, then

Â no one will ever be able to do it again.

Â So, once you've cleaned the data and you've gotten

Â a basic look at it, it's important to determine of

Â the data are good enough to solve your problems

Â because in some cases they may not be good enough.

Â You may not have enough data, you may not have enough variables or enough

Â characteristics, the sampling of the data may be inappropriate for your question.

Â So there may be all kinds of problems that occur

Â and you, that you realize as you clean the data.

Â And so, and if you determined the data

Â are not good enough for your question, then you've

Â got to quit, and, or, and try again,

Â or change the data, or try a different question.

Â it's, it's important to not to just push on, and with the data

Â you have, just because that's all that you've got, because that can lead

Â inappropriate inferences or conclusions.

Â