[MUSIC] Welcome back. Last time we talked about three out of the four of these dimensions in describing how we designed this course in data science. And so, in this segment I want to talk about this last dimensions of what I call struts versus stats. So, this is the relative importance of data manipulation versus deeper mathematics. And you can see that I've put the dial here a little bit to the left. And I'll try to motivate that in the next few minutes. All right. So we already saw one example of this in the first segment where I use some examples of data science from recent history. And one of these was Nate Silver's prediction of the electoral college votes for the 2012 US Presidential election. And if you recall, this prediction was accomplished by essentially taking the average of the state polls for each state. So it didn't really require a sophisticated statistical model and yet it had massive impact. Okay, so a quote that I think sums this up a little bit comes from Aaron Kimball at a company called wibidata, where he says, you know 80% of analytics is really just sums and averages. What he means by this is if you can get these sums and averaged right, if you can do it at any scale on any data that you might see, then you can always sort of, build up more advanced techniques. Everything sort of boils down to just sums and averages. So I think this is a motivation for why focusing on data manipulation, which typically is associated with being able to express sums and averages. Which for example, what you can do in a database query, which we'll talk about in the next couple lectures, that gets you a pretty long way, all right, gets you 80% of the problem. So, another way of looking at this is that there's three main tasks involved with a data science project. There's preparing to run the model, running the actual statistical model, and then interpreting the results and communicating them. I got the animation out of order here, so, you're gonna ignore that red. But the point here is, again, Erin Kimball, the conversation with him is where I got this, was 80% of work is really in this first step, where you're gathering data and cleaning it and integrating it and restructuring it, transforming it, loading it and so on. Right? So verifying all these verbs that you see here. This is the hard part, right? And so, actually running the model or even choosing the model and then running it, doesn't tend to keep people up at night in practice, okay. So and then the joke here is perhaps that the other 80% of the work, implying that there's sort of 160% of a normal task is in data science, is in this interpreting the results. So this is the visualization and the communication and the explanation of the results, okay. So this is another reason why I wanna focus in this course on Data manipulation tasks that are associate with this first number one task. Another way of looking at this is a quote that's now really old. So this is 12 years old or so at the time of this recording, from Doug Laney and this is the document that first coined this notion of big data being the three V's of volume, velocity and variety and we'll talk about that in a couple segments. But he has this quote, no greater barrier to effective data management will exist than the variety of incompatible data formats, non-aligned data structures, and inconsistent data semantics. So what the database community, my community calls the data integration problem. This is the hard part, and so he was saying this back in 2001. And I would argue that it's still true today, this is the greatest barrier. And in the context of this, he was talking about this notion of variety being harder than volume or velocity. And I'll explain more about what those Vs mean in a couple of seconds. All right so, another vignette here is something that we like to ask the scientists we worked with. So you know these were astronomers and oceanographers and biologists. We asked them sort of informally, how much time do they spend "handling data" as opposed to "doing science."? Now, we let them interpret these quotes however they want. But what we mean by doing science, choosing a statistical method or designing a statistical model, they absolutely consider part of their science. So what we mean by handling data is all the other crap, the format conversions and so on, and so what do you think the most common answer is here? You can guess to yourself for a second, but they don't even blink and they say things like 90%, and so this number should, give pause, right. This is tax payer money that goes to federal funding agencies to come back to pay some post doctoral fellow to spend 90% of her time doing something that she doesn't even consider science. And so this is why I think it's really important, as a data scientist, to focus on this problem. Now, you might say, well that's just science, what about business? But, we'll try to make the point throughout the course, there's increasing alignment between what's going on in business and what's going in science. And we'll talk about that more in a couple of segments. All right so if 90% of the problem is handling data, boy we ought to spend a lot of attention on that. All right, so another argument that sort of follows on the first slide that I gave is that the data manipulation platforms and data bases, in particular, actually go a pretty long way to being able to express more advance things. This isn't just a matter of oh, well, you can express anything if you have sums and averages. It's also even fairly advance techniques, there's an increasing amount of interest in figuring out how to get this stuff into the database. [MUSIC]