Welcome to The Big Data demo using BigQuery on Google Cloud Platform. Here, we're going to show the server-less scaling features of BigQuery, how it scales up automatically behind the scenes without your intervention to query large datasets. We're going to talk about 10 billion rows of Wikipedia data. The first thing is first, we're going to follow the demo scripts, always demos the code that I am going to be running. Everything I'll be running is inside of our demos folder in our public repository. First up, we're actually going to copy the query on the clipboard and I want to search for it, and navigate to BigQuery. Inside of the Google Cloud Platform, I already have BigQuery open, but if you needed to navigate to it navigation, I have it pinned up here it's starring it, but if you scroll all the way down under Big Data you have BigQuery. To promote things, you don't have to continuously scroll and search, I just pin them. I'm often using AI platform notebooks for machine learning work Composer for data engineering work BigQuery for data analysis work. Once you're inside of BigQuery, we're going to paste in our query and the query editor window and you notice where you're getting data sets from, where your data actually lives is under Resources, and one of the very popular public datasets that's available is Wikipedia. It's one of many, so you can actually get airline data for flights, reddit data, geographic data and Wikipedia benchmark for very very large datasets. If you were given a scripts one of my favorite hotkeys that you can choose is you can actually hold down it's on a Mac, it's a Command key, on a Windows, I think it's the Windows key, that'll highlight all of the tables inside of your query and it turns them into buttons. If you click on this, you automatically get back to the schema. It's a great way to iterate between what are the columns and what are the details and a preview of the data, versus the query results as well. Again that's just a cool hockey. All the shortcuts that I mentioned are available if you open up that module. You'll get the shortcuts there as well. Ten billion rows is it really 10 billion rows. The fastest way we can find that out is in the details, it is just about a gigabyte, here we go, 10 billion, 600 million rows of Wikipedia data. What type of data are we talking about? What do we actually going to be querying here? Scheme it's not too wide, it is the year, month and day, the Wikipedia project and the language that it's in and the title of the Wikipedia page and how many views it has. It's just a lot, a lot arose. What are we going to do? What type of operation are we going to do? Well, you can see our query when we're going to run it, is going to go through 10 billion rows which is about 415 gigabytes of data. Let's see how fast it does that. It's going to return not only just columns, but's going to do a calculation. It's basically going to say; give me the language that Wikipedia page was written in, give me the title of that page, give me the total number of views where somewhere in the title of any of these articles that name Google was featured, and it has to be capital G because SQL is case-sensitive. I'll show you how to ignore that in just a second with a function. Of course anytime you're doing aggregations, you do the group by, and once you have the pages that have google somewhere in the title, the top pages by view count first which I'm assuming is just going to be a page called Google. Let's go ahead and run this. How long does it take the process 400 gigabytes? We're running and again you're not a DBA or not managing the indexes or anything like that you just have your SQL query, it's not even our data servers, we're just using somebody else's dataset. You can see how for when I recorded this video we got 10 seconds 415 gigabytes processing and here's your insight, so that reached out and it found 10 billion of rows, 10 billion pages of Wikipedia data that's stored here. Looked into a like as a rather expensive operation, it's got to not only look at the columns going to look into that string value, find if the word Google somewhere appears anywhere within there. The wildcard character percentage sign is any characters before any characters after and sum up those total views and it did that pretty quickly. In total there are 214,000 pages with Google somewhere in the name, the most popular pages is the English page for Google, the Spanish page for Google shortly after that, and then Google Earth, Google Maps and then Chrome as well. Now of course if you wanted to make this not case sensitive, one of the things that you could do is you could do say, I wanted to wrap the title in everything is going to be uppercase. Then you would have to do this as well, so just match like for like. If you're doing wildcard operators using a like, it's a good idea to use upper lower, or if you're experienced with rejects you can do that as well. That is 10 billion and you can see what the really cool thing behind the scenes is on the execution details, you can see how it actually did this. It took you the human while you're watching it 10 seconds, you're just watching it, behind the scenes, if you were to do it serially, linearly, stack all the computers, all the work that they did, it would be two hours and 38 minutes for one computer to do it essentially. But that's the beauty of distributed parallel processing that happened behind the scenes. These, you don't have to care about how many virtual machines were spun up to do this work. But in aggregate, they did almost three hours of work automatically and they shared a lot of data in between themselves as well. You can see the process of going from those 10 billion records all the way down after the aggregations to outputting the result that you see there. That's cool, 10 billion, let's see if we can do a 100 billion. Let's see if we have dataset, I think it's literally just adding another 0, why not go bigger? Again if you want to get back to that dataset hotkey, you have more information here, yeah we do we've got the title, I think it's largely the same schema, details. Cool, we got a real big dataset, we had six terabytes, water records, same principle expensive operation when it go into every single field. How long do you think it's going to process to go through 100 billion records open up every single title and then see whether or not it's somewhere in that title is a string of literals Google. Once you've got that result, it has to take that and all of its friends over the other a 100 billion or those that match and then sum them all together. The virtual machines have to communicate with each other when they're doing aggregations that's where that shuffling step comes into play. Assume this data is going to process. Less than a minute, just over 30 seconds, it went through 4.1 terabytes of data, and it gave us the result there. You can see almost a full day of computing, if you're going to be doing that just on a single machine. It doesn't even tell you how many machines were there behind the scene. That slot time is a phenomenally interesting metric that just shows you this scale. You waited 31 seconds behind the scenes, you don't even have to manage them, we're using 24 hours essentially, if we compute out, boom, just like that. When you don't need it anymore, obviously not paying for those machines, you're just paying for the bytes of data that were processed. That's the demo of BigQuery at scale.