The first lab is about exploring the data. Why are we exploring the data? Why don't we just take all the columns in the dataset and feed them into the machine learning model? Shouldn't the machine learning model be able to figure out that some of the columns aren't needed? Maybe give them zero weight? Isn't the point of the machine learning model to learn how to combine the columns so as to get the label that we want? Well, real life doesn't work that way. Many times that data, as recorded, isn't what you expect. Show me a dataset that no one is actively visualizing, whether in the form of dashboards or charts or something, and I'm quite confident that much of the data will be missing or even wrong. In the real world, there are surprisingly many intricacies hidden in the data, and if we use the data without developing an understanding of it, we will end up using the data in a way that will make productionization very hard. The thing to remember about productionization is that during production, you're going to have to deal with the data as it comes in. So, it'll make productionization very hard and we'll see a few examples of this. You are probably doing this specialization because you saw images, sequences, recommendation models, all listed in the set of courses. However, all five courses in the first specialization were all on structured data. Why? Even though image models and text models get all the press, even at Google, most of our machine learning models operate on structured data. That's what this table shows. MLP is multilayer perceptron, your traditional feedforward fully connected neural network with four or five layers, and that's what you tend to use for structured data. Nearly two thirds of our models are MLPs. LSTM, long short-term memory models, are what you tend to use on text and time series models. That's 29% of all of our models. CNNs, convolutional neural networks, these are the models you tend to use primarily for image models. Although you can also successfully use them for tasks like text classification. CNNs are just five percent of models. This explains why we have focused so much on structured data models. These are, quite simply, the most common types of models that you will encounter in practice. Our goal is to predict the weight of newborns so that all newborns can get the care that they need. This scenario is this, a mother calls a clinic and says that she's on her way. At that point, the nurse uses our application to predict what the weight of the newborn baby is going to be, and if the weight is below some number, the nurse arranges for special facilities like incubators, maybe different types of doctors et cetera, and this is so that we can get babies the care that they need. So, this is the application that we will build. Essentially, the nurse puts in the mother's age, the gestation weeks assuming that the baby is born today, how many babies - single, twins et cetera, and the baby's gender if it is known. The nurse hits "Predict", the ML model runs, and the nurse gets back a prediction of 7.19 pounds or 4.36 pounds, depending on the inputs, and then the nurse arranges for special facilities for the babies on the right, and that's the way it works. So, this is what we will build. For machine learning, we need training data. In our case, the US government has been collecting statistics on births for many years. That data is available as a sample dataset in BigQuery. It's reasonably sized, it has about 140 million rows, 22 gigs of data. We can use this dataset to build a machine learning model. In reality, of course, you don't want to use data this old, 1969 to 2008, but let's ignore the fact that the sample dataset stops in 2008 because this is a learning opportunity. The dataset includes a variety of details about the baby and about the pregnancy. We'll ignore the birthday, of course, but columns like the US State, the mother's age, gestation weeks et cetera, those might be useful features. The baby's birth weight in pounds, that is the label, it is what we're training our model to predict. Our first step will be to explore this dataset, primarily by visualizing various things. But before that, a quick word on how to access the lab environment.