Now we'll look at how you can model exogenous performance data on competing chip manufacturers. Based on the results of your analysis, you can decide which manufacturers will likely see increases in their share price due to their competitive advantage. The first thing you need to ask yourself is, what data do I have? This determines the form of the model you will create. Is the data a numerical time series, or is it a qualitative ranking like pass or fail? Before investing the time and effort to create a model from your data., you should consider whether it is private data to which you have exclusive access. Or is it public data that anyone can use? Private data is much more valuable in that it gives you the ability to model a unique feature that you can use to predict changes in the company's performance and price. Public data on the other hand may have already been used by other trading groups to create models and trading strategies. It also may be already fully incorporated into the current share price and so have no value in predicting future price changes. The only thing worse than having no data is using old data to predict changes in share price for performance. It's as if you've got today's winning lottery numbers in tomorrow's mail, doesn't do you much good. Old data has two downsides, one, it may not reflect the current business environment that the company faces. And two, it's already incorporated into the share price, and so has no trading value. Models, even purely statistical models have implicit assumptions. The most common assumption is that the factors and features modeled using historical data will accurately predict future changes in share price. This assumes that the economic and market environment in which shares trade is fairly static. That's a reasonable assumption over short time horizons, but fails miserably over long time periods. Your model needs to be retested constantly with fresh data. And your assumptions need to be re-validated based on the current performance of your model. Sometimes you're lucky enough to get both good quality and doggedness data, and private exogenous data. This allows you to model both the current price trends and potentially predict longer-term changes in share value. Qualitative variables such as competitiveness and performance rankings can also have a large impact on long-term value. Now we'll look at how you can model exogenous performance data on competing chip manufacturers. Based on the results of your analysis, you will be able to decide which manufacturers are likely to see increases in their sales and share price due to their competitive advantage. Let's say you're an analyst at a technology fund. Specifically, your expertise is semiconductor manufacturers like Intel and AMD. The dataset that you have to analyze and model is the performance of CPU chips or processors. This is a real public dataset that we will demo and make available to you in the course references. This exercise and modeling CPU performance by vendor is taken from Sarah Robinson's blog post on BigQuery ML. She's a developer advocate for Google. So what data are we given? Let's say instead of a public dataset, this is a private data set you created through your own research. It has all the characteristics of the CPUs that different vendors are producing. One of your goal should be to compare the performance of these processors against each other to see if there any clear leaders or laggards among the vendors. If you can get a winning model that forecast chip performance before it hits the market, it might help your fund value chip vendors more accurately. So given all these inputs about the chip specifications like speed, memory, operating systems, cash, etc. Can you predict the benchmark scores say from zero to 100 that you could use for comparison? Here's the complete code for creating a machine learning model with a dataset using BigQuery ML. At the top, we give our model a name. Then since we're predicting a numeric benchmark score, we'll choose linear regression as our model type. There happens to be an existing benchmark for CPUs that you can use to try to predict performance for these newer models. One commonly used CPU benchmark is called 631 Dip Sang is based on the computer program, Dip Sang which was the 2008 world computer speed chess champion. This benchmark program stress test CPUs by making them play expert-level chest and compute very large decision trees and advanced move ordering. Say for the sake of argument that the benchmarking exercise took a week to run on new CPUs. If you were given the specs of these CPUs right as the test was starting. And you build a model that accurately predicted the score of each CPU using ML, instead of waiting for the actual results a week later. You could have valuable insights into which vendor CPUs were under or over performing. You could then use this relative performance data to project increases and decreases in sales, which could impact vendor share price. Let's see a quick demo of running this model in BigQuery ML. >> Hi, I'm Evan. Let's do a test drive of BigQuery machine learning to see just how easy it is to create machine learning models with just SQL right where your data already lives inside of BigQuery. For this example use case, we'll just be using a benchmark dataset of CPU characteristics. That's how fast your computer chip is inside of your computer. Given known information about it, can we predict its performance? Let's get started. So here's the dataset inside of BigQuery. One of the tips and tricks that I like to use is, if somebody gives you a query like if I just gave you this query, you can actually hold down the command key or the Windows key on your keyboard. And it turns all of those tables into buttons that you can click on. Why is that useful? Say you were somewhere else in our query and you clicked on this, it would bring up the schema, the details and the preview of the dataset. So if you have multiple different tables inside of this query, you just click on this first one and you can say, what are the columns that I have? Well, it's called the CPU spec, the integer speed for its processing some of these characteristics. So without giving too much away. Essentially, these computers are going to be running very fast chess algorithms. And they're trying to basically optimize a bunch of moves inside of chess which could take a human years to do. But since computers are very good at math, some processors were better at doing this than others. So what information do we have for the columns? Well, we have a vendor like if you like Intel and AMD, we have the model name of the CPU. You the megahertz, how fast it is, how many cores it has, how many chips, all things that are relevant to the CPU processing. If you're familiar with CPUs, a lot of these might look familiar. If you're not, there just all kinds of features that we can basically say, well, I don't know if the number of chips. Or the number of L2 cache in memory in kilobytes is going to be a better feature for our machine learning model to predict overall performance. And the good news is you don't have to know. Machine learning is not about creating these if then or case when statements. Throw in all the features that you think might be useful, let the model figure out the relationship between all of your input features and then what you're doing the prediction on. So let's see what actually what I'll be doing the prediction on. If you scroll all the way down you'll notice that there are some benchmarks here. We just make this a little clearer on the screen. All the way at the bottom you're going to see that there is some _600, _602, _605, _620. These are industry standard benchmarks, essentially different types of algorithms, standard algorithms that CPU vendors can bring their CPUs to and then run essentially standardized. It's a standardized test. If you've ever been to a university or school courses, this is the same test that they're going to be running against different hardware and assess its performance. And it will come down to one number. So for example, if you wanted to see what this data look like in the preview of BiqQuery, we have all of these inputs, cores, chips, channels. And again, I'm showing the limits of my understanding of hardware, what operating system it's running on, what compiler it's running on, who's sponsoring the test. And then all of these columns here are how well those computer chips did in the past. So we have the right answer, which is your label column in machine learning terms, in the past. So we know that row 1 here, this Intel Xenon Gold 5115, for this given test the 631 deep in s is 4.48, whatever that means. Generally, I'm assuming lower is better or maybe higher is better for the spec. You have to check the spec for that. But mainly what I want to try to do with machine learning is, can we replicate, can we predict how well one of these given columns 1 through 20 or everything except for the the test. Let's train the model to try to make this prediction. Actually not running it through the test, but predicting its performance given all these characteristics how well you think it's going to do. A good analogy is well, if a student is studying for a test, if one student's studying for a test five hours a night for a week, and another one just decides, you know what, I don't want to study, only studies one hour or zero hours. You can start to make those predictions of how well they're going to do on that test. Similar argument here except a little bit more complicated in that you have a lot of different columns that may be interrelated to each other and they have more weight and more importance. But you let machine learning figure that out, okay? So to review we have a lot of input features. They're just columns inside of SQL. And then we have one in this particular case would ignore all the rest of these tests except for the 631. And basically see if we can, knowing the information that we have on this for the past, can we predict the test for similar specs of hardware for the future? And the good news is, I mean not to give away the ending, but you can actually predict with relative accuracy because there's a very strong relationship between some of this processor specs as you can imagine and how well it performs in the test. But again, machine learning is not about you saying if the cores are greater than two and this and that and it's Sunday when you run the test, then it's going to be a greater than five benchmarking result. You don't need to do that. You just feed the machine learning model data, and it comes out with the result. So you can do the select star. But again, the faster way to preview data in BigQuery is clicking that preview tab. How many different processes are we testing? Well, we have a small amount, we got 1,400. All this data is publicly available under the Fair Use License from SPEC.org. And they actually provide, if you're wondering what is this benchmark description? It's a program that's based on though world speed chess computer chip, and it's kind of cool. So you have to have a standardized test on something. Why not pick something that has a lot of math which is like chess memorizing those opening moves. All right, so back to BigQuery. So given the fact that all of these columns are inputs, how do you create a model to predict? Well, as you can follow along in Sarah Robinson, this is where I got this entire narrative from as well. Sarah Robinson from developer relations and Google has an amazing write-up on this. So this is just my condensed version of her original blog post. So to create a model inside of BigQuery the first thing you need to do is you need to have some training data. So the training data that we're going to have is basically, well, let's pass in the vendor name, the model name, the megahertz, the memory, the speed of the memory, how much memory in gigabytes, the L1 L2 L3 cache, operating system, compiler, the sponsor. Whatever you think is right and again machine learning is iterative. If you don't get good model performance from these features, again features or columns, you can come back and then do different ones and retrain your model as often as you need. So it's literally just a select statement and I'll show you what this data looks like. So given all of these features all the way through here, can we predict this last column, this score? So we're going to try to predict that. And again, we're going to train the model on the past saying hey rip through all these examples try to learn a relationship. I'm not going to give you anything more business logic or rules. I'm just going to give you the data. Can you learn the relationship between all the columns from sponsor and earlier on in the query and can you predict this numeric field? Because we're predicting a numeric field that really shortens the model options, the list of model options that we want to use. So a good model option for predicting on a numeric field, like next month's sales, here it's just a float or a decimal figure, it is going to be a linear regression. So linear regression, when you think a numeric field predicting you might want to start with linear regression. You might have heard of other models, like deep neural networks and whatnot. Absolutely, you could do those. Linear regression has to be really fast. So if you get good performance on linear regression inside of BigQuery, you might be able to say, okay, cool. Can I squeeze it a little bit more performance by doing a more sophisticated model type. There's absolutely nothing wrong with starting with linear regression. So the actual syntax for this, let me zoom in a little bit, Looks like this. Create or replace model, you give the model a name. In BigQuery, what's before the dot, the prefix here, it's generally projectname.dataset. Dataset is just a container that can have tables, SQL views, and machine learning models. So if the data set is CPU performance, I just happened to name it that. And then the model name is going to be spec in speed 631 test. Now if you omit the project name here in BigQuery, it will assume whatever current project you're on. So right now, this is my project, data-to-insights. And literally the only options that you need to provide, and I'll provide a link where you can actually see all the different model types from BigQuery meld, is a model type, linear regression. And then what you're asking the model to try to predict on in the future knowing the right answer in the past. Because you know the right answer in the past, this is called supervised machine learning. If you didn't know the answer in the past and you're not training the model any right answer, that's unsupervised machine. You'll have machine learning models like clustering k-means clustering which is also available in thicker email. But since we know the right answer in the past, we're going to say given the relationship you know between all these historical scores, can you predict? Given a new test even before it's ran and has a benchmark, can you predict what the outcome is going to be? Much like that student that's going to take that test next week, given their sleep patterns and how often they've been studying for the last two weeks, can you predict their score, A Plus, B minus, that kind of thing. So it's literally just give it a name, specify these options BigQuery handles a lot of the model performance behind the scenes for you. So what I'm going to do is actually I've already created this but it actually creates really fast in the olden days of machine learning, you're running a model, even linear regressions could take multiple days to create. Even some of the production machine learning models at Google, training and retraining them was a matter of hours. Here when you run this this should take us, it's generally about one to two minutes for a very simple linear regression model too, but most of the BigQuery ML models will run in about ten minutes or less. So you can see this one actually already created, so maybe it already pulled from some kind of cache, which is cool. So it took only 10 seconds to create the actual model which is kind of insane. There's a button that says go to model. The next thing you want to do is evaluate the performance of the model. Let's go to model, you can get the details of when it was created, the model type. There's the loss metrics, for the performance of the model. So in linear regression terms it's essentially, you're making that line. If the model CPU benchmark in reality is four seconds, or whatever to complete that speech s algorithm, and the model predicted five, the line is the actual, the dot above that five is what you actually predicted. The distance from that five of the four, that one that's your error bar, and if you take the average of all those error bars, you get what's called the mean error. And then we'll also generally, on the evaluation tab, you can see these we also do something called the absolute error. So you don't want your positive and negative errors to cancel out, if you were wrong up by one, and then wrong down by negative one, you don't want to cancel those out and say that you have a perfect model performance. So we take the absolute value of them first. So that's the mean absolute error, generally though to get it inside of, you'll have a the mean squared error as well. And then the last thing that generally you might see in machine learning algorithms is the root mean squared error RMSE. And then generally the lower the value that you have better, and so the closer to zero. But if you didn't like just looking at the stats and generally the lower the log value the better, so this is actually really, really good. If you wanted to look at actual predictions to see how well you did one of the last things you can do, now that the model has been trained in created, is now you can invoke it with a special command called ML.PREDICT. So this last part of the query is, we have this existing model now. Now what I want to do, the last part of this is, I want to pass in some columns some input columns of data. And I want to get back what the predicted score is for the label on that trained model, which will always come back as predicted underscore and then what the actual original label field was. So when you see predicted as part of the schema of the fields in the model, that's the new column, it's the models prediction. But we also have since this is supervised machine learning, we also know in the past what that actual value is, so and you just invoke ML.PREDICT. So let's go ahead and run this. What I've done also is this row 51 here, in the code. I've taken the difference, essentially the predicted value in seconds minus the actual value, rounded it and then taken the absolute value, and then sort it by, it'll make 1400 predictions, sort it by the worst ones first, so how far off was it? So now you can see how well this model does. So given the is Xenon Processor E5, the predicted score it made, and this model just went bonkers here for some reason, was -605. The actual score was 4, so the difference as you can see is massive so you can see for maybe these first 5 where it's over 100, there's something that threw the model off. Maybe we can go back in and see. It really there was a field, that just, it tried to learn and then just It, maybe it just, for whatever reason didn't didn't understand it. So maybe adding more input data for that would be a little bit better. But again, this is worst case, this is 5 out of 1,043. Let me just bring my screen up here out of 1,043 are over 100 error, as you start scrolling down, since I ordered it from greatest to the greatest least. You start seeing that the error is one plus or minus one out of four, is still pretty bad, but okay, now we're down to below this is only what record are we on? This is 17 out of 1,400, that now we're down below or it's 0.5 and everything below here is is even less than that. So you can see by the time you get to even the just record 100 out of 1,473, the error is down to plus or minus 0.1 against a 5 value. The next page you can imagine it gets even better .08, and then by the time you get to the last page, the predictions are exactly exactly matching for all the last of these 73 that you see here. What this means here is that the relationship between the features, and this kind of makes sense for CPUs, right? Given the specs you should be able to predict the performance, generally, so that absolutely holds true here for a linear model, but just the fact that we have this data set and it can make those predictions even not having run those is super beneficial. So the big takeaways are even if you don't deal in the world of CPU's, the fact that you can just quickly iterate and build a machine learning model, in a numeric field here. If you're doing a buy or no buy, by using maybe a classification model for those two buckets, those two classes. You could very quickly iterate right where your data lives inside of BigQuery ML. So I highly encourage you as you get started on your journey for BiQuery Machine Learning, is check out the documentation. I'll just flip over here if you just Google BigQuery ML. This is just the tip of the iceberg. So this is a very simple linear model. BigQuery ML, as of last year, now supports training a deep neural networks. Just as a different model type in that creator place model syntax, deep neural networks, recommender models using matrix factorization, k-means clustering. You can actually run Tensorflow models on BigQuery ML, if you previously created them in Tensorflow and you just want to import them to run them BigQuery ML as well. So if you're already familiar with SQL, or you just you have data scientist in your team, you kind of want to co-meld your skills, It's a great tool to get started with. Let me show you this syntax here. Here we go. The BigQuery ML, creating SQL syntax, this shows you all the different creating model options. So if you're wondering well, hey, I already know a lot of Tensorflow or Scikit Learn, does BigQuery ML let me do things like controller regularization? Does it allow me to control the split of the training evaluation testing data set? If you're not familiar with any of these terms BigQuery ML, will automatically handle a lot of this for you. But if you're an ML expert already, you can actually have control over a lot of those hyperparameters that you can set. So if you look in the create module syntax, the option list that I showed you was only about two things right, like the model type and the label, it actually gets pretty deep. You can actually specify it too early stop. You can specify whether or not you want to have certain weights going into the model for classification model. Number of clusters, If you're doing a k-means classification. Optimization strategy, whether or not you're using batch gradient descent or an automatic strategy. Again that it goes pretty deep, but the best part is again, you're just doing this all inside of SQL right on top of where your data already lives. And naturally it will scale meet the size of your data. So let's see what data types, or models are currently available inside of here. Right now, we have, let's see linear regression, logistic regression for classification, k-means clustering, matrix factorization. That's the really cool one from doing recommendations, like product basket analysis, and Tensorflow in here as well. All right, that's the BigQuery ML inside of a nutshell. I'll make sure that this code is available, we'll see that there's licenses associated with the spec data set here. I'll make sure that we can share as much as we can, if not, I'll provide links where you can download the CSV's of this data and ingest yourself inside of BigQuery. I hope you enjoy your SQL journey and good luck with machine learning. >> As you saw in the demo, the actual model code was just a small fraction of what we did. This is true of ML projects in general. You'll spend the majority of your time collecting, cleansing, and transforming your data to tease out useful model inputs. And then once you have this rich training data set, you finally get to build the actual model. Let's do a recap of what we've learned so far. We started with a discussion on exogenous versus endogenous factors for our models. When you're modeling based on fundamentals that's exogenous. It's when you're modeling based on technicals, like high frequency traders do, that's endogenous. Keep in mind that sophisticated trading models can use a combination of both factor types, and even many different models also. Next we learned that your model is only as good as the data you can feed into it. The larger and cleaner your training data set is, the better luck you'll have in modeling useful business insights. Lastly, keep in mind the 80/20 rule. You can expect to spend 80% of your time on an ML project doing feature engineering, that is preparing your rows and columns of data to be clean model inputs. The other 20% is choosing and testing different model types. Machine learning at its core is an iterative process, even once you have a model that you're happy with, you should be continually reviewing and collecting new data to retrain and grow your model over time.