For this lab, you'll need the source code copied from GitHub to your Cloud Shell environment. And you will also need to run a script to download some libraries that will have the dependencies via pipeline. These steps take a few minutes to complete. So right now, you can see the video fast forwarding through these steps until after the source code is installed and libraries have been downloaded. From Cloud Shell, you can use different editors to view the source code of the pipeline, you can use a text based editor like Nano. But here in this video, you'll see me use a built-in Cloud Shell graphical editor. Once this editor loads, you can see that in the menu on the left, you can open up the training-data-analyst, courses, data_analysis, lab2, python folder, and access to pipeline source code in the grep.py file. The source code takes as input the various java files highlighted here in line 26. So, you will use the java file specified was the wildcard statement, for each one of the files, the transform is looking for lines of Java Source Code containing the keyword, the search term is "Import." You can see the details of the pipeline implementation in the lines 32 to 34. Notice, that the grep step of the pipeline is using the My_grep method defined in line 20. The my_grep method looks for the search term "Import" and for all the lines that contain the search term, the result is written out to the /tmp/output directory. To run the pipeline on Cloud Shell, you simply use the python command and pass the name of the source code file with the pipeline implementation. The pipeline completed successfully and you can confirm that by looking at the output files that the pipeline created. The pipeline correctly identified all the lines of Java source code that contain the key word "Import". In the remaining part of the lab, you will take this pipeline source code and prepare it to run on the Google Cloud Dataflow platform. But before you can do that, there are some prerequisites steps. First, you need to search for dataflow APIs in GCP and enable the APIs using the enable button you see on the screen. This is going to take a few moments, so the video will fast forward until after the APIs are enabled. Okay, you can confirm that the APIs are enabled, if you can see the disable button on the dataflow API screen. Next, you need to make sure that you have a Cloud Storage Bucket created for your pipeline. You can create this Bucket, and it's important that you assign this Bucket a unique name and make sure it is set up as a original Bucket. Here, I assigned us-east4 for the Northern Virginia region. Okay. Once the Bucket is ready, you will copy the input source code files for your pipeline from Cloud Shell to the Google Cloud Storage Bucket. You do this using the GSU copy command. Remember, that you are copying these Java source code files for your pipeline because the pipeline does not have access to your Cloud Shell file system while it's executing on Google Cloud Dataflow. After the gsutil copy command finishes copying the files, you can go back to Google Cloud storage bucket in your browser, refresh the page, and then you can confirm that the files have been copied successfully. Here are the four Java files that will be used as an input to your pipeline running on Google Cloud Dataflow. Next, take a look at the source code for the pipeline implementation that was modified to run on the Google Cloud Dataflow platform. It is in the grepc.py file. Notice that this one uses constance for project and bucket names, in my case, I've used the same unique ID for both the project and the Bucket. So I'm going to put the same value for both. The code also specify some parameters that I needed to run this pipeline on Cloud Dataflow. For example, you need to specify the name of the job running your pipeline and also the data flow runner to execute the pipeline on dataflow. Here, the input and the output are specified as paths to your Google Cloud Storage Bucket. The rest of the code for the pipeline stays the same. To run your pipeline on dataflow, you still use the python command and pass in as arguments, the file name was the source code of your pipeline implementation. Here, since the source code used the dataflow runner, your code is going to be packaged together as dataflow libraries and submitted as a job to execute a pipeline on top of the Google Cloud Dataflow platform. When the python command finishes executing, you're going to go back to the GCP and open up data flow using the hamburger menu on the left or using the search bar. And from the data flow dashboard, you can monitor the pipeline you just submitted as one of the jobs. Here, the job is called example2, because that's the name that I've used in the grepc.py file. First, you'll notice that the job is not yet fully started. It says it's autoscaling and currently showing it's only using a single virtual core for execution. On the right hand side, you can also see pipeline options and other information about the job. In the log section, you can find out that the pipeline is not yet running, because it's still starting up one of the workers, and you can confirm that by looking at the graph in the autoscalling section. Here, you'll notice that the job is expecting to use one target worker. And currently, the number of workers went from zero to one. So this means that exactly one virtual instance has been provision to execute this pipeline. It is going to take a few minutes for this pipeline to finish executing. So right now, you can see the video fast forwarding a few minutes until after the job is done. If you take a closer look at the pipeline you can tell by the green check marks that all the individual steps for the transformations have completed. And by reviewing the graph on the bottom right, you'll notice that all the workers that have been used to execute the pipeline have been scaled down. You can take a look at the output of this pipeline by copying the output files from Google Cloud Storage to Cloud Shell. And once the files are copied, you can review them directly in Cloud Shell or you can also open Google Cloud Storage in your browser and find the files in your Bucket under the Java Help folder. The files will have a prefix of outputs, so they will be named like 04, 0104, 024 and so on. To review the content of the files, it is important that you use a public link checkbox on the right. Here, you can see the contents of the first file.