[MUSIC] Welcome to Distributed Computing with Sparks SQL, produced by UC Davis division of continuing and professional education in partnership with Databricks. My name is Brooke Wenig and I lead the machine learning practice at Databricks. I've been working with Apache Spark for four years and have a master's degree in computer science from UCLA, focused and distributed machine learning. Fun fact, I enjoy riding bikes and in fluent in Mandarin Chinese. I'm accompanied here by my esteemed colleague Connor Murphy. Connor, take it away. >> Hi, I'm Connor Murphy, a data scientist at Databricks. I've been focusing my work on Apache Spark and distributed computing for over two years now. My area of focus is scaling machine learning across large data sets. I did both my undergraduate and graduate work in philosophy. I know, not a traditional approach. I became interested in data through seeing the impact of data-driven methodologies on humanitarian interventions. This was during my work with the Rotary Foundation. I then transition from the nonprofit sector to the technology sector a few years ago. Outside of data science and engineering problems, I spend most of my free time and free fall as a skydiver. Currently, Brooke and I work together at Databricks founded by the original creators of Apache Spark. This course is designed to scale the SQL queries and workloads that you developed in earlier courses in the series. It is designed for students who are already familiar with SQL, but want to work on larger data sets where they have more data that can fit on any single machine. This is where distributed computing and Apache Spark come in. Spark solves the problem of scaling computation to large data sets. This poses a number of unique challenges. In this course, we'll give you the conceptual framework to approach those challenges as well as the hands-on experience writing Spark code. >> Now, let's talk about what you will be able to accomplish by the end of this class. In the first week, we will cover the Core Concepts of distributed computing and when and where it is useful. By the end of the first week, you'll be able to run your SQL queries at scale using Spark SQL. We'll also introduce the basic data structure in Spark called a DataFrame. This is a collection of data distributed across a number of machines not just sitting on a single database or on your laptop. We'll end the module by introducing the collaborative Databricks collaborative workspace, where you'll be able to write SQL code that executes against a cluster of machines. >> In module 2, we'll cover the core concepts of Spark and you'll be able to optimize and use Spark in your own work. Sparks SQL look very similar to the way you've accessed data and databases in previous courses in this series with some key distinctions because of its distributed nature. Spark itself is not a database, it is a computation engine. We will demonstrate how to speed up your queries by cashing your data and how to use the Spark UI to debug slow queries. This can solve slow queries and it can also fix queries that might not complete otherwise. In module 3, we'll explore engineering data pipelines. This allows us to go under the hood with how Spark cluster connect to databases using the JDBC protocol, a common way of connecting to databases in Java environments. We'll show you schemas and types and why they matter in data pipelines. Certain file formats work well in distributed environments and certain file formats don't, we'll discuss some of those trade-offs. Finally, we'll explore best practices for writing data to save the results at the end of our queries. At the end of this third module, you'll be able to create a basic pipeline that reads, transforms, and writes data, a process known as ETO. >> Lastly, we will cover the real world applications of Apache Spark and build a machine learning model for our data set. We will cover the basics of machine learning at a high level and introduce how we can combine machine learning with Spark SQL. After this week, you'll be able to explain the difference between regression and classification and apply machine learning models using Spark SQL. We will wrap up the course with a summary of the key concepts you have learned. All you need to take this class is a working knowledge of SQL, a desire to learn, and access to Databricks Free Community Edition, as well as the Spark documentation. Let's get started. [MUSIC]