Hi again. We've discussed the six V's of data and considerations associated with different data sets. In this segment, we'll discuss dimensions and complexity of data, data quality, reliability, and cleansing. Let's begin with dimensions. A dimension of data is an attribute that defines something about the data. Some dimensions describe a particular data set. Others may describe multiple data sets. Dimensions help us understand how to think about and evaluate the data we have for data analytics projects. We'll go over several important dimensions, but let's note that the six V's are also dimensions that help us evaluate data. Using our athletic clothing company example, attributes such as business segment, customer, products, region, month or quarter could all be dimensions of a data set. Sometimes a dimension is also used to characterize data quality. Dimensions associated with data quality include completeness, accuracy, consistency, uniqueness, and timeliness. Complexity of data is an important consideration. The more complex the data set is, the more time, effort, and resources it will take to work with the data. The complexity of the data is a manifestation of the factors and dimensions. If a data set is incomplete, perhaps it includes only quarterly data but not monthly reported data, the data uses different conventions in currency conversion rate from multiple sources for different periods of time. This makes the data set complex. Let's envision a data set as being the system of record of 100 students' names and attendance. This system of record is maintained by a specific teacher and audited by a designated school administrator. In that case, it is a structured data set likely with low variability. It would also have low complexity. In contrast, if the data set is the system of record of students' names, attendance, extra curricular activities, grade point averages, standardized test scores, social media account names, recent yearbook pictures, and notes from school counselors, it would be a complex data set with both structured and unstructured data. The unstructured data in this example would be the notes from the school counselors. Understanding the complexity of data sets helps inform us about organizational capabilities in data governance, data infrastructure, and data management. Understanding the organization's data governance , data infrastructure, and data management enables you to better work with data of various levels of complexity and determine technology that could address gaps. Data quality is often defined by a set of dimensions that are used to evaluate the quality of data. Data quality is the state of the data set. In assessing the quality of a data set, ask the following questions; is the data set accurate? Was the data set updated promptly? Is the data set complete? Is the data set valid meeting requirements such as type, format, time range, and owner? Is the data set consistent, that is are the data points updated consistently following specific data governance rules and best practices? Attributes of data quality include completeness; is enough data present in the data set? Is all data there? Accuracy and consistency; is the data accurate and consistent across multiple systems? Usability; what do differences in data set updates mean for usability? Integrity and duplication; can records from different systems be linked and mapped? Availability or timeliness; is updated data available when needed? Attributes of data integrity include relevance to the events or instances that cause figures to be included in the data set, relationship between the members of the data set, clearly defined inclusion and exclusion criteria, clearly stated characteristics and the validity complete lists and accuracy of those measurements of that which was measured and recorded, clarity that the data is valid, complete, accurate and current. In addition to ensuring data quality, data governance will address data reliability. Data reliability extends beyond data quality and incorporates areas such as data availability, data redundancy, and data protection. In assessing a data set for reliability, ask the following questions; is the data set available when the organization needs it? Is there a business continuity plan for the data set in case of unintended data loss, data hacks, or other types of internal and external risks? Is the data set adequately protected to prevent data theft, data tampering, or data loss? Lastly, let's discuss the key concept of data cleansing. Data cleansing is the task of correcting or completing records in a data set or database. For example, in a supplier data set, the same supplier's name can appear as Pacific Industrial or Pacific Ind. or Pacific Industrial Limited. If you are conducting a supplier spend analysis, you want to use data cleansing to clarify that all three suppliers names indicate the same supplier entity. Pacific Industrial Limited. If location data is missing, consider filling in the site location data. Identifying location will allow you to identify the supplier spend by country and city. Data cleansing is a crucial step to ensure high-quality and reliable data. In practice, when you receive a data set, you should perform something called exploratory data analysis or EDA. EDA involves reviewing the data set and plotting it to get a sense of how the data looks. Often, EDA helps uncover focus areas for data cleansing. In addition to exploratory data analysis, your expertise in the subject matter area can guide you to look deeper into areas of the data set that may require data cleansing. Data cleansing can be time-consuming but when done correctly improves the data quality and in turn the quality of decisions that will be made. You've learned about thinking beyond the six V's of data and from the perspective of data dimensions. You know the importance of understanding data complexity, data quality, data reliability and how to deploy data cleansing as you embark on a data analytics project. Next, we'll discuss feature engineering which is used frequently in advanced data analytics and machine learning tasks.