If you are a mysterious skilled stock operator or a senior employee of the Asia-Pacific Section of an unusual company desiring to know stock information of similar companies or an odd-lotter carefully studying stock transaction knowledge and successful becoming a small stockholder or a Python learner loving thinking and learning then what we're going to talk about today might be of help to you Data exploration is the step before real statistical analysis and mining of data often the first step after obtaining data Its tasks somewhat overlap data preprocessing Data exploration helps better understand data and help make wise decisions As we discussed before, there're two main tasks for data exploration one is checking of data error and the other is learning about distribution characteristics and inherent regularities of data As for checking of data error, the main task is to check whether the raw data contain any dirty data, like missing values outliers, and inconsistent data Detection and processing of missing values and outliers have been introduced before while inconsistent data are often generated from data integration One example is, in two sheets data with the same attribute but of different types or certain data only update in one sheet data with the same attribute but of different types or certain data only update in one sheet causing data inconsistence in two sheets For example, in two sheets, the type of attribute "a" is "int" in one sheet but "float" in another and it's necessary to convert one of the two into "float" to achieve consistence Let's briefly try it with the target attribute in Iris Suppose we've built an Iris with all data including the target attribute, i.e., category, DataFrame(iris_df) the data returned from the target attribute "target" of iris_df is originally of "int" type We may use the astype() method to easily convert the data into the float type We got the correct result In this part, we focus on distribution characteristics and inherent regularities of data especially how to utilize Python mainly based on statistics in the "pandas" module and plotting functions in the modules of Matplotlib and pandas to realize basic analysis of data characteristics We'll mostly talk about three common methods for analyzing data characteristics The first one is distribution analysis the second one is statistics analysis and the third one is correlation analysis First, look at the first method – distribution analysis Distribution analysis consists of two types: quantitative, and qualitative Quantitative data are often to learn whether data distribution graphs are symmetrical and a common way, for example, is to observe the histogram describing attribute-frequency pairs Qualitative data, by contrast, mainly focus on distribution of categories often shown in pie charts First, look at the analysis of quantitative data analysis As we introduced before, histograms may be used to express attribute-frequency pairs for numerosity reduction Obviously, histogram results serve well to observe data distribution Plotting methods are easy You might still remember that Just use the hist() function and properly set "bins" and we can plot However, be careful during describing of data distribution to try to make the width of each group the same Let's try with the attribute of iris_df we just created First, select an attribute Suppose we select the 'sepal length (cm)', which is the attribute numbered 0 and then use the hist() function in the pyplot module to plot First, import this module Suppose we divide the data area into 30 equal-width parts and set the colors 150 records are put into 30 bins In this way, we may roughly see the number of attributes within each equal-width data range and learn about the overall data distribution Moreover, let's observe this histogram Can we infer that this attribute is roughly in a normal distribution A lot of methods are available for testing normal distributions We may test with the special function for testing normal distributions normaltest() in the scipy.stats module to have a test This module abounds in other useful statistical functions too First, import this module Apart from data themselves this function also has an "axis" argument whose default value is 0 indicating that testing is along the axis numbered 0 namely, to test normality of each column of data We just use its default value here The function returns the p value of chi-square distribution and correlation Statistically speaking a very small p value means it's quite unlikely that data come from a normal distribution 0.05 is a standard threshold value of p value If the p value exceeds 0.05, the data may be deemed to be in a normal distribution Here, the p value is slightly greater than 0.05 roughly in a normal distribution The result is the same as we observed from the histogram By the way, let's look at other attributes First, look at the histograms of another 3 attributes Copy the previous code from the log This is the histogram of Attribute 1 the histogram of Attribute 2 the histogram of Attribute 3 As we see, Attribute 1 has noticeable characteristics of normal distribution and Attribute 2 and Attribute 3 seem not quite in normal distributions Let's test them with the normaltest() function This argument may be omitted Look at the p value in the result and we can see the result is indeed identical to our previous judgment of histogram The p value of Attribute 1 is 0.2 far greater than 0.05 in a normal distribution The p values of Attribute 2 and Attribute 3 are quite small not in a normal distribution Qualitative data often show categorical distribution of data often with the value_counts() method of Series or DataFrame to acquire the result or observe from a pie chart such as the observation of categories of the Iris dataset Suppose that we don't know the 150 records have been equally divided into 3 categories First, select the "target" attribute and then call the value_counts() method The operational results quite clearly show the distribution of data categories 50 each Then, look at the pie chart The result is the same as the previous one A pie chart is more vivid, of course Next, look at the statistics analysis among methods for analyzing data characteristics Statistics analysis includes central tendency analysis and divergence tendency analysis Central tendency is to discuss the location of data averaging or concentration and the most common one must be means and medians Divergence tendency is to discuss the discrete or disperse extent such as the minimum value, the maximum value, the standard deviation, and the interquartile range, which we talked about before and means the difference between the upper quartile and the lower quartile Now, let's first discuss the four statistical quantities below First, look at means Means may better show the central tendency of data but it's easy to understand that, in data if there are some extremely big or small values or skew data the means cannot well reflect the central tendency and we may then use medians instead Medians are the data in the middle position when a set of data are ranked from the smallest to the biggest If the number of data is an odd number say, 101 and suppose the starting index is 1 the median is the element whose index is (101+1)/2, equal to 51 There are 50 numbers to its left and to its right, respectively If the number of data is an even number say, 100 the median is the mean of the two numbers in the middle The index of one number is 100/2, equal to 50 and the index of the other number is 100/2+1, equal to 51 They are elements with the two corresponding indexes A standard deviation shows the discrete extent of data and mean A small standard deviation indicates that these numerical values are closer to the mean and a bigger one means these values are not quite close to the mean value A quartile range includes half of the raw data and a bigger value indicates a greater extent of variation in data Next, we still take the Iris dataset as an example to see how these 4 statistical quantities are calculated Similarly, select the data first I'm sure you know the method to acquire means The value of mean() is 5.84 and the method to calculate the median is median() 5.8 You must know the method of standard deviation too: std() Here, you might as well calculate the standard deviation of each attribute and see which attribute has a value not close to the mean After calculation, you'll see the second attribute (petal length) is least close to the mean The method to calculate the quartile is: quantile() and the default argument value is 0.5 that is, to calculate the median We may also set a percentage to calculate the numbers at other quantiles at one time For example, to calculate the lower quartile and the upper quartile, write like this This is the lower quartile and this is the upper quartile Then, how to calculate the quartile range Can we take out these two values and then just do subtraction How to take out This is a Series object and 0.25 and 0.75 are the indexes of this Series Its type is integer so we may use the "loc" attribute to take out the value in Series Copy it first Use the "loc" attribute to determine the data and this is the calculated result of quartile range: 1.3 It equals 6.4 minus 5.1. That's right The 4 statistical quantities we introduced just now were all realized with separate methods Can you still remember, as we introduced before a method returning the statistical quantity of basic data: describe() which may conduct analyses of all the 4 statistical quantities at a time Let's have a look Examples include the mean median, and standard deviation just calculated The result of quartile range can be obtained by subtracting the upper quartile from the lower quartile extracted from the result Note that the "index" here is of string type Let's calculate it Copy first The result of the value at 75% location minus the value at 25% location as we see, is correct These are common statistical analyses For "ndarray", there are also similar functions and methods to achieve such effect Next, look at correlation analysis For analyzing the linear correlation strength among continuous variables we may use correlation analysis Scatter plots are the most vivid form to expressing data correlation For example, as for two variables suppose they can be expressed in such a set of data points at the coordinate axis If we can find such a straight line that makes the sum of squares of their errors (this is error, perpendicular to the x-axis) quite small that means the two variables are in positive linear correlation If we find such a straight line, that means they are in negative linear correlation If not found, that means they are not correlated Sure, they might be in nonlinear correlation instead Like this, for example Find such a curve Suppose we don't know the category of each record of Iris We may use the scatter plot to view the relations among the 4 attributes For example, look at the sepal length and the petal length The result is like this Based on the position of dot we may find out that there is indeed certain correlation between the two variables Let me demonstrate this program The program itself is quite simple After selecting the attributes numbered 0 and 2 we call the scatter() function to plot them respectively Execute This is the result of execution Think about it If we didn't know the category of data beforehand is this scatter plot quite valuable for reference During data exploration apart from single scatter plots we often use matrices of scatter plots or bar charts to observe the correlation between variables For example, during analysis of multiple linear regression we often utilize a scatter plot matrix to observe the correlation between the target attribute and the other attributes In later parts, we'll provide specific methods on plot matrices when discussing specific cases and let's skip them here Let's focus on the quantitative correlation analysis method – correlation coefficient It's more accurate to use a correlation coefficient to judge the correlation between two continuous variables If two variables are independent and in a normal distribution we normally use the Pearson correlation coefficient If not, the Spearman correlation coefficient is often used instead Moreover, the Kendall correlation coefficient is also frequently used Let's focus on the calculation of the Pearson correlation coefficient Its calculation formula is like this These are all means This formula should be quite understandable The range of value of calculated result "r" is [-1,1] A value greater than 0 indicates positive correlation, meaning that the change of a variable will cause the other variable to change in the same direction For example, the increase in one variable will cause the increase in the other variable A value smaller than 0 indicates negative correlation; for example, the increase in one variable will cause the decrease in the other variable A value equal to 0 means no correlation An absolute value equal to 1 indicates complete linear correlation and if the result is within (0,1) the degree of correlation depends on the value of number It's generally believed that a correlation coefficient greater than 0.5 indicates relatively significant correlation and, if greater than 0.8, in particular, high correlation If less than or equal to 0.3, we generally believe there's almost no correlation (0.3,0.5) indicates low correlation This division is a general description which may depend on characteristics and requirement of data and application involved Moreover, it's worth noticing that correlation coefficients obtained through such calculations if too low, only indicate no linear correlation between variables but there may be nonlinear correlation Let's calculate it The method for calculating linear correlation between variables is easy The corr() method can be used and this method has a "method" argument indicating which method of correlation coefficient to be adopted to calculate which is "pearson" by default Undoubtedly, we may set it as any other correlation coefficient Suppose we execute this statement Select the 0th and 1st attributes and the target attribute to calculate the Pearson correlation coefficient among them As we find out, the Pearson correlation coefficient between the 0th attribute (sepal length) and the target attribute (category of flowers) is closed to 0.8 indicating great correlation which there is lower linear correlation between the 1st attribute (sepal width) and the category of flowers Sure, we may separately calculate the correlation coefficient between any two variables We may write it like this As we see this is the calculation of the Pearson correlation coefficient between the 0th attribute and the target attribute (target) I believe you can all calculate the correction coefficient now, right Here, by the way, let's use the plotting module "seaborn" to plot a heatmap among the 0th and 1st attributes and the target attribute The method is simple Just directly call the heatmap() function in the module First, import the "seaborn" module and then use its "heatmap()" function and adjust some arguments There's a "cmap" argument indicating "colormap" We use "rainbow" Observe this heatmap. As we see, the correlation coefficient among the three attributes is more vividly expressed in the heatmap Roughly speaking, a darker color indicates higher correction The "cmap" argument may be used to adjust the overall color Which "cmap" values are available I've found an easy method A purposeful misinput would lead to listing of all available cmap values Have a try and find your favorable colormap Normally, I prefer this cmap value: coolwarm Let's try the effect of color matching Like this Besides, for regression analysis, we often mention a coefficient of determination: r2 r-squared The coefficient of determination (r2) is the square of correlation coefficient thus known as r squared It measures the degree of explanation of attribute by regression equation The closer it is to 1, the stronger the correlation between attributes is and, the closer to 0, the weaker, of course Relevant notes are provided in the linear regression analysis we add in later lessons Let's skip the details here