[MUSIC] [MUSIC] [MUSIC] [MUSIC] Hello and welcome to this lesson in which we will learn about the Moran’s I indicator of global spatial autocorrelation. An indicator of spatial autocorrelation quantifies the structure of a spatially distributed phenomenon by assessing the correlation between an observation and the values of neighbouring measurements. We will also see how we define a neighbourhood in order to create an appropriate spatial weighting scheme. These schemes are required to calculate indices such as the Moran’s I and are also useful to evaluate the spatial dependence of a data set. Finally, in this lesson you will also learn how to evaluate the significance of the Moran’s I index. The main objectives of the lesson are to present the background information that is required so that you can calculate an index of spatial autocorrelation. We will begin by explaining some background information so that you can define your own spatial waiting schemes. We will also review the Moran’s I index, an easy and intuitive method for calculating global spatial autocorrelation, and we will see how it can be interpreted as a regression coefficient. After having followed this video, you should be able to select a spatial weighting scheme that is adapted to your data, and from this, to calculate the global Moran’s I and its significance for any point or polygonal dataset. In this lesson we will present you with everything required so that you can calculate a global index of spatial autocorrelation. These indices are constituted by a single value and characterize the spatial arrangement of geographic units on the basis of a given attribute. A number of these types of indices exist. The Join Count index is an enumeration statistic that can only be applied to polygons. Other common indices are Geary’s C, Ripley’s K, Getis-Ord’s G, and finally Moran’s I, which is the most widely used and will be the focus of this lesson. [MUSIC] In order to quantify spatial dependence and to assess the global spatial autocorrelation of a dataset, we must first consider the tendencies of the neighbouring values for each each of the geographic units in our analysis. This will be dependent on how we define our neighbourhood. In essence, a measure of global autocorrelation compares the behaviour of an object in relation to its neighbours and summarizes this relationship throughout the entire study area. The neighbourhood used for the analysis is defined on the basis of a number of different criteria. For example, with this group of 54 points that correspond to the centroids of adjacent municipalities, the neighbourhood could be defined by a 5km kernel around each point. The white circle corresponds to object 1, and the yellow circle around it delimits a 5km radius. The neighbourhood is defined by a km radius. From this, the value of attribute A for object 1 can be compared with a 51 statistic that summarizes the values of attribute A for its 17 neighbours, which are shown in green. This statistic could be the mean, as is the case for the Moran’s I. This operation is then repeated for each object by comparing its value of attribute A with its neighbourhood’s average. In this example, this process would have to be repeated 54 times- once for each municipality. Different criteria can be used to define a neighbourhood. Here, we will use the GeoDa program to illustrate how these different criteria can be implemented. The criteria used to define a neighbourhood are in part dependent on the type of object under consideration. If we are dealing with point objects, neighbourhoods are most commonly defined using either a distance threshold or a proximity criterion that identifies an object’s k nearest neighbours. Returning to the example we looked at previously, here we illustrate a neighbourhood defined by a 5000m distance threshold. In this case, the neighbourhood is determined solely on the distance between each point and the neighbouring observations; accordingly, it is called a fixed kernel neighbourhood, and the 5000m threshold distance (d) is the bandwidth. We could, alternatively, use the nearest neighbour criteria to define an object’s neighbourhood. Here, k = 7, and the neighbourhood adapts as a function of point density – this is called an adaptive kernel. Polygonal objects can be defined according to the adjacent spatial units and a chosen order of contiguity. There are two measures of contiguity that can be applied: Queen or Rook; these correspond to the movements of their respective pieces in the game of chess. Queen’s contiguity corresponds to a neighbourhood that includes all adjacent polygons that touch the polygon of interest. All points that are classified as neighbours must have at least one pair of coordinates in common with the object of interest. Rook’s contiguity, which as in chess, corresponds to the polygons situated to the North, South, East or West of the polygon of interest; the neighbours will share at least one side with the object of interest. This type of contiguity is used primarily when geographical units are orthogonal to one another, as in the States in the USA. Next, the order of contiguity must also be defined. We could take into account only the neighbours that are immediately contiguous with our polygon of interest, as indicated by the yellow arrows surrounding polygon E; this is a first order contiguity neighbourhood. Alternatively, the neighbourhood could be defined by a higher level of contiguity. Here the green arrows correspond to a second order contiguity. In this case the polygons that are adjacent, whether adjacency is determined according to the Queen or Rook definition, to the neighbours of polygon 1 are define the neighbourhood. Finally, depending on the analysis, it is also possible to include lower orders of contiguity in the neighbourhood. Fixed kernels, such as those defined by a distance threshold, and adaptive kernels, defined by k nearest neighbours, can also be applied to polygonal datasets. In this case, the coordinates of the polygon’s centroid centroid are used for neighbourhood definition. These same criteria can equally be applied to linear objects. [MUSIC] [MUSIC] The defined neighbourhood can then be used to create a spatial weighting scheme designed with a given dataset in mind. This weighting scheme will be used to create a weighting file that lists the neighbours of each object in the dataset. Returning again to our example of the 54 municipalities, we can see an example of one of these spatial weighting files. The first line is the file header; it is comprised of 4 elements separated by a space. In the current version of GeoDa, the zero does not have any significance, and the 54 indicates the number of spatial units in the data file. The next two elements correspond to the file name and the name of the unique identifier. The rest of the file lists the neighbours of each polygon. On line 2, number 1 refers to polygon 1, and the 6 indicates that using the defined neighbourhood criteria, polygon 1 has 6 neighbours; these are polygons 35, 29, 13, 4, 3, and 2, and are listed on line three. Continuing on line 4, polygon 2 has 5 neighbours: 9, 8, 4, 3, and 1. Using the k-nearest neighbours criteria, as shown on the right, after the file header, the 7 nearest neighbours of polygon 1 are listed with their centroid to centroid distance. Starting on line 9, the same nearest neighbours of polygon 2 are listed, etc. [MUSIC] [MUSIC] Spatial autocorrelation is quantified by calculating the correlation between neighbouring measurements. With Moran’s I, we calculate the correlation between the measured attribute of a geographic unit and the average of this attribute for all units that are contained by the defined neighbourhood. In order to do this, the spatial weighting file is used to to calculate the neighbourhood average for each object in the dataset. - In this example, the variable z indicates the average monthly precipitation in each municipality. Units are expressed in tenths of a millimeter. Measured precipitation, z, is given for each object in the third column, after the identifier and the list of neighbours that were used to calculate mean z. Neighbourhood mean z is given in the last column. The neighbours listed in column 2 are used to calculate omega, which corresponds to the weight attributed to each polygon when calculating mean z. If the contiguity criterion is used to define the neighbourhoods, this value is 1 if the polygon is adjacent and 0 if not. Moran’s I coefficient of autocorrelation is an extension of the Bravais-Pearson correlation coefficient. It quantifies the difference between the values of a variable across all sets of neighbouring objects. It is defined as a ratio between the covariance of a variable, in relation to its mean value, and its variance throughout the entire study area. Simply, Moran’s I corresponds to the linear correlation between z and bar z, and results in a value that ranges between +1 and -1, where +1 indicates a positive net correlation, 0 indicates no correlation, and -1 indicates a negative, or inverse, correlation. In 1996, Luc Anselin suggested that Moran’s I could be interpreted as a regression coefficient. By adopting this interpretation, we can better understand the calculation procedure for Moran’s I, as implemented in GeoDa. We will see how this works, by calculating the Moran’s I for the measured precipitation in our dataset with the 54 municipalities. To do this we will use a weighting scheme with an adaptive kernel based on a Queen’s case, first order contiguity. The Moran’s I formula is equivalent to the correlation between between the measured precipitation in each municipality : the z in the table and the average precipitation in the municipality’s neighbourhood. Bar z in the table By regressing the independent variable z on bar z, the slope of the regression line corresponds to the Moran’s I value. A bivariate scatterplot is used to visualize the linear relationship between the studied variable and the mean value in the surrounding neighbourhood. In this case, the regression coefficient, given by the slope, is 0.79. The Moran’s I scatterplot illustrates the same relationship, but with standardized values. In this example, the high Moran’s I means that, in general, precipitation in a municipality is similar to the average precipitation in neighbouring municipalities. This means that there is spatial autocorrelation, and precipitation is spatially dependent. This thematic map shows precipitation in each of the municipalities divided into 5 classes and provides support for the spatial dependency shown by the Moran’s I. As we can see the intensity of precipitation gradually declines from the East to the West of the study area. [MUSIC] Next, we check if this value of spatial autocorrelation is statistically significant. We have to check whether this value is the result of a spatial process or if it is simply due to chance. That is, how does the observed situation, in this case the spatial structure structure indicated by the 0.79 Moran’s I, compare to all of the potential spatial configurations, or at least a large number of them. In order to create these different spatial configurations, we randomly permute the values between all other locations that are possible in the given dataset. Here, this process is illustrated by values ranging from 1 to 54, With all of the other possible configurations in the given dataset. Each configuration corresponds to a random assignment of locations. The same procedure is repeated thousands of times following the Monte-Carlo method. For each configuration, we calculate the Moran’s I and compare it to the Moran’s I that was calculated for the observed situation. After 999 permutations, 999 configurations and their corresponding Moran’s I values will be generated. These values can then be compared with the observed situation to assess whether the observed Moran’s I resembles the randomly generated configurations or if it is clearly different. GeoDa stores the Moran’s I values generated for each of the random configurations and uses them to create a histogram. Using the precipitation by municipality dataset, we can generate the Moran’s I histograms using 99 permutations, 999 permutations, then 9999, and finally 99999. The higher the number of permutations used, the more the histogram distribution will approach a normal distribution and the more the standard deviation and mean value will approach their theoretical values. Let’s look at the 999 permutation histogram in more depth. The random distribution represents geographically neutral space. Around the mean, where Moran’s I values are near zero, precipitation is not similar to that of the neighbourhood average. The Moran’s I of the observed situation is given in yellow, and is not part of the histogram, and can be clearly distinguished from the rest of the distribution. It can thus be deemed significant. A non-significant value, like that shown here in green in the middle of the random distribution, is not spatially dependent. This significance is numerically translated as the probability of rejecting the null hypothesis, where the p-value is a threshold value for the rejection of the null hypothesis. The null hypothesis is that the observed Moran’s I has resulted from chance, and that it is similar to the other values generated by random permutation. The lower the p-value, the lower the risk of committing an error by rejecting the null hypothesis; in which case we can say that the observed value is significantly different from a random distribution. This p-value is the ratio between the number of randomly generated Moran’s I that are greater or equal to the observed Moran’s I plus 1, and the total number of random permutations plus 1. Moran’s I can also take on negative values, and in this case, the p-value would be calculated based on the number of randomly generated Moran’s I values that are smaller or equal to the observed value. The p-value is more accurately called a pseudo p-value, because the significance threshold will depend on the number of permutations. In this example, we can conclude that precipitation is significantly spatially autocorrelated. [MUSIC] [MUSIC] We will now look at how to generate a spatial weighting file, to calculate the global Moran’s I and assess its significance in GeoDa. To begin, we need to open the shapefile that contains the municipality polygons and their corresponding precipitations. Next we create a spatial weighting file by clicking on the “ create weights” button. In the corresponding pop-up window, a unique identifier must be selected – here “ide”, and an appropriate weighting scheme must be defined. In this case we will use a Queen’s case, first order continuity. After having clicked on the “create” button, we have to navigate to where we want to store this file and give it a name. Next, by selecting "connectivity histogram”, we can inspect the histogram, which provides us with an indication of the frequency of neighbouring polygons. We can then, for example, highlight the most connected municipality, or the seven municipalities that are located on the borders and are the most poorly connected. Let’s take a look at the spatial weighting file. This can be done by simply opening the file in a text editor. We find the same structure that we demonstrated earlier. The header is on the first line, followed by a neighbourhood description of polygon 1, which has 6 neighbours, then polygon 2 with 5, polygon 3 which also has 5, etc.. Now we can move on to the calculation of the Moran’s I. We first have to select the variable of interest: here the z that corresponds to monthly precipitation. The Moran’s scatterplot is automatically generated on the basis of the weighting file that we just created. It shows the standardized values and indicates a high Moran’s I of 0.79. The significance of the Moran’s I can be quickly assessed by right clicking on the scatterplot and selecting “randomization”. Here we will generate 9999 random permutations. The histogram corresponds to a very small pseudo p-value, which means that it is unlikely that we are to commit an error by rejecting the null hypothesis. Thus, the observed spatial structure, that is characterized by a Moran’s I of 0.79 is significantly different from a random spatial distribution. Each time that we click “run”, we generate a different series of 9999 permutations with different statistics. Regardless, the pseudo p-value remains small. To finish, we will calculate the Moran’s I again; this time using Luc Anselin’s interpretation. Here, we can interpret the results as a linear regression between the z variable and the weighted, ordered precipitations (zbar). We can see that the beta coefficient, which is given by the slope of the regression line is equal to 0.79, and by standardizing the distributions given in deciliters of water, we get the Moran’s scatter plot that we saw previously. Over the course of this lesson, we have shown you a number of criteria that can be used to define a geographic neighbourhood. It is important to assess whether the data that we are using would be best represented using a fixed or variable kernel, or if we are working with polygons, which type and order of contiguity should be used? Because the manner in which we define the neighbourhood will determine the spatial weighting scheme, this constitutes a key step in the calculation of spatial autocorrelation. Indeed, this step is used to delimit the zone in which the variable’s behaviour will be compared to each geographic object. This comparison is made by calculating the correlation between the distribution of the variable of interest for each object and average distribution of this variable within each object’s neighbourhood. This value corresponds to the Moran’s I. Finally, we also showed you how to generate random permutations using the Monte-Carlo method in order to evaluate the significance of the Moran’s I. [MUSIC] [MUSIC]