Welcome back. So in this video we're going to talk about different file organizations. So how to organize files, there are different ways. But the major, or the main approaches to organize a file, first, we have a heap file, we have sorted files, we have indices. So a heap file, you basically have a random order of the data, there's no specific order, okay? In a sorted file, you sort the data based on some attribute. In an index, it's a level above the sorted file, which you can still sort the data sometimes but you have an index too where some data is in the database. So this helps you to have a faster search sometimes, of the data. So in a heap file, this is just to give you an idea of how it looks like, the data is stored as multiple pages again on disc, and you have data pages. So you have the header page, this is the header of the file. And, again, a file in a database can represent a collection of the objects stored in the database. So you have a header page and the data page are the set of objects. If we have, for example, a relation database, a data page is like a subset of the table, for instance. And these data pages are connected in a heap format. So each page is kind of like a linked list. There's a linked list of these pages, and they're connected to each other through pointers that way. And whenever data comes in it's inserted into the heap file an empty space, okay? So you have pages that have empty spaces, you include it in them so in the header you have all this information, right? So it's a very simple file architecture. You can easily insert data into it. Because sometimes you just append the data into the heap file. So the idea of having different file organizations is to have different ways to organize the data. And hence, if you have different ways to organize the data, there will be different ways to insert new data and different ways to read the already stored data. And each approach has its own pros and its own cons. So the number of page access is a really important aspect when it comes to calculating the cost of operations which run on the file. Number of page accesses, like for instance, if you have a query or if you're reading some pages in the database. So if you have, for example, to read 3 pages, Reading 3 pages is definitely less time consuming than reading just 1 page. So every file organization that we are going to discuss has different cost in terms of the number of pages you read. And if you have a good cost model to capture the IO cost, when we say number of pages is I/O cost. So if you have different accurate cost models, you can actually estimate how much time the operation on the database can take. And if you have this kind of good estimate, you can optimize the queries and the application running on top of the database and minimize the time to application, right? So in a heap file, the advantage of this is that you can easily bulk load data into the database. So if you have a set of data and you want to bulk load it into the database, the easiest way is to use a heap file structure. It's actually very fast. Because there is no specific order that you need to take care of, you just keep inserting the data. So if you have small data, so having indices or sorting the data, maybe it's not that necessary. So having a heap file structure can be the best way to store the data in that case. So if the whole workload coming from the application is to load all the data all the time, having the data stored in a heap file is also good. Because you don't need any indices. You're not actually selecting a specific part of the data. You're selecting all the data. So a heap file will be very sufficient. And the disadvantages actually comes when you have more selective queries. So if you want to find a specific piece, like a needle in the haystack of the data that you have, heap files may not be very useful. Because in this case, you need to load the whole pages in the heap file. And also, if you need to sort the data for some reason, because in some operations you need to sort the data, having data sorted in a heap file will be very time consuming. If you have an index, so an indexed file. Index file, the benefits of them is that you have index entries. So you have the data stored here. And you have some sort of an index on top. And this index can magically, somehow, we're going to see how, example of indices, can magically takes you, whenever you're searching for a specific kind of data, it can takes you right away to where the data is, Instead of loading the whole data or checking every piece of the data one-by-one. So you have index entries that help you do that and you have data entries. We're going to take a look at examples of how indices work. A very popular example of type of indices used in database systems is the B+ tree index. In the B+ tree index you have, again, index entries. And these are basically the non-leaf pages of the tree. So a B+ tree is a typical tree data structure. So the non-leaf pages are index entries. And then you have the leaf pages, or the leaf nodes, in the tree. They have data entries. And they sort it by the key you build the index on. So the non-leaf pages, which are called the index entries, they have a very simple structure. You have the very important component, which is the key. And this is, again, the attribute you build the index on. So you key one, you have different values of this attribute. Let's assume the example of it is you have a database that has the students in it, and you have one attribute is the age of the student. And you want to build an index on the age so that you easily search students by their age. So the key in this case is the age. So you have multiple values of the key in the non-leaf pages, in the index entries. And you have two pointers with each key. The first pointer is on the left side and this pointer points to a node underneath and this node has all the keys that has a value less than key 1, k1. And on the left-hand side you have a pointer to all the subtree or the nodes, the children of this node that has a key less than k1. And so on and so forth. So you have all of these, you have multiple keys, you have different pointers, like the left pointer and the right pointer. And left has the value with the less. And right has the value with the higher than the key. So, again, another example, this is a more detailed example to show you how it works. This index is on the age attribute. So at the root node, you have the key 17. This means the age is 17. The left pointer points to a children node, such that every key in this children node has an age value less than 17. The right pointer points to a child node. And the child node and the right pointer has all the age values larger than 17. And you keep going down that way. So, also, you look at the left hand-side here. So this key, here in this node, is 5. Everything in the child node on the left pointer has age less than 5. And everything on the right child pointer has everything larger or equal to 5. So the tree structure is recursively build that way. So that when you search all the way down, I start from the root node, and let's assume the query, I want to retrieve all students such that the age is between is between 5 and 20. So in this case I start up there. I found that. Okay, so, is 5 less than 17? Yes, so I go down this way. 20 also is larger than 17, so I still need to trace down the tree this way, on the right direction too. So I go in the left direction first. I figure out, okay, where's 5 here? So I have 5. If everything less than 5, I'm going to ignore because I'm looking for age range between 5 and 20. So I'm going to ignore the everything on the left pointer and then go to the right pointer. So everything in this Pretty much area represent the data that I need to retrieve. And this index helps you get through this data way faster than just scanning the whole data to retrieve it.