Now, what we are gonna do here is run interactively,
using a program called Beeline.
But there are other run mechanisms that I will list at the end of the video.
So what's Apache Hive?
It's essentially a data warehousing software.
So it lets you structure and query data using an SQL like language.
So you could have the data in HDFS or HBase, but
basically this sits on top of it and lets you do the querying.
The execution environment can be like MapReduce and Yon, or Tez or Spark.
And there are lots of options on the essentially.
And just like we saw in Pig, you can essentially use or
write, Map and Reduce functions, if you need to.
What are the kinds of applications that people use it for?
Hive is useful in data mining, analytics kind of applications, machine learning,
you can do ad hoc analysis of data that's sitting in hdfs Or H-Space.
Again, there's a huge list of Windows and companies using this and
application areas, so you can take a look at the link that I've got on there.
But these are the types of areas people typically use it.
So for our example, what we'll do is go back to our passwords file example from
Pig, and try to do this using hive essentially.
So we, again, start by loading the file into HDFS and,
as I mentioned, the command is hdfs dfs -put /etc/passwd and
least this time we're gonna stick it in the /tmp directory.
Now this /tmp is on the HDFS file system.
And what we are essentially doing is trying to extract the same
information we had from Pig, using, essentially, Hive.
And as I mentioned, we're gonna use beeline to access this data,
and access Hive, and run things interactively.
That's the command to do that.
Remember all these are commands that you're essentially typing in the terminal
window if your [INAUDIBLE] Quickstart.
This is how it looks essentially.
I have a snapshot of the Quickstart window that I used and
you can see there is dfs put command at the first line and
then I 'm using a dfs ls to make sure the password file is indeed in there.
Then I'm just running b line to get interactive access hive.
[COUGH] There's a lot of commands in Hive that are going to be used,
so I'm listing them at the start so that we have some place to copy and
paste it from if you're following along over the video.
So the command that essentially creates a table,
all the import from the password file.
So you kind of look at the password file, see what are the entries and
then you can just create appropriate string entries in this table essentially.
So you have a user name, password, URD, hdir,
a phone line, a home directory, a shell.
So, and they are basically saying that a row format parameter is so
remember in Pig you have to go specify that specifically,
so we are doing it here too.
And it's a text file so it just essentially loading another text file.
So, once the table is set, we essentially load
the data from /tmp/password again, give it time for
part and essentially loading it into the table user info.
And now what we're doing is saying, hey let's select a user name the full name and
the home directory from this table.
And all ordered by user name.
So essentially we're trying to get the same data we got previously from so
how does this look from our snapshot point of view?
So we're just running these commands now.
So first we create the table.
So you have to userinfo table.
And it says no rows affected, that's because we have no data in it.
Then we load the input.
And [COUGH] essentially we are all writing it into the thing and
we are still, you just have the raw data right now.
Now we essentially do the select and do a uname,
fullname, and home directory from user input.
And once we do that, you can see the backend of Hadoop and
my produced jobs are launched.
Essentially one job being gone, you have a number of reduced on sector one because we
just have one task in this case.
You can actually change all these number of producers if you have more, of course.