Let's create a simple word count example using Spark and Hadoop file system in Linux environment.
I hope you have already installed
apache-spark
Now create a simple text file. Let's name it as "input.txt" and
let the content of the file be
people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.
Start you spark by typing spark-shell on you Linux terminal.
after some time you should be logged into Scala terminal.
In order to get the word count
When you trying to execute this command you will get an error saying
This is because Hadoop file system has a different file structure.
You can check hdfs file system by
and you will recognize that your file is not there. So we should copy our local file to hdfs system.
so now we have copied our new file to /nuwan/input.txt
Now running your command
I hope you have already installed
apache-spark
Now create a simple text file. Let's name it as "input.txt" and
let the content of the file be
people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.
Start you spark by typing spark-shell on you Linux terminal.
[cloudera@quickstart ~]$ spark-shell
after some time you should be logged into Scala terminal.
In order to get the word count
- load the file content
- split the words
- get the count
scala> sc.textFile("input.txt").flatMap(_.split(" ")).count
When you trying to execute this command you will get an error saying
org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist: hdfs://localhost:8020/user/input.txt
Input path does not exist: hdfs://localhost:8020/user/input.txt
This is because Hadoop file system has a different file structure.
You can check hdfs file system by
[cloudera@quickstart ~]$ hdfs dfs -ls /
and you will recognize that your file is not there. So we should copy our local file to hdfs system.
[cloudera@quickstart ~]$ hdfs dfs -mkdir /nuwan
[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal input.txt /nuwan
so now we have copied our new file to /nuwan/input.txt
[cloudera@quickstart ~]$ hdfs dfs -ls /nuwan
Found 1 items
-rw-r--r-- 1 cloudera supergroup 144 2017-06-18 12:32 /nuwan/input.txt
Found 1 items
-rw-r--r-- 1 cloudera supergroup 144 2017-06-18 12:32 /nuwan/input.txt
Now running your command
scala> sc.textFile("/nuwan/input.txt").flatMap(_.split(" ")).count
res14: Long = 30
res14: Long = 30
Very nice article,Thank you for sharing this awesome blog.
ReplyDeleteKeep updating...
Big Data Hadoop Course