Apache Spark Hadoop Word Count Example

Let's create a simple word count example using Spark and Hadoop file system in Linux environment.

I hope you have already installed
apache-spark

Now create a simple text file. Let's name it as "input.txt" and
let the content of the file be

people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.

Start you spark by typing spark-shell on you Linux terminal.

[cloudera@quickstart ~]$ spark-shell

after some time you should be logged into Scala terminal.

In order to get the word count

load the file content
split the words
get the count

The only command you need to get the count of the file is

scala> sc.textFile("input.txt").flatMap(_.split(" ")).count

When you trying to execute this command you will get an error saying

org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist: hdfs://localhost:8020/user/input.txt

This is because Hadoop file system has a different file structure.

You can check hdfs file system by

[cloudera@quickstart ~]$ hdfs dfs -ls /

and you will recognize that your file is not there. So we should copy our local file to hdfs system.

[cloudera@quickstart ~]$ hdfs dfs -mkdir /nuwan

[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal input.txt /nuwan

so now we have copied our new file to /nuwan/input.txt

[cloudera@quickstart ~]$ hdfs dfs -ls /nuwan
Found 1 items
-rw-r--r-- 1 cloudera supergroup 144 2017-06-18 12:32 /nuwan/input.txt

Now running your command

scala> sc.textFile("/nuwan/input.txt").flatMap(_.split(" ")).count
res14: Long = 30

Comments

Veera Blogspot30 October 2020 at 00:05
Very nice article,Thank you for sharing this awesome blog.

Keep updating...

Big Data Hadoop Course
ReplyDelete
Replies

Add comment

Nuwan Senanayake

Search This Blog

Apache Spark Hadoop Word Count Example

Labels

Comments

Post a Comment

Popular posts from this blog

Oracle Database 12c installation on Ubuntu 16.04

DBCA : No Protocol specified

Java Head Dump Vs Thread Dump