Skip to main content

Apache Spark Hadoop Word Count Example

Let's create a simple word count example using Spark and Hadoop file system in Linux environment.

I hope you have already installed
apache-spark

Now create a simple text file. Let's name it as "input.txt" and
let the content of the file be

people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.

Start you spark by typing spark-shell on you Linux terminal.

[cloudera@quickstart ~]$ spark-shell


after some time you should be logged into Scala terminal.

In order to get the word count
  1. load the file content
  2. split the words
  3. get the count
The only command you need to get the count of the file is

scala> sc.textFile("input.txt").flatMap(_.split(" ")).count


When you trying to execute this command you will get an error saying

org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist: hdfs://localhost:8020/user/input.txt


This is because Hadoop file system has a different file structure.

You can check hdfs file system by

[cloudera@quickstart ~]$ hdfs dfs -ls /


and you will recognize that your file is not there. So we should copy our local file to hdfs system.

[cloudera@quickstart ~]$ hdfs dfs -mkdir /nuwan


[cloudera@quickstart ~]$ hdfs dfs -copyFromLocal input.txt /nuwan


so now we have copied our new file to /nuwan/input.txt

[cloudera@quickstart ~]$ hdfs dfs -ls /nuwan
Found 1 items
-rw-r--r--   1 cloudera supergroup        144 2017-06-18 12:32 /nuwan/input.txt


Now running your command

scala> sc.textFile("/nuwan/input.txt").flatMap(_.split(" ")).count
res14: Long = 30  

Comments

  1. Very nice article,Thank you for sharing this awesome blog.

    Keep updating...

    Big Data Hadoop Course

    ReplyDelete

Post a Comment

Popular posts from this blog

Oracle Database 12c installation on Ubuntu 16.04

This article describes how to install Oracle 12c 64bit database on Ubuntu 16.04 64bit. Download software  Download the Oracle software from OTN or MOS or get a downloaded zip file. OTN: Oracle Database 12c Release 1 (12.1.0.2) Software (64-bit). edelivery: Oracle Database 12c Release 1 (12.1.0.2) Software (64-bit)   Unpacking  You should have following two files downloaded now. linuxamd64_12102_database_1of2.zip linuxamd64_12102_database_2of2.zip Unzip and copy them to \tmp\databases NOTE: you might have to merge two unzipped folders to create a single folder. Create new groups and users Open a terminal and execute following commands. you might need root permission. groupadd -g 502 oinstall groupadd -g 503 dba groupadd -g 504 oper groupadd -g 505 asmadmin Now create the oracle user useradd -u 502 -g oinstall -G dba,asmadmin,oper -s /bin/bash -m oracle You will prompt to set to password. set a momorable password and write it down. (mine is orac

DBCA : No Protocol specified

when trying to execute dbca from linux terminal got this error message. now execute the command xhost, you probably receiving No protocol specified xhost:  unable to open display ":0" issue is your user is not allowed to access the x server. You can use xhost to limit access for X server for security reasons. probably you are logged in as oracle user. switch back to default user and execute xhost again. you should see something like SI:localuser:nuwan solution is adding the oracle to access control list xhost +SI:localuser:oracle now go back to oracle user and try dbca it should be working

Java Head Dump Vs Thread Dump

JVM head dump is a snapshot of a JVM heap memory in a given time. So its simply a heap representation of JVM. That is the state of the objects. JVM thread dump is a snapshot of a JVM threads at a given time. So thats what were threads doing at any given time. This is the state of threads. This helps understanding such as locked threads, hanged threads and running threads. Head dump has more information of java class level information than a thread dump. For example Head dump is good to analyse JVM heap memory issues and OutOfMemoryError errors. JVM head dump is generated automatically when there is something like OutOfMemoryError has taken place.  Heap dump can be created manually by killing the process using kill -3 . Generating a heap dump is a intensive computing task, which will probably hang your jvm. so itsn't a methond to use offetenly. Heap can be analysed using tools such as eclipse memory analyser. Core dump is a os level memory usage of objects. It has more informaiton t