■10. 멥리듀스를 jave 로 수행하기
1. hadoop 홈디렉토리에 자바 실행 파일인 jar 파일의 위치가 어디인지
설정하는 환경설정을 한다.
2. 하둡 홈디렉토리 밑에 labs 라는 디렉토리를 만들고
거기에 SingleFileWriteRead.java 파일을 생성한다.
↓
"텍스트 파일 한개를 하둡 파일 시스템에 올리는
자바 코드"
3. 두 개의 텍스트 파일을 하나로 합쳐서 하둡 파일 시스템에
올리기 위해 PutMerge.java 라는 파일을 생성한다
4. 자바로 WordCount 를 수행한다
↓
겨울왕국 대본의 단어와 단어의 갯수를
출력한다
SQL------> regexpress_count 함수로 수행
파이썬-> count, len
리눅스-> wc
하둡---> java 코드로 수행
R--------> 아직 안배움
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
항상
start-all.sh // jps 돌리고 시작 !
# Java 를 이용한 HDFS 입출력 (P 82 ~ 84)
1. hadoop 홈디렉토리에 자바 실행 파일인 jar 파일의 위치가 어디인지
설정하는 환경설정을 하는 부분
[oracle@edydr1p2 ~]$ cat >> .bash_profile << EOF *이렇게 캣 명령어로 넣을수있따
export CLASSPATH=.:$HADOOP_HOME/hadoop-ant-1.2.1.jar:$HADOOP_HOME/hadoop-client-1.2.1.jar:$HADOOP_HOME/hadoop-core-1.2.1.jar:$HADOOP_HOME/hadoop-examples-1.2.1.jar:$HADOOP_HOME/hadoop-minicluster-1.2.1.jar:$HADOOP_HOME/hadoop-test-1.2.1.jar:$HADOOP_HOME/hadoop-tools-1.2.1.jar:$HADOOP_HOME/lib/asm-3.2.jar:$HADOOP_HOME/lib/aspectjrt-1.6.11.jar:$HADOOP_HOME/lib/aspectjtools-1.6.11.jar:$HADOOP_HOME/lib/commons-beanutils-1.7.0.jar:$HADOOP_HOME/lib/commons-beanutils-core-1.8.0.jar:$HADOOP_HOME/lib/commons-cli-1.2.jar:$HADOOP_HOME/lib/commons-codec-1.4.jar:$HADOOP_HOME/lib/commons-collections-3.2.1.jar:$HADOOP_HOME/lib/commons-configuration-1.6.jar:$HADOOP_HOME/lib/commons-daemon-1.0.1.jar:$HADOOP_HOME/lib/commons-digester-1.8.jar:$HADOOP_HOME/lib/commons-el-1.0.jar:$HADOOP_HOME/lib/commons-httpclient-3.0.1.jar:$HADOOP_HOME/lib/commons-io-2.1.jar:$HADOOP_HOME/lib/commons-lang-2.4.jar:$HADOOP_HOME/lib/commons-logging-1.1.1.jar:$HADOOP_HOME/lib/commons-logging-api-1.0.4.jar:$HADOOP_HOME/lib/commons-math-2.1.jar:$HADOOP_HOME/lib/commons-net-3.1.jar:$HADOOP_HOME/lib/core-3.1.1.jar:$HADOOP_HOME/lib/hadoop-capacity-scheduler-1.2.1.jar:$HADOOP_HOME/lib/hadoop-fairscheduler-1.2.1.jar:$HADOOP_HOME/lib/hadoop-thriftfs-1.2.1.jar:$HADOOP_HOME/lib/hsqldb-1.8.0.10.jar:$HADOOP_HOME/lib/jackson-core-asl-1.8.8.jar:$HADOOP_HOME/lib/jackson-mapper-asl-1.8.8.jar:$HADOOP_HOME/lib/jasper-compiler-5.5.12.jar:$HADOOP_HOME/lib/jasper-runtime-5.5.12.jar:$HADOOP_HOME/lib/jdeb-0.8.jar:$HADOOP_HOME/lib/jersey-core-1.8.jar:$HADOOP_HOME/lib/jersey-json-1.8.jar:$HADOOP_HOME/lib/jersey-server-1.8.jar:$HADOOP_HOME/lib/jets3t-0.6.1.jar:$HADOOP_HOME/lib/jetty-6.1.26.jar:$HADOOP_HOME/lib/jetty-util-6.1.26.jar:$HADOOP_HOME/lib/jsch-0.1.42.jar:$HADOOP_HOME/lib/junit-4.5.jar:$HADOOP_HOME/lib/kfs-0.2.2.jar:$HADOOP_HOME/lib/log4j-1.2.15.jar:$HADOOP_HOME/lib/mockito-all-1.8.5.jar:$HADOOP_HOME/lib/oro-2.0.8.jar:$HADOOP_HOME/lib/servlet-api-2.5-20081211.jar:$HADOOP_HOME/lib/slf4j-api-1.4.3.jar:$HADOOP_HOME/lib/slf4j-log4j12-1.4.3.jar:$HADOOP_HOME/lib/xmlenc-0.52.jar:$CLASSPATH
EOF
[oracle@edydr1p2 ~]$ source .bash_profile
*. .bash_profile 이랑 같은거다. 점이 source랑 같음
2. 하둡 홈디렉토리 밑에 labs 라는 디렉토리를 만들고
거기에 SingleFileWriteRead.java 파일을 생성한다.
SingleFileWriteRead.java ---> 로컬에 있는 파일을 하둡 파일 시스템
에 올리는 자바 파일
[oracle@edydr1p2 ~]$ cd $HADOOP_HOME
[oracle@edydr1p2 hadoop-1.2.1]$ mkdir labs
[oracle@edydr1p2 hadoop-1.2.1]$ cd labs
[oracle@edydr1p2 labs]$ vi SingleFileWriteRead.java
===================================================================================
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class SingleFileWriteRead {
public static void main(String[] args) {
if (args.length != 2) {
System.err.println("Usage: SingleFileWriteRead <filename> <contents>");
System.exit(2);
}
try {
Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://localhost:9000");
FileSystem hdfs = FileSystem.get(conf);
Path path = new Path(args[0]);
if (hdfs.exists(path)) {
hdfs.delete(path, true);
}
FSDataOutputStream outStream = hdfs.create(path);
outStream.writeUTF(args[1]);
outStream.close();
FSDataInputStream inputStream = hdfs.open(path);
String inputString = inputStream.readUTF();
inputStream.close();
System.out.println("## inputString:" + inputString);
} catch (Exception e) {
e.printStackTrace();
}
}
}
===================================================================================
// 자바에서 하둡 파일 시스템에 접근가능하다.
3. java 파일을 컴파일해서 class 파일을 생성해야한다.
[oracle@edydr1p2 labs]$ javac SingleFileWriteRead.java
4. first.txt 를 하둡 파일 시스템에 올린다.
$ vi first.txt
aaaaaaa
bbbbbbb
ccccccc
5. 하둡 파일 시스템에 SingleFileWriteRead 클래스 파일을 이용해서
first.txt 를 올린다.
[oracle@edydr1p2 labs]$ hadoop -cp $CLASSPATH:. SingleFileWriteRead first.txt firstText
## inputString:firstText **하둡에 자바코드로 firstText 라는 글자를 올린거
[oracle@edydr1p2 labs]$ hadoop -cp $CLASSPATH:. SingleFileWriteRead second.txt "secondText thirdText"
## inputString:secondText thirdText
**hadoop fs -put 해서 올리는걸 자바로 올리는 것과 같음
**hadoop fs -cat second.txt 하면
***secondText thirdText[orcl:labs]$ 이런 글자가 출력된다
[oracle@edydr1p2 labs]$ hadoop fs -ls **first,second 올라간걸 볼 수 있다.
Found 2 items
-rw-r--r-- 3 oracle supergroup 11 2014-09-18 22:03 /user/oracle/first.txt
-rw-r--r-- 3 oracle supergroup 22 2014-09-18 22:03 /user/oracle/second.txt
[oracle@edydr1p2 labs]$ hadoop fs -cat first.txt
firstText
[oracle@edydr1p2 labs]$ hadoop fs -cat second.txt
secondText thirdText
[oracle@edydr1p2 labs]$ hadoop fs -rm first.txt second.txt *지운다
Deleted hdfs://localhost:9000/user/oracle/first.txt
Deleted hdfs://localhost:9000/user/oracle/second.txt
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
■ 두개의 파일을 하나로 합쳐서 하둡 파일 시스템에 올리는 실습
# PutMerge
[oracle@edydr1p2 labs]$ vi PutMerge.java
============================================================
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class PutMerge {
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://localhost:9000");
FileSystem hdfs = FileSystem.get(conf);
FileSystem local = FileSystem.getLocal(conf);
Path inputDir = new Path(args[0]);
Path hdfsFile = new Path(args[1]);
try {
FileStatus[] inputFiles = local.listStatus(inputDir);
FSDataOutputStream out = hdfs.create(hdfsFile);
for (int i=0; i<inputFiles.length; i++) {
System.out.println(inputFiles[i].getPath().getName());
FSDataInputStream in = local.open(inputFiles[i].getPath());
byte buffer[] = new byte[256];
int bytesRead = 0;
while( (bytesRead = in.read(buffer)) > 0) {
out.write(buffer, 0, bytesRead);
}
in.close();
}
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
============================================================
[oracle@edydr1p2 labs]$ javac PutMerge.java
[oracle@edydr1p2 labs]$ mkdir inputtext
[oracle@edydr1p2 labs]$ cp /etc/hosts inputtext/
[oracle@edydr1p2 labs]$ cp /etc/group inputtext/
[oracle@edydr1p2 labs]$ ls -l inputtext/
합계 8
-rw-r--r-- 1 oracle oinstall 717 9월 18 22:07 group
-rw-r--r-- 1 oracle oinstall 294 9월 18 22:07 hosts
*위 2개를 아래의 result.txt 로 합칠거다
[oracle@edydr1p2 labs]$ hadoop -cp $CLASSPATH:. PutMerge inputtext result.txt
group
hosts
[oracle@edydr1p2 labs]$ hadoop fs -ls
Found 1 items
-rw-r--r-- 3 oracle supergroup 1011 2014-09-18 22:08 /user/oracle/result.txt
[oracle@edydr1p2 labs]$ hadoop fs -cat result.txt
root:x:0:root
bin:x:1:root,bin,daemon
daemon:x:2:root,bin,daemon
sys:x:3:root,bin,adm
adm:x:4:root,adm,daemon
...
# Do not remove the following line, or various programs
# that require network functionality will fail.
192.168.100.101 edydr1p1.us.oracle.com edydr1p1
192.168.100.102 edydr1p2.us.oracle.com edydr1p2
127.0.0.1 edydr1p2.us.oracle.com edydr1p2 localhost.localdomain localhost
[oracle@edydr1p2 labs]$ hadoop fs -rm result.txt *실습끝났으니 지운다
Deleted hdfs://localhost:9000/user/oracle/result.txt
[oracle@edydr1p2 labs]$ rm -rf inputtext/
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
# WordCount 예제
1. 하둡 파일시스템에 input 이라는 디렉토리를 만든다
[oracle@edydr1p2 labs]$ hadoop fs -mkdir input
2. 하둡 파일 시스템에 input 디렉토리에 로컬에 etc 밑에 hosts 파일을
올린다. 또 하둡 홈 디렉토리 밑에 README.txt 를 올린다.
[oracle@edydr1p2 labs]$ hadoop fs -put /etc/hosts input
[oracle@edydr1p2 labs]$ hadoop fs -put $HADOOP_HOME/README.txt input
[oracle@edydr1p2 labs]$ hadoop fs -lsr
drwxr-xr-x - oracle supergroup 0 2014-09-18 22:16 /user/oracle/input
-rw-r--r-- 3 oracle supergroup 1366 2014-09-18 22:16 /user/oracle/input/README.txt
-rw-r--r-- 3 oracle supergroup 294 2014-09-18 22:15 /user/oracle/input/hosts
3. 하둡 홈디렉토리에 기본적으로 내장되어있는 멥리듀스 함수를 이용해서
지금 올린 1개의 파일을 워드 카운트 한다.
[oracle@edydr1p2 labs]$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input/hosts output1
14/09/18 22:17:10 INFO input.FileInputFormat: Total input paths to process : 1
14/09/18 22:17:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/09/18 22:17:10 WARN snappy.LoadSnappy: Snappy native library not loaded
14/09/18 22:17:10 INFO mapred.JobClient: Running job: job_201409021854_0002
14/09/18 22:17:11 INFO mapred.JobClient: map 0% reduce 0%
14/09/18 22:17:15 INFO mapred.JobClient: map 100% reduce 0%
14/09/18 22:17:23 INFO mapred.JobClient: map 100% reduce 100%
14/09/18 22:17:24 INFO mapred.JobClient: Job complete: job_201409021854_0002
14/09/18 22:17:24 INFO mapred.JobClient: Counters: 29
14/09/18 22:17:24 INFO mapred.JobClient: Job Counters
14/09/18 22:17:24 INFO mapred.JobClient: Launched reduce tasks=1
14/09/18 22:17:24 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4013
14/09/18 22:17:24 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/09/18 22:17:24 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/09/18 22:17:24 INFO mapred.JobClient: Launched map tasks=1
14/09/18 22:17:24 INFO mapred.JobClient: Data-local map tasks=1
14/09/18 22:17:24 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8201
14/09/18 22:17:24 INFO mapred.JobClient: File Output Format Counters
14/09/18 22:17:24 INFO mapred.JobClient: Bytes Written=290
14/09/18 22:17:24 INFO mapred.JobClient: FileSystemCounters
14/09/18 22:17:24 INFO mapred.JobClient: FILE_BYTES_READ=396
14/09/18 22:17:24 INFO mapred.JobClient: HDFS_BYTES_READ=404
14/09/18 22:17:24 INFO mapred.JobClient: FILE_BYTES_WRITTEN=126001
14/09/18 22:17:24 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=290
14/09/18 22:17:24 INFO mapred.JobClient: File Input Format Counters
14/09/18 22:17:24 INFO mapred.JobClient: Bytes Read=294
14/09/18 22:17:24 INFO mapred.JobClient: Map-Reduce Framework
14/09/18 22:17:24 INFO mapred.JobClient: Map output materialized bytes=396
14/09/18 22:17:24 INFO mapred.JobClient: Map input records=7
14/09/18 22:17:24 INFO mapred.JobClient: Reduce shuffle bytes=396
14/09/18 22:17:24 INFO mapred.JobClient: Spilled Records=50
14/09/18 22:17:24 INFO mapred.JobClient: Map output bytes=386
14/09/18 22:17:24 INFO mapred.JobClient: Total committed heap usage (bytes)=222298112
14/09/18 22:17:24 INFO mapred.JobClient: CPU time spent (ms)=1660
14/09/18 22:17:24 INFO mapred.JobClient: Combine input records=28
14/09/18 22:17:24 INFO mapred.JobClient: SPLIT_RAW_BYTES=110
14/09/18 22:17:24 INFO mapred.JobClient: Reduce input records=25
14/09/18 22:17:24 INFO mapred.JobClient: Reduce input groups=25
14/09/18 22:17:24 INFO mapred.JobClient: Combine output records=25
14/09/18 22:17:24 INFO mapred.JobClient: Physical memory (bytes) snapshot=236175360
14/09/18 22:17:24 INFO mapred.JobClient: Reduce output records=25
14/09/18 22:17:24 INFO mapred.JobClient: Virtual memory (bytes) snapshot=795942912
14/09/18 22:17:24 INFO mapred.JobClient: Map output records=28
[oracle@edydr1p2 labs]$ hadoop fs -lsr
drwxr-xr-x - oracle supergroup 0 2014-09-18 22:16 /user/oracle/input
-rw-r--r-- 3 oracle supergroup 1366 2014-09-18 22:16 /user/oracle/input/README.txt
-rw-r--r-- 3 oracle supergroup 294 2014-09-18 22:15 /user/oracle/input/hosts
drwxr-xr-x - oracle supergroup 0 2014-09-18 22:17 /user/oracle/output1
-rw-r--r-- 3 oracle supergroup 0 2014-09-18 22:17 /user/oracle/output1/_SUCCESS
drwxr-xr-x - oracle supergroup 0 2014-09-18 22:17 /user/oracle/output1/_logs
drwxr-xr-x - oracle supergroup 0 2014-09-18 22:17 /user/oracle/output1/_logs/history
-rw-r--r-- 3 oracle supergroup 13855 2014-09-18 22:17 /user/oracle/output1/_logs/history/job_201409021854_0002_1411046230148_oracle_word+count
-rw-r--r-- 3 oracle supergroup 54765 2014-09-18 22:17 /user/oracle/output1/_logs/history/job_201409021854_0002_conf.xml
-rw-r--r-- 3 oracle supergroup 290 2014-09-18 22:17 /user/oracle/output1/part-r-00000
[oracle@edydr1p2 labs]$ hadoop fs -cat output1/part-r-00000
[oracle@edydr1p2 labs]$ hadoop fs -get output1/part-r-00000 /home/oracle/result1.txt
# 2
127.0.0.1 1
192.168.100.101 1
192.168.100.102 1
Do 1
edydr1p1 1
edydr1p1.us.oracle.com 1
edydr1p2 2
edydr1p2.us.oracle.com 2
fail. 1
following 1
functionality 1
line, 1
localhost 1
localhost.localdomain 1
network 1
not 1
or 1
programs 1
remove 1
require 1
that 1
the 1
various 1
will 1
[oracle@edydr1p2 labs]$ hadoop fs -rmr output*
Deleted hdfs://localhost:9000/user/oracle/output1
[oracle@edydr1p2 labs]$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input/README.txt output2
14/09/18 22:19:03 INFO input.FileInputFormat: Total input paths to process : 1
14/09/18 22:19:03 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/09/18 22:19:03 WARN snappy.LoadSnappy: Snappy native library not loaded
14/09/18 22:19:04 INFO mapred.JobClient: Running job: job_201409021854_0003
14/09/18 22:19:05 INFO mapred.JobClient: map 0% reduce 0%
14/09/18 22:19:08 INFO mapred.JobClient: map 100% reduce 0%
14/09/18 22:19:15 INFO mapred.JobClient: map 100% reduce 33%
14/09/18 22:19:17 INFO mapred.JobClient: map 100% reduce 100%
14/09/18 22:19:17 INFO mapred.JobClient: Job complete: job_201409021854_0003
...
[oracle@edydr1p2 labs]$ hadoop fs -cat output2/part-r-00000
(BIS), 1
(ECCN) 1
(TSU) 1
(see 1
5D002.C.1, 1
740.13) 1
Administration 1
Apache 1
BEFORE 1
BIS 1
Bureau 1
Commerce, 1
Commodity 1
...
[oracle@edydr1p2 labs]$ hadoop fs -rmr output*
Deleted hdfs://localhost:9000/user/oracle/output2
[oracle@edydr1p2 labs]$ hadoop fs -rmr input
Deleted hdfs://localhost:9000/user/oracle/input
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
'hadoop' 카테고리의 다른 글
9. sqoop 설치 , sqoop 으로 오라클, hive 연동 (0) | 2019.01.11 |
---|---|
8. mySQL 설치, 설명 (0) | 2019.01.11 |
7. mongo db 설치, 설명 (0) | 2019.01.11 |
6. PIG설치, 설명 (0) | 2019.01.11 |
5. TAJO설치, 설명 (0) | 2019.01.11 |