Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.
1. 작업폴더 생성/이동(/home/hadoop/S2RDF_work에 실행에 필요한 jar파일을 복사하고 작업용 폴더(예, test3)를 만들어 triple data 생성하고 작업함)
가. mkdir /home/hadoop/S2RDF_work
나. cd /home/hadoop/S2RDF_work
다. mkdir test3
라. cd test3
2. triple data파일 생성(test3.nq)
vi test3.nq
===>
<http://www.w3.org/2002/07/owl#Thing> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Resource> .
<http://www.w3.org/2002/07/owl#Thing> <http://www.w3.org/1999/02/22-rdf-syntax-ns#have> <http://www.w3.org/2000/01/rdf-schema#Resource2> .
<http://www.w3.org/2002/07/owl#Thing2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Resource> .
<http://www.w3.org/2002/07/owl#Thing2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#have> <http://www.w3.org/2000/01/rdf-schema#Resource2> .
<http://www.w3.org/2002/07/owl#Thing2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#have> <http://www.w3.org/2000/01/rdf-schema#Resource3> .
3. HDFS에 업로드
가. hadoop fs -mkdir test3
나. hadoop fs -put test3.nq test3
4. DataSetCreator실행(db명 : test3, /home/hadoop/S2RDF_work에서 실행함, test3.nq는 HDFS의 test3폴더 밑에 있음)
가. Generate Vertical Partitioning
$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq VP 0.2
==> 작업이 실행된 서버에 /tmp/stat_vp.txt가 만들어짐
==> stat_vp.txt내용(cat stat_vp.txt, 항목은 tab으로 분리됨)
VP Statistic
---------------------------------------------------------
<<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>> 3 5 0.60
<<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> 2 5 0.40
---------------------------------------------------------
Saved tabels ->2
Unsaved non-empty tables ->0
Empty tables ->0
나. Generate Exteded Vertical Partitioning subset SO
$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq SO 0.2
==> 작업이 실행된 서버에 /tmp/stat_so.txt가 만들어짐
==> stat_so.txt내용(at stat_so.txt, 항목은 tab으로 분리됨)
SO Statistic
---------------------------------------------------------
---------------------------------------------------------
Saved tabels ->0
Unsaved non-empty tables ->0
Empty tables ->4
다. Generate Exteded Vertical Partitioning subset OS
$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq OS 0.2
==> 작업이 실행된 서버에 /tmp/stat_os.txt가 만들어짐
==> stat_os.txt내용(at stat_os.txt, 항목은 tab으로 분리됨)
OS Statistic
---------------------------------------------------------
---------------------------------------------------------
Saved tabels ->0
Unsaved non-empty tables ->0
Empty tables ->4
라. Generate Exteded Vertical Partitioning subset SS
$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq SS 0.2
==> 작업이 실행된 서버에 /tmp/stat_ss.txt가 만들어짐
==> stat_ss.txt내용(at stat_ss.txt, 항목은 tab으로 분리됨)
SS Statistic
---------------------------------------------------------
<<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>> 3 3 1.00 0.60
<<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> 3 3 1.00 0.60
<<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>> 2 2 1.00 0.40
<<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> 2 2 1.00 0.40
---------------------------------------------------------
Saved tabels ->0
Unsaved non-empty tables ->2
Empty tables ->2
5. 통계정보 파일을 특정폴더에 취합
위에서 생성된 파일을 /home/hadoop/S2RDF_work/test3/statistics폴더 밑으로 복사해준다.
-rw-rw-r--. 1 hadoop hadoop 201 2016-06-16 17:37 stat_os.txt
-rw-rw-r--. 1 hadoop hadoop 201 2016-06-16 17:37 stat_so.txt
-rw-rw-r--. 1 hadoop hadoop 732 2016-06-16 17:38 stat_ss.txt
-rw-rw-r--. 1 hadoop hadoop 354 2016-06-16 17:36 stat_vp.txt
6. 실행할 sparql이 들어 있는 파일을 만든다.
vi /home/hadoop/S2RDF_work/test3/test3.sparql
내용 : select ?s ?o where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o}
7. QueryTranslator실행(/home/hadoop/S2RDF_work에서 실행함,
queryTranslator-1.1.0.jar파일은 원본에서 제공하는 queryTranslator-1.1.jar을 사용하지 않고 소스 일부 수정하고 컴파일하여 다시 jar로 묶어서 만들어짐)
java -jar ./queryTranslator-1.1.0.jar -i ./test3/test3.sparql -o ./test3/test3.sparql -sd ./test3/statistics/ -sUB 0.2
==>실행결과 아래와 같은 로그가 표시되며 log파일과 sql파일은 test3.sparql파일이 있는곳에 생성됨(예,/home/hadoop/S2RDF_work/test3/test3.sparql.sql)
inputFile- =================>./test3/test3.sparql
18:34:25 DEBUG Main :: inputFile-- =================>./test3/test3.sparql
18:34:25 DEBUG JenaIOEnvironment :: Failed to find configuration: location-mapping.ttl;location-mapping.rdf;location-mapping.n3;etc/location-mapping.rdf;etc/location-mapping.n3;etc/location-mapping.ttl
VP STAT Size = 2
OS STAT Size = 0
SO STAT Size = 0
SS STAT Size = 4
THE NUMBER OF ALL SAVED (< ScaleUB) TRIPLES IS -> 5
THE NUMBER OF ALL SAVED (< ScaleUB) TABLES IS -> 2
TABLE-><http__//www.w3.org/1999/02/22-rdf-syntax-ns#type>
8. 7에서 만들어진 sql을 이용하여 실행함.
가. /home/hadoop/S2RDF_work/test3/test3.sparql.sql파일을 수정한다.
(>>>>>>TEST3--SO-OS-SS_VP__test3에서 --, SO, __가 반드시 포함되어 있어야함.. 나중에 이부분은 체크하지 않도록 소스에서 제외시켜야할 필요가 있음)
>>>>>>TEST3--SO-OS-SS_VP__test3
SELECT sub AS s , obj AS o
FROM `_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_$$1$$`
++++++Tables Statistic
_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_$$1$$ 0 VP _L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/
VP <http__//www.w3.org/1999/02/22-rdf-syntax-ns#type> 2
------
나. QueryTranslator실행
$HOME/spark/bin/spark-submit --driver-memory 2g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster --files ./test3/test3.sparql.sql ./queryexecutor_2.10-1.1.jar test3 test3.sparql.sql
---------------------YARN Application에서 데이타 확인을 위해서 로그를 찍어보면 아래와 같다.------------------
Log Type: stdout
Log Upload Time: 목 6월 16 20:09:59 +0900 2016
Log Length: 2443
queryName ==>TEST3--SO-OS-SS_VP__test3 sqlQuery==>SELECT sub AS s , obj AS o FROM `_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__` qStat ==>_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__ 0 VP _L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/ VP <http__//www.w3.org/1999/02/22-rdf-syntax-ns#type> 2 ------ tables==>Map(_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__ -> queryExecutor.Table@2224c8cc) queryNames======>TEST3--SO-OS-SS_VP__test3 pr-TEST3pf-SO-OS-SS_VP__test3atTEST3 Test TEST3--SO-OS-SS_VP__test3: tPath=======>_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/ Load Table _L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__ from test3/VP/_L_http__/www.w3.org/1999/02/22-rdf-syntax-ns#type_B_.parquet-> ==_sqlContext.sql result =====================>[sub: string, obj: string] Cached 2 Elements in 754ms tPath=======>_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/ query.query=================>SELECT sub AS s , obj AS o FROM `_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__` HaLLO Project [sub#6 AS s#36,obj#7 AS o#37] InMemoryColumnarTableScan [sub#6,obj#7], [], (InMemoryRelation [sub#6,obj#7], true, 20000, StorageLevel(true, true, false, true, 1), (PhysicalRDD [sub#6,obj#7], MapPartitionsRDD[6] at repartition at DataFrame.scala:907), Some(_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__)) HaLL1 Run query -> t==>[<http://www.w3.org/2002/07/owl#Thing>,<http://www.w3.org/2000/01/rdf-schema#Resource> .] t==>[<http://www.w3.org/2002/07/owl#Thing2>,<http://www.w3.org/2000/01/rdf-schema#Resource> .] colname[0] name ===>s,value===>[s: string] colname[1] name ===>o,value===>[o: string] temp.toJSON.toString ============>MapPartitionsRDD[23] at mapPartitions at DataFrame.scala:862 Run query -> t==>[<http://www.w3.org/2002/07/owl#Thing>,<http://www.w3.org/2000/01/rdf-schema#Resource> .] t==>[<http://www.w3.org/2002/07/owl#Thing2>,<http://www.w3.org/2000/01/rdf-schema#Resource> .] colname[0] name ===>s,value===>[s: string] colname[1] name ===>o,value===>[o: string] temp.toJSON.toString ============>MapPartitionsRDD[34] at mapPartitions at DataFrame.scala:862 MapPartitionsRDD[38] at mapPartitions at DataFrame.scala:862 results============================>Map() fileName==>/tmp/./results.txt line ==>Thu Jun 16 20:10:08 KST 2016 fileName==>/tmp/./resultTimes.txt line ==>Thu Jun 16 20:10:08 KST 2016