Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.

Spark+S2RDF 5건의 triple data를 이용하여 특정 작업 폴더에서 작업하는 방법/절차

총관리자 2016.06.16 20:07 조회 수 : 123

1. 작업폴더 생성/이동(/home/hadoop/S2RDF_work에 실행에 필요한 jar파일을 복사하고 작업용 폴더(예, test3)를 만들어 triple data 생성하고 작업함)

가. mkdir /home/hadoop/S2RDF_work

나. cd /home/hadoop/S2RDF_work

다. mkdir test3

라. cd test3

2. triple data파일 생성(test3.nq)

vi test3.nq

===>

<http://www.w3.org/2002/07/owl#Thing> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Resource> .

<http://www.w3.org/2002/07/owl#Thing> <http://www.w3.org/1999/02/22-rdf-syntax-ns#have> <http://www.w3.org/2000/01/rdf-schema#Resource2> .

<http://www.w3.org/2002/07/owl#Thing2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Resource> .

<http://www.w3.org/2002/07/owl#Thing2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#have> <http://www.w3.org/2000/01/rdf-schema#Resource2> .

<http://www.w3.org/2002/07/owl#Thing2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#have> <http://www.w3.org/2000/01/rdf-schema#Resource3> .

3. HDFS에 업로드

가. hadoop fs -mkdir test3

나. hadoop fs -put test3.nq test3

4. DataSetCreator실행(db명 : test3, /home/hadoop/S2RDF_work에서 실행함, test3.nq는 HDFS의 test3폴더 밑에 있음)

가. Generate Vertical Partitioning

$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq VP 0.2

==> 작업이 실행된 서버에 /tmp/stat_vp.txt가 만들어짐

==> stat_vp.txt내용(cat stat_vp.txt, 항목은 tab으로 분리됨)

VP Statistic

---------------------------------------------------------

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>> 3 5 0.60

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> 2 5 0.40

---------------------------------------------------------

Saved tabels ->2

Unsaved non-empty tables ->0

Empty tables ->0

나. Generate Exteded Vertical Partitioning subset SO

$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq SO 0.2

==> 작업이 실행된 서버에 /tmp/stat_so.txt가 만들어짐

==> stat_so.txt내용(at stat_so.txt, 항목은 tab으로 분리됨)

SO Statistic

---------------------------------------------------------

Saved tabels ->0

Unsaved non-empty tables ->0

Empty tables ->4

다. Generate Exteded Vertical Partitioning subset OS

$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq OS 0.2

==> 작업이 실행된 서버에 /tmp/stat_os.txt가 만들어짐

==> stat_os.txt내용(at stat_os.txt, 항목은 tab으로 분리됨)

OS Statistic

---------------------------------------------------------

Saved tabels ->0

Unsaved non-empty tables ->0

Empty tables ->4

라. Generate Exteded Vertical Partitioning subset SS

$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq SS 0.2

==> 작업이 실행된 서버에 /tmp/stat_ss.txt가 만들어짐

==> stat_ss.txt내용(at stat_ss.txt, 항목은 tab으로 분리됨)

SS Statistic

---------------------------------------------------------

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>> 3 3 1.00 0.60

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> 3 3 1.00 0.60

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>> 2 2 1.00 0.40

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> 2 2 1.00 0.40

---------------------------------------------------------

Saved tabels ->0

Unsaved non-empty tables ->2

Empty tables ->2

5. 통계정보 파일을 특정폴더에 취합

위에서 생성된 파일을 /home/hadoop/S2RDF_work/test3/statistics폴더 밑으로 복사해준다.

-rw-rw-r--. 1 hadoop hadoop 201 2016-06-16 17:37 stat_os.txt

-rw-rw-r--. 1 hadoop hadoop 201 2016-06-16 17:37 stat_so.txt

-rw-rw-r--. 1 hadoop hadoop 732 2016-06-16 17:38 stat_ss.txt

-rw-rw-r--. 1 hadoop hadoop 354 2016-06-16 17:36 stat_vp.txt

6. 실행할 sparql이 들어 있는 파일을 만든다.

vi /home/hadoop/S2RDF_work/test3/test3.sparql

내용 : select ?s ?o where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o}

7. QueryTranslator실행(/home/hadoop/S2RDF_work에서 실행함,

queryTranslator-1.1.0.jar파일은 원본에서 제공하는 queryTranslator-1.1.jar을 사용하지 않고 소스 일부 수정하고 컴파일하여 다시 jar로 묶어서 만들어짐)

java -jar ./queryTranslator-1.1.0.jar -i ./test3/test3.sparql -o ./test3/test3.sparql -sd ./test3/statistics/ -sUB 0.2

==>실행결과 아래와 같은 로그가 표시되며 log파일과 sql파일은 test3.sparql파일이 있는곳에 생성됨(예,/home/hadoop/S2RDF_work/test3/test3.sparql.sql)

inputFile- =================>./test3/test3.sparql

18:34:25 DEBUG Main :: inputFile-- =================>./test3/test3.sparql

18:34:25 DEBUG JenaIOEnvironment :: Failed to find configuration: location-mapping.ttl;location-mapping.rdf;location-mapping.n3;etc/location-mapping.rdf;etc/location-mapping.n3;etc/location-mapping.ttl

VP STAT Size = 2

OS STAT Size = 0

SO STAT Size = 0

SS STAT Size = 4

THE NUMBER OF ALL SAVED (< ScaleUB) TRIPLES IS -> 5

THE NUMBER OF ALL SAVED (< ScaleUB) TABLES IS -> 2

TABLE-><http__//www.w3.org/1999/02/22-rdf-syntax-ns#type>

8. 7에서 만들어진 sql을 이용하여 실행함.

가. /home/hadoop/S2RDF_work/test3/test3.sparql.sql파일을 수정한다.

(>>>>>>TEST3--SO-OS-SS_VP__test3에서 --, SO, __가 반드시 포함되어 있어야함.. 나중에 이부분은 체크하지 않도록 소스에서 제외시켜야할 필요가 있음)

>>>>>>TEST3--SO-OS-SS_VP__test3

SELECT sub AS s , obj AS o

FROM `_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_$$1$$`

++++++Tables Statistic

_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_$$1$$ 0 VP _L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/

VP <http__//www.w3.org/1999/02/22-rdf-syntax-ns#type> 2

------

나. QueryTranslator실행

$HOME/spark/bin/spark-submit --driver-memory 2g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster --files ./test3/test3.sparql.sql ./queryexecutor_2.10-1.1.jar test3 test3.sparql.sql

---------------------YARN Application에서 데이타 확인을 위해서 로그를 찍어보면 아래와 같다.------------------

Log Type: stdout

Log Upload Time: 목 6월 16 20:09:59 +0900 2016

Log Length: 2443

queryName ==>TEST3--SO-OS-SS_VP__test3
sqlQuery==>SELECT sub AS s , obj AS o 
	 FROM `_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__`
	
	

qStat ==>_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__	0	VP	_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/
	VP	<http__//www.w3.org/1999/02/22-rdf-syntax-ns#type>	2
------

tables==>Map(_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__ -> queryExecutor.Table@2224c8cc)
queryNames======>TEST3--SO-OS-SS_VP__test3
pr-TEST3pf-SO-OS-SS_VP__test3atTEST3
Test TEST3--SO-OS-SS_VP__test3:
tPath=======>_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/
	Load Table _L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__ from test3/VP/_L_http__/www.w3.org/1999/02/22-rdf-syntax-ns#type_B_.parquet-> 
==_sqlContext.sql result =====================>[sub: string, obj: string]
		Cached 2 Elements in 754ms
tPath=======>_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/
query.query=================>SELECT sub AS s , obj AS o 
	 FROM `_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__`
	
	

HaLLO
Project [sub#6 AS s#36,obj#7 AS o#37]
 InMemoryColumnarTableScan [sub#6,obj#7], [], (InMemoryRelation [sub#6,obj#7], true, 20000, StorageLevel(true, true, false, true, 1), (PhysicalRDD [sub#6,obj#7], MapPartitionsRDD[6] at repartition at DataFrame.scala:907), Some(_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__))

HaLL1

	 Run query -> 
t==>[<http://www.w3.org/2002/07/owl#Thing>,<http://www.w3.org/2000/01/rdf-schema#Resource> .]
t==>[<http://www.w3.org/2002/07/owl#Thing2>,<http://www.w3.org/2000/01/rdf-schema#Resource> .]
colname[0] name ===>s,value===>[s: string]
colname[1] name ===>o,value===>[o: string]
temp.toJSON.toString ============>MapPartitionsRDD[23] at mapPartitions at DataFrame.scala:862

	 Run query -> 
t==>[<http://www.w3.org/2002/07/owl#Thing>,<http://www.w3.org/2000/01/rdf-schema#Resource> .]
t==>[<http://www.w3.org/2002/07/owl#Thing2>,<http://www.w3.org/2000/01/rdf-schema#Resource> .]
colname[0] name ===>s,value===>[s: string]
colname[1] name ===>o,value===>[o: string]
temp.toJSON.toString ============>MapPartitionsRDD[34] at mapPartitions at DataFrame.scala:862
MapPartitionsRDD[38] at mapPartitions at DataFrame.scala:862
results============================>Map()
fileName==>/tmp/./results.txt
line ==>Thu Jun 16 20:10:08 KST 2016
fileName==>/tmp/./resultTimes.txt
line ==>Thu Jun 16 20:10:08 KST 2016

이 게시물을

이 글의 추천인 목록 목록

번호	제목	날짜	조회 수
41	Hadoop Clsuter에 이미 포함된 host의 hostname변경시 처리 절차	2023.03.24	141
40	[KUDU] kudu tablet server여러가지 원인에 의해서 corrupted상태가 된 경우 복구방법	2023.03.28	159
39	[DataNode]org.apache.hadoop.security.KerberosAuthException: failure to login: for principal: hdfs/datanode03@GOOPER.COM from keytab hdfs.keytab오류	2023.04.18	6132
38	[Ranger]계정에 admin권한(grant, create등)의 권한 부여 방법	2023.04.18	235
37	[Solr in Cloudera]Solr Data Directory변경 방법/절차	2023.04.21	169
36	[Atlas Server]org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions (user=atlas/node01.gooper.com@GOOPER.COM, scope=default:atlas_janus, params=[table=default:atlas_janus,], action-CREATE)]	2023.05.15	643
35	Impala Admission Control 설정시 쿼리가 사용하는 메모리 사용량 판단 방법	2023.05.19	669
34	[impala]insert into db명.table명 select a, b from db명.table명 쿼리 수행시 "Memory limit exceeded: Failed to allocate memory for Parquet page index"오류 조치 방법	2023.05.31	195
33	[CDP7.1.3]Ranger WebUI에서 Error! Connection refused: Please check the KMS provider URL and whether the Ranager KMS is running발생시 조치 방법	2023.06.07	131
32	[Ranger]RangerAdminRESTClient Error gertting pplicies; Received NULL response!!, secureMode=true, user=rangerkms/node01.gooper.com@ GOOPER.COM (auth:KERBEROS), serviceName=cm_kms	2023.06.27	73
31	[KTS Cluster의 Key Trustee Server]self-signed 인증서 발급및 설정 방법	2023.06.27	149
30	[Hadoop Encryption] Encryption Zone 생성/설정시 User:hadoop not allowed to do 'DECRYPT_EEK' ON 'testkey' 오류 발생 조치 사항	2023.06.28	176
29	[HDFS]Encryption Zone에 생성된 테이블 조회시 Failed to open HDFS file hdfs://nameservice1/tmp/zone1/sec_test_file.txt Error(255): Unknown error 255 Root cause: AuthorizationException: User:impala not allowd to do 'DECRYPT_EEK' on 'testkey'	2023.06.29	653
28	[EncryptionZone]User:testuser not allowed to do "DECRYPT_EEK" on 'testkey'	2023.06.29	89
27	[Encryption Zone]Encryption Zone에 생성된 table을 select할때 HDFS /tmp/zone1에 대한 권한이 없는 경우	2023.06.29	83
26	[CDP7.1.6,HDFS]HDFS파일을 삭제하고 Trash비움이 완료된후에도 HDFS 공간을 차지하고 있는 경우 확인/조치 방법	2023.07.17	107
25	oozie의 sqoop action수행시 ooize:launcher의 applicationId를 이용하여 oozie:action의 applicationId및 관련 로그를 찾는 방법	2023.07.26	103
24	[Hue admin]Add/Sync LDAP user, Sync LDAP users/groups 버튼 기능 설명	2023.08.09	155
23	[Hue metadata]Oracle에 있는 Hue 메타정보 테이블을 이용하여 coordinator와 workflow관계 목록을 추출하는 방법	2023.08.22	99
22	[Impala jdbc]CDP7.1.7환경에서 java프로그램을 이용하여 kerberized impala cluster에 접근하여 SQL을 수행하는 방법	2023.08.22	164

쓰기 태그

첫 페이지 29 30 31 32 33 34 35 36 37 38 끝 페이지

Cloudera, BigData, Semantic IoT, Hadoop, NoSQL

Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.

Spark+S2RDF 5건의 triple data를 이용하여 특정 작업 폴더에서 작업하는 방법/절차

댓글 0

Cloudera, BigData, Semantic IoT, Hadoop, NoSQL

Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.

Spark+S2RDF 5건의 triple data를 이용하여 특정 작업 폴더에서 작업하는 방법/절차

댓글 0

LOGIN