Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.

Spark+S2RDF 5건의 triple data를 이용하여 특정 작업 폴더에서 작업하는 방법/절차

총관리자 2016.06.16 20:07 조회 수 : 2079

1. 작업폴더 생성/이동(/home/hadoop/S2RDF_work에 실행에 필요한 jar파일을 복사하고 작업용 폴더(예, test3)를 만들어 triple data 생성하고 작업함)

가. mkdir /home/hadoop/S2RDF_work

나. cd /home/hadoop/S2RDF_work

다. mkdir test3

라. cd test3

2. triple data파일 생성(test3.nq)

vi test3.nq

===>

<http://www.w3.org/2002/07/owl#Thing> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Resource> .

<http://www.w3.org/2002/07/owl#Thing> <http://www.w3.org/1999/02/22-rdf-syntax-ns#have> <http://www.w3.org/2000/01/rdf-schema#Resource2> .

<http://www.w3.org/2002/07/owl#Thing2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Resource> .

<http://www.w3.org/2002/07/owl#Thing2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#have> <http://www.w3.org/2000/01/rdf-schema#Resource2> .

<http://www.w3.org/2002/07/owl#Thing2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#have> <http://www.w3.org/2000/01/rdf-schema#Resource3> .

3. HDFS에 업로드

가. hadoop fs -mkdir test3

나. hadoop fs -put test3.nq test3

4. DataSetCreator실행(db명 : test3, /home/hadoop/S2RDF_work에서 실행함, test3.nq는 HDFS의 test3폴더 밑에 있음)

가. Generate Vertical Partitioning

$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq VP 0.2

==> 작업이 실행된 서버에 /tmp/stat_vp.txt가 만들어짐

==> stat_vp.txt내용(cat stat_vp.txt, 항목은 tab으로 분리됨)

VP Statistic

---------------------------------------------------------

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>> 3 5 0.60

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> 2 5 0.40

---------------------------------------------------------

Saved tabels ->2

Unsaved non-empty tables ->0

Empty tables ->0

나. Generate Exteded Vertical Partitioning subset SO

$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq SO 0.2

==> 작업이 실행된 서버에 /tmp/stat_so.txt가 만들어짐

==> stat_so.txt내용(at stat_so.txt, 항목은 tab으로 분리됨)

SO Statistic

---------------------------------------------------------

Saved tabels ->0

Unsaved non-empty tables ->0

Empty tables ->4

다. Generate Exteded Vertical Partitioning subset OS

$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq OS 0.2

==> 작업이 실행된 서버에 /tmp/stat_os.txt가 만들어짐

==> stat_os.txt내용(at stat_os.txt, 항목은 tab으로 분리됨)

OS Statistic

---------------------------------------------------------

Saved tabels ->0

Unsaved non-empty tables ->0

Empty tables ->4

라. Generate Exteded Vertical Partitioning subset SS

$HOME/spark/bin/spark-submit --driver-memory 1g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster ./datasetcreator_2.10-1.1.jar test3/ test3.nq SS 0.2

==> 작업이 실행된 서버에 /tmp/stat_ss.txt가 만들어짐

==> stat_ss.txt내용(at stat_ss.txt, 항목은 tab으로 분리됨)

SS Statistic

---------------------------------------------------------

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>> 3 3 1.00 0.60

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> 3 3 1.00 0.60

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#have>> 2 2 1.00 0.40

<<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>><<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> 2 2 1.00 0.40

---------------------------------------------------------

Saved tabels ->0

Unsaved non-empty tables ->2

Empty tables ->2

5. 통계정보 파일을 특정폴더에 취합

위에서 생성된 파일을 /home/hadoop/S2RDF_work/test3/statistics폴더 밑으로 복사해준다.

-rw-rw-r--. 1 hadoop hadoop 201 2016-06-16 17:37 stat_os.txt

-rw-rw-r--. 1 hadoop hadoop 201 2016-06-16 17:37 stat_so.txt

-rw-rw-r--. 1 hadoop hadoop 732 2016-06-16 17:38 stat_ss.txt

-rw-rw-r--. 1 hadoop hadoop 354 2016-06-16 17:36 stat_vp.txt

6. 실행할 sparql이 들어 있는 파일을 만든다.

vi /home/hadoop/S2RDF_work/test3/test3.sparql

내용 : select ?s ?o where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o}

7. QueryTranslator실행(/home/hadoop/S2RDF_work에서 실행함,

queryTranslator-1.1.0.jar파일은 원본에서 제공하는 queryTranslator-1.1.jar을 사용하지 않고 소스 일부 수정하고 컴파일하여 다시 jar로 묶어서 만들어짐)

java -jar ./queryTranslator-1.1.0.jar -i ./test3/test3.sparql -o ./test3/test3.sparql -sd ./test3/statistics/ -sUB 0.2

==>실행결과 아래와 같은 로그가 표시되며 log파일과 sql파일은 test3.sparql파일이 있는곳에 생성됨(예,/home/hadoop/S2RDF_work/test3/test3.sparql.sql)

inputFile- =================>./test3/test3.sparql

18:34:25 DEBUG Main :: inputFile-- =================>./test3/test3.sparql

18:34:25 DEBUG JenaIOEnvironment :: Failed to find configuration: location-mapping.ttl;location-mapping.rdf;location-mapping.n3;etc/location-mapping.rdf;etc/location-mapping.n3;etc/location-mapping.ttl

VP STAT Size = 2

OS STAT Size = 0

SO STAT Size = 0

SS STAT Size = 4

THE NUMBER OF ALL SAVED (< ScaleUB) TRIPLES IS -> 5

THE NUMBER OF ALL SAVED (< ScaleUB) TABLES IS -> 2

TABLE-><http__//www.w3.org/1999/02/22-rdf-syntax-ns#type>

8. 7에서 만들어진 sql을 이용하여 실행함.

가. /home/hadoop/S2RDF_work/test3/test3.sparql.sql파일을 수정한다.

(>>>>>>TEST3--SO-OS-SS_VP__test3에서 --, SO, __가 반드시 포함되어 있어야함.. 나중에 이부분은 체크하지 않도록 소스에서 제외시켜야할 필요가 있음)

>>>>>>TEST3--SO-OS-SS_VP__test3

SELECT sub AS s , obj AS o

FROM `_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_$$1$$`

++++++Tables Statistic

_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_$$1$$ 0 VP _L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/

VP <http__//www.w3.org/1999/02/22-rdf-syntax-ns#type> 2

------

나. QueryTranslator실행

$HOME/spark/bin/spark-submit --driver-memory 2g --class runDriver --master yarn --executor-memory 1g --deploy-mode cluster --files ./test3/test3.sparql.sql ./queryexecutor_2.10-1.1.jar test3 test3.sparql.sql

---------------------YARN Application에서 데이타 확인을 위해서 로그를 찍어보면 아래와 같다.------------------

Log Type: stdout

Log Upload Time: 목 6월 16 20:09:59 +0900 2016

Log Length: 2443

queryName ==>TEST3--SO-OS-SS_VP__test3
sqlQuery==>SELECT sub AS s , obj AS o 
	 FROM `_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__`
	
	

qStat ==>_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__	0	VP	_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/
	VP	<http__//www.w3.org/1999/02/22-rdf-syntax-ns#type>	2
------

tables==>Map(_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__ -> queryExecutor.Table@2224c8cc)
queryNames======>TEST3--SO-OS-SS_VP__test3
pr-TEST3pf-SO-OS-SS_VP__test3atTEST3
Test TEST3--SO-OS-SS_VP__test3:
tPath=======>_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/
	Load Table _L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__ from test3/VP/_L_http__/www.w3.org/1999/02/22-rdf-syntax-ns#type_B_.parquet-> 
==_sqlContext.sql result =====================>[sub: string, obj: string]
		Cached 2 Elements in 754ms
tPath=======>_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B_/
query.query=================>SELECT sub AS s , obj AS o 
	 FROM `_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__`
	
	

HaLLO
Project [sub#6 AS s#36,obj#7 AS o#37]
 InMemoryColumnarTableScan [sub#6,obj#7], [], (InMemoryRelation [sub#6,obj#7], true, 20000, StorageLevel(true, true, false, true, 1), (PhysicalRDD [sub#6,obj#7], MapPartitionsRDD[6] at repartition at DataFrame.scala:907), Some(_L_http__//www.w3.org/1999/02/22-rdf-syntax-ns#type_B___1__))

HaLL1

	 Run query -> 
t==>[<http://www.w3.org/2002/07/owl#Thing>,<http://www.w3.org/2000/01/rdf-schema#Resource> .]
t==>[<http://www.w3.org/2002/07/owl#Thing2>,<http://www.w3.org/2000/01/rdf-schema#Resource> .]
colname[0] name ===>s,value===>[s: string]
colname[1] name ===>o,value===>[o: string]
temp.toJSON.toString ============>MapPartitionsRDD[23] at mapPartitions at DataFrame.scala:862

	 Run query -> 
t==>[<http://www.w3.org/2002/07/owl#Thing>,<http://www.w3.org/2000/01/rdf-schema#Resource> .]
t==>[<http://www.w3.org/2002/07/owl#Thing2>,<http://www.w3.org/2000/01/rdf-schema#Resource> .]
colname[0] name ===>s,value===>[s: string]
colname[1] name ===>o,value===>[o: string]
temp.toJSON.toString ============>MapPartitionsRDD[34] at mapPartitions at DataFrame.scala:862
MapPartitionsRDD[38] at mapPartitions at DataFrame.scala:862
results============================>Map()
fileName==>/tmp/./results.txt
line ==>Thu Jun 16 20:10:08 KST 2016
fileName==>/tmp/./resultTimes.txt
line ==>Thu Jun 16 20:10:08 KST 2016

이 게시물을

이 글의 추천인 목록 목록

번호	제목	날짜	조회 수
250	kafkaWordCount.scala의 producer와 consumer 클래스를 이용하여 kafka를 이용한 word count 테스트 하기	2016.08.02	3561
249	프로세스를 확인해서 프로세스를 삭제하는 shell script예제(cryptonight)	2018.02.02	3545
248	[Impala jdbc]CDP7.1.7환경에서 java프로그램을 이용하여 kerberized impala cluster에 접근하여 SQL을 수행하는 방법	2023.08.22	3544
247	eclipse editor 설정방법	2022.02.01	3542
246	Cloudera Manager 5.x설치시 embedded postgresql를 사용하는 경우의 관리정보	2018.04.13	3538
245	DBCP Datasource(org.apache.commons.dbcp.BasicDataSource) 설정 및 속성 설명	2016.09.26	3538
244	spark stream처리할때 두개의 client프로그램이 동일한 checkpoint로 접근할때 발생하는 오류 내용	2018.01.16	3537
243	oozie db변경후 재기동시 "Table 'oozie.VALIDATE_CONN' doesn't exist" 오류 발생시 조치방법	2018.05.23	3536
242	워킹 메모리에 대한 정보를 처리하는 클래스 파일	2016.07.21	3534
241	--master yarn 옵션으로 spark client프로그램 실행할때 메모리 부족 오류발생시 조치방법	2016.05.27	3522
240	원보드 컴퓨터 비교표	2014.08.04	3521
239	[Ranger]RangerAdminRESTClient Error gertting pplicies; Received NULL response!!, secureMode=true, user=rangerkms/node01.gooper.com@ GOOPER.COM (auth:KERBEROS), serviceName=cm_kms	2023.06.27	3517
238	RDF4J의 rdf4j-server.war가 제공하는 RESTFul API를 이용한 CRUD테스트(트랜잭션처리)	2017.08.30	3517
237	[TLS]pkcs12형식의 인증서 생성및 jks형식 인증서 생성 커맨드 예시	2022.03.15	3512
236	컴퓨터 무한 재부팅 원인및 조치방법	2017.12.05	3512
235	[postgresql 9.x] PostgreSQL Replication 구축하기	2018.07.17	3510
234	halyard 1.3을 다른 서버로 이전하는 방법	2017.07.05	3501
233	[vi]블럭 및 문서내 복사등에 관련된 명령어	2017.02.17	3497
232	[Dovecot] -ERR [SYS/PERM] Permission denied	2017.06.13	3493
231	VPS에서는 root로 실행해도 swap파일을 만들지 못하게 만들어 두었지만 swap파일을 생성하는 방법	2017.06.20	3474

쓰기 태그

첫 페이지 21 22 23 24 25 26 27 28 29 30 끝 페이지

Cloudera, BigData, Semantic IoT, Hadoop, NoSQL

Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.

Spark+S2RDF 5건의 triple data를 이용하여 특정 작업 폴더에서 작업하는 방법/절차

댓글 0

Cloudera, BigData, Semantic IoT, Hadoop, NoSQL

Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.

Spark+S2RDF 5건의 triple data를 이용하여 특정 작업 폴더에서 작업하는 방법/절차

댓글 0

LOGIN