메뉴 건너뛰기

Cloudera, BigData, Semantic IoT, Hadoop, NoSQL

Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.


1. 상황

 가. zookeeper의 zookeeper_trace.log파일이 너무커서 강제로 cat /dev/null > zookeeper_trace.log로 0로 만들고 zkServer.sh로 재기동 했더니 hadoop cluster전체가 이상하게 행동하고 journalnode데몬이 다운되고 이어서 namenode가 다운되는 등의 문제가 발생함.

 나. 많은 WARN발생후에 journalnode데몬이 기동되더라고 namenode가 비 정상적으로 작동하다가 다운되는 문제가 발생함.

 다. 이에 따른 영향으로 hadoop cluster나 zookeeper를 사용하는 각 데몬들이 문제가 발생하면서 다운되는 문제가 발생함.


2. 원인

journal node가 사용하는 editLog가 노드가 sync되지 않고 다른 경우에 발생하는 문제임


3. 조치 

아래와 같은 오류가 발생하면 문제없이 journalnode데몬이 수행되는 곳의 edit log(edits_로 시작되는 파일만)들을 journal node로 사용하는 모든 서버에 복사하고 다시 기동시켜준다.(특히 edits_inprogress_0000000000005681277.empty파일이 포함되어야 한다.)

(예, scp -P 10022 /data/hadoop/journal/data/mycluster/current/edits_* root@gsda2:/data/hadoop/journal/data/mycluster/current)



-------------nodename기동시 오류(INFO)내용---------------

2017-09-14 15:10:33,450 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 73044 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:34,452 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 74046 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:35,453 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 75047 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:36,454 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 76048 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:37,455 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 77049 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:38,456 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 78050 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:39,457 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 79051 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:40,458 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 80052 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:41,459 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 81053 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:42,460 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 82054 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:43,462 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 83056 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:44,463 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 84057 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:45,464 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 85058 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:46,465 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 86059 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:47,466 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 87060 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:48,467 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 88061 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:49,468 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 89062 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:50,469 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 90063 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:51,470 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 91064 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:52,471 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 92065 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:53,472 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 93066 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:54,473 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 94067 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:55,474 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 95068 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:56,475 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 96070 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]



--------------journalnode 데몬 기동시 WARNING내용(editLog파일 각각 유사한 WARN이 발생함)---------

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1147)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)

        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:548)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:155)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:190)

        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)

        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1147)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)

        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:548)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:155)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:190)

        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)

        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

번호 제목 날짜 조회 수
301 Ubuntu 16.04 LTS에 Hive 2.1.1설치하면서 "Version information not found in metastore"발생하는 오류원인및 조치사항 2017.05.03 567
300 [oneM2M]Ontologies used for oneM2M 2017.08.02 568
299 [우분투] suppoie 채굴 프로세스 발생시 자동으로 삭제하는 shell프로그램 2018.04.01 568
298 Cloudera의 API를 이용하여 impala의 실행되었던 쿼리 확인하는 예시 2018.05.03 568
297 kafka 0.9.0.1 for scala 2.1.1 설치및 테스트 2016.05.02 569
296 kudu 테이블 metadata강제 삭제시 발생하는 오류 메세지 2022.01.12 571
295 Impala daemon기동시 "Could not create temporary timezone file"오류 발생시 조치사항 2018.03.29 575
294 데이타 제공 사이트 링크 2014.08.03 576
293 lagom에서 제공하는 초기 생성기능을 이용하여 생성한 프로젝트의 소스 파악 2018.01.16 576
292 impala,hive및 hdfs만 접근가능하고 파일을 이용한 테이블생성가능하도록 hue 권한설정설정 2018.09.17 576
291 Incompatible clusterIDs오류 원인및 해결방법 2016.04.01 577
290 [kudu]테이블 drop이 안되고 timeout이 걸리는 경우 조치 방법 2020.06.08 579
289 spark-sql실행시 The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH오류 발생시 조치사항 2016.06.09 581
288 Eclipse실행시 Java was started but returned exit code=1이라는 오류가 발생할때 조치방법 2016.11.07 581
287 CDH에서 Sentry 개념및 설정 file 2018.06.21 582
286 시스템날짜를 현재 정보로 동기화 하는 방법(rdate, ntpdate이용) 2014.08.24 583
285 Windows7 64bit 환경에서 Apache Spark 2.2.0 설치하기 2017.07.26 583
284 It is indirectly referenced from required .class files 오류 발생시 조치방법 2017.03.09 587
283 TLS/SSl설정시 방법및 참고 사항 2021.10.08 592
282 CDH 5.14.2 설치중 agent설치에서 실패하는 경우 확인/조치 2018.05.22 593
위로