There is Spark job failing on memory allocation, specifically in CloudWatch I can see that HDFS is running out of space fast. I know memory tuning for Spark app is a not trivial and I have very little expertise in it
https://databricks.com/blog/2015/05/28/ ... tions.html
Need to see if memory settings need to be tuned . Below are some recommendation from gceasy.io based on GC log from cluster
Current flags are:
Code: Select all
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH" {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2048m -Xmx2048m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.port.maxRetries=32' '-Dspark.history.ui.port=18080' '-Dspark.driver.port=45330' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@1.....:45330 --executor-id 3 --hostname ip-....ec2.internal --cores 4 --app-id application_1489430908790_1946 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
https://spark-summit.org/2017/events/de ... ory-model/