1. 程式人生 > 其它 >自行編譯spark適配CDH 6.3.2的spark-sql

自行編譯spark適配CDH 6.3.2的spark-sql

一開始覺得簡單,參考某些文章用apache編譯後的2.4.0的包直接替換就行,發現搞了好久spark-sql都不成功

於是下決心參考網上的自己編譯了。

軟體版本:jdk-1.8、maven-3.6.3、scala-2.11.12 、spark-3.1.2

1.下載軟體

wget  http://distfiles.macports.org/scala2.11/scala-2.11.12.tgz
wget  https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
wget  https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2.tgz

把壓縮包放在/opt目錄,全部解壓,設定jdk、scala、maven 的環境變數

#####java#####
export JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera
export JRE_HOME=$JAVA_HOME/jre
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib

######maven#######
export PATH=/opt/apache-maven-3.6.3/bin:$PATH

####scala#####
export SCALA_HOME
=/opt/scala-2.11.12 export PATH=${SCALA_HOME}/bin:$PATH

2.編譯spark3

修改spark3的pom配置 /opt/spark-3.1.2/pom.xml,增加cloudera maven倉庫

<repositories>
  <repository>
     <id>central</id>
      <!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution -->
      <name>Maven Repository</name>
      <url>https://
repo1.maven.org/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories>

修改pom檔案中的hadoop版本

預設是帶的hadoop 3.2 ,需要將 hadoop.version 屬性改為 3.0.0-cdh6.3.2

注意2:maven環境記憶體要符合條件。如果用maven進行編譯需要先設定maven記憶體,如果用make-distribution.sh ,則在這個/opt/spark-3.1.2/dev/make-distribution.sh指令碼中進行修改:

編譯的時候,Xmx設定的4G,CacheSize設定的2G,否則編譯總是失敗

export MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=2g"

注意3:如果scala 版本為2.10.x ,需要進行

# cd  /opt/spark-3.1.2
# ./dev/change-scala-version.sh 2.10

如果為2.11.x,需要進行

# cd  /opt/spark-3.1.2
#./dev/change-scala-version.sh 2.11

注意4:

 推薦使用一下命令編譯:

./dev/make-distribution.sh \
--name 3.0.0-cdh6.3.2 --tgz  -Pyarn -Phadoop-3.0 \
-Phive -Phive-thriftserver -Dhadoop.version=3.0.0-cdh6.3.2 -X

用的是spark的make-distribution.sh指令碼進行編譯,這個指令碼其實也是用maven編譯的,

  • –tgz 指定以tgz結尾
  • –name後面跟的是我們Hadoop的版本,在後面生成的tar包我們會發現名字後面的版本號也是這個(這個可以看make-distribution.sh原始碼瞭解)
  • -Pyarn 是基於yarn
  • -Dhadoop.version=3.0.0-cdh6.3.2 指定Hadoop的版本。

編譯報錯報錯資訊:

/root/spark-3.1.2/build/mvn: 行 212:  6877 已殺死               "${MVN_BIN}" -DzincPort=${ZINC_PORT} "$@"

 解決方法:

修改./dev/make-distribution.sh檔案,將原來的maven地址指定為自己系統裡裝的maven環境:

# cat make-distribution.sh
# Figure out where the Spark framework is installed
SPARK_HOME="$(cd "`dirname "$0"`/.."; pwd)"
DISTDIR="$SPARK_HOME/dist"

MAKE_TGZ=false
MAKE_PIP=false
MAKE_R=false
NAME=none
#MVN="$SPARK_HOME/build/mvn"
MVN="/opt/apache-maven-3.6.3/bin/mvn"

 

編譯過程很漫長:

 編譯成功後的目錄:

 編譯完後的spark檔案就是:

spark-3.1.2-bin-3.0.0-cdh6.3.2.tgz

3.部署

tar zxvf   spark-3.1.2-bin-3.0.0-cdh6.3.2.tgz  /opt/cloudera/parcels/CDH/lib/spark3

將CDH叢集的spark-env.sh 複製到/opt/cloudera/parcels/CDH/lib/spark3/conf 下:

cp /etc/spark/conf/spark-env.sh  /opt/cloudera/parcels/CDH/lib/spark3/conf

然後將spark-home 修改一下:

[root@master1 conf]# cat spark-env.sh
#!/usr/bin/env bash
##
# Generated by Cloudera Manager and should not be modified directly
##

SELF="$(cd $(dirname $BASH_SOURCE) && pwd)"
if [ -z "$SPARK_CONF_DIR" ]; then
  export SPARK_CONF_DIR="$SELF"
fi

#export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark3

將gateway節點的hive-site.xml複製到spark2/conf目錄下,不需要做變動:

cp /etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/

配置yarn.resourcemanager,檢視你CDH的yarn配置裡是否有如下配置,需要開啟:

正常情況下,resourcemanager應該會預設啟用以上配置的,

建立spark-sql

cat /opt/cloudera/parcels/CDH/bin/spark-sql 
#!/bin/bash  
# Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in  
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
SOURCE="${BASH_SOURCE[0]}"  
BIN_DIR="$( dirname "$SOURCE" )"  
while [ -h "$SOURCE" ]  
do  
 SOURCE="$(readlink "$SOURCE")"  
 [[ $SOURCE != /* ]] && SOURCE="$BIN_DIR/$SOURCE"  
 BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"  
done  
BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"  
LIB_DIR=$BIN_DIR/../lib  
export HADOOP_HOME=$LIB_DIR/hadoop  
  
# Autodetect JAVA_HOME if not defined  
. $LIB_DIR/bigtop-utils/bigtop-detect-javahome  
  
exec $LIB_DIR/spark3/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@" 

配置快捷方式

alternatives --install /usr/bin/spark-sql spark-sql  /opt/cloudera/parcels/CDH/lib/spark3/bin/spark-sql   1

測試:

 

參考:

https://its401.com/article/qq_26502245/120355741

https://blog.csdn.net/Mrheiiow/article/details/123007848