自行編譯spark適配CDH 6.3.2的spark-sql
一開始覺得簡單,參考某些文章用apache編譯後的2.4.0的包直接替換就行,發現搞了好久spark-sql都不成功。
於是下決心參考網上的自己編譯了。
軟體版本:jdk-1.8、maven-3.6.3、scala-2.11.12 、spark-3.1.2
1.下載軟體
wget http://distfiles.macports.org/scala2.11/scala-2.11.12.tgz wget https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz wget https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2.tgz
把壓縮包放在/opt目錄,全部解壓,設定jdk、scala、maven 的環境變數
#####java##### export JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera export JRE_HOME=$JAVA_HOME/jre export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib ######maven####### export PATH=/opt/apache-maven-3.6.3/bin:$PATH ####scala##### export SCALA_HOME=/opt/scala-2.11.12 export PATH=${SCALA_HOME}/bin:$PATH
2.編譯spark3
修改spark3的pom配置 /opt/spark-3.1.2/pom.xml,增加cloudera maven倉庫
<repositories> <repository> <id>central</id> <!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution --> <name>Maven Repository</name> <url>https://repo1.maven.org/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories>
修改pom檔案中的hadoop版本
預設是帶的hadoop 3.2 ,需要將 hadoop.version 屬性改為 3.0.0-cdh6.3.2
注意2:maven環境記憶體要符合條件。如果用maven進行編譯需要先設定maven記憶體,如果用make-distribution.sh ,則在這個/opt/spark-3.1.2/dev/make-distribution.sh指令碼中進行修改:
編譯的時候,Xmx設定的4G,CacheSize設定的2G,否則編譯總是失敗
export MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=2g"
注意3:如果scala 版本為2.10.x ,需要進行
# cd /opt/spark-3.1.2
# ./dev/change-scala-version.sh 2.10
如果為2.11.x,需要進行
# cd /opt/spark-3.1.2
#./dev/change-scala-version.sh 2.11
注意4:
推薦使用一下命令編譯:
./dev/make-distribution.sh \ --name 3.0.0-cdh6.3.2 --tgz -Pyarn -Phadoop-3.0 \ -Phive -Phive-thriftserver -Dhadoop.version=3.0.0-cdh6.3.2 -X
用的是spark的make-distribution.sh指令碼進行編譯,這個指令碼其實也是用maven編譯的,
- –tgz 指定以tgz結尾
- –name後面跟的是我們Hadoop的版本,在後面生成的tar包我們會發現名字後面的版本號也是這個(這個可以看make-distribution.sh原始碼瞭解)
- -Pyarn 是基於yarn
- -Dhadoop.version=3.0.0-cdh6.3.2 指定Hadoop的版本。
編譯報錯報錯資訊:
/root/spark-3.1.2/build/mvn: 行 212: 6877 已殺死 "${MVN_BIN}" -DzincPort=${ZINC_PORT} "$@"
解決方法:
修改./dev/make-distribution.sh
檔案,將原來的maven地址指定為自己系統裡裝的maven環境:
# cat make-distribution.sh # Figure out where the Spark framework is installed SPARK_HOME="$(cd "`dirname "$0"`/.."; pwd)" DISTDIR="$SPARK_HOME/dist" MAKE_TGZ=false MAKE_PIP=false MAKE_R=false NAME=none #MVN="$SPARK_HOME/build/mvn" MVN="/opt/apache-maven-3.6.3/bin/mvn"
編譯過程很漫長:
編譯成功後的目錄:
編譯完後的spark檔案就是:
spark-3.1.2-bin-3.0.0-cdh6.3.2.tgz
3.部署
tar zxvf spark-3.1.2-bin-3.0.0-cdh6.3.2.tgz /opt/cloudera/parcels/CDH/lib/spark3
將CDH叢集的spark-env.sh 複製到/opt/cloudera/parcels/CDH/lib/spark3/conf 下:
cp /etc/spark/conf/spark-env.sh /opt/cloudera/parcels/CDH/lib/spark3/conf
然後將spark-home 修改一下:
[root@master1 conf]# cat spark-env.sh #!/usr/bin/env bash ## # Generated by Cloudera Manager and should not be modified directly ## SELF="$(cd $(dirname $BASH_SOURCE) && pwd)" if [ -z "$SPARK_CONF_DIR" ]; then export SPARK_CONF_DIR="$SELF" fi #export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark3
將gateway節點的hive-site.xml複製到spark2/conf目錄下,不需要做變動:
cp /etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/
配置yarn.resourcemanager,檢視你CDH的yarn配置裡是否有如下配置,需要開啟:
正常情況下,resourcemanager應該會預設啟用以上配置的,
建立spark-sql
cat /opt/cloudera/parcels/CDH/bin/spark-sql #!/bin/bash # Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in export HADOOP_CONF_DIR=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf SOURCE="${BASH_SOURCE[0]}" BIN_DIR="$( dirname "$SOURCE" )" while [ -h "$SOURCE" ] do SOURCE="$(readlink "$SOURCE")" [[ $SOURCE != /* ]] && SOURCE="$BIN_DIR/$SOURCE" BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )" done BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )" LIB_DIR=$BIN_DIR/../lib export HADOOP_HOME=$LIB_DIR/hadoop # Autodetect JAVA_HOME if not defined . $LIB_DIR/bigtop-utils/bigtop-detect-javahome exec $LIB_DIR/spark3/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"
配置快捷方式
alternatives --install /usr/bin/spark-sql spark-sql /opt/cloudera/parcels/CDH/lib/spark3/bin/spark-sql 1
測試:
參考:
https://its401.com/article/qq_26502245/120355741
https://blog.csdn.net/Mrheiiow/article/details/123007848