How-to: use spark to suport query across mysql tables and hbase tables

阿新 • • 發佈：2019-02-07

To resolve this, one good choice is spark whose parquet support and dataframe resolved this problem. Parquet is a good choice for performance consideration. Here is the steps:

With sqlContext, the mysql big tables could be loaded and saved as parquet files in hdfs. Design this as daily job. The code could be like following:
Please notcie that this code is based on spark-1.3. From Spark 1.4, please use sqlContext.read
partitions

= ( upper_bound - lower_bound ) / lines_each_part
varoptions: HashMap[String, String] =newHashMap options.put("driver","com.mysql.jdbc.Driver") options.put("url",url ) options.put("dbtable",table) options.put("lowerBound",lower_bound.toString()) options.put("upperBound",upper_bound.toString()) //partitions are base don lower_bound and upper_bound

options.put("numPartitions",partitions.toString()) options.put("partitionColumn",id); valjdbcDF=sqlContext.load("jdbc",options) jdbcDF.save(output)
Load mysql data from hdfs parquet files like following:
sqlContext.parquetFile(base_dir + "/" + table).toDF().registerTempTable(table)
Load hbase table as DF and register as a table:

varconfig = HBaseConfiguration.create() config.addResource( new Path(System.getenv("HBASE_HOME" ) + "/conf/hbase-site.xml" )) try { HBaseAdmin.checkHBaseAvailable( config) System. out.println( "Detected HBase is running" ) } catch { casee => e .printStackTrace } config.set(TableInputFormat. INPUT_TABLE , hbase_table ) config.set(TableInputFormat. SCAN_COLUMN_FAMILY , columnF )
...... sqlContext.createDataFrame( hc .toRowRDD(hc .createPairRDD(jsc, config)),hc.schema()).toDF().registerTempTable(table)

//hc.createPaireRDD:
public JavaPairRDD<ImmutableBytesWritable, Result> createPairRDD ( JavaSparkContext jsc, Configuration conf) { returnjsc .newAPIHadoopRDD( conf, TableInputFormat. class , ImmutableBytesWritable. class , Result. class ).cache(); }

//hc.toRowRDD:
public JavaRDD<Row> toRowRDD(JavaPairRDD<ImmutableBytesWritable, Result> pairRDD ) { returnpairRDD.map(new Function<Tuple2<ImmutableBytesWritable, Result>, Row>() { privatestatic finallongserialVersionUID = -4887770613787757684L; public Row call(Tuple2<ImmutableBytesWritable, Result> re) throws Exception { Result result = re._2(); Row row = null ; if (schema .getColumns(). length == 0) { row = getAll( result); } else { row = get( result); } returnrow ; } public Row get(Result result ) throws Exception { List<Object> values = new ArrayList<Object>(); for (String col : schema .getColumns()) { byte [] b = result .getValue(schema .getFamily().getBytes(), col.getBytes()); if (b == null) { values.add( "0" ); continue ; } values.add( new String(b )); } Row row = RowFactory. create( values.toArray( new Object[values .size()])); returnrow ; } public Row getAll(Result result ) throws Exception { NavigableMap< byte [], byte []> map = result .getFamilyMap(schema .getFamily().getBytes()); List<Object> values = new ArrayList<Object>(); for (byte [] key : map .keySet()) { values.add( new String(map .get(key ))); } Row row = RowFactory. create( values.toArray( new Object[values .size()])); returnrow ; } }); } //hc.schema():
public StructType schema() { final List<StructField> keyFields = new ArrayList<StructField>(); for (String fieldName : this.hbase_columns) {//hbase_columns is String[] keyFields .add(DataTypes.createStructField( fieldName , DataTypes.StringType , true)); } return DataTypes.createStructType( keyFields ); }
run sql as following and save result in hdfs:
valrdd_parquet=sqlContext.sql(sql) rdd_parquet.rdd.saveAsTextFile(output)

How-to: use spark to suport query across mysql tables and hbase tables

To resolve this, one good choice is spark whose parquet support and dataframe resolved this problem. Parquet is a good choice for performance consideration

How to use GITHUB to do source control

GITHUB sourcecontrolhow to create repository how to create branch how to add the comment for every change what is the commit how to rollback how to sync th

How to use script to get all oracle EBS Form name and corres

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

How Microsoft plans to use AI to impact the entertainment industry

Artificial intelligence (AI) tends to divide public opinion; some favor the developments made in tech, while others worry we've gone too far. Tony Emerson,

How to Use Checklists to Improve Your UX

How to Use Checklists to Improve Your UXAccording to data published by researchers at the Harvard T.H. Chan School of Public Health, more than 234 million

How to use Paperspace to train your Deep Neural Networks

First, you have to sign up for the service. One tip here: students of the fast.ai course get a promo code, which is worth $15. That’s up to about 30 hours

How to Use React to display NASA’s Astronomy Picture of the Day

How to Use React to display NASA’s Astronomy Picture of the DayGoal: Display NASA’s Astronomy Picture of the Day from the date a user inputsIn my first wee

How to Use Postman to Test an API During Development

REST APIs have become the new standard in providing a public and secure interface for your service. Though REST has become ubiquitous, it's no

How I use Python to blog from my iPhone

1. I write each article in MarkdownMarkdown is a simple syntax that can be easily translated to HTML (and a bunch of other formats), but only requires a si

How We Use Data to Suggest Tags for Your Story

How We Use Data to Suggest Tags for Your StoryHere on Medium, we envision tags to be central in organizing and connecting ideas. Follow the tags you’re int

how to use OpenSSL to decrypt Java AES-encrypted data?

Question: I'm interfacing to a legacy Java application (the app cannot be changed) which is encrypting data using AES. Here is how the or

How to Use NSLog to Debug CGRect and CGPoint

CGPoint and CGRect are structures (versus objects) and therefore the old NSLog standby %@ will not work as expected. Here is how each structure is defined

Google to use AI to predict natural disasters

Google has provided a service for many years which provides alerts about natural disasters. This has been in the form of warnings provided by government ag

Sanofi taps startup Researchably to use AI to sift through new research

Researchably, a young startup incubated at UC Berkeley, is conducting a pilot with Sanofi in China, using AI to sift through thousands of research studies

60% Indian businesses plan to use AI to automate tasks: Survey

Bengaluru: While nearly 60 per cent business leaders in India are planning to use Artificial Intelligence (AI) to automate tasks to a large extent, 20 per

Civil Wants to Use Blockchain to Fund Journalism. Can it Work?

Civil Wants to Use Blockchain to Fund Journalism. Can it Work?The start-up is trying to combat fake news and win over a new generation of readersBy Hannah

The World Bank and tech companies want to use AI to predict famine

At this week's United Nations General Assembly, the World Bank, the United Nations, and the Red Cross teamed up with tech giants Amazon, Microsoft, and Goo

How to convert matrix to RDD[Vector] in spark

toarray kcon tex logs def supports iterator ati true The matrix is generated from SVD, and I am using the results from SVD to do clusteri

How to use this image - Redis

art compile clu contain ext nal nds iat pop link - https://store.docker.com/images/redis?tab=description start a redis instance $ docker

how to use seeta face engine to detect and recognize face

obb ref mcs oci vdc face gin engine http R啦2Z娜辟絲5卸JZ戮諳http://www.docin.com/app/user/userinfo?userid=179005792 3Z煙1VLBR1吐http://www.docin

How-to: use spark to suport query across mysql tables and hbase tables

相關推薦