1. 程式人生 > >Mark : Spark-Avro學習1之使用SparkSQL讀取AVRO檔案

Mark : Spark-Avro學習1之使用SparkSQL讀取AVRO檔案

1.安裝:

  1. https://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/2.0.1/  
匯入到Spark專案裡

檔案:

  1. https://github.com/databricks/spark-avro/raw/master/src/test/resources/episodes.avro  


2.使用

程式碼:

  1. /**  
  2.  * @author xubo  
  3.  * @time 20160502  
  4.  * ref https://github.com/databricks/spark-avro  
  5.  */  
  6. package org.apache.spark.avro.learning  
  7. import org.apache.spark.sql.SQLContext  
  8. import org.apache.spark.SparkConf  
  9. import org.apache.spark.SparkContext  
  10. import java.text.SimpleDateFormat  
  11. import java.util.Date  
  12. object learning1 {  
  13.   def main(args: Array[String]) {  
  14.     val conf = new SparkConf().setAppName("readFileFromFaFq").setMaster("local")  
  15.     val sc = new SparkContext(conf)  
  16.     // import needed for the .avro method to be added  
  17.     import com.databricks.spark.avro._  
  18.     val sqlContext = new SQLContext(sc)  
  19.     // The Avro records get converted to Spark types, filtered, and  
  20.     // then written back out as Avro records  
  21.     val df = sqlContext.read.avro("file/data/avro/input/episodes.avro")  
  22.     df.show  
  23.     val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date())  
  24.     df.filter("doctor > 5").write.avro("file/data/avro/output/episodes/avro" + iString)  
  25.     df.filter("doctor > 5").show  
  26.   }  
  27. }  

原始檔:
  1. Objavro.schema?{"type":"record","name":"episodes","namespace":"testing.hive.avro.serde","fields":[{"name":"title","type":"string","doc":"episode title"},{"name":"air_date","type":"string","doc":"initial date"},{"name":"doctor","type":"int","doc":"main actor playing the Doctor in episode"}]} 巏RLS|]Z^{??"The Eleventh Hour3 April 2010"The Doctor's Wife14 May 2011&Horror of Fang Rock 3 September 1977$An Unearthly Child 23 November 1963*The Mysterious Planet 6 September 1986Rose26 March 2005.The Power of the Daleks5 November 1966Castrolava4 January 1982  
  2. 巏RLS|]Z^{?  

執行結果:
  1. +--------------------+----------------+------+  
  2. |               title|        air_date|doctor|  
  3. +--------------------+----------------+------+  
  4. |   The Eleventh Hour|    3 April 2010|    11|  
  5. |   The Doctor's Wife|     14 May 2011|    11|  
  6. | Horror of Fang Rock|3 September 1977|     4|  
  7. |  An Unearthly Child|23 November 1963|     1|  
  8. |The Mysterious Pl...|6 September 1986|     6|  
  9. |                Rose|   26 March 2005|     9|  
  10. |The Power of the ...| 5 November 1966|     2|  
  11. |          Castrolava|  4 January 1982|     5|  
  12. +--------------------+----------------+------+  
  13. +--------------------+----------------+------+  
  14. |               title|        air_date|doctor|  
  15. +--------------------+----------------+------+  
  16. |   The Eleventh Hour|    3 April 2010|    11|  
  17. |   The Doctor's Wife|     14 May 2011|    11|  
  18. |The Mysterious Pl...|6 September 1986|     6|  
  19. |                Rose|   26 March 2005|     9|  
  20. +--------------------+----------------+------+  

寫入結果:
  1. Objavro.codecsnappyavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"title","type":["string","null"]},{"name":"air_date","type":["string","null"]},{"name":"doctor","type":["int","null"]}]}  


參考:
  1. https://github.com/databricks/spark-avro