1. 程式人生 > >Scala for Data Science Engineering — Part 1

Scala for Data Science Engineering — Part 1

Data Science is an interesting field to work in, a combination of statistics and real world programming. There are number of programming languages used by Data Science Engineers, each of which has unique features. Most famous among them are Scala, Python and R. Since I am working with Scala at work, I would like to share some of the most important concepts that I come across and worthy for the beginners in Data Science Engineering.

Data Science (DS) vs Data Science Engineering (DSE)

Before moving in to Scala Concepts, I thought of giving a short description on DS vs DSE. Many of us are not clear about roles played by each, specially beginners. Data Science is about finding answers for the questions using the knowledge of Statistics, Machine learning and Data Mining tools for a large or small dataset. Since it involves a good research, normally a PhD or M.Sc in Data Science is required to get the job done. DSE on the other hand is about gathering the data, storing it, doing batch or real-time processing and exposing it as an API or other forms by which enables a Data Scientist to finish his job. There are many Big Data tools involve in DSE such as Hadoop, Spark, Hive, Hbase etc. A Data Science Engineer should have a strong programming knowledge as well as knowledge in Database.

Scala

When it comes to DSE, Apache Spark is the widely used tool in the industry which is written using Scala programming language. Spark is an extension for Hadoop which does batch processing as well as real-time processing. Compared to Hadoop, Spark is more efficient due to many reasons. Find more information on Spark from

here.

Scala is an extension of java language that is inter-operable with java as it runs on JVM. It has many different features compared to java and a language more focused on Functional Programming. Since Spark is written using Scala, for Spark users Scala will be more appropriate compared to other languages. So using Scala we can get the maximum out of the Spark framework without any restrictions. There are many features in Scala a Data Science Engineer should be familiar with, such as val, Higher Order Functions, Partial Functions, Pattern Matching & Case Classes, Collections, Currying and Implicit.

Note :- Above listings are according to my knowledge and expertise in using Scala with Spark.

val as a Keyword

val is an immutable data structure which represents a read only variable. Since the data is distributed and frameworks like Spark relies on RDD which is an immutable data structure , val is the suitable keyword to declare a variable.

val conf = new SparkConf()val sc = new SparkContext(conf)
val rdd = sc.textFile("filelocation")val filteredRDD = rdd.filter(r => r.split(",")(2) == "USA")

Spark uses val to make RDDs shared safe across processes and makes it serializable.

In Scala it would be a bad idea to serialize an object always which can result in more consumption of memory and higher network throughput(eg:- Spark) consuming more time to recalculate the fields. So we can use lazy val instead which denotes a field that will be calculated once it accessed for first time and then stored for future references. Additionally we can use @transient lazy val where it denotes that the field shall not be serialized.

Higher Order Functions

In Scala there are functions that take other functions as parameters and whose result is also a function.

def apply(f: Int => String, x: Int) = f(x)

In the above example the apply method takes a function f as a parameter along with a variable x and result is a function f applied to x.

Note:- In Scala methods and functions are different concepts. But in here methods are automatically coerced to functions if the context requires this.

In Spark, filter, map and reduce are some of the frequently used higher order functions. Let’s consider a simple example of applying a higher order function to a Resilient Distributed Dataset (RDD).

val conf = new SparkConf().setAppName("reduce").setMaster("local[*]")val sc = new SparkContext(conf)val inputIntegers = List(1, 2, 3, 4, 5)val integerRdd = sc.parallelize(inputIntegers)val product = integerRdd.reduce((x, y) => x * y)

Above example a RDD created from a list of Integers by using the inbuilt parallelize method available in Spark Context and then applied reduce (higher order) function which takes another anonymous function ((x, y) => x * y) as an input.

See you soon with Scala for Data Science Engineering — Part 2 :)

Thanks for reading

References

“Scala for the Impatient” by Cay S. Horstmann

“Scala Cook Book” by Alvin Alexander