Posts

Showing posts from May, 2018

Spark Enhanced Support for Multiple Operations

Spark provides numerous essential supports related to data processing in big data industry like – For streaming data SQL queries Graphic data processing, Machine learning, MapReduce operations. Multi-platform Support Apache Spark provides extended interoperability regarding its running platform or supported data structure. Spark supports applications running in – cloud standalone cluster mode Besides, that Spark can access varied data structures HBase Tachyon HDFS Cassandra Hive Hadoop data source Spark can be deployed on A distributed framework such as YARN or Mesos Standalone server  Spark Features    Open source    Fast  processing    Distributed     Data Processing    Highly Productive    Scalability

In how many ways can we create RDDs in Apache Spark?

These are three methods to create the  RDD . 1.The first method is used when data is already available with the external systems like local filesystem,  HDFS ,  HBase RDD  can be created by calling a textFile method of  SparkContext  with path / URL as the argument. scala> val data = sc.textFile("File1.txt") sc is the object of SparkContext You need to create a file File1.txt in Spark_Home directory 2.The second approach can be used with the existing collections scala> val arr1 = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val rdd1 = sc.parallelize(arr1) 3.The third one is a way to create new RDD from the existing one. scala> val newRDD = arr1.map(data => (data * 2)) Ref : http://data-flair.training/forums/topic/in-how-many-ways-can-we-create-rdds-in-apache-spark-explain

Map vs flatMap in Spark?

Spark Map :  A Spark map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD. Spark FlatMap :   FlatMap   is a transformation operation. It applies to each element of RDD and it returns the result as new RDD. It is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function. In the FlatMap operation, a developer can define his own custom business logic. The same logic will be applied to all the elements of the RDD. Ref : https://www.quora.com/What-is-the-difference-between-Map-and-FlatMap-in-Apache-Spark

What is the Spark Driver and Executors ?

The  main()  method of the program runs in the driver. The driver is the process that runs the user code that  creates RDDs , and performs  transformation and action , and also creates SparkContext. When the Spark Shell is launched, this signifies that we have created a driver program. On the termination of the driver, the application is finished. The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler resides in the driver and distributes task among workers. The two main key roles of drivers are: Converting user program into the task. Scheduling task on the executor. The structure of Spark program at a higher level is: RDDs are created from some input data, derive new RDD from existing using various transformations, and then after it performs an action to compute data. In Spark Program, the  DAG (directed acyclic graph)  of operations are created implicitly. And when the driver r...

The basic steps for Spark Streaming Code ?

Initialize a Spark Streamingcontext Object Appaly Transformations and ouput operations to Dstreams Start reciving data and processing it using streamingContext.start() Wait for the processing to be stopped using streamingContext.awaitTermination().

What is SparkContext in Apache Spark?

A SparkContext is a client of Spark’s execution environment and it acts as the master of the  Spark  application. SparkContext sets up internal services and establishes a connection to a Spark execution environment. You can  create RDDs , accumulators and broadcast variables, access Spark services and run jobs (until SparkContext stops) after the creation of SparkContext. Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. In Spark shell, a special interpreter-aware SparkContext is already created for the user, in the variable called sc. The first step of any Spark driver application is to create a SparkContext. The SparkContext allows the Spark driver application to access the cluster through a resource manager. The resource manager can be  YARN , or  Spark’s Cluster Manager . Few functionalities which SparkContext offers are:  1. We can get the current status of a Spark application like c...

What is SparkSession in Apache Spark?

Spark Session is the new entry point for Spark applications. Prior to 2.0, SparkContext was the entry point for spark jobs. RDD was one of the main APIs then, and it was created and manipulated using Spark Context. For every other APIs, different contexts were required - For SQL, SQL Context was required; For Streaming, Streaming Context was required; For Hive, Hive Context was required. But from 2.0, RDD along with DataSet and its subset DataFrame APIs are becoming the standard APIs and are a basic unit of data abstraction in Spark. All of the user defined code will be written and evaluated against the DataSet and DataFrame APIs as well as RDD. So, there is a need for a new entry point build for handling these new APIs, which is why Spark Session has been introduced. Spark Session also includes all the APIs available in different contexts - Spark Context, SQL Context, Streaming Context, Hive Context. Ref :  https://data-flair.training/forums/topic/what-is-sparksession-in...

Broadcast Variables in Apache Spark

A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. Broadcast variables are created from a variable  v  by calling  SparkContext.broadcast(T, scala.reflect.ClassTag<T>) . The broadcast variable is a wrapper around  v , and its value can be accessed by calling the  value method. scala> val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int} = Broadcast(0) scala> broadcastVar.value res0: Array[Int] = Array(1, 2, 3) After the broadcast variable is created, it should be used instead of the value  v  in any functions run on the cluster so that  ...

What are Accumulators in Apache Spark ?

Accumulators  are variables that are "added" to through an associative and commutative "add" operation. They act as a container for accumulating partial values across multiple tasks (running on executors). They are designed to be used safely and efficiently in parallel and distributed Spark computations and are meant for distributed counters and sums (e.g.  task metrics ). You can create built-in accumulators for  longs, doubles, or collections  or register custom accumulators using the  SparkContext.register  methods. You can create accumulators with or without a name, but only  named accumulators  are displayed in  web UI  (under Stages tab for a given stage). Ref:  https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-accumulators.html https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#accumulators

What is Accumulator in Spark ?

Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark naively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages. An accumulator is created from an initial value  v  by calling  SparkContext.accumulator(v) . Tasks running on the cluster can then add to it using the  add  method or the  +=  operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its  value  method. scala > val accum = sc . accumulator ( 0 , "My Accumulator" ) accum : spark.Accumulator [ Int ] = 0 scala > sc . parallelize ( Array ( 1 , 2 ,...

UNIX Useful commands

Creating a link for a directory ln directory_with_location link_name_with_location Example: ln spark-listener-dir current Update the link ln -sfn  spark-listener-updated-dir current https://www.tecmint.com/check-linux-disk-usage-of-files-and-directories/

eclipse oxygen is not starting

I found Eclipse Oxygen is not working. it is throwing an below error I modified  workbench.xmi to  workbenchOLD.xmi. using  cd  .metadata/.plugins/org.eclipse.e4.workbench mv workbench.xmi workbenchOLD.xmi It got work !!!!!!!! in <ECLIPSE_WORKSPACE> .metadata/.plugins/org.eclipse.e4.workbench !ENTRY org.eclipse.e4.ui.workbench.swt 4 2 2018-05-14 13:02:12.435 !MESSAGE Problems occurred when invoking code from plug-in: "org.eclipse.e4.ui.workbench.swt". !STACK 0 java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:204) at scala.tools.scalap.Classfile.<init>(Classfile.scala:17) at org.scalaide.core.internal.jdt.model.ScalaClassFileDescriber$.isScala(ScalaClassFileDescriber.scala:16) at org.scalaide.core.internal.jdt.model.ScalaClassFileDescriber.describe(ScalaClassFileDescriber.scala:38) at org.eclipse.core.internal.content.ContentTypeCatalo...