VSNR Blog

Posts

Showing posts from May, 2018

Spark Enhanced Support for Multiple Operations

May 28, 2018

Spark provides numerous essential supports related to data processing in big data industry like – For streaming data SQL queries Graphic data processing, Machine learning, MapReduce operations. Multi-platform Support Apache Spark provides extended interoperability regarding its running platform or supported data structure. Spark supports applications running in – cloud standalone cluster mode Besides, that Spark can access varied data structures HBase Tachyon HDFS Cassandra Hive Hadoop data source Spark can be deployed on A distributed framework such as YARN or Mesos Standalone server Spark Features Open source Fast processing Distributed Data Processing Highly Productive Scalability

In how many ways can we create RDDs in Apache Spark?

May 21, 2018

These are three methods to create the RDD . 1.The first method is used when data is already available with the external systems like local filesystem, HDFS , HBase RDD can be created by calling a textFile method of SparkContext with path / URL as the argument. scala> val data = sc.textFile("File1.txt") sc is the object of SparkContext You need to create a file File1.txt in Spark_Home directory 2.The second approach can be used with the existing collections scala> val arr1 = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val rdd1 = sc.parallelize(arr1) 3.The third one is a way to create new RDD from the existing one. scala> val newRDD = arr1.map(data => (data * 2)) Ref : http://data-flair.training/forums/topic/in-how-many-ways-can-we-create-rdds-in-apache-spark-explain

Map vs flatMap in Spark?

May 21, 2018

Spark Map : A Spark map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD. Spark FlatMap : FlatMap is a transformation operation. It applies to each element of RDD and it returns the result as new RDD. It is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function. In the FlatMap operation, a developer can define his own custom business logic. The same logic will be applied to all the elements of the RDD. Ref : https://www.quora.com/What-is-the-difference-between-Map-and-FlatMap-in-Apache-Spark

What is the Spark Driver and Executors ?

May 21, 2018

The main() method of the program runs in the driver. The driver is the process that runs the user code that creates RDDs , and performs transformation and action , and also creates SparkContext. When the Spark Shell is launched, this signifies that we have created a driver program. On the termination of the driver, the application is finished. The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler resides in the driver and distributes task among workers. The two main key roles of drivers are: Converting user program into the task. Scheduling task on the executor. The structure of Spark program at a higher level is: RDDs are created from some input data, derive new RDD from existing using various transformations, and then after it performs an action to compute data. In Spark Program, the DAG (directed acyclic graph) of operations are created implicitly. And when the driver r...

The basic steps for Spark Streaming Code ?

May 21, 2018

Initialize a Spark Streamingcontext Object Appaly Transformations and ouput operations to Dstreams Start reciving data and processing it using streamingContext.start() Wait for the processing to be stopped using streamingContext.awaitTermination().

What is SparkContext in Apache Spark?

May 21, 2018

A SparkContext is a client of Spark’s execution environment and it acts as the master of the Spark application. SparkContext sets up internal services and establishes a connection to a Spark execution environment. You can create RDDs , accumulators and broadcast variables, access Spark services and run jobs (until SparkContext stops) after the creation of SparkContext. Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. In Spark shell, a special interpreter-aware SparkContext is already created for the user, in the variable called sc. The first step of any Spark driver application is to create a SparkContext. The SparkContext allows the Spark driver application to access the cluster through a resource manager. The resource manager can be YARN , or Spark’s Cluster Manager . Few functionalities which SparkContext offers are: 1. We can get the current status of a Spark application like c...

What is SparkSession in Apache Spark?

May 21, 2018

Spark Session is the new entry point for Spark applications. Prior to 2.0, SparkContext was the entry point for spark jobs. RDD was one of the main APIs then, and it was created and manipulated using Spark Context. For every other APIs, different contexts were required - For SQL, SQL Context was required; For Streaming, Streaming Context was required; For Hive, Hive Context was required. But from 2.0, RDD along with DataSet and its subset DataFrame APIs are becoming the standard APIs and are a basic unit of data abstraction in Spark. All of the user defined code will be written and evaluated against the DataSet and DataFrame APIs as well as RDD. So, there is a need for a new entry point build for handling these new APIs, which is why Spark Session has been introduced. Spark Session also includes all the APIs available in different contexts - Spark Context, SQL Context, Streaming Context, Hive Context. Ref : https://data-flair.training/forums/topic/what-is-sparksession-in...

Broadcast Variables in Apache Spark

May 21, 2018

A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. Broadcast variables are created from a variable v by calling SparkContext.broadcast(T, scala.reflect.ClassTag<T>) . The broadcast variable is a wrapper around v , and its value can be accessed by calling the value method. scala> val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int} = Broadcast(0) scala> broadcastVar.value res0: Array[Int] = Array(1, 2, 3) After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that ...

What are Accumulators in Apache Spark ?

May 20, 2018

Accumulators are variables that are "added" to through an associative and commutative "add" operation. They act as a container for accumulating partial values across multiple tasks (running on executors). They are designed to be used safely and efficiently in parallel and distributed Spark computations and are meant for distributed counters and sums (e.g. task metrics ). You can create built-in accumulators for longs, doubles, or collections or register custom accumulators using the SparkContext.register methods. You can create accumulators with or without a name, but only named accumulators are displayed in web UI (under Stages tab for a given stage). Ref: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-accumulators.html https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#accumulators

What is Accumulator in Spark ?

May 19, 2018

Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark naively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages. An accumulator is created from an initial value v by calling SparkContext.accumulator(v) . Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method. scala > val accum = sc . accumulator ( 0 , "My Accumulator" ) accum : spark.Accumulator [ Int ] = 0 scala > sc . parallelize ( Array ( 1 , 2 ,...

UNIX Useful commands

May 17, 2018

Creating a link for a directory ln directory_with_location link_name_with_location Example: ln spark-listener-dir current Update the link ln -sfn spark-listener-updated-dir current https://www.tecmint.com/check-linux-disk-usage-of-files-and-directories/

eclipse oxygen is not starting

May 14, 2018

I found Eclipse Oxygen is not working. it is throwing an below error I modified workbench.xmi to workbenchOLD.xmi. using cd .metadata/.plugins/org.eclipse.e4.workbench mv workbench.xmi workbenchOLD.xmi It got work !!!!!!!! in <ECLIPSE_WORKSPACE> .metadata/.plugins/org.eclipse.e4.workbench !ENTRY org.eclipse.e4.ui.workbench.swt 4 2 2018-05-14 13:02:12.435 !MESSAGE Problems occurred when invoking code from plug-in: "org.eclipse.e4.ui.workbench.swt". !STACK 0 java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:204) at scala.tools.scalap.Classfile.<init>(Classfile.scala:17) at org.scalaide.core.internal.jdt.model.ScalaClassFileDescriber$.isScala(ScalaClassFileDescriber.scala:16) at org.scalaide.core.internal.jdt.model.ScalaClassFileDescriber.describe(ScalaClassFileDescriber.scala:38) at org.eclipse.core.internal.content.ContentTypeCatalo...