VSNR Blog

Posts

Why Scala for Spark ?

August 04, 2018

In Summary: 1) Apache Spark is written in Scala and because of its scalability on JVM 2) Scala programming retains a perfect balance between productivity and performance. 3) Organizations want to enjoy the expressive power of dynamic programming language without having to lose type safety 4) Scala is designed with parallelism and concurrency in mind for big data applications 5) Scala collaborates well within the MapReduce big data model because of its functional paradigm 6)Scala programming language provides the best path for building scalable big data applications in terms of data size and program complexity 7) Scala programming is comparatively less complex unlike Java. 8) Scala has well-designed libraries for scientific computing, linear algebra and random number generation. 9) Efficiency and speed play a vital role regardless of increasing processor speeds. 10) Other programming languages like Python or Java have lag in the API coverage. 1...

Hive insert overwrite output data delimiter

June 04, 2018

User cannot specify the data delimiter in hive during insert overwrite. The default delimiter it generates is '\1'

Application Localisation in YARN and its pitfalls

June 01, 2018

An application container before it is started goes through a phase called Application Localisation. Node Manager takes care of this activity only once for an application irrespective of number of application's containers it has been entrusted to execute. This step mostly includes -- Downloading the application jars, files and archives from HDFS -- Log Aggregation setup for application If you notice your container not starting ( check container logs) as soon as the Application Master has sent a request to Node Manager ( check node manager logs ), it indicates delays with localisation. Use the jars more judiciously for your application to manage the localisation overhead.

How to give arguments to kill via pipe

June 01, 2018

kill $(ps -e | grep <YOUR_SEARCH_STRING> | awk '{print $1}')

Spark Enhanced Support for Multiple Operations

May 28, 2018

Spark provides numerous essential supports related to data processing in big data industry like – For streaming data SQL queries Graphic data processing, Machine learning, MapReduce operations. Multi-platform Support Apache Spark provides extended interoperability regarding its running platform or supported data structure. Spark supports applications running in – cloud standalone cluster mode Besides, that Spark can access varied data structures HBase Tachyon HDFS Cassandra Hive Hadoop data source Spark can be deployed on A distributed framework such as YARN or Mesos Standalone server Spark Features Open source Fast processing Distributed Data Processing Highly Productive Scalability

In how many ways can we create RDDs in Apache Spark?

May 21, 2018

These are three methods to create the RDD . 1.The first method is used when data is already available with the external systems like local filesystem, HDFS , HBase RDD can be created by calling a textFile method of SparkContext with path / URL as the argument. scala> val data = sc.textFile("File1.txt") sc is the object of SparkContext You need to create a file File1.txt in Spark_Home directory 2.The second approach can be used with the existing collections scala> val arr1 = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val rdd1 = sc.parallelize(arr1) 3.The third one is a way to create new RDD from the existing one. scala> val newRDD = arr1.map(data => (data * 2)) Ref : http://data-flair.training/forums/topic/in-how-many-ways-can-we-create-rdds-in-apache-spark-explain

Search This Blog