Posts

Why Scala for Spark ?

In Summary: 1) Apache Spark is written in Scala and because of its scalability on JVM 2) Scala programming retains a perfect balance between productivity and performance.  3) Organizations want to enjoy the expressive power of dynamic programming language without having to lose type safety 4) Scala is designed with parallelism and concurrency in mind for big data applications 5) Scala collaborates well within the MapReduce big data model because of its functional paradigm 6)Scala programming language provides the best path for building scalable big data applications in terms of data size and program complexity 7) Scala programming is comparatively less complex unlike Java. 8) Scala has well-designed libraries for scientific computing, linear algebra and random number generation. 9) Efficiency and speed play a vital role regardless of increasing processor speeds.  10) Other programming languages like Python or Java have lag in the API coverage.  11) The Functio

Hive insert overwrite output data delimiter

User cannot specify the data delimiter in hive during insert overwrite. The default delimiter it generates is '\1'

Application Localisation in YARN and its pitfalls

An application container before it is started goes through a phase called Application Localisation. Node Manager takes care of this activity only once for an application irrespective of number of application's containers it has been entrusted to execute. This step mostly includes -- Downloading the application jars, files and archives from HDFS -- Log Aggregation setup for application If you notice your container not starting ( check container logs) as soon as the Application Master has sent a request to Node Manager ( check node manager logs ), it indicates delays with localisation. Use the jars more judiciously for your application to manage the localisation overhead.

How to give arguments to kill via pipe

kill $(ps -e | grep <YOUR_SEARCH_STRING> | awk '{print $1}')

Spark Enhanced Support for Multiple Operations

Spark provides numerous essential supports related to data processing in big data industry like – For streaming data SQL queries Graphic data processing, Machine learning, MapReduce operations. Multi-platform Support Apache Spark provides extended interoperability regarding its running platform or supported data structure. Spark supports applications running in – cloud standalone cluster mode Besides, that Spark can access varied data structures HBase Tachyon HDFS Cassandra Hive Hadoop data source Spark can be deployed on A distributed framework such as YARN or Mesos Standalone server  Spark Features    Open source    Fast  processing    Distributed     Data Processing    Highly Productive    Scalability

In how many ways can we create RDDs in Apache Spark?

These are three methods to create the  RDD . 1.The first method is used when data is already available with the external systems like local filesystem,  HDFS ,  HBase RDD  can be created by calling a textFile method of  SparkContext  with path / URL as the argument. scala> val data = sc.textFile("File1.txt") sc is the object of SparkContext You need to create a file File1.txt in Spark_Home directory 2.The second approach can be used with the existing collections scala> val arr1 = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val rdd1 = sc.parallelize(arr1) 3.The third one is a way to create new RDD from the existing one. scala> val newRDD = arr1.map(data => (data * 2)) Ref : http://data-flair.training/forums/topic/in-how-many-ways-can-we-create-rdds-in-apache-spark-explain