Table of Contents

1 Variables

  • When a function passed to a Spark operation (such as a map or reduce) is executed on a remote cluster node, it works on seperate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program.

2 Cassandra

2.1 Performance

2.1.1 Repartitioning

Using the Datastax's Cassandra Connector we can repartition RDDs to line up with the partitioning of the Cassandra nodes, as long as the items we're considering can be mapped to the corresponding table. This is done using the repartitionByCassandraReplica(keyspace, col) method, being called on say a DStream .

In fact, if Cassandra nodes are collocated with Spark nodes, the queries are always sent to the Cassandra process running on the same node as the Spark Executor process, hence data are not transferred between nodes. If a Cassandra node fails or gets overloaded during read, the queries are retried to a different node. Source