kokilasoral 寫:
Hi,
I'm exploring the databricks trial version using a 3 node r3.2xlarge (Spark 1.6). Tried executing a Cassandra data insert (external cassandra DB on AWS cluster itself) using spark-cassandra connector 1.5.0-M3, it took an average of 28 seconds for inserting the data ( 3 rows, 2 columns) via a scheduler job. Am I missing something in the optimization here? I used the standard cassandra -spark code snippet
1. import com.datastax.spark.connector._
2. val sampleRdd = sc.parallelize(Array(("Databricks", 101), ("Spark", 110), ("Big Data", 10)))
3. // This function saves to an existing table. To save to a new table, use saveAsCassandraTable.
4. sampleRdd.saveToCassandra("test_keyspace", "words", SomeColumns("word", "count"))
Thanks,
It's hard to understand your question. I guess your question is about the latency (28 seconds).
Since the dataset sampleRdd is so small and according to these description, all I could guess are:
(1) maybe the network latency between Spark 1.6 and Cassandra DB is high
(2) maybe the spark is run on YARN ? so it need to "schedule" a job to run your insert. If this is the case, most of 28 seconds are wasted on 'scheduling'.
- Jazz