I clocked your code `iris = parallelize_dataframe(iris, multiply_columns)` (which calls `apply(lambda x: len(x))`) vs `iris[‘length_of_word’] = iris[‘species’].str.len()` and the vectorized str.len() function was much faster.

I wrote a post on exploring this at: https://maxpowerwastaken.github.io/blog/pandas-dont-apply-_-vectorize/

]]>I am trying to use the inbuilt visualisation of zeppelin when retrieving data from Cassandra keyspace using spark-cassandra connector.

import org.apache.spark.sql.cassandra.CassandraSQLContext

import org.apache.spark.sql.cassandra._

import org.apache.spark.sql

val csc=new CassandraSQLContext(sc)

import sqlContext.implicits._

val rdd1=csc.sql(“SELECT * from q_data.qr_ep_data limit 100”)

///rdd1.registerTempTable(“epdata”)

rdd1.registerTempTable(“epdata”)

%sql

SELECT * from q_data.qr_ep_data limit 10

Error::171: error: not found: value %

%sql

I am not able to use the %sql which throws me AN ERROR. Any help is much appreciated.

]]>I’ve got the default Numpy anaconda installation with OpenBLAS and it shows that all the cores are used to perform some matrix computation (SVD in my case). When I tried to run SVD a list of random matrices in parallel, the result was actually slower than if I had done it in parallel. Some googling matched my intuition – a lot of the base numerical routines optimize to run in parallel such that they utilize resources much more efficiently if you do them serially than if you decide to run them in parallel python processes.

Maybe this works for more straightforward operations (as is common in pandas).

]]>