Using Pandas in Databricks

# Using Pandas in Databricks I dislike working with Spark DataFrames in notebooks, because (a) you always need to remember (how) to force their display, and (b) they don’t have the fancy out-of-the-box plotting capabilities Pandas DFs have. Instead, it is easier to work with Pandas in Databricks notebooks: ```python import pyspark.pandas as ps ``` Now, `ps` works like `import pandas as pd`, but with any DataFrames you create (`ps.DataFrame()`, `ps.read_…`, `ps.sql(‘select …’)`, etc.) being Spark DataFrames wrapped with the Pandas API. To wrap an existing Spark DataFrame spark_df with the Pandas DataFrame API (but don’t convert it to an actual Pandas DataFrame and lose the benefits of using Spark when stuff doesn’t fit in memory, or when Spark might be more efficient than Pandas): ```python ps_df = spark_df.to_pandas_on_spark() ``` **You can also convert Spark DataFrames straight to Pandas if all data easily fits in memory** (this preferred, as `pyspark.pandas` doesn't support the full Pandas API): ```python pd_df = spark_df.to_pandas() ``` PySpark is much slower than (native) Scala Spark, and you already get gains from using Arrow there, especially when converting DataFrames. So you should also make sure you have [arrow as your execution engine](https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html) when using Pandas in Databricks notebooks, to make more efficient conversions to/from Spark DataFrames: ```python import pandas as pd # Enable Arrow-based columnar data transfers spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") ```