Spark and CSV and SQL

SQL

data_frame = spark.read.csv("/db/nowe/APLUS_LOG_MINMAX_1183241001_LoggerData.csv", header=True).select("ID", "UTC").limit(200)
data_frame.createOrReplaceTempView("my_table")
# What happened? 
# data_frame.printSchema()
spark.sql("desc my_table").show()
# Wow

# data_frame.first()
spark.sql("select * from my_table limit 1").show()

# data_frame.withColumnRenamed
spark.sql("select ID as SOME_ID from my_table limit 1").show()

# Casting... data_frame.select(data_frame.ID.cast("float")).show(2)
spark.sql("select CAST(ID as FLOAT) as SOME_ID from my_table limit 1").show()

# Now, cast ID to float, then get only the ID which can be divided by 2
# data_frame.select( (data_frame.ID.cast("float") % 2).alias("IS_DIV_BY_TWO")).filter("IS_DIV_BY_TWO < 1").show(10)
spark.sql("select ID, (ID  % 2) AS IS_DIV_BY_TWO from my_table HAVING IS_DIV_BY_TWO < 1 LIMIT 10").show()

# Group by with aliasing
# data_frame.select( (data_frame.ID.cast("float") % 2) .alias("IS_DIV_BY_TWO")).groupby("IS_DIV_BY_TWO").count().collect()
spark.sql("SELECT md, count(*) FROM (select (ID%2) AS md FROM my_table) group by md").show()

2017: 05/05
CATEGORY: Cluster Computing
Write comment

Spark and CSV and SQL

Information

Recent entry

Archive

Category