Spark and CSV for python language

Now, we have 2017 year, second quarter. It seems that in one year the a/m instruction are not to be adequate.
Simple instructions:

# Open file and use first columns as header
data_frame = spark.read.csv("/path/to/file.csv", header=True)
# You received basic Spark type - DataFrame.
# See what structure how looks the structure
data_frame.printSchema()
root
 |-- ID: string (nullable = true)
 |-- ConfigId: string (nullable = true)
 |-- UTC: string (nullable = true)
# Add conditions to your selection (assume that you want see only ID and UTC):
data_frame = spark.read.csv("/path/to/file.csv", header=True).select("ID", "UTC")
data_frame.printSchema()
root
 |-- ID: string (nullable = true)
 |-- UTC: string (nullable = true)
# And now - get first row
data_frame.first()
Row(ID=u'0', UTC=u'04/10/17 10:41:25')
# Amazing, now get first two rows:
data_frame.take(2)
[Row(ID=u'0', UTC=u'04/10/17 10:41:25'), Row(ID=u'1', UTC=u'04/10/17 10:41:26')]
# Do you prefer column format?
data_frame.show(2)
# Show only ID
data_frame.select("ID").show(2)
# Change name ID into MY_KEY and UTC -> MY_DATE
data_frame.withColumnRenamed("ID", "MY_KEY").withColumnRenamed("UTC", "MY_DATE").show(2)
# Change type from int TO float
data_frame.select(data_frame.ID.cast("float")).show(2)
# Did spark truncated row?
data_frame.select(data_frame.ID.cast("float")).show(2, truncate=False)

# Now, cast ID to float, and get the ID which can be divided by 2
data_frame.select( (data_frame.ID.cast("float") % 2) .alias("IS_DIV_BY_TWO")).filter("IS_DIV_BY_TWO < 1").show(10)

# Grouby?
data_frame.select( (data_frame.ID.cast("float") % 2) .alias("IS_DIV_BY_TWO")).groupby("IS_DIV_BY_TWO").count().collect()
[Row(IS_DIV_BY_TWO=1.0, count=99929), Row(IS_DIV_BY_TWO=0.0, count=99930)]      

Do I need doing all that stuff with python? There shouldn't be a SQL, should it?

  1. No comments yet.

  1. No trackbacks yet.