Spark and CSV for python language
Now, we have 2017 year, second quarter. It seems that in one year the a/m instruction are not to be adequate.
Simple instructions:
# Open file and use first columns as header data_frame = spark.read.csv("/path/to/file.csv", header=True) # You received basic Spark type - DataFrame. # See what structure how looks the structure data_frame.printSchema() root |-- ID: string (nullable = true) |-- ConfigId: string (nullable = true) |-- UTC: string (nullable = true) # Add conditions to your selection (assume that you want see only ID and UTC): data_frame = spark.read.csv("/path/to/file.csv", header=True).select("ID", "UTC") data_frame.printSchema() root |-- ID: string (nullable = true) |-- UTC: string (nullable = true) # And now - get first row data_frame.first() Row(ID=u'0', UTC=u'04/10/17 10:41:25') # Amazing, now get first two rows: data_frame.take(2) [Row(ID=u'0', UTC=u'04/10/17 10:41:25'), Row(ID=u'1', UTC=u'04/10/17 10:41:26')] # Do you prefer column format? data_frame.show(2) # Show only ID data_frame.select("ID").show(2) # Change name ID into MY_KEY and UTC -> MY_DATE data_frame.withColumnRenamed("ID", "MY_KEY").withColumnRenamed("UTC", "MY_DATE").show(2) # Change type from int TO float data_frame.select(data_frame.ID.cast("float")).show(2) # Did spark truncated row? data_frame.select(data_frame.ID.cast("float")).show(2, truncate=False) # Now, cast ID to float, and get the ID which can be divided by 2 data_frame.select( (data_frame.ID.cast("float") % 2) .alias("IS_DIV_BY_TWO")).filter("IS_DIV_BY_TWO < 1").show(10) # Grouby? data_frame.select( (data_frame.ID.cast("float") % 2) .alias("IS_DIV_BY_TWO")).groupby("IS_DIV_BY_TWO").count().collect() [Row(IS_DIV_BY_TWO=1.0, count=99929), Row(IS_DIV_BY_TWO=0.0, count=99930)]
Do I need doing all that stuff with python? There shouldn't be a SQL, should it?
No comments yet.