{"id":1262,"date":"2017-05-04T22:15:33","date_gmt":"2017-05-04T22:15:33","guid":{"rendered":"http:\/\/10sa.com\/sql_stories\/?p=1262"},"modified":"2017-05-05T00:00:21","modified_gmt":"2017-05-05T00:00:21","slug":"spark-and-csv-for-python-language","status":"publish","type":"post","link":"http:\/\/10sa.com\/sql_stories\/?p=1262","title":{"rendered":"Spark and CSV for python language"},"content":{"rendered":"<p>Now, we have 2017 year, second quarter. It seems that in one year the a\/m instruction are not to be adequate.<br \/>\nSimple instructions:<\/p>\n<pre language=\"python\">\r\n# Open file and use first columns as header\r\ndata_frame = spark.read.csv(\"\/path\/to\/file.csv\", header=True)\r\n# You received basic Spark type - DataFrame.\r\n# See what structure how looks the structure\r\ndata_frame.printSchema()\r\nroot\r\n |-- ID: string (nullable = true)\r\n |-- ConfigId: string (nullable = true)\r\n |-- UTC: string (nullable = true)\r\n# Add conditions to your selection (assume that you want see only ID and UTC):\r\ndata_frame = spark.read.csv(\"\/path\/to\/file.csv\", header=True).select(\"ID\", \"UTC\")\r\ndata_frame.printSchema()\r\nroot\r\n |-- ID: string (nullable = true)\r\n |-- UTC: string (nullable = true)\r\n# And now - get first row\r\ndata_frame.first()\r\nRow(ID=u'0', UTC=u'04\/10\/17 10:41:25')\r\n# Amazing, now get first two rows:\r\ndata_frame.take(2)\r\n[Row(ID=u'0', UTC=u'04\/10\/17 10:41:25'), Row(ID=u'1', UTC=u'04\/10\/17 10:41:26')]\r\n# Do you prefer column format?\r\ndata_frame.show(2)\r\n# Show only ID\r\ndata_frame.select(\"ID\").show(2)\r\n# Change name ID into MY_KEY and UTC -> MY_DATE\r\ndata_frame.withColumnRenamed(\"ID\", \"MY_KEY\").withColumnRenamed(\"UTC\", \"MY_DATE\").show(2)\r\n# Change type from int TO float\r\ndata_frame.select(data_frame.ID.cast(\"float\")).show(2)\r\n# Did spark truncated row?\r\ndata_frame.select(data_frame.ID.cast(\"float\")).show(2, truncate=False)\r\n\r\n# Now, cast ID to float, and get the ID which can be divided by 2\r\ndata_frame.select( (data_frame.ID.cast(\"float\") % 2) .alias(\"IS_DIV_BY_TWO\")).filter(\"IS_DIV_BY_TWO < 1\").show(10)\r\n\r\n# Grouby?\r\ndata_frame.select( (data_frame.ID.cast(\"float\") % 2) .alias(\"IS_DIV_BY_TWO\")).groupby(\"IS_DIV_BY_TWO\").count().collect()\r\n[Row(IS_DIV_BY_TWO=1.0, count=99929), Row(IS_DIV_BY_TWO=0.0, count=99930)]      \r\n<\/pre>\n<p>Do I need doing all that stuff with python? There shouldn't be a SQL, should it?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Now, we have 2017 year, second quarter. It seems that in one year the a\/m instruction are not to be adequate. Simple instructions: # Open file and use first columns as header data_frame = spark.read.csv(&#8220;\/path\/to\/file.csv&#8221;, header=True) # You received basic Spark type &#8211; DataFrame. # See what structure how looks the structure data_frame.printSchema() root |&#8211; [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[24],"tags":[],"_links":{"self":[{"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/posts\/1262"}],"collection":[{"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1262"}],"version-history":[{"count":8,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/posts\/1262\/revisions"}],"predecessor-version":[{"id":1272,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/posts\/1262\/revisions\/1272"}],"wp:attachment":[{"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1262"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1262"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1262"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}