Getting Started with Apache Spark

Getting Started with Apache Spark

Spark MLib + SQL

Apache Spark* is a dispatcher processing engine responsible for orchestrating, distributing, and monitoring applications that perform multiple data processing tasks on multiple worker machines, which form a cluster. As we have already mentioned, it is possible to read the data from different persistent storage solutions such as Amazon S3 or Google Storage, distributed storage systems such as HDFS, key-value systems such as Apache Cassandra, or message buses such as Kafka.

How does Spark work?

Spark does not store data itself, but rather has a focus on processing. This is one of the points that differentiate it from Hadoop, which includes both persistent storage (HDFS) and a processing system (MapReduce) in a very integrated way. It is important to talk about the processing speed: the key is the possibility that Spark offers to perform processing from memory. This, and the extension of the popular MapReduce to efficiently allow other types of operations: Interactive Queries and Streaming Processing.

image.png

Differences between Apache Spark & Hadoop

Apache SparkHadoop MapReduce
Uses memory instead of disk for intermediate results.In MapReduce, until the map phase completes its processing, the reducers do not start executing. Nor can you control their order of execution.
The data type used by the core of Spark, known as Spark Core, is RDD or Resilient Distributed Datasets.The intrinsic limitation of MapReduce is, in fact, the “one-way scalability” of its design.
The graph scheduler, DAG, optimizes the stages in execution.Due to the fixed cost incurred by each MapReduce job submitted, an application that requires low latency time or random access to a large set of data is infeasible.

Basics of Apache Spark

First, install pyspark

!pip3 install pyspark

Note: This may take some time...be patient

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
     |████████████████████████████████| 281.4 MB 35 kB/s 
Collecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
     |████████████████████████████████| 198 kB 68.2 MB/s 
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... done
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=d169ac1b6c62c56f835e15772f7c9402760bd12802709aabbdc6764782d71589
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1

Spark Session

import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.conf import SparkConf

conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)

spark = SparkSession \
    .builder \
    .appName('Basics') \
    .getOrCreate()

We've just launched our first Spark context & session. Congratulations! 🎉

The dataset

water.jpeg

We are going to work with the Water Treatment Plant Data Set from UCI Machine Learning Repository.

This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The objective is to classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process. This domain has been stated as an ill-structured domain.

Attributes Information:

All attributes are numeric and continuous

N. Attrib.

  • 1 Q-E (input flow to plant)
  • 2 ZN-E (input Zinc to plant)
  • 3 PH-E (input pH to plant)
  • 4 DBO-E (input Biological demand of oxygen to plant)
  • 5 DQO-E (input chemical demand of oxygen to plant)
  • 6 SS-E (input suspended solids to plant)
  • 7 SSV-E (input volatile supended solids to plant)
  • 8 SED-E (input sediments to plant)
  • 9 COND-E (input conductivity to plant)
  • 10 PH-P (input pH to primary settler)
  • 11 DBO-P (input Biological demand of oxygen to primary settler)
  • 12 SS-P (input suspended solids to primary settler)
  • 13 SSV-P (input volatile supended solids to primary settler)
  • 14 SED-P (input sediments to primary settler)
  • 15 COND-P (input conductivity to primary settler)
  • 16 PH-D (input pH to secondary settler)
  • 17 DBO-D (input Biological demand of oxygen to secondary settler)
  • 18 DQO-D (input chemical demand of oxygen to secondary settler)
  • 19 SS-D (input suspended solids to secondary settler)
  • 20 SSV-D (input volatile supended solids to secondary settler)
  • 21 SED-D (input sediments to secondary settler)
  • 22 COND-D (input conductivity to secondary settler)
  • 23 PH-S (output pH)
  • 24 DBO-S (output Biological demand of oxygen)
  • 25 DQO-S (output chemical demand of oxygen)
  • 26 SS-S (output suspended solids)
  • 27 SSV-S (output volatile supended solids)
  • 28 SED-S (output sediments)
  • 29 COND-S (output conductivity)
  • 30 RD-DBO-P (performance input Biological demand of oxygen in primary settler)
  • 31 RD-SS-P (performance input suspended solids to primary settler)
  • 32 RD-SED-P (performance input sediments to primary settler)
  • 33 RD-DBO-S (performance input Biological demand of oxygen to secondary settler)
  • 34 RD-DQO-S (performance input chemical demand of oxygen to secondary settler)
  • 35 RD-DBO-G (global performance input Biological demand of oxygen)
  • 36 RD-DQO-G (global performance input chemical demand of oxygen)
  • 37 RD-SS-G (global performance input suspended solids)
  • 38 RD-SED-G (global performance input sediments)

Load the dataset

We download it from the UCI repo and save it to our Spark DataFrame.

df_data = spark.read.csv(path='water-treatment.data', header=False, inferSchema=True)

Let's see our dataframe

df_data.show()
+---------+-----+----+---+---+---+---+----+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|      _c0|  _c1| _c2|_c3|_c4|_c5|_c6| _c7|_c8| _c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|_c25|_c26|_c27|_c28|_c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38|
+---------+-----+----+---+---+---+---+----+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| D-1/3/90|44101|1.50|7.8|  ?|407|166|66.3|4.5|2110| 7.9|   ?| 228|70.2| 5.5|2120| 7.9|   ?| 280|  94|72.3| 0.3|2010| 7.3|   ?|  84|  21|81.0|0.02|2000|   ?|58.8|95.5|   ?|70.0|   ?|79.4|87.3|99.6|
| D-2/3/90|39024|3.00|7.7|  ?|443|214|69.2|6.5|2660| 7.7|   ?| 244|75.4| 7.7|2570| 7.6|   ?| 474|  96|79.2| 0.4|2700| 7.5|   ?|  91|  17|94.1|0.00|2590|   ?|60.7|94.8|   ?|80.8|   ?|79.5|92.1| 100|
| D-4/3/90|32229|5.00|7.6|  ?|528|186|69.9|3.4|1666| 7.7|   ?| 220|72.7| 4.5|1594| 7.7|   ?| 272|  92|78.3| 0.2|1742| 7.6|   ?| 128|  21|  81|0.05|1888|   ?|58.2|95.6|   ?|52.9|   ?|75.8|88.7|98.5|
| D-5/3/90|35023|3.50|7.9|205|588|192|65.6|4.5|2430| 7.8| 236| 268|73.1| 8.5|2280| 7.8| 158| 376|  96|77.1| 0.4|2060| 7.6|  20| 104|  20|96.7|0.00|1840|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6| 100|
| D-6/3/90|36924|1.50|8.0|242|496|176|64.8|4.0|2110| 7.9|   ?| 236|57.6| 4.5|2020| 7.8|   ?| 372|  88|68.2| 0.2|2250| 7.6|  19| 108|  22|65.9|0.02|2120|   ?|62.7|95.6|   ?|71.0|92.1|78.2|87.5|99.5|
| D-7/3/90|38572|3.00|7.8|202|372|186|68.8|4.5|1644| 7.8|   ?| 248|66.1| 8.5|1762| 7.7| 150| 460| 100|76.0| 0.3|1768| 7.5|  20| 100|  28|82.1|0.00|1764|   ?|59.7|96.5|86.7|78.3|90.1|73.1|84.9| 100|
| D-8/3/90|41115|6.00|7.8|  ?|552|262|64.1|5.0|1603| 7.8|   ?| 320|67.5| 6.5|1608| 7.8| 192| 376| 122|72.1| 0.4|1668| 7.5|  21|  76|  26|84.6|0.05|1703|   ?|61.9|93.8|89.1|79.8|   ?|86.2|90.1|99.0|
| D-9/3/90|36107|5.00|7.7|215|489|334|40.7|6.0|1613| 7.6|   ?| 304|53.9| 8.0|1557| 7.6| 181| 350|  90|71.1| 0.4|1596| 7.5|  17| 162|  18|66.7|0.00|1606|   ?|70.4|95.6|90.6|53.7|92.1|66.9|94.6| 100|
|D-11/3/90|29156|2.50|7.7|206|451|194|69.1|4.5|1249| 7.7| 206| 220|61.8| 4.0|1219| 7.7| 111| 282| 124|77.4| 0.3|1233| 7.5|  16| 118|  19|84.2|0.03|1338|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2|99.4|
|D-12/3/90|39246|2.00|7.8|172|506|200|69.0|5.0|1865| 7.8| 208| 248|66.1| 6.5|1929| 7.8| 164| 463| 100|78.0| 0.6|1825| 7.6|  19| 157|  27|87.0|0.02|1616|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5|99.6|
|D-13/3/90|42393|0.70|7.9|189|478|230|67.0|5.5|1410| 8.1| 173| 192|62.5| 5.0|1406| 7.7| 172| 412| 104|71.2| 0.4|1562| 7.6| 152| 306| 131|79.6|3.50|1575| 0.6|45.8|92.0|11.6|25.7|19.6|36.0|43.0|36.4|
|D-14/3/90|42857|1.50|7.7|238|319|292|33.8|3.5|1261| 7.6| 170| 268|31.3| 4.2|1204| 7.6| 116| 276| 104|51.9| 0.3|1261| 7.4| 320| 350| 238|73.9|2.00|1304|31.8|61.2|92.9|   ?|   ?|   ?|   ?|18.5|42.9|
|D-15/3/90|42911|0.70|7.6|114|252|116|58.6|1.2|1238| 7.9| 148| 136|64.7| 3.0|1208| 7.7|  79| 216|  70|82.9| 0.3|1177| 7.5|  84| 172| 104|78.8|0.06|1221|46.6|48.5|91.7|   ?|20.4|26.3|31.7|10.3|95.4|
|D-16/3/90|40376|   ?|8.1|204|333|174|67.8|3.0|2390| 7.8| 231| 156|74.4| 2.5|2540| 7.8| 136| 325|  78|79.5| 0.4|2580| 7.6|  32| 153|  98|87.8|0.00|2550|41.1|50.0|84.0|76.5|52.9|84.3|54.1|43.7| 100|
|D-18/3/90|40923|3.50|7.6|146|329|188|57.4|2.5|1300| 7.6| 162| 132|63.6| 2.0|1324| 7.6| 109| 243|  88|81.8| 0.2|1467| 7.5|  19|  94|  41|82.9|0.02|1545|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2|99.2|
|D-19/3/90|43830| 1.5|7.8|177|512|214|58.9|5.5|1605| 7.7| 164| 256|71.9| 5.5|1599| 7.7| 118| 320|  70|88.6| 0.4|1401| 7.6|  25| 203|  20|85.0|0.00|1110|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7| 100|
|D-20/3/90|39165|1.20|7.4|250|447|252|61.1|7.0|1533| 7.4| 275| 216|57.4| 6.5|1501| 7.4| 138| 269|  90|73.3| 0.5|1458| 7.3|  14|   9|  20|82.5|0.00|1402|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1| 100|
|D-21/3/90|35791|1.20|7.8|277|466|246|63.4|4.0|1556| 7.7|   ?| 288|65.3| 6.0|1846| 7.7| 166| 419| 174|80.5| 1.3|1664| 7.5|  24| 124|  26|82.7|0.03|1606|   ?|39.6|78.3|85.5|70.4|91.3|73.4|89.4|99.4|
|D-22/3/90|37419|1.20|7.6|219|446|222|61.3|5.5|1600| 7.7| 266| 240|70.0| 5.0|1645| 7.6| 172| 345| 102|84.3| 0.4|1670| 7.5|  42| 175|  53|84.2|0.02|1780|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1|99.6|
|D-23/3/90|40983|3.00|7.6|182|431|214|57.0|7.0|1591| 7.5| 219| 248|58.1| 5.5|1473| 7.5| 175| 376|  88|65.9| 0.4|1537| 7.5|  23| 120|  25|68.0|0.00|1597|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3| 100|
+---------+-----+----+---+---+---+---+----+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+

Print Schema

df = df_data
df.printSchema()
root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: integer (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: integer (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: integer (nullable = true)
 |-- _c16: double (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- _c20: string (nullable = true)
 |-- _c21: string (nullable = true)
 |-- _c22: integer (nullable = true)
 |-- _c23: string (nullable = true)
 |-- _c24: string (nullable = true)
 |-- _c25: string (nullable = true)
 |-- _c26: string (nullable = true)
 |-- _c27: string (nullable = true)
 |-- _c28: string (nullable = true)
 |-- _c29: string (nullable = true)
 |-- _c30: string (nullable = true)
 |-- _c31: string (nullable = true)
 |-- _c32: string (nullable = true)
 |-- _c33: string (nullable = true)
 |-- _c34: string (nullable = true)
 |-- _c35: string (nullable = true)
 |-- _c36: string (nullable = true)
 |-- _c37: string (nullable = true)
 |-- _c38: string (nullable = true)

Since, only the first column should be a string ( or datetime to be precise). We are going to change each column type to double.

from pyspark.sql.types import IntegerType,BooleanType,DateType, DoubleType
# Convert String to Integer Type
for i in range(1,38):
  df = df.withColumn('_c'+ str(i),df['_c' + str(i)].cast(DoubleType()))
df.printSchema()
|-- _c0: string (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: double (nullable = true)
 |-- _c10: double (nullable = true)
 |-- _c11: double (nullable = true)
 |-- _c12: double (nullable = true)
 |-- _c13: double (nullable = true)
 |-- _c14: double (nullable = true)
 |-- _c15: double (nullable = true)
 |-- _c16: double (nullable = true)
 |-- _c17: double (nullable = true)
 |-- _c18: double (nullable = true)
 |-- _c19: double (nullable = true)
 |-- _c20: double (nullable = true)
 |-- _c21: double (nullable = true)
 |-- _c22: double (nullable = true)
 |-- _c23: double (nullable = true)
 |-- _c24: double (nullable = true)
 |-- _c25: double (nullable = true)
 |-- _c26: double (nullable = true)
 |-- _c27: double (nullable = true)
 |-- _c28: double (nullable = true)
 |-- _c29: double (nullable = true)
 |-- _c30: double (nullable = true)
 |-- _c31: double (nullable = true)
 |-- _c32: double (nullable = true)
 |-- _c33: double (nullable = true)
 |-- _c34: double (nullable = true)
 |-- _c35: double (nullable = true)
 |-- _c36: double (nullable = true)
 |-- _c37: double (nullable = true)
 |-- _c38: double (nullable = true)

Now, let's parse the first column to datetime

i = 0
df = df.withColumn('_c'+ str(i),df['_c' + str(i)].cast(DateType()))
root
 |-- _c0: date (nullable = true)
.
.
.
 |-- _c38: double (nullable = true)

Deleting N/As

Before taking any decision, we are going to count how many N/As exist in our dataframe.

from pyspark.sql.functions import isnan, when, count, col
df_aux = df.drop('_c0')
df_aux.select([count(when(isnan(c), c)).alias(c) for c in df_aux.columns]).show()
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|_c25|_c26|_c27|_c28|_c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|  0|  0|  0|  0|  0|  0|  0|  0|  0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|   0|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+

Now null values...

df_aux.select([count(when(isnull(c), c)).alias(c) for c in df_aux.columns]).show()
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|_c25|_c26|_c27|_c28|_c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 18|  3|  0| 23|  6|  1| 11| 25|  0|   0|  40|   0|  11|  24|   0|   0|  28|   9|   2|  13|  25|   0|   1|  23|  18|   5|  17|  28|   1|  62|   4|  27|  40|  26|  36|  25|   8|  31|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+

We have decided to drop null rows entirely

df = df_aux.na.drop(how="any")
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+
|_c1    |_c2 |_c3|_c4  |_c5  |_c6  |_c7 |_c8|_c9   |_c10|_c11 |_c12 |_c13|_c14|_c15  |_c16|_c17 |_c18 |_c19 |_c20|_c21|_c22  |_c23|_c24 |_c25 |_c26 |_c27|_c28|_c29  |_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38 |
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+
|35023.0|3.5 |7.9|205.0|588.0|192.0|65.6|4.5|2430.0|7.8 |236.0|268.0|73.1|8.5 |2280.0|7.8 |158.0|376.0|96.0 |77.1|0.4 |2060.0|7.6 |20.0 |104.0|20.0 |96.7|0.0 |1840.0|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6|100.0|
|29156.0|2.5 |7.7|206.0|451.0|194.0|69.1|4.5|1249.0|7.7 |206.0|220.0|61.8|4.0 |1219.0|7.7 |111.0|282.0|124.0|77.4|0.3 |1233.0|7.5 |16.0 |118.0|19.0 |84.2|0.03|1338.0|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2|99.4 |
|39246.0|2.0 |7.8|172.0|506.0|200.0|69.0|5.0|1865.0|7.8 |208.0|248.0|66.1|6.5 |1929.0|7.8 |164.0|463.0|100.0|78.0|0.6 |1825.0|7.6 |19.0 |157.0|27.0 |87.0|0.02|1616.0|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5|99.6 |
|42393.0|0.7 |7.9|189.0|478.0|230.0|67.0|5.5|1410.0|8.1 |173.0|192.0|62.5|5.0 |1406.0|7.7 |172.0|412.0|104.0|71.2|0.4 |1562.0|7.6 |152.0|306.0|131.0|79.6|3.5 |1575.0|0.6 |45.8|92.0|11.6|25.7|19.6|36.0|43.0|36.4 |
|40923.0|3.5 |7.6|146.0|329.0|188.0|57.4|2.5|1300.0|7.6 |162.0|132.0|63.6|2.0 |1324.0|7.6 |109.0|243.0|88.0 |81.8|0.2 |1467.0|7.5 |19.0 |94.0 |41.0 |82.9|0.02|1545.0|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2|99.2 |
|43830.0|1.5 |7.8|177.0|512.0|214.0|58.9|5.5|1605.0|7.7 |164.0|256.0|71.9|5.5 |1599.0|7.7 |118.0|320.0|70.0 |88.6|0.4 |1401.0|7.6 |25.0 |203.0|20.0 |85.0|0.0 |1110.0|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7|100.0|
|39165.0|1.2 |7.4|250.0|447.0|252.0|61.1|7.0|1533.0|7.4 |275.0|216.0|57.4|6.5 |1501.0|7.4 |138.0|269.0|90.0 |73.3|0.5 |1458.0|7.3 |14.0 |9.0  |20.0 |82.5|0.0 |1402.0|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1|100.0|
|37419.0|1.2 |7.6|219.0|446.0|222.0|61.3|5.5|1600.0|7.7 |266.0|240.0|70.0|5.0 |1645.0|7.6 |172.0|345.0|102.0|84.3|0.4 |1670.0|7.5 |42.0 |175.0|53.0 |84.2|0.02|1780.0|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1|99.6 |
|40983.0|3.0 |7.6|182.0|431.0|214.0|57.0|7.0|1591.0|7.5 |219.0|248.0|58.1|5.5 |1473.0|7.5 |175.0|376.0|88.0 |65.9|0.4 |1537.0|7.5 |23.0 |120.0|25.0 |68.0|0.0 |1597.0|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3|100.0|
|42217.0|8.5 |7.5|138.0|333.0|240.0|55.0|3.8|1087.0|7.5 |153.0|184.0|67.4|4.0 |1109.0|7.5 |108.0|194.0|82.0 |85.4|0.4 |1136.0|7.1 |16.0 |62.0 |17.0 |94.1|0.0 |1223.0|29.4|55.4|91.3|85.2|68.0|88.4|81.4|92.9|100.0|
|47665.0|1.2 |7.7|156.0|405.0|200.0|74.0|4.0|1856.0|7.6 |178.0|184.0|71.7|3.5 |1976.0|7.5 |128.0|302.0|92.0 |78.3|0.3 |1920.0|7.6 |19.0 |71.0 |23.0 |78.3|0.01|1706.0|28.1|50.0|91.4|85.2|76.5|87.8|82.5|88.5|99.8 |
|44314.0|3.0 |7.8|155.0|389.0|308.0|49.4|6.0|1927.0|7.7 |252.0|308.0|49.4|6.5 |2150.0|7.7 |121.0|302.0|108.0|72.2|0.6 |1950.0|7.6 |15.0 |87.0 |23.0 |69.7|0.0 |1869.0|52.0|64.9|90.8|87.6|71.2|90.3|77.6|92.5|100.0|
|40841.0|1.0 |7.6|179.0|389.0|168.0|69.0|3.5|1240.0|7.8 |202.0|272.0|72.1|6.0 |1381.0|7.8 |148.0|302.0|92.0 |78.3|0.3 |1425.0|7.9 |16.0 |83.0 |20.0 |85.0|0.0 |1416.0|26.7|66.2|95.0|89.2|72.5|91.1|78.7|88.1|100.0|
|41157.0|3.0 |8.0|145.0|398.0|192.0|66.7|4.5|2240.0|8.0 |213.0|240.0|61.7|6.0 |2010.0|8.0 |140.0|287.0|84.0 |78.6|0.4 |2270.0|7.8 |15.0 |87.0 |21.0 |81.0|0.0 |2290.0|34.3|65.0|94.2|89.3|69.7|89.7|78.1|89.1|100.0|
|40078.0|1.4 |7.9|198.0|464.0|228.0|64.9|4.6|1431.0|7.6 |243.0|272.0|64.7|7.5 |1606.0|7.8 |177.0|319.0|88.0 |81.8|0.2 |1556.0|7.8 |17.0 |102.0|22.0 |81.8|0.0 |1475.0|27.2|67.6|97.3|90.4|68.0|91.4|78.0|90.4|100.0|
|43080.0|4.25|7.8|95.0 |349.0|136.0|76.5|2.5|1063.0|7.8 |132.0|188.0|74.5|2.0 |1139.0|7.8 |123.0|317.0|98.0 |69.4|0.4 |1218.0|7.5 |19.0 |67.0 |24.0 |83.3|0.0 |1220.0|6.8 |47.9|80.0|84.6|78.9|80.0|80.8|82.4|100.0|
|29414.0|3.0 |7.6|160.0|374.0|168.0|69.0|3.1|1042.0|7.6 |220.0|246.0|69.9|4.6 |1057.0|7.6 |126.0|299.0|112.0|75.0|0.2 |1085.0|7.4 |19.0 |79.0 |28.0 |82.0|0.0 |1087.0|42.7|54.5|95.7|84.9|73.6|88.1|78.9|83.3|100.0|
|38568.0|0.7 |8.2|233.0|506.0|204.0|66.7|6.7|1692.0|8.3 |218.0|212.0|66.0|10.5|1614.0|7.9 |188.0|355.0|88.0 |81.8|0.2 |1516.0|7.5 |47.0 |116.0|59.0 |81.4|0.05|1483.0|13.8|58.5|98.6|75.0|67.3|79.8|77.1|71.1|99.3 |
|34193.0|2.0 |8.0|166.0|396.0|176.0|70.5|4.0|1265.0|8.0 |178.0|188.0|70.2|5.5 |1380.0|7.8 |165.0|368.0|90.0 |77.8|0.2 |1434.0|7.5 |26.0 |106.0|31.0 |83.9|0.0 |1442.0|7.3 |52.1|96.4|84.2|71.2|84.3|73.2|82.4|100.0|
|36332.0|3.5 |7.9|120.0|455.0|184.0|67.4|4.0|1224.0|8.1 |205.0|188.0|68.1|5.5 |1217.0|7.7 |168.0|333.0|90.0 |77.8|0.2 |1353.0|7.6 |24.0 |98.0 |32.0 |81.3|0.0 |1420.0|18.0|52.1|96.4|85.7|70.6|80.0|78.5|82.6|100.0|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+
only showing top 20 rows

KMeans Model

df.describe().show()
+-------+-----------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+-------------------+-----------------+-----------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+
|summary|              _c1|               _c2|                _c3|               _c4|               _c5|               _c6|               _c7|               _c8|               _c9|               _c10|              _c11|              _c12|              _c13|             _c14|              _c15|              _c16|              _c17|             _c18|              _c19|              _c20|              _c21|              _c22|               _c23|             _c24|             _c25|              _c26|             _c27|               _c28|              _c29|              _c30|              _c31|             _c32|             _c33|              _c34|             _c35|             _c36|             _c37|             _c38|
+-------+-----------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+-------------------+-----------------+-----------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+
|  count|              380|               380|                380|               380|               380|               380|               380|               380|               380|                380|               380|               380|               380|              380|               380|               380|               380|              380|               380|               380|               380|               380|                380|              380|              380|               380|              380|                380|               380|               380|               380|              380|              380|               380|              380|              380|              380|              380|
|   mean|37371.67105263158|2.2772631578947404|  7.825263157894731|189.62894736842105|403.97105263157897|226.80526315789473| 60.85842105263157| 4.685789473684211|1468.4315789473685|  7.851052631578943|209.66842105263157|257.39473684210526|59.680263157894764|5.117105263157893|1484.6184210526317| 7.835263157894732|122.58947368421053|272.9842105263158| 93.91052631578947| 72.72315789473694|0.4142105263157895|1477.6921052631578|  7.720526315789466|18.73421052631579|84.13157894736842| 20.91315789473684|79.68368421052628|0.03613157894736828|1481.5210526315789|  39.1263157894737| 58.89921052631584|90.64736842105263| 84.1065789473684| 68.65078947368418|89.52815789473688|78.34157894736842| 89.6128947368421|99.14210526315776|
| stddev|6851.660908338489| 2.328609351883363|0.23660680499389936| 61.37377766311134|117.85509059364503|118.07798493502413|12.660505462095747|2.9078926638512526|393.60511271084596|0.22395803265084027| 71.91717276788036|147.33252456754101|12.733014654813468|3.577671452661805|398.60282263659286|0.1952433951140157|36.528803031438024|70.95343720605777|23.438853131795298|10.394320070958777|0.3781752576287504| 401.0942518686844|0.15341232777572877| 9.60735141555947|33.66306044704339|11.546639873835971|9.238464795265763|0.20034280266077764| 384.2781906221161|14.927089012623302|12.955976785539372|8.913158777278337|6.933349020128138|10.515216987456455|5.450973312552462| 8.22838305730927|6.084902120391749|4.131641951230141|
|    min|          10050.0|               0.1|                7.3|              48.0|             105.0|              98.0|              13.2|               0.4|             651.0|                7.3|              59.0|             104.0|               7.1|              1.0|             646.0|               7.3|              26.0|            100.0|              49.0|              23.8|               0.0|              85.0|                7.1|              5.0|              9.0|               8.0|             29.2|                0.0|             683.0|               0.6|              12.3|              7.7|             11.6|              25.7|             19.6|             36.0|             43.0|             36.4|
|    max|          60081.0|              19.1|                8.5|             438.0|             941.0|            1228.0|              84.8|              36.0|            3230.0|                8.5|             449.0|            1692.0|              93.5|             46.0|            3170.0|               8.3|             223.0|            463.0|             244.0|             100.0|               3.5|            3690.0|                8.1|            152.0|            306.0|             131.0|            100.0|                3.5|            3950.0|              79.1|              94.7|            100.0|             94.7|              96.8|             97.0|             98.1|             98.6|            100.0|
+-------+-----------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+-------------------+-----------------+-----------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+

VectorAssembler

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(
    inputCols=df.columns,
    outputCol="features"
)
v = vecAssembler.transform(df)
v.show()
**+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+
|    _c1| _c2|_c3|  _c4|  _c5|  _c6| _c7|_c8|   _c9|_c10| _c11| _c12|_c13|_c14|  _c15|_c16| _c17| _c18| _c19|_c20|_c21|  _c22|_c23| _c24| _c25| _c26|_c27|_c28|  _c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37| _c38|            features|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+
|35023.0| 3.5|7.9|205.0|588.0|192.0|65.6|4.5|2430.0| 7.8|236.0|268.0|73.1| 8.5|2280.0| 7.8|158.0|376.0| 96.0|77.1| 0.4|2060.0| 7.6| 20.0|104.0| 20.0|96.7| 0.0|1840.0|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6|100.0|[35023.0,3.5,7.9,...|
|29156.0| 2.5|7.7|206.0|451.0|194.0|69.1|4.5|1249.0| 7.7|206.0|220.0|61.8| 4.0|1219.0| 7.7|111.0|282.0|124.0|77.4| 0.3|1233.0| 7.5| 16.0|118.0| 19.0|84.2|0.03|1338.0|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2| 99.4|[29156.0,2.5,7.7,...|
|39246.0| 2.0|7.8|172.0|506.0|200.0|69.0|5.0|1865.0| 7.8|208.0|248.0|66.1| 6.5|1929.0| 7.8|164.0|463.0|100.0|78.0| 0.6|1825.0| 7.6| 19.0|157.0| 27.0|87.0|0.02|1616.0|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5| 99.6|[39246.0,2.0,7.8,...|
|42393.0| 0.7|7.9|189.0|478.0|230.0|67.0|5.5|1410.0| 8.1|173.0|192.0|62.5| 5.0|1406.0| 7.7|172.0|412.0|104.0|71.2| 0.4|1562.0| 7.6|152.0|306.0|131.0|79.6| 3.5|1575.0| 0.6|45.8|92.0|11.6|25.7|19.6|36.0|43.0| 36.4|[42393.0,0.7,7.9,...|
|40923.0| 3.5|7.6|146.0|329.0|188.0|57.4|2.5|1300.0| 7.6|162.0|132.0|63.6| 2.0|1324.0| 7.6|109.0|243.0| 88.0|81.8| 0.2|1467.0| 7.5| 19.0| 94.0| 41.0|82.9|0.02|1545.0|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2| 99.2|[40923.0,3.5,7.6,...|
|43830.0| 1.5|7.8|177.0|512.0|214.0|58.9|5.5|1605.0| 7.7|164.0|256.0|71.9| 5.5|1599.0| 7.7|118.0|320.0| 70.0|88.6| 0.4|1401.0| 7.6| 25.0|203.0| 20.0|85.0| 0.0|1110.0|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7|100.0|[43830.0,1.5,7.8,...|
|39165.0| 1.2|7.4|250.0|447.0|252.0|61.1|7.0|1533.0| 7.4|275.0|216.0|57.4| 6.5|1501.0| 7.4|138.0|269.0| 90.0|73.3| 0.5|1458.0| 7.3| 14.0|  9.0| 20.0|82.5| 0.0|1402.0|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1|100.0|[39165.0,1.2,7.4,...|
|37419.0| 1.2|7.6|219.0|446.0|222.0|61.3|5.5|1600.0| 7.7|266.0|240.0|70.0| 5.0|1645.0| 7.6|172.0|345.0|102.0|84.3| 0.4|1670.0| 7.5| 42.0|175.0| 53.0|84.2|0.02|1780.0|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1| 99.6|[37419.0,1.2,7.6,...|
|40983.0| 3.0|7.6|182.0|431.0|214.0|57.0|7.0|1591.0| 7.5|219.0|248.0|58.1| 5.5|1473.0| 7.5|175.0|376.0| 88.0|65.9| 0.4|1537.0| 7.5| 23.0|120.0| 25.0|68.0| 0.0|1597.0|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3|100.0|[40983.0,3.0,7.6,...|
|42217.0| 8.5|7.5|138.0|333.0|240.0|55.0|3.8|1087.0| 7.5|153.0|184.0|67.4| 4.0|1109.0| 7.5|108.0|194.0| 82.0|85.4| 0.4|1136.0| 7.1| 16.0| 62.0| 17.0|94.1| 0.0|1223.0|29.4|55.4|91.3|85.2|68.0|88.4|81.4|92.9|100.0|[42217.0,8.5,7.5,...|
|47665.0| 1.2|7.7|156.0|405.0|200.0|74.0|4.0|1856.0| 7.6|178.0|184.0|71.7| 3.5|1976.0| 7.5|128.0|302.0| 92.0|78.3| 0.3|1920.0| 7.6| 19.0| 71.0| 23.0|78.3|0.01|1706.0|28.1|50.0|91.4|85.2|76.5|87.8|82.5|88.5| 99.8|[47665.0,1.2,7.7,...|
|44314.0| 3.0|7.8|155.0|389.0|308.0|49.4|6.0|1927.0| 7.7|252.0|308.0|49.4| 6.5|2150.0| 7.7|121.0|302.0|108.0|72.2| 0.6|1950.0| 7.6| 15.0| 87.0| 23.0|69.7| 0.0|1869.0|52.0|64.9|90.8|87.6|71.2|90.3|77.6|92.5|100.0|[44314.0,3.0,7.8,...|
|40841.0| 1.0|7.6|179.0|389.0|168.0|69.0|3.5|1240.0| 7.8|202.0|272.0|72.1| 6.0|1381.0| 7.8|148.0|302.0| 92.0|78.3| 0.3|1425.0| 7.9| 16.0| 83.0| 20.0|85.0| 0.0|1416.0|26.7|66.2|95.0|89.2|72.5|91.1|78.7|88.1|100.0|[40841.0,1.0,7.6,...|
|41157.0| 3.0|8.0|145.0|398.0|192.0|66.7|4.5|2240.0| 8.0|213.0|240.0|61.7| 6.0|2010.0| 8.0|140.0|287.0| 84.0|78.6| 0.4|2270.0| 7.8| 15.0| 87.0| 21.0|81.0| 0.0|2290.0|34.3|65.0|94.2|89.3|69.7|89.7|78.1|89.1|100.0|[41157.0,3.0,8.0,...|
|40078.0| 1.4|7.9|198.0|464.0|228.0|64.9|4.6|1431.0| 7.6|243.0|272.0|64.7| 7.5|1606.0| 7.8|177.0|319.0| 88.0|81.8| 0.2|1556.0| 7.8| 17.0|102.0| 22.0|81.8| 0.0|1475.0|27.2|67.6|97.3|90.4|68.0|91.4|78.0|90.4|100.0|[40078.0,1.4,7.9,...|
|43080.0|4.25|7.8| 95.0|349.0|136.0|76.5|2.5|1063.0| 7.8|132.0|188.0|74.5| 2.0|1139.0| 7.8|123.0|317.0| 98.0|69.4| 0.4|1218.0| 7.5| 19.0| 67.0| 24.0|83.3| 0.0|1220.0| 6.8|47.9|80.0|84.6|78.9|80.0|80.8|82.4|100.0|[43080.0,4.25,7.8...|
|29414.0| 3.0|7.6|160.0|374.0|168.0|69.0|3.1|1042.0| 7.6|220.0|246.0|69.9| 4.6|1057.0| 7.6|126.0|299.0|112.0|75.0| 0.2|1085.0| 7.4| 19.0| 79.0| 28.0|82.0| 0.0|1087.0|42.7|54.5|95.7|84.9|73.6|88.1|78.9|83.3|100.0|[29414.0,3.0,7.6,...|
|38568.0| 0.7|8.2|233.0|506.0|204.0|66.7|6.7|1692.0| 8.3|218.0|212.0|66.0|10.5|1614.0| 7.9|188.0|355.0| 88.0|81.8| 0.2|1516.0| 7.5| 47.0|116.0| 59.0|81.4|0.05|1483.0|13.8|58.5|98.6|75.0|67.3|79.8|77.1|71.1| 99.3|[38568.0,0.7,8.2,...|
|34193.0| 2.0|8.0|166.0|396.0|176.0|70.5|4.0|1265.0| 8.0|178.0|188.0|70.2| 5.5|1380.0| 7.8|165.0|368.0| 90.0|77.8| 0.2|1434.0| 7.5| 26.0|106.0| 31.0|83.9| 0.0|1442.0| 7.3|52.1|96.4|84.2|71.2|84.3|73.2|82.4|100.0|[34193.0,2.0,8.0,...|
|36332.0| 3.5|7.9|120.0|455.0|184.0|67.4|4.0|1224.0| 8.1|205.0|188.0|68.1| 5.5|1217.0| 7.7|168.0|333.0| 90.0|77.8| 0.2|1353.0| 7.6| 24.0| 98.0| 32.0|81.3| 0.0|1420.0|18.0|52.1|96.4|85.7|70.6|80.0|78.5|82.6|100.0|[36332.0,3.5,7.9,...|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+**

Standarize data

from pyspark.ml.feature import StandardScaler
inputcols2 = df.columns
scaler = StandardScaler(inputCol='features', outputCol="scaledFeatures",
                        withStd=True, withMean=True)
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(v)

# Normalize each feature to have unit standard deviation.
scaledData = scalerModel.transform(v)
scaledData.show()
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+
|    _c1| _c2|_c3|  _c4|  _c5|  _c6| _c7|_c8|   _c9|_c10| _c11| _c12|_c13|_c14|  _c15|_c16| _c17| _c18| _c19|_c20|_c21|  _c22|_c23| _c24| _c25| _c26|_c27|_c28|  _c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37| _c38|            features|      scaledFeatures|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+
|35023.0| 3.5|7.9|205.0|588.0|192.0|65.6|4.5|2430.0| 7.8|236.0|268.0|73.1| 8.5|2280.0| 7.8|158.0|376.0| 96.0|77.1| 0.4|2060.0| 7.6| 20.0|104.0| 20.0|96.7| 0.0|1840.0|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6|100.0|[35023.0,3.5,7.9,...|[-0.3427885711292...|
|29156.0| 2.5|7.7|206.0|451.0|194.0|69.1|4.5|1249.0| 7.7|206.0|220.0|61.8| 4.0|1219.0| 7.7|111.0|282.0|124.0|77.4| 0.3|1233.0| 7.5| 16.0|118.0| 19.0|84.2|0.03|1338.0|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2| 99.4|[29156.0,2.5,7.7,...|[-1.1990772985617...|
|39246.0| 2.0|7.8|172.0|506.0|200.0|69.0|5.0|1865.0| 7.8|208.0|248.0|66.1| 6.5|1929.0| 7.8|164.0|463.0|100.0|78.0| 0.6|1825.0| 7.6| 19.0|157.0| 27.0|87.0|0.02|1616.0|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5| 99.6|[39246.0,2.0,7.8,...|[0.27355833460575...|
|42393.0| 0.7|7.9|189.0|478.0|230.0|67.0|5.5|1410.0| 8.1|173.0|192.0|62.5| 5.0|1406.0| 7.7|172.0|412.0|104.0|71.2| 0.4|1562.0| 7.6|152.0|306.0|131.0|79.6| 3.5|1575.0| 0.6|45.8|92.0|11.6|25.7|19.6|36.0|43.0| 36.4|[42393.0,0.7,7.9,...|[0.73286302613975...|
|40923.0| 3.5|7.6|146.0|329.0|188.0|57.4|2.5|1300.0| 7.6|162.0|132.0|63.6| 2.0|1324.0| 7.6|109.0|243.0| 88.0|81.8| 0.2|1467.0| 7.5| 19.0| 94.0| 41.0|82.9|0.02|1545.0|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2| 99.2|[40923.0,3.5,7.6,...|[0.51831650673874...|
|43830.0| 1.5|7.8|177.0|512.0|214.0|58.9|5.5|1605.0| 7.7|164.0|256.0|71.9| 5.5|1599.0| 7.7|118.0|320.0| 70.0|88.6| 0.4|1401.0| 7.6| 25.0|203.0| 20.0|85.0| 0.0|1110.0|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7|100.0|[43830.0,1.5,7.8,...|[0.94259319510523...|
|39165.0| 1.2|7.4|250.0|447.0|252.0|61.1|7.0|1533.0| 7.4|275.0|216.0|57.4| 6.5|1501.0| 7.4|138.0|269.0| 90.0|73.3| 0.5|1458.0| 7.3| 14.0|  9.0| 20.0|82.5| 0.0|1402.0|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1|100.0|[39165.0,1.2,7.4,...|[0.26173638353671...|
|37419.0| 1.2|7.6|219.0|446.0|222.0|61.3|5.5|1600.0| 7.7|266.0|240.0|70.0| 5.0|1645.0| 7.6|172.0|345.0|102.0|84.3| 0.4|1670.0| 7.5| 42.0|175.0| 53.0|84.2|0.02|1780.0|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1| 99.6|[37419.0,1.2,7.6,...|[0.00690766049306...|
|40983.0| 3.0|7.6|182.0|431.0|214.0|57.0|7.0|1591.0| 7.5|219.0|248.0|58.1| 5.5|1473.0| 7.5|175.0|376.0| 88.0|65.9| 0.4|1537.0| 7.5| 23.0|120.0| 25.0|68.0| 0.0|1597.0|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3|100.0|[40983.0,3.0,7.6,...|[0.52707350753062...|
|42217.0| 8.5|7.5|138.0|333.0|240.0|55.0|3.8|1087.0| 7.5|153.0|184.0|67.4| 4.0|1109.0| 7.5|108.0|194.0| 82.0|85.4| 0.4|1136.0| 7.1| 16.0| 62.0| 17.0|94.1| 0.0|1223.0|29.4|55.4|91.3|85.2|68.0|88.4|81.4|92.9|100.0|[42217.0,8.5,7.5,...|[0.70717582381691...|
|47665.0| 1.2|7.7|156.0|405.0|200.0|74.0|4.0|1856.0| 7.6|178.0|184.0|71.7| 3.5|1976.0| 7.5|128.0|302.0| 92.0|78.3| 0.3|1920.0| 7.6| 19.0| 71.0| 23.0|78.3|0.01|1706.0|28.1|50.0|91.4|85.2|76.5|87.8|82.5|88.5| 99.8|[47665.0,1.2,7.7,...|[1.50231149571943...|
|44314.0| 3.0|7.8|155.0|389.0|308.0|49.4|6.0|1927.0| 7.7|252.0|308.0|49.4| 6.5|2150.0| 7.7|121.0|302.0|108.0|72.2| 0.6|1950.0| 7.6| 15.0| 87.0| 23.0|69.7| 0.0|1869.0|52.0|64.9|90.8|87.6|71.2|90.3|77.6|92.5|100.0|[44314.0,3.0,7.8,...|[1.01323300149305...|
|40841.0| 1.0|7.6|179.0|389.0|168.0|69.0|3.5|1240.0| 7.8|202.0|272.0|72.1| 6.0|1381.0| 7.8|148.0|302.0| 92.0|78.3| 0.3|1425.0| 7.9| 16.0| 83.0| 20.0|85.0| 0.0|1416.0|26.7|66.2|95.0|89.2|72.5|91.1|78.7|88.1|100.0|[40841.0,1.0,7.6,...|[0.50634860565650...|
|41157.0| 3.0|8.0|145.0|398.0|192.0|66.7|4.5|2240.0| 8.0|213.0|240.0|61.7| 6.0|2010.0| 8.0|140.0|287.0| 84.0|78.6| 0.4|2270.0| 7.8| 15.0| 87.0| 21.0|81.0| 0.0|2290.0|34.3|65.0|94.2|89.3|69.7|89.7|78.1|89.1|100.0|[41157.0,3.0,8.0,...|[0.55246880982706...|
|40078.0| 1.4|7.9|198.0|464.0|228.0|64.9|4.6|1431.0| 7.6|243.0|272.0|64.7| 7.5|1606.0| 7.8|177.0|319.0| 88.0|81.8| 0.2|1556.0| 7.8| 17.0|102.0| 22.0|81.8| 0.0|1475.0|27.2|67.6|97.3|90.4|68.0|91.4|78.0|90.4|100.0|[40078.0,1.4,7.9,...|[0.39498874558646...|
|43080.0|4.25|7.8| 95.0|349.0|136.0|76.5|2.5|1063.0| 7.8|132.0|188.0|74.5| 2.0|1139.0| 7.8|123.0|317.0| 98.0|69.4| 0.4|1218.0| 7.5| 19.0| 67.0| 24.0|83.3| 0.0|1220.0| 6.8|47.9|80.0|84.6|78.9|80.0|80.8|82.4|100.0|[43080.0,4.25,7.8...|[0.83313068520675...|
|29414.0| 3.0|7.6|160.0|374.0|168.0|69.0|3.1|1042.0| 7.6|220.0|246.0|69.9| 4.6|1057.0| 7.6|126.0|299.0|112.0|75.0| 0.2|1085.0| 7.4| 19.0| 79.0| 28.0|82.0| 0.0|1087.0|42.7|54.5|95.7|84.9|73.6|88.1|78.9|83.3|100.0|[29414.0,3.0,7.6,...|[-1.1614221951566...|
|38568.0| 0.7|8.2|233.0|506.0|204.0|66.7|6.7|1692.0| 8.3|218.0|212.0|66.0|10.5|1614.0| 7.9|188.0|355.0| 88.0|81.8| 0.2|1516.0| 7.5| 47.0|116.0| 59.0|81.4|0.05|1483.0|13.8|58.5|98.6|75.0|67.3|79.8|77.1|71.1| 99.3|[38568.0,0.7,8.2,...|[0.17460422565752...|
|34193.0| 2.0|8.0|166.0|396.0|176.0|70.5|4.0|1265.0| 8.0|178.0|188.0|70.2| 5.5|1380.0| 7.8|165.0|368.0| 90.0|77.8| 0.2|1434.0| 7.5| 26.0|106.0| 31.0|83.9| 0.0|1442.0| 7.3|52.1|96.4|84.2|71.2|84.3|73.2|82.4|100.0|[34193.0,2.0,8.0,...|[-0.4639270820835...|
|36332.0| 3.5|7.9|120.0|455.0|184.0|67.4|4.0|1224.0| 8.1|205.0|188.0|68.1| 5.5|1217.0| 7.7|168.0|333.0| 90.0|77.8| 0.2|1353.0| 7.6| 24.0| 98.0| 32.0|81.3| 0.0|1420.0|18.0|52.1|96.4|85.7|70.6|80.0|78.5|82.6|100.0|[36332.0,3.5,7.9,...|[-0.1517400038531...|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+

SparkML

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
kmeans = KMeans().setK(4).setSeed(1)
model = kmeans.fit(scaledData)

Make predictions

predictions = model.transform(scaledData)

Evaluate clustering by computing Silhouette score

evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
ilhouette with squared euclidean distance = 0.7454468744909859

Show the results

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)
Cluster Centers: 
[3.27473399e+04 2.35556650e+00 7.80147783e+00 1.99857143e+02
 4.33714286e+02 2.23605911e+02 6.40591133e+01 4.90541872e+00
 1.49865517e+03 7.83349754e+00 2.19926108e+02 2.53487685e+02
 6.24507389e+01 5.39901478e+00 1.51449754e+03 7.83103448e+00
 1.29576355e+02 2.86660099e+02 9.53497537e+01 7.54472906e+01
 4.29064039e-01 1.51744828e+03 7.71625616e+00 1.88571429e+01
 8.53448276e+01 2.09359606e+01 8.00812808e+01 3.45812808e-02
 1.51224138e+03 3.89724138e+01 5.93418719e+01 9.13463054e+01
 8.49221675e+01 6.99300493e+01 9.00625616e+01 7.97216749e+01
 8.99798030e+01 9.91403941e+01]
[4.91368571e+04 2.04285714e+00 7.89642857e+00 1.56053571e+02
 3.05089286e+02 2.14392857e+02 5.31785714e+01 3.51964286e+00
 1.41028571e+03 7.91964286e+00 1.68964286e+02 2.38714286e+02
 5.21839286e+01 3.61428571e+00 1.43589286e+03 7.87678571e+00
 9.98035714e+01 2.25714286e+02 8.81428571e+01 6.56910714e+01
 3.69642857e-01 1.39700000e+03 7.75535714e+00 1.51250000e+01
 7.23392857e+01 1.73571429e+01 7.88160714e+01 1.71428571e-02
 1.43160714e+03 3.91785714e+01 5.67696429e+01 8.88767857e+01
 8.44196429e+01 6.71303571e+01 8.94785714e+01 7.55589286e+01
 9.02589286e+01 9.94857143e+01]
[4.03285169e+04 2.27313559e+00 7.83898305e+00 1.88500000e+02
 4.00449153e+02 2.35949153e+02 5.95169492e+01 4.87542373e+00
 1.44216102e+03 7.85677966e+00 2.11652542e+02 2.69644068e+02
 5.88618644e+01 5.35254237e+00 1.45742373e+03 7.83135593e+00
 1.22923729e+02 2.74855932e+02 9.45084746e+01 7.16042373e+01
 4.15254237e-01 1.45109322e+03 7.71355932e+00 2.03728814e+01
 8.83559322e+01 2.25932203e+01 7.94305085e+01 4.85593220e-02
 1.45549153e+03 3.86406780e+01 5.88372881e+01 9.01932203e+01
 8.27228814e+01 6.72805085e+01 8.85644068e+01 7.71593220e+01
 8.86406780e+01 9.89720339e+01]
[1.43653333e+04 1.51666667e+00 7.56666667e+00 1.68666667e+02
 3.75666667e+02 3.15333333e+02 4.04000000e+01 4.13333333e+00
 1.54200000e+03 7.53333333e+00 1.97333333e+02 3.88666667e+02
 4.43333333e+01 4.83333333e+00 1.44200000e+03 7.50000000e+00
 6.20000000e+01 1.56333333e+02 8.06666667e+01 6.36666667e+01
 2.00000000e-01 1.34000000e+03 7.63333333e+00 1.33333333e+01
 5.60000000e+01 1.96666667e+01 7.89333333e+01 6.66666667e-03
 1.35833333e+03 6.76666667e+01 7.11333333e+01 9.42666667e+01
 7.75000000e+01 6.43666667e+01 9.22000000e+01 8.34000000e+01
 9.09666667e+01 9.95333333e+01]
pred = model.summary.predictions
pred.show()
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+----------+
|    _c1| _c2|_c3|  _c4|  _c5|  _c6| _c7|_c8|   _c9|_c10| _c11| _c12|_c13|_c14|  _c15|_c16| _c17| _c18| _c19|_c20|_c21|  _c22|_c23| _c24| _c25| _c26|_c27|_c28|  _c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37| _c38|            features|      scaledFeatures|prediction|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+----------+
|35023.0| 3.5|7.9|205.0|588.0|192.0|65.6|4.5|2430.0| 7.8|236.0|268.0|73.1| 8.5|2280.0| 7.8|158.0|376.0| 96.0|77.1| 0.4|2060.0| 7.6| 20.0|104.0| 20.0|96.7| 0.0|1840.0|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6|100.0|[35023.0,3.5,7.9,...|[-0.3427885711292...|         0|
|29156.0| 2.5|7.7|206.0|451.0|194.0|69.1|4.5|1249.0| 7.7|206.0|220.0|61.8| 4.0|1219.0| 7.7|111.0|282.0|124.0|77.4| 0.3|1233.0| 7.5| 16.0|118.0| 19.0|84.2|0.03|1338.0|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2| 99.4|[29156.0,2.5,7.7,...|[-1.1990772985617...|         0|
|39246.0| 2.0|7.8|172.0|506.0|200.0|69.0|5.0|1865.0| 7.8|208.0|248.0|66.1| 6.5|1929.0| 7.8|164.0|463.0|100.0|78.0| 0.6|1825.0| 7.6| 19.0|157.0| 27.0|87.0|0.02|1616.0|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5| 99.6|[39246.0,2.0,7.8,...|[0.27355833460575...|         2|
|42393.0| 0.7|7.9|189.0|478.0|230.0|67.0|5.5|1410.0| 8.1|173.0|192.0|62.5| 5.0|1406.0| 7.7|172.0|412.0|104.0|71.2| 0.4|1562.0| 7.6|152.0|306.0|131.0|79.6| 3.5|1575.0| 0.6|45.8|92.0|11.6|25.7|19.6|36.0|43.0| 36.4|[42393.0,0.7,7.9,...|[0.73286302613975...|         2|
|40923.0| 3.5|7.6|146.0|329.0|188.0|57.4|2.5|1300.0| 7.6|162.0|132.0|63.6| 2.0|1324.0| 7.6|109.0|243.0| 88.0|81.8| 0.2|1467.0| 7.5| 19.0| 94.0| 41.0|82.9|0.02|1545.0|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2| 99.2|[40923.0,3.5,7.6,...|[0.51831650673874...|         2|
|43830.0| 1.5|7.8|177.0|512.0|214.0|58.9|5.5|1605.0| 7.7|164.0|256.0|71.9| 5.5|1599.0| 7.7|118.0|320.0| 70.0|88.6| 0.4|1401.0| 7.6| 25.0|203.0| 20.0|85.0| 0.0|1110.0|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7|100.0|[43830.0,1.5,7.8,...|[0.94259319510523...|         2|
|39165.0| 1.2|7.4|250.0|447.0|252.0|61.1|7.0|1533.0| 7.4|275.0|216.0|57.4| 6.5|1501.0| 7.4|138.0|269.0| 90.0|73.3| 0.5|1458.0| 7.3| 14.0|  9.0| 20.0|82.5| 0.0|1402.0|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1|100.0|[39165.0,1.2,7.4,...|[0.26173638353671...|         2|
|37419.0| 1.2|7.6|219.0|446.0|222.0|61.3|5.5|1600.0| 7.7|266.0|240.0|70.0| 5.0|1645.0| 7.6|172.0|345.0|102.0|84.3| 0.4|1670.0| 7.5| 42.0|175.0| 53.0|84.2|0.02|1780.0|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1| 99.6|[37419.0,1.2,7.6,...|[0.00690766049306...|         2|
|40983.0| 3.0|7.6|182.0|431.0|214.0|57.0|7.0|1591.0| 7.5|219.0|248.0|58.1| 5.5|1473.0| 7.5|175.0|376.0| 88.0|65.9| 0.4|1537.0| 7.5| 23.0|120.0| 25.0|68.0| 0.0|1597.0|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3|100.0|[40983.0,3.0,7.6,...|[0.52707350753062...|         2|
|42217.0| 8.5|7.5|138.0|333.0|240.0|55.0|3.8|1087.0| 7.5|153.0|184.0|67.4| 4.0|1109.0| 7.5|108.0|194.0| 82.0|85.4| 0.4|1136.0| 7.1| 16.0| 62.0| 17.0|94.1| 0.0|1223.0|29.4|55.4|91.3|85.2|68.0|88.4|81.4|92.9|100.0|[42217.0,8.5,7.5,...|[0.70717582381691...|         2|
|47665.0| 1.2|7.7|156.0|405.0|200.0|74.0|4.0|1856.0| 7.6|178.0|184.0|71.7| 3.5|1976.0| 7.5|128.0|302.0| 92.0|78.3| 0.3|1920.0| 7.6| 19.0| 71.0| 23.0|78.3|0.01|1706.0|28.1|50.0|91.4|85.2|76.5|87.8|82.5|88.5| 99.8|[47665.0,1.2,7.7,...|[1.50231149571943...|         1|
|44314.0| 3.0|7.8|155.0|389.0|308.0|49.4|6.0|1927.0| 7.7|252.0|308.0|49.4| 6.5|2150.0| 7.7|121.0|302.0|108.0|72.2| 0.6|1950.0| 7.6| 15.0| 87.0| 23.0|69.7| 0.0|1869.0|52.0|64.9|90.8|87.6|71.2|90.3|77.6|92.5|100.0|[44314.0,3.0,7.8,...|[1.01323300149305...|         2|
|40841.0| 1.0|7.6|179.0|389.0|168.0|69.0|3.5|1240.0| 7.8|202.0|272.0|72.1| 6.0|1381.0| 7.8|148.0|302.0| 92.0|78.3| 0.3|1425.0| 7.9| 16.0| 83.0| 20.0|85.0| 0.0|1416.0|26.7|66.2|95.0|89.2|72.5|91.1|78.7|88.1|100.0|[40841.0,1.0,7.6,...|[0.50634860565650...|         2|
|41157.0| 3.0|8.0|145.0|398.0|192.0|66.7|4.5|2240.0| 8.0|213.0|240.0|61.7| 6.0|2010.0| 8.0|140.0|287.0| 84.0|78.6| 0.4|2270.0| 7.8| 15.0| 87.0| 21.0|81.0| 0.0|2290.0|34.3|65.0|94.2|89.3|69.7|89.7|78.1|89.1|100.0|[41157.0,3.0,8.0,...|[0.55246880982706...|         2|
|40078.0| 1.4|7.9|198.0|464.0|228.0|64.9|4.6|1431.0| 7.6|243.0|272.0|64.7| 7.5|1606.0| 7.8|177.0|319.0| 88.0|81.8| 0.2|1556.0| 7.8| 17.0|102.0| 22.0|81.8| 0.0|1475.0|27.2|67.6|97.3|90.4|68.0|91.4|78.0|90.4|100.0|[40078.0,1.4,7.9,...|[0.39498874558646...|         2|
|43080.0|4.25|7.8| 95.0|349.0|136.0|76.5|2.5|1063.0| 7.8|132.0|188.0|74.5| 2.0|1139.0| 7.8|123.0|317.0| 98.0|69.4| 0.4|1218.0| 7.5| 19.0| 67.0| 24.0|83.3| 0.0|1220.0| 6.8|47.9|80.0|84.6|78.9|80.0|80.8|82.4|100.0|[43080.0,4.25,7.8...|[0.83313068520675...|         2|
|29414.0| 3.0|7.6|160.0|374.0|168.0|69.0|3.1|1042.0| 7.6|220.0|246.0|69.9| 4.6|1057.0| 7.6|126.0|299.0|112.0|75.0| 0.2|1085.0| 7.4| 19.0| 79.0| 28.0|82.0| 0.0|1087.0|42.7|54.5|95.7|84.9|73.6|88.1|78.9|83.3|100.0|[29414.0,3.0,7.6,...|[-1.1614221951566...|         0|
|38568.0| 0.7|8.2|233.0|506.0|204.0|66.7|6.7|1692.0| 8.3|218.0|212.0|66.0|10.5|1614.0| 7.9|188.0|355.0| 88.0|81.8| 0.2|1516.0| 7.5| 47.0|116.0| 59.0|81.4|0.05|1483.0|13.8|58.5|98.6|75.0|67.3|79.8|77.1|71.1| 99.3|[38568.0,0.7,8.2,...|[0.17460422565752...|         2|
|34193.0| 2.0|8.0|166.0|396.0|176.0|70.5|4.0|1265.0| 8.0|178.0|188.0|70.2| 5.5|1380.0| 7.8|165.0|368.0| 90.0|77.8| 0.2|1434.0| 7.5| 26.0|106.0| 31.0|83.9| 0.0|1442.0| 7.3|52.1|96.4|84.2|71.2|84.3|73.2|82.4|100.0|[34193.0,2.0,8.0,...|[-0.4639270820835...|         0|
|36332.0| 3.5|7.9|120.0|455.0|184.0|67.4|4.0|1224.0| 8.1|205.0|188.0|68.1| 5.5|1217.0| 7.7|168.0|333.0| 90.0|77.8| 0.2|1353.0| 7.6| 24.0| 98.0| 32.0|81.3| 0.0|1420.0|18.0|52.1|96.4|85.7|70.6|80.0|78.5|82.6|100.0|[36332.0,3.5,7.9,...|[-0.1517400038531...|         0|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+----------+
pred.groupBy('prediction').count().show()
+----------+-----+
|prediction|count|
+----------+-----+
|         1|   56|
|         3|    3|
|         2|  118|
|         0|  203|
+----------+-----+

Visualization

Extreme pairplot DO NOT TRY AT HOME

import pandas as pd
pandasDF = pred.toPandas()
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

%matplotlib inline
columns = []
for i in range(1, 38):
  columns += ['_c' + str(i)]
columns
sns_plot = sns.pairplot(pandasDF[columns], diag_kind='kde')
sns_plot.savefig("pairplot.png")
plt.clf() # Clean parirplot figure from sns 
Image(filename='pairplot.png') # Show pairplot as image

Screenshot 2022-03-25 at 13.54.45.png **Download full size:**

Plot Correlation

import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt


def plot_correlation(data):
    '''
    plot correlation's matrix to explore dependency between features 
    '''
    # init figure size
    rcParams['figure.figsize'] = 15, 20
    fig = plt.figure()
    sns.heatmap(data.corr(), annot=True, fmt=".2f")
    plt.show()
    fig.savefig('corr.png')

# load your data 
data  = pandasDF

# plot correlation & densities
plot_correlation(data)

corr.png

Density plot of features

import pandas as pd
from pylab import rcParams
import matplotlib.pyplot as plt


def plot_densities(data):
    '''
    Plot features densities depending on the outcome values
    '''
    # change fig size to fit all subplots beautifully 
    rcParams['figure.figsize'] = 30, 40

    # separate data based on outcome values 
    outcome_0 = data[data['prediction'] == 0]
    outcome_1 = data[data['prediction'] == 1]
    outcome_2 = data[data['prediction'] == 2]
    outcome_3 = data[data['prediction'] == 3]

    # init figure
    fig, axs = plt.subplots(38, 1)
    fig.suptitle('Features densities for different outcomes 0/1/2/3')
    plt.subplots_adjust(left = 0.25, right = 0.9, bottom = 0.1, top = 0.95,
                        wspace = 0.2, hspace = 0.95)

    # plot densities for outcomes
    for column_name in names[:-1]: 
        ax = axs[names.index(column_name)]
        #plt.subplot(4, 2, names.index(column_name) + 1)
        outcome_0[column_name].plot(kind='density', ax=ax, subplots=True, 
                                    sharex=False, color="red", legend=True,
                                    label=column_name + ' for Outcome = 0')
        outcome_1[column_name].plot(kind='density', ax=ax, subplots=True, 
                                     sharex=False, color="green", legend=True,
                                     label=column_name + ' for Outcome = 1')
        outcome_2[column_name].plot(kind='density', ax=ax, subplots=True, 
                                     sharex=False, color="black", legend=True,
                                     label=column_name + ' for Outcome = 2')
        outcome_3[column_name].plot(kind='density', ax=ax, subplots=True, 
                                     sharex=False, color="blue", legend=True,
                                     label=column_name + ' for Outcome = 3')
        ax.set_xlabel(column_name + ' values')
        ax.set_title(column_name + ' density')
        ax.grid('on')
    plt.show()
    fig.savefig('densities.png')

# load your data 
data  = pandasDF
names = list(data.columns)

# plot correlation & densities
plot_densities(data)

density.png

References