Apache Spark* is a dispatcher processing engine responsible for orchestrating, distributing, and monitoring applications that perform multiple data processing tasks on multiple worker machines, which form a cluster. As we have already mentioned, it is possible to read the data from different persistent storage solutions such as Amazon S3 or Google Storage, distributed storage systems such as HDFS, key-value systems such as Apache Cassandra, or message buses such as Kafka.
How does Spark work?
Spark does not store data itself, but rather has a focus on processing. This is one of the points that differentiate it from Hadoop, which includes both persistent storage (HDFS) and a processing system (MapReduce) in a very integrated way. It is important to talk about the processing speed: the key is the possibility that Spark offers to perform processing from memory. This, and the extension of the popular MapReduce to efficiently allow other types of operations: Interactive Queries and Streaming Processing.
Differences between Apache Spark & Hadoop
Apache Spark | Hadoop MapReduce |
Uses memory instead of disk for intermediate results. | In MapReduce, until the map phase completes its processing, the reducers do not start executing. Nor can you control their order of execution. |
The data type used by the core of Spark, known as Spark Core, is RDD or Resilient Distributed Datasets. | The intrinsic limitation of MapReduce is, in fact, the “one-way scalability” of its design. |
The graph scheduler, DAG, optimizes the stages in execution. | Due to the fixed cost incurred by each MapReduce job submitted, an application that requires low latency time or random access to a large set of data is infeasible. |
Basics of Apache Spark
First, install pyspark
!pip3 install pyspark
Note: This may take some time...be patient
Collecting pyspark
Downloading pyspark-3.2.1.tar.gz (281.4 MB)
|████████████████████████████████| 281.4 MB 35 kB/s
Collecting py4j==0.10.9.3
Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
|████████████████████████████████| 198 kB 68.2 MB/s
Building wheels for collected packages: pyspark
Building wheel for pyspark (setup.py) ... done
Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=d169ac1b6c62c56f835e15772f7c9402760bd12802709aabbdc6764782d71589
Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1
Spark Session
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.conf import SparkConf
conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)
spark = SparkSession \
.builder \
.appName('Basics') \
.getOrCreate()
We've just launched our first Spark context & session. Congratulations! 🎉
The dataset
We are going to work with the Water Treatment Plant Data Set from UCI Machine Learning Repository.
This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The objective is to classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process. This domain has been stated as an ill-structured domain.
Attributes Information:
All attributes are numeric and continuous
N. Attrib.
- 1 Q-E (input flow to plant)
- 2 ZN-E (input Zinc to plant)
- 3 PH-E (input pH to plant)
- 4 DBO-E (input Biological demand of oxygen to plant)
- 5 DQO-E (input chemical demand of oxygen to plant)
- 6 SS-E (input suspended solids to plant)
- 7 SSV-E (input volatile supended solids to plant)
- 8 SED-E (input sediments to plant)
- 9 COND-E (input conductivity to plant)
- 10 PH-P (input pH to primary settler)
- 11 DBO-P (input Biological demand of oxygen to primary settler)
- 12 SS-P (input suspended solids to primary settler)
- 13 SSV-P (input volatile supended solids to primary settler)
- 14 SED-P (input sediments to primary settler)
- 15 COND-P (input conductivity to primary settler)
- 16 PH-D (input pH to secondary settler)
- 17 DBO-D (input Biological demand of oxygen to secondary settler)
- 18 DQO-D (input chemical demand of oxygen to secondary settler)
- 19 SS-D (input suspended solids to secondary settler)
- 20 SSV-D (input volatile supended solids to secondary settler)
- 21 SED-D (input sediments to secondary settler)
- 22 COND-D (input conductivity to secondary settler)
- 23 PH-S (output pH)
- 24 DBO-S (output Biological demand of oxygen)
- 25 DQO-S (output chemical demand of oxygen)
- 26 SS-S (output suspended solids)
- 27 SSV-S (output volatile supended solids)
- 28 SED-S (output sediments)
- 29 COND-S (output conductivity)
- 30 RD-DBO-P (performance input Biological demand of oxygen in primary settler)
- 31 RD-SS-P (performance input suspended solids to primary settler)
- 32 RD-SED-P (performance input sediments to primary settler)
- 33 RD-DBO-S (performance input Biological demand of oxygen to secondary settler)
- 34 RD-DQO-S (performance input chemical demand of oxygen to secondary settler)
- 35 RD-DBO-G (global performance input Biological demand of oxygen)
- 36 RD-DQO-G (global performance input chemical demand of oxygen)
- 37 RD-SS-G (global performance input suspended solids)
- 38 RD-SED-G (global performance input sediments)
Load the dataset
We download it from the UCI repo and save it to our Spark DataFrame.
df_data = spark.read.csv(path='water-treatment.data', header=False, inferSchema=True)
Let's see our dataframe
df_data.show()
+---------+-----+----+---+---+---+---+----+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| _c0| _c1| _c2|_c3|_c4|_c5|_c6| _c7|_c8| _c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|_c25|_c26|_c27|_c28|_c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38|
+---------+-----+----+---+---+---+---+----+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| D-1/3/90|44101|1.50|7.8| ?|407|166|66.3|4.5|2110| 7.9| ?| 228|70.2| 5.5|2120| 7.9| ?| 280| 94|72.3| 0.3|2010| 7.3| ?| 84| 21|81.0|0.02|2000| ?|58.8|95.5| ?|70.0| ?|79.4|87.3|99.6|
| D-2/3/90|39024|3.00|7.7| ?|443|214|69.2|6.5|2660| 7.7| ?| 244|75.4| 7.7|2570| 7.6| ?| 474| 96|79.2| 0.4|2700| 7.5| ?| 91| 17|94.1|0.00|2590| ?|60.7|94.8| ?|80.8| ?|79.5|92.1| 100|
| D-4/3/90|32229|5.00|7.6| ?|528|186|69.9|3.4|1666| 7.7| ?| 220|72.7| 4.5|1594| 7.7| ?| 272| 92|78.3| 0.2|1742| 7.6| ?| 128| 21| 81|0.05|1888| ?|58.2|95.6| ?|52.9| ?|75.8|88.7|98.5|
| D-5/3/90|35023|3.50|7.9|205|588|192|65.6|4.5|2430| 7.8| 236| 268|73.1| 8.5|2280| 7.8| 158| 376| 96|77.1| 0.4|2060| 7.6| 20| 104| 20|96.7|0.00|1840|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6| 100|
| D-6/3/90|36924|1.50|8.0|242|496|176|64.8|4.0|2110| 7.9| ?| 236|57.6| 4.5|2020| 7.8| ?| 372| 88|68.2| 0.2|2250| 7.6| 19| 108| 22|65.9|0.02|2120| ?|62.7|95.6| ?|71.0|92.1|78.2|87.5|99.5|
| D-7/3/90|38572|3.00|7.8|202|372|186|68.8|4.5|1644| 7.8| ?| 248|66.1| 8.5|1762| 7.7| 150| 460| 100|76.0| 0.3|1768| 7.5| 20| 100| 28|82.1|0.00|1764| ?|59.7|96.5|86.7|78.3|90.1|73.1|84.9| 100|
| D-8/3/90|41115|6.00|7.8| ?|552|262|64.1|5.0|1603| 7.8| ?| 320|67.5| 6.5|1608| 7.8| 192| 376| 122|72.1| 0.4|1668| 7.5| 21| 76| 26|84.6|0.05|1703| ?|61.9|93.8|89.1|79.8| ?|86.2|90.1|99.0|
| D-9/3/90|36107|5.00|7.7|215|489|334|40.7|6.0|1613| 7.6| ?| 304|53.9| 8.0|1557| 7.6| 181| 350| 90|71.1| 0.4|1596| 7.5| 17| 162| 18|66.7|0.00|1606| ?|70.4|95.6|90.6|53.7|92.1|66.9|94.6| 100|
|D-11/3/90|29156|2.50|7.7|206|451|194|69.1|4.5|1249| 7.7| 206| 220|61.8| 4.0|1219| 7.7| 111| 282| 124|77.4| 0.3|1233| 7.5| 16| 118| 19|84.2|0.03|1338|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2|99.4|
|D-12/3/90|39246|2.00|7.8|172|506|200|69.0|5.0|1865| 7.8| 208| 248|66.1| 6.5|1929| 7.8| 164| 463| 100|78.0| 0.6|1825| 7.6| 19| 157| 27|87.0|0.02|1616|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5|99.6|
|D-13/3/90|42393|0.70|7.9|189|478|230|67.0|5.5|1410| 8.1| 173| 192|62.5| 5.0|1406| 7.7| 172| 412| 104|71.2| 0.4|1562| 7.6| 152| 306| 131|79.6|3.50|1575| 0.6|45.8|92.0|11.6|25.7|19.6|36.0|43.0|36.4|
|D-14/3/90|42857|1.50|7.7|238|319|292|33.8|3.5|1261| 7.6| 170| 268|31.3| 4.2|1204| 7.6| 116| 276| 104|51.9| 0.3|1261| 7.4| 320| 350| 238|73.9|2.00|1304|31.8|61.2|92.9| ?| ?| ?| ?|18.5|42.9|
|D-15/3/90|42911|0.70|7.6|114|252|116|58.6|1.2|1238| 7.9| 148| 136|64.7| 3.0|1208| 7.7| 79| 216| 70|82.9| 0.3|1177| 7.5| 84| 172| 104|78.8|0.06|1221|46.6|48.5|91.7| ?|20.4|26.3|31.7|10.3|95.4|
|D-16/3/90|40376| ?|8.1|204|333|174|67.8|3.0|2390| 7.8| 231| 156|74.4| 2.5|2540| 7.8| 136| 325| 78|79.5| 0.4|2580| 7.6| 32| 153| 98|87.8|0.00|2550|41.1|50.0|84.0|76.5|52.9|84.3|54.1|43.7| 100|
|D-18/3/90|40923|3.50|7.6|146|329|188|57.4|2.5|1300| 7.6| 162| 132|63.6| 2.0|1324| 7.6| 109| 243| 88|81.8| 0.2|1467| 7.5| 19| 94| 41|82.9|0.02|1545|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2|99.2|
|D-19/3/90|43830| 1.5|7.8|177|512|214|58.9|5.5|1605| 7.7| 164| 256|71.9| 5.5|1599| 7.7| 118| 320| 70|88.6| 0.4|1401| 7.6| 25| 203| 20|85.0|0.00|1110|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7| 100|
|D-20/3/90|39165|1.20|7.4|250|447|252|61.1|7.0|1533| 7.4| 275| 216|57.4| 6.5|1501| 7.4| 138| 269| 90|73.3| 0.5|1458| 7.3| 14| 9| 20|82.5|0.00|1402|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1| 100|
|D-21/3/90|35791|1.20|7.8|277|466|246|63.4|4.0|1556| 7.7| ?| 288|65.3| 6.0|1846| 7.7| 166| 419| 174|80.5| 1.3|1664| 7.5| 24| 124| 26|82.7|0.03|1606| ?|39.6|78.3|85.5|70.4|91.3|73.4|89.4|99.4|
|D-22/3/90|37419|1.20|7.6|219|446|222|61.3|5.5|1600| 7.7| 266| 240|70.0| 5.0|1645| 7.6| 172| 345| 102|84.3| 0.4|1670| 7.5| 42| 175| 53|84.2|0.02|1780|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1|99.6|
|D-23/3/90|40983|3.00|7.6|182|431|214|57.0|7.0|1591| 7.5| 219| 248|58.1| 5.5|1473| 7.5| 175| 376| 88|65.9| 0.4|1537| 7.5| 23| 120| 25|68.0|0.00|1597|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3| 100|
+---------+-----+----+---+---+---+---+----+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
Print Schema
df = df_data
df.printSchema()
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: double (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: string (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: integer (nullable = true)
|-- _c10: double (nullable = true)
|-- _c11: string (nullable = true)
|-- _c12: integer (nullable = true)
|-- _c13: string (nullable = true)
|-- _c14: string (nullable = true)
|-- _c15: integer (nullable = true)
|-- _c16: double (nullable = true)
|-- _c17: string (nullable = true)
|-- _c18: string (nullable = true)
|-- _c19: string (nullable = true)
|-- _c20: string (nullable = true)
|-- _c21: string (nullable = true)
|-- _c22: integer (nullable = true)
|-- _c23: string (nullable = true)
|-- _c24: string (nullable = true)
|-- _c25: string (nullable = true)
|-- _c26: string (nullable = true)
|-- _c27: string (nullable = true)
|-- _c28: string (nullable = true)
|-- _c29: string (nullable = true)
|-- _c30: string (nullable = true)
|-- _c31: string (nullable = true)
|-- _c32: string (nullable = true)
|-- _c33: string (nullable = true)
|-- _c34: string (nullable = true)
|-- _c35: string (nullable = true)
|-- _c36: string (nullable = true)
|-- _c37: string (nullable = true)
|-- _c38: string (nullable = true)
Since, only the first column should be a string ( or datetime to be precise). We are going to change each column type to double.
from pyspark.sql.types import IntegerType,BooleanType,DateType, DoubleType
# Convert String to Integer Type
for i in range(1,38):
df = df.withColumn('_c'+ str(i),df['_c' + str(i)].cast(DoubleType()))
df.printSchema()
|-- _c0: string (nullable = true)
|-- _c1: double (nullable = true)
|-- _c2: double (nullable = true)
|-- _c3: double (nullable = true)
|-- _c4: double (nullable = true)
|-- _c5: double (nullable = true)
|-- _c6: double (nullable = true)
|-- _c7: double (nullable = true)
|-- _c8: double (nullable = true)
|-- _c9: double (nullable = true)
|-- _c10: double (nullable = true)
|-- _c11: double (nullable = true)
|-- _c12: double (nullable = true)
|-- _c13: double (nullable = true)
|-- _c14: double (nullable = true)
|-- _c15: double (nullable = true)
|-- _c16: double (nullable = true)
|-- _c17: double (nullable = true)
|-- _c18: double (nullable = true)
|-- _c19: double (nullable = true)
|-- _c20: double (nullable = true)
|-- _c21: double (nullable = true)
|-- _c22: double (nullable = true)
|-- _c23: double (nullable = true)
|-- _c24: double (nullable = true)
|-- _c25: double (nullable = true)
|-- _c26: double (nullable = true)
|-- _c27: double (nullable = true)
|-- _c28: double (nullable = true)
|-- _c29: double (nullable = true)
|-- _c30: double (nullable = true)
|-- _c31: double (nullable = true)
|-- _c32: double (nullable = true)
|-- _c33: double (nullable = true)
|-- _c34: double (nullable = true)
|-- _c35: double (nullable = true)
|-- _c36: double (nullable = true)
|-- _c37: double (nullable = true)
|-- _c38: double (nullable = true)
Now, let's parse the first column to datetime
i = 0
df = df.withColumn('_c'+ str(i),df['_c' + str(i)].cast(DateType()))
root
|-- _c0: date (nullable = true)
.
.
.
|-- _c38: double (nullable = true)
Deleting N/As
Before taking any decision, we are going to count how many N/As exist in our dataframe.
from pyspark.sql.functions import isnan, when, count, col
df_aux = df.drop('_c0')
df_aux.select([count(when(isnan(c), c)).alias(c) for c in df_aux.columns]).show()
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|_c25|_c26|_c27|_c28|_c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
Now null values...
df_aux.select([count(when(isnull(c), c)).alias(c) for c in df_aux.columns]).show()
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|_c25|_c26|_c27|_c28|_c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 18| 3| 0| 23| 6| 1| 11| 25| 0| 0| 40| 0| 11| 24| 0| 0| 28| 9| 2| 13| 25| 0| 1| 23| 18| 5| 17| 28| 1| 62| 4| 27| 40| 26| 36| 25| 8| 31|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
We have decided to drop null rows entirely
df = df_aux.na.drop(how="any")
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+
|_c1 |_c2 |_c3|_c4 |_c5 |_c6 |_c7 |_c8|_c9 |_c10|_c11 |_c12 |_c13|_c14|_c15 |_c16|_c17 |_c18 |_c19 |_c20|_c21|_c22 |_c23|_c24 |_c25 |_c26 |_c27|_c28|_c29 |_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37|_c38 |
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+
|35023.0|3.5 |7.9|205.0|588.0|192.0|65.6|4.5|2430.0|7.8 |236.0|268.0|73.1|8.5 |2280.0|7.8 |158.0|376.0|96.0 |77.1|0.4 |2060.0|7.6 |20.0 |104.0|20.0 |96.7|0.0 |1840.0|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6|100.0|
|29156.0|2.5 |7.7|206.0|451.0|194.0|69.1|4.5|1249.0|7.7 |206.0|220.0|61.8|4.0 |1219.0|7.7 |111.0|282.0|124.0|77.4|0.3 |1233.0|7.5 |16.0 |118.0|19.0 |84.2|0.03|1338.0|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2|99.4 |
|39246.0|2.0 |7.8|172.0|506.0|200.0|69.0|5.0|1865.0|7.8 |208.0|248.0|66.1|6.5 |1929.0|7.8 |164.0|463.0|100.0|78.0|0.6 |1825.0|7.6 |19.0 |157.0|27.0 |87.0|0.02|1616.0|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5|99.6 |
|42393.0|0.7 |7.9|189.0|478.0|230.0|67.0|5.5|1410.0|8.1 |173.0|192.0|62.5|5.0 |1406.0|7.7 |172.0|412.0|104.0|71.2|0.4 |1562.0|7.6 |152.0|306.0|131.0|79.6|3.5 |1575.0|0.6 |45.8|92.0|11.6|25.7|19.6|36.0|43.0|36.4 |
|40923.0|3.5 |7.6|146.0|329.0|188.0|57.4|2.5|1300.0|7.6 |162.0|132.0|63.6|2.0 |1324.0|7.6 |109.0|243.0|88.0 |81.8|0.2 |1467.0|7.5 |19.0 |94.0 |41.0 |82.9|0.02|1545.0|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2|99.2 |
|43830.0|1.5 |7.8|177.0|512.0|214.0|58.9|5.5|1605.0|7.7 |164.0|256.0|71.9|5.5 |1599.0|7.7 |118.0|320.0|70.0 |88.6|0.4 |1401.0|7.6 |25.0 |203.0|20.0 |85.0|0.0 |1110.0|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7|100.0|
|39165.0|1.2 |7.4|250.0|447.0|252.0|61.1|7.0|1533.0|7.4 |275.0|216.0|57.4|6.5 |1501.0|7.4 |138.0|269.0|90.0 |73.3|0.5 |1458.0|7.3 |14.0 |9.0 |20.0 |82.5|0.0 |1402.0|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1|100.0|
|37419.0|1.2 |7.6|219.0|446.0|222.0|61.3|5.5|1600.0|7.7 |266.0|240.0|70.0|5.0 |1645.0|7.6 |172.0|345.0|102.0|84.3|0.4 |1670.0|7.5 |42.0 |175.0|53.0 |84.2|0.02|1780.0|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1|99.6 |
|40983.0|3.0 |7.6|182.0|431.0|214.0|57.0|7.0|1591.0|7.5 |219.0|248.0|58.1|5.5 |1473.0|7.5 |175.0|376.0|88.0 |65.9|0.4 |1537.0|7.5 |23.0 |120.0|25.0 |68.0|0.0 |1597.0|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3|100.0|
|42217.0|8.5 |7.5|138.0|333.0|240.0|55.0|3.8|1087.0|7.5 |153.0|184.0|67.4|4.0 |1109.0|7.5 |108.0|194.0|82.0 |85.4|0.4 |1136.0|7.1 |16.0 |62.0 |17.0 |94.1|0.0 |1223.0|29.4|55.4|91.3|85.2|68.0|88.4|81.4|92.9|100.0|
|47665.0|1.2 |7.7|156.0|405.0|200.0|74.0|4.0|1856.0|7.6 |178.0|184.0|71.7|3.5 |1976.0|7.5 |128.0|302.0|92.0 |78.3|0.3 |1920.0|7.6 |19.0 |71.0 |23.0 |78.3|0.01|1706.0|28.1|50.0|91.4|85.2|76.5|87.8|82.5|88.5|99.8 |
|44314.0|3.0 |7.8|155.0|389.0|308.0|49.4|6.0|1927.0|7.7 |252.0|308.0|49.4|6.5 |2150.0|7.7 |121.0|302.0|108.0|72.2|0.6 |1950.0|7.6 |15.0 |87.0 |23.0 |69.7|0.0 |1869.0|52.0|64.9|90.8|87.6|71.2|90.3|77.6|92.5|100.0|
|40841.0|1.0 |7.6|179.0|389.0|168.0|69.0|3.5|1240.0|7.8 |202.0|272.0|72.1|6.0 |1381.0|7.8 |148.0|302.0|92.0 |78.3|0.3 |1425.0|7.9 |16.0 |83.0 |20.0 |85.0|0.0 |1416.0|26.7|66.2|95.0|89.2|72.5|91.1|78.7|88.1|100.0|
|41157.0|3.0 |8.0|145.0|398.0|192.0|66.7|4.5|2240.0|8.0 |213.0|240.0|61.7|6.0 |2010.0|8.0 |140.0|287.0|84.0 |78.6|0.4 |2270.0|7.8 |15.0 |87.0 |21.0 |81.0|0.0 |2290.0|34.3|65.0|94.2|89.3|69.7|89.7|78.1|89.1|100.0|
|40078.0|1.4 |7.9|198.0|464.0|228.0|64.9|4.6|1431.0|7.6 |243.0|272.0|64.7|7.5 |1606.0|7.8 |177.0|319.0|88.0 |81.8|0.2 |1556.0|7.8 |17.0 |102.0|22.0 |81.8|0.0 |1475.0|27.2|67.6|97.3|90.4|68.0|91.4|78.0|90.4|100.0|
|43080.0|4.25|7.8|95.0 |349.0|136.0|76.5|2.5|1063.0|7.8 |132.0|188.0|74.5|2.0 |1139.0|7.8 |123.0|317.0|98.0 |69.4|0.4 |1218.0|7.5 |19.0 |67.0 |24.0 |83.3|0.0 |1220.0|6.8 |47.9|80.0|84.6|78.9|80.0|80.8|82.4|100.0|
|29414.0|3.0 |7.6|160.0|374.0|168.0|69.0|3.1|1042.0|7.6 |220.0|246.0|69.9|4.6 |1057.0|7.6 |126.0|299.0|112.0|75.0|0.2 |1085.0|7.4 |19.0 |79.0 |28.0 |82.0|0.0 |1087.0|42.7|54.5|95.7|84.9|73.6|88.1|78.9|83.3|100.0|
|38568.0|0.7 |8.2|233.0|506.0|204.0|66.7|6.7|1692.0|8.3 |218.0|212.0|66.0|10.5|1614.0|7.9 |188.0|355.0|88.0 |81.8|0.2 |1516.0|7.5 |47.0 |116.0|59.0 |81.4|0.05|1483.0|13.8|58.5|98.6|75.0|67.3|79.8|77.1|71.1|99.3 |
|34193.0|2.0 |8.0|166.0|396.0|176.0|70.5|4.0|1265.0|8.0 |178.0|188.0|70.2|5.5 |1380.0|7.8 |165.0|368.0|90.0 |77.8|0.2 |1434.0|7.5 |26.0 |106.0|31.0 |83.9|0.0 |1442.0|7.3 |52.1|96.4|84.2|71.2|84.3|73.2|82.4|100.0|
|36332.0|3.5 |7.9|120.0|455.0|184.0|67.4|4.0|1224.0|8.1 |205.0|188.0|68.1|5.5 |1217.0|7.7 |168.0|333.0|90.0 |77.8|0.2 |1353.0|7.6 |24.0 |98.0 |32.0 |81.3|0.0 |1420.0|18.0|52.1|96.4|85.7|70.6|80.0|78.5|82.6|100.0|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+
only showing top 20 rows
KMeans Model
df.describe().show()
+-------+-----------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+-------------------+-----------------+-----------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+
|summary| _c1| _c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9| _c10| _c11| _c12| _c13| _c14| _c15| _c16| _c17| _c18| _c19| _c20| _c21| _c22| _c23| _c24| _c25| _c26| _c27| _c28| _c29| _c30| _c31| _c32| _c33| _c34| _c35| _c36| _c37| _c38|
+-------+-----------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+-------------------+-----------------+-----------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+
| count| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380| 380|
| mean|37371.67105263158|2.2772631578947404| 7.825263157894731|189.62894736842105|403.97105263157897|226.80526315789473| 60.85842105263157| 4.685789473684211|1468.4315789473685| 7.851052631578943|209.66842105263157|257.39473684210526|59.680263157894764|5.117105263157893|1484.6184210526317| 7.835263157894732|122.58947368421053|272.9842105263158| 93.91052631578947| 72.72315789473694|0.4142105263157895|1477.6921052631578| 7.720526315789466|18.73421052631579|84.13157894736842| 20.91315789473684|79.68368421052628|0.03613157894736828|1481.5210526315789| 39.1263157894737| 58.89921052631584|90.64736842105263| 84.1065789473684| 68.65078947368418|89.52815789473688|78.34157894736842| 89.6128947368421|99.14210526315776|
| stddev|6851.660908338489| 2.328609351883363|0.23660680499389936| 61.37377766311134|117.85509059364503|118.07798493502413|12.660505462095747|2.9078926638512526|393.60511271084596|0.22395803265084027| 71.91717276788036|147.33252456754101|12.733014654813468|3.577671452661805|398.60282263659286|0.1952433951140157|36.528803031438024|70.95343720605777|23.438853131795298|10.394320070958777|0.3781752576287504| 401.0942518686844|0.15341232777572877| 9.60735141555947|33.66306044704339|11.546639873835971|9.238464795265763|0.20034280266077764| 384.2781906221161|14.927089012623302|12.955976785539372|8.913158777278337|6.933349020128138|10.515216987456455|5.450973312552462| 8.22838305730927|6.084902120391749|4.131641951230141|
| min| 10050.0| 0.1| 7.3| 48.0| 105.0| 98.0| 13.2| 0.4| 651.0| 7.3| 59.0| 104.0| 7.1| 1.0| 646.0| 7.3| 26.0| 100.0| 49.0| 23.8| 0.0| 85.0| 7.1| 5.0| 9.0| 8.0| 29.2| 0.0| 683.0| 0.6| 12.3| 7.7| 11.6| 25.7| 19.6| 36.0| 43.0| 36.4|
| max| 60081.0| 19.1| 8.5| 438.0| 941.0| 1228.0| 84.8| 36.0| 3230.0| 8.5| 449.0| 1692.0| 93.5| 46.0| 3170.0| 8.3| 223.0| 463.0| 244.0| 100.0| 3.5| 3690.0| 8.1| 152.0| 306.0| 131.0| 100.0| 3.5| 3950.0| 79.1| 94.7| 100.0| 94.7| 96.8| 97.0| 98.1| 98.6| 100.0|
+-------+-----------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+-------------------+-----------------+-----------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-----------------+-----------------+
VectorAssembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(
inputCols=df.columns,
outputCol="features"
)
v = vecAssembler.transform(df)
v.show()
**+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+
| _c1| _c2|_c3| _c4| _c5| _c6| _c7|_c8| _c9|_c10| _c11| _c12|_c13|_c14| _c15|_c16| _c17| _c18| _c19|_c20|_c21| _c22|_c23| _c24| _c25| _c26|_c27|_c28| _c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37| _c38| features|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+
|35023.0| 3.5|7.9|205.0|588.0|192.0|65.6|4.5|2430.0| 7.8|236.0|268.0|73.1| 8.5|2280.0| 7.8|158.0|376.0| 96.0|77.1| 0.4|2060.0| 7.6| 20.0|104.0| 20.0|96.7| 0.0|1840.0|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6|100.0|[35023.0,3.5,7.9,...|
|29156.0| 2.5|7.7|206.0|451.0|194.0|69.1|4.5|1249.0| 7.7|206.0|220.0|61.8| 4.0|1219.0| 7.7|111.0|282.0|124.0|77.4| 0.3|1233.0| 7.5| 16.0|118.0| 19.0|84.2|0.03|1338.0|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2| 99.4|[29156.0,2.5,7.7,...|
|39246.0| 2.0|7.8|172.0|506.0|200.0|69.0|5.0|1865.0| 7.8|208.0|248.0|66.1| 6.5|1929.0| 7.8|164.0|463.0|100.0|78.0| 0.6|1825.0| 7.6| 19.0|157.0| 27.0|87.0|0.02|1616.0|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5| 99.6|[39246.0,2.0,7.8,...|
|42393.0| 0.7|7.9|189.0|478.0|230.0|67.0|5.5|1410.0| 8.1|173.0|192.0|62.5| 5.0|1406.0| 7.7|172.0|412.0|104.0|71.2| 0.4|1562.0| 7.6|152.0|306.0|131.0|79.6| 3.5|1575.0| 0.6|45.8|92.0|11.6|25.7|19.6|36.0|43.0| 36.4|[42393.0,0.7,7.9,...|
|40923.0| 3.5|7.6|146.0|329.0|188.0|57.4|2.5|1300.0| 7.6|162.0|132.0|63.6| 2.0|1324.0| 7.6|109.0|243.0| 88.0|81.8| 0.2|1467.0| 7.5| 19.0| 94.0| 41.0|82.9|0.02|1545.0|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2| 99.2|[40923.0,3.5,7.6,...|
|43830.0| 1.5|7.8|177.0|512.0|214.0|58.9|5.5|1605.0| 7.7|164.0|256.0|71.9| 5.5|1599.0| 7.7|118.0|320.0| 70.0|88.6| 0.4|1401.0| 7.6| 25.0|203.0| 20.0|85.0| 0.0|1110.0|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7|100.0|[43830.0,1.5,7.8,...|
|39165.0| 1.2|7.4|250.0|447.0|252.0|61.1|7.0|1533.0| 7.4|275.0|216.0|57.4| 6.5|1501.0| 7.4|138.0|269.0| 90.0|73.3| 0.5|1458.0| 7.3| 14.0| 9.0| 20.0|82.5| 0.0|1402.0|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1|100.0|[39165.0,1.2,7.4,...|
|37419.0| 1.2|7.6|219.0|446.0|222.0|61.3|5.5|1600.0| 7.7|266.0|240.0|70.0| 5.0|1645.0| 7.6|172.0|345.0|102.0|84.3| 0.4|1670.0| 7.5| 42.0|175.0| 53.0|84.2|0.02|1780.0|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1| 99.6|[37419.0,1.2,7.6,...|
|40983.0| 3.0|7.6|182.0|431.0|214.0|57.0|7.0|1591.0| 7.5|219.0|248.0|58.1| 5.5|1473.0| 7.5|175.0|376.0| 88.0|65.9| 0.4|1537.0| 7.5| 23.0|120.0| 25.0|68.0| 0.0|1597.0|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3|100.0|[40983.0,3.0,7.6,...|
|42217.0| 8.5|7.5|138.0|333.0|240.0|55.0|3.8|1087.0| 7.5|153.0|184.0|67.4| 4.0|1109.0| 7.5|108.0|194.0| 82.0|85.4| 0.4|1136.0| 7.1| 16.0| 62.0| 17.0|94.1| 0.0|1223.0|29.4|55.4|91.3|85.2|68.0|88.4|81.4|92.9|100.0|[42217.0,8.5,7.5,...|
|47665.0| 1.2|7.7|156.0|405.0|200.0|74.0|4.0|1856.0| 7.6|178.0|184.0|71.7| 3.5|1976.0| 7.5|128.0|302.0| 92.0|78.3| 0.3|1920.0| 7.6| 19.0| 71.0| 23.0|78.3|0.01|1706.0|28.1|50.0|91.4|85.2|76.5|87.8|82.5|88.5| 99.8|[47665.0,1.2,7.7,...|
|44314.0| 3.0|7.8|155.0|389.0|308.0|49.4|6.0|1927.0| 7.7|252.0|308.0|49.4| 6.5|2150.0| 7.7|121.0|302.0|108.0|72.2| 0.6|1950.0| 7.6| 15.0| 87.0| 23.0|69.7| 0.0|1869.0|52.0|64.9|90.8|87.6|71.2|90.3|77.6|92.5|100.0|[44314.0,3.0,7.8,...|
|40841.0| 1.0|7.6|179.0|389.0|168.0|69.0|3.5|1240.0| 7.8|202.0|272.0|72.1| 6.0|1381.0| 7.8|148.0|302.0| 92.0|78.3| 0.3|1425.0| 7.9| 16.0| 83.0| 20.0|85.0| 0.0|1416.0|26.7|66.2|95.0|89.2|72.5|91.1|78.7|88.1|100.0|[40841.0,1.0,7.6,...|
|41157.0| 3.0|8.0|145.0|398.0|192.0|66.7|4.5|2240.0| 8.0|213.0|240.0|61.7| 6.0|2010.0| 8.0|140.0|287.0| 84.0|78.6| 0.4|2270.0| 7.8| 15.0| 87.0| 21.0|81.0| 0.0|2290.0|34.3|65.0|94.2|89.3|69.7|89.7|78.1|89.1|100.0|[41157.0,3.0,8.0,...|
|40078.0| 1.4|7.9|198.0|464.0|228.0|64.9|4.6|1431.0| 7.6|243.0|272.0|64.7| 7.5|1606.0| 7.8|177.0|319.0| 88.0|81.8| 0.2|1556.0| 7.8| 17.0|102.0| 22.0|81.8| 0.0|1475.0|27.2|67.6|97.3|90.4|68.0|91.4|78.0|90.4|100.0|[40078.0,1.4,7.9,...|
|43080.0|4.25|7.8| 95.0|349.0|136.0|76.5|2.5|1063.0| 7.8|132.0|188.0|74.5| 2.0|1139.0| 7.8|123.0|317.0| 98.0|69.4| 0.4|1218.0| 7.5| 19.0| 67.0| 24.0|83.3| 0.0|1220.0| 6.8|47.9|80.0|84.6|78.9|80.0|80.8|82.4|100.0|[43080.0,4.25,7.8...|
|29414.0| 3.0|7.6|160.0|374.0|168.0|69.0|3.1|1042.0| 7.6|220.0|246.0|69.9| 4.6|1057.0| 7.6|126.0|299.0|112.0|75.0| 0.2|1085.0| 7.4| 19.0| 79.0| 28.0|82.0| 0.0|1087.0|42.7|54.5|95.7|84.9|73.6|88.1|78.9|83.3|100.0|[29414.0,3.0,7.6,...|
|38568.0| 0.7|8.2|233.0|506.0|204.0|66.7|6.7|1692.0| 8.3|218.0|212.0|66.0|10.5|1614.0| 7.9|188.0|355.0| 88.0|81.8| 0.2|1516.0| 7.5| 47.0|116.0| 59.0|81.4|0.05|1483.0|13.8|58.5|98.6|75.0|67.3|79.8|77.1|71.1| 99.3|[38568.0,0.7,8.2,...|
|34193.0| 2.0|8.0|166.0|396.0|176.0|70.5|4.0|1265.0| 8.0|178.0|188.0|70.2| 5.5|1380.0| 7.8|165.0|368.0| 90.0|77.8| 0.2|1434.0| 7.5| 26.0|106.0| 31.0|83.9| 0.0|1442.0| 7.3|52.1|96.4|84.2|71.2|84.3|73.2|82.4|100.0|[34193.0,2.0,8.0,...|
|36332.0| 3.5|7.9|120.0|455.0|184.0|67.4|4.0|1224.0| 8.1|205.0|188.0|68.1| 5.5|1217.0| 7.7|168.0|333.0| 90.0|77.8| 0.2|1353.0| 7.6| 24.0| 98.0| 32.0|81.3| 0.0|1420.0|18.0|52.1|96.4|85.7|70.6|80.0|78.5|82.6|100.0|[36332.0,3.5,7.9,...|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+**
Standarize data
from pyspark.ml.feature import StandardScaler
inputcols2 = df.columns
scaler = StandardScaler(inputCol='features', outputCol="scaledFeatures",
withStd=True, withMean=True)
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(v)
# Normalize each feature to have unit standard deviation.
scaledData = scalerModel.transform(v)
scaledData.show()
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+
| _c1| _c2|_c3| _c4| _c5| _c6| _c7|_c8| _c9|_c10| _c11| _c12|_c13|_c14| _c15|_c16| _c17| _c18| _c19|_c20|_c21| _c22|_c23| _c24| _c25| _c26|_c27|_c28| _c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37| _c38| features| scaledFeatures|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+
|35023.0| 3.5|7.9|205.0|588.0|192.0|65.6|4.5|2430.0| 7.8|236.0|268.0|73.1| 8.5|2280.0| 7.8|158.0|376.0| 96.0|77.1| 0.4|2060.0| 7.6| 20.0|104.0| 20.0|96.7| 0.0|1840.0|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6|100.0|[35023.0,3.5,7.9,...|[-0.3427885711292...|
|29156.0| 2.5|7.7|206.0|451.0|194.0|69.1|4.5|1249.0| 7.7|206.0|220.0|61.8| 4.0|1219.0| 7.7|111.0|282.0|124.0|77.4| 0.3|1233.0| 7.5| 16.0|118.0| 19.0|84.2|0.03|1338.0|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2| 99.4|[29156.0,2.5,7.7,...|[-1.1990772985617...|
|39246.0| 2.0|7.8|172.0|506.0|200.0|69.0|5.0|1865.0| 7.8|208.0|248.0|66.1| 6.5|1929.0| 7.8|164.0|463.0|100.0|78.0| 0.6|1825.0| 7.6| 19.0|157.0| 27.0|87.0|0.02|1616.0|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5| 99.6|[39246.0,2.0,7.8,...|[0.27355833460575...|
|42393.0| 0.7|7.9|189.0|478.0|230.0|67.0|5.5|1410.0| 8.1|173.0|192.0|62.5| 5.0|1406.0| 7.7|172.0|412.0|104.0|71.2| 0.4|1562.0| 7.6|152.0|306.0|131.0|79.6| 3.5|1575.0| 0.6|45.8|92.0|11.6|25.7|19.6|36.0|43.0| 36.4|[42393.0,0.7,7.9,...|[0.73286302613975...|
|40923.0| 3.5|7.6|146.0|329.0|188.0|57.4|2.5|1300.0| 7.6|162.0|132.0|63.6| 2.0|1324.0| 7.6|109.0|243.0| 88.0|81.8| 0.2|1467.0| 7.5| 19.0| 94.0| 41.0|82.9|0.02|1545.0|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2| 99.2|[40923.0,3.5,7.6,...|[0.51831650673874...|
|43830.0| 1.5|7.8|177.0|512.0|214.0|58.9|5.5|1605.0| 7.7|164.0|256.0|71.9| 5.5|1599.0| 7.7|118.0|320.0| 70.0|88.6| 0.4|1401.0| 7.6| 25.0|203.0| 20.0|85.0| 0.0|1110.0|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7|100.0|[43830.0,1.5,7.8,...|[0.94259319510523...|
|39165.0| 1.2|7.4|250.0|447.0|252.0|61.1|7.0|1533.0| 7.4|275.0|216.0|57.4| 6.5|1501.0| 7.4|138.0|269.0| 90.0|73.3| 0.5|1458.0| 7.3| 14.0| 9.0| 20.0|82.5| 0.0|1402.0|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1|100.0|[39165.0,1.2,7.4,...|[0.26173638353671...|
|37419.0| 1.2|7.6|219.0|446.0|222.0|61.3|5.5|1600.0| 7.7|266.0|240.0|70.0| 5.0|1645.0| 7.6|172.0|345.0|102.0|84.3| 0.4|1670.0| 7.5| 42.0|175.0| 53.0|84.2|0.02|1780.0|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1| 99.6|[37419.0,1.2,7.6,...|[0.00690766049306...|
|40983.0| 3.0|7.6|182.0|431.0|214.0|57.0|7.0|1591.0| 7.5|219.0|248.0|58.1| 5.5|1473.0| 7.5|175.0|376.0| 88.0|65.9| 0.4|1537.0| 7.5| 23.0|120.0| 25.0|68.0| 0.0|1597.0|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3|100.0|[40983.0,3.0,7.6,...|[0.52707350753062...|
|42217.0| 8.5|7.5|138.0|333.0|240.0|55.0|3.8|1087.0| 7.5|153.0|184.0|67.4| 4.0|1109.0| 7.5|108.0|194.0| 82.0|85.4| 0.4|1136.0| 7.1| 16.0| 62.0| 17.0|94.1| 0.0|1223.0|29.4|55.4|91.3|85.2|68.0|88.4|81.4|92.9|100.0|[42217.0,8.5,7.5,...|[0.70717582381691...|
|47665.0| 1.2|7.7|156.0|405.0|200.0|74.0|4.0|1856.0| 7.6|178.0|184.0|71.7| 3.5|1976.0| 7.5|128.0|302.0| 92.0|78.3| 0.3|1920.0| 7.6| 19.0| 71.0| 23.0|78.3|0.01|1706.0|28.1|50.0|91.4|85.2|76.5|87.8|82.5|88.5| 99.8|[47665.0,1.2,7.7,...|[1.50231149571943...|
|44314.0| 3.0|7.8|155.0|389.0|308.0|49.4|6.0|1927.0| 7.7|252.0|308.0|49.4| 6.5|2150.0| 7.7|121.0|302.0|108.0|72.2| 0.6|1950.0| 7.6| 15.0| 87.0| 23.0|69.7| 0.0|1869.0|52.0|64.9|90.8|87.6|71.2|90.3|77.6|92.5|100.0|[44314.0,3.0,7.8,...|[1.01323300149305...|
|40841.0| 1.0|7.6|179.0|389.0|168.0|69.0|3.5|1240.0| 7.8|202.0|272.0|72.1| 6.0|1381.0| 7.8|148.0|302.0| 92.0|78.3| 0.3|1425.0| 7.9| 16.0| 83.0| 20.0|85.0| 0.0|1416.0|26.7|66.2|95.0|89.2|72.5|91.1|78.7|88.1|100.0|[40841.0,1.0,7.6,...|[0.50634860565650...|
|41157.0| 3.0|8.0|145.0|398.0|192.0|66.7|4.5|2240.0| 8.0|213.0|240.0|61.7| 6.0|2010.0| 8.0|140.0|287.0| 84.0|78.6| 0.4|2270.0| 7.8| 15.0| 87.0| 21.0|81.0| 0.0|2290.0|34.3|65.0|94.2|89.3|69.7|89.7|78.1|89.1|100.0|[41157.0,3.0,8.0,...|[0.55246880982706...|
|40078.0| 1.4|7.9|198.0|464.0|228.0|64.9|4.6|1431.0| 7.6|243.0|272.0|64.7| 7.5|1606.0| 7.8|177.0|319.0| 88.0|81.8| 0.2|1556.0| 7.8| 17.0|102.0| 22.0|81.8| 0.0|1475.0|27.2|67.6|97.3|90.4|68.0|91.4|78.0|90.4|100.0|[40078.0,1.4,7.9,...|[0.39498874558646...|
|43080.0|4.25|7.8| 95.0|349.0|136.0|76.5|2.5|1063.0| 7.8|132.0|188.0|74.5| 2.0|1139.0| 7.8|123.0|317.0| 98.0|69.4| 0.4|1218.0| 7.5| 19.0| 67.0| 24.0|83.3| 0.0|1220.0| 6.8|47.9|80.0|84.6|78.9|80.0|80.8|82.4|100.0|[43080.0,4.25,7.8...|[0.83313068520675...|
|29414.0| 3.0|7.6|160.0|374.0|168.0|69.0|3.1|1042.0| 7.6|220.0|246.0|69.9| 4.6|1057.0| 7.6|126.0|299.0|112.0|75.0| 0.2|1085.0| 7.4| 19.0| 79.0| 28.0|82.0| 0.0|1087.0|42.7|54.5|95.7|84.9|73.6|88.1|78.9|83.3|100.0|[29414.0,3.0,7.6,...|[-1.1614221951566...|
|38568.0| 0.7|8.2|233.0|506.0|204.0|66.7|6.7|1692.0| 8.3|218.0|212.0|66.0|10.5|1614.0| 7.9|188.0|355.0| 88.0|81.8| 0.2|1516.0| 7.5| 47.0|116.0| 59.0|81.4|0.05|1483.0|13.8|58.5|98.6|75.0|67.3|79.8|77.1|71.1| 99.3|[38568.0,0.7,8.2,...|[0.17460422565752...|
|34193.0| 2.0|8.0|166.0|396.0|176.0|70.5|4.0|1265.0| 8.0|178.0|188.0|70.2| 5.5|1380.0| 7.8|165.0|368.0| 90.0|77.8| 0.2|1434.0| 7.5| 26.0|106.0| 31.0|83.9| 0.0|1442.0| 7.3|52.1|96.4|84.2|71.2|84.3|73.2|82.4|100.0|[34193.0,2.0,8.0,...|[-0.4639270820835...|
|36332.0| 3.5|7.9|120.0|455.0|184.0|67.4|4.0|1224.0| 8.1|205.0|188.0|68.1| 5.5|1217.0| 7.7|168.0|333.0| 90.0|77.8| 0.2|1353.0| 7.6| 24.0| 98.0| 32.0|81.3| 0.0|1420.0|18.0|52.1|96.4|85.7|70.6|80.0|78.5|82.6|100.0|[36332.0,3.5,7.9,...|[-0.1517400038531...|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+
SparkML
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
kmeans = KMeans().setK(4).setSeed(1)
model = kmeans.fit(scaledData)
Make predictions
predictions = model.transform(scaledData)
Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
ilhouette with squared euclidean distance = 0.7454468744909859
Show the results
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
Cluster Centers:
[3.27473399e+04 2.35556650e+00 7.80147783e+00 1.99857143e+02
4.33714286e+02 2.23605911e+02 6.40591133e+01 4.90541872e+00
1.49865517e+03 7.83349754e+00 2.19926108e+02 2.53487685e+02
6.24507389e+01 5.39901478e+00 1.51449754e+03 7.83103448e+00
1.29576355e+02 2.86660099e+02 9.53497537e+01 7.54472906e+01
4.29064039e-01 1.51744828e+03 7.71625616e+00 1.88571429e+01
8.53448276e+01 2.09359606e+01 8.00812808e+01 3.45812808e-02
1.51224138e+03 3.89724138e+01 5.93418719e+01 9.13463054e+01
8.49221675e+01 6.99300493e+01 9.00625616e+01 7.97216749e+01
8.99798030e+01 9.91403941e+01]
[4.91368571e+04 2.04285714e+00 7.89642857e+00 1.56053571e+02
3.05089286e+02 2.14392857e+02 5.31785714e+01 3.51964286e+00
1.41028571e+03 7.91964286e+00 1.68964286e+02 2.38714286e+02
5.21839286e+01 3.61428571e+00 1.43589286e+03 7.87678571e+00
9.98035714e+01 2.25714286e+02 8.81428571e+01 6.56910714e+01
3.69642857e-01 1.39700000e+03 7.75535714e+00 1.51250000e+01
7.23392857e+01 1.73571429e+01 7.88160714e+01 1.71428571e-02
1.43160714e+03 3.91785714e+01 5.67696429e+01 8.88767857e+01
8.44196429e+01 6.71303571e+01 8.94785714e+01 7.55589286e+01
9.02589286e+01 9.94857143e+01]
[4.03285169e+04 2.27313559e+00 7.83898305e+00 1.88500000e+02
4.00449153e+02 2.35949153e+02 5.95169492e+01 4.87542373e+00
1.44216102e+03 7.85677966e+00 2.11652542e+02 2.69644068e+02
5.88618644e+01 5.35254237e+00 1.45742373e+03 7.83135593e+00
1.22923729e+02 2.74855932e+02 9.45084746e+01 7.16042373e+01
4.15254237e-01 1.45109322e+03 7.71355932e+00 2.03728814e+01
8.83559322e+01 2.25932203e+01 7.94305085e+01 4.85593220e-02
1.45549153e+03 3.86406780e+01 5.88372881e+01 9.01932203e+01
8.27228814e+01 6.72805085e+01 8.85644068e+01 7.71593220e+01
8.86406780e+01 9.89720339e+01]
[1.43653333e+04 1.51666667e+00 7.56666667e+00 1.68666667e+02
3.75666667e+02 3.15333333e+02 4.04000000e+01 4.13333333e+00
1.54200000e+03 7.53333333e+00 1.97333333e+02 3.88666667e+02
4.43333333e+01 4.83333333e+00 1.44200000e+03 7.50000000e+00
6.20000000e+01 1.56333333e+02 8.06666667e+01 6.36666667e+01
2.00000000e-01 1.34000000e+03 7.63333333e+00 1.33333333e+01
5.60000000e+01 1.96666667e+01 7.89333333e+01 6.66666667e-03
1.35833333e+03 6.76666667e+01 7.11333333e+01 9.42666667e+01
7.75000000e+01 6.43666667e+01 9.22000000e+01 8.34000000e+01
9.09666667e+01 9.95333333e+01]
pred = model.summary.predictions
pred.show()
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+----------+
| _c1| _c2|_c3| _c4| _c5| _c6| _c7|_c8| _c9|_c10| _c11| _c12|_c13|_c14| _c15|_c16| _c17| _c18| _c19|_c20|_c21| _c22|_c23| _c24| _c25| _c26|_c27|_c28| _c29|_c30|_c31|_c32|_c33|_c34|_c35|_c36|_c37| _c38| features| scaledFeatures|prediction|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+----------+
|35023.0| 3.5|7.9|205.0|588.0|192.0|65.6|4.5|2430.0| 7.8|236.0|268.0|73.1| 8.5|2280.0| 7.8|158.0|376.0| 96.0|77.1| 0.4|2060.0| 7.6| 20.0|104.0| 20.0|96.7| 0.0|1840.0|33.1|64.2|95.3|87.3|72.3|90.2|82.3|89.6|100.0|[35023.0,3.5,7.9,...|[-0.3427885711292...| 0|
|29156.0| 2.5|7.7|206.0|451.0|194.0|69.1|4.5|1249.0| 7.7|206.0|220.0|61.8| 4.0|1219.0| 7.7|111.0|282.0|124.0|77.4| 0.3|1233.0| 7.5| 16.0|118.0| 19.0|84.2|0.03|1338.0|46.1|43.6|92.5|85.6|58.2|92.2|73.8|90.2| 99.4|[29156.0,2.5,7.7,...|[-1.1990772985617...| 0|
|39246.0| 2.0|7.8|172.0|506.0|200.0|69.0|5.0|1865.0| 7.8|208.0|248.0|66.1| 6.5|1929.0| 7.8|164.0|463.0|100.0|78.0| 0.6|1825.0| 7.6| 19.0|157.0| 27.0|87.0|0.02|1616.0|21.2|59.7|90.8|88.4|66.1|89.0|69.0|86.5| 99.6|[39246.0,2.0,7.8,...|[0.27355833460575...| 2|
|42393.0| 0.7|7.9|189.0|478.0|230.0|67.0|5.5|1410.0| 8.1|173.0|192.0|62.5| 5.0|1406.0| 7.7|172.0|412.0|104.0|71.2| 0.4|1562.0| 7.6|152.0|306.0|131.0|79.6| 3.5|1575.0| 0.6|45.8|92.0|11.6|25.7|19.6|36.0|43.0| 36.4|[42393.0,0.7,7.9,...|[0.73286302613975...| 2|
|40923.0| 3.5|7.6|146.0|329.0|188.0|57.4|2.5|1300.0| 7.6|162.0|132.0|63.6| 2.0|1324.0| 7.6|109.0|243.0| 88.0|81.8| 0.2|1467.0| 7.5| 19.0| 94.0| 41.0|82.9|0.02|1545.0|32.7|33.3|90.0|82.6|61.3|87.0|71.4|78.2| 99.2|[40923.0,3.5,7.6,...|[0.51831650673874...| 2|
|43830.0| 1.5|7.8|177.0|512.0|214.0|58.9|5.5|1605.0| 7.7|164.0|256.0|71.9| 5.5|1599.0| 7.7|118.0|320.0| 70.0|88.6| 0.4|1401.0| 7.6| 25.0|203.0| 20.0|85.0| 0.0|1110.0|28.0|72.7|92.7|78.8|36.6|85.9|60.4|90.7|100.0|[43830.0,1.5,7.8,...|[0.94259319510523...| 2|
|39165.0| 1.2|7.4|250.0|447.0|252.0|61.1|7.0|1533.0| 7.4|275.0|216.0|57.4| 6.5|1501.0| 7.4|138.0|269.0| 90.0|73.3| 0.5|1458.0| 7.3| 14.0| 9.0| 20.0|82.5| 0.0|1402.0|49.8|58.3|92.3|89.9|96.8|94.4|98.1|92.1|100.0|[39165.0,1.2,7.4,...|[0.26173638353671...| 2|
|37419.0| 1.2|7.6|219.0|446.0|222.0|61.3|5.5|1600.0| 7.7|266.0|240.0|70.0| 5.0|1645.0| 7.6|172.0|345.0|102.0|84.3| 0.4|1670.0| 7.5| 42.0|175.0| 53.0|84.2|0.02|1780.0|35.3|57.5|92.0|75.6|49.3|80.8|60.8|76.1| 99.6|[37419.0,1.2,7.6,...|[0.00690766049306...| 2|
|40983.0| 3.0|7.6|182.0|431.0|214.0|57.0|7.0|1591.0| 7.5|219.0|248.0|58.1| 5.5|1473.0| 7.5|175.0|376.0| 88.0|65.9| 0.4|1537.0| 7.5| 23.0|120.0| 25.0|68.0| 0.0|1597.0|20.1|64.5|93.6|86.9|68.1|87.4|72.2|88.3|100.0|[40983.0,3.0,7.6,...|[0.52707350753062...| 2|
|42217.0| 8.5|7.5|138.0|333.0|240.0|55.0|3.8|1087.0| 7.5|153.0|184.0|67.4| 4.0|1109.0| 7.5|108.0|194.0| 82.0|85.4| 0.4|1136.0| 7.1| 16.0| 62.0| 17.0|94.1| 0.0|1223.0|29.4|55.4|91.3|85.2|68.0|88.4|81.4|92.9|100.0|[42217.0,8.5,7.5,...|[0.70717582381691...| 2|
|47665.0| 1.2|7.7|156.0|405.0|200.0|74.0|4.0|1856.0| 7.6|178.0|184.0|71.7| 3.5|1976.0| 7.5|128.0|302.0| 92.0|78.3| 0.3|1920.0| 7.6| 19.0| 71.0| 23.0|78.3|0.01|1706.0|28.1|50.0|91.4|85.2|76.5|87.8|82.5|88.5| 99.8|[47665.0,1.2,7.7,...|[1.50231149571943...| 1|
|44314.0| 3.0|7.8|155.0|389.0|308.0|49.4|6.0|1927.0| 7.7|252.0|308.0|49.4| 6.5|2150.0| 7.7|121.0|302.0|108.0|72.2| 0.6|1950.0| 7.6| 15.0| 87.0| 23.0|69.7| 0.0|1869.0|52.0|64.9|90.8|87.6|71.2|90.3|77.6|92.5|100.0|[44314.0,3.0,7.8,...|[1.01323300149305...| 2|
|40841.0| 1.0|7.6|179.0|389.0|168.0|69.0|3.5|1240.0| 7.8|202.0|272.0|72.1| 6.0|1381.0| 7.8|148.0|302.0| 92.0|78.3| 0.3|1425.0| 7.9| 16.0| 83.0| 20.0|85.0| 0.0|1416.0|26.7|66.2|95.0|89.2|72.5|91.1|78.7|88.1|100.0|[40841.0,1.0,7.6,...|[0.50634860565650...| 2|
|41157.0| 3.0|8.0|145.0|398.0|192.0|66.7|4.5|2240.0| 8.0|213.0|240.0|61.7| 6.0|2010.0| 8.0|140.0|287.0| 84.0|78.6| 0.4|2270.0| 7.8| 15.0| 87.0| 21.0|81.0| 0.0|2290.0|34.3|65.0|94.2|89.3|69.7|89.7|78.1|89.1|100.0|[41157.0,3.0,8.0,...|[0.55246880982706...| 2|
|40078.0| 1.4|7.9|198.0|464.0|228.0|64.9|4.6|1431.0| 7.6|243.0|272.0|64.7| 7.5|1606.0| 7.8|177.0|319.0| 88.0|81.8| 0.2|1556.0| 7.8| 17.0|102.0| 22.0|81.8| 0.0|1475.0|27.2|67.6|97.3|90.4|68.0|91.4|78.0|90.4|100.0|[40078.0,1.4,7.9,...|[0.39498874558646...| 2|
|43080.0|4.25|7.8| 95.0|349.0|136.0|76.5|2.5|1063.0| 7.8|132.0|188.0|74.5| 2.0|1139.0| 7.8|123.0|317.0| 98.0|69.4| 0.4|1218.0| 7.5| 19.0| 67.0| 24.0|83.3| 0.0|1220.0| 6.8|47.9|80.0|84.6|78.9|80.0|80.8|82.4|100.0|[43080.0,4.25,7.8...|[0.83313068520675...| 2|
|29414.0| 3.0|7.6|160.0|374.0|168.0|69.0|3.1|1042.0| 7.6|220.0|246.0|69.9| 4.6|1057.0| 7.6|126.0|299.0|112.0|75.0| 0.2|1085.0| 7.4| 19.0| 79.0| 28.0|82.0| 0.0|1087.0|42.7|54.5|95.7|84.9|73.6|88.1|78.9|83.3|100.0|[29414.0,3.0,7.6,...|[-1.1614221951566...| 0|
|38568.0| 0.7|8.2|233.0|506.0|204.0|66.7|6.7|1692.0| 8.3|218.0|212.0|66.0|10.5|1614.0| 7.9|188.0|355.0| 88.0|81.8| 0.2|1516.0| 7.5| 47.0|116.0| 59.0|81.4|0.05|1483.0|13.8|58.5|98.6|75.0|67.3|79.8|77.1|71.1| 99.3|[38568.0,0.7,8.2,...|[0.17460422565752...| 2|
|34193.0| 2.0|8.0|166.0|396.0|176.0|70.5|4.0|1265.0| 8.0|178.0|188.0|70.2| 5.5|1380.0| 7.8|165.0|368.0| 90.0|77.8| 0.2|1434.0| 7.5| 26.0|106.0| 31.0|83.9| 0.0|1442.0| 7.3|52.1|96.4|84.2|71.2|84.3|73.2|82.4|100.0|[34193.0,2.0,8.0,...|[-0.4639270820835...| 0|
|36332.0| 3.5|7.9|120.0|455.0|184.0|67.4|4.0|1224.0| 8.1|205.0|188.0|68.1| 5.5|1217.0| 7.7|168.0|333.0| 90.0|77.8| 0.2|1353.0| 7.6| 24.0| 98.0| 32.0|81.3| 0.0|1420.0|18.0|52.1|96.4|85.7|70.6|80.0|78.5|82.6|100.0|[36332.0,3.5,7.9,...|[-0.1517400038531...| 0|
+-------+----+---+-----+-----+-----+----+---+------+----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+-----+-----+-----+----+----+------+----+----+----+----+----+----+----+----+-----+--------------------+--------------------+----------+
pred.groupBy('prediction').count().show()
+----------+-----+
|prediction|count|
+----------+-----+
| 1| 56|
| 3| 3|
| 2| 118|
| 0| 203|
+----------+-----+
Visualization
Extreme pairplot DO NOT TRY AT HOME
import pandas as pd
pandasDF = pred.toPandas()
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
%matplotlib inline
columns = []
for i in range(1, 38):
columns += ['_c' + str(i)]
columns
sns_plot = sns.pairplot(pandasDF[columns], diag_kind='kde')
sns_plot.savefig("pairplot.png")
plt.clf() # Clean parirplot figure from sns
Image(filename='pairplot.png') # Show pairplot as image
Plot Correlation
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
def plot_correlation(data):
'''
plot correlation's matrix to explore dependency between features
'''
# init figure size
rcParams['figure.figsize'] = 15, 20
fig = plt.figure()
sns.heatmap(data.corr(), annot=True, fmt=".2f")
plt.show()
fig.savefig('corr.png')
# load your data
data = pandasDF
# plot correlation & densities
plot_correlation(data)
Density plot of features
import pandas as pd
from pylab import rcParams
import matplotlib.pyplot as plt
def plot_densities(data):
'''
Plot features densities depending on the outcome values
'''
# change fig size to fit all subplots beautifully
rcParams['figure.figsize'] = 30, 40
# separate data based on outcome values
outcome_0 = data[data['prediction'] == 0]
outcome_1 = data[data['prediction'] == 1]
outcome_2 = data[data['prediction'] == 2]
outcome_3 = data[data['prediction'] == 3]
# init figure
fig, axs = plt.subplots(38, 1)
fig.suptitle('Features densities for different outcomes 0/1/2/3')
plt.subplots_adjust(left = 0.25, right = 0.9, bottom = 0.1, top = 0.95,
wspace = 0.2, hspace = 0.95)
# plot densities for outcomes
for column_name in names[:-1]:
ax = axs[names.index(column_name)]
#plt.subplot(4, 2, names.index(column_name) + 1)
outcome_0[column_name].plot(kind='density', ax=ax, subplots=True,
sharex=False, color="red", legend=True,
label=column_name + ' for Outcome = 0')
outcome_1[column_name].plot(kind='density', ax=ax, subplots=True,
sharex=False, color="green", legend=True,
label=column_name + ' for Outcome = 1')
outcome_2[column_name].plot(kind='density', ax=ax, subplots=True,
sharex=False, color="black", legend=True,
label=column_name + ' for Outcome = 2')
outcome_3[column_name].plot(kind='density', ax=ax, subplots=True,
sharex=False, color="blue", legend=True,
label=column_name + ' for Outcome = 3')
ax.set_xlabel(column_name + ' values')
ax.set_title(column_name + ' density')
ax.grid('on')
plt.show()
fig.savefig('densities.png')
# load your data
data = pandasDF
names = list(data.columns)
# plot correlation & densities
plot_densities(data)
References
- Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- Manel Poch (igte2 '@' cc.uab.es) Unitat d'Enginyeria Quimica Universitat Autonoma de Barcelona. Bellaterra. Barcelona; Spain
- Javier Bejar and Ulises Cortes (bejar '@' lsi.upc.es) Dept. Llenguatges i Sistemes Informatics; Universitat Politecnica de Catalunya. Barcelona; Spain