Statistical functions with Numy¶

import numpy as np

arr = np.random.rand(100000)

np.amin(arr)

1.2487471540811867e-05

np.amax(arr)

0.9999991775057848

np.mean(arr)

0.5000285410829219

np.std(arr)

0.28858064751802887

np.median(arr)

0.500066650992059

np.var(arr)

0.08327879012192481

np.percentile(arr, 50)

0.500066650992059

np.median(arr)

0.5000945785815393

np.percentile(arr, 75)

0.7497093639966632

np.percentile(arr, [10, 30, 60])

array([0.10068439, 0.30077677, 0.60000558])

np.percentile(arr, [25, 75])

array([0.25037751, 0.74986553])

np.percentile(arr, 10)

0.10093826911393286

np.percentile(arr, 90)

0.899997898924828

np.percentile(arr, 30)

0.3005185124573417

iqr = np.percentile(arr, 75) - np.percentile(arr, 25)

print(iqr)

0.49948802790270597

quartiles = np.percentile(arr, [25, 75])

print(quartiles)

[0.25037751 0.74986553]

iqr = quartiles[1] - quartiles[0]

print(iqr)

0.49948802790270597

- observe the time difference between the following two methods to get IQR¶

%%time
iqr = np.percentile(arr, 75) - np.percentile(arr, 25)

CPU times: user 5.31 ms, sys: 10 µs, total: 5.32 ms
Wall time: 7.53 ms

%%time
quartiles = np.percentile(arr, [75, 25])
iqr = quartiles[0] - quartiles[1]

CPU times: user 3.51 ms, sys: 0 ns, total: 3.51 ms
Wall time: 3.53 ms

if the array size is a bit larger, the performace difference would be more significant¶

arr2 = np.random.rand(100000000)

%%time
iqr = np.percentile(arr2, 75) - np.percentile(arr2, 25)

CPU times: user 2.65 s, sys: 6.95 ms, total: 2.65 s
Wall time: 2.65 s

print(iqr)

0.5000233875522627

%%time
quartiles = np.percentile(arr2, [75, 25])
iqr = quartiles[0] - quartiles[1]

CPU times: user 1.96 s, sys: 1.99 ms, total: 1.96 s
Wall time: 1.95 s

print(iqr)

0.5000233875522627

Z score¶

Z score - how far away a particular point is away from the mean,
negative implies left from mean and
positive implies right from the mean

(arr - np.mean(arr))/np.std(arr) # returns one number per array element 
# displaying the distance of that particular element in the array from the mean

array([ 1.10533211, -1.52392851,  0.71986322, ..., -1.66676396,
       -0.51762155, -0.01055   ])

histogram with this array gives two arrays, where one is the points and the other is the bin sie¶

np.histogram(arr)

(array([ 9916, 10043,  9992, 10031, 10013, 10071,  9890,  9993, 10051,
        10000]),
 array([1.24874715e-05, 1.00011156e-01, 2.00009825e-01, 3.00008494e-01,
        4.00007163e-01, 5.00005832e-01, 6.00004501e-01, 7.00003170e-01,
        8.00001839e-01, 9.00000509e-01, 9.99999178e-01]))

np.histogram(arr, 5)

(array([19959, 20023, 20084, 19883, 20051]),
 array([1.24874715e-05, 2.00009825e-01, 4.00007163e-01, 6.00004501e-01,
        8.00001839e-01, 9.99999178e-01]))

np.histogram(arr, bins = [0, .25, .27, 1])

(array([24966,  1998, 73036]), array([0.  , 0.25, 0.27, 1.  ]))

bins = [0, .25, .5, .75, 1]

digitize¶

gives an array where the numbers indicate the bin in which they are in

np.digitize(arr, bins)

array([4, 1, 3, ..., 1, 2, 2])

arr3 = np.random.randint(10, 20, 10)

arr3

array([19, 17, 15, 13, 13, 15, 12, 12, 14, 18])

bins = [10, 14, 18, 20] # here the bins are 10-14, 14-18, 18-20

np.digitize(arr3, bins) # 19 in bin 3, 17 in bin2, 15 in bin 2, 13 in bin1 and so on

array([3, 2, 2, 1, 1, 2, 1, 1, 2, 3])

Consider an example with realtime dataset with height, weight and age¶

height = np.random.randint(100, 180, 10)
weight = np.random.randint(40, 150, 10)
age = np.random.randint(10, 80, 10)

height

array([173, 124, 113, 169, 163, 144, 113, 164, 106, 166])

weight

array([ 85, 125, 135,  76,  61,  58, 125, 132, 113,  94])

age

array([77, 37, 13, 49, 47, 19, 58, 42, 49, 23])

np.min(weight)

58

np.max(weight)

135

np.min(height)

106

np.max(height)

173

arr_concat = np.concatenate((weight, height, age))

print(arr_concat)

[ 85 125 135  76  61  58 125 132 113  94 173 124 113 169 163 144 113 164
 106 166  77  37  13  49  47  19  58  42  49  23]

np.amin(arr_concat)

13

the expectation from the following line of code is to get to get the min of height, weight and age but the result seems no... hence use vstack()¶

np.concatenate((weight, height, age)).shape

(30,)

vstack() is used to place one dataset below the other,¶

unlike concatenate(), which appends the next daatset to the previous dataset¶

np.vstack((height, weight, age))

array([[173, 124, 113, 169, 163, 144, 113, 164, 106, 166],
       [ 85, 125, 135,  76,  61,  58, 125, 132, 113,  94],
       [ 77,  37,  13,  49,  47,  19,  58,  42,  49,  23]])

np.vstack((height, weight, age)).shape

(3, 10)

arr4 = np.vstack((height, weight, age))

arr4

array([[173, 124, 113, 169, 163, 144, 113, 164, 106, 166],
       [ 85, 125, 135,  76,  61,  58, 125, 132, 113,  94],
       [ 77,  37,  13,  49,  47,  19,  58,  42,  49,  23]])

np.amin(arr4, axis=1)

array([106,  58,  13])

np.amax(arr4, axis=0)

array([173, 125, 135, 169, 163, 144, 125, 164, 113, 166])

the following line of code gives the min of height, weight and age,¶

unlike the concatenate f() which gives the min across the combination of height, weight and age¶

np.amin(arr4, axis=1)

array([106,  58,  13])

Rules of Statistics¶

import numpy as np

1. Mean subtracted array has zero mean¶

base_mean_data = np.random.rand(10000000)

base_mean = base_mean_data - np.mean(base_mean_data)

print(base_mean)

[ 0.31959795 -0.48022501  0.04116783 ...  0.23665416  0.03643021
 -0.45697251]

print(np.mean(base_mean))

-3.907985046680551e-18

Computing mean with smaller set of values¶

generally, the mean of the sample is almost near to the mean of the population
smaller dataset's mean need not be the same as the mean of the population, so,
to check emperically, how big a sample should be to get a closer estimate to the mean of the population?

import matplotlib.pyplot as plt

arr = np.random.randint(1, 100, 100)

arr[:10]

array([33, 26, 75, 38, 17, 21, 87, 29, 38, 57])

arr[0]

33

np.mean(33)

33.0

np.mean([33, 26])

29.5

print(arr[:10])

[33 26 75 38 17 21 87 29 38 57]

print(arr[0:0])

[]

print(arr[0:1])

[33]

the outcome of the following code remains almost constant between 30 - 40 and there after...¶

hence, 30 - 40 samples can be considered as a good sample size to resemble, to that of the mean of the population

print(arr[:10])

[33 26 75 38 17 21 87 29 38 57]

for i in range(1, 50):
    arr1 = arr[0:i]
    print(i, arr[i-1], np.mean(arr1))

1 33 33.0
2 26 29.5
3 75 44.666666666666664
4 38 43.0
5 17 37.8
6 21 35.0
7 87 42.42857142857143
8 29 40.75
9 38 40.44444444444444
10 57 42.1
11 50 42.81818181818182
12 54 43.75
13 15 41.53846153846154
14 66 43.285714285714285
15 15 41.4
16 26 40.4375
17 51 41.05882352941177
18 19 39.833333333333336
19 36 39.63157894736842
20 99 42.6
21 88 44.76190476190476
22 33 44.22727272727273
23 47 44.34782608695652
24 76 45.666666666666664
25 6 44.08
26 47 44.19230769230769
27 2 42.629629629629626
28 22 41.892857142857146
29 5 40.62068965517241
30 42 40.666666666666664
31 61 41.32258064516129
32 36 41.15625
33 90 42.63636363636363
34 27 42.1764705882353
35 74 43.08571428571429
36 48 43.22222222222222
37 23 42.67567567567568
38 39 42.578947368421055
39 19 41.97435897435897
40 28 41.625
41 15 40.97560975609756
42 10 40.23809523809524
43 68 40.883720930232556
44 4 40.04545454545455
45 46 40.17777777777778
46 81 41.06521739130435
47 96 42.234042553191486
48 30 41.979166666666664
49 3 41.183673469387756

cumsum() sums all the elements till the current element in the array¶

first time, first number
second time, first two numbers,
third time, first three numbers and so on till the end of the array

np.cumsum(arr)

array([  33,   59,  134,  172,  189,  210,  297,  326,  364,  421,  471,
        525,  540,  606,  621,  647,  698,  717,  753,  852,  940,  973,
       1020, 1096, 1102, 1149, 1151, 1173, 1178, 1220, 1281, 1317, 1407,
       1434, 1508, 1556, 1579, 1618, 1637, 1665, 1680, 1690, 1758, 1762,
       1808, 1889, 1985, 2015, 2018, 2102, 2162, 2200, 2226, 2310, 2359,
       2428, 2477, 2491, 2521, 2544, 2616, 2624, 2685, 2716, 2808, 2843,
       2867, 2903, 2921, 2967, 2994, 3013, 3097, 3114, 3119, 3187, 3232,
       3287, 3332, 3375, 3447, 3462, 3521, 3543, 3570, 3572, 3668, 3709,
       3786, 3873, 3928, 4001, 4046, 4086, 4132, 4164, 4251, 4279, 4306,
       4327])

np.cumsum(arr)/(np.arange(1, 101))

array([33.        , 29.5       , 44.66666667, 43.        , 37.8       ,
       35.        , 42.42857143, 40.75      , 40.44444444, 42.1       ,
       42.81818182, 43.75      , 41.53846154, 43.28571429, 41.4       ,
       40.4375    , 41.05882353, 39.83333333, 39.63157895, 42.6       ,
       44.76190476, 44.22727273, 44.34782609, 45.66666667, 44.08      ,
       44.19230769, 42.62962963, 41.89285714, 40.62068966, 40.66666667,
       41.32258065, 41.15625   , 42.63636364, 42.17647059, 43.08571429,
       43.22222222, 42.67567568, 42.57894737, 41.97435897, 41.625     ,
       40.97560976, 40.23809524, 40.88372093, 40.04545455, 40.17777778,
       41.06521739, 42.23404255, 41.97916667, 41.18367347, 42.04      ,
       42.39215686, 42.30769231, 42.        , 42.77777778, 42.89090909,
       43.35714286, 43.45614035, 42.94827586, 42.72881356, 42.4       ,
       42.8852459 , 42.32258065, 42.61904762, 42.4375    , 43.2       ,
       43.07575758, 42.79104478, 42.69117647, 42.33333333, 42.38571429,
       42.16901408, 41.84722222, 42.42465753, 42.08108108, 41.58666667,
       41.93421053, 41.97402597, 42.14102564, 42.17721519, 42.1875    ,
       42.55555556, 42.2195122 , 42.42168675, 42.17857143, 42.        ,
       41.53488372, 42.16091954, 42.14772727, 42.53932584, 43.03333333,
       43.16483516, 43.48913043, 43.50537634, 43.46808511, 43.49473684,
       43.375     , 43.82474227, 43.66326531, 43.49494949, 43.27      ])

means = np.cumsum(arr)/np.arange(1, 101)

means[:10]

array([33.        , 29.5       , 44.66666667, 43.        , 37.8       ,
       35.        , 42.42857143, 40.75      , 40.44444444, 42.1       ])

how does exterme or outliers effect mean and median¶

how does scaling and shifting effect mean and median¶

ex_arr = np.random.randint(1, 100, 100)

np.mean(ex_arr)

48.85

ex_arr[:10]

array([29, 88, 93, 45, 41, 75, 78, 14,  6, 69])

np.median(ex_arr)

44.0

adding outliers to the end of the array¶

ex_arr = np.append(ex_arr, [4000, 2000])

ex_arr[-10:]

array([  30,   74,   23,   24,   49,   66,   94,   22, 4000, 2000])

observe the difference between mean and median values, on par with the previous observations¶

in the previous instance, mean and median are very near to each other
were as after addition of outliers, they are very far from each other
observation: mean is very sensitive to outliers where as median is not so

np.mean(ex_arr)

106.7156862745098

np.median(ex_arr)

44.5

Effect of scaling on mean and median¶

sca_arr = np.random.randint(1, 100, 100)

np.mean(sca_arr)

47.2

np.median(sca_arr)

49.0

sca_arr1 = 2.5 * sca_arr + 10.02

SCALING the array by linear co-efficient and adding is equivalent to scaling the mean of the array¶

print(np.mean(sca_arr1), 
      2.5 * np.mean(sca_arr) + 10.02)

128.02000000000004 128.02

print(np.mean(2.5 * sca_arr + 10.02), 
      2.5 * np.mean(sca_arr) + 10.02)

128.02000000000004 128.02

print(np.median(sca_arr1), 2.5 * np.median(sca_arr) + 10.02)

132.52 132.52

outcome of SCALING is different in the case of variance and standard deviation¶

print(np.var(sca_arr1), 2.5 * np.var(sca_arr) + 10.02)

4602.000000000001 1850.8200000000002

print(np.std(sca_arr1), 2.5 * np.std(sca_arr) + 10.02)

67.83804242458652 77.85804242458651

Consulting - Quality

NUMPY - STATISTIC FUNCTIONS

exercises from PadhAI class

Statistical functions with Numy¶

- observe the time difference between the following two methods to get IQR¶

if the array size is a bit larger, the performace difference would be more significant¶

Z score¶

histogram with this array gives two arrays, where one is the points and the other is the bin sie¶

digitize¶

Consider an example with realtime dataset with height, weight and age¶

the expectation from the following line of code is to get to get the min of height, weight and age but the result seems no... hence use vstack()¶

vstack() is used to place one dataset below the other,¶

unlike concatenate(), which appends the next daatset to the previous dataset¶

the following line of code gives the min of height, weight and age,¶

unlike the concatenate f() which gives the min across the combination of height, weight and age¶

Rules of Statistics¶

1. Mean subtracted array has zero mean¶

Computing mean with smaller set of values¶

the outcome of the following code remains almost constant between 30 - 40 and there after...¶

cumsum() sums all the elements till the current element in the array¶

how does exterme or outliers effect mean and median¶

how does scaling and shifting effect mean and median¶

adding outliers to the end of the array¶

observe the difference between mean and median values, on par with the previous observations¶

Effect of scaling on mean and median¶

SCALING the array by linear co-efficient and adding is equivalent to scaling the mean of the array¶

outcome of SCALING is different in the case of variance and standard deviation¶