## CSE • Associate Analytics

 UNIT - II Summarizing Data & Revisiting Probability (NOS 2101)

Summarizing data with R:

When working with large amounts of data that is structured in a tabular format, a common operation is to summarize that data in different ways. R provides a variety of methods for summarizing data in tabular and other forms.

Generally, summarizing data means finding statistical figures such as mean, median, box plot etc.

 79 68 69 88 90 74 84 76 93

Sort : Applying sort function to sort the grades in order.

 68 69 74 76 79 84 88 90 93

Summary: summary  is a generic function used to produce result summaries of the results of various model fitting functions.

Min.  1st Qu.  Median  Mean  3rd Qu.  Max.
68.00   74.00   79.00   80.11   88.00   93.00

# boxplot representation for the grades The centre of data and spread of data:

Generally, summarizing data means finding statistical figures such as mean, median, box plot etc.

Arithmetic Mean: Mean function is a generic function for the calculating the arithmetic mean.

Creating a variable salaries

> salaries = c(33750,44000,138188,45566,44000,141666,292500)
> salaries
 33750 44000 138188 45566 44000 141666 292500

> mean(salaries)
 105667.1

Mean is not robust as it effect with the extreme observation. To overcome this, trim mean is used. Trim mean is the mean of the remaining data values after removing the k largest and k smallest data values in the observations.

> salaries = c(12,33750,44000,138188,45566,44000,141666,292500)
> mean(salaries,trim = 0.1)
 92460.25

Median: The Median is the  "middle"  of a sorted list of numbers.

> sort(salaries)
 12 33750 44000 44000 45566 138188 141666 292500

> median(salaries)
 44783

Median is robust as it takes the middle data values.

Length function:

> salaries = c(33750,44000,138188,45566.67,44000,141666.67,292500)
> salaries
 33750.00 44000.00 138188.00 45566.67 44000.00 141666.67
 292500.00

> length(salaries)
 7

> salaries1 = sort(salaries)[1:7]
> salaries1
 33750.00 44000.00 44000.00 45566.67 138188.00 141666.67
 292500.00

Range:

The range of a set of data is the difference between the highest and lowest values in the set. It tells the spread of the data.

Range( )in R: range  returns a vector containing the minimum and maximum of all the given arguments.

> range(salaries)
 33750 292500

Standard deviation:

Sd() : This function computes the standard deviation of the values in  x

> sd(salaries)
 94560.3

Var(): var  compute the variance of  x.

> var(salaries)
 8941650650

> sqrt(var(salaries))
 94560.3

IQR: The  interquartile range  of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value.

> IQR(salaries)
 95927.34

Histogram representation in R

> hist(salaries, breaks = 3,xlab ='salaries', main='salaries') Probability:

A deterministic experiment is one whose outcome may be predicted with certainty before hand such as adding two numbers such as 2+3.

A random experiment may not be predicted with certainty before hand. Example tossing a coin, rolling a die etc. Sample space:

For a random experiment E, the set of all possible outcomes of E is called the sample space and is denoted by the letter S.

Example: For coin-toss experiment E, S will be the results ‘Head’ and ‘Tail’, which we may represent by S = {H,T}. Formally, the performance of a random experiment is the unpredictable selection of an outcome in S.

In R using prob package we can work with probability.

A sample space is usually represented by a data frame i.e, collection of variables. Each row of the data frame corresponds to an outcome of the experiment.

Consider the random experiment of tossing a coin. The outcomes are H and T.

> library(prob)

> tosscoin(1)
toss1
1  H
2  T

The argument 1 tells to toss the coin once.

> tosscoin(3)
toss1 toss2 toss3
1   H   H   H
2   T   H   H
3   H   T   H
4   T   T   H
5   H   H   T
6   T   H   T
7   H   T   T
8   T   T   T

> rolldie(1)
X1
1  1
2  2
3  3
4  4
5  5
6  6

The rolldie( ) function by default has 6-sided die. We can also specify with the n sides argument.

> cards()

> head(cards()) # If we would like to draw one card from a standard set of playing cards.
rank   suit
1   2   Club
2   3   Club
3   4   Club
4   5   Club
5   6   Club
6   7   Club

Sampling from urn:

The most fundamental type of random experiment is to have an urn that contains a bunch of distinguishable object (balls) inside it. When we shake up the urn, grab a ball. If we would like to grab more than one ball. What are all of the possible outcomes of the experiment now? It depends on how we sample.

1. With replacement
2. Without replacement

With replacement: we could select a ball, take a look, put it back and sample again.

Without replacement: we would select a ball, take a look but do not put it back and sample again.

There are certainly more possible outcomes of the experiment in the former case than in the latter.

The prob package accomplishes sampling from urns with the urnsamples function, which has argument x, size, replace and ordered.

The argument x, represent the urn from which sampling is to be done. The size argument tells how large the sample will be. The ordered and replace arguments are logical and specify how sampling will be performed.

Ordered, with replacement:

If sampling is with replacement, then we can get any outcome on any draw. Further, by “ordered” we mean that we shall keep track of the order of the draws that we observed.

> urnsamples(1:3, size=2,replace = TRUE, ordered = TRUE)
X1   X2
1   1   1
2   2   1
3   3   1
4   1   2
5   2   2
6   3   2
7   1   3
8   2   3
9   3   3

Ordered, without replacement:

In sampling without replacement, we may not observe the same number twice in any row.

> urnsamples(1:3, size=2,replace = FALSE, ordered = TRUE)
X1   X2
1   1   2
2   2   1
3   1   3
4   3   1
5   2   3
6   3   2

Unordered, without replacement:

In this case, we may not observe the same outcome twice but we will only retain those outcomes which would not duplicate earlier ones.

> urnsamples(1:3, size=2,replace = FALSE, ordered = FALSE)
X1   X2
1   1   2
2   1   3
3   2   3

Unordered, with replacement:

We replace the balls after every draw, but we do not remember the order in which the draws came.

> urnsamples(1:3, size=2,replace = TRUE, ordered = FALSE)
X1   X2
1   1   1
2   1   2
3   1   3
4   2   2
5   2   3
6   3   3

Events:

An event A is a collection of outcomes or a subset of the sample space. After the performance of a random experiment E we say that the event A occurred if the experiment’s outcome belongs to A.

> s = tosscoin(2, makespace = TRUE)
> s
toss1 toss2 probs
1   H   H   0.25
2   T   H   0.25
3   H   T   0.25
4   T   T   0.25

> s[1:3,]

toss1 toss2 probs
1   H   H   0.25
2   T   H   0.25
3   H   T   0.25

> s[c(2,4),]

toss1 toss2 probs
2   T   H   0.25
4   T   T   0.25

We can also extract rows that satisfy a logical expression using the subset function.

> s1 = cards()
> s1
rank   suit
1   2   Club
2   3   Club
3   4   Club
4   5   Club
5   6   Club
6   7   Club
7   8   Club
8   9   Club
9   10   Club
10   J   Club
11   Q   Club
12   K   Club
13   A   Club
14   2   Diamond
15   3   Diamond
16   4   Diamond
17   5   Diamond
18   6   Diamond
19   7   Diamond
20   8   Diamond
21   9   Diamond
22   10   Diamond
23   J   Diamond
24   Q   Diamond
25   K   Diamond
26   A   Diamond
27   2   Heart
28   3   Heart
29   4   Heart
30   5   Heart
31   6   Heart
32   7   Heart
33   8   Heart
34   9   Heart
35   10   Heart
36   J   Heart
37   Q   Heart
38   K   Heart
39   A   Heart

> subset(s1, suit == "Heart")

rank   suit
27   2   Heart
28   3   Heart
29   4   Heart
30   5   Heart
31   6   Heart
32   7   Heart
33   8   Heart
34   9   Heart
35   10   Heart
36   J   Heart
37   Q   Heart
38   K   Heart
39   A   Heart

> subset(s1, rank %in% 7:9)
rank   suit
6   7   Club
7   8   Club
8   9   Club
19   7   Diamond
20   8   Diamond
21   9   Diamond
32   7   Heart
33   8   Heart
34   9   Heart

The function %in% helps to learn whether each value of one vector lies somewhere inside another vector.

The isin function:

It is used to know whether the whole vector y is in x.

Example:
> x = 1:10
> y = c(3,3,7)
> isin(x,y)
 FALSE

> isin(x,c(3,4,5), ordered = TRUE)
 TRUE

> isin(x,c(3,5,4), ordered = TRUE)
 FALSE

In probability if we have a data frame as sample space and would like to find a subset of the space.

> s2 <- rolldie(4)
> subset(s2, isin(s2, c(2,2,6), ordered = TRUE))
X1   X2  X3   X4
188   2   2   6   1
404   2   2   6   2
620   2   2   6   3
836   2   2   6   4
1052   2   2   6   5
1088   2   2   1   6
1118   2   1   2   6
1123   1   2   2   6
1124   2   2   2   6
1125   3   2   2   6
1126   4   2   2   6
1127   5   2   2   6
1128   6   2   2   6
1130   2   3   2   6
1136   2   4   2   6
1142   2   5   2   6
1148   2   6   2   6
1160   2   2   3   6
1196   2   2   4   6
1232   2   2   5   6
1268   2   2   6   6

Random variable:

It is a real variable and is associated with the outcomes of an experiment.

It is a real valued function define on the sample space ‘S’.

Example: When two coins are tossed as a random experiment, X denotes the number of heads then the function X: S -> R is called a random variable.

i.e, S = { HH, HT, TH, TT}

Let the random variable X = no of heads. So for example X(HH) = 2, while X(HT) = 1. X : 0 1 2 where 0,1,2 are the numbers given to the outcomes of the experiment and X is called random variable.

Random variable in R:

The addrv function in R used for adding Random Variables to a Probability Space.

> S <- rolldie(3,nsides = 4, makespace = TRUE)
> S <- addrv(S, U = X1-X2+X3)

The idea is to write a formula defining the random variable inside the function and it will beaded as a column to the data frame.

The above R code defines a rolldie with 4 sided roll three times and define the random variable U = X1-X2+X3.

X1  X2  X3  U   probs
1   1   1   1   1   0.015625
2   2   1   1   2   0.015625
3   3   1   1   3   0.015625
4   4   1   1   4   0.015625
5   1   2   1   0   0.015625
6   2   2   1   1   0.015625

There are two types of random variables.

> Discrete Random variable
> Continuous Random variable

Discrete Random Variable:

A random variable ‘X’ is said to be discrete random variable if it takes the values finitely or count ably infinite values.

Example : when two coins are tossed the sample space is

S = {HH, TH, HT, TT}

Let ‘X’ denotes number of heads then X : 0 1 2 which is finite then X is called Discrete Random Variable.

Continuous Random Variable:

A random variable ‘X’ is said to be continuous if it takes the values in some interval.

-∞ < X < ∞

Example: let ‘X’ denotes number of calls received by a receptionist in the hospital for a 10 minutes period.

Probability Distribution:

It is a tabular form of a data, where the data is of a random variable and its probabilities.

For example when two coins are tossed the sample space is S = {TT, TH, HT, HH} and let ‘X’ denotes number of heads then X = {0, 1, 1, 2}. The probability P(X=0) = ¼, P(X=1) = 2/4, P(X=2) = ¼. There are two probabilities distribution function.

> Discrete Probability Distribution (probability mass function (p.m.f)): A probability distribution of a discrete random variable or (p.m.f) is called discrete probablility distribution.

> Continuous Probability Distribution (probability density function (p.d.f)): A continuous random variable ‘X’ assuming the values taking from -∞ < X < ∞ with the corresponding probabilities p(xi) then continuous probability distribution is define as P(-∞ < X < ∞) = ∫ -∞p(x) dx and satisfying the conditions
1) P(Xi) >= 0 (2) = ∫ -∞p(x) dx =1.

Probability Distribution Function or PDF is the function that defines probability of outcomes based on certain conditions.
Based on Conditions, there are majorly 5 types PDFs.

Types of Probability Distribution:
Binomial Distribution
Poisson Distribution
Continuous Uniform Distribution
Exponential Distribution
Normal Distribution
Chi-squared Distribution
Student t Distribution
F Distribution

Binomial Distribution (Bernouli Distribution):

It is a discrete probability distribution. It was given by Swiss Mathematician James Bernouli in 1713. It is the distribution of two possibilities (Binomial). It has only two features that is success (p) and failure (q). Characteristics of Binomial distribution:
1. The experiment should be of finite number of times.
2. Each trial should have only two outcomes i.e, success (p) and failure (q).
3. All trials should be independent to each other.
4. The probability should be remain constant for each trial.

The trials satisfying above 4 conditions are known as the Bernouli trials.

Definition of Binomial Distribution: A discrete Random Variable ‘X’ is said to have Binomial distribution if it is assumes only non negative finite number of values with the probability mass function.

P(X=r) = ncr pr qn-r   r = 0,1,2,…n

It is denoted by X ~ B.D (n,p) the ‘X’ is called as binomial variate.

Examples for binomial distribution:
1. Tossing a coin for ‘n’ number of times
2. Students writing a exam
3. When a product is tested for a finite number of values.

Poisson Distribution: It is a discrete probability distribution. It was given by French mathematician Simeon Denics poission in the year 1837.

It is used to explain the behavior of a Discrete Random Variable were the probability of occurances of an event is very small and the total number of possible cases is sufficiently very large. Definition of Poission Distribution:

A Discrete Random Variable is said have probability distribution if ‘X’ assumes only non- negative infinite number of values and with the probability distribution

P(x=r) = e λ r/r!     r = 0,1,2,…,∞

Normal Distribution: It is a continuous probability distribution and it was discovered by a English Mathematician, De Moivre in 1733. It is an approximation of binomial distribution number of large values when either p or q is very small. The "Bell Curve" is a Normal Distribution.

Definition of Normal distribution:

A continuous random variable ‘X’ taking the values in the interval from -∞ to ∞ is said to have normal distribution with Mean ‘µ’ and standard deviation ‘σ’ is said to have probability distribution and is called as probability density function.

The  probability density  of the normal distribution is: Where:

·   µ is  mean  or  expectation  of the distribution (and also its median  and  mode).
·   σ is  standard deviation
·   σ2 is  variance

Seven features of normal distributions:

1. Normal distributions are symmetric around their mean.
2. The mean, median, and mode of a normal distribution are equal.
3. The area under the normal curve is equal to 1.0.
4. Normal distributions are denser in the center and less dense in the tails.
5. Normal distributions are defined by two parameters, the mean (μ) and the standard deviation (σ).
6. 68% of the area of a normal distribution is within one standard deviation of the mean.
7. Approximately 95% of the area of a normal distribution is within two standard deviations of the mean.

Characteristic of Normal Distribution:

1. The shape of the normal probability curve is bell shaped and its distribution is symmetrical about the point mean (μ) that is top of the bell is directly above the mean x= μ (z=0).
2. Since the distribution is symmetrical mean, median, mode coincides i.e mean=median=mode.
3. Since mean=median=mode at the point x= μ (z=0), the shape of the curve is exactly divided into two equal positions. That is area from –∞ to x=µ(z=0) is equal to area from x=µ(z=0) to +∞ that is p(-∞ < x < ∞)== ∫ -∞f(x) dx =1. Total probability =1
4. The curve is asymptotic ,the curve approaches near and nearer to the x-axis but it never touches .
Since the curve is having one max point at x=µ(z=0) therefore the Normal distribution as only one mode.
5. The linear combination of independent normal variables is also a normal variate .
6. As the’ X’ value increases numerically the curve f(X) decreases rapidly.
7. The max probability occurring at x=µ or z=0
(f(X))=1/( σ √2¶)
8. This is normal probability curve f(X) is inversely proportional to standard deviation.

Central Limit Theorem

The  central limit theorem  states that the sampling distribution of the mean of any  independent ,random variable  will be normal or nearly normal, if the sample size is large enough.

How large is "large enough"? The answer depends on two factors.

> Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required.

> The shape of the underlying population. The more closely the original population resembles a normal distribution, the fewer sample points will be required.

In practice, some statisticians say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple peaks, and/or has outliers), researchers like the sample size to be even larger.

Suppose x1,x2….xn be the random of size ‘n’ drawn from a population with mean µand variance σ2 then as the sample size increases more and more. i.e n tends to infinity as the sampling normal distribution with mean µ=0 and variance σ is 1.

1. For large samples (n>=30)
Variance is known then we use z distribution.

2. For small sample (n<=30)
Variance is not known then we use t distribution.