Associate Analytics
Summarizing data with R: When working with large amounts of data that is structured in a tabular format, a common operation is to summarize that data in different ways. R provides a variety of methods for summarizing data in tabular and other forms. Generally, summarizing data means finding statistical figures such as mean, median, box plot etc. > grades = c(79,68,69,88,90,74,84,76,93) # creating a grade variable > grades Sort : Applying sort function to sort the grades in order. > sort(grades) Summary: summary is a generic function used to produce result summaries of the results of various model fitting functions. > summary(grades) Min. 1st Qu. Median Mean 3rd Qu. Max. # boxplot representation for the grades The centre of data and spread of data: Generally, summarizing data means finding statistical figures such as mean, median, box plot etc. Arithmetic Mean: Mean function is a generic function for the calculating the arithmetic mean. Creating a variable salaries > salaries = c(33750,44000,138188,45566,44000,141666,292500) > mean(salaries) Mean is not robust as it effect with the extreme observation. To overcome this, trim mean is used. Trim mean is the mean of the remaining data values after removing the k largest and k smallest data values in the observations. > salaries = c(12,33750,44000,138188,45566,44000,141666,292500) Median: The Median is the "middle" of a sorted list of numbers. > sort(salaries) > median(salaries) Median is robust as it takes the middle data values. Length function: > salaries = c(33750,44000,138188,45566.67,44000,141666.67,292500) > length(salaries) > salaries1 = sort(salaries)[1:7] Range: The range of a set of data is the difference between the highest and lowest values in the set. It tells the spread of the data. Range( )in R: range returns a vector containing the minimum and maximum of all the given arguments. > range(salaries) Standard deviation: Sd() : This function computes the standard deviation of the values in x > sd(salaries) Var(): var compute the variance of x. > var(salaries) > sqrt(var(salaries)) IQR: The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value. > IQR(salaries) Histogram representation in R > hist(salaries, breaks = 3,xlab ='salaries', main='salaries') Probability: A deterministic experiment is one whose outcome may be predicted with certainty before hand such as adding two numbers such as 2+3. A random experiment may not be predicted with certainty before hand. Example tossing a coin, rolling a die etc. Sample space: For a random experiment E, the set of all possible outcomes of E is called the sample space and is denoted by the letter S. Example: For coin-toss experiment E, S will be the results ‘Head’ and ‘Tail’, which we may represent by S = {H,T}. Formally, the performance of a random experiment is the unpredictable selection of an outcome in S. In R using prob package we can work with probability. A sample space is usually represented by a data frame i.e, collection of variables. Each row of the data frame corresponds to an outcome of the experiment. Consider the random experiment of tossing a coin. The outcomes are H and T. > library(prob) > tosscoin(1) The argument 1 tells to toss the coin once. > tosscoin(3) > rolldie(1) The rolldie( ) function by default has 6-sided die. We can also specify with the n sides argument. > cards() > head(cards()) # If we would like to draw one card from a standard set of playing cards. Sampling from urn: The most fundamental type of random experiment is to have an urn that contains a bunch of distinguishable object (balls) inside it. When we shake up the urn, grab a ball. If we would like to grab more than one ball. What are all of the possible outcomes of the experiment now? It depends on how we sample. 1. With replacement With replacement: we could select a ball, take a look, put it back and sample again. Without replacement: we would select a ball, take a look but do not put it back and sample again. There are certainly more possible outcomes of the experiment in the former case than in the latter. The prob package accomplishes sampling from urns with the urnsamples function, which has argument x, size, replace and ordered. The argument x, represent the urn from which sampling is to be done. The size argument tells how large the sample will be. The ordered and replace arguments are logical and specify how sampling will be performed. Ordered, with replacement: If sampling is with replacement, then we can get any outcome on any draw. Further, by “ordered” we mean that we shall keep track of the order of the draws that we observed. > urnsamples(1:3, size=2,replace = TRUE, ordered = TRUE) Ordered, without replacement: In sampling without replacement, we may not observe the same number twice in any row. > urnsamples(1:3, size=2,replace = FALSE, ordered = TRUE) Unordered, without replacement: In this case, we may not observe the same outcome twice but we will only retain those outcomes which would not duplicate earlier ones. > urnsamples(1:3, size=2,replace = FALSE, ordered = FALSE) Unordered, with replacement: We replace the balls after every draw, but we do not remember the order in which the draws came. > urnsamples(1:3, size=2,replace = TRUE, ordered = FALSE) Events: An event A is a collection of outcomes or a subset of the sample space. After the performance of a random experiment E we say that the event A occurred if the experiment’s outcome belongs to A. > s = tosscoin(2, makespace = TRUE) > s[1:3,] toss1 toss2 probs > s[c(2,4),] toss1 toss2 probs We can also extract rows that satisfy a logical expression using the subset function. > s1 = cards() > subset(s1, suit == "Heart") rank suit > subset(s1, rank %in% 7:9) The function %in% helps to learn whether each value of one vector lies somewhere inside another vector. The isin function: It is used to know whether the whole vector y is in x. Example: > isin(x,c(3,4,5), ordered = TRUE) > isin(x,c(3,5,4), ordered = TRUE) In probability if we have a data frame as sample space and would like to find a subset of the space. > s2 <- rolldie(4) Random variable: It is a real variable and is associated with the outcomes of an experiment. It is a real valued function define on the sample space ‘S’. Example: When two coins are tossed as a random experiment, X denotes the number of heads then the function X: S -> R is called a random variable. i.e, S = { HH, HT, TH, TT} Let the random variable X = no of heads. So for example X(HH) = 2, while X(HT) = 1. X : 0 1 2 where 0,1,2 are the numbers given to the outcomes of the experiment and X is called random variable. Random variable in R: The addrv function in R used for adding Random Variables to a Probability Space. > S <- rolldie(3,nsides = 4, makespace = TRUE) The idea is to write a formula defining the random variable inside the function and it will beaded as a column to the data frame. The above R code defines a rolldie with 4 sided roll three times and define the random variable U = X1-X2+X3. > head(S) There are two types of random variables. > Discrete Random variable Discrete Random Variable: A random variable ‘X’ is said to be discrete random variable if it takes the values finitely or count ably infinite values. Example : when two coins are tossed the sample space is S = {HH, TH, HT, TT} Let ‘X’ denotes number of heads then X : 0 1 2 which is finite then X is called Discrete Random Variable. Continuous Random Variable: A random variable ‘X’ is said to be continuous if it takes the values in some interval. -∞ < X < ∞ Example: let ‘X’ denotes number of calls received by a receptionist in the hospital for a 10 minutes period. Probability Distribution: It is a tabular form of a data, where the data is of a random variable and its probabilities. For example when two coins are tossed the sample space is S = {TT, TH, HT, HH} and let ‘X’ denotes number of heads then X = {0, 1, 1, 2}. The probability P(X=0) = ¼, P(X=1) = 2/4, P(X=2) = ¼.
There are two probabilities distribution function. > Discrete Probability Distribution (probability mass function (p.m.f)): A probability distribution of a discrete random variable or (p.m.f) is called discrete probablility distribution. > Continuous Probability Distribution (probability density function (p.d.f)): A continuous random variable ‘X’ assuming the values taking from -∞ < X < ∞ with the corresponding probabilities p(xi) then continuous probability distribution is define as P(-∞ < X < ∞) = ∫ ∞-∞p(x) dx and satisfying the conditions Probability Distribution Function or PDF is the function that defines probability of outcomes based on certain conditions. Types of Probability Distribution: Binomial Distribution (Bernouli Distribution): It is a discrete probability distribution. It was given by Swiss Mathematician James Bernouli in 1713. It is the distribution of two possibilities (Binomial). It has only two features that is success (p) and failure (q).
Characteristics of Binomial distribution: The trials satisfying above 4 conditions are known as the Bernouli trials. Definition of Binomial Distribution: A discrete Random Variable ‘X’ is said to have Binomial distribution if it is assumes only non negative finite number of values with the probability mass function. P(X=r) = ncr pr qn-r r = 0,1,2,…n It is denoted by X ~ B.D (n,p) the ‘X’ is called as binomial variate. Examples for binomial distribution: Poisson Distribution: It is a discrete probability distribution. It was given by French mathematician Simeon Denics poission in the year 1837. It is used to explain the behavior of a Discrete Random Variable were the probability of occurances of an event is very small and the total number of possible cases is sufficiently very large. Definition of Poission Distribution: A Discrete Random Variable is said have probability distribution if ‘X’ assumes only non- negative infinite number of values and with the probability distribution P(x=r) = e-λ λ r/r! r = 0,1,2,…,∞ Normal Distribution: It is a continuous probability distribution and it was discovered by a English Mathematician, De Moivre in 1733. It is an approximation of binomial distribution number of large values when either p or q is very small. The "Bell Curve" is a Normal Distribution. Definition of Normal distribution: A continuous random variable ‘X’ taking the values in the interval from -∞ to ∞ is said to have normal distribution with Mean ‘µ’ and standard deviation ‘σ’ is said to have probability distribution and is called as probability density function. The probability density of the normal distribution is: Where: · µ is mean or expectation of the distribution (and also its median and mode). Seven features of normal distributions: 1. Normal distributions are symmetric around their mean. Characteristic of Normal Distribution: 1. The shape of the normal probability curve is bell shaped and its distribution is symmetrical about the point mean (μ) that is top of the bell is directly above the mean x= μ (z=0). Central Limit Theorem The central limit theorem states that the sampling distribution of the mean of any independent ,random variable will be normal or nearly normal, if the sample size is large enough. How large is "large enough"? The answer depends on two factors. > Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required. > The shape of the underlying population. The more closely the original population resembles a normal distribution, the fewer sample points will be required. In practice, some statisticians say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple peaks, and/or has outliers), researchers like the sample size to be even larger. Suppose x1,x2….xn be the random of size ‘n’ drawn from a population with mean µand variance σ2 then as the sample size increases more and more. i.e n tends to infinity as the sampling normal distribution with mean µ=0 and variance σ is 1. 1. For large samples (n>=30) 2. For small sample (n<=30)
|