Introduction to Probability Statistics


Foundation

What is statistics?

When there are lots of data, you can not read the nature of the data simply by looking at it. By summarizing the data from one viewpoint and calculating values ​​that symbolize the properties of the data, you can discover the principles underlying it and find out the overall features. In the statistics, it is statistical to extract features of a population from a population to be noticed, actual data from a sample, and a large number of samples. For feature extraction, it is common to obtain mean, variance, histogram, median and mode. By devising expression methods, it is possible to visually extract features. Line charting in time series, bar charts, pie charts and so on.

What you can do with statistics

1) Capture the characteristics of data: When there are a lot of data, each data is unbalanced value and it is difficult to find features from there. By using statistics, we can calculate "representative value" representing the characteristics of data, analyze the characteristics of data, the state of distribution of data and the relationship between data. 2) Estimate the characteristics of the group as a whole from the sample data of the group: When trying to examine the characteristics of many groups, it is sometimes difficult to conduct a full survey. We can examine only a small number of specimens (samples), estimate the characteristics of the group from the analysis, and obtain the same conclusion as in the total survey.

All survey and sample survey

It is a statistic to examine the measured value and derive its property. If the number of data is small, calculate the statistic from all data (Examine all). On the other hand, if there are many data or only partial data can be gathered due to circumstances, only the sample is investigated and the whole image is estimated by statistical processing. In recent years, with the development of IT technology it has become possible to conduct an exhaustive survey to process large amounts of data instantaneously.

Basics of statistical processing

The average value is the representative value of the data. Add all the data values ​​together and divide by the number. It can represent the whole with one average value, and it can capture the comparison and temporal change of two kinds of data by the average value. The maximum and minimum represent the range of the population, indicating the recognition of the boundary within this range and the degree of variation. Dispersion is obtained by squaring the difference between data and average value and averaging it, and it indicates the degree of variation in data. The median is a value that comes to the center when data is arranged in order from small to large, and it is a representative value of data. A histogram shows the distribution of data by xy coordinates, and it is possible to visually understand the distribution.

Visualization of statistics

Just looking at a row of data that is large in quantity makes it difficult to read the whole feature from there. By graphing, you can visually grasp features. 1. Bar graph, Line graph: The height represents the quantity with the notable category as the horizontal axis and the quantity as the vertical axis. A line graph is suitable for viewing changes in quantity which is a temporal transition (time series). 2. Pie Chart, Horizontal Bar Graph: It is used to see the proportion of it as 100% as a whole. 3. Correlation graph: To see the relationship between the two measures of A and B, map A with vertical axis and B as horizontal axis. Look for the correlation with the point bias.

Frequency distribution

A lot of samples such as height, weight, etc. are divided into ranges, and the number of items falling within the range is plotted as a bar graph on the vertical axis. The shape of this distribution expresses the degree of variation in data.

Parameters characterizing the frequency distribution

In the basic calculation of statistics, there are average value, variance, maximum minimum, median, mode (mode). The average value is a value representing data. When comparing two data, by comparing the average value, it is possible to explain to some extent the difference of the original data. Dispersion represents variance of data. If the variance is large, the variation around the average value (value range) is large, and if it is small, it can be said that the variation (average value range) around the average value is small. The maximum and minimum indicate the range of data. All data is in this range. When comparing two data, even if comparing at the maximum and minimum, only a small portion of data is compared, and it is not possible to conclude the difference between the two data.

The median (median) is the value of the ranking in the middle when data is ordered and ordered. Although it is a familiar value as a representative value of order conscious data, it is not mathematically valid. An example is the order of the back. In order to capture the middle of the height, not only the average but also the height of the person in the middle rank can be the representative value. Also in the distribution of income, it is said that the median is more intuitive for the representative value than the average value.

The

mode is the value with the highest number of samples in the frequency distribution. The mean, median and mode are parameters that characterize the statistics. Though there are cases where these three are consistent, if they do not match, it is necessary to consider which one of them is suitable for representing the statistical value.

Statistics and Probability

Data collected by statistics has various values. There are variations in the values. Also, if it is a measured value, observation error is included, and even if there is only one true value, the observed value varies. It can be interpreted that various values ​​appear from the measurement target. We can link probabilities and statistics by viewing statistical values ​​as collections of actual values ​​that actually appeared from stochastic events. The calculation of the average value and variance used in the statistics is positioned by the idea of ​​probability.

What is probability?

At the moment we do not know what will happen in the future. Even though we can anticipate it, it does not necessarily happen. However, except in cases where it can not be expected at all, the result options are finite, and in some cases any of the options will be realized. Mathematical thinking can be the turning point if there is no possibility of potential difference in each option. This is the probability. The fundamental way of thinking is to interpret the event as occurring with a probability of 1 / n if there are n options for the result and there is no difference in that option.

Stochastic variable

Variables that have uncertainty and values ​​that we do not know at the present time are called random variables. When the uncertainty disappears and the value becomes known, it becomes a specific value, but candidates which are likely to be unlikely to be obtained beforehand are called realized values. If there are n realizations, and there is no difference in the likelihood, the probability that a random variable will be a certain realized value will be 1 / n. The probability of becoming a specific realized value is all set to 1. Formally, the definition of the probability is the only condition for this to be the probability if it is 1 by adding the probabilities of all realized values.

Basic calculation of probabilities and independent events

When event A and event B occur simultaneously, the probability is multiplied by the probability of A and B. The probability that either event A or event B occurs will be the addition of the probabilities of A and B. When the probability of event A is p , the probability that event A does not occur is 1 - p . At this time, an independent event is assumed. Event A does not occur is not affected by event B. Likewise, it is that the unaffected effect that event B happens is not beyond event A.

Conditional Probability

Law of large numbers

If the probability of an event per event is x and the number of occurrences in n number of attempts is m m / n approaches the value of x as n grows bigger. This is called a law of a large number. Strictly speaking, there are variations in m / n in n times trials, but as n gets bigger, m / n approaches x as the average value of the distribution approaches zero.

probability distribution

For all feasible realizations, a list describing the probability is called a probability distribution. In the probability distribution, the horizontal axis represents the realization value column and the vertical axis the probability value. If the realized value is discrete, it becomes a bar graph. When the realized value is continuous, it becomes a graph of a curve (straight line). Once the probability distribution is determined, it can be said that the properties of the random variables are defined. From the formula, expectation value and standard deviation can be calculated. Semantically, the expected value is a value that represents the distribution, and in the statistics it is the average value. The standard deviation is a measure showing the variation of the realized value, and if the standard deviation is large, the realized value largely varies. Even if they have the same expected value, the distribution will differ if the standard deviation is different, and even if the same expectation value and standard deviation, the shape of the distribution can be different.

Expected value and variance

Multiply all possible realizations by their probabilities, and the sum is the expected value. Expected value is the representative value expected, as its name suggests. The expected value does not necessarily mean that value, and there are cases where the probability is zero. For example, in the case of dice, the expected value is 3.5. The expected value has a meaning when comparing expectation values ​​with each other based on the value itself. When there are two random variables, it can be said that a high realized value tends to occur in a random variable with a higher expectation than a low expected random variable. Squared the difference between all possible realized values ​​and expected values ​​and multiplying by the probability and adding it is called dispersion. Intuitively, the variance is an amount indicating the degree of variation of the random variable. Variance of the random variable is evaluated with the value of variance.

Continuous Probability Distribution

For each probability variable x , the probability distribution is a table listing the value of each realization value and the corresponding probability, but P (x < a) And the probability that x is less than a , we can assign probabilities infinitely to the continuous variable a . If you set the continuous variable a on the horizontal axis and the corresponding probability on the vertical axis, a single graph is created. The probability of P (a < x < b) corresponds to the area of ​​ a .

Parameters characterizing probability distribution

Besides the expected value and the standard deviation, there are median (median) and mode (mode) as describing the nature of the distribution. The median (median) is to find the value that comes in the center of the total number by aligning the specimen from the smaller one. When the shape of the distribution is bilaterally symmetric, the expected value is nearly median, but when the distribution shape is not symmetrical but distorted, the median (median) becomes important. The mode (mode) is the value when the distribution peaks. Different from mathematically meaningful expectations and standard deviations, median (median) and mode (mode) have no meaning.

Relationship with statistics

We can combine the theory of probability and statistic by thinking the sequence of realized values ​​of random variable as a sequence of statistical data. The statistical data is a series of samples from the population, and the average and variance of the statistics are the sample mean and the sample variance. The median (median) is a value of 50% of the probability distribution function, and the mode (mode) is the peak part of the probability density distribution.

Representative probability distribution: Uniform distribution

There are countless probability distributions, but there are some representative patterns. Uniform distribution. This is the case where any realization value has the same probability. With the simplest distribution, especially if there are no conditions in the shape of the distribution, it is natural to think that any realization value is equally likely, that is, to assume a uniform distribution.

Representative probability distribution: binomial distribution

The other is a binomial distribution. This has two realizations A and B, with probabilities p and 1 - p , respectively. If one realization value is found in one trial, the result of n batch will be found by repeating n times. The number of occurrences of A can be 1, 2, 3, n and n . Consider the probability of each candidate of realized value 1, 2, 3, n , with the number of occurrences of A in n number of trials as a random variable. The probability that A occurs once appears as np (1 - p) n - 1 , the probability that A appears twice is n ) / 2 p 2 (1 - p) n - 2 This is calculated as 1, 2, 3, n , and graphically shown in the graph. The graph has a mountain shape that the middle is high and the quantity is low. The expected value of the distribution is p n , and the probability is the highest. It becomes the center of the mountain on the graph. When the probability is 0.3, it seems most likely that A appears 30 times per 100 times, and it is the intuition that it becomes the expected value. The shape of the base footprint spreads around 30 times.

Representative probability distribution: Normal distribution

Binomial distribution is small when n is small, the number of realized values, which is the number of occurrences, is also small, the distribution becomes a bar graph, but increasing the n As you go, the realized value becomes nearly infinite, and the distribution has a gentle curve. This is called a normal distribution. It is said that the normal distribution can be modeled as a normal distribution when there is an expected value in a random variable and when an error is added to the expected value as an observation value. Many natural phenomena with uncertainty can be represented by normal distribution. In general, we can assume a uniform distribution or a normal distribution, especially if there is no special condition. The features of the normal distribution are as follows. (1) A mountain is one and it has a symmetrical shape. (2) The distribution width is infinite. (3) Expected value, median, mode are the same, it is in the center of the graph. (4) The normal distribution has two parameters, expected value and standard deviation, and the form of the normal distribution is determined only by these two parameters. Given such features and ease of modeling natural phenomena, normal distributions are often used as probabilistic models.

Nature of normal distribution

A normal distribution is a representative probability distribution and it is necessary to understand some properties. Distribution type is determined by expected value and standard deviation. The position obtained by taking the standard deviation to the left and right from the expected value (peak value of the graph) is called 1 sigma and represents the positions of 30% and 70%, respectively. Positions that take twice the standard deviation on the left and right are called 2 sigma and represent positions of 2.5% and 97.5%, respectively. 95% will fit within 2 sigma, and risk management for coping with uncertainty can be managed with a probability of 95% if it is managed within 2 sigma. In risk management, the object with uncertainty assumes a normal distribution, estimates the expected value and the standard deviation, and permits the range of twice the standard deviation within the assumed range.

Poisson distribution

It is a binomial distribution, which is the distribution that becomes the limit when the occurrence frequency is low and the number of trials is lengthened. It is a distribution with a probability of n occurrence of a certain event in a certain period as a function of n . It is used for analysis of possible occurrences of events whose occurrence probabilities are known (eg queue length).

Exponential distribution

When the occurrence frequency in a unit time is constant, the time until the first occurrence of that event varies. The distribution of that time is called an exponential distribution, and it is a distribution which is often used by engineering. The time until the failure (the time of the normal operation) when the failure rate per unit time is fixed is the exponential distribution. It can be used for product lifetime design and warranty period setting from average and dispersion.

Breakdown curve

In the exponential distribution, it was assumed that the failure rate was constant, but in general, the failure rate of industrial products was high at the beginning of product input, then gradually decreased to a certain level, from near product life , And the failure rate increases. This shape is also called a bathtub type. The length of the period except the initial and end period can be modeled by exponential distribution.

Specimen and population

In statistics, the target for which statistics are to be taken is called a mothers collection. However, when the population is large, it is impossible to conduct an exhaustive survey, a method is taken to extract some specimens from the mother set and to estimate the population from the analysis of the specimens. It is so-called sample survey and questionnaire survey. The distinction between sample sets and population sets is important. We study the sample set, find the statistical properties, and infer the nature of the population. The simplest idea is to speculate that the sample mean represents the average of the population and the standard deviation of the sample represents the standard deviation of the population. There is also a maximum likelihood determination method that estimates the statistical properties of a population from specimens, assuming that the obtained sample is most likely (the most probable).

Sample statistics and population statistics

It may be impossible to survey the whole number of population. In this case, we will infer the population from the sample survey. In the sample survey, two are required, such as (1) random sampling and (2) taking as many samples as possible so that the nature of the population is reflected in the specimen. As the number of samples increases, the population average and population variance estimate values ​​increase from sample mean and sample variance.

Estimation of population from specimen

The advantage of probabilistic statistics is to derive a lot of knowledge for a population from less information (specimen data). The statistics obtained from the sample data are average, variance, histogram. Then estimate the probability distribution, mean, variance of the population by calculation. If it is a random, independent, mass collected sample, you can think as follows. · It is the most reasonable estimate to think that the average and variance are the mean and variance of the population as the number of samples increases.

Estimation of true values ​​

Observed quantities contain observation errors. A true value must be estimated from observations. The error is often assumed as a normal distribution with zero expected value and constant standard deviation. Since an error is added to the true value and it becomes an observation value, conversely, if subtracting the error from the observation value, it becomes a true value. Although the error component is unknown, estimation is possible if it follows a constant distribution probabilistically. The probability that the true value fits into the observation value plus or minus 1 sigma is 70%, and the probability that the true value fits in the sum plus or minus 2 sigma is 95%.

Observation value = true value + measurement error, from observation value - measurement error = true value. Although the value of the measurement error is unknown, in general, as statistical properties, 1) normal distribution, 2) independent event in time is assumed. If we assume this, we can utilize knowledge of probability statistics and make various calculations.

Confidence section

True value = observation value - From the expression of measurement error, if the measurement error is a random variable, the candidate of the true value estimated from the observation value is also a random variable. Candidate values ​​have widths. If the observed value is x and the measurement error is a normal distribution with standard deviation σ , the true value is x - σ and x + σ with a probability of 70%. A range that fits above a certain probability is called a confidence interval. If the confidence interval is enlarged, the probability that it fits in it increases. The true value fits with a probability of 98% between x - 2σ and x + 2σ .

Test

Make a hypothesis and test whether it can be rejected. I can not prove that even if the hypothesis can be rejected. Explain by case. A certain physical quantity was measured with a measuring instrument. There is an error in the measuring instrument, and the error distribution is assumed to be a normal distribution. It is the test that verifies whether the value obtained by performing the sample survey is a reasonable value considering the error. Estimate the probability distribution of the population. Given a probability distribution, the probability that a sample value appears appears. Assuming that the test level is 2%, if the occurrence probability of the sample value is 2% or less, "If we assume that the probability distribution of the estimated population is correct, an extremely low probability event has occurred. Since it seems that there are no results, the original estimated probability distribution must be wrong "and rejects the estimated distribution. Even though we can reject it, we can not qualify the estimated probability distribution as correct.

Calculation Formula for Stochastic Variables

The following formula holds for the variable X whose probability variable has a probability distribution with mean m and standard deviation s . E [a X + b] = am + b , D [a X + b] = as That is, the random variable X The mean and the standard deviation of what you did can be multiplied by a constant respectively.

Advanced version

Two random variables

For each of the two random variables X and Y , the realized values ​​ x and y When a probability variable X is satisfied when the probability of appearance is P (X = x) P (Y = y) = P (X = x, Y = y) var> Y are said to be independent. Semantically, the random variables of X and Y are not affected by the opponent's real value. At this time, it is assumed that D [X] + D [Y + Y] = E [X + Y} = E [X] + E [Y] = m + n ]) Square root . The scatter plot shows that some realized values ​​of two random variables are plotted on the secondary plane. By looking at the positional relationship of the points plotted on the scatter plot, you can see if there is a relationship between the two random variables. If it is a sequence as shown in the figure, it is presumed that the two random variables are not irrelevant but proportional. On the other hand, it can be said that it is irrelevant in the figure. The degree of relationship can be calculated as a correlation coefficient. The correlation coefficient is a variable between -1 and 1, showing a relationship of inverse correlation and positive correlation between -1 and 1, and 0 indicates no correlation. If two random variables are independent, they are uncorrelated. On the other hand, even if it is uncorrelated, it is not independent. The mean and standard deviation of two random variables when not independent are E [X + Y} = E [X] + E [Y] = m + n , D [X + Y] = (D [X] + D [Y]) + D [X] D [Y] cos u . The standard deviation is determined by the standard deviation of each random variable and the correlation coefficient.

Covariance and Correlation Coefficient

The two random variables X and Y are expected values ​​ E [X] and E [Y] , variance values ​​ S [X] and S [Y] . When Z = X + Y , if X and Y are independent, E [X + Y] = E [ X] + E [Y], S [X + Y] = S [X] + S [Y] . S [a X] = a 2 S [X] <[a]], where Z = a X / var>. When it is not independent, the calculation becomes complicated. Since values ​​differ depending on independence, caution is necessary in the case of two variables. Also, covariance is called (X - E [X]) (Y - E [Y]) P (X, Y) . This square root is a correlation coefficient.

Scatter diagram

When plotting the two random variables on the xy plane and looking at the set of points, a certain relationship may be found. In the case of uniformly flickering as a whole: It can be interpreted that there are no correlations between the two random variables. Calculating the correlation coefficient results in a value close to zero. In the case of hardening in the upper right direction: It can be interpreted that there is a positive correlation between the two random variables. The correlation coefficient is a positive value, and if the points are aligned on a straight line, the correlation coefficient is 1. In the case of hardening in the lower right corner direction: Two random variables can be interpreted as having a negative correlation. The correlation coefficient becomes a positive value, and if the points are aligned on a straight line, the correlation coefficient is -1. As the straight line approaches, the correlation coefficient approaches 1 or -1. The degree of the slope of the straight line does not matter.

Regression analysis

Calculate the correlation coefficient, and if it is close to 1 or -1, the relationship between the two random variables is hypothetically considered as Y = a X + b and the actual x , v <> var> e of y was added to the value of y b, y as y = ax + b + e Can be modeled as. Here we consider how to create a model representing the relationship between two random variables. When calculating the error y - (ax + b) between the hypothesis model and the actual value for a pair of actual observations x, y We choose a and b so that the sum of the squares is minimized. Semantically, you choose a and b that minimize the standard deviation of the error between the hypothesis model and the measured value. Conversely, in this way The selected a and b are models that minimize the difference between the measured values. At this time, the sum of the squares of the error y - (ax + b) between the hypothesis model and the measured value is called standard error. It is called regression analysis to obtain the parameter a b assuming the relationship between the two variables as y = ax + b + e . a is X , b is a constant component unrelated to X . The size of a , the sign (positive value, negative value) of a is noticed.

Application of regression analysis

Regression analysis to explore the relationship between X and Y when two random variables X, Y always appear in pairs Is often used. It is the simplest form to represent the relationship between X and Y in a linear model with Y = a X + b . If you do not know the true model, this linear model is assumed. Y is an explanatory variable, X is an explanatory variable, and the motion of X has an influence on Y It is a model useful for analysis. It should be noted that the relationship between X and Y is shown and X and Y It does not mean that it represents a causal relationship. Based on the samples (statistical data) of X and Y , the population model Y = a X + b a < / var> and b . Estimate calculates a and b so that the standard error is minimized. For the validity of the calculation result, calculate the t value to find a and Evaluate the reliability of b . Samples (statistical data) and model fitting are evaluated by calculation of the coefficient of determination. It is the sign of a to pay attention to. If a is positive, X and Y have a positive correlation, a > X and Y have a negative correlation. Regression analysis is used to find the relationship between two variables of economic data and natural science data.

Central Limit Theorem

A sample mean of var> A, B, C, D random variable of arbitrary n probability distribution A + B + C + D) / n approaches the normal distribution and dispersion becomes smaller as n becomes larger. The use of this theorem is that the sample average becomes a normal distribution, and the variation of the value decreases more and more. In other words, by collecting as many samples as possible and averaging them, it is possible to obtain a small distribution (highly accurate, highly reliable) normal distribution.

What you can see from the central limit theorem

The central limit theorem implies that when a lot of samples are averaged, the distribution becomes a normal distribution. Also, the distribution becomes narrower as more samples are added. That means that if you collect a large number of samples and take an average of them, it will be an estimate of the average of the population. The population can be any distribution. What is meant differently is that the phenomenon that various factors overlap (are added) becomes a normal distribution from the central limit theorem. Simply put, the distribution of complicated phenomena can be assumed to be normal distribution.

Statistical analysis method

Since it takes a huge amount of time and it is wasted with the approach that if we do statistical calculations blindly, the conclusion comes out something, so we proceed systematically as follows.

  1. Set up a hypothesis.
    1. Define the survey target as a population and define a sample group. Perform sample survey.
    2. Compute the basic statistics of the sample. Create a distribution.
    3. Perform a hypothesis test and verify the hypothesis.
    4. If you are interested in two variables, create a correlation diagram. Perform regression analysis.
    5. Calculate the decision coefficient and t value and verify the relationship between the two variables.
  2. Examine the population
    1. Clarify survey target (population) and set hypothesis
    2. Acquire sample data (random, set condition based on hypothesis)
    3. Collect data in a table
    4. Create distribution chart
    5. Calculate average, variance, median, mode
    6. Estimate the mean and variance of the population
       
  3. Examine the relationship between two variables
    1. Clarification of investigation target (two variables) and setting of hypothesis (model)
    2. Acquire sample data (random, set condition based on hypothesis)
    3. Collect data in a table
    4. Create a correlation diagram of two variables
    5. Calculate parameters by mean value, variance, regression analysis
    6. Inference of relational model of two variables of population

Probability and time

Probability is for one trial. However, interpreting it as a repetitive phenomenon is easy to understand. For example, if the probability is 0.5, it is considered to occur at a rate of once every two times. It is five times per 10 times. Since the probability assumes a one-time trial, it is not necessarily correct to capture the probability 1 / n as 1 per n times.

stochastic process

I put the concept of time into probability. Let the random variable X be a function X (t) of time t . Specific values ​​of X change with the passage of time. If X = 3 t + 2 , the value of X changes over time but is deterministic. In the stochastic process, for example, X (0) = 1, X (1) = 5, X (2) = 1, assuming that the number of dice continuing to swing and continue appearing every second is X 3, and the eyes of the dice continue. A typical example of such stochastic process is random walk. Probability 0.5, X (t) to be X (t) + 1 for X (t) Consider a stochastic process where the probability is 0.5 for X (t + 1) = X (t) - 1 . Every time one unit time elapses, it jumps up or down by one. There are other Markov processes. It expresses state transition by probability. In states A and B, the probability of moving from A to B is 0.5, and the probability of moving from B to A is 0.5.

Time series analysis

Even if it is continuous, if you sample in discrete time, it becomes statistical data. For example, it is to estimate the property of the population (infinite time) from data (sample) of one hour collected every second. In the stationary process, the statistical properties of the samples are assumed to be time invariant, and the average and variance of the samples are estimated values ​​of the whole time series data (statistical population). If there is a trend in the sample, it is also assumed that the trend exists besides the sample, and it applies to the whole time series data as well. In time series analysis, trends and periodicity are extracted in addition to the basic statistics (average, variance). If periodicity exists, the components of the waves which give the length of the period and the period are analyzed.

Autocorrelation

Collect samples of

f (x) and f (x + T) and calculate the correlation coefficient. By changing T and obtaining correlation coefficients, it becomes a function of T . If this function has a peak with several T values, the value of that T represents the periodic component. Conversely, time series data can be thought of as synthesizing waves of various cycles.

Big data analysis

The elementary data of human behavior is based on 5 W 1 H. Who, when, where, what, how, and what is done. If it is purchase data, who (personal, attribute information: sex, age, address), when, where, what, how, and what is done. We collect a lot of this data, find out the rule of law from there, optimize product composition, lead to sales and profit through sales promotion activities. If there is no target for the conclusion aimed at the analysis, it will be analyzed randomly.