2.3 Descriptive statistics
R is mainly used for data processing, analysis and visualisation. Subsequent parts of the present work are devoted to these three typical applications.
Before we discuss those complex applications, we will present some basic cases here. Variables in data analyses are usually characterised according to the classification by Stanley Stevens:
- Qualitative variables (also referred to as factors or categorical variables) are variables which can take on a limited number of values (usually non-numerical). They can be further divided into the following groups:
- Binary variables (also known as dichotomous or binomial variables), such as gender (female/male).
- Nominal variables (also known as unordered qualitative variables), such as car make: there is no specific order for car makes.
- Ordinal variables (also known as ordered qualitative variables), such as education (primary/secondary/tertiary).
- Quantitative variables, which can be further divided into:
- Count variables (count of occurrences of a given phenomenon expressed as a natural number), such as the number of education years.
- Interval variables, measured on a scale where values can be subtracted, but not divided by each other, such as temperature in Celsius degrees, or A.D. year.
- Ratio variables, measured on a scale where proportions are kept. This means that values can be divided by one another and there is a clear definition of 0.0. Examples include temperature in Kelvin degrees or height in centimetres.
In R, quantitative variables are represented with a numerical type called numeric
. There are no separate types to describe numbers on a ratio scale or an interval scale.
Qualitative data in R are represented with a type called factor. factor
variables can be additionally marked as ordered. In such cases, they have an additional class called ordered
.
Binary variables can be represented with a logical type called logical
.
Table 2.1 presents some functions which calculate the most popular descriptive statistics. We will practice calculating descriptive statistics on a data set called socData
from the Przewodnik
package.
library("Przewodnik")
<- read.csv("socData.csv", sep=";"); head(socData, 3) socData
## age education cyvil.status sex employment diastolic_pressure
## 1 70 zawodowe w zwiazku male uczen lub pracuje 143
## 2 66 zawodowe w zwiazku female uczen lub pracuje 123
## 3 71 zawodowe singiel female uczen lub pracuje 167
## systolic_pressure
## 1 83
## 2 80
## 3 80
Function | Description |
---|---|
. | \(\texttt{base}\) package |
\(\texttt{max()/min()}\) | Maximal/minimal value in the sample. |
\(\texttt{mean()}\) | Arithmetic mean,\(\bar{x}=\sum_{i}x_i/n\) \(\texttt{trim}\) is an optional argument. When it is different than 0, a trimmed mean is calculated. A trimmed mean is calculated just like the arithmetic mean after removing \(200\% * \texttt{trim}\) of edge observations. |
\(\texttt{length()}\) | Count of elements in the sample. |
\(\texttt{range()}\) | Variability range of the sample, calculated as \([\mbox{min}_i x_i,\mbox{max}_i x_i]\). |
. | \(\texttt{stats}\) package |
\(\texttt{weighted.mean}\) | Weighted mean, calculated as \(\frac{1}{n}\sum_i w_i x_i\). The weight vector \(w_i\) is the second argument. |
\(\texttt{median()}\) | Median (middle value). |
\(\texttt{quantile()}\) | Q-quantile. The second argument of \(\texttt{quantile()}\) is the vector of quantiles to find. This function implements 9 different algorithms to find quantiles, see the description of \(\texttt{type}\) argument for more information. |
\(\texttt{IQR()}\) | Interquartile range, i.e. the difference between the upper and lower quartile, \(IQR=q_{0.75}-q_{0.25}\). |
\(\texttt{var()}\) | Variation in the sample. The unbiased estimator of variance is calculated as \(S^2=\frac{1}{n-1}\sum_i (x-\bar{x})^2\). For two vectors, the covariance of these two vectors will be calculated. For a matrix, the covariance matrix for its columns will be calculated instead. |
\(\texttt{sd()}\) | Standard deviation, calculated as \(\sqrt{S^2}\), where \(S^2\) is the estimator of variance. |
\(\texttt{cor()}\), \(\texttt{cov()}\) | Correlation and covariance matrix. The arguments may be a pair of vectors, or a matrix. |
\(\texttt{mad()}\) | Median absolute deviation, calculated as \(1.4826*median(|x_i-median(x_i)|)\). |
. | other packages |
\(\texttt{kurtosis()}\) | Kurtosis, measure of concentration, \(\frac{n\sum_i(x_i-\bar{x})^4}{(\sum_i(x_i-\bar{x})^2)^2}-3\). The normal distribution has a kurtosis of 0. This function comes from the \(\texttt{e1071}\) package. |
\(\texttt{skewness()}\) | Skewness, measure of asymmetry, \(\frac{\sqrt{n}\sum_i(x_i-\bar{x})^3}{(\sum_i(x_i-\bar{x})^2)^{3/2}}\). The symmetric distribution has a skewness of 0. This function comes from the \(\texttt{e1071}\) package. |
\(\texttt{geometric.mean()}\) | Geometric mean, calculated as \((\prod_ix_i)^{1/n}\). This function comes from the \(\texttt{psych}\) package. |
\(\texttt{harmonic.mean()}\) | Harmonic mean, calculated as \(n/\sum_ix_i^{-1}\). This function comes from the \(\texttt{psych}\) package. |
\(\texttt{moda()}\) | Mode, i.e. the most frequent value. This function comes from the \(\texttt{dprep}\) package. In Linux, we can also use the \(\texttt{mod()}\) function from \(\texttt{RVAideMemoire}\). |
2.3.1 Quantitative variables
Let us take a look at the values in the age
column. We can refer to that column with socData$age
.
Age is a quantitative ratio variable (ratios make sense in this case; for example, we can say that someone is twice as old as someone else).
Our first question is: what are the lowest and greatest values that the age
variable can take on? It is always a good idea to check boundary values as they may help us identify errors in data.
range(socData$age)
## [1] 22 75
What is the mean age?
mean(socData$age)
## [1] 43.16176
And what is the trimmed mean calculated for the middle 60% of observations?
mean(socData$age, trim=0.2)
## [1] 42.58065
The median turns out to be close to the mean – that could mean there is no skewness.
median(socData$age)
## [1] 45
We can use the summary()
function to quickly calculate the most important characteristics. In the case of quantitative variables, the result is given as a vector with the following values: the minimum, maximum, mean, median, first and third quartiles (also called lower and upper quartiles).
All of these values, apart from the mean, are always returned by the fivenum()
function (the so-called five-number summary that divides the values observed into four equal parts). If there are missing observations in the variable, their count is also given.
summary(socData$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 30.00 45.00 43.16 53.00 75.00
Standard deviation:
sd(socData$age)
## [1] 13.8471
Kurtosis / measure of tailedness:
::kurtosis(socData$age) e1071
## [1] -0.9558479
Skewness:
::skewness(socData$age) e1071
## [1] 0.233151
Selected quantiles of the age
variable:
quantile(socData$age, c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 26.0 30.0 45.0 53.0 62.4
One statistic which is frequently computed for multiple variables is called correlation. We can use the cor()
function to calculate it. A correlation matrix is given below for three selected columns:
cor(socData[,c(1,6,7)])
## age diastolic_pressure systolic_pressure
## age 1.00000000 -0.02765239 -0.08313656
## diastolic_pressure -0.02765239 1.00000000 0.67852707
## systolic_pressure -0.08313656 0.67852707 1.00000000
2.3.2 Qualitative variables
Let us now take a look at the education
column. We can refer to it by typing socData$education
.
Education is a qualitative variable. It can take on four different values and there is a natural order for them.
A contingency table is the most frequent statistic for qualitative variables. The example below uses the table()
function:
table(socData$education)
##
## podstawowe srednie wyzsze zawodowe
## 93 55 34 22
This function defines a contingency table for one, two or more count variables. Contingency tables can also be obtained with xtabs()
and ftable()
.
table(socData$education, socData$employment)
##
## nie pracuje uczen lub pracuje
## podstawowe 22 71
## srednie 16 39
## wyzsze 6 28
## zawodowe 8 14
In the case of qualitative variables, the summary()
function has a similar effect to the table()
function. The only difference is that table()
ignores NA
data, whereas summary()
provides their count.
summary(socData$education)
## Length Class Mode
## 204 character character
The summary()
function can also take an argument of data.frame
type. In this case, summaries are given for each column of the data frame.
summary(socData[,1:4])
## age education cyvil.status sex
## Min. :22.00 Length:204 Length:204 Length:204
## 1st Qu.:30.00 Class :character Class :character Class :character
## Median :45.00 Mode :character Mode :character Mode :character
## Mean :43.16
## 3rd Qu.:53.00
## Max. :75.00