library(palmerpenguins)
library(ggplot2)
<- na.omit(penguins$body_mass_g) # Remove NA values
data
<- 1 + log2(length(data)) #calculating number of bins using struge's rules
num_bins_sturges
ggplot(data.frame(data), aes(x = data)) +
geom_histogram(bins = round(num_bins_sturges)) +
ggtitle("Histogram with Sturges' Rule") + theme_light()
Data Summarization
Data Summarization
Data summarization is an essential process in statistical programming, which involves reducing and simplifying large datasets into more manageable and understandable forms. This process is crucial as it helps in highlighting the key aspects of the data by extracting important patterns, trends, and relationships. In the realm of statistical analysis, summarization is not just about making the data smaller or simpler; it’s about capturing the essence of the data in a way that is both informative and useful for analysis.
Significance in Large Datasets
The advent of big data has made data summarization more important than ever. With the sheer volume of data available today, it’s practically impossible to analyze every individual data point. Summarization helps in distilling large datasets to a form where they can be easily interpreted and analyzed. This process not only saves time and computational resources but also aids in making data-driven decisions more effectively.
For instance, summarizing sales data over years can reveal trends and patterns that might not be evident when looking at daily sales figures. Similarly, summarizing survey data can help in quickly understanding the general opinion or trend, without getting lost in the myriad of individual responses.
Types of Summarization Techniques
Numerical Summarization: This involves using statistical measures to summarize the key characteristics of numerical data. Techniques include calculating measures of central tendency (like mean, median, and mode) and measures of variability or spread (like range, variance, and standard deviation). These techniques are fundamental in providing a quick snapshot of the data’s overall distribution and central values.
Categorical Summarization: When dealing with categorical (or qualitative) data, summarization often involves understanding the frequency or occurrence of different categories. Techniques include creating frequency tables, cross-tabulations, and using measures like mode. This type of summarization is particularly useful in understanding the distribution of categorical variables, like customer categories or product types.
Visual Summarization: Visual representations of data, such as histograms, bar charts, box plots, and scatter plots, provide an intuitive way to summarize and understand complex datasets. These techniques are invaluable in revealing patterns, trends, outliers, and relationships in data that might not be obvious in textual or numerical summaries.
In R, these summarization techniques are supported by a variety of functions and packages, making it an ideal environment for both basic and advanced data summarization tasks. This chapter will delve into these techniques, providing the reader with the knowledge and tools to effectively summarize and interpret large datasets using R.
Summarizing Numerical Data
Introduction to Normality Testing.
Normality testing is a fundamental step in statistical analysis, particularly when deciding which statistical methods to apply. Many statistical techniques assume that the data follows a normal (or Gaussian) distribution. However, real-world data often deviates from this idealized distribution. Therefore, assessing the normality of your dataset is crucial before applying techniques that assume normality, such as parametric tests.
In R, there are several methods to test for normality, including graphical methods like Q-Q plots and statistical tests like the Shapiro-Wilk test. Let’s explore how to perform these tests in R with an example dataset.
Graphical Methods to Identify Normality distribution
1) Histogram
2) Boxplot
ggplot(penguins, aes(body_mass_g)) + geom_boxplot() +
ggtitle("A Boxplot showing distribution of weight of Penguins in grams") +
theme_minimal() +
scale_y_continuous(limits=c(-1,1))
3) Statistical Test
Shapiro-Wilk Test
The Shapiro-Wilk test is widely used for testing normality. This test checks the null hypothesis that the data were drawn from a normal distribution.
It is generally preferred for small sample sizes (< 50 samples), but can be used for larger ones.
shapiro.test(penguins$body_mass_g)
Shapiro-Wilk normality test data: penguins$body_mass_g W = 0.95921, p-value = 3.679e-08
Kolmogorov-Smirnov Test (K-S Test)
The K-S test compares the empirical distribution function of the data with the expected distribution function for a normal distribution.
This test is more sensitive towards the center of the distribution than the tails.
#removing missing value library(dplyr) <- penguins %>% data1 filter(!is.na(body_mass_g)) # Standardizing the data <- scale(data1$body_mass_g) standardized_data # Kolmogorov-Smirnov Test ks.test(standardized_data, "pnorm")
Asymptotic one-sample Kolmogorov-Smirnov test data: standardized_data D = 0.10408, p-value = 0.00121 alternative hypothesis: two-sided
Lilliefors Test
A modification of the K-S test, the Lilliefors test is specifically designed for testing normality.
It is particularly useful when the mean and variance of the distribution are unknown.
library(nortest) lillie.test(penguins$body_mass_g)
Lilliefors (Kolmogorov-Smirnov) normality test data: penguins$body_mass_g D = 0.10408, p-value = 1.544e-09
Basic Statistical Summarisation
Data summarization is a crucial aspect of statistical analysis, providing a way to describe and understand large datasets through a few summary statistics. Among the key concepts in data summarization are measures of central tendency, position, and dispersion. Each of these measures gives different insights into the nature of the data. [for continuous/numerical data only]
1. Measures of Central Tendency
Measures of central tendency describe the center point or typical value of a dataset. The most common measures are:
Mean (Arithmetic Average): The sum of all values divided by the number of values. It’s sensitive to outliers and can be skewed by them.
# calculating mean of numerical variable mean(penguins$body_mass_g, na.rm = TRUE)
[1] 4201.754
Median: The middle value when the data is sorted in ascending order. It’s less affected by outliers and skewness and provides a better central value for skewed distributions.
#calculating median of numerical variable median(penguins$body_mass_g, na.rm=TRUE)
[1] 4050
Mode: The most frequently occurring value in the dataset. There can be more than one mode in a dataset (bimodal, multimodal). Useful in understanding the most common value, especially for categorical data.
- There are no specific function in base R
library(DescTools) ::Mode(penguins$island) DescTools
[1] Biscoe attr(,"freq") [1] 168 Levels: Biscoe Dream Torgersen
2. Measures of Position
Measures of position describe how data points fall in relation to the distribution or to each other. These include:
Percentiles: Values below which a certain percentage of the data falls. For example, the 25th percentile (or 1st quartile) is the value below which 25% of the data lies.
quantile(penguins$body_mass_g, na.rm=TRUE)
0% 25% 50% 75% 100% 2700 3550 4050 4750 6300
Quartiles: Special percentiles that divide the dataset into four equal parts. The median is the second quartile.
#1st Quartile quantile(penguins$body_mass_g, 0.25, na.rm=TRUE)
25% 3550
#2nd Quartile quantile(penguins$body_mass_g, 0.50, na.rm=TRUE) #same as median
50% 4050
#3rd Quartile quantile(penguins$body_mass_g, 0.75, na.rm=TRUE)
75% 4750
Interquartile Range (IQR): The range between the first and third quartiles (25th and 75th percentiles). It represents the middle 50% of the data and is a measure of variability that’s not influenced by outliers.
IQR(penguins$body_mass_g, na.rm=TRUE)
[1] 1200
3. Measures of Dispersion
Measures of dispersion or variability tell us about the spread of the data points in a dataset:
min: To find the minimum value in the variable
min(penguins$body_mass_g, na.rm = TRUE)
[1] 2700
max: To find the maximum value in the variable
max(penguins$body_mass_g, na.rm=TRUE)
[1] 6300
length: to find how many observations in the variable
length(penguins$body_mass_g)
[1] 344
Variance: The average of the squared differences from the mean. It gives a sense of the spread of the data, but it’s not in the same unit as the data.
var(penguins$body_mass_g, na.rm=TRUE)
[1] 643131.1
Standard Deviation (SD): The square root of the variance. It’s in the same units as the data and describes how far data points tend to deviate from the mean.
sd(penguins$body_mass_g, na.rm=TRUE)
[1] 801.9545
Summary()
Function
The summary()
function in R is a generic function used to produce result summaries of various model and data objects. When applied to a data frame, it provides a quick overview of the statistical properties of each column. The function is particularly useful for getting a rapid sense of the data, especially during the initial stages of data analysis.
Key Features of summary()
in R:
Applicability to Different Objects: The
summary()
function can be used on different types of objects in R, including vectors, data frames, and model objects. The output format varies depending on the type of object.Default Output for Data Frames: For a data frame,
summary()
typically returns the following statistics for each column:For numeric variables: Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum.
For factor variables: Counts for each level, and NA count if there are missing values.
Handling of Missing Values: The function includes NA values in its output, providing a count of missing values, which is crucial for data cleaning and preprocessing.
Customization: The behavior of
summary()
can be customized for user-defined classes (S3 or S4) in R. This means that when you create a new type of object, you can also define whatsummary()
should return when applied to objects of this type.Use in Exploratory Data Analysis (EDA): It is often used as a preliminary step in EDA to get a sense of the data distribution, identify possible outliers, and detect missing values.
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
While
summary()
provides a quick and useful overview, it’s often just a starting point for data analysis. Depending on the results, you might need more detailed analysis, such as specific statistical tests or detailed data visualizations.The function is particularly handy for quickly checking data after importation, allowing for a rapid assessment of data quality, structure, and potential areas that may require further investigation.
Apply function
We can calculate summary statistics simultaneously by using sapply
, this function allow us to get the numerical statistics measures for all numerical values. we will discuss about apply
family in other lecture.
1st selecting numerical variables in the dataset only
<- penguins[, sapply(penguins, is.numeric)] numeric_data
Mean
sapply(numeric_data, mean, na.rm=TRUE)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
43.92193 17.15117 200.91520 4201.75439
year
2008.02907
Standard Deviation
sapply(numeric_data, sd, na.rm=TRUE)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
5.4595837 1.9747932 14.0617137 801.9545357
year
0.8183559
Sum / total in a column
sapply(numeric_data, sum, na.rm=T)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
15021.3 5865.7 68713.0 1437000.0
year
690762.0
Summary Statistics by group
Usually in data analysis, if there are a categorical variable, we would like to compare the numerical measure with the categorical data. For example, we want to know the mean and standard deviation by gender (male and female) instead of sumarizing the mean and standard deviation for whole dataset.
Example: calculating mean and standard deviation for mtcars
dataset
Calculating Mean and Standard Deviation for mpg
by cyl
.
%>%
mtcars group_by(factor(cyl)) %>%
summarise(mean_mpg = mean(mpg),
sd_mpg = sd(mpg))
# A tibble: 3 × 3
`factor(cyl)` mean_mpg sd_mpg
<fct> <dbl> <dbl>
1 4 26.7 4.51
2 6 19.7 1.45
3 8 15.1 2.56
Example: calculating median and IQR for non-normal distributed data (mtcars
dataset)
Calculating Median and IQR for mpg
by am
.
%>%
mtcars group_by(am) %>%
filter(!is.na(am)) %>%
summarise(median_mpg = median(mpg),
IQR_mpg = IQR(mpg))
# A tibble: 2 × 3
am median_mpg IQR_mpg
<dbl> <dbl> <dbl>
1 0 17.3 4.25
2 1 22.8 9.4
Several usefull package for numerical data
1) Hmisc
package
library(Hmisc)
::describe(penguins$bill_length_mm) Hmisc
penguins$bill_length_mm
n missing distinct Info Mean Gmd .05 .10
342 2 164 1 43.92 6.274 35.70 36.60
.25 .50 .75 .90 .95
39.23 44.45 48.50 50.80 51.99
lowest : 32.1 33.1 33.5 34 34.1, highest: 55.1 55.8 55.9 58 59.6
2) psych
package
library(psych)
::describe(penguins$bill_length_mm) psych
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 342 43.92 5.46 44.45 43.91 7.04 32.1 59.6 27.5 0.05 -0.89 0.3
we can also describe for the whole dataset
::describe(penguins) psych
vars n mean sd median trimmed mad min max
species* 1 344 1.92 0.89 2.00 1.90 1.48 1.0 3.0
island* 2 344 1.66 0.73 2.00 1.58 1.48 1.0 3.0
bill_length_mm 3 342 43.92 5.46 44.45 43.91 7.04 32.1 59.6
bill_depth_mm 4 342 17.15 1.97 17.30 17.17 2.22 13.1 21.5
flipper_length_mm 5 342 200.92 14.06 197.00 200.34 16.31 172.0 231.0
body_mass_g 6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
sex* 7 333 1.50 0.50 2.00 1.51 0.00 1.0 2.0
year 8 344 2008.03 0.82 2008.00 2008.04 1.48 2007.0 2009.0
range skew kurtosis se
species* 2.0 0.16 -1.73 0.05
island* 2.0 0.61 -0.91 0.04
bill_length_mm 27.5 0.05 -0.89 0.30
bill_depth_mm 8.4 -0.14 -0.92 0.11
flipper_length_mm 59.0 0.34 -1.00 0.76
body_mass_g 3600.0 0.47 -0.74 43.36
sex* 1.0 -0.02 -2.01 0.03
year 2.0 -0.05 -1.51 0.04
3) skimr
package
library(skimr)
skim(penguins$bill_length_mm)
Name | penguins$bill_length_mm |
Number of rows | 344 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
data | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
we can also describe for the whole dataset
skim(penguins)
Name | penguins |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
Summarizing Categorical Data
Summarizing categorical data is an essential part of data analysis, especially when dealing with survey results, demographic information, or any data where variables are qualitative rather than quantitative. The goal is to gain insights into the distribution of categories, identify patterns, and make inferences about the population being studied.
Key Concepts in Summarizing Categorical Data:
Frequency Counts:
The most basic form of summarization for categorical data is to count the number of occurrences of each category.
In R,
table()
function is commonly used for this purpose.# To tabulate categorical data table(penguins$species)
Adelie Chinstrap Gentoo 152 68 124
Proportions and Percentages:
Converting frequency counts into proportions or percentages provides a clearer understanding of the data relative to the whole.
This is particularly useful when comparing groups of different sizes.
#to get relative frequency by group prop.table(table(penguins$species))
Adelie Chinstrap Gentoo 0.4418605 0.1976744 0.3604651
To combine the frequency value and percentage value together:
<- as.data.frame(table(penguins$species))
freq1 <- as.data.frame(prop.table(table(penguins$species))*100)
percent1 names(percent1)[2] <- "percentage"
cbind(freq1, percent1[2])
Var1 Freq percentage
1 Adelie 152 44.18605
2 Chinstrap 68 19.76744
3 Gentoo 124 36.04651
<- cbind(freq1, round(percent1[2],2))
combine combine
Var1 Freq percentage
1 Adelie 152 44.19
2 Chinstrap 68 19.77
3 Gentoo 124 36.05
Several Useful packages for categorical data analysis
1) janitor
package
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
tabyl(penguins$species, sort=TRUE)
penguins$species n percent
Adelie 152 0.4418605
Chinstrap 68 0.1976744
Gentoo 124 0.3604651
2) epiDisplay
package
library(epiDisplay)
Loading required package: foreign
Loading required package: survival
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Loading required package: nnet
Attaching package: 'epiDisplay'
The following objects are masked from 'package:psych':
alpha, cs, lookup
The following object is masked from 'package:ggplot2':
alpha
tab1(penguins$species, sort.group = "decreasing",
cum.percent = TRUE)
penguins$species :
Frequency Percent Cum. percent
Adelie 152 44.2 44.2
Gentoo 124 36.0 80.2
Chinstrap 68 19.8 100.0
Total 344 100.0 100.0
3) summarytools
package
library(summarytools)
Attaching package: 'summarytools'
The following objects are masked from 'package:Hmisc':
label, label<-
::freq(penguins$species) summarytools
Frequencies
penguins$species
Type: Factor
Freq % Valid % Valid Cum. % Total % Total Cum.
--------------- ------ --------- -------------- --------- --------------
Adelie 152 44.19 44.19 44.19 44.19
Chinstrap 68 19.77 63.95 19.77 63.95
Gentoo 124 36.05 100.00 36.05 100.00
<NA> 0 0.00 100.00
Total 344 100.00 100.00 100.00 100.00
4) gmodels
package
library(gmodels)
Registered S3 method overwritten by 'gdata':
method from
reorder.factor DescTools
Attaching package: 'gmodels'
The following object is masked from 'package:epiDisplay':
ci
CrossTable(penguins$species, format="SPSS") #will return as spss format
Cell Contents
|-------------------------|
| Count |
| Row Percent |
|-------------------------|
Total Observations in Table: 344
| Adelie | Chinstrap | Gentoo |
|-----------|-----------|-----------|
| 152 | 68 | 124 |
| 44.186% | 19.767% | 36.047% |
|-----------|-----------|-----------|
This package can be also use for cross tabulation table, for example tabulation for species
and sex
.
library(gmodels)
CrossTable(penguins$species, penguins$sex,
format="SPSS",
expected = T, #expected value
prop.r = T, #row total
prop.c = F, #column total
prop.t = F, #overall total
prop.chisq = F, #chi-square contribution of each cell
chisq = T, #the results of a chi-square
fisher = F, #the result of a Fisher Exact test
mcnemar = F) #the result of McNemar test
Cell Contents
|-------------------------|
| Count |
| Expected Values |
| Row Percent |
|-------------------------|
Total Observations in Table: 333
| penguins$sex
penguins$species | female | male | Row Total |
-----------------|-----------|-----------|-----------|
Adelie | 73 | 73 | 146 |
| 72.342 | 73.658 | |
| 50.000% | 50.000% | 43.844% |
-----------------|-----------|-----------|-----------|
Chinstrap | 34 | 34 | 68 |
| 33.694 | 34.306 | |
| 50.000% | 50.000% | 20.420% |
-----------------|-----------|-----------|-----------|
Gentoo | 58 | 61 | 119 |
| 58.964 | 60.036 | |
| 48.739% | 51.261% | 35.736% |
-----------------|-----------|-----------|-----------|
Column Total | 165 | 168 | 333 |
-----------------|-----------|-----------|-----------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 0.04860717 d.f. = 2 p = 0.9759894
Minimum expected frequency: 33.69369
Let’s try real example
Example 1: TB Incidence
By using TB incidence dataset, you can download it from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx
Step 1: Load the dataset
library(readxl) #library for importing excel file
<- "https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx"
url <- "tb_incidence.xlsx"
destfile ::curl_download(url, destfile)
curl<- read_excel(destfile) data1
Step 2: Rename the first variable to “country
”
library(dplyr)
<- rename(data1, country = 'TB incidence, all forms (per 100 000 population per year)' )
data1 names(data1[1])
[1] "country"
Step 3: Find mean for all variables before year 2000.
<- dplyr::select(data1, starts_with("1"))
avg colnames(avg)
[1] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
sapply(avg, mean, na.rm = T) #alternative -> ColMeans(avg, na.rm=T)
1990 1991 1992 1993 1994 1995 1996 1997
105.5797 107.6715 108.3140 110.3188 111.9662 114.1981 115.3527 118.8792
1998 1999
121.5169 125.0435
Step 4: Create a new variable to represent the mean of TB incidence before year 2000 for all observations.
$before_2000_avg <- rowMeans(avg, na.rm=T)
data1names(data1)
[1] "country" "1990" "1991" "1992"
[5] "1993" "1994" "1995" "1996"
[9] "1997" "1998" "1999" "2000"
[13] "2001" "2002" "2003" "2004"
[17] "2005" "2006" "2007" "before_2000_avg"
head(data1[, c("country", "before_2000_avg")])
# A tibble: 6 × 2
country before_2000_avg
<chr> <dbl>
1 Afghanistan 168
2 Albania 26.3
3 Algeria 41.8
4 American Samoa 8.5
5 Andorra 28.8
6 Angola 225.
Example 2: Youth Tobacco
Import the dataset from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv
Step 1: Loading the dataset
library(readr)
<- read_csv("https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv") data2
Rows: 9794 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
dbl (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Step 2: Explore these variables, take note on the frequency of each categories (MeasureDesc
, Gender
, and Response
)
::freq(data2$MeasureDesc) summarytools
Frequencies
data2$MeasureDesc
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
--------------------------------------------------------------- ------ --------- -------------- --------- --------------
Percent of Current Smokers Who Want to Quit 1205 12.30 12.30 12.30 12.30
Quit Attempt in Past Year Among Current Cigarette Smokers 1041 10.63 22.93 10.63 22.93
Smoking Status 3783 38.63 61.56 38.63 61.56
User Status 3765 38.44 100.00 38.44 100.00
<NA> 0 0.00 100.00
Total 9794 100.00 100.00 100.00 100.00
::freq(data2$Gender) summarytools
Frequencies
data2$Gender
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
------------- ------ --------- -------------- --------- --------------
Female 3256 33.24 33.24 33.24 33.24
Male 3256 33.24 66.49 33.24 66.49
Overall 3282 33.51 100.00 33.51 100.00
<NA> 0 0.00 100.00
Total 9794 100.00 100.00 100.00 100.00
::freq(data2$Response) summarytools
Frequencies
data2$Response
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
-------------- ------ --------- -------------- --------- --------------
Current 2514 33.31 33.31 25.67 25.67
Ever 2520 33.39 66.69 25.73 51.40
Frequent 2514 33.31 100.00 25.67 77.07
<NA> 2246 22.93 100.00
Total 9794 100.00 100.00 100.00 100.00
Step 3: Filter MeasureDesc
= Smoking Status
, Gender
= Male
, Response
= Frequent
. Show only YEAR
, LocationDesc
, and Data_Value
. save this filter into sub_ytl
.
<- data2 %>%
sub_ytl ::filter(MeasureDesc == "Smoking Status",
dplyr== "Male",
Gender == "Frequent") %>%
Response ::select(YEAR, LocationDesc, Data_Value) dplyr
Step 4: Summarise the data_value
by mean based on YEAR
.
%>%
sub_ytl ::group_by(YEAR) %>%
dplyr::summarize(mean = mean(Data_Value)) dplyr
# A tibble: 17 × 2
YEAR mean
<dbl> <dbl>
1 1999 8.61
2 2000 8.51
3 2001 6.4
4 2002 7.26
5 2003 5.22
6 2004 5.71
7 2005 5.78
8 2006 5.9
9 2007 5.53
10 2008 5.39
11 2009 4.6
12 2010 5.00
13 2011 5.03
14 2012 3.95
15 2013 2.81
16 2014 2.56
17 2015 2.17