Data Summarization

Author

Dr. Mohammad Nasir Abdullah

Data Summarization

Data summarization is an essential process in statistical programming, which involves reducing and simplifying large datasets into more manageable and understandable forms. This process is crucial as it helps in highlighting the key aspects of the data by extracting important patterns, trends, and relationships. In the realm of statistical analysis, summarization is not just about making the data smaller or simpler; it’s about capturing the essence of the data in a way that is both informative and useful for analysis.

Significance in Large Datasets

The advent of big data has made data summarization more important than ever. With the sheer volume of data available today, it’s practically impossible to analyze every individual data point. Summarization helps in distilling large datasets to a form where they can be easily interpreted and analyzed. This process not only saves time and computational resources but also aids in making data-driven decisions more effectively.

For instance, summarizing sales data over years can reveal trends and patterns that might not be evident when looking at daily sales figures. Similarly, summarizing survey data can help in quickly understanding the general opinion or trend, without getting lost in the myriad of individual responses.

Types of Summarization Techniques

  1. Numerical Summarization: This involves using statistical measures to summarize the key characteristics of numerical data. Techniques include calculating measures of central tendency (like mean, median, and mode) and measures of variability or spread (like range, variance, and standard deviation). These techniques are fundamental in providing a quick snapshot of the data’s overall distribution and central values.

  2. Categorical Summarization: When dealing with categorical (or qualitative) data, summarization often involves understanding the frequency or occurrence of different categories. Techniques include creating frequency tables, cross-tabulations, and using measures like mode. This type of summarization is particularly useful in understanding the distribution of categorical variables, like customer categories or product types.

  3. Visual Summarization: Visual representations of data, such as histograms, bar charts, box plots, and scatter plots, provide an intuitive way to summarize and understand complex datasets. These techniques are invaluable in revealing patterns, trends, outliers, and relationships in data that might not be obvious in textual or numerical summaries.

In R, these summarization techniques are supported by a variety of functions and packages, making it an ideal environment for both basic and advanced data summarization tasks. This chapter will delve into these techniques, providing the reader with the knowledge and tools to effectively summarize and interpret large datasets using R.

Summarizing Numerical Data

Introduction to Normality Testing.

Normality testing is a fundamental step in statistical analysis, particularly when deciding which statistical methods to apply. Many statistical techniques assume that the data follows a normal (or Gaussian) distribution. However, real-world data often deviates from this idealized distribution. Therefore, assessing the normality of your dataset is crucial before applying techniques that assume normality, such as parametric tests.

In R, there are several methods to test for normality, including graphical methods like Q-Q plots and statistical tests like the Shapiro-Wilk test. Let’s explore how to perform these tests in R with an example dataset.

Graphical Methods to Identify Normality distribution

1) Histogram

library(palmerpenguins)
library(ggplot2)

data <- na.omit(penguins$body_mass_g)  # Remove NA values

num_bins_sturges <- 1 + log2(length(data)) #calculating number of bins using struge's rules

ggplot(data.frame(data), aes(x = data)) +
    geom_histogram(bins = round(num_bins_sturges)) +
    ggtitle("Histogram with Sturges' Rule") + theme_light()

2) Boxplot

ggplot(penguins, aes(body_mass_g)) + geom_boxplot() +
  ggtitle("A Boxplot showing distribution of weight of Penguins in grams") + 
  theme_minimal() + 
  scale_y_continuous(limits=c(-1,1))

3) Statistical Test

  1. Shapiro-Wilk Test

    • The Shapiro-Wilk test is widely used for testing normality. This test checks the null hypothesis that the data were drawn from a normal distribution.

    • It is generally preferred for small sample sizes (< 50 samples), but can be used for larger ones.

    shapiro.test(penguins$body_mass_g)
    
        Shapiro-Wilk normality test
    
    data:  penguins$body_mass_g
    W = 0.95921, p-value = 3.679e-08
  2. Kolmogorov-Smirnov Test (K-S Test)

    • The K-S test compares the empirical distribution function of the data with the expected distribution function for a normal distribution.

    • This test is more sensitive towards the center of the distribution than the tails.

    #removing missing value
    library(dplyr)
    data1 <- penguins %>%
      filter(!is.na(body_mass_g))
    # Standardizing the data
    standardized_data <- scale(data1$body_mass_g)
    
    # Kolmogorov-Smirnov Test
    ks.test(standardized_data, "pnorm")
    
        Asymptotic one-sample Kolmogorov-Smirnov test
    
    data:  standardized_data
    D = 0.10408, p-value = 0.00121
    alternative hypothesis: two-sided
  3. Lilliefors Test

    • A modification of the K-S test, the Lilliefors test is specifically designed for testing normality.

    • It is particularly useful when the mean and variance of the distribution are unknown.

      library(nortest)
      lillie.test(penguins$body_mass_g)
      
          Lilliefors (Kolmogorov-Smirnov) normality test
      
      data:  penguins$body_mass_g
      D = 0.10408, p-value = 1.544e-09

Basic Statistical Summarisation

Data summarization is a crucial aspect of statistical analysis, providing a way to describe and understand large datasets through a few summary statistics. Among the key concepts in data summarization are measures of central tendency, position, and dispersion. Each of these measures gives different insights into the nature of the data. [for continuous/numerical data only]

1. Measures of Central Tendency

Measures of central tendency describe the center point or typical value of a dataset. The most common measures are:

  • Mean (Arithmetic Average): The sum of all values divided by the number of values. It’s sensitive to outliers and can be skewed by them.

    # calculating mean of numerical variable
    mean(penguins$body_mass_g, na.rm = TRUE)
    [1] 4201.754
  • Median: The middle value when the data is sorted in ascending order. It’s less affected by outliers and skewness and provides a better central value for skewed distributions.

    #calculating median of numerical variable
    median(penguins$body_mass_g, na.rm=TRUE)
    [1] 4050
  • Mode: The most frequently occurring value in the dataset. There can be more than one mode in a dataset (bimodal, multimodal). Useful in understanding the most common value, especially for categorical data.

    • There are no specific function in base R
    library(DescTools)
    DescTools::Mode(penguins$island)
    [1] Biscoe
    attr(,"freq")
    [1] 168
    Levels: Biscoe Dream Torgersen

2. Measures of Position

Measures of position describe how data points fall in relation to the distribution or to each other. These include:

  • Percentiles: Values below which a certain percentage of the data falls. For example, the 25th percentile (or 1st quartile) is the value below which 25% of the data lies.

    quantile(penguins$body_mass_g, na.rm=TRUE)
      0%  25%  50%  75% 100% 
    2700 3550 4050 4750 6300 
  • Quartiles: Special percentiles that divide the dataset into four equal parts. The median is the second quartile.

    #1st Quartile
    quantile(penguins$body_mass_g, 0.25, na.rm=TRUE)
     25% 
    3550 
    #2nd Quartile
    quantile(penguins$body_mass_g, 0.50, na.rm=TRUE) #same as median
     50% 
    4050 
    #3rd Quartile
    quantile(penguins$body_mass_g, 0.75, na.rm=TRUE)
     75% 
    4750 
  • Interquartile Range (IQR): The range between the first and third quartiles (25th and 75th percentiles). It represents the middle 50% of the data and is a measure of variability that’s not influenced by outliers.

    IQR(penguins$body_mass_g, na.rm=TRUE)
    [1] 1200

3. Measures of Dispersion

Measures of dispersion or variability tell us about the spread of the data points in a dataset:

  • min: To find the minimum value in the variable

    min(penguins$body_mass_g, na.rm = TRUE)
    [1] 2700
  • max: To find the maximum value in the variable

    max(penguins$body_mass_g, na.rm=TRUE)
    [1] 6300
  • length: to find how many observations in the variable

    length(penguins$body_mass_g)
    [1] 344
  • Variance: The average of the squared differences from the mean. It gives a sense of the spread of the data, but it’s not in the same unit as the data.

    var(penguins$body_mass_g, na.rm=TRUE)
    [1] 643131.1
  • Standard Deviation (SD): The square root of the variance. It’s in the same units as the data and describes how far data points tend to deviate from the mean.

    sd(penguins$body_mass_g, na.rm=TRUE)
    [1] 801.9545

Summary() Function

The summary() function in R is a generic function used to produce result summaries of various model and data objects. When applied to a data frame, it provides a quick overview of the statistical properties of each column. The function is particularly useful for getting a rapid sense of the data, especially during the initial stages of data analysis.

Key Features of summary() in R:

  1. Applicability to Different Objects: The summary() function can be used on different types of objects in R, including vectors, data frames, and model objects. The output format varies depending on the type of object.

  2. Default Output for Data Frames: For a data frame, summary() typically returns the following statistics for each column:

    • For numeric variables: Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum.

    • For factor variables: Counts for each level, and NA count if there are missing values.

  3. Handling of Missing Values: The function includes NA values in its output, providing a count of missing values, which is crucial for data cleaning and preprocessing.

  4. Customization: The behavior of summary() can be customized for user-defined classes (S3 or S4) in R. This means that when you create a new type of object, you can also define what summary() should return when applied to objects of this type.

  5. Use in Exploratory Data Analysis (EDA): It is often used as a preliminary step in EDA to get a sense of the data distribution, identify possible outliers, and detect missing values.

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 
Notes:
  • While summary() provides a quick and useful overview, it’s often just a starting point for data analysis. Depending on the results, you might need more detailed analysis, such as specific statistical tests or detailed data visualizations.

  • The function is particularly handy for quickly checking data after importation, allowing for a rapid assessment of data quality, structure, and potential areas that may require further investigation.

Apply function

We can calculate summary statistics simultaneously by using sapply, this function allow us to get the numerical statistics measures for all numerical values. we will discuss about apply family in other lecture.

1st selecting numerical variables in the dataset only

numeric_data <- penguins[, sapply(penguins, is.numeric)]

Mean

sapply(numeric_data, mean, na.rm=TRUE)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
         43.92193          17.15117         200.91520        4201.75439 
             year 
       2008.02907 

Standard Deviation

sapply(numeric_data, sd, na.rm=TRUE)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
        5.4595837         1.9747932        14.0617137       801.9545357 
             year 
        0.8183559 

Sum / total in a column

sapply(numeric_data, sum, na.rm=T)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
          15021.3            5865.7           68713.0         1437000.0 
             year 
         690762.0 

Summary Statistics by group

Usually in data analysis, if there are a categorical variable, we would like to compare the numerical measure with the categorical data. For example, we want to know the mean and standard deviation by gender (male and female) instead of sumarizing the mean and standard deviation for whole dataset.

Example: calculating mean and standard deviation for mtcars dataset

Calculating Mean and Standard Deviation for mpg by cyl.

mtcars %>%
  group_by(factor(cyl)) %>%
  summarise(mean_mpg = mean(mpg),
            sd_mpg = sd(mpg))
# A tibble: 3 × 3
  `factor(cyl)` mean_mpg sd_mpg
  <fct>            <dbl>  <dbl>
1 4                 26.7   4.51
2 6                 19.7   1.45
3 8                 15.1   2.56

Example: calculating median and IQR for non-normal distributed data (mtcars dataset)

Calculating Median and IQR for mpg by am.

mtcars %>%
  group_by(am) %>%
  filter(!is.na(am)) %>%
  summarise(median_mpg = median(mpg),
            IQR_mpg = IQR(mpg))
# A tibble: 2 × 3
     am median_mpg IQR_mpg
  <dbl>      <dbl>   <dbl>
1     0       17.3    4.25
2     1       22.8    9.4 

Several usefull package for numerical data

1) Hmisc package

library(Hmisc)
Hmisc::describe(penguins$bill_length_mm)
penguins$bill_length_mm 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2      164        1    43.92    6.274    35.70    36.60 
     .25      .50      .75      .90      .95 
   39.23    44.45    48.50    50.80    51.99 

lowest : 32.1 33.1 33.5 34   34.1, highest: 55.1 55.8 55.9 58   59.6

2) psych package

library(psych)
psych::describe(penguins$bill_length_mm)
   vars   n  mean   sd median trimmed  mad  min  max range skew kurtosis  se
X1    1 342 43.92 5.46  44.45   43.91 7.04 32.1 59.6  27.5 0.05    -0.89 0.3

we can also describe for the whole dataset

psych::describe(penguins)
                  vars   n    mean     sd  median trimmed    mad    min    max
species*             1 344    1.92   0.89    2.00    1.90   1.48    1.0    3.0
island*              2 344    1.66   0.73    2.00    1.58   1.48    1.0    3.0
bill_length_mm       3 342   43.92   5.46   44.45   43.91   7.04   32.1   59.6
bill_depth_mm        4 342   17.15   1.97   17.30   17.17   2.22   13.1   21.5
flipper_length_mm    5 342  200.92  14.06  197.00  200.34  16.31  172.0  231.0
body_mass_g          6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
sex*                 7 333    1.50   0.50    2.00    1.51   0.00    1.0    2.0
year                 8 344 2008.03   0.82 2008.00 2008.04   1.48 2007.0 2009.0
                   range  skew kurtosis    se
species*             2.0  0.16    -1.73  0.05
island*              2.0  0.61    -0.91  0.04
bill_length_mm      27.5  0.05    -0.89  0.30
bill_depth_mm        8.4 -0.14    -0.92  0.11
flipper_length_mm   59.0  0.34    -1.00  0.76
body_mass_g       3600.0  0.47    -0.74 43.36
sex*                 1.0 -0.02    -2.01  0.03
year                 2.0 -0.05    -1.51  0.04

3) skimr package

library(skimr)
skim(penguins$bill_length_mm)
Data summary
Name penguins$bill_length_mm
Number of rows 344
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁

we can also describe for the whole dataset

skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Summarizing Categorical Data

Summarizing categorical data is an essential part of data analysis, especially when dealing with survey results, demographic information, or any data where variables are qualitative rather than quantitative. The goal is to gain insights into the distribution of categories, identify patterns, and make inferences about the population being studied.

Key Concepts in Summarizing Categorical Data:

  1. Frequency Counts:

    • The most basic form of summarization for categorical data is to count the number of occurrences of each category.

    • In R, table() function is commonly used for this purpose.

      # To tabulate categorical data
      table(penguins$species)
      
         Adelie Chinstrap    Gentoo 
            152        68       124 
  2. Proportions and Percentages:

    • Converting frequency counts into proportions or percentages provides a clearer understanding of the data relative to the whole.

    • This is particularly useful when comparing groups of different sizes.

      #to get relative frequency by group
      prop.table(table(penguins$species))
      
         Adelie Chinstrap    Gentoo 
      0.4418605 0.1976744 0.3604651 

To combine the frequency value and percentage value together:

freq1 <- as.data.frame(table(penguins$species))
percent1 <- as.data.frame(prop.table(table(penguins$species))*100)
names(percent1)[2] <- "percentage"
cbind(freq1, percent1[2])
       Var1 Freq percentage
1    Adelie  152   44.18605
2 Chinstrap   68   19.76744
3    Gentoo  124   36.04651
combine <- cbind(freq1, round(percent1[2],2))
combine
       Var1 Freq percentage
1    Adelie  152      44.19
2 Chinstrap   68      19.77
3    Gentoo  124      36.05

Several Useful packages for categorical data analysis

1) janitor package

library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
tabyl(penguins$species, sort=TRUE)
 penguins$species   n   percent
           Adelie 152 0.4418605
        Chinstrap  68 0.1976744
           Gentoo 124 0.3604651

2) epiDisplay package

library(epiDisplay)
Loading required package: foreign
Loading required package: survival
Loading required package: MASS

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
Loading required package: nnet

Attaching package: 'epiDisplay'
The following objects are masked from 'package:psych':

    alpha, cs, lookup
The following object is masked from 'package:ggplot2':

    alpha
tab1(penguins$species, sort.group = "decreasing", 
     cum.percent = TRUE)

penguins$species : 
          Frequency Percent Cum. percent
Adelie          152    44.2         44.2
Gentoo          124    36.0         80.2
Chinstrap        68    19.8        100.0
  Total         344   100.0        100.0

3) summarytools package

library(summarytools)

Attaching package: 'summarytools'
The following objects are masked from 'package:Hmisc':

    label, label<-
summarytools::freq(penguins$species)
Frequencies  
penguins$species  
Type: Factor  

                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
--------------- ------ --------- -------------- --------- --------------
         Adelie    152     44.19          44.19     44.19          44.19
      Chinstrap     68     19.77          63.95     19.77          63.95
         Gentoo    124     36.05         100.00     36.05         100.00
           <NA>      0                               0.00         100.00
          Total    344    100.00         100.00    100.00         100.00

4) gmodels package

library(gmodels)
Registered S3 method overwritten by 'gdata':
  method         from     
  reorder.factor DescTools

Attaching package: 'gmodels'
The following object is masked from 'package:epiDisplay':

    ci
CrossTable(penguins$species, format="SPSS") #will return as spss format

   Cell Contents
|-------------------------|
|                   Count |
|             Row Percent |
|-------------------------|

Total Observations in Table:  344 

          |    Adelie  | Chinstrap  |    Gentoo  | 
          |-----------|-----------|-----------|
          |      152  |       68  |      124  | 
          |   44.186% |   19.767% |   36.047% | 
          |-----------|-----------|-----------|

 

This package can be also use for cross tabulation table, for example tabulation for species and sex .

library(gmodels)
CrossTable(penguins$species, penguins$sex,
           format="SPSS", 
           expected = T, #expected value
           prop.r = T, #row total
           prop.c = F, #column total
           prop.t = F, #overall total
           prop.chisq = F, #chi-square contribution of each cell
           chisq = T, #the results of a chi-square
           fisher = F, #the result of a Fisher Exact test
           mcnemar = F) #the result of McNemar test

   Cell Contents
|-------------------------|
|                   Count |
|         Expected Values |
|             Row Percent |
|-------------------------|

Total Observations in Table:  333 

                 | penguins$sex 
penguins$species |   female  |     male  | Row Total | 
-----------------|-----------|-----------|-----------|
          Adelie |       73  |       73  |      146  | 
                 |   72.342  |   73.658  |           | 
                 |   50.000% |   50.000% |   43.844% | 
-----------------|-----------|-----------|-----------|
       Chinstrap |       34  |       34  |       68  | 
                 |   33.694  |   34.306  |           | 
                 |   50.000% |   50.000% |   20.420% | 
-----------------|-----------|-----------|-----------|
          Gentoo |       58  |       61  |      119  | 
                 |   58.964  |   60.036  |           | 
                 |   48.739% |   51.261% |   35.736% | 
-----------------|-----------|-----------|-----------|
    Column Total |      165  |      168  |      333  | 
-----------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  0.04860717     d.f. =  2     p =  0.9759894 


 
       Minimum expected frequency: 33.69369 

Let’s try real example

Example 1: TB Incidence

By using TB incidence dataset, you can download it from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx

Step 1: Load the dataset

library(readxl) #library for importing excel file
url <- "https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx"
destfile <- "tb_incidence.xlsx"
curl::curl_download(url, destfile)
data1 <- read_excel(destfile)

Step 2: Rename the first variable to “country

library(dplyr)
data1 <- rename(data1, country = 'TB incidence, all forms (per 100 000 population per year)' )
names(data1[1])
[1] "country"

Step 3: Find mean for all variables before year 2000.

avg <- dplyr::select(data1, starts_with("1")) 
colnames(avg)
 [1] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
sapply(avg, mean, na.rm = T) #alternative -> ColMeans(avg, na.rm=T)
    1990     1991     1992     1993     1994     1995     1996     1997 
105.5797 107.6715 108.3140 110.3188 111.9662 114.1981 115.3527 118.8792 
    1998     1999 
121.5169 125.0435 

Step 4: Create a new variable to represent the mean of TB incidence before year 2000 for all observations.

data1$before_2000_avg <- rowMeans(avg, na.rm=T)
names(data1)
 [1] "country"         "1990"            "1991"            "1992"           
 [5] "1993"            "1994"            "1995"            "1996"           
 [9] "1997"            "1998"            "1999"            "2000"           
[13] "2001"            "2002"            "2003"            "2004"           
[17] "2005"            "2006"            "2007"            "before_2000_avg"
head(data1[, c("country", "before_2000_avg")])
# A tibble: 6 × 2
  country        before_2000_avg
  <chr>                    <dbl>
1 Afghanistan              168  
2 Albania                   26.3
3 Algeria                   41.8
4 American Samoa             8.5
5 Andorra                   28.8
6 Angola                   225. 

Example 2: Youth Tobacco

Import the dataset from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv

Step 1: Loading the dataset

library(readr)
data2 <- read_csv("https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv")
Rows: 9794 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
dbl  (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 2: Explore these variables, take note on the frequency of each categories (MeasureDesc, Gender, and Response)

summarytools::freq(data2$MeasureDesc)
Frequencies  
data2$MeasureDesc  
Type: Character  

                                                                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
--------------------------------------------------------------- ------ --------- -------------- --------- --------------
                    Percent of Current Smokers Who Want to Quit   1205     12.30          12.30     12.30          12.30
      Quit Attempt in Past Year Among Current Cigarette Smokers   1041     10.63          22.93     10.63          22.93
                                                 Smoking Status   3783     38.63          61.56     38.63          61.56
                                                    User Status   3765     38.44         100.00     38.44         100.00
                                                           <NA>      0                               0.00         100.00
                                                          Total   9794    100.00         100.00    100.00         100.00
summarytools::freq(data2$Gender)
Frequencies  
data2$Gender  
Type: Character  

                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
------------- ------ --------- -------------- --------- --------------
       Female   3256     33.24          33.24     33.24          33.24
         Male   3256     33.24          66.49     33.24          66.49
      Overall   3282     33.51         100.00     33.51         100.00
         <NA>      0                               0.00         100.00
        Total   9794    100.00         100.00    100.00         100.00
summarytools::freq(data2$Response)
Frequencies  
data2$Response  
Type: Character  

                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
-------------- ------ --------- -------------- --------- --------------
       Current   2514     33.31          33.31     25.67          25.67
          Ever   2520     33.39          66.69     25.73          51.40
      Frequent   2514     33.31         100.00     25.67          77.07
          <NA>   2246                              22.93         100.00
         Total   9794    100.00         100.00    100.00         100.00

Step 3: Filter MeasureDesc = Smoking Status, Gender = Male, Response = Frequent. Show only YEAR, LocationDesc, and Data_Value . save this filter into sub_ytl.

sub_ytl <- data2 %>%
          dplyr::filter(MeasureDesc == "Smoking Status", 
                        Gender == "Male", 
                        Response == "Frequent") %>%
          dplyr::select(YEAR, LocationDesc, Data_Value)

Step 4: Summarise the data_value by mean based on YEAR.

sub_ytl %>%
  dplyr::group_by(YEAR) %>%
  dplyr::summarize(mean = mean(Data_Value))
# A tibble: 17 × 2
    YEAR  mean
   <dbl> <dbl>
 1  1999  8.61
 2  2000  8.51
 3  2001  6.4 
 4  2002  7.26
 5  2003  5.22
 6  2004  5.71
 7  2005  5.78
 8  2006  5.9 
 9  2007  5.53
10  2008  5.39
11  2009  4.6 
12  2010  5.00
13  2011  5.03
14  2012  3.95
15  2013  2.81
16  2014  2.56
17  2015  2.17