Import and Export Data

Author

Dr. Mohammad Nasir Abdullah

Importing Data

Importing data is one of the first steps in the field of data science that comes before data analysis or modelling. In order to do any kind of analysis, you need good data that is organised well. As a result, it is very important to have a quick and error-free way to import data. In the same way, after doing modelling or analysis in R, it is common to need to share the results, visualisations, or changed datasets for sharing, reporting, or further processing in order tools.

Why is effective data import/export crucial?

  1. Integrity of the data: If something goes wrong during the import process, like not handling missing values, data types, or encoding correctly, the research results could be wrong. Import methods that work well keep the integrity of the data.

  2. Efficiency: The time you spend fixing problems with importing and exporting data could be better spent analysing the data itself. Importing and exporting data without any problems can greatly increase productivity.

  3. Versatility: The world of data is very big, and it can be kept in a lot of different ways, such as in CSV files, Excel spreadsheets, databases, and even more specialised formats like SPSS or SAS. A data worker can be more flexible and adaptable if they know how to import and export data from these different sources.

  4. Collaboration: People who work with data often have to work with others or show what they have found to people who matter. Exporting tools that work well make it easy to share results, datasets, or visualisations, which helps people work together and make decisions.

Important

In this notes, we will use demo.csv, demo.sav, demo.xls, Data Exercise.sav, and KAP1.dta. All the dataset can be downloaded through this link: https://dataintror.s3-ap-southeast-1.amazonaws.com/dataset+Introduction+to+R.zip

At first, let us setting up the evironment. we need to make sure we already identify the working directory and clear all object in the environment.

1) To clear all objects in the R environment

rm(list=ls())

2) To check working directory

#To check current working directory
getwd()

#If you not satisfied, we can change the current working directory using :
setwd("C:/Users/Nasirthegreats/Documents")

1. Importing from a Comma Separated file

First, we must make sure the original file has these criteria’s:

  1. The first row of the file should contain variable names at the top.

  2. The variable names must include normal letters, numbers, and underscores (for example: Age, Admission_date, Score_1).

  3. Should not include special characters such as space, @, $, # (For example: “Admission Date”, #race, @prob).

  4. The data should begin at the second row and first column onwards.

  5. Everything that is not part of the data should be removed (for example: comments, labels, graphs, pivot tables).

1st Method using .csv

This is the example of comma separated text file:

data1 <- read.csv("demo.csv")

Next, we can check for the dimension of the dataset by using this code:

dim(data1)
[1] 6400   28

we also can view some of the observations by:

head(data1) #will show first 6 observations for all variables

head(data1[1:4], n=10L)
#It will show first 10 observations (10L)
#For variable number 1 until number 4

2nd Method using readr

library(readr)
data2 <- read_csv("demo.csv")

2. Importing from MS Excel file

One of the best ways to read an Excel file is to export it to a comma delimited file and import it using the method above. Alternatively, you can use the xlsx package to access Excel files. The first row should contain variable/column names. As as start, we can read in the first worksheet from the workbook demo.xls by double click on the orginal excel file and make sure the first row contains variable names and all the criteria mention in the previous section.

To start importing excel file, we need to install a library named xlsx.

install.packages("readxl")

If you already installed the package, please load it up by calling the library

library(readxl)

Now we can begin to import the data from MS Excel file

mydata1 <- read_excel("demo.xls", sheet="demo")
#we need to specify the sheet name.
#In this case the sheet name is "demo"

In the file name, you can mention either it is .xls or .xlsx. There are another method to import data from excel, without specifying the sheet name. But you need to know which number of sheets you want to import, in this case, the sheet is in the 1st sheet.

mydata1 <- read_excel("demo.xls", 1)
#we need to specify the sheet number. 
#In this case the "demo" sheet is the first sheet.

3. Importing data from SPSS file

1st method using foreign package

To import data file from SPSS format file, we need to install foreign library. In foreign library, we able to import data file from SPSS and STATA.

install.packages("foreign")

If you already install the package, please load it up by calling the library

library(foreign)

Now, we can begin to import the data from SPSS file

data.sav <- read.spss("Data Exercise.sav", to.data.frame=TRUE) #SPSS
#Last option converts value labels to R factors.

2nd method using haven package

To import SPSS data file into R using haven package

library(haven)
data4 <- read_sav("demo.sav")

4. Importing data from STATA file

1st method using foreign package

In this example, we will use foreign library to import data

library(foreign)

Now, we will begin to import STATA data file format .dta into R.

mydata <- read.dta("KAP1.dta")

It actually pretty fast to import STATA data file into R.

2nd method using haven package

library(haven)
data10 <- read_dta("KAP1.dta")

5. Importing data from Internet / Web

The internet is a vast repository of data, ranging from structured dataset in CSV or Excel formats, APIs returning JSON or XML data, to unstructured data in web pages. R provides a range of tools and packages that make importing this data straightforwards.

From website: https://catalog.data.gov/dataset/youth-tobacco-survey-yts-data, we will download this data https://data.cdc.gov/api/views/4juz-x2tp/rows.csv?accessType=DOWNLOAD, noticed that this is .csv file data provided from this webpage.

my_data <- read_csv("https://data.cdc.gov/api/views/4juz-x2tp/rows.csv?accessType=DOWNLOAD")

Now, let explore some other dataset from data.gov.my. which is Number of births in Malaysia by date: https://storage.data.gov.my/demography/births.csv

birth <- read_csv("https://storage.data.gov.my/demography/births.csv")

Another example from the same website: RON97 Petrol Price: https://storage.data.gov.my/commodities/fuelprice.csv

ron97 <- read.csv("https://storage.data.gov.my/commodities/fuelprice.csv")

6. Importing data from .parquet file

Parquet is a columnar storage file format optimized for analytics. Originating from the Hadoop ecosystem, Parquet is especially used with big data technologies because of its efficiency. Some features and advantages of Parquet include:

  1. Columnar Storage: By storing data column-wise, Parquet allows more efficient I/O operations, compression, and encoding shcemes.

  2. Schema Evolution: Parquet supports complex nested data structures and allows schema evolution, meaning you can modify the schema over time without having to rewrite the entire dataset.

  3. Language and Framework Neutrality: Parquet can be used with any project in the Hadoop ecosystem, regardless of the data processing framework, data model, or programming language.

  4. Compression: Columnar storage also makes it possible to better compress data. Different compression schemes can be specified for different columns.

Reading a Parquet file in R

To read a parquet file in R, you can use the arrow package, which provides a set of tools for working with Parquet and Arrow dataset:

Here’s how you can read a Parquet file into R using the arrow package:

install.packages("arrow")

 once installed the package, you can load it up into R environment:

library(arrow)

Let try to import Parquet file from data.gov.my where this data is about Exchange Rate Malaysia vs China

data1 <- read_parquet("https://storage.data.gov.my/finsector/exchangerates.parquet")

head(data1)
# A tibble: 6 × 25
  date                myr_usd myr_rmb myr_sgd myr_eur myr_jpy myr_twd myr_thb
  <dttm>                <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 2003-12-01 08:00:00   0.263    2.18   0.453   0.220    28.8    8.36    10.5
2 2003-12-02 08:00:00   0.263    2.18   0.453   0.218    28.6    8.36    10.5
3 2003-12-03 08:00:00   0.264    2.18   0.453   0.218    28.6    8.36    10.5
4 2003-12-04 08:00:00   0.263    2.18   0.451   0.218    28.5    8.36    10.5
5 2003-12-05 08:00:00   0.264    2.18   0.452   0.217    28.4    8.36    10.5
6 2003-12-06 08:00:00   0.264    2.18   0.452   0.217    28.4    8.36    10.5
# ℹ 17 more variables: myr_idr <dbl>, myr_hkd <dbl>, myr_krw <dbl>,
#   myr_vnd <dbl>, myr_inr <dbl>, myr_aud <dbl>, myr_php <dbl>, myr_aed <dbl>,
#   myr_sar <dbl>, myr_try <dbl>, myr_gbp <dbl>, myr_brl <dbl>, myr_mxn <dbl>,
#   myr_bdt <dbl>, myr_chf <dbl>, myr_cad <dbl>, myr_rub <dbl>

Noticed that the parquet file here is same as tibble data format.

Exporting data

The ability to export data efficiently and accurately is a crucial step in the data analysis process. As a bridge between data processing and data communication, exporting ensures that insights generated through meticulous analyses can be shared, interpreted, and acted upon across platforms and stakeholders.

R, a leading language in statistical computing and graphics, has carved out a reputation for its data manipulation capabilities. However, its prowess is not limited to just importing and transforming data. Its extensive suite of exporting tools ensures that users can save their work in a myriad of formats, be it traditional spreadsheets like Excel, databases like MySQL, or even more specialized formats like SPSS and SAS. This flexibility ensures that R integrates seamlessly into various data workflows, making it a preferred tool for many data professionals.

However, with flexibility comes complexity. The vast array of exporting options can be overwhelming for both new and experienced users. The choice of format, the specific package to use, and the nuances of each exporting function can impact the integrity and usability of the exported data. Hence, understanding these options and their implications becomes paramount.

In this notes, we aim to demystify the data exporting process in R. Through detailed explanations, examples, and best practices, we will guide readers through the diverse landscape of data exporting in R. Whether you are looking to share your findings with colleagues, create visualizations in another software, or move large datasets to a different storage platform, this guide will equip you with the knowledge and skills you need to do so effectively.

1. Exporting data to comma separated file (.csv)

write.csv(data1, "data2.csv")

2. Exporting data to excel data type

Method 1

library(openxlsx)
write.xlsx(data1, "data3.xls", sheetName = "Sheet1", 
           colNames = TRUE, rowNames = TRUE, 
           append = FALSE)

Method 2

library(writexl)
write_xlsx(data1, "data.xlsx")

3. Exporting data to SPSS format

Method 1

library(foreign)
write.foreign(data1, "data2.txt", 
              "data2.sav", package = "SPSS")

Method 2

library(haven)
write_sav(data1, "data.sav")

4. Exporting data to STATA format

Method 1

write.dta(data1, "data2.dta")

Method 2

write_dta(data1, "data2.dta")

5. Export to RDS and RDATA

R native format are efficient for storing R objects

saveRDS(data1, "data2.rds")
loaded_data <- readRDS("data2.rds")


#multiple object with RDATA
save(data1, data2, file="data1and2.RData")
load("data1and2.RData")

Conclusion Remarks

R offers an extensive set of tools for exporting and importing data. Depending on you needs and the tools you or your collaborators use, you can choose the appropriate method for exporting and importing. Always keep in mind the compatibility and constrains of your chosen data format.