Date and time in R
there are several types of R objects
all the explanation below are taken from the Handling date-times in R Cole Beck manual
- Date Class
The simplest data type to use for dates is the ”Date” class. these will be internally stored as integers.The specific date used to index your dates is called the origin. Typically programming languages use a default origin of 1970-01-01, though it is really day zero, not day one (negative values are perfectly valid).
unclass(Sys.Date())
## [1] 18015
Sys.Date() - as.Date("1970-01-01")
## Time difference of 18015 days
- POSIXt Date-Time Class
Dates are pretty simple and most of the operations that we could use for them will also apply to date-time variables. There are at least three good options for date-time data types: built-in POSIXt, chron package, lubridate package. There are two POSIXt types, POSIXct and POSIXlt. ”ct” can stand for calendar time, it stores the number of seconds since the origin. ”lt”, or local time, keeps the date as a list of time attributes (such as ”hour” and ”mon”)
current time as POSIXct:
unclass(Sys.time())
## [1] 1556570382
Sys.time()-1556568634
## [1] "1970-01-01 02:29:08 IST"
current time as POSIXit:
unclass(as.POSIXlt(Sys.time()))
## $sec
## [1] 42.50493
##
## $min
## [1] 39
##
## $hour
## [1] 23
##
## $mday
## [1] 29
##
## $mon
## [1] 3
##
## $year
## [1] 119
##
## $wday
## [1] 1
##
## $yday
## [1] 118
##
## $isdst
## [1] 1
##
## $zone
## [1] "IDT"
##
## $gmtoff
## [1] 10800
##
## attr(,"tzone")
## [1] "" "IST" "IDT"
- usually POSIXct is easier to work with… with lubridate you don’t need to worry about it
in this workshop we will give introduction to ‘lubridate’ package
Part 1 - simple example
library(tidyverse)
library(lubridate)
library(hms)
library(scales)
library(knitr)
theme_set(theme_bw())
timeSample <- tibble(
date_1 = c("02-10-2019", "2/10/19"),
date_2 = c("2019 10 02", "19/10/2"),
datetime_1 = c("02-10-2019 19:00:00", "02-10-2019 7:00:00PM")
)
date_1 | date_2 | datetime_1 |
---|---|---|
02-10-2019 | 2019 10 02 | 02-10-2019 19:00:00 |
2/10/19 | 19/10/2 | 02-10-2019 7:00:00PM |
Convert to Datetime object (class POSIXCT/POSIXIT)
lubridate
has multiple functions that convert factor (or any other class) to dates. For example:
ymd()
(year, month, day)mdy()
dmy()
The difference is the order of the time variables–you need to choose the one that fits for your date type.
e.g. in the date_1
column the date is built in the day-month-year format so I use the dmy function
timeSample %>%
mutate(date_1 = dmy(date_1)) %>%
kable()
## Warning: package 'bindrcpp' was built under R version 3.4.4
date_1 | date_2 | datetime_1 |
---|---|---|
2019-10-02 | 2019 10 02 | 02-10-2019 19:00:00 |
2019-10-02 | 19/10/2 | 02-10-2019 7:00:00PM |
in date_2 column the date is built in format of year-month-day so i use the ymd function
timeSample %>%
mutate(date_1 = dmy(date_1),
date_2 = ymd(date_2)) %>%
kable()
date_1 | date_2 | datetime_1 |
---|---|---|
2019-10-02 | 2019-10-02 | 02-10-2019 19:00:00 |
2019-10-02 | 2019-10-02 | 02-10-2019 7:00:00PM |
Note the only thing that matters is the order of the day/month/year. The separators can change (‘:’ or ‘/’ or ‘-’ etc.) and lubridate can handle it…the same goes for 02 or 2, 2019 or 19.
Really cool!!
lubridate
can also handle date and time object in the same manner… * note - can translate AM/PM to the right 24 clock hour
timeSample <- timeSample %>%
mutate(date_1 = dmy(date_1),
date_2 = ymd(date_2),
datetime_1 = dmy_hms(datetime_1))
date_1 | date_2 | datetime_1 |
---|---|---|
2019-10-02 | 2019-10-02 | 2019-10-02 19:00:00 |
2019-10-02 | 2019-10-02 | 2019-10-02 19:00:00 |
Time zones
When dealing with times in R it’s super important to pay attention to time zones
If you don’t specify the time zone, the default will be UTC. This can cause problems later on if you do all kind of time calculations and don’t understand why you get weird numbers (;
The force_tz
function will take the time as it and change the time zone (in the ‘back office’) to the required one. It’s good for cases where you didn’t specify the time zone for your data beforehand and now you want to fix it.
tz(timeSample$datetime_1)
## [1] "UTC"
timeSample$datetime_1 <- force_tz(timeSample$datetime_1, "Asia/Jerusalem")
tz(timeSample$datetime_1)
## [1] "Asia/Jerusalem"
The with_tz
function will convert your time zone to a different one. this function is good for cases where you want to know the time in different places
timeSample$datetime_utc <- with_tz(timeSample$datetime_1, "UTC")
date_1 | date_2 | datetime_1 | datetime_utc |
---|---|---|---|
2019-10-02 | 2019-10-02 | 2019-10-02 19:00:00 | 2019-10-02 16:00:00 |
2019-10-02 | 2019-10-02 | 2019-10-02 19:00:00 | 2019-10-02 16:00:00 |
The best practice is to specify the timezone in first place
ymd_hms("2019-07-20 08:35:00", tz = "Asia/Jerusalem")
## [1] "2019-07-20 08:35:00 IDT"
Other nice functions in lubridate
days_in_month(today()) # how many days in this month?
## Apr
## 30
days_in_month(today() + 31) # how many day in the month 31 days from now?
## May
## 31
per <- period(hours = 10, minutes = 5) # define a period you can add to other date time...
per # a period class object
## [1] "10H 5M 0S"
Sys.time() + per
## [1] "2019-04-30 09:44:44 IDT"
Sys.time() - per
## [1] "2019-04-29 13:34:44 IDT"
difftime(Sys.time() + per, Sys.time(), unit = "hours")
## Time difference of 10.08333 hours
difftime(Sys.time() + period(week = 3), Sys.time(), unit = "day")
## Time difference of 21 days
exmaple_interval<- interval(start = dmy("01-04-2019"),end = dmy("01-06-2019"),tz="Asia/Jerusalem") # create time interval class
exmaple_interval
## [1] 2019-04-01 03:00:00 IDT--2019-06-01 03:00:00 IDT
today() %within% exmaple_interval
## [1] TRUE
quarter(Sys.time(), with_year = T) # the qurater of the year...
## [1] 2019.2
Part 2 - Taxi data frame
Import the data
taxi <- read_csv("taxi.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## taxi_id = col_integer(),
## trip_start = col_datetime(format = ""),
## trip_end = col_datetime(format = ""),
## trip_miles = col_double(),
## fare = col_double(),
## tips = col_double(),
## tolls = col_double(),
## trip_total = col_double(),
## payment_type = col_character(),
## company = col_integer()
## )
Convert to time class (“POSIXct” or “POSIXt” )
class(taxi$trip_start)
## [1] "POSIXct" "POSIXt"
class(taxi$trip_end)
## [1] "POSIXct" "POSIXt"
taxi <- taxi %>%
mutate(trip_start = ymd_hms(trip_start, tz = "America/Chicago"),
trip_end = ymd_hms(trip_end, tz = "America/Chicago"))
What is the week-day with the highest number of taxi trips?
lubridate
has a series of functions that extract different variables out of the datetime
day()
month()
year()
week()
hour()
- etc.
# If label is false then is will show the week day numbers (1:7)
wday(taxi$trip_start, label = TRUE) %>% head()
## [1] Wed Fri Sun Sat Thu Fri
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
taxi %>%
mutate(week_day = wday(trip_start, label = TRUE)) %>%
ggplot(aes(week_day)) +
geom_bar()
Friday is the most busy day…
What is the duration of each taxi trip?
?difftime
taxi <- taxi %>%
mutate(duration = difftime(trip_end, trip_start,
units = "mins")) %>%
filter(duration < 100) #remove the abnormally long rides
taxi %>%
ggplot(aes(duration)) +
geom_histogram(bins = 15) +
xlim(-5, 100)
## Warning: Removed 2 rows containing missing values (geom_bar).
It seems that trips with duration below 15 mins default to zero…
Let’s assume that this is a mistake and change all the zero duration to a five minus drive
five_min <- as.period(5, unit = "mins") # create five minute object
taxi <- taxi %>%
# if the duration is 0 add five minutes
mutate(alt_end_time = ifelse(duration == 0,
trip_end + five_min,
trip_end)) %>%
# the output is in numeric format - change to datetime format and don't forget the time zone
mutate(alt_end_time = as_datetime(alt_end_time, tz = "America/Chicago")) %>%
# fix the duration
mutate(fix_duration = difftime(alt_end_time, trip_start,
units = "mins"))
taxi %>%
filter(fix_duration < 100) %>%
ggplot(aes(fix_duration)) +
geom_histogram(bins = 15) +
xlim(-5, 100)
## Warning: Removed 2 rows containing missing values (geom_bar).
When are the longest trips? Morning, noon or night?
taxi <- taxi %>%
mutate(hour = hour(trip_start),
time_category = case_when(
between(hour, 5, 12) ~ "morning",
between(hour, 12, 18) ~ "noon",
TRUE ~ "night"
))
ggplot(taxi, aes(time_category, trip_miles)) +
geom_boxplot() +
scale_y_log10() +
coord_flip()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1366 rows containing non-finite values (stat_boxplot).
Round dates and times
Assume I want to round the trips to half hour unit so I can group them better…
the unit can be other combinations as well…
e.g.
"2 day"
"5 mins"
"quarter"
(taxi <- taxi %>%
mutate(trip_start = round_date(trip_start,
unit = "1 hour")))
## # A tibble: 4,992 x 16
## X1 taxi_id trip_start trip_end trip_miles fare
## <int> <int> <dttm> <dttm> <dbl> <dbl>
## 1 1 85 2016-01-13 06:00:00 2016-01-13 06:15:00 0.4 4.5
## 2 2 2776 2016-01-22 10:00:00 2016-01-22 09:45:00 0.7 4.45
## 3 3 3168 2016-01-31 22:00:00 2016-01-31 21:30:00 0 42.8
## 4 4 4237 2016-01-23 18:00:00 2016-01-23 17:30:00 1.1 7
## 5 5 5710 2016-01-14 06:00:00 2016-01-14 06:00:00 2.71 10.2
## 6 6 1987 2016-01-08 18:00:00 2016-01-08 18:45:00 6.2 17.8
## 7 7 4986 2016-01-14 05:00:00 2016-01-14 05:00:00 18.4 45
## 8 8 6400 2016-01-26 04:00:00 2016-01-26 04:15:00 0.2 3.75
## 9 9 7418 2016-01-22 12:00:00 2016-01-22 11:45:00 0 5
## 10 10 6450 2016-01-07 21:00:00 2016-01-07 21:15:00 0 3.25
## # ... with 4,982 more rows, and 10 more variables: tips <dbl>,
## # tolls <dbl>, trip_total <dbl>, payment_type <chr>, company <int>,
## # duration <time>, alt_end_time <dttm>, fix_duration <time>, hour <int>,
## # time_category <chr>
Other functions are floor_date()
and ceiling_date()
that will round the time down or up respectively
Plots:
A basic plot
(p <- ggplot(taxi, aes(trip_start, fare)) +
geom_point(color = "blue", alpha = 0.1) +
xlab("Trip start") + ylab("Trip cost ($)"))
To control the time axis use scale_x_datetime()
or scale_x_date()
if you only have date object
p + scale_x_datetime(date_breaks = "5 day",
labels = date_format("%d-%m-%Y"))
The date_break
argument controls for the tick mark
The labels
argument controls for the label of your data
Note that the syntax here is different…
before each time variable, you need to add % sign, then the symbol for the time argument and the desired separator
link to the date symbol guide:
link[]
p + scale_x_datetime(date_breaks = "12 hours",
labels = date_format("%d-%m-%y %H:%M")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
How to show only some of the time series
start_time <- ymd_hms("2016-01-01 06:00:00", tz = "Asia/Jerusalem")
end_time <- ymd_hms("2016-01-05 06:00:00", tz = "Asia/Jerusalem")
# create a start and end time R object
start_end <- c(start_time, end_time)
p + scale_x_datetime(limits = start_end,
date_breaks = "12 hours",
labels = date_format("%d-%m-%y %H:%M")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 4406 rows containing missing values (geom_point).
By hour plots
If you want to plot hourly patterns there are 2 options
1. use the hms
package that can extract only the time variable from the date-time
taxi <- taxi %>%
mutate(time = as.hms(trip_start))
ggplot(taxi, aes(time, fare)) +
geom_point(color = "blue", alpha = 0.1) +
scale_x_time(labels = function(x) strftime(x, "%H:%M"))
Option 2
Give one fake date to all the observation and customize with the scale_x_datetime()
arguments
taxi <- taxi %>%
mutate(fake_date = ymd_hms(paste("2000-01-01", time)))
(p_2 <- ggplot(taxi, aes(fake_date, fare)) +
geom_point(color = "blue", alpha = 0.1) +
xlab("Trip start (hour)") +
ylab("Trip cost ($)"))
p_2 +
scale_x_datetime(date_breaks = "2 hours",
labels = date_format("%H:%M")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))