Date and time in R

there are several types of R objects

all the explanation below are taken from the Handling date-times in R Cole Beck manual

  1. Date Class

The simplest data type to use for dates is the ”Date” class. these will be internally stored as integers.The specific date used to index your dates is called the origin. Typically programming languages use a default origin of 1970-01-01, though it is really day zero, not day one (negative values are perfectly valid).

unclass(Sys.Date())
## [1] 18015
Sys.Date() - as.Date("1970-01-01")
## Time difference of 18015 days
  1. POSIXt Date-Time Class

Dates are pretty simple and most of the operations that we could use for them will also apply to date-time variables. There are at least three good options for date-time data types: built-in POSIXt, chron package, lubridate package. There are two POSIXt types, POSIXct and POSIXlt. ”ct” can stand for calendar time, it stores the number of seconds since the origin. ”lt”, or local time, keeps the date as a list of time attributes (such as ”hour” and ”mon”)

current time as POSIXct:

unclass(Sys.time())
## [1] 1556570382
Sys.time()-1556568634
## [1] "1970-01-01 02:29:08 IST"

current time as POSIXit:

unclass(as.POSIXlt(Sys.time()))
## $sec
## [1] 42.50493
## 
## $min
## [1] 39
## 
## $hour
## [1] 23
## 
## $mday
## [1] 29
## 
## $mon
## [1] 3
## 
## $year
## [1] 119
## 
## $wday
## [1] 1
## 
## $yday
## [1] 118
## 
## $isdst
## [1] 1
## 
## $zone
## [1] "IDT"
## 
## $gmtoff
## [1] 10800
## 
## attr(,"tzone")
## [1] ""    "IST" "IDT"
  • usually POSIXct is easier to work with… with lubridate you don’t need to worry about it

in this workshop we will give introduction to ‘lubridate’ package

Part 1 - simple example

library(tidyverse)
library(lubridate)
library(hms)
library(scales)
library(knitr)
theme_set(theme_bw())
timeSample <- tibble(
  date_1 = c("02-10-2019", "2/10/19"),
  date_2 = c("2019 10 02", "19/10/2"),
  datetime_1 = c("02-10-2019 19:00:00", "02-10-2019 7:00:00PM")
)
date_1 date_2 datetime_1
02-10-2019 2019 10 02 02-10-2019 19:00:00
2/10/19 19/10/2 02-10-2019 7:00:00PM

Convert to Datetime object (class POSIXCT/POSIXIT)

lubridate has multiple functions that convert factor (or any other class) to dates. For example:

  • ymd() (year, month, day)
  • mdy()
  • dmy()

The difference is the order of the time variables–you need to choose the one that fits for your date type.

e.g. in the date_1 column the date is built in the day-month-year format so I use the dmy function

timeSample %>% 
  mutate(date_1 = dmy(date_1)) %>% 
  kable()
## Warning: package 'bindrcpp' was built under R version 3.4.4
date_1 date_2 datetime_1
2019-10-02 2019 10 02 02-10-2019 19:00:00
2019-10-02 19/10/2 02-10-2019 7:00:00PM

in date_2 column the date is built in format of year-month-day so i use the ymd function

timeSample %>% 
  mutate(date_1 = dmy(date_1),
         date_2 = ymd(date_2)) %>% 
  kable()
date_1 date_2 datetime_1
2019-10-02 2019-10-02 02-10-2019 19:00:00
2019-10-02 2019-10-02 02-10-2019 7:00:00PM

Note the only thing that matters is the order of the day/month/year. The separators can change (‘:’ or ‘/’ or ‘-’ etc.) and lubridate can handle it…the same goes for 02 or 2, 2019 or 19.

Really cool!!

lubridate can also handle date and time object in the same manner… * note - can translate AM/PM to the right 24 clock hour

timeSample <- timeSample %>% 
  mutate(date_1 = dmy(date_1),
         date_2 = ymd(date_2),
         datetime_1 = dmy_hms(datetime_1)) 
date_1 date_2 datetime_1
2019-10-02 2019-10-02 2019-10-02 19:00:00
2019-10-02 2019-10-02 2019-10-02 19:00:00

Time zones

When dealing with times in R it’s super important to pay attention to time zones

If you don’t specify the time zone, the default will be UTC. This can cause problems later on if you do all kind of time calculations and don’t understand why you get weird numbers (;

The force_tz function will take the time as it and change the time zone (in the ‘back office’) to the required one. It’s good for cases where you didn’t specify the time zone for your data beforehand and now you want to fix it.

tz(timeSample$datetime_1)
## [1] "UTC"
timeSample$datetime_1 <- force_tz(timeSample$datetime_1, "Asia/Jerusalem")
tz(timeSample$datetime_1)
## [1] "Asia/Jerusalem"

The with_tz function will convert your time zone to a different one. this function is good for cases where you want to know the time in different places

timeSample$datetime_utc <- with_tz(timeSample$datetime_1, "UTC")
date_1 date_2 datetime_1 datetime_utc
2019-10-02 2019-10-02 2019-10-02 19:00:00 2019-10-02 16:00:00
2019-10-02 2019-10-02 2019-10-02 19:00:00 2019-10-02 16:00:00

The best practice is to specify the timezone in first place

ymd_hms("2019-07-20 08:35:00", tz = "Asia/Jerusalem")
## [1] "2019-07-20 08:35:00 IDT"

Other nice functions in lubridate

days_in_month(today()) # how many days in this month?
## Apr 
##  30
days_in_month(today() + 31) # how many day in the month 31 days from now?
## May 
##  31
per <- period(hours = 10, minutes = 5) # define a period you can add to other date time...
per # a period class object
## [1] "10H 5M 0S"
Sys.time() + per
## [1] "2019-04-30 09:44:44 IDT"
Sys.time() - per
## [1] "2019-04-29 13:34:44 IDT"
difftime(Sys.time() + per, Sys.time(), unit = "hours")
## Time difference of 10.08333 hours
difftime(Sys.time() + period(week = 3), Sys.time(), unit = "day")
## Time difference of 21 days
exmaple_interval<- interval(start = dmy("01-04-2019"),end = dmy("01-06-2019"),tz="Asia/Jerusalem") # create time interval class
exmaple_interval
## [1] 2019-04-01 03:00:00 IDT--2019-06-01 03:00:00 IDT
today() %within% exmaple_interval
## [1] TRUE
quarter(Sys.time(), with_year = T) # the qurater of the year...
## [1] 2019.2

Part 2 - Taxi data frame

Import the data

taxi <- read_csv("taxi.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   taxi_id = col_integer(),
##   trip_start = col_datetime(format = ""),
##   trip_end = col_datetime(format = ""),
##   trip_miles = col_double(),
##   fare = col_double(),
##   tips = col_double(),
##   tolls = col_double(),
##   trip_total = col_double(),
##   payment_type = col_character(),
##   company = col_integer()
## )

Convert to time class (“POSIXct” or “POSIXt” )

class(taxi$trip_start)
## [1] "POSIXct" "POSIXt"
class(taxi$trip_end)
## [1] "POSIXct" "POSIXt"
taxi <- taxi %>% 
  mutate(trip_start = ymd_hms(trip_start, tz = "America/Chicago"),
         trip_end = ymd_hms(trip_end, tz = "America/Chicago"))

What is the week-day with the highest number of taxi trips?

lubridate has a series of functions that extract different variables out of the datetime

  • day()
  • month()
  • year()
  • week()
  • hour()
  • etc.
# If label is false then is will show the week day numbers (1:7)
wday(taxi$trip_start, label = TRUE) %>% head()
## [1] Wed Fri Sun Sat Thu Fri
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
taxi %>% 
  mutate(week_day = wday(trip_start, label = TRUE)) %>% 
  ggplot(aes(week_day)) +
  geom_bar()

Friday is the most busy day…

What is the duration of each taxi trip?

?difftime

taxi <- taxi %>% 
  mutate(duration = difftime(trip_end, trip_start, 
                             units = "mins")) %>% 
  filter(duration < 100) #remove the abnormally long rides
  
taxi %>% 
  ggplot(aes(duration)) +
  geom_histogram(bins = 15) +
  xlim(-5, 100)
## Warning: Removed 2 rows containing missing values (geom_bar).

It seems that trips with duration below 15 mins default to zero…

Let’s assume that this is a mistake and change all the zero duration to a five minus drive

five_min <- as.period(5, unit = "mins") # create five minute object
taxi <- taxi %>%
  
  # if the duration is 0 add five minutes
  mutate(alt_end_time = ifelse(duration == 0,
                               trip_end + five_min,
                               trip_end)) %>% 
  # the output is in numeric format - change to datetime format and don't forget the time zone
  mutate(alt_end_time = as_datetime(alt_end_time, tz = "America/Chicago")) %>% 
  
  # fix the duration
  mutate(fix_duration = difftime(alt_end_time, trip_start, 
                                 units = "mins"))
taxi %>% 
  filter(fix_duration < 100) %>% 
  ggplot(aes(fix_duration)) +
  geom_histogram(bins = 15) +
  xlim(-5, 100)
## Warning: Removed 2 rows containing missing values (geom_bar).

When are the longest trips? Morning, noon or night?

taxi <- taxi %>% 
  mutate(hour = hour(trip_start),
         time_category = case_when(
           between(hour, 5, 12) ~ "morning",
           between(hour, 12, 18) ~ "noon",
           TRUE ~ "night"
         )) 
ggplot(taxi, aes(time_category, trip_miles)) +
  geom_boxplot() +
  scale_y_log10() +
  coord_flip()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1366 rows containing non-finite values (stat_boxplot).

Round dates and times

Assume I want to round the trips to half hour unit so I can group them better…

the unit can be other combinations as well…

e.g.

(taxi <- taxi %>% 
  mutate(trip_start = round_date(trip_start, 
                                 unit = "1 hour")))
## # A tibble: 4,992 x 16
##       X1 taxi_id trip_start          trip_end            trip_miles  fare
##    <int>   <int> <dttm>              <dttm>                   <dbl> <dbl>
##  1     1      85 2016-01-13 06:00:00 2016-01-13 06:15:00       0.4   4.5 
##  2     2    2776 2016-01-22 10:00:00 2016-01-22 09:45:00       0.7   4.45
##  3     3    3168 2016-01-31 22:00:00 2016-01-31 21:30:00       0    42.8 
##  4     4    4237 2016-01-23 18:00:00 2016-01-23 17:30:00       1.1   7   
##  5     5    5710 2016-01-14 06:00:00 2016-01-14 06:00:00       2.71 10.2 
##  6     6    1987 2016-01-08 18:00:00 2016-01-08 18:45:00       6.2  17.8 
##  7     7    4986 2016-01-14 05:00:00 2016-01-14 05:00:00      18.4  45   
##  8     8    6400 2016-01-26 04:00:00 2016-01-26 04:15:00       0.2   3.75
##  9     9    7418 2016-01-22 12:00:00 2016-01-22 11:45:00       0     5   
## 10    10    6450 2016-01-07 21:00:00 2016-01-07 21:15:00       0     3.25
## # ... with 4,982 more rows, and 10 more variables: tips <dbl>,
## #   tolls <dbl>, trip_total <dbl>, payment_type <chr>, company <int>,
## #   duration <time>, alt_end_time <dttm>, fix_duration <time>, hour <int>,
## #   time_category <chr>

Other functions are floor_date() and ceiling_date() that will round the time down or up respectively

Plots:

A basic plot

(p <- ggplot(taxi, aes(trip_start, fare)) +
  geom_point(color = "blue", alpha = 0.1) + 
  xlab("Trip start") + ylab("Trip cost ($)"))

To control the time axis use scale_x_datetime() or scale_x_date() if you only have date object

p + scale_x_datetime(date_breaks = "5 day", 
                     labels = date_format("%d-%m-%Y"))

The date_break argument controls for the tick mark

The labels argument controls for the label of your data

Note that the syntax here is different…

before each time variable, you need to add % sign, then the symbol for the time argument and the desired separator

link to the date symbol guide:

link[]

p + scale_x_datetime(date_breaks = "12 hours",
                     labels = date_format("%d-%m-%y %H:%M")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

How to show only some of the time series

start_time <- ymd_hms("2016-01-01 06:00:00", tz = "Asia/Jerusalem")
end_time <- ymd_hms("2016-01-05 06:00:00", tz = "Asia/Jerusalem")
# create a start and end time R object
start_end <- c(start_time, end_time)
p + scale_x_datetime(limits = start_end,
                     date_breaks = "12 hours", 
                     labels = date_format("%d-%m-%y %H:%M")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 4406 rows containing missing values (geom_point).

By hour plots

If you want to plot hourly patterns there are 2 options

1. use the hms package that can extract only the time variable from the date-time

taxi <- taxi %>% 
  mutate(time = as.hms(trip_start)) 
ggplot(taxi, aes(time, fare)) +
  geom_point(color = "blue", alpha = 0.1) +
  scale_x_time(labels = function(x) strftime(x, "%H:%M"))

Option 2

Give one fake date to all the observation and customize with the scale_x_datetime() arguments

taxi <- taxi %>% 
  mutate(fake_date = ymd_hms(paste("2000-01-01", time)))
(p_2 <- ggplot(taxi, aes(fake_date, fare)) +
    geom_point(color = "blue", alpha = 0.1) + 
    xlab("Trip start (hour)") + 
    ylab("Trip cost ($)"))

p_2 + 
  scale_x_datetime(date_breaks = "2 hours", 
                   labels = date_format("%H:%M")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))