Time series

Some useful public time series data repositories:

  1. UCI Machine Learning Repository has 126 time series data sets.
  2. The UEA and UCR Times Series Classification Repository has a diverse set of data.
  3. NOAA National Centers for Environmental Information publishes time series data related to temperatures and precipitations.
  4. The Bureau of Labor Statistics publishes employment, productivity, and prices indices.
  5. The Centers for Disease COntrol and Prevention publishes data related to the flu season.
  6. The Federal Reserve Bank of St. Louis offers a set of economic time series data.
  7. CompEngine is a self organizing database of time series data that has more than 25,000 time series data sets. They promote highly comparative time series analysis, i.e. temporal behavior agnostic to discipline-specific data.
  8. Mcomp and M4comp2018 R packages provide the competition data from the 1982 M-competition (1,001 time series), the M3 competition in 2000 (3,003 time series), and the M4 competition in 2018 (100,000 time series). Also, R's compdata and CRAN time series repository have more specialized time series data sets.

The term lookahead is used in time series analysis to denote any knowledge of the future. You shouldn't have such knowledge when designing, training, or evaluating a model. A lookahead is any way that information about what will happen in the future might propagate back in time in your modeling and affect how your model behaves earlier in time. Choosing best-performing hyper-parameters on the test set is a lookahead.

Some general advice:

  1. Make sure the data means what you think it means (member status is a yearly status or just the most recent status?)
  2. Some data should be treated as periods rather than time-stamps occurring specific period of time apart (emails opened per week).
  3. When analyzing human activities, start according to the cycle of human activity (Monday, not Jan 1).
  4. In some cases, null weeks matter when we want to do time-oriented modeling (0 emails opened for that week).
  5. When calculating the length in time, be aware of adding one to the result to account for the subtracted first time ((April 28 - April 7) / 7 = 3 but we need 4).

Pandas package is used extensively in time series analysis. 10 minutes overview is a brief overview for those unfamiliar with it.