>
The competition is described in the 650 page book---still for 30-something dollar at your bookstore---
Time Series Prediction: Forecasting
the Future and Understanding the Past. A. S. Weigend and
N. A. Gershenfeld, eds. Reading, MA: Addison-Wesley, 1994.
This page contains the information distributed through ftp on the
Internet in 1991, and html-ized in 1994.
If you are interested in time series, data mining, applications to finance etc, you might want also want to go to Andreas
Weigend's current research , or send comments to aweigend@stern.nyu.edu. Thank
you.
For each data set we give: (1) the
original description from the competition instruction file, (2) the
full information about the data set, (3) an explanation of why it was
chosen, and (4) a description of the format of the continuation file
if there exist one. In addition, hyperlinks are provided to the original data
sets, any continuation set, and papers known to this server that make reference
to the data sets. Two of the data sets, B and E, were fully
described in the instruction file and are not used for a prediction task and so
items (2) and (4) will be omitted for them.
We have tried, with 5 data sets, to provide data that cover as wide a range of
realistic time series problems as possible. In order to do this, each data set
was selected to have a number of features of interest. Although this has
necessarily entailed some compromises, we believe that these data represent
the core of the time series analysis problems that arise in many disciplines.
[top of document];
[next data set];
This is a univariate time record of a single observed quantity,
measured in a physics laboratory experiment.
The measurements were made on an 81.5-micron 14NH3 cw (FIR) laser,
pumped optically by the P(13) line of an N2O laser via the vibrational
aQ(8,7) NH3 transition. The basic laser setup can be found in Ref. 1.
The intensity data was recorded by a LeCroy oscilloscope. No further
processing happened. The experimental signal to noise ratio was about
300 which means slightly under the half bit uncertainty of the analog
to digital conversion.
The data is a cross-cut through periodic to chaotic intensity pulsations
of the laser. Chaotic pulsations more or less follow the theoretical
Lorenz model (see References) of a two level system.
The data was analyzed. References are e.g.:
This page
contains descriptions of the 6 data sets (
A,
B,
C,
D,
E and,
F
)
that were used in the Santa Fe Competition, directed by Neil Gershenfeld
(now at MIT's Media Lab) and Andreas
Weigend (now at NYU's Stern School of Business).
The Santa Fe Time Series Competition Data
Data Set A: Laser generated data
Original Description:
Full Description:
Reason for choice:
These data were chosen because they are a good example of the complicated
behavior that can be seen in a clean, stationary, low-dimensional
non-trivial physical system for which the underlying governing equations
dynamics are well understood.
In many fields, such as economics, the data sets may only be a few
hundred points long, and so a great deal of expertise has been developed
in analyzing such short data sets. The size of the data set that we
provided, 1,000 points, was chosen to be long compared to the shortest
time series that people seriously analyze, but to be short compared to
the length that some techniques require. We picked a data set that was
known to have low-dimensional dynamics to use as a test case for
analyzing short data sets to help make the task more manageable.
Many forecasting techniques do not easily provide a measure of the
prediction error, and there is no single best approach to determining
this error, yet for many uses of time series forecasting knowing the
uncertainty is as important as knowing the prediction. We included the
predicted error in the competition metric in order to evaluate how well
the time dependence of the prediction error was estimated. To help
clarify this evaluation, we chose a data set that is very predictable
on the shortest time scales (relatively simple oscillations), but that
has global events that are harder to predict (the rapid decay of the
oscillations).
Continuation file:
[top of document]; [next data set]; [previous data set]
2:00: W, 2:30: 1, 3:30: W, 9:30: 1, 10:00: W, 11:00: 1, 12:00: W, 15:30: 1, 16:00: 2, 36:30: 1, 38:30: W, 39:30: 1, 42:30: 2, 44:00: 1, 44:30: 2, 45:00: W, 46:00: 1, 47:00: W, 47:30: 2, 48:30: 1, 50:00: 2, 50:30: 1, 51:00: 2, 51:30: 1, 52:00: 2, 52:30: W, 53:00: 1, 53:30: W, 55:00: 1, 56:00: 2, 1:21:30: W, 1:22:30: 1, 1:25:00: W, 1:30:00: 1, 1:30:30: W, 1:31:00: 1, 1:31:30: W, 1:34:00: 1, 1:35:00: W, 1:38:30: 1, 1:39:00: W, 1:40:00: 1, 1:40:30: W, 1:42:00: 1, 1:42:30: 2, 1:44:00: 1, 1:50:30: 2, 2:04:30: R, 2:21:00: W, 2:22:00: 1, 2:22:30: W, 2:25:00: 1, 2:43:30: W, 2:47:30: 1, 2:48:30: W, 2:50:00: 1, 2:57:30: W, 2:58:30: 1, 2:59:00: W, 3:00:00: 1, 3:00:30: W, 3:01:00: 1, 3:05:00: W, 3:17:30: 1, 3:18:00: 2, 3:21:00: W, 3:21:30: 1, 3:22:00: W, 3:43:00: 1, 4:11:00: W, 4:11:30: 1, 4:12:00: W, 4:25:00: 1, 4:27:00: W, 4:27:30: 1, 4:28:00: W, 4:43:30: 1, 4:44:00: 2, 4:44:30: 1, 4:45:00: 2, 4:47:00: 1, 4:47:30: 2, 4:48:30: 1, 4:49:00: 2, 4:49:30: 1, 4:50:00: 2, 4:52:00: 1, 4:52:30: 2, 4:54:00: 1, 4:54:30: 2, 4:57:30: 1, 4:58:00: 2
This patient shows sleep apnea (periods during which he takes a few
quick breaths and then stops breathing for up to 45 seconds). Sleep
apnea is medically important because it leads to sleep deprivation
and occasionally death. There are three primary research questions
associated with this data set:
[top of document]; [next data set]; [previous data set]
Tick-wise time record of a financial series. The first column is the day of the week (Monday = 1, Friday = 5), the second is the time time of that day (in hours, minutes, and seconds as HH.MMSS) after opening of the house, and the third is the value of the series. We provide ten parts (1 through 10) of each 3000 contiguous data points. There are gaps of varying length between these ten sets. The sets are ordered in time (C1 = earliest, C10 = latest). The first five parts are combined in file C1-5.dat (with one line of comment at the beginning of each part), the remaining files in C6-10.dat
==> set C part 1 <== 3 9.4846 1.2740 900822 (1 minute) 3 10.0218 1.2715 900822 (15 minutes) 3 10.47 1.273 900822 (60 minutes) 3 10.5520 1.2730 900822 (close of trading day) 4 0.0507 1.2740 900823 (open at next trading day) 3 10.4145 1.2835 900829 (close after 5 trading days) ^ ^ ^ ^ | | | | | | | ------ date (YYMMDD) | | ------ exchange rate | ------ time (HH.MMSS) ------ day (Monday = 1)
[top of document]; [next data set]; [previous data set]
This univariate time series is provided without background. D2.dat immediately follows D1.dat
in an asymmetric four-dimensional four-well potential
With forcing period F(t) = F sin(omega t) in the x3 direction, and the dissipation = - gamma velocity. The value of a1 has a small drift produced by integrating a Gaussian random variable, and the observable saved is:
The equations of motion were integrated with a simple fixed-step 4th order Runge-Kutta routine. The program that generated the data (well6.f) is in the programs directory on sfi.santafe.edu, and the input parameters are in the same directory in the file well6.in. Here is a log of the session used to generate the data:
Quartic coefficient? 1 Quadratic coefficient? 1 Starting linear coefficient? -0.01 Random walk amplitude? 0.000002 Random walk offset? 3.5e-8 Output noise magnitude? 0.004 Dissipation? 0.01 Drive amplitude? 0.135 Drive frequency? 0.6 Time step? 0.05 Number of points to save? 105000 Step between saves? 10 Initial iterates to skip? 100000 File to save the observable? data.out File to save the linear term? a1.out Minimum a1 = -1.0066601085927E-2 Maximum a1 = 6.1581271767362E-2 a1(start) = 2.5680793489264E-2 a1(end) = 6.1575419060772E-2
The data was generated on a Cray Y-MP, and so 64 bits were used for the floating-point numbers.
[top of document]; [next data set]; [previous data set]
This is a set of measurements of the light curve (time variation of the intensity) of the variable white dwarf star PG1159-035 during March 1989. It was recorded by the Whole Earth Telescope (a coordinated group of telescopes distributed around the earth that permits the continuous observation of an astronomical object) and submitted by James Dixson and Don Winget of the Department of Astronomy and the McDonald Observatory of the University of Texas at Austin. The telescope is described in an article in The Astrophysical Journal (361), p. 309-317 (1990), and the measurements on PG1159-035 will be described in an article scheduled for the September 1 issue of the Astrophysical Journal. The observations were made of PG1159-035 and a nonvariable comparison star. A polynomial was fit to the light curve of the comparison star, and then this polynomial was used to normalize the PG1159-035 signal to remove changes due to varying extinction (light absorption) and differing telescope properties.
The samples in the files are all integrations spaced at 10 second intervals. The number of points and starting times of the parts are
part points start time
---- ------ ----------
1, 618, 521048.7
2, 1256, 526881.9
3, 1222, 539951.9
4, 980, 550941.5
5, 550, 559402.8
6, 1554, 566422.8
7, 1937, 585517.5
8, 2496, 613164.2
9, 1941, 633834.8
10, 1472, 647065.1
11, 2605, 671536.7
12, 1549, 699206.4
13, 2568, 707915.6
14, 2602, 731247.2
15, 673, 764048.0
16, 1512, 774058.0
17, 1669, 794053.6
where the times are in seconds from the beginning of the observational run.
The intensity variations of the star arise from the excited modes, which are spherical harmonics. For a given mode Y_(klm), for each l value there will be 2l+1 m modes. For a fixed star the m modes have the same frequency; rotation of the star and magnetic fields split this degeneracy. The two main questions that it is hoped these data will help answer are:
[top of document]; [previous data set]
This is a vector data set, consisting of measurements of four interacting degrees of freedom of the system. The data format consists of lines spaced by equal time steps, with one column per degree of freedom. For a very interesting reason, the continuation of this data set can not be measured. Therefore, any insight into the long-term predictability that allows the set to be continued will be of interest to a very large community. The identity of this set will be announced at the Workshop.