This page contains descriptions of the 6 data sets ( A, B, C, D, E and, F ) that were used in the Santa Fe Competition, directed by Neil Gershenfeld (now at MIT's Media Lab) and Andreas Weigend (now at NYU's Stern School of Business).

The competition is described in the 650 page book---still for 30-something dollar at your bookstore--- Time Series Prediction: Forecasting the Future and Understanding the Past. A. S. Weigend and N. A. Gershenfeld, eds. Reading, MA: Addison-Wesley, 1994.

This page contains the information distributed through ftp on the Internet in 1991, and html-ized in 1994.

If you are interested in time series, data mining, applications to finance etc, you might want also want to go to Andreas Weigend's current research , or send comments to aweigend@stern.nyu.edu. Thank you.

The Santa Fe Time Series Competition Data

For each data set we give: (1) the original description from the competition instruction file, (2) the full information about the data set, (3) an explanation of why it was chosen, and (4) a description of the format of the continuation file if there exist one. In addition, hyperlinks are provided to the original data sets, any continuation set, and papers known to this server that make reference to the data sets. Two of the data sets, B and E, were fully described in the instruction file and are not used for a prediction task and so items (2) and (4) will be omitted for them.

We have tried, with 5 data sets, to provide data that cover as wide a range of realistic time series problems as possible. In order to do this, each data set was selected to have a number of features of interest. Although this has necessarily entailed some compromises, we believe that these data represent the core of the time series analysis problems that arise in many disciplines.

Data Set A: Laser generated data

[top of document]; [next data set];

Original Description:

A.dat (1,000 points)

This is a univariate time record of a single observed quantity, measured in a physics laboratory experiment.

Full Description:

The data was contributed by Udo Huebner, Phys.-Techn. Bundesanstalt, Braunschweig, Germany, and were collected primarily by N. B. Abraham and C. O. Weiss. These data were recorded from a Far-Infrared-Laser in a chaotic state; here is the description from Dr. Huebner:

The measurements were made on an 81.5-micron 14NH3 cw (FIR) laser, pumped optically by the P(13) line of an N2O laser via the vibrational aQ(8,7) NH3 transition. The basic laser setup can be found in Ref. 1. The intensity data was recorded by a LeCroy oscilloscope. No further processing happened. The experimental signal to noise ratio was about 300 which means slightly under the half bit uncertainty of the analog to digital conversion.

The data is a cross-cut through periodic to chaotic intensity pulsations of the laser. Chaotic pulsations more or less follow the theoretical Lorenz model (see References) of a two level system.

The data was analyzed. References are e.g.:

  1. U. Huebner, N. B. Abraham, and C. O. Weiss: ``Dimensions and entropies of chaotic intensity pulsations in a single-mode far-infrared NH3 laser.'' Phys. Rev. A 40, p. 6354 (1989)

  2. U. Huebner, W. Klische, N. B. Abraham, and C. O. Weiss: ``On problems encountered with dimension calculations.'' Measures of Complexity and Chaos; Ed. by N. B. Abraham et. al., Plenum Press, New York 1989, p. 133

  3. U. Huebner, W. Klische, N. B. Abraham, and C. O. Weiss: ``Comparison of Lorenz-like laser behavior with the Lorenz model.'' Coherence and Quantum Optics VI; Ed. by J. Eberly et. al., Plenum Press, New York 1989, p. 517

Reason for choice:

  1. Relatively simple
    These data were chosen because they are a good example of the complicated behavior that can be seen in a clean, stationary, low-dimensional non-trivial physical system for which the underlying governing equations dynamics are well understood.

  2. Short data sets
    In many fields, such as economics, the data sets may only be a few hundred points long, and so a great deal of expertise has been developed in analyzing such short data sets. The size of the data set that we provided, 1,000 points, was chosen to be long compared to the shortest time series that people seriously analyze, but to be short compared to the length that some techniques require. We picked a data set that was known to have low-dimensional dynamics to use as a test case for analyzing short data sets to help make the task more manageable.

  3. Prediction error measures
    Many forecasting techniques do not easily provide a measure of the prediction error, and there is no single best approach to determining this error, yet for many uses of time series forecasting knowing the uncertainty is as important as knowing the prediction. We included the predicted error in the competition metric in order to evaluate how well the time dependence of the prediction error was estimated. To help clarify this evaluation, we chose a data set that is very predictable on the shortest time scales (relatively simple oscillations), but that has global events that are harder to predict (the rapid decay of the oscillations).

Continuation file:

A.cont provides approximately 10,000 points beyond the end of the competition data set. We have included many more points than were needed for the prediction task because the supplied data set was short; the extra points were added in case there is interest in the further testing of models derived from the supplied data.

Data Set B: Physiological data

[top of document]; [next data set]; [previous data set]

Original Description:

This is a multivariate data set recorded from a patient in the sleep laboratory of the Beth Israel Hospital in Boston, Massachusetts (data submitted by David Rigney and Ary Goldberger). The file has been split into two sequential parts, B1.dat and B2.dat; the lines in the files are spaced by 0.5 seconds. The first column is the heart rate, the second is the chest volume (respiration force), and the third is the blood oxygen concentration (measured by ear oximetry). The heart rate was determined by measuring the time between the QRS complexes in the electrocardiogram, taking the inverse, and then converting this to an evenly sampled record by interpolation. There were no premature beats - sudden changes in the heart rate are not artifacts. The respiration and blood oxygen data are given in uncalibrated A/D bits; these two sensors slowly drift with time (and are therefore occasionally rescaled by a technician) and can be detached by the motion of the patient, hence their calibration is not constant over the data set. They were converted from 250 Hz to 2 Hz data by averaging over a 0.08 second window at the times of the heart rate samples. Between roughly 4 hours 30 minutes and 4 hours 34 minutes from the start of the file the sensors were disconnected. The following table gives the times and stages of sleep, as determined by a neurologist looking at the EEG (W = awake, 1 and 2 = waking/sleep stages, R = REM sleep):
   2:00: W,    2:30: 1,    3:30: W,    9:30: 1,   10:00: W,
  11:00: 1,   12:00: W,   15:30: 1,   16:00: 2,   36:30: 1,
  38:30: W,   39:30: 1,   42:30: 2,   44:00: 1,   44:30: 2,
  45:00: W,   46:00: 1,   47:00: W,   47:30: 2,   48:30: 1,
  50:00: 2,   50:30: 1,   51:00: 2,   51:30: 1,   52:00: 2,
  52:30: W,   53:00: 1,   53:30: W,   55:00: 1,   56:00: 2,
1:21:30: W, 1:22:30: 1, 1:25:00: W, 1:30:00: 1, 1:30:30: W,
1:31:00: 1, 1:31:30: W, 1:34:00: 1, 1:35:00: W, 1:38:30: 1,
1:39:00: W, 1:40:00: 1, 1:40:30: W, 1:42:00: 1, 1:42:30: 2,
1:44:00: 1, 1:50:30: 2, 2:04:30: R, 2:21:00: W, 2:22:00: 1,
2:22:30: W, 2:25:00: 1, 2:43:30: W, 2:47:30: 1, 2:48:30: W,
2:50:00: 1, 2:57:30: W, 2:58:30: 1, 2:59:00: W, 3:00:00: 1,
3:00:30: W, 3:01:00: 1, 3:05:00: W, 3:17:30: 1, 3:18:00: 2,
3:21:00: W, 3:21:30: 1, 3:22:00: W, 3:43:00: 1, 4:11:00: W,
4:11:30: 1, 4:12:00: W, 4:25:00: 1, 4:27:00: W, 4:27:30: 1,
4:28:00: W, 4:43:30: 1, 4:44:00: 2, 4:44:30: 1, 4:45:00: 2,
4:47:00: 1, 4:47:30: 2, 4:48:30: 1, 4:49:00: 2, 4:49:30: 1,
4:50:00: 2, 4:52:00: 1, 4:52:30: 2, 4:54:00: 1, 4:54:30: 2,
4:57:30: 1, 4:58:00: 2

This patient shows sleep apnea (periods during which he takes a few quick breaths and then stops breathing for up to 45 seconds). Sleep apnea is medically important because it leads to sleep deprivation and occasionally death. There are three primary research questions associated with this data set:

  1. Can part of the temporal variation in the heart rate be explained by a low-dimensional mechanism, or is it due to noise or external inputs?

  2. How do the evolution of the heart rate, the respiration rate, and the blood oxygen concentration affect each other? (a correlation between breathing and the heart rate, called sinus arrhythmia, is almost always observed).

  3. Can the episodes of sleep apnea (stoppage of breathing) be predicted from the preceding data?

Reason for choice:

  1. Heart rate variability
    There is growing (but still controversial) evidence that the observed variations in the heart rate might be related to a low-dimensional governing mechanism; understanding this mechanism is obviously very important in order to understand its failures (ie, heart attacks).

  2. Multi-dimensional data sets
    These data provide simultaneous measurements of a number of potentially interacting variables; it is an open question how best to use the extra information to learn about how the variables interact. Most importantly, there is interest in verifying and understanding the coupling between respiration and the heart rate.

  3. Non-stationary data
    These data were recorded with as much care as is possible, but the experimental system (the sleeping patient) is obviously non-stationary. A successful analysis of these data must attempt to distinguish the presumed internal dynamics from changes in the patient's state.

Data Set C: Currency exchange rate data

[top of document]; [next data set]; [previous data set]

Original Description:

C1-5.dat and C6-10.dat (10 segments of 3000 points each)

Tick-wise time record of a financial series. The first column is the day of the week (Monday = 1, Friday = 5), the second is the time time of that day (in hours, minutes, and seconds as HH.MMSS) after opening of the house, and the third is the value of the series. We provide ten parts (1 through 10) of each 3000 contiguous data points. There are gaps of varying length between these ten sets. The sets are ordered in time (C1 = earliest, C10 = latest). The first five parts are combined in file C1-5.dat (with one line of comment at the beginning of each part), the remaining files in C6-10.dat

Full Description:

These data are the tickwise bids for the exchange rate from Swiss francs to US dollars; they were recorded by a currency trading group from August 7, 1990 to April 18, 1991.

Reason for choice:

  1. Financial data
    Predicting currency exchange rates is a classic problem in time series analysis, and is of both academic and financial interest. The promise of finding predictable structure in such a data set is a source of much of the interest in, and support for, time series analysis. These particular data were chosen because they were available on a tick-wise basis, and were representative of financial prediction tasks, but still were obscure enough that they would not be easily recognized.

  2. Multiple prediction data sets
    We collected the predictions for 10 data sets in order to build up better statistics about how algorithms compare, and to check to see if the predictability of the exchange rate varies over time.

Continuation file:

C.cont consists, for each of the data sets, of the exchange rate at the tick closest to the requested time. Following is an example:
   ==> set C part 1 <==
   3  9.4846  1.2740 900822   (1 minute)
   3 10.0218  1.2715 900822   (15 minutes)
   3 10.47    1.273 900822    (60 minutes)
   3 10.5520  1.2730 900822   (close of trading day)
   4  0.0507  1.2740 900823   (open at next trading day)
   3 10.4145  1.2835 900829   (close after 5 trading days)
   ^    ^        ^     ^
   |    |        |     |
   |    |        |     ------ date (YYMMDD)
   |    |        ------ exchange rate
   |    ------ time (HH.MMSS)
   ------ day (Monday = 1)

Data Set D: Computer generated series

[top of document]; [next data set]; [previous data set]

Original Description:

D1.dat and D2.dat (100,000 points)

This univariate time series is provided without background. D2.dat immediately follows D1.dat

Full Description:

In order to provide a relatively long series of known high-diminsional dynamics (between the extremes of Data Set A and Data Set C) with weak nonstationarity, we generated 100,000 points by numerically integrating the equations of motion for a damped, driven particle

           see SantaFe/data_instructions

in an asymmetric four-dimensional four-well potential

           see SantaFe/data_instructions

With forcing period F(t) = F sin(omega t) in the x3 direction, and the dissipation = - gamma velocity. The value of a1 has a small drift produced by integrating a Gaussian random variable, and the observable saved is:

              see SantaFe/data_instructions

The equations of motion were integrated with a simple fixed-step 4th order Runge-Kutta routine. The program that generated the data (well6.f) is in the programs directory on sfi.santafe.edu, and the input parameters are in the same directory in the file well6.in. Here is a log of the session used to generate the data:

   Quartic coefficient? 1
   Quadratic coefficient? 1
   Starting linear coefficient? -0.01
   Random walk amplitude? 0.000002
   Random walk offset? 3.5e-8
   Output noise magnitude? 0.004
   Dissipation? 0.01
   Drive amplitude? 0.135
   Drive frequency? 0.6
   Time step? 0.05
   Number of points to save? 105000
   Step between saves? 10
   Initial iterates to skip? 100000
   File to save the observable? data.out
   File to save the linear term? a1.out
   Minimum a1 = -1.0066601085927E-2
   Maximum a1 = 6.1581271767362E-2
   a1(start) = 2.5680793489264E-2
   a1(end) = 6.1575419060772E-2

The data was generated on a Cray Y-MP, and so 64 bits were used for the floating-point numbers.

Reason for choice:

  1. Synthetically generated
    We wanted one of the data sets to be synthetically generated, so that in principle everything about it is known. A simple, fixed-step 4th-order Runge Kutta routine with a small step size was used to generate the data, instead of a much more efficient adaptive algorithm, in order to remove any question about the dynamics of the adaptive algorithm coupling to the dynamics of the system.

  2. Relatively high-dimensional dynamics
    The configuration space of the system has 9 degrees of freedom (4 position, 4 velocity, and 1 forcing time), which was picked to be large but still realistically accessible.

  3. Long data sets
    To recognize the presence of such high-dimensional dynamics, large data sets are needed, and can obviously be easily generated here. For a time series of this length, the efficiency of the algorithm used to analyze the data begins to important. 10^5 points was picked to be the longest set that could realistically be distributed by email and floppy disk.

  4. No background information
    For some of the data sets we provided a great deal of background information, however some techniques do not requires such information and so we included such a data set.

  5. Finite states
    The potential has four minima; the dynamics consist of nonlinear oscillations within the wells and transitions among them. A symbolic dynamics analysis in terms of the observed states of the system might be able to avoid examining a great deal of unnecessary information. The observable was picked so that there were three observed values for the four wells (two were indistinguishable); a second challenge is to recognize that there were more internal states than were externally observable.

  6. Drifting parameters
    A common observational situation is to have deterministic dynamics coupled to non-stationary parameters (such as, perhaps, the heart-rate data in set B). We put a slow drift into this problem to provide a clean example of this -- from the beginning to the end of the time series the dynamics are very similar, but the relative depths of the wells drifts so that the transition probabilities change. An analysis of these data that does not check for stationarity will not recognize the distinction between the short-term deterministic evolution and the long-term drift.

Continuation file:

D.dat gives the requested 500 continuation points.

Data Set E: Astrophysical data

[top of document]; [next data set]; [previous data set]

Original Description:

E.dat (27704 points)

This is a set of measurements of the light curve (time variation of the intensity) of the variable white dwarf star PG1159-035 during March 1989. It was recorded by the Whole Earth Telescope (a coordinated group of telescopes distributed around the earth that permits the continuous observation of an astronomical object) and submitted by James Dixson and Don Winget of the Department of Astronomy and the McDonald Observatory of the University of Texas at Austin. The telescope is described in an article in The Astrophysical Journal (361), p. 309-317 (1990), and the measurements on PG1159-035 will be described in an article scheduled for the September 1 issue of the Astrophysical Journal. The observations were made of PG1159-035 and a nonvariable comparison star. A polynomial was fit to the light curve of the comparison star, and then this polynomial was used to normalize the PG1159-035 signal to remove changes due to varying extinction (light absorption) and differing telescope properties.

The samples in the files are all integrations spaced at 10 second intervals. The number of points and starting times of the parts are

   part  points  start time
   ----  ------  ----------
     1,   618,   521048.7
     2,   1256,  526881.9
     3,   1222,  539951.9
     4,   980,   550941.5
     5,   550,   559402.8
     6,   1554,  566422.8
     7,   1937,  585517.5
     8,   2496,  613164.2
     9,   1941,  633834.8
     10,  1472,  647065.1
     11,  2605,  671536.7
     12,  1549,  699206.4
     13,  2568,  707915.6
     14,  2602,  731247.2
     15,  673,   764048.0
     16,  1512,  774058.0
     17,  1669,  794053.6

where the times are in seconds from the beginning of the observational run.

The intensity variations of the star arise from the excited modes, which are spherical harmonics. For a given mode Y_(klm), for each l value there will be 2l+1 m modes. For a fixed star the m modes have the same frequency; rotation of the star and magnetic fields split this degeneracy. The two main questions that it is hoped these data will help answer are:

  1. How many modes are excited? The W.E.T. group believes that this number is very large (on the order of 100).

  2. Do the excited modes of the star interact, and what is the form of this interaction (i.e., is it nonlinear)?

Reason for choice:

  1. Noisy data
    Both this set and set A represent the optical oscillations of a physical system, but unlike set A, which has very little noise, the measurement process for these data is much noisier. These observations are all that is available, and in fact represent a tremendous experimental effort to collect them, and so any insight into this system must come from the time series analysis.

  2. Discontinuous data
    The method of collection naturally partitions the data into separated observations, and so a successful analysis must combine the information from these segments.

  3. Probably linear, possibly nonlinear
    Unlike many of the time series problems in the recent literature (such as the optical oscillations in set A), these data represent a difficult observational question about the behavior of a primarily linear system. The initial question here is in principle simple: assuming linear spherical modes, how many of them can be recognized above the experimental background? A followup question is much harder: are there nonlinear interactions among the modes?

Data Set F: J. S. Bach's last (unfinished) fugue

[top of document]; [previous data set]

Original Description:

F.dat (4 variables, 3808 points)

This is a vector data set, consisting of measurements of four interacting degrees of freedom of the system. The data format consists of lines spaced by equal time steps, with one column per degree of freedom. For a very interesting reason, the continuation of this data set can not be measured. Therefore, any insight into the long-term predictability that allows the set to be continued will be of interest to a very large community. The identity of this set will be announced at the Workshop.


This data set is most easily listened to. A full discussion of this fugue and the reasons for our choosing it are given in the bookchapter by Dirst and Weigend, "Baroque Forecasting."

Reason for choice:


Any feedback is welcome: aweigend@stern.nyu.edu. Thank you.