Analysis of Tick Data for High Frequency Trading

Abstract

In this project, I implemented C++ method to analyse the tick data for 100 Scandinavian blue chip stocks. I considered statistics of tick and trades in terms of mean, median and maximum. As advised, I also considered the round number effect. Some sanity check is conducted as well, by for example, checking the scale of the output. Considering the needs from industry, my top priority is efficiency in processing data and the reliability of measurement, then comes the precision. The final version of the code only cost less than 100MB and would finish within seconds. Finally, for the code, we heavily used containers of map in this implementation.

Codes

Codes are available on my repo

Pre-Start Check & Analysis

Considering the size and the specific requirement of the data, we first of all reviewed the CSV file.

Completeness

The CSV is checked. It is witnessed there is no incorrect data during transmitting. The only thing is that the first 72 rows have 15 columns while the rest are in 16 columns. However, we don’t need the 16th column. In other words, it is ready to proceed.

Precision and scale of the Data

We check to what precision / scale the data is recorded. This would be useful for processing the median. Here, we found the spread is recorded at different scale, while the trade time in seconds is recorded with the precision of 0.1s. Suppose we want to find the median of trading time, an idea is that we could try counting method, which would discuss in the following section.

Crucial Idea

These ideas are used for efficiency or better illustration purposes.

Map

Map is the central structure in this work. Since the tick data contains 100 stocks, and each of the stocks are updated randomly, we have to set an index to find the location of the stock’s data, then update it. The advantage of map is that it is very efficient in finding it by its index. We constructed three types of maps, both of them are indexed by string type, which is the stock name.

  • typedef map<std::string, std::vector<double>> mymap

    mymap is mainly used for dynamic programming. The first part stores the stock name for index, the second part is a vector that stores all the required data, which would be updated when reading line by line.

  • typedef map< std::string, std::vector<vector<double>> > trademap

    trademap is used to store the trade and bid - ask information. Since we are going to compute the time between trades and ticks, we need to quiry the time of last update.

  • typedef map< std::string, medianclass> stock_medianmap

    stock_medianmap is a bit complicated. It is used to store median. We would further discuss what do class medianclass contains in the following section. At this stage, we just need to hold that it is also indexed by the stock name, and stores the median of the stock’s bid-ask spread, tick and trade update times.

Median - Counting

For median, we are aware that we cannot use the normal dynamic programming method. Here, we tried a counting method. Since we are aware that the time between trades and tick changes are in the precision of 0.1s, and are generally very short, we could infer that the change times are generally of very few selections. We just need to count how many times do one of the selection appear, and then we could get the median. Hence, we constructed the following structure:

  • typedef map< double, unsigned long> medmap

    medmap is used to count the frequency of the index. For example, suppose we record one more instance of trade time of 0.1s, we just do medmap.find(0.1)->second++

However, for the spread, we cannot do this, as there could be as many selections / choices as the scale of the csv file. Hence, we set up a class medianclass, which contains two medmap, storing the information of tick and trades, respectively, along with a spreadvec, which saves all the spread.

Dynamic Programming

Dynamic Programming is a method that dramatically decrease the usage of memory. Since we don’t need to store all the information, we just read them line by line and we could already get most of the required data. For example, for mean value, we calculated as below: [\mu_i = \mu_{i-1} \frac{i-1}{i} + \frac{x_i}{i}] (\mu_i) is the mean when we read the (i)-th input (x_i).

Round Effect

For the round effect, we didn’t mechanically count how many zeros are there in each position, as we have varied length for volumes and prices. Alternatively, we count the density of zeros. We give the rigorous definition below: [f = \frac {N_0}{N}] Here, (f) denotes the density of zeros in other positions, (N_0, N) denotes the zeros in positions other than the first and the last, (N) is the total digits minus the first and the last.

NB: We disregard the first digit. For decimals (number smaller than 0), the first digit is always 0. For the rest, the first digit is never

  1. Both the cases would misguide the density, hence we disregard the first digit.

NB: For the requirement of the above, we didn’t consider the following two kinds of the digit:

  • numbers with less than two digits

Structure of the Code

The following two figures give a basic illustration of the structure of this work.

Illustration for stock class

Illustration for stock_medianmap class

Although there are remarks in the code, we still supply the reader with the following general explanation for main functions.

  • stock::updatemap This function reads in the lines, and then call the stock::calculate function to execute all the calculations.

  • stock::calculate This function ither initialises the vector store the required values or calculates all the required values by referring to some member variables of stock class.

  • stock::updatetradebid This function either initialises the map to store the last trade / bid-ask time info or updates the last trade/bid-ask time.

  • stock::updatemedian This function either initialises the map that is required for computing median, or update the map for median.

The above functions are built based on polymorphism to realise multi-function purposes.

  • stock::findmedian This function updates the median in the vector of required values based on medianmap

Results & Performance

Results

For the result file, please visit the output stock.txt. Here, we briefly discuss some of the findings:

DesciptionAverage Value
Mean Spread / Median Spread1.49
Longest Time Trade / Tick1.32
Probability of zeros at last digit / other position - trade price27.21
Probability of zeros at last digit / other position - trade volume70.09
Mean trade update time / Mean tick update time9.06
Total tick update times / total trade times9.26
  • Median is generally smaller than the mean. This implies that the mean value is dominated by some large values.

  • The round effect do exist in most of the stocks, in both volumes and the prices. (There are still exceptions in very few stocks).

  • Tick updates much faster than trade updates.

We cite the resources this program require when running to verify its efficiency.

MethodMemory Required
Python3.6Gb
C++ without optimisation500Mb
C++ with optimisation for median<100Mb

Drawbacks & Explanation

We explain several points that could be confusing or improved.

  • Computation of median: we could consider online algorithm or other better methods, such as median of median for better efficiency

  • Median of trades and tick updates is 0: After check by more than one methods, we have to confirm this is the truth. Due to the precision of the tick data is at 0.1s, some of the ticks happen very rapid, thus this makes the median become 0.

  • Some of the stocks have the possibility of zeros in price in other positions except the last digit as 0: After check, this happens to stocks with fewer trades, or small variations in price. But the volume won’t be impact. This is due the price is very stable, with very small variations, that is hard to generate a zero.

  • Due to the limit of time, I cannot go through the analysis stock by stock. However, if time permit, we could check, for example, the difference between stocks with higher / lower price, or stocks with more / fewer trades.

  • Simplified computation of median: we didn’t consider the two variants of median if the vector is odd or even. We just out put the value (x[i]), i = int(x.size()/2). The vector size is sufficiently big, hence this wouldn’t dramatically impact our value.

  • The loss of “BBHBEAT Index”. We have in total 99 stocks and 1 index. We disregard the BBHBEAT index since we only need to consider the condition code that is none or “XT”.

Sanity Check

We checked the following terms to confirm our data is convincing:

  • Total read lines: we compare the total lines we read with the original csv file.

  • The scale of the numbers: For example, all the numbers should be non negative. The update frequency of the tick / trade should in proportional to its time gap.

  • The output: For example, if the total stocks amount equal to the given one.

Remarks

The column of the output denotes:

  • 1 = stock name

  • 2 = mean of spread

  • 3 = median of spread

  • 4 = longest time between trades

  • 5 = longest time between tick updates

  • 6 = possibility of having zero in price in positions other than the last digit

  • 7 = possibility of having zero in price in last digit

  • 8 = possibility of having zero in volume in positions other than the last digit

  • 9 = mean of time between trades

  • 10 = median of time between trades

  • 11 = mean of time between tick changes

  • 12 = median of time between tick changes

  • 13 = total times of trade updates computed (exclude ones we disregarded)

  • 14 = total times of tick updates computed (exclude ones we disregarded)

Compile: GNU C++, g++ -o filename update.cpp median.cpp main.cpp, output written to stock.txt