g01 Chapter Contents
g01 Chapter Introduction
NAG C Library Manual

# NAG Library Function Documentnag_approx_quantiles_arbitrary (g01apc)

## 1  Purpose

nag_approx_quantiles_arbitrary (g01apc) finds approximate quantiles from a large arbitrary-sized data stream using an out-of-core algorithm.

## 2  Specification

 #include #include
 void nag_approx_quantiles_arbitrary (Integer *ind, const double rv[], Integer nb, double eps, Integer *np, const double q[], double qv[], Integer nq, double rcomm[], Integer lrcomm, Integer icomm[], Integer licomm, NagError *fail)

## 3  Description

A quantile is a value which divides a frequency distribution such that there is a given proportion of data values below the quantile. For example, the median of a dataset is the $0.5$ quantile because half the values are less than or equal to it.
nag_approx_quantiles_arbitrary (g01apc) uses a slightly modified version of an algorithm described in a paper by Zhang and Wang (2007) to determine $\epsilon$-approximate quantiles of a large arbitrary-sized data stream of real values, where $\epsilon$ is a user-defined approximation factor. Let $m$ denote the number of data elements processed so far then, given any quantile $q\in \left[0.0,1.0\right]$, an $\epsilon$-approximate quantile is defined as an element in the data stream whose rank falls within $\left[\left(q-\epsilon \right)m,\left(q+\epsilon \right)m\right]$. In case of more than one $\epsilon$-approximate quantile being available, the one closest to $qm$ is used.

## 4  References

Zhang Q and Wang W (2007) A fast algorithm for approximate quantiles in high speed data streams Proceedings of the 19th International Conference on Scientific and Statistical Database Management IEEE Computer Society 29

## 5  Arguments

1:     indInteger *Input/Output
On initial entry: must be set to $0$.
On entry: indicates the action required in the current call to nag_approx_quantiles_arbitrary (g01apc).
${\mathbf{ind}}=0$
Initialize the communication arrays and attempt to process the first nb values from the data stream. eps, rv and nb must be set and licomm must be at least $10$.
${\mathbf{ind}}=1$
Attempt to process the next block of nb values from the data stream. The calling program must update rv and (if required) nb, and re-enter nag_approx_quantiles_arbitrary (g01apc) with all other parameters unchanged.
${\mathbf{ind}}=2$
Continue calculation following the reallocation of either or both of the communication arrays rcomm and icomm.
${\mathbf{ind}}=3$
Calculate the nq $\epsilon$-approximate quantiles specified in q. The calling program must set q and nq and re-enter nag_approx_quantiles_arbitrary (g01apc) with all other parameters unchanged. This option can be chosen only when ${\mathbf{np}}\ge ⌈\mathrm{exp}\left(1.0\right)/{\mathbf{eps}}⌉$.
On exit: indicates output from the call.
${\mathbf{ind}}=1$
nag_approx_quantiles_arbitrary (g01apc) has processed np data points and expects to be called again with additional data.
${\mathbf{ind}}=2$
Either one or more of the communication arrays rcomm and icomm is too small. The new minimum lengths of rcomm and icomm have been returned in ${\mathbf{icomm}}\left[0\right]$ and ${\mathbf{icomm}}\left[1\right]$ respectively. If the new minimum length is greater than the current length then the corresponding communication array needs to be reallocated, its contents preserved and nag_approx_quantiles_arbitrary (g01apc) called again with all other parameters unchanged.
If there is more data to be processed, it is recommended that lrcomm and licomm are made significantly bigger than the minimum to limit the number of reallocations.
${\mathbf{ind}}=3$
nag_approx_quantiles_arbitrary (g01apc) has returned the requested $\epsilon$-approximate quantiles in qv. These quantiles are based on np data points.
Constraint: ${\mathbf{ind}}=0$, $1$, $2$ or $3$.
2:     rv[$\mathit{dim}$]const doubleInput
Note: the dimension, dim, of the array rv must be at least ${\mathbf{nb}}$ when ${\mathbf{ind}}=0$, $1$ or $2$.
On entry: if ${\mathbf{ind}}=0$, $1$ or $2$, the vector containing the current block of data, otherwise rv is not referenced.
3:     nbIntegerInput
On entry: if ${\mathbf{ind}}=0$, $1$ or $2$, the size of the current block of data. The size of blocks of data in array rv can vary; therefore nb can change between calls to nag_approx_quantiles_arbitrary (g01apc).
Constraint: if ${\mathbf{ind}}=0$, $1$ or $2$, ${\mathbf{nb}}>0$.
4:     epsdoubleInput
On entry: approximation factor $\epsilon$.
Constraint: ${\mathbf{eps}}>0.0\text{​ and ​}{\mathbf{eps}}\le 1.0$.
5:     npInteger *Output
On exit: $m$, the number of elements processed so far.
6:     q[$\mathit{dim}$]const doubleInput
Note: the dimension, dim, of the array q must be at least ${\mathbf{nq}}$ when ${\mathbf{ind}}=3$.
On entry: if ${\mathbf{ind}}=3$, the quantiles to be calculated, otherwise q is not referenced. Note that ${\mathbf{q}}\left[i\right]=0.0$, corresponds to the minimum value and ${\mathbf{q}}\left[i\right]=1.0$ to the maximum value.
Constraint: if ${\mathbf{ind}}=3$, $0.0\le {\mathbf{q}}\left[\mathit{i}-1\right]\le 1.0$, for $\mathit{i}=1,2,\dots ,{\mathbf{nq}}$.
7:     qv[$\mathit{dim}$]doubleOutput
Note: the dimension, dim, of the array qv must be at least ${\mathbf{nq}}$ when ${\mathbf{ind}}=3$.
On exit: if ${\mathbf{ind}}=3$, ${\mathbf{qv}}\left[i\right]$ contains the $\epsilon$-approximate quantiles specified by the value provided in ${\mathbf{q}}\left[i\right]$.
8:     nqIntegerInput
On entry: if ${\mathbf{ind}}=3$, the number of quantiles requested, otherwise nq is not referenced.
Constraint: if ${\mathbf{ind}}=3$, ${\mathbf{nq}}>0$.
9:     rcomm[lrcomm]doubleCommunication Array
On entry: if ${\mathbf{ind}}=1$ or $2$ then the first $l$ elements of rcomm as supplied to nag_approx_quantiles_arbitrary (g01apc) must be identical to the first $l$ elements of rcomm returned from the last call to nag_approx_quantiles_arbitrary (g01apc), where $l$ is the value of lrcomm used in the last call. In other words, the contents of rcomm must not be altered between calls to this function. If rcomm needs to be reallocated then its contents must be preserved. If ${\mathbf{ind}}=0$ then rcomm need not be set.
On exit: rcomm holds information required by subsequent calls to nag_approx_quantiles_arbitrary (g01apc)
10:   lrcommIntegerInput
On entry: the dimension of the array rcomm.
Constraints:
• if ${\mathbf{ind}}=0$, ${\mathbf{lrcomm}}\ge 1$;
• otherwise ${\mathbf{lrcomm}}\ge {\mathbf{icomm}}\left[0\right]$.
11:   icomm[licomm]IntegerCommunication Array
On entry: if ${\mathbf{ind}}=1$ or $2$ then the first $l$ elements of icomm as supplied to nag_approx_quantiles_arbitrary (g01apc) must be identical to the first $l$ elements of icomm returned from the last call to nag_approx_quantiles_arbitrary (g01apc), where $l$ is the value of licomm used in the last call. In other words, the contents of icomm must not be altered between calls to this function. If icomm needs to be reallocated then its contents must be preserved. If ${\mathbf{ind}}=0$ then icomm need not be set.
On exit: ${\mathbf{icomm}}\left[0\right]$ holds the minimum required length for rcomm and ${\mathbf{icomm}}\left[1\right]$ holds the minimum required length for icomm. The remaining elements of icomm are used for communication between subsequent calls to nag_approx_quantiles_arbitrary (g01apc).
12:   licommIntegerInput
On entry: the dimension of the array icomm.
Constraints:
• if ${\mathbf{ind}}=0$, ${\mathbf{licomm}}\ge 10$;
• otherwise ${\mathbf{licomm}}\ge {\mathbf{icomm}}\left[1\right]$.
13:   failNagError *Input/Output
The NAG error argument (see Section 3.6 in the Essential Introduction).

## 6  Error Indicators and Warnings

NE_ALLOC_FAIL
Dynamic memory allocation failed.
NE_ARRAY_SIZE
On entry, ${\mathbf{licomm}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{licomm}}\ge 10$.
On entry, ${\mathbf{lrcomm}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{lrcomm}}\ge 1$.
On entry, argument $〈\mathit{\text{value}}〉$ had an illegal value.
NE_ILLEGAL_COMM
The contents of icomm have been altered between calls to this function.
The contents of rcomm have been altered between calls to this function.
NE_INT
On entry, ${\mathbf{ind}}=0$, $1$ or $2$ and ${\mathbf{nb}}=〈\mathit{\text{value}}〉$.
Constraint: if ${\mathbf{ind}}=0$, $1$ or $2$ then ${\mathbf{nb}}>0$.
On entry, ${\mathbf{ind}}=3$ and ${\mathbf{nq}}=〈\mathit{\text{value}}〉$.
Constraint: if ${\mathbf{ind}}=3$ then ${\mathbf{nq}}>0$.
On entry, ${\mathbf{ind}}=〈\mathit{\text{value}}〉$.
Constraint: ${\mathbf{ind}}=0$, $1$, $2$ or $3$.
NE_INTERNAL_ERROR
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
NE_Q_OUT_OF_RANGE
On entry, ${\mathbf{ind}}=3$ and ${\mathbf{q}}\left[〈\mathit{\text{value}}〉\right]=〈\mathit{\text{value}}〉$.
Constraint: if ${\mathbf{ind}}=3$ then $0.0\le {\mathbf{q}}\left[i\right]\le 1.0$ for all $i$.
NE_REAL
On entry, ${\mathbf{eps}}=〈\mathit{\text{value}}〉$.
Constraint: $0.0<{\mathbf{eps}}\le 1.0$.
NE_TOO_SMALL
Number of data elements streamed, $〈\mathit{\text{value}}〉$ is not sufficient for a quantile query when ${\mathbf{eps}}=〈\mathit{\text{value}}〉$.
Supply more data or reprocess the data with a higher eps value.

## 7  Accuracy

Not applicable.

The average time taken by nag_approx_quantiles_arbitrary (g01apc) scales as ${\mathbf{np}}\mathrm{log}\left(1/\epsilon \mathrm{log}\left(\epsilon {\mathbf{np}}\right)\right)$.
It is not possible to determine in advance the final size of the communication arrays rcomm and icomm without knowing the size of the dataset. However, if a rough size ($n$) is known, the speed of the computation can be increased if the sizes of the communication arrays are not smaller than
 $lrcomm = log2 n×eps+1.0 - 2 × 1.0/eps +1+x+ 2× minx, x/2.0 +1 × y +1 licomm = log2 n×eps+1.0 - 2 × 2 × 1.0/eps +1 + 1 + 2 × x+2× minx, x/2.0 +1 × y + y + 11$
where
 $x= max1, log⁡ eps×n / eps y = log2 n/x+1.0 +1 .$

## 9  Example

This example computes a list of $\epsilon$-approximate quantiles. The data is processed in blocks of $20$ observations at a time to simulate a situation in which the data is made available in a piecemeal fashion.

### 9.1  Program Text

Program Text (g01apce.c)

### 9.2  Program Data

Program Data (g01apce.d)

### 9.3  Program Results

Program Results (g01apce.r)