Python Library

The Making of the New nAG Library for Python, With Usage Tutorials


Published 05/09/2018 By Mat Cross

We have a new Python package for accessing the nAG Library. It has been a few person-years in the making and is a substantial improvement over its predecessors in terms of usability.

This post splits into two sections: the first covers some technical details, history and context about what’s happened behind the scenes with the package; the second jumps into some examples of how to use it.

Background

Previous Offerings

Earlier packages for accessing the nAG Library from Python based themselves on the off-the-peg nAG C Library.

The downside was that the result was still very much like using a C Library, but from Python (i.e., fairly unforgiving!). Particularly close to the metal were callback functions (where you had to supply Python callbacks having C signatures …) and the I/O system (where you could interact solely with the spartan C-like default system used by the underlying compiled library). C header files were used as the source templates for the Python wrappers. The inherent tersity of these meant that incorrectly-shaped arrays supplied by the caller could not be detected at the Python level, instead they bubbled up from the bowels of the OS as scary illegal memory access crashes.

In response to customer interest and feedback we decided to commit to an overhaul, to deliver a more Pythonic product.

nAG Engine Technology

The nAG Library is a single, growing, collection of mathematical and statistical algorithms. Initially provided for use from Fortran and subsequently also as native C interfaces, for many years these two products used two separate implementations of the same methods. Naturally that made it suboptimal to develop new features and to support existing ones.

The nAG Library Engine was introduced to unify the implementation of the algorithms into a single core, interoperable with ANSI C. For any language that can talk to ANSI C we can ‘wrap’ the nAG Engine with a minimal communication layer for marshalling between the user’s environment and the Engine API:

define optimization_foolang_wrapper(Array x, ...):
   [...]
   engine_x = EnsureContiguous(x);
   New c_exit_struct();
   Call nag_engine_optimizer(engine_x, ..., c_exit_struct);
   c_exit_struct.handle_errors();

We document the Engine API using XML:

<param>
<paramhead xid= "X" type= "iarray" class= "I" purpose= "data" /> 
   [...]

For reasons of interface mutability we do not make this API public; as new language environments come along we often need to expand the API in some way to assist communication with the Engine.

The public interfaces you see in the nAG C Library or nAG Toolbox for MATLAB®, say, are just facets of the full superset Engine API. For each routine in the Library we encode a plethora of metadata which may or may not be discarded or made irrelevant depending on the language framework being targeted. For example: detailed loci for errors diagnosed in the Engine; or full metadata for an array shape that is expected by the Engine, extending in some cases to the shape that is actually used on entry or that is of interest to the user on exit:

<paramhead xid="V" type="iarray" class="IO" purpose="data">
   <dim1>100</dim1>
   <dim1-input>3</dim1-input>
   <input>three algorithmic flags. The remainder of this array is used as workspace.</input>
   <dim1-output>10</dim1-output>
   <output>ten interesting iteration statistics.</output>
</paramhead>

The implication there is that the interface for this algorithm could be created to require an array of length 3 on entry, expand it to 100 for the Engine to use, and return just the first 10 elements back to the caller, in a high-level language where this was appropriate.

When we approach a language framework for the first time (or in the Python case, when we revisit it for a usability revamp) there is naturally a large amount of initial design work required to fill out how the user API is going to look. But once that has been agreed, when a new wrapper is needed for a particular Engine algorithm in that framework we can interrogate its XML and create some appropriate set of marshalling commands to get the user’s data between the two systems.

It is very important to remember that some algorithms must return back to the user’s space during the course of their computations and not just on completion. This is usually when a supplied callback procedure is present (as with an objective function in optimization, for example). In recent years, as we began to supply access to the algorithmic Engine from languages that have rich I/O systems, we also expanded the Engine API to allow Engine I/O calls (e.g. for iteration monitoring) to call back to the user.

Just as small wrapper layers manage the initial and final movement of data between the user and the Engine, we use ‘helper’ procedures in an similar context for managing movement between a callback argument written in the user’s language and its invocation from within the Engine core:

def cb_objfun(x, y, z):
   """
   Wrapping the user's Python objfun.
   Engine argument z is not required in the Python interface.
   """
   y[:] = objfun(x.asndarray)
[...]
nag_engine_call(..., objfun_helper=cb_objfun, ...)

The moral is that to get the most usable interfaces in any given language context one must interface directly to the underlying Engine.

Intermediate XML

The original model we used to underpin the code generation for language wrappers was that the ‘root’ XML contained annotations about where a piece of information was relevant.

Consider language environment somelib (the only language we support that handles array sections/slices) and an array that we can receive as a slice. In all other environments where there is no provision for slicing we need to make the array addressable using an increment; that is, whatever the operation in question is, it is performed on elements $1, 1+\mathrm{incx}, 1+2*\mathrm{incx},\ldots$, say, of the array:

<param>
  <paramhead xid="INCX" type="integer" class="I" purpose="index-increment"/>
  [...]
</param>
[...]
<param>
  <paramhead xid="X" type="iarray" class="IO" purpose="data" omitlib="somelib">
    <dim1>max(1, 1+(<arg>N</arg>-1)*<arg>INCX</arg>)</dim1>
  </paramhead>
  <paramhead xid="X" type="iarray" class="IO" purpose="data" lib="somelib">
    <dim1><arg>N</arg></dim1>
  </paramhead>
  [...]

By interfacing to more and higher-level languages in which design paradigms are repeated it is apparent that using where is not scalable. With the example above, once we start supporting calling the nAG Library from Python (which allows slicing), then the somelib tags above need to be extended to also cover this case.

We need instead to tag what happens when an entity or annotation is active:

<param>
  <paramhead xid="X" type="iarray" class="IO" purpose="data" when-arg="INCX">
    <dim1>max(1, 1+(<arg>N</arg>-1)*<arg>INCX</arg>)</dim1>
  </paramhead>
  <paramhead xid="X" type="iarray" class="IO" purpose="data" not-when-arg="INCX">
    <dim1><arg>N</arg></dim1>
  </paramhead>
  [...]

To collapse down the entirety of the multifaceted description in the root XML for code generation we run a series of filtering stylesheets which define the relevant characteristics of the outer and inner interfaces for the two ‘layers’ we are interfacing between.

In the example above, the Python filtering stylesheet may mark that it will be dropping arguments that have purpose="index-increment" (because the language supports array sections), and hence it will make the not-when-arg="INCX" element become active.

After filtering, the resulting ‘intermediate’ XML file is purely the information required for the layer pairing, and no more:

<params type="input">
   <param interface="wrapper-input" xid="x" stype="int" pclass="array" class="I" purpose="data">
      <dim1>:</dim1>
   </param>
</params>
[...]

By annotating our XML so that it does not focus on specific language environments but more generally on language features we get a more future-proofed description of each API.

New Package

Our new nAG Library for Python is intended to be intuitive, flexible and natural. We believe that there should be no need to spell out what it does from the Python viewpoint, because it should do everything you would expect it to. We are confident that it addresses the usability issues of the previous offerings.

The Library interface collection is suppled by a library subpackage, which is a one-to-one provision of all algorithmics in a standard procedural form to match the set supplied by the traditional nAG Libraries. We have been careful to build in space so that other encapsulations could be included in the future depending on demand – a subpackage for a specific vertical domain perhaps, or a certain object-oriented bundling suitable for use by a particular application.

There are two points of technical interest that came up during the design of the new package.

The nAG Engine uses column-major storage for multidimensional arrays. This is therefore the scheme you should use for your own data, for best efficiency. The NumPy default storage scheme is row-major, but the ndarray container supports either scheme. So for flexibility the new nAG interfaces support both schemes as well.

We have added a unified transposition kernel to the nAG Engine which can be activated for row-ordered input, so that the Engine-proper on input, and the caller on output, receives the correct view it is expecting of the data. The language wrapper (and consequently the code-generation framework that creates it) does not need to include any of the tranposition code, it merely decides when to ask the Engine to activate it:

nag_engine_marshalling(do_transpose=True, a)
[...]
define nag_engine_marshalling(do_transpose, a):
   [...]
   if do_transpose
      transposer(a, a_t);
      nag_engine_proper(a_t);
      transposer(a_t, a);
   else
      nag_engine_proper(a);

Along similar lines we have ‘workspace allocation’ handlers in the Engine too. In a user environment where workspace arrays in the Engine are suppressed from the caller—which will usually be the case except for lower-level environments—a single allocation handler can be triggered to set up workspace of the correct size without the language wrapper having to worry about the details:

nag_engine_marshalling(alloc_work=True, wk=0)
[...]
define nag_engine_marshalling(alloc_work, wk_array):
   [...]
   if alloc_work
      allocate(local_work(100));
      nag_engine_proper(local_work);
   else
      nag_engine_proper(wk_array);

We’re hoping that all the innovations above can help us respond quicker to supplying nAG algorithmics in emerging environments and that they can ensure that what we do provide is as natural as possible.

Access

All the information on applicability, installation, and getting started can be found in the HTML documentation for the new package, here.

Installation directly from the nAG website can be as simple as

pip install --extra-index-url https://www.nag.com/downloads/py/naginterfaces_mkl naginterfaces

Code Examples

Now we’ll take a look at some examples of using the new package. The emphasis is on how to access functionality from the Library, so the problems we are solving are fairly simplistic but good for illustrative purposes.

We cover surface fittingglobal optimizationquantile regressionkernel density estimation and change point analysis.

Surface Fitting

This first example fits a surface to some scattered data and plots the result.

To begin, generate some scattered $(x, y)$ data, of a modest size for this example: 100 pseudorandom uniformly-distributed values on (0,1] using the Wichmann–Hill I generator.

nAG random number generators pass around their state using a communication structure which is constructed by an initialization function. To create a repeatable sequence we use rand.init_repeat, for which the required generator is selected using genid = 2:

from naginterfaces.library import rand
statecomm = rand.init_repeat(genid=2, seed=[32958])

Generate the scattered data using this initialized RNG state

In [2]:
n = 100
x = rand.dist_uniform01(n, statecomm)
y = rand.dist_uniform01(n, statecomm)

Ensure that the bounding box for the data includes the corners $(0, 0)$ and $(1, 1)$

In [3]:
x[0], y[0] = 0., 0.
x[-1], y[-1] = 1., 1.

Here are the function values to fit, which are from the Franke function—a popular function for testing surface-fitting methods

In [4]:
import numpy as np
f = (
    0.75*np.exp(
        -(
            (9.*x-2.)**2 +
            (9.*y-2.)**2
        )/4.
    ) +
    0.75*np.exp(
        -(9.*x+1.)**2/49. -
        (9.*y+1.)/10.
    ) +
    0.5*np.exp(
        -(
            (9.*x-7.)**2 +
            (9.*y-3.)**2
        )/4.
    ) -
    0.2*np.exp(
        -(9.*x-4.)**2 -
        (9.*y-7.)**2
    )
)

The nAG Library supplies a very configurable function (based on the TSFIT package of Davydov and Zeilfelder) for fitting a smooth ($C^1$ or $C^2$) surface to scattered data: fit.dim2_spline_ts_sctr.

We need to choose a granularity for the triangulation that the function will use

In [5]:
nxcels = 6
nycels = 6

and also filtering thresholds for that triangulation

In [6]:
lsminp = 3
lsmaxp = 100

The fitting function needs its communication state to be initialized using the fit Chapter’s initialization and option-setting function

In [7]:
from naginterfaces.library import fit
comm = {}
fit.opt_set('Initialize = dim2_spline_ts_sctr', comm)

Further algorithmic options can also be configured using the option-setting function, if the documented defaults are not desired; we would like to use $C^2$ global smoothing and for the local approximations on the triangulation to use averaged radial basis functions

In [8]:
for fit_opt in [
    'Global Smoothing Level = 2',
    'Averaged Spline = Yes',
    'Local Method = RBF',
]:
    fit.opt_set(fit_opt, comm)

Now compute the spline coefficients, which are entered into the communication structure for subsequent use by the evaluation functions for the fitter

In [9]:
fit.dim2_spline_ts_sctr(
    x, y, f, lsminp, lsmaxp, nxcels, nycels, comm,
)

For plotting the spline we use the associated mesh evaluator, which is fit.dim2_spline_ts_evalm. For the evaluation this requires a vector of $x$ coordinates and a vector of $y$ coordinates

In [10]:
x_m = np.linspace(0, 1, 101)
y_m = np.linspace(0, 1, 101)

The values of the spline on the grid defined by these vectors is then

In [11]:
z_m = fit.dim2_spline_ts_evalm(x_m, y_m, comm)

For compatibility with surface plotting in matplotlib we need to convert the mesh vectors into a mesh grid

In [12]:
X, Y = np.meshgrid(x_m, y_m, indexing='ij')

Now plot the spline surface (in wireframe here) with a projection of the contours

In [13]:
# Jupyter magic for displaying figures inline:
%matplotlib inline

In [14]:

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
fig = plt.figure()
ax = Axes3D(fig)
ax.grid(False)
ax.plot_wireframe(X, Y, z_m, color='red', linewidths=0.4)
ax.contour(X, Y, z_m, 15, offset=-1.2, cmap=cm.jet)
ax.set_title('Spline Fit of the Franke Function')
ax.set_xlabel(r'$\mathit{x}$')
ax.set_ylabel(r'$\mathit{y}$')
ax.set_zlim3d(-1.2, 1.2)
ax.azim = -20
ax.elev = 20
plt.show()

Spline fit

Global Optimization

Finding the absolute maximum or minimum value of a function can be hard. For problems with a few tens of variables and only bound constraints the nAG solver glopt.bnd_mcs_solve for multi-level coordinate search (from Huyer and Neumaier) is an effective routine.

For a quick demonstration of glopt.bnd_mcs_solve we find the global minimum of the two-dimensional ‘peaks’ function, which (in a suitable form for the solver’s objfun argument) is

In [15]:
from math import exp
objfun = lambda x, _nstate: (
    3.*(1. - x[0])**2*exp(-x[0]**2 - (x[1] + 1.)**2)
    - (10.*(x[0]/5. - x[0]**3 - x[1]**5)*exp(-x[0]**2 - x[1]**2))
    - 1./3.0*exp(-(x[0] + 1.)**2 - x[1]**2)
)

The optimization is over the box $[-3, -3]\times[3, 3]$

In [16]:

n = 2
bl = [-3.]*n
bu = [3.]*n

The ibound argument tells the solver that we are supplying the bounds explicitly ourselves

In [17]:
ibound = 0

The plot for this example is going to show the search boxes that the optimizer considers as it iterates. We can use the function’s monitoring callback to store the boxes that it visits

In [18]:
from matplotlib.patches import Rectangle

boxes = []

def monit(
    _ncall, _xbest, _icount, _inlist, _numpts,
    _initpt, _xbaskt, boxl, boxu, _nstate,
):
    boxes.append(Rectangle(boxl, *(boxu - boxl)))

Before calling the optimizer we need to initialize its communication state

In [19]:
from naginterfaces.library import glopt
comm = glopt.bnd_mcs_init()

Now do the optimization. The monitoring function is an optional keyword argument. In this example we don’t need to access any of the return data

In [20]:
glopt.bnd_mcs_solve(objfun, ibound, bl, bu, comm, monit=monit);

Grid the objective function for plotting

In [21]:
import numpy as np
delta = 0.025
x = np.arange(bl[0], bu[0], delta)
y = np.arange(bl[1], bu[1], delta)
X, Y = np.meshgrid(x, y)
Z = np.empty((len(X), len(Y)))
for i, x_i in enumerate(x):
    for j, y_j in enumerate(y):
        Z[j, i] = objfun([x_i, y_j], None)

Store the optimization start and end points for inclusion in the plot. For the routine’s default ‘simple’ initialization method (iinit = 0) the start point (initpt) was at the origin

In [22]:
start_x = [0.]*n

Here is the known global minimum for the problem

In [23]:
min_x = [0.23, -1.63]

Aggregate the search boxes to include in the subsequent plot

In [24]:
from matplotlib.collections import PatchCollection
boxes_col = PatchCollection(
    boxes,
    edgecolor='b', facecolor='none', linewidth=0.2,
)

Now assemble the plot

In [25]:
from matplotlib import cm
import matplotlib.pyplot as plt
ax = plt.axes()
ax.contour(
    X, Y, Z,
    levels=[-6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 6, 8],
    cmap=cm.jet,
)
ax.plot(start_x[0], start_x[1], 'rx')
ax.plot(min_x[0], min_x[1], 'go')
ax.legend(('Start', 'Glob. min.'), loc='upper right')
ax.add_collection(boxes_col)
ax.axis(xmin=bl[0], ymin=bl[1], xmax=bu[0], ymax=bu[1])
ax.set_title(r'The Peaks Function and MCS Search Boxes')
plt.show()

Peaks Function

Quantile Regression

The nAG function correg.quantile_linreg_easy can be used to model the conditional $\tau$-th quantile of a dependent variable against one or more independent or explanatory variables.

In our example below the dependent variable is household food expenditure, which is regressed against household income. The data is from a study of 1857 by Engels.

In [26]:
income = [
    420.1577, 541.4117, 901.1575, 639.0802, 750.8756, 945.7989, 829.3979, 979.1648,
    1309.8789, 1492.3987, 502.8390, 616.7168, 790.9225, 555.8786, 713.4412, 838.7561,
    535.0766, 596.4408, 924.5619, 487.7583, 692.6397, 997.8770, 506.9995, 654.1587,
    933.9193, 433.6813, 587.5962, 896.4746, 454.4782, 584.9989, 800.7990, 502.4369,
    713.5197, 906.0006, 880.5969, 796.8289, 854.8791, 1167.3716, 523.8000, 670.7792,
    377.0584, 851.5430, 1121.0937, 625.5179, 805.5377, 558.5812, 884.4005, 1257.4989,
    2051.1789, 1466.3330, 730.0989, 2432.3910, 940.9218, 1177.8547, 1222.5939, 1519.5811,
    687.6638, 953.1192, 953.1192, 953.1192, 939.0418, 1283.4025, 1511.5789, 1342.5821,
    511.7980, 689.7988, 1532.3074, 1056.0808, 387.3195, 387.3195, 410.9987, 499.7510,
    832.7554, 614.9986, 887.4658, 1595.1611, 1807.9520, 541.2006, 1057.6767, 800.7990,
    1245.6964, 1201.0002, 634.4002, 956.2315, 1148.6010, 1768.8236, 2822.5330, 922.3548,
    2293.1920, 627.4726, 889.9809, 1162.2000, 1197.0794, 530.7972, 1142.1526, 1088.0039,
    484.6612, 1536.0201, 678.8974, 671.8802, 690.4683, 860.6948, 873.3095, 894.4598,
    1148.6470, 926.8762, 839.0414, 829.4974, 1264.0043, 1937.9771, 698.8317, 920.4199,
    1897.5711, 891.6824, 889.6784, 1221.4818, 544.5991, 1031.4491, 1462.9497, 830.4353,
    975.0415, 1337.9983, 867.6427, 725.7459, 989.0056, 1525.0005, 672.1960, 923.3977,
    472.3215, 590.7601, 831.7983, 1139.4945, 507.5169, 576.1972, 696.5991, 650.8180,
    949.5802, 497.1193, 570.1674, 724.7306, 408.3399, 638.6713, 1225.7890, 715.3701,
    800.4708, 975.5974, 1613.7565, 608.5019, 958.6634, 835.9426, 1024.8177, 1006.4353,
    726.0000, 494.4174, 776.5958, 415.4407, 581.3599, 643.3571, 2551.6615, 1795.3226,
    1165.7734, 815.6212, 1264.2066, 1095.4056, 447.4479, 1178.9742, 975.8023, 1017.8522,
    423.8798, 558.7767, 943.2487, 1348.3002, 2340.6174, 587.1792, 1540.9741, 1115.8481,
    1044.6843, 1389.7929, 2497.7860, 1585.3809, 1862.0438, 2008.8546, 697.3099, 571.2517,
    598.3465, 461.0977, 977.1107, 883.9849, 718.3594, 543.8971, 1587.3480, 4957.8130,
    969.6838, 419.9980, 561.9990, 689.5988, 1398.5203, 820.8168, 875.1716, 1392.4499,
    1256.3174, 1362.8590, 1999.2552, 1209.4730, 1125.0356, 1827.4010, 1014.1540, 880.3944,
    873.7375, 951.4432, 473.0022, 601.0030, 713.9979, 829.2984, 959.7953, 1212.9613,
    958.8743, 1129.4431, 1943.0419, 539.6388, 463.5990, 562.6400, 736.7584, 1415.4461,
    2208.7897, 636.0009, 759.4010, 1078.8382, 748.6413, 987.6417, 788.0961, 1020.0225,
    1230.9235, 440.5174, 743.0772,
]
expenditure = [
    255.8394, 310.9587, 485.6800, 402.9974, 495.5608, 633.7978, 630.7566,
    700.4409, 830.9586, 815.3602, 338.0014, 412.3613, 520.0006, 452.4015,
    512.7201, 658.8395, 392.5995, 443.5586, 640.1164, 333.8394, 466.9583,
    543.3969, 317.7198, 424.3209, 518.9617, 338.0014, 419.6412, 476.3200,
    386.3602, 423.2783, 503.3572, 354.6389, 497.3182, 588.5195, 654.5971,
    550.7274, 528.3770, 640.4813, 401.3204, 435.9990, 276.5606, 588.3488,
    664.1978, 444.8602, 462.8995, 377.7792, 553.1504, 810.8962, 1067.9541,
    1049.8788, 522.7012, 1424.8047, 517.9196, 830.9586, 925.5795, 1162.0024,
    383.4580, 621.1173, 621.1173, 621.1173, 548.6002, 745.2353, 837.8005,
    795.3402, 418.5976, 508.7974, 883.2780, 742.5276, 242.3202, 242.3202,
    266.0010, 408.4992, 614.7588, 385.3184, 515.6200, 1138.1620, 993.9630,
    299.1993, 750.3202, 572.0807, 907.3969, 811.5776, 427.7975, 649.9985,
    860.6002, 1143.4211, 2032.6792, 590.6183, 1570.3911, 483.4800, 600.4804,
    696.2021, 774.7962, 390.5984, 612.5619, 708.7622, 296.9192, 1071.4627,
    496.5976, 503.3974, 357.6411, 430.3376, 624.6990, 582.5413, 580.2215,
    543.8807, 588.6372, 627.9999, 712.1012, 968.3949, 482.5816, 593.1694,
    1033.5658, 693.6795, 693.6795, 761.2791, 361.3981, 628.4522, 771.4486,
    757.1187, 821.5970, 1022.3202, 679.4407, 538.7491, 679.9981, 977.0033,
    561.2015, 728.3997, 372.3186, 361.5210, 620.8006, 819.9964, 360.8780,
    395.7608, 442.0001, 404.0384, 670.7993, 297.5702, 353.4882, 383.9376,
    284.8008, 431.1000, 801.3518, 448.4513, 577.9111, 570.5210, 865.3205,
    444.5578, 680.4198, 576.2779, 708.4787, 734.2356, 433.0010, 327.4188,
    485.5198, 305.4390, 468.0008, 459.8177, 863.9199, 831.4407, 534.7610,
    392.0502, 934.9752, 813.3081, 263.7100, 769.0838, 630.5863, 645.9874,
    319.5584, 348.4518, 614.5068, 662.0096, 1504.3708, 406.2180, 692.1689,
    588.1371, 511.2609, 700.5600, 1301.1451, 879.0660, 912.8851, 1509.7812,
    484.0605, 399.6703, 444.1001, 248.8101, 527.8014, 500.6313, 436.8107,
    374.7990, 726.3921, 1827.2000, 523.4911, 334.9998, 473.2009, 581.2029,
    929.7540, 591.1974, 637.5483, 674.9509, 776.7589, 959.5170, 1250.9643,
    737.8201, 810.6772, 983.0009, 708.8968, 633.1200, 631.7982, 608.6419,
    300.9999, 377.9984, 397.0015, 588.5195, 681.7616, 807.3603, 696.8011,
    811.1962, 1305.7201, 442.0001, 353.6013, 468.0008, 526.7573, 890.2390,
    1318.8033, 331.0005, 416.4015, 596.8406, 429.0399, 619.6408, 400.7990,
    775.0209, 772.7611, 306.5191, 522.6019,
]

In the design matrix for the regression we include an intercept term by augmenting the income data set with a column of ones

In [27]:
income_X = [[1., incomei] for incomei in income]

Our quantiles of interest

In [28]:
tau = [0.10, 0.25, 0.5, 0.75, 0.9]

Compute the regression

In [29]:
from naginterfaces.library import correg
regn = correg.quantile_linreg_easy(income_X, expenditure, tau)

The regression coefficients are returned in attribute b of the function’s return tuple.

For the plot, compute the regression lines

In [30]:

 

import numpy as np
plot_x = np.linspace(0, max(income))
plot_ys = [regn.b[0, i] + regn.b[1, i]*plot_x for i in range(len(tau))]

Make a scatter plot of the original income data (without the intercept) and add in the regression lines

In [31]:
import matplotlib.pyplot as plt
plt.scatter(income, expenditure, c='red', marker='+', linewidth=0.5)
for tau_i, tau_val in enumerate(tau):
    plt.plot(
        plot_x, plot_ys[tau_i],
        label=r'$\tau$ = {:.2f}'.format(tau_val),
    )
plt.ylim((0., max(expenditure)))
plt.xlabel('Household Income')
plt.ylabel('Household Food Expenditure')
plt.legend(loc='lower right')
plt.title(
    'Quantile Regression\n'
    'Engels\' 1857 Study of Household Expenditure on Food'
)
plt.show()

Quantile Regression

Kernel Density Estimation

Density estimation starts from a set of observations and produces a smooth nonparametric estimate of the unknown density function from which the observations were drawn. The nAG function for this computation is smooth.kerndens_gauss.

Here is some trial data

In [32]:

x = [
    0.114, -0.232, -0.570, 1.853, -0.994, -0.374, -1.028, 0.509, 0.881, -0.453,
    0.588, -0.625, -1.622, -0.567, 0.421, -0.475, 0.054, 0.817, 1.015, 0.608,
    -1.353, -0.912, -1.136, 1.067, 0.121, -0.075, -0.745, 1.217, -1.058, -0.894,
    1.026, -0.967, -1.065, 0.513, 0.969, 0.582, -0.985, 0.097, 0.416, -0.514,
    0.898, -0.154, 0.617, -0.436, -1.212, -1.571, 0.210, -1.101, 1.018, -1.702,
    -2.230, -0.648, -0.350, 0.446, -2.667, 0.094, -0.380, -2.852, -0.888, -1.481,
    -0.359, -0.554, 1.531, 0.052, -1.715, 1.255, -0.540, 0.362, -0.654, -0.272,
    -1.810, 0.269, -1.918, 0.001, 1.240, -0.368, -0.647, -2.282, 0.498, 0.001,
    -3.059, -1.171, 0.566, 0.948, 0.925, 0.825, 0.130, 0.930, 0.523, 0.443,
    -0.649, 0.554, -2.823, 0.158, -1.180, 0.610, 0.877, 0.791, -0.078, 1.412,
]

We are going to repeat the density estimation using four different multipliers for the window width

In [33]:
window_ms = [2./2.**i for i in range(4)]

The nAG routine has an initialization mode for computing FFT data from the input data once only. The mode is triggered by passing an empty communication dict. On subsequent calls the FFT is then re-used

In [34]:
from naginterfaces.library import smooth
comm = {}
k_gs = [smooth.kerndens_gauss(x, comm, window=window_m) for window_m in window_ms]

From each return tuple the window attribute gives the window width that was actually used, t the points at which the density estimate was calculated, and smooth the values of the estimate

In [35]:
import matplotlib.pyplot as plt
for k_g in k_gs:
    plt.plot(
        k_g.t, k_g.smooth,
        linewidth=0.75, label='window = {:.4f}'.format(k_g.window),
    )
plt.xlabel(r'$\mathit{t}$')
plt.ylabel('Density Estimate')
plt.legend()
plt.title('Gaussian Kernel Density Estimation')
plt.show()

Kernal Density

Change Point Analysis

Change points are time points at which some feature of a data set changes. For detecting change points in a univariate time series we can use tsa.cp_pelt.

Consider the following time series, assumed to be from a Normal distribution

In [36]:
y = [
    0., 0.78, -0.02, 0.17, 0.04, -1.23, 0.24, 1.7, 0.77, 0.06,
    0.67, 0.94, 1.99, 2.64, 2.26, 3.72, 3.14, 2.28, 3.78, 0.83,
    2.8, 1.66, 1.93, 2.71, 2.97, 3.04, 2.29, 3.71, 1.69, 2.76,
    1.96, 3.17, 1.04, 1.5, 1.12, 1.11, 1., 1.84, 1.78, 2.39,
    1.85, 0.62, 2.16, 0.78, 1.7, 0.63, 1.79, 1.21, 2.2, -1.34,
    0.04, -0.14, 2.78, 1.83, 0.98, 0.19, 0.57, -1.41, 2.05, 1.17,
    0.44, 2.32, 0.67, 0.73, 1.17, -0.34, 2.95, 1.08, 2.16, 2.27,
    -0.14, -0.24, 0.27, 1.71, -0.04, -1.03, -0.12, -0.67, 1.15,
    -1.1, -1.37, 0.59, 0.44, 0.63, -0.06, -0.62, 0.39, -2.63, -1.63,
    -0.42, -0.73, 0.85, 0.26, 0.48, -0.26, -1.77, -1.53, -1.39, 1.68, 0.43,
]

We wish to look for changes in the mean, which is selected using the following ctype value in the nAG routine

In [37]:
ctype = 1

We are also assuming unit variance for this problem, which must be communicated to the nAG function using the param keyword argument

In [38]:
param = [1.]

Find the change points

In [39]:
from naginterfaces.library import tsa
soln = tsa.cp_pelt(ctype, y, param=param)

The locations of the change points are in attribute tau of the return tuple, while the estimated distribution parameters are in attribute sparam.

Plot the time series, and also the change points as vertical lines with the corresponding means as horizontal lines

import matplotlib.pyplot as plt
plt.plot(range(1, len(y)+1), y, color='r', linewidth=0.75)
last_cp = 1.
for tau_i, tau_val in enumerate(soln.tau):
    vl = plt.axvline(tau_val, label='Change pt.')
    hl = plt.hlines(
        soln.sparam[2*tau_i],
        xmin=last_cp, xmax=tau_val, color='g', label='Mean',
    )
    last_cp = tau_val
plt.xlim((1, len(y)))
plt.xticks(range(10, len(y)+1, 10))
plt.xlabel('Time')
plt.ylabel('Value')
plt.title(
    'Simulated Time Series and\n'
    'Corresponding Changes in Mean')
plt.legend(handles=[vl, hl], loc='upper right')
plt.show()

simulated time series

    Please provide your work email to access the free trial

    By clicking the button below you agree to our Privacy Policy

    This will close in 20 seconds

      Discover how we can help you in just a few clicks





      Discover how we can help you in just a few clicks

      Personal Information





      By clicking the button below you agree to our Privacy Policy

      This will close in 0 seconds