A taste of data science with Auto Trader

We were delighted to welcome Auto Trader to Alan Turing last Wednesday for an industry problem solving event focused on data analysis.

If you’ve ever looked at buying a car online (or, like me, enjoy a bit of fantasy car shopping)  you’ll doubtless have come across Auto Trader’s website; it’s the UK’s largest digital automotive marketplace with around 48 million visits a month. Combine this site traffic with a plethora of search filters and you get a huge amount of data to deal with. What can maths tell us about users’ search habits and the cars they’re buying?

Cue Auto Trader data scientist Dr Peter Appleby, who presented us with two problems that reflect the mathematical challenges the company faces, each focusing on a different technique for data analysis. The first was using regression on historic car prices and mileage to create a model for valuation. The second looked at applying clustering to try and understand the search space generated by users using the different filters- namely, which combination of filters (make, model, fuel etc.) are more common than others?

After a quick crash course (pun intended) on regression and clustering, we headed over to the computer cluster to wrangle with some real-life datasets. Of course, there’s only so much you can do in an hour, but there was plenty of discussion and coding going on as people tried to extract useful insights from the data. The clustering problem proved most popular -the conclusion from the session seemed to be that users who search for Audis also tend to look for BMWs and other pricier makes. Not exactly the most ground breaking conclusion you might say, but one can imagine extending this method to a complex multidimensional search space, where much more interesting patterns might start to emerge.

An example of k-means clustering for a dataset using two nodes. Source: Wikicommons

In all, it was a fun day for both those familiar with these techniques and those who were trying out data analysis for the first time. As one undergraduate attendee told me over coffee at the close of the event, if didn’t matter that he wasn’t familiar with the techniques beforehand “it’s fun just to come along and have a go, mess around with different models- you’re learning something new.” For the more experienced, it was “motivating to get to work with a real dataset, rather than making up your own.”

For my part, I wouldn’t have necessarily realised the extent to which maths is used behind the scenes at somewhere like Auto Trader – it’s really interesting to see how maths and data-driven methods are becoming an increasingly important tool for companies in the digital age. I would like to offer my personal thanks to Dr. Appleby and Auto Trader for their support for this event, and I look forward to organising the next industry problem solving event!

Advertisements

math.git version control seminar

Many thanks to Weijian Zhang for his excellent introduction to version control with Git. The room was almost packed out- whether this was entirely owing to interest in version control, or the pile of chocolate biscuits on offer, I couldn’t possibly say- but there was a great informal, interactive atmosphere in the seminar. Most of the room (myself included) were trying out Git for the first time on their laptops, with Weijian and more experienced users in the room fielding questions.

Created by Linux godfather Linus Torvald (apparently over the course of a weekend in response to the loss of access to BitKeeper), Git is now one of the most widely used systems for software version management, as well as being used for countless personal coding projects. Definitely a useful tool for the modern mathematician to have, if not the most glamorous. Speaking to attendees afterwards, however,  I found lots of people enthusiastic to go away and start introducing Git to their workflow- we might even run a follow up seminar! Who knew there was such enthusiasm for version control in Alan Turing.

In the meantime, if you’re looking to learn more about Git, head to Github which has full documentation and example projects to help familiarise you with the language. Check out Bitbucket too as an alternative place to set up a remote repository for your projects. The University of Manchester also offers training on Git to staff and PhD students as part of its Research IT support programme; details here (University of Manchester login required).

We’ll be announcing the next math.seminar soon, on creating vector graphics with Inkscape, so watch this space!

 

 

New(ish) math.seminar series!

Following on from a successful workshop on Julia (see Chapter Secretary Matthew Gwynne’s  blog post for SIAM UKIE), we’re organising a new series of seminars to promote the kinds of computer skills which are useful for any mathematician, but don’t necessarily involve actual maths!

Have you been meaning to set up a personal website to promote your research? Looking for something better than Microsoft Paint to make your diagrams? Need to brush up your LaTeX? Then the “maths dot” series is for you!

Each seminar is aimed at beginners – no previous experience necessary! – and is accompanied by tea, coffee and biscuits. It’s best if you can bring your laptop with you, but feel free to come along and watch if not.

We’ve already had our first successful seminar math.html on the basics of HTML and setting up a personal webpage, given by our very own Webmaster Jonathan Deakin (blog post to follow!).

Upcoming math.seminars:

 

math.git
4 – 5 pm, Thursday 16th February 2017
Frank Adams 1, Alan Turing Building
Weijian Zhang

If you’ve ever found yourself using filenames like “project_final_draft_3_March12_DEFINITELYFINAL_v2”, then you need Git! Learn to keep track of your changes to your thesis, paper, source code etc. by using Git for version control.

math.inkscape
Date & time TBC
Georgia Lynott

Want to fill your papers and presentations with beautiful, non-fuzzy graphics? Inkscape is a free vector graphics program with lots of excellent features which will allow you to make high-quality diagrams and graphics.

Have you got an idea for another maths.seminar? If you’re interested in sharing your computing expertise, please do contact the committee!

 

 

On functional data analysis

Functional data analysis (FDA) is a collection of methods applied to a dataset consisting of a collection of curves, continuous functions, or at least observations of such curves at discrete points. It includes different topics of statistics such as supervised and unsupervised classification, factor analysis, inference, regression, and more. FDA is particularly interesting since the methods can incorporate information on the rates of change or derivatives of the curves, which can be extremely useful when modelling and analysing results from physical phenomena.  Hence, there has been a recent increase in popularity of these methods within a large number of fields including bioscience, system engineering and meteorology, with the two main references being monographs by Ramsay and Silverman [1], and Ferraty and Vieu [2].

Datasets in the functional data literature

In the FDA literature, there are many publicly available datasets. Canadian weather, the poblenou data and the tecator dataset are among the most popular ones. These functional datasets stem from real phenomena, and are extensively useful for the nonparametric methods for functional data analysis developed by Ferraty and Vieu [3]. For example, the Canadian weather dataset consists of the daily temperature and precipitation at 35 different locations in Canada, whereas the poblenou data groups the NOx levels measured every hour by a control station in Barcelona. The tecator dataset, which appeared in the paper by Borggaard and Thodberg [4], consists of a collection of 215 finely chopped meat pieces with different moisture, fat and protein contents. We observe one spectrometric curve which corresponds to the absorbance measured at 100 wavelengths. The pieces are split into two different classes: with small (<20%) and large fat content obtained by an analytical chemical processing.  The spectrometric curves are shown in Figure 1.

spectometric curvesFigure 1:  Spectrometric curves for the tecator dataset.

With the huge advances in technology and the constant production of data, more sophisticated structures will be produced in the future. Therefore, constructing statistical methods for such data allows us to anticipate new kinds of datasets as well as the technologies that will produce them.

Challenges for the future

There are many open problems in the field of functional data, for instance, Bayesian approaches, spatial functional statistics and differential equation models as suggested by Ramsay, Hooker, and Graves [5]. In addition, there is a need to develop suitable mathematical models to explore nonlinear structures in high dimensional spaces, and the challenge of determining the dimension for principal component representation in infinite dimensional spaces to define a density for functional data. Due to the large class of problems in which FDA can already be applied, advancements in this area are likely to have high impact on applied sciences such as environmetrics, chemometrics and econometrics, among many others. 

 

References

[1] Ramsay, James O., and Silverman, Bernard W. (2006), Functional Data Analysis, 2nd ed., Springer, New York.

[2] Ferraty, F.  and  Vieu, P.  (2006). Nonparametric Functional Data Analysis: Theory and Practice. Springer Series in Statistics, Springer-Verlag, New York.

[3] Ferraty, F. and Vieu, P. (2003). Curves Discrimination: A Nonparametric Functional Approach. Computational Statistics and Data Analysis, 44, 161-173.

[4] Borggaard, C. and Thodberg, H.H. (1992).  Optimal minimal neural interpretation of spectra. Anal. Chem. 64, 545-551.

[5] Ramsay, James O., Hooker, G., and Graves, S. (2009), Functional Data Analysis in R and Matlab, Springer, New York.

Author: Diego Andres Perez Ruiz, diego.perezruiz@manchester.ac.uk

 

Bayesian Modelling and Markov chain Monte Carlo

Statisticians and applied mathematicians are interested in building models of real world processes from experimental data. Quite often the experiments which lead to this data are both expensive and time consuming. Fortunately the Bayesian method [4] provides an intuitive way for us to fill the gaps left by small or incomplete data sets. One such example where the Bayesian method might be useful is in the design of car engines.


Example

Suppose a car engine manufacturer has designed a new engine and would like to know the lifetime of an already used drive belt on this new engine. To determine the duration of the belt the car must be run 24/7 for many weeks until the belt gives up: a highly time consuming experiment. Suppose the company is willing to perform at most five repeats to reduce the delay in production and limit expenses. The sample of results is prohibitively small, and we cannot expect to reach accurate, informative conclusions using a classical approach.

Assume the company already has some prior knowledge of the lifetime of the belt. The Bayesian method makes use of this “prior information” and combines it with the experimental results to create a Bayesian predictive distribution.


Figure 1: Probability distributions predicting the lifetime of a drive belt in a new car engine.

We can see in Figure 1 that the Bayesian predictive distribution is somewhere between the prior and the classical estimate. By balancing the classical estimate with the prior knowledge we have obtained a more peaked distribution. This distribution has a smaller variance, suggesting we can have more confidence in the prediction of the expected lifetime of the drive belt.


To calculate the Bayesian predictive distribution, \pi(x|D), given some data, D, we simply multiply the density function of the classical solution, \ell(D|x), with the density function produced by our prior knowledge, \pi(x). This is a direct application of Bayes’ theorem [6]. Unfortunately, this product will not integrate to one, a necessary condition for probability density functions. To overcome this, we multiply the density function by a constant Z, which rescales the density so that it does integrate to one. The resulting Bayesian distribution defined over the n-dimensional parameter space S \subset \mathbb{R}^n is

\pi(x|D) = \frac{1}{Z}\pi(x)\ell(D|x), \quad Z = \int_{S} \! \pi(x)\ell(D|x) \, \text{d}x.

In one dimension it is easy to use numerical quadrature to calculate Z. However as the dimension becomes large (10^9 variables), this method quickly becomes impractical. Here we turn to a class of statistical algorithms known as Markov chain Monte Carlo (MCMC) methods [1], which can tackle these high dimensional parameter spaces.

Monte Carlo methods are algorithms which produce samples from probability distributions. These samples can be used to estimate statistics such as the mean and variance. MCMC is a Monte Carlo algorithm class which cleverly targets sample locations using Markov chains to achieve faster convergence [3,5].

Such methods are notoriously computationally intensive. It is surprising therefore that MCMC is thought to be one of the most widely used classes of algorithms in computational science, with applications common in computational physics, statistics, biology, computer science and even linguistics. An interesting discussion on further applications of MCMC is available at [2].


[1] S. Brooks, A. Gelman, G. Jones, and X. Meng. Handbook of Markov Chain Monte Carlo. CRC press, 2011.

[2] P. Diaconis. The Markov chain Monte Carlo Revolution. Bulletin of the American Mathematical Society, 2009.

[3] W. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 1970.

[4] P. Lee. Bayesian statistics: an introduction. John Wiley & Sons, 2012.

[5] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 1953.

[6] Wikipedia. Bayes’ theorem | Wikipedia, The Free Encyclopedia, 2016. [Online; accessed 14-April-2016].

Author: Paul Russell, paul.russell@manchester.ac.uk

Public Key Cryptography: Merkle’s Puzzles

Up until the invention of public key cryptography in the 1970s, symmetric ciphers were the only encryption scheme available. A symmetric cipher refers to a cryptographic model where the communicating parties share a secret key to ensure the privacy of communications. Depending on the key, a symmetric encryption algorithm performs various permutations and substitutions to transform a given message into ciphertext. To recover the original message the decryption algorithm would perform the same operations in reverse, using the secret key. However, the key had to be sent through a secure channel, e.g. a trusted courier, registered mail, or best of all – agreed verbally in advance. There did not seem to be any other way for secure information exchange.

In 1974 Ralph Merkle, a computer science student at UC Berkeley attended a course in computer security. As part of the course he submitted a proposal addressing the issue of the distribution of a secret key [1]. The proposal was rejected several times, which led Merkle to drop the course and continue working on his idea independently [2]. Assuming that only an insecure channel is available for transmitting information, Merkle suggested that communications be based on a set of ‘puzzles’, solution to which would contain a unique puzzle ID and a unique puzzle key. If user B wishes to set up a connection with user A, they require A to generate a large number, say N=1,000,000, of puzzles and transmit them to B. After picking at random and solving one of the puzzles, B transmits only the puzzle ID to A, and keeps the puzzle key private, which can then be used for subsequent message encryption with a symmetric cipher. Since the puzzle ID is sent through an insecure channel, it is available to an adversary, however, only  A can match the puzzle ID to the puzzle key. The only way for the adversary to obtain the key is to solve every puzzle until a matching ID is found. This would on average require to solve \frac{1}{2} N puzzles.

Merkle (1)Figure: First A transmits a large number of puzzles \text{X}_1, \text{X}_2, \ldots, \text{X}_N to B. Next, B randomly selects and solves one puzzle \text{X}_j in order to obtain an (\text{ID}_j,\text{key}_j) pair. Finally, B sends only the puzzle ID=ID_j back to A.

A puzzle could be a symmetric encryption of the puzzle key and ID, solvable by brute-force, i.e. by trying all possible keys. If one puzzle takes 2 minutes to solve, with 1,000,000 puzzles it would take an adversary an average of around a year to obtain the secret key. Merkle’s proposal was a new paradigm of which the puzzles were just an example. Indeed, soon after the proposal, cryptographic systems based on hard mathematical problems such as discrete logarithm [3] and integer factorization [4] were developed. The model is now known as public key cryptography, and it revolutionised the field by eliminating the need for a secure channel to transmit the key.

References

[1] Merkle, R. C. Secure communications over insecure channels. Communications of the ACM 21.4 (1978): 294-299.

[2] Diffie, W. The first ten years of public-key cryptography. Proceedings of the IEEE 76.5 (1988): 560-577.

[3] Diffie, W. and Hellman, M. E. New directions in cryptography. Information Theory, IEEE Transactions on 22.6 (1976): 644-654.

[4] Rivest, R. L., Shamir, A. and Adleman, L. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM 21.2 (1978): 120-126.

Author: Mante Zemaityte, mante.zemaityte@manchester.ac.uk

On the Leja method…

…or how we replaced the heuristics of a code by a rigorous backward error analysis.
We all know the procedure of developing a numerical method/algorithm:

  1. get it running,
  2. perform experiments,
  3. optimize the algorithm/code.

The emphasis is always on step 1. Unfortunately, step 3 is often not pursued because <<the PhD student left>, <the focus shifted>, <funding ended>,… >. In 2009 there existed a code for computing the action of the matrix exponential based on Leja method, see [3], [4], [5]. However, the code was based on some heuristics and had room for optimization.

The main idea of the Leja method is to compute \exp(A)v without explicitly forming \exp(A) for possibly large (sparse) matrices. The action of the matrix exponential has various applications, e.g. evolution equations (exponential integrators), network analysis, and many other areas. For the computation a sequence of Leja points in an interval is used (see Figure 1 for an example), as this allows for an iterative selection of the degree of interpolation to reduce computation cost.

figure
Figure 1: 15 Leja points in the interval [-2,2].

In order to allow matrices with a large norm one takes advantage of the functional equation of the exponential to introduce scaling steps. In the end the interpolation essentially depends on three parameters, the number of scaling steps, the location/sequence of the interpolation points, and the degree of interpolation. The remaining question is: How do you select these parameters for a specific input?

The code from 2009 had the following heuristics included

% three magic numbers...
maxm = 150; minm = 30;
m = max (2.6 * ceil (gamma), minm);
nsteps = ceil (m / maxm);

These magic numbers from above determined all of the necessary parameters to perform the interpolation. The gamma in the formula describes an ellipse that encloses the field of values of the input and nsteps corresponds to the number of scaling steps. The first attempt to get rid of the heuristics replaced the above selection procedure, see [1]. The selection of the parameters was then based on the interpolation behaviour of a 2-by-2 matrix, with entries reflecting a bound of the extrema of the field of values of A. Nevertheless, it still had some heuristics included.

In the preprint [2], the heuristics were finally eliminated. Yet again, the code was redesigned, based on a rigorous backward error analysis. The main idea is to interpret the result of the interpolation as the exact solution of a perturbed problem. The backward error analysis gives rise to a selection procedure of the essential parameters that can minimize the cost of the algorithm.

As a result, the code is simpler and has no longer any restrictions on the input matrix. Nevertheless, there is still room for optimization. Numerical experiments show that the method can have performance losses for nonnormal matrices. We are currently working to fix this glitch but that is for another blog entry.

The code is available at the Numerical Analysis Innsbruck project page.

References

[1] Caliari, M., Kandolf, P., Ostermann, A., Rainer, S., 2014. Comparison of methods for computing the action of the matrix exponential. BIT Numer. Math. 52 (1), 113-128.

[2] Caliari, M., Kandolf, P., Ostermann, A., Rainer, S., 2015. The Leja method revisited: backward error analysis for the matrix exponential. Preprint, arXiv:1506.08665.

[3] Caliari, M., Vianello, M., Bergamaschi, L., 2004. Interpolating discrete advection-diffusion propagators at Leja sequences. J. Comput. Appl. Math. 172 (1), 79-99.

[4] Huisinga, W., Pesce, L., Kosloff, R., Saalfrank, P., 1999. Faber and Newton polynomial integrators for open-system density matrix propagation. J. Chem. Phy., 110, 5538-5547.

[5] Reichel, L., 1990. Newton interpolation at Leja points. BIT Numer. Math. 30 (2), 332-346.

Author: Peter Kandolf, peter.kandolf@manchester.ac.uk