Cover Page

WILEY SERIES IN PROBABILITY AND STATISTICS

Established by Walter A. Shewhart and Samuel S. Wilks

Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay

Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels

The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state-of-the-art developments in the field and classical methods. Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches. This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research.

A complete list of titles in this series can be found at http://www.wiley.com/go/wsps

Machine Learning: a Concise Introduction






Steven W. Knox



















logo

Preface

The goal of statistical data analysis is to extract the maximum information from the data, and to present a product that is as accurate and as useful as possible.

—David Scott, Scott, David Multivariate Density Estimation:
Theory, Practice and Visualization
, 1992

My purpose in writing this book is to introduce the mathematically sophisticated reader to a large number of topics and techniques in the field variously known as machine learning, statistical learning, or predictive modeling. I believe that a deeper understanding of the subject as a whole will be obtained from reflection on an intuitive understanding of many techniques rather than a very detailed understanding of only one or two, and the book is structured accordingly. I have omitted many details while focusing on what I think shows “what is really going on.” For details, the reader will be directed to the relevant literature, or to the exercises, which form an integral part of the text.

No work this small on a subject this large can be self-contained. Some undergraduate-level calculus, linear algebra, and probability is assumed without reference, as are a few basic ideas from statistics. All of the techniques discussed here can, I hope, be implemented using this book and a mid-level programming language (such as C),1 and explicit implementation of many techniques using R is presented in the last chapter.

The reader may detect a coverage bias in favor of classification over regression. This is deliberate. The existing literature on the theory and practice of linear regression and many of its variants is so strong that it does not need any contribution from me. Classification, I believe, is not yet so well documented. In keeping with what has been important in my experience, loss functions are completely general and predictive modeling is stressed more than explanatory modeling.

The intended audience for these notes has an extremely diverse background in probability, ranging from one introductory undergraduate course to extensive graduate work and published research.2 In seeking a probability notation which will create the least confusion for all concerned, I arrived at the non-standard use of P(x) for both the probability of an event x and a probability mass or density function, with respect to some measure which is never stated, evaluated at a point x. My hope, which I believe has been borne out in practice, is that anyone with sufficient knowledge to find this notation confusing will have sufficient knowledge to work through that confusion.

Notes