Baseball by the Numbers

A Brown Bag Presentation

(with appologies to Phil Birnbaum and the SABR Statistical Analysis Committee)


Baseball is a game that is immersed with numbers. So much of the way baseball is played today hinges upon interpreting statistical data. This presentation is designed for those fans who want to get involved with those numbers. It will show you how to get, store and manipulate baseball data for free (or close to free). Attendees are encouraged to bring their laptop computers and follow along with the presentation.

Presentation PowerPoint Slides (Office 2007 format), PDF

Rene Knott (KSDK Channel 5) interviewed me following this presentation. Here is a Quicktime video of this interview (111 MB file, approx. 20 min of content). Because this file is large, it may take some time to download.

SQL queries from the presentation

2007 Double Plays.sql <-- This is a corrected version of the query. It finds the correct numberof double plays. Orlando Hudson (Arizona Diamondbacks) led the majors batting into 34 double plays [Ryan Zimmerman (Washington), had 32 and Albert Pujols (Saint Louis), Carlos Lee (Houston), and Mark Teahen (Kansas City) all had 30].

2007 NL Central Batting Averages.sql

Manager Wins.sql

Team Salaries.sql

On-line Resources (data & research) - Noncommercial

Society for American Baseball Research (SABR) was formed in August 1971 in Cooperstown New York. It now consists of more than 6,700 members -- including many prominent writers, officials, and players -- worldwide. The purpose of SABR is to foster the research and dissemination of the history and record of baseball. SABR shall carry out that mission through programs: 1) To encourage the study of baseball, past and present, as a significant athletic and social institution;  2) To encourage further research and literary efforts to establish the accurate historical record of baseball; and  3) To help disseminate educational, historical and research information about baseball.

Lahman Database - The Baseball Archive was launched in 1995, making it the oldest continually operating baseball site on the web. It began as an effort to collect baseball information in one place for personal use. As a freelance writer and baseball enthusiast, Sean Lahman found that having this sort of information at his fingertips was an invaluable resource. In his 1984 book, Bill James lamented that research was tremendously hindered by the fact that the game's statistics were so closely guarded by those who compiled them (namely the Elias Sports Bureau). This database is an attempt to make this data available to all baseball enthusiasts.

Sean Lahman's Baseball Archive (MS Access file)

Baseball-Databank (MySQL and other database format files)

Retrosheet Retrosheet was founded in 1989 for the purpose of computerizing play-by-play accounts of as many pre-1984 major league games as possible. Retrosheet has been very successful in the collection of game accounts with more than 100,000 currently in hand. A group of some 100 volunteers is actively involved in the translation work, computer entry and proofing the results. However, the task ahead is enormous and we are always looking for more volunteers; any offers of help are greatly appreciated. Baseball fans interested in this historical effort are invited to volunteer their assistance in the translation and inputting efforts as well as to make available copies of game accounts they might have.

The Baseball Index Started in 1990, The Baseball Index project is the product of dozens of volunteers who have indexed over 200,000 baseball sources in order to facilitate baseball research. The Baseball Index is part of the Society for American Baseball Research's commitment to advance and support baseball research.

Fungoes Sabermetric analysis with and emphasis on the Saint Louis Cardinals. Sponsored by the Bob Broeg chapter (Saint Louis) of SABR.

Business of Baseball Downloadable Data and Documents covering business agreements, salary issues and related topics.

Tango on Baseball If Linear Weights, Run Expectancy, and Runs Created mean something to you, if you are a fan of Pete Palmer or Bill James, then this is the place for you. This is a repository for research and reports by Tom Tango.


Sabermetric Research Blogspot

The Official MLB Website

On-line Resources (data & research) - Commercial

The Baseball Cube Stats for MLB, Minors, College and the Draft

Baseball Info Solutions

Baseball Prospectus analysis, statistics, and research articles for your subscription fee.

Baseball-Reference lots of data and analysis for a modest subscription fee.

On-line Resources (tools)

Open Office is a multiplatform and multilingual office suite and an open-source project. Compatible with all other major office suites, the product is free to download, use, and distribute.

MySQL MySQL is the world's most popular open source database software, with over 100 million copies of its software downloaded or distributed throughout its history. With superior speed, reliability, and ease of use, MySQL has become the preferred choice of corporate IT Managers because it eliminates the major problems associated with downtime, maintenance, administration and support.

Perl Perl is a high-level programming language with an eclectic heritage written by Larry Wall and a cast of thousands. It derives from the ubiquitous C programming language and to a lesser extent from sed, awk, the Unix shell, and at least a dozen other tools and languages. Perl's process, file, and text manipulation facilities make it particularly well-suited for tasks involving quick prototyping, system utilities, software tools, system management tasks, database access, graphical programming, networking, and world wide web programming. These strengths make it especially popular with system administrators and CGI script authors, but mathematicians, geneticists, journalists, and even managers also use Perl.

R R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

Octave GNU Octave is a high-level language, primarily intended for numerical computations. It provides a convenient command line interface for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with Matlab. It may also be used as a batch-oriented language.

Sage SAGE is free open source math software that supports research and teaching in algebra, geometry, number theory, cryptography, and related areas. Both the SAGE development model and the technology in SAGE itself is distinguished by an extremely strong emphasis on openness, community, cooperation, and collaboration: we are building the car, not reinventing the wheel. SAGE makes it easy for you to use most mathematics software together. SAGE includes interfaces to Magma, Maple, Mathematica, MATLAB, and MuPAD, and the free programs Axiom, GAP, GP/PARI, Macaulay2, Maxima, Octave, and Singular.



