Sports, Stats and Science Introduction and Cubs Opening Day

As my first post I thought it best to introduce what I am trying to bring here that may separate me from the thousands of other sports blogs.  When I was a kid I was a baseball card collector and a stats junkie, and now as an adult I am still a stat junkie and now a computer scientists who can use this data to do things I want to do.  So I am going to combine my love of sports, stats and coding on here and hope it can do something for the reader and maybe get me a front office job for a baseball team or other sports team.

I see the Cubs this year and their stadium in the same stage of their building process.  The Cubs are past the transition point of selling off their bad parts and are now in the process of seeing new shiny rookies coming to the plate and dazzling us.  The Cubs stadium is past the process of the rooftop owners blocking their progress and are now in the process of adding shining new toys as well, like that massive scoreboard and really loud speakers.

There are also going to be some ups and downs as both the team on the field and the field itself works out some kinks.  We saw what young guys will look like at times, with Soler having a bad day in the field and being tricked into swinging at pitches outside the box (something he apparently never did in Spring Training).  And at the stadium we saw from the outside a great looking visual dedicated to Banks to cover up the construction and on the inside long lines at the bathrooms and concessions and the concessions running out of hot dog buns.

All in all it was a very exciting experience as I watched the game from my brothers place in Wrigleyville and heard the loudspeakers from as far as half a mile away.  The excitement this year will far exceed my expectations of what I hope to see.  I am not looking for a playoff team, I am simply looking for players to play well, and for some improvements to occur so that we have a playoff team from 2016-2021 and multiple chances to win a World Series.  This team is far from complete as is the stadium but it will bring us the experiences of what may happen in the future and that future could be amazing.

So what I am doing for coding?

I am working with Big Data and Hadoop, which is a system developed by Google to parse through lots of data really quickly.  So the program I am working on is simple just so I can get my feet wet and develop my ability to do things with data that can hopefully land me a front office gig.  My first program will be to take the overall batting average and ERA for both leagues combined since 1871- 2014.  The idea is to see if there is any correlation with batting averages and earned run averages, and see how much it correlates, and eventually do this with many batting categories.  So right now I am done figuring out how to get the averages, so if you are a baseball fan and can’t read code you can start to zone out here, and skip to the bottom.

Here is the Mapper Class, that reads from a comma separated value file and passes to the reducer.  There are comments in the code that show what each line does.

package BaseballStats;

import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

import com.opencsv.CSVReader;
public class StatMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {

private Text Year = new Text();

private Text info = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
// TODO Auto-generated method stub
//Text being converted to String and set to line

String line = value.toString();

//CVS object created to take in line values through String Reader
CSVReader R = new CSVReader(new StringReader(line));
String[] ParsedLine = R.readNext();

//String values being set to Text value.
//take in atbatss andd hits and put into one value

output.collect(Year, info);

Here is the reducer Class, that takes atbats and hits and the year and creates a league average for that year:

package BaseballStats;
import java.util.Iterator;
import java.util.Map;
import java.util.TreeMap;
import java.util.regex.*;

import mrtools.CountMap;

import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class StatReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
// TODO Auto-generated method stub

float AB = 0;
float H = 0;

while (values.hasNext()) {
//take in value and split it into two ints
String v[] =“,”);
int atbats = 0;
int hits = 0;
try {
if (Integer.parseInt(v[0]) != 0 && Integer.parseInt(v[1]) != 0) {
//need to take items from array and setting as ints
atbats = Integer.parseInt(v[0]);
hits = Integer.parseInt(v[1]);
} else { continue; }
} catch (ArrayIndexOutOfBoundsException | NumberFormatException e) { continue; }
//totalling atbats to year value
AB = AB + atbats;
H = H + hits;

if (AB != 0 && H !=0){
//division to create batting average
float average = (H/AB);

Here is the driver that starts the program:

package PitchingStats;


import javax.xml.soap.Text;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;

public class CountJob {
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(CountJob.class);
conf.setJobName(“Batting Average”);



FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));


The output of this code looks like this:

1988    0.255
1989    0.255
1990    0.259
1991    0.256
1992    0.256
1993    0.266
1994    0.271
1995    0.268
1996    0.271
1997    0.268
1998    0.267
1999    0.272
2000    0.271
2001    0.265
2002    0.262
2003    0.265
2004    0.267
2005    0.265
2006    0.270
2007    0.269
2008    0.265
2009    0.263
2010    0.258
2011    0.256
2012    0.255
2013    0.255
2014    0.252

I included these years because you can clearly see where the steroid era kicked in.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s