Kentucky Derby

There is a horse race called Kentucky Derby. People are betting on the outcomes of this race. Let’s do an analysis to see if we can get an edge over other people.

Plan

◼
  • Get data
  • ◼
  • Analyse data
  • Acquiring Data

    https://en.wikipedia.org/wiki/2020_Kentucky_Derby
    I decided to get data from Wikipedia. After checking 2020 and 2021 pages of Kentucky Derby, I found that each page contains a table that has the data that I need. After some tinkering, I came up with data acquisition code that works.
    In[]:=
    getData[year_]:=With[{data=Import["https://en.wikipedia.org/wiki/"<>year<>"_Kentucky_Derby","Data"]},​​With[{position=First@Position[data,{"Finish"|"Finish ",___}]},​​data[[Sequence@@Append[Most[position],1][[;;-2]]]]]]
    I decided to only consider data from 2015 - 2021 and not earlier. It was an arbitrary assumption. I didn’t want to take too old data because I believe the relationships might not be relevant and decided to set a time window to the last 7 years.
    In[]:=
    years=Table[ToString@year,{year,2015,2021}]
    Out[]=
    {2015,2016,2017,2018,2019,2020,2021}
    In[]:=
    races={#,getData[#]}&/@years;
    In[]:=
    cleanHeader[element_]:=StringTrim@StringReplace[element,("[26]"|"[33]")->""];
    In[]:=
    addAssociations[headers_,body_,year_]:=AssociationThread[Append[headers[[;;6]],"Year"],Append[#[[;;6]],year]]&/@body;
    In[]:=
    data=Flatten[addAssociations[cleanHeader/@First@#[[2]],Rest@#[[2]],#[[1]]]&/@races,1];
    In[]:=
    cleanValues[value_]:=If[StringQ[value],StringTrim[value],value];
    In[]:=
    cleanData=Map[cleanValues,data[[All,{"Finish","Horse","Jockey","Trainer","Year"}]],{2}];
    In[]:=
    ds=Dataset[cleanData][Select[If[NumberQ[#Finish],True,!StringContainsQ[#Finish,"also"]]&]];
    In[]:=
    ds
    Out[]=
    Finish
    Horse
    Jockey
    Trainer
    Year
    1
    American Pharoah
    Victor Espinoza
    Bob Baffert
    2015
    2
    Firing Line
    Gary Stevens
    Simon Callaghan
    2015
    3
    Dortmund
    Martin Garcia
    Bob Baffert
    2015
    4
    Frosted
    Joel Rosario
    Kiaran McLaughlin
    2015
    5
    Danzig Moon
    Julien Leparoux
    Mark E. Casse
    2015
    6
    Materiality
    Javier Castellano
    Todd Pletcher
    2015
    7
    Keen Ice
    Kent Desormeaux
    Dale Romans
    2015
    8
    Mubtaahij
    Christophe Soumillon
    Michael de Kock
    2015
    9
    Itsaknockout
    Luis Saez
    Todd Pletcher
    2015
    10
    Carpe Diem
    John Velazquez
    Todd Pletcher
    2015
    11
    Frammento
    Corey Nakatani
    Nick Zito
    2015
    12
    Bolo
    Rafael Bejarano
    Carla Gaines
    2015
    13
    Mr. Z
    Ramon A. Vazquez
    D. Wayne Lukas
    2015
    14
    Ocho Ocho Ocho
    Elvis Trujillo
    James M. Cassidy
    2015
    15
    Far Right
    Mike E. Smith
    Ron Moquett
    2015
    16
    War Story
    Joe Talamo
    Tom Amoss
    2015
    17
    Tencendur
    Manuel Franco
    George Weaver
    2015
    18
    Upstart
    José Ortiz
    Richard Violette
    2015
    scratched [13]
    International Star
    Miguel Mena
    Michael Maker
    2015
    scratched [12]
    El Kabeir
    Calvin Borel
    John Terranova
    2015
    rows 1–20 of
    143

    EDA (Exploratory Data Analysis)

    Jockey

    Let’s get some domain expertise before delving right into data.
    In[]:=
    Unfortunately, Wikipedia doesn’t have size, fitness level, communications skills, courage metrics and if either or not jockey lives in a stable.
    However, we can find a proxy for hard training. We can see how often a jockey participates in a race. We can think about it as the live training. The training that matters.
    In[]:=
    ds[GroupBy["Jockey"],#[["Finish"]]&/@#&]
    Out[]=
    Victor Espinoza
    {1,19,9,12}
    Gary Stevens
    {2,10,N/A}
    Martin Garcia
    {3,n/a}
    Joel Rosario
    {4,DNF,5,5,16}
    Julien Leparoux
    {5,17,4,6,7}
    Javier Castellano
    {6,6,7,3,12,13,9}
    Kent Desormeaux
    {7,2,16,5}
    Christophe Soumillon
    {8,DNF}
    Luis Saez
    {9,7,15,19,17-DQ [f],3}
    John Velazquez
    {10,14,1,N/A,9,2,1,DQ}
    Corey Nakatani
    {11}
    Rafael Bejarano
    {12,13}
    Ramon A. Vazquez
    {13}
    Elvis Trujillo
    {14}
    Mike E. Smith
    {15,15,13,4}
    Joe Talamo
    {16,14}
    Manuel Franco
    {17,2}
    José Ortiz
    {18,6,2,3}
    Miguel Mena
    {scratched [13],9}
    Calvin Borel
    {scratched [12]}
    rows 1–20 of
    63
    From this, we can see that there is a difference in jockey’s experience in this race. There are some that participate more often than others. It might have to do with the fact that jockeys that perform well have better chances of returning in the next year. There might be survivorship bias present in the dataset. It might also indicate that the same jockeys perform consistently better than the others over races so they keep returning.
    In[]:=
    Histogram@ds[GroupBy["Jockey"],Length[#[["Finish"]]&/@#]&]
    Out[]=
    In[]:=
    topJockeys=ds[GroupBy["Jockey"],Count[#[["Finish"]]&/@#,1|2|3]&][Select[#!=0&]]
    Those are the jockeys that finished within first 3 from 2015-2021.
    We can see that indeed jockeys that have taken one of the leading places seem to be participating more often than those that didn’t. Another thing that we notice is that jockeys that win top places don’t participate once in a race.
    Do jockeys win their first race they participate in?
    Only 20% of participating jockeys from 2015-2021 won on their first race. Ignoring the fact that they might have participated in the race before 2015.

    Domain knowledge

    In order to understand importance of relationships that we find as well as to what to pay attention to. I decided find some articles that explain what role jockeys and trainer play.
    A jockey is booked to ride a horse by his agent. The booking requires the agreement of the owner and trainer of the racehorse. The jockey is not the sole decision-maker over which horse he rides. However, good riders are sought after and often can pick their horse. - https://horseracingsense.com/how-do-jockeys-choose-which-horses-they-ride/
    Jockeys ride the horses on race days and often follow the instructions issued by the horse’s trainer, but sometimes they use their own initiative. Winning a race reflects well on the jockey, while losing can provoke a search for riding errors. - https://www.racingpost.com/guide-to-racing/trainers-and-jockeys/
    He can’t do much with a lousy horse, but he can help a great horse win. The best jockeys know an animal’s strengths and weaknesses. Some horses prefer to hang back and break at the last minute, while others, known as speed horses, like to be out front the whole time. Some horses are comfortable running in close quarters and can pass along the rail on the left, while others need more space and pass on the right. A jockey takes these factors into account and adjusts his strategy accordingly. - https://slate.com/news-and-politics/2009/05/do-jockeys-matter-at-all-in-horse-racing.html

    Observations

    It seems that jockey are following success to the successful pattern. If you are already successful, then you will have more chances to be more successful later. It’ll be easier because you’ll be able to choose better horse and have more opportunities to succeed.

    Trainer

    A horse trainer or instructor works with horses to ready them for riders, races or shows. They typically are expected to analyze horses’ dispositions to anticipate any possible behavioral problems such as kicking, tossing or biting. Then, they train accordingly to prevent future behavioral problems. Additionally, trainers/instructors assist horses in adapting to gear, acclimating to riding on various terrains and performing various exercises. - https://agexplorer.ffa.org/career/horse-trainer-instructor
    In other words, trainer is responsible in large part for the victory.
    Trainers can participate multiple times within the same race because they can have more than one horse participating.
    We can see that the majority of the trainers within our dataset only participated once and there are some outliers that have participated multiple times.
    Number of 1st place per trainer.
    Number of 2nd place per trainer.
    Number of 3rd place per trainer.
    Number of times trainer’s horse finished 1st, 2nd or 3rd

    Let’s look at the relationship between trainer and jockey

    With naked eye we can see that Todd Pletcher seem to have horses that every jockey wants.
    We can also see that all top jockeys used horses trained by Todd Pletcher.
    Let’s find some communities:
    ◼
  • We can see that there are community clusters around the trainers. There are some trainer’s whose horses have been ridden by many Jockeys. There are other trainers that only appeared once.
  • ◼
  • Interesting to see that Todd Pletcher and Bob Baffert are within the same community. Bob Baffert was disqualified from participating in races until 2023. So in previous race Brad Cox won a first place instead.
  • ◼
  • https://www.wdrb.com/derby_148/with-3-kentucky-derby-horses-trainer-brad-cox-wants-the-real-experience-of-winning-roses/article_72db583a-ca2a-11ec-a768-0b6b91c03f53.html
  • This is the number of popular trainers. So we can see that there are a lot of unpopular trainers whose horses were only used by one jockey and there are two trainers whose horses were used by 11 and 12 jockeys.
    Let’s see the names of these trainers.
    Let’s see the number of jockeys that use different horses.
    Most of the jockeys use one trainer. However, there are some jockeys that use 4 or 5 trainers.
    Who are the jockeys that use different trainers?
    Let’s see if the jockeys that have won before are using different trainers.
    In fact it seems, that our top jockeys have used different horses to participate in a race.

    Wins per community

    We can see that the majority of wins fall onto the cluster that has Bob Baffet and Todd Pletcher in it.
    However, we can see that the majority of the wins fall outside of this community. Almost half of all the wins that happened between 2015-2021 fall into the first cluster.

    Summary

    ◼
  • Successful jockeys are not successful right away (only considering past 5 years)
  • ◼
  • Successful jockeys have more freedom in choosing best horses (success to the successful)
  • ◼
  • There is a cluster of trainers whose horses account for 50% of wins in the past 5 years.
  • Participants in 2022

    Now, let’s apply our knowledge and see if we can make some informed predictions about Kentucky Derby race that is going to take place on May 7th 2022.

    Acquire data for 2022

    Let’s use our knowledge

    Let’s see if there are trainers that accounted for 50% wins for the past year in 2022 dataset.
    For some reason trainer names are spelled differently for 2022 dataset, so we need to correct them first
    Let’s see the trainer’s from the winning cluster that are participating.
    Let’s see if we have jockeys that won previously participating:
    So here we have two participants that according to our analysis might be able to win the race.

    Improvements

    I didn’t use ML algorithms to come up with a model that would predict the outcome of a race. If I had more time and interest, I would definitely look into how to train such a model. I didn’t do it right away because I don’t believe there is enough of data on Wikipedia to do a meaningful prediction.