Who Wrote That Book?
Who Wrote That Book?
In this module, students will build a classifier that predicts the author of a text based on some training data.
Appropriate for ages 12+.
Allow 60 minutes to complete the module.
Important note: This module should be led by an instructor with basic Wolfram Language knowledge. If you would like to learn the language, please try this free online introduction. If you would like a Computational Thinking Initiative ambassador or volunteer to help you run an adventure, please contact us.
Appropriate for ages 12+.
Allow 60 minutes to complete the module.
Important note: This module should be led by an instructor with basic Wolfram Language knowledge. If you would like to learn the language, please try this free online introduction. If you would like a Computational Thinking Initiative ambassador or volunteer to help you run an adventure, please contact us.
Learning Objective
◼
Students will learn about machine learning capabilities by using the Classify function to detect the author of a text.
Computational Thinking Principles and Practices
◼
Interpreting a problem or idea in a way that a computer can assist with it
◼
Exploring entire categories all at once (i.e. look at all flags, all of Shakespeare’s sonnets, etc.)
◼
Simulating things that are hard or impossible to do by performing real-world experiments
Standards Alignment
◼
AP Computer Science Principles:
◼
LO 3.1.1: find patterns and test hypotheses about digitally processed information to gain insight and knowledge
Helpful Background
◼
Wolfram Code Gallery: https://www.wolfram.com/language/gallery/determine-the-author-of-a-text
◼
Wolfram Code Gallery: https://www.wolfram.com/language/gallery/classify-images-as-day-or-night
◼
Wolfram Code Gallery: https://www.wolfram.com/language/gallery/recognize-handwritten-digits
STARTING POINT
STARTING POINT
“Today we’re going to build a classifier that can detect the author of a piece of text. For now, let’s find and import some text from https://datarepository.wolframcloud.com/type/Text.”
Students should choose authors who have more than one text available.
In[1]:=
hamlet=ResourceData["Hamlet"];sonnets=ResourceData["Shakespeare's Sonnets"];caesar=ResourceData["Friends, Romans, Countrymen"];pride=ResourceData["Pride and Prejudice"];persuasion=ResourceData["Persuasion"];mansfield=ResourceData["Mansfield Park"];sawyer=ResourceData["The Adventures of Tom Sawyer"];finn=ResourceData["The Adventures of Huckleberry Finn"];yankee=ResourceData["A Connecticut Yankee in King Arthur's Court"];
Encourage students to set their texts to variables for later use and to use a semicolon to hide the text in order keep their notebooks short and neat.
◼ What authors did you choose?
◼ Your favorite author isn’t listed in the repository yet? How could you get text from another resource? You can even submit new resources so that everyone can use them!
CHECKPOINT
Check that each student has chosen and imported some texts.
“Now that we have our texts, we can separate them into training and test sets. Later we will use the Classify function to teach the machine which texts belong to which authors. Classify is an example of machine learning. In some ways, machine learning is like human learning—the computer can’t learn the correct answer without some examples. If we want the computer to learn which author is which, we have to give it many examples of text and tell it which author wrote those examples. We call this our ‘training set.’ After the computer has learned using those examples, we are ready to test it to see how good it is at identifying the right author. We call the pieces of text we use for testing our ‘test set.’”
Be sure that students do not have any overlap between their training and test sets.
In[10]:=
training=<|"Shakespeare"->{hamlet,sonnets},"Austen"->{pride,persuasion},"Twain"->{sawyer,finn}|>;
In[11]:=
test=<|"Shakespeare"{caesar},"Austen"{mansfield},"Twain"{yankee}|>;
◼ Why do you think we need the training and test sets to be completely different?
◼ Do you have at least one text for each author in both the training and test sets?
◼ What is a training set for? What is a test set for?
CHECKPOINT
Check that everyone has divided their data up into a training and test set.
“Let’s build your classifier using Classify. Remember, we will build the classifier using our training data.”
In[12]:=
classifier=Classify[training]
Out[12]=
ClassifierFunction
◼ What do you think happens if we don’t specify a method?
◼ Try methods other than Markov. A list of available methods can be found here: http://reference.wolfram.com/language/ref/Classify.html
CHECKPOINT
Check that everyone has trained their classifiers.
“We can use ClassifierMeasurements on our test set to see how accurate our classifier is.”
In[13]:=
ClassifierMeasurements[classifier,test,"Accuracy"]
Out[13]=
1.
Explain what the accuracy value means.
◼ What other measurements do you think would be interesting or useful?
◼ Why do you think your classifier worked so well/poorly? How might we improve your classifier?
◼ Try running the classifier on texts not written by one of your authors. Can you trick the classifier into thinking a piece of text was written by one of your authors?
◼ What do you think the classifier is looking for when deciding which author a text belongs to?
◼ What other types of things or text can we classify?
FINAL POINT
FINAL POINT
Ten minutes before the end of the module time.
Summarize
Summarize
Summarize what was done in the module and talk about findings.
Refer
Refer
Refer back to the learning objective and summarize how you have reached it.
Extend
Extend
Extend the module to the future. For example, "If you have time at home, try using your classifier on some text that you’ve written. What author do you write most like?”
Possible Additional Relevant Functions
Possible Additional Relevant Functions
Predict • Nearest • ClassifierInformation • ClusterClassify
Possible Pitfalls
Possible Pitfalls
◼
Classify tends to work really well on text, so students may get 100% accuracy on their first try. This may make discussions around how to improve their accuracy difficult.