WOLFRAM|DEMONSTRATIONS PROJECT

Term Weighting with TF-IDF

​
term weighting
none
term frequency
raw document frequency
inverse document frequency
TF-IDF
stopwords grayed out
charles bailey was indicted for feloniously stealing on the 29th of december two dressed deer skins value 20 s the property of samuel savage and richard savage richard savage i am a leather seller 63 chiswell street my partner s name is samuel savage a few days previous to the 29th of december i looked out seventy skins for an order these skins being of a bad colour i directed them to be brimstoned to make them of equal colour pale on the 29th in the afternoon i saw them all smooth on a horse a few hours afterwards they appeared very much tumbled and one was thrown into the yard and dirtied i caused them to be brought in the warehouse and counted there was two gone our foreman went to worship street and brought armstrong and vickrey they searched and found this skin in the prisoner s breeches and the other skin was found in the workshop carter i am foreman to samuel and richard savage the seventy skins i was with mr savage looking them out i took them out of the stove and counted them on the horse and on friday i counted them three times over there were no more than sixty eight instead of seventy i went to worship street brought mr armstrong and vickery with me they waited till the men left work and when they came down they were searched and on the prisoner one skin was found john armstrong i went to this gentleman s house after the men came down vickrey and i were searching in one minute vickrey called me i received this skin from him it was taken out of the prisoner s breeches i have had it ever since john vickrey q you were with armstrong a yes while i was searching another man i saw the prisoner very uneasy and his breeches were unbuttoned i put my hand in and took that skin out he said he could not tell how it came there the property produced and identified the prisoner said nothing in his defence called four witnesses who gave him a good character guilty aged 27 confined six months in the house of correction and fined 1 s second middlesex jury before mr recorder
TF-IDF (term frequency-inverse document frequency) is a way of determining which terms in a document should be weighted most heavily when trying to understand what the document is about. The term frequency reflects how often a given term appears in the document of interest. The document frequency is measured with respect to a corpus of other documents. It tells you how often the term appears in your corpus overall. The terms that are most informative about a particular text have a high term frequency and a low document frequency. The TF-IDF for a term is the product of its term frequency and the scaled inverse of its document frequency. Stopwords are those words that occur so frequently in the language that they rarely convey information about the meaning of a particular document. In this Demonstration, stopwords can be turned on and off, and the font size of each term can be scaled by various weighting factors.