Term Weighting with TF-IDF
Term Weighting with TF-IDF
TF-IDF (term frequency-inverse document frequency) is a way of determining which terms in a document should be weighted most heavily when trying to understand what the document is about. The term frequency reflects how often a given term appears in the document of interest. The document frequency is measured with respect to a corpus of other documents. It tells you how often the term appears in your corpus overall. The terms that are most informative about a particular text have a high term frequency and a low document frequency. The TF-IDF for a term is the product of its term frequency and the scaled inverse of its document frequency. Stopwords are those words that occur so frequently in the language that they rarely convey information about the meaning of a particular document. In this Demonstration, stopwords can be turned on and off, and the font size of each term can be scaled by various weighting factors.