Multilingual Stopwords

Functions and arguments

The tidystopwords package gives you potential stopwords in more than 100 languages. Its main function is generate_stoplist. Its argument language accepts atomic strings and character vectors of language names or language abbreviations corresponding to those listed by the helping function list_supported_languages.

The list_supported_languages function comes with three numbered output options.

  • Option 1 outputs a character vector of unique word forms.
  • Option 2 outputs a named character vector of word forms. The names denote stop classes roughly corresponding to parts of speech. Note that, in this output, the word forms are not unique. For instance, in English stopwords, that would occur as a subordinating conjunction as well as as a pronoun.
  • Option 3 (the default) outputs a data frame, where each row represents a combination of language (columns lang_name and lang_id), word form and word lemma (columns form and lemma), and several other columns explained below.

Definition of stopwords

The list_supported_languages output is based on multilingual_stoplist - a data frame that was automatically extracted from the Universal Dependencies treebanks (henceforth UD). Universal Dependencies is a framework for cross-linguistically consistent grammatical annotation. The tidystopwords package uses their lemmatization, universal parts of speech, and universal features to derive an inventory of stop classes:

  • abbreviation (e.g. e.g., cf., etc);
  • adposition (preposition or postposition e.g. in and ago);
  • auxiliary verb (e.g. been, have, must);
  • conjunction_subordinator (e.g. and, because);
  • contraction (e.g. ’nt);
  • determiner_quantifier (e.g. third, which, both);
  • interjection (e.g. yes );
  • particle (e.g. off in take off )
  • pronominal (functional words that act as nouns - e.g., him, it. Pronouns acting as adjectives (your) and pronominal adverbs (where) are covered by the determiner_quantifier stop class.)

In terms of the Universal Dependencies, the stop classes are defined as follows:

  • abbreviation: ufeat contains Abbr=Yes and upos does not equal NOUN or ADJ;
  • adposition: upos equals AVP;
  • auxiliary verb: upos equals AUX;
  • conjunction_subordinator: upos equals CONJ or SCONJ;
  • contraction: neither form nor lemma equal _, upos equals _ and the form has occurred more than twice in the corpus;
  • determiner_quantifier: either upos equals DET or ufeat contains PronType and at least one of the following strings: NumType, Ind, Dem, Int, Rel, Tot, Neg;
  • interjection: upos equals INTJ;
  • particle: upos equals PART;
  • pronominal: upos equals PRON with no restrictions to ufeat or ufeat contains PronType but then upos does not equal DET.

Vocabulary coverage

Each version of this package uses the latest UD release available to generate the multilingual_stoplist data frame. Therefore multilingual_stoplist can differ from version to version. Typically, a new UD release brings bigger annotated corpora and emerging corpora of new languages.

All stopword lists in tidystopwords have been generated automatically from the data available at the moment. Hence their quality depends on the size of the underlying corpora as well as the morphological richness of the given language.

To allow the user to assess the reliability of the stopword list for the given language, the multilingual_stoplist contains relevant frequency information for each word form in three columns: n_formlemma, n_uposlemma, and
n_stopclasses.

The n_formlemma column gives the absolute frequency of the given word form with the given lemma. The n_uposformlemma column gives the absolute frequency of the given word form with the given lemma and upos.

The n_stopclasses column says in how many stop classes the given word form with the given lemma occurs. For instance that occurs as determiner_quantifier (that pie tastes good), pronominal (don’t mention that), and conjunction_subordinator (say that you will do it).

Noise control

Even high-quality reference corpora such as the UD treebanks contain tagging errors and typos. A two step frequency filter minimizes the noise: 1) a word form must occur more than three times with a given lemma; 2) if a word form with a given lemma (rendered by n_formlemma) occurs in
several different upos combinations (n_uposlemma), only combinations that represent more than 20% of n_formlemma remain listed.