The tidystopwords
package gives you potential stopwords
in more than 100 languages. Its main function is
generate_stoplist
. Its argument language
accepts atomic strings and character vectors of language names or
language abbreviations corresponding to those listed by the helping
function list_supported_languages
.
The list_supported_languages
function comes with three
numbered output options.
1
outputs a character vector of unique word
forms.2
outputs a named character vector of word
forms. The names denote stop classes
roughly corresponding
to parts of speech. Note that, in this output, the word forms are not
unique. For instance, in English stopwords, that would occur as
a subordinating conjunction as well as as a pronoun.3
(the default) outputs a data frame, where each
row represents a combination of language (columns lang_name
and lang_id
), word form and word lemma (columns
form
and lemma
), and several other columns
explained below.The list_supported_languages
output is based on
multilingual_stoplist
- a data frame that was automatically
extracted from the Universal Dependencies
treebanks (henceforth UD). Universal Dependencies is a framework for
cross-linguistically consistent grammatical annotation. The
tidystopwords
package uses their lemmatization,
universal parts of speech, and universal features to
derive an inventory of stop classes:
abbreviation
(e.g. e.g., cf., etc);adposition
(preposition or postposition
e.g. in and ago);auxiliary verb
(e.g. been, have, must);conjunction_subordinator
(e.g. and,
because);contraction
(e.g. ’nt);determiner_quantifier
(e.g. third,
which, both);interjection
(e.g. yes );particle
(e.g. off in take off )pronominal
(functional words that act as nouns - e.g.,
him, it. Pronouns acting as adjectives (your)
and pronominal adverbs (where) are covered by the
determiner_quantifier
stop class.)In terms of the Universal Dependencies, the stop classes are defined as follows:
abbreviation
: ufeat
contains
Abbr=Yes
and upos does not equal NOUN
or
ADJ
;adposition
: upos
equals
AVP
;auxiliary verb
: upos
equals
AUX
;conjunction_subordinator
: upos
equals
CONJ
or SCONJ
;contraction
: neither form
nor
lemma
equal _
, upos
equals
_
and the form has occurred more than twice in the
corpus;determiner_quantifier
: either upos
equals
DET
or ufeat
contains PronType
and at least one of the following strings: NumType
,
Ind
, Dem
, Int
, Rel
,
Tot
, Neg
;interjection
: upos
equals
INTJ
;particle
: upos
equals
PART
;pronominal
: upos
equals PRON
with no restrictions to ufeat
or ufeat
contains PronType
but then upos
does not equal
DET
.Each version of this package uses the latest UD release available to
generate the multilingual_stoplist
data frame. Therefore
multilingual_stoplist
can differ from version to version.
Typically, a new UD release brings bigger annotated corpora and emerging
corpora of new languages.
All stopword lists in tidystopwords
have been generated
automatically from the data available at the moment. Hence their quality
depends on the size of the underlying corpora as well as the
morphological richness of the given language.
To allow the user to assess the reliability of the stopword list for
the given language, the multilingual_stoplist
contains
relevant frequency information for each word form in three columns:
n_formlemma
, n_uposlemma
, and
n_stopclasses
.
The n_formlemma
column gives the absolute frequency of
the given word form with the given lemma. The
n_uposformlemma
column gives the absolute frequency of the
given word form with the given lemma and upos.
The n_stopclasses
column says in how many stop classes
the given word form with the given lemma occurs. For instance
that occurs as determiner_quantifier
(that pie
tastes good), pronominal
(don’t mention
that), and conjunction_subordinator
(say that you
will do it).
Even high-quality reference corpora such as the UD treebanks contain
tagging errors and typos. A two step frequency filter minimizes the
noise: 1) a word form must occur more than three times with a given
lemma; 2) if a word form with a given lemma (rendered by
n_formlemma
) occurs in
several different upos
combinations
(n_uposlemma
), only combinations that represent more than
20% of n_formlemma
remain listed.