This vignette explains basic functionalities of the package
litRiddle
, a part of the Riddle of Literary Quality
project.
The package contains the data of a reader survey about fiction in Dutch, a description of the novels the readers rated, and the results of stylistic measurements of the novels. The package also contains functions to combine, analyze, and visualize these data.
See: https://literaryquality.huygens.knaw.nl/ for further details. Information in Dutch about the package can be found at https://karinavdo.github.io/RaadselLiteratuur/02_07_data_en_R_package.html.
These data are also available as individual csv files for persons wanting to work with the data in non R environments. See: https://github.com/karinavdo/RiddleData.
If you use litRiddle
in your academic publications,
please consider citing the following references:
Maciej Eder, Lensink, S., van Zundert, J.J., and van Dalen-Oskam, K.H. (2022). Replicating The Riddle of Literary Quality: The LitRiddle Package for R. In Digital Humanities 2022 Conference Abstracts, 636–637. Tokyo: The University of Tokyo / DH2022 Local Organizing Committee. https://dh2022.dhii.asia/abstracts/files/VAN_DALEN_OSKAM_Karina_Replicating_The_Riddle_of_Literary_Qu.html
Karina van Dalen-Oskam (2023). The Riddle of Literary Quality: A Computational Approach. Amsterdam University Press.
Install the package from the CRAN repository:
Alternatively, try installing it directly from the GitHub repository:
First, one has to activate the package so that its functions become visible to the user:
## litRiddle version: 1.0.0
##
## Thank you for working with the litRiddle package. We would greatly
## appreciate properly citing this software if you find it of use.
## When citing, you can refer to this software as:
## Eder, M., van Zundert, J., Lensink, S., and van Dalen-Oskam, K. (2023).
## litRiddle: Dataset and Tools to Research the Riddle of Literary Quality
## CRAN. <https://CRAN.R-project.org/package=litRiddle>
## If you prefer to cite a publication instead, here is our suggestion:
## Eder, M., Lensink, S., Van Zundert, J.J., and Van Dalen-Oskam, K.H.
## (2022). Replicating The Riddle of Literary Quality: The LitRiddle
## Package for R. In Digital Humanities 2022 Conference Abstracts,
## 636-637. Tokyo: The University of Tokyo,
## <https://dh2022.dhii.asia/abstracts/163>.
##
## To get full BibTeX entry, type: citation("litRiddle").
To activate the dataset, type one of the following lines (or all of them):
From now on, the dataset, divided into four data tables, is visible for the user. Please note that the functions discussed below do not need the dataset to be activated (they take care of it themselves), therefore you don’t have to remember about this step if you plan to analyze the data using the functions from the package.
Time to explore some of the data tables. This generic function will
list all the data points from the table books
:
This command will dump quite a lot of stuff on the screen offering little insight or overview. It’s usually a better idea to select one portion of information at a time, usually one variable or one observation. We assume here that the user has some basic knowledge about R, and particularly s/he knows how to access values in vectors and tables (matrices). To get the titles of the books scored in the survey (or, say, the first 10 titles), one might type:
## [1] Haar naam was Sarah Duel
## [3] Het Familieportret De kraai
## [5] Mannen die vrouwen haten Heldere hemel
## [7] Vijftig tinten grijs Gerechtigheid
## [9] De verrekijker De vrouw die met vuur speelde
## 399 Levels: 1 Fifth Avenue 13 uur 1953 1q84 22/11/63 50/50 Moorden ... Zwarte piste
Well, but how do I know that the name of the particular variable I
want to get is title
, rather than anything else? There
exists a function that lists all the variables from the three data
tables.
The function that creates a list of all the column names from all
three datasets is named get.columns()
and needs no
arguments to be run. What it means is that you simply type the following
code, remembering about the parentheses at the end of the function:
## $books
## [1] "short.title" "author"
## [3] "title" "title.english"
## [5] "genre" "book.id"
## [7] "riddle.code" "riddle.code.english"
## [9] "translated" "gender.author"
## [11] "origin.author" "original.language"
## [13] "inclusion.criterion" "publication.date"
## [15] "first.print" "publisher"
## [17] "word.count" "type.count"
## [19] "sentence.length.mean" "sentence.length.variance"
## [21] "paragraph.count" "sentence.count"
## [23] "paragraph.length.mean" "raw.TTR"
## [25] "sampled.TTR"
##
## $respondents
## [1] "respondent.id" "gender.resp" "age.resp"
## [4] "zipcode" "education" "education.english"
## [7] "books.per.year" "typically.reads" "how.literary"
## [10] "s.4a1" "s.4a2" "s.4a3"
## [13] "s.4a4" "s.4a5" "s.4a6"
## [16] "s.4a7" "s.4a8" "s.12b1"
## [19] "s.12b2" "s.12b3" "s.12b4"
## [22] "s.12b5" "s.12b6" "s.12b7"
## [25] "s.12b8" "remarks.survey" "date.time"
## [28] "week.nr" "day"
##
## $reviews
## [1] "respondent.id" "book.id" "quality.read"
## [4] "literariness.read" "quality.notread" "literariness.notread"
## [7] "book.read"
##
## $motivations
## [1] "motivation.id" "respondent.id" "book.id" "paragraph.id"
## [5] "sentence.id" "token" "lemma" "upos"
Not bad indeed. However, how can I know what s.4a2
stands for?
Function that lists an short explanation of what the different column
names refer to and what their levels consist of is called
explain()
. To work properly, this function needs an
argument to be passed, which basically mean that the user has
to specify which dataset s/he is interested in. The options are as
follows:
## The 'books' dataset contains information on several details of the 401
## different books used in the survey.
##
## Here follows a list with the different column names and an explanation of
## the information they contain:
##
## 1. short.title A short name containing the author's name and
## (a part of) the title;
## 2. author Last name and first name of the author;
## 3. title Full title of the book;
## 4. title.english Full title of the book in English;
## 5. genre Genre of the book.
## There are four different genres:
## a) Fiction; b) Romantic; c) Suspense; d) Other;
## 6. book.id Unique number to identify each book;
## 7. riddle.code More complete list of genres of the books.
## Contains 13 categories --- to see which, type
## `levels(books$riddle.code)` in the terminal;
## 8. riddle.code.english Translation of code in column 7 in English;
## 9. translated 'yes' if the book has been translated,
## 'no' if not;
## 10. gender.author The gender of the author:
## Female, Male, Unknown/Multiple;
## 11. origin.author The country of origin of the author.
## Note that short country codes have been used
## instead of the full country names;
## 12. original.language The original language of the book. Note that short
## language codes have been used, instead of the full
## language names;
## 13. inclusion.criterion In what category a book has been placed, either
## a) bestseller; b) boekenweekgeschenk;
## c) library; or d) literair juweeltje;
## 14. publication.date Publication date of the book, YYYY-MM-DD format;
## 15. first.print Year in which the first print appeared;
## 16. publisher Publishers of the books;
## 17. word.count Word count, or total number of words (tokens)
## used in a book;
## 18. type.count Total number of unique words (types) used in book;
## 19. sentence.length.mean Average sentence length in a book (in words);
## 20. sentence.length.variance Standard deviation of the sentence length;
## 21. paragraph.count Total number of paragraphs in a book;
## 22. sentence.count Total number of sentences in a book;
## 23. paragraph.length.mean Average paragraph length in a book (in words);
## 24. raw.TTR Lexical diversity, or type-token ratio, which
## gives an indication of how diverse the word use
## in a book is;
## 25. sampled.TTR Unlike the raw type-token ratio, the sampled
## TTR is significantly more resistant to text
## size, and thus it should be preferred over the
## raw TTR.
## The 'reviews' dataset contains four different ratings that were given
## to 401 different books.
##
## Here follows a list with the different column names and an explanation of
## what information they contain:
##
## 1. respondent.id Unique number for each respondent of the survey;
## 2. book.id Unique number to identify each book;
## 3. quality.read Rating on the quality of a book that a respondent
## has read. Scale from 1 - 7, with 1 meaning
## 'very bad' and 7 meaning 'very good';
## 4. literariness.read Rating on how literary a respondent found a book
## that s/he has read. Scale from 1 - 7, with 1 meaning
## 'not literary at all' and 7 meaning 'very literary';
## 5. quality.notread Rating on the quality of a book that a respondent
## has not read. Scale from 1 - 7, with 1 meaning
## 'very bad' and 7 meaning 'very good';
## 6. literariness.notread Rating on how literary a respondent found a book that
## s/he has not read. Scale from 1 - 7, with 1 meaning
## 'not literary at all' and 7 meaning 'very literary';
## 7. book.read 1 or 0: 1 indicates that the respondent read
## the book, 0 indicates the respondent did not
## read the book but had an opinion about
## the literary quality of the book.
## The 'respondents' dataset contains information on the people that participated
## in the survey.
##
## Here follows a list with the different column names and an explanation of
## what information they contain:
##
## 1. respondent.id Unique number for each respondent of the survey;
## 2. gender.resp Gender of the respondent: Female, Male, NA;
## 3. age.resp Age of the respondent;
## 4. zipcode Zip code of the respondent;
## 5. education Education level, containing 8 levels (see which
## levels by typing 'levels(respondents$education)'
## in the terminal);
## 6. education.english English translation of education levels.
## 7. books.per.year Number of books read per year by each respondent;
## 8. typically.reads Typical genre of books that a respondent reads,
## with three levels a) Fiction; b) Non-fiction;
## c) both;
## 9. how.literary Answer to the question 'How literary a reader do
## you consider yourself to be?', where respondents
## could fill in a number from 1 - 7, with 1 meaning
## 'not literary at all' and 7 meaning 'very literary';
## 10. s.4a1 Answer to the question: 'I like reading novels that
## I can relate to my own life'. Scale from 1 - 5, with
## 1 meaning 'completely disagree', and 5 meaning
## 'completely agree';
## 11. s.4a2 Answer to the question: 'The story of a novel is what
## matters most to me'. Scale from 1 - 5;
## 12. s.4a3 Answer to the question: 'The writing style in a book
## is important to me'. Scale from 1 - 5;
## 13. s.4a4 Answer to the question: 'I like searching for deeper
## layers in a novel'. Scale from 1 - 5;
## 14. s.4a5 Answer to the question: 'I like reading literature'.
## Scale from 1 - 5;
## 15. s.4a6 Answer to the question: 'I read novels to discover new
## worlds and unknown time periods'. Scale from 1 - 5;
## 16. s.4a7 Answer to the question: 'I mostly read novels during my
## vacation'. Scale from 1 - 5;
## 17. s.4a8 Answer to the question: 'I usually read several novels at
## the same time'. Scale from 1 - 5;
## 18. s.12b1 Answer to the question: 'I like novels based on real
## events'. Scale from 1 - 5;
## 19. s.12b2 Answer to the question: 'I like thinking about a novel's
## structure'. Scale from 1 - 5;
## 20. s.12b3 Answer to the question: 'The writing style in a novel
## is of more importance to me than its story'.
## Scale from 1 - 5;
## 21. s.12b4 Answer to the question: 'I like to get carried away by
## a novel'. Scale from 1 - 5;
## 22. s.12b5 Answer to the question: 'I like to pick my books from
## the top 10 list of best sold books'. Scale from 1 - 5;
## 23. s.12b6 Answer to the question: 'I read novels to be challenged
## intellectually'. Scale from 1 - 5;
## 24. s.12b7 Answer to the question: 'I love novels that are easy
## to read'. Scale from 1 - 5;
## 25. s.12b8 Answer to the question: 'In the evening, I prefer
## to read books over watching TV'. Scale from 1 - 5;
## 26. remarks.survey Any additional remarks that respondents filled in
## at the end of the survey;
## 27. date.time Date and time of the moment a respondent filled in
## the survey, format in YYYY-MM-DD HH:MM:SS;
## 28. week.nr Number of week in which the respondent filled in
## the survey;
## 29. day Day of the week in which the respondent filled in
## the survey.
## The 'motivations' dataset contains all motivations that people provided
## about why they gave a certain book a specific rating. The motivations have been
## parsed to provide POS tag information
##
## Here follows a list with the different column names and an explanation of
## what information they contain:
##
## 1. motivation.id Unique number for each motivation given;
## 2. respondent.id Unique number for each respondent;
## 3. book.id Unique number of the book the motivation pertains to;
## 4. paragraph.id Number of paragraph within the motivation;
## 5. sentence.id Number of sentence within the paragraph;
## 6. token Token (in sentence order);
## 7. lemma Lemma of token;
## 8. upos POS tag of token;
## This is a dataframe containing numerical values for word frequencies
## of the 5000 most frequent words (in a descending order of frequency)
## of 401 literary novels in Dutch. The table contains relative frequencies,
## meaning that the original word occurrences from a book were divided
## by the total number of words of the book in question.
##
## The row names coincide with the column short.title from the data frame books.
## The column names list the 5000 most frequent words in the corpus.
The package provides a function to combine all information of the survey, reviews, and books into one big dataframe. The user can specify whether or not s/he wants to also load the freqTable with the frequency counts of the word n-grams of the books.
Combine and load all data from the books, respondents and reviews into a new dataframe (tibble format)
## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`
Combine and load all data from the books, respondents and reviews into a new dataframe (tibble format), and additionally also load the frequency table of all word 1grams of the corpus used.
## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`
Return the name of the dataset where a column can be found.
## [1] "books" "reviews"
## [1] "respondents"
It’s useful to combine it with the already-discussed function
get.columns()
.
Make a table of frequency counts for one variable, and plot a
histogram of the results. Not sure which variable you want to plot?
Invoke the above-discussed function get.columns()
once
more, to see which variables you can choose from:
Now the fun stuff:
##
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 63 56 67 66 83 104 126 150 160 156 152 153 142 128 145 143 141 128 126 139
## 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
## 123 139 135 124 147 148 181 178 209 196 208 231 229 258 283 312 331 343 372 384
## 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 389 409 419 394 389 389 407 362 382 445 459 309 312 272 222 159 143 130 96 107
## 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 93 94 98
## 70 54 62 42 49 18 25 19 8 10 7 5 8 3 4 1 1 1 1
You can also adjust the x label, y label, title, and colors:
make.table(table.of = 'age.resp', xlab = 'age respondent',
ylab = 'number of people',
title = 'Distribution of respondent age',
barcolor = 'red', barfill = 'white')
##
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 63 56 67 66 83 104 126 150 160 156 152 153 142 128 145 143 141 128 126 139
## 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
## 123 139 135 124 147 148 181 178 209 196 208 231 229 258 283 312 331 343 372 384
## 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 389 409 419 394 389 389 407 362 382 445 459 309 312 272 222 159 143 130 96 107
## 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 93 94 98
## 70 54 62 42 49 18 25 19 8 10 7 5 8 3 4 1 1 1 1
Note: please mind that in the above examples we used single quotes to
indicate arguments (e.g. xlab = 'age respondent'
), whereas
at the beginning of the document, we used double quotes
(explain("books")
). We did it for a reason, namely we
wanted to emphasize that the functions provided by the package
litRiddle
are fully compliant with the generic R syntax,
which allows for using either single or double quotes to indicate the
strings.
## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`
##
## 16 17 18 19 20 21 22 23 24 25 26
## female 704 748 791 735 1238 1889 2536 2507 2879 3205 2701
## male 95 59 215 100 437 194 212 227 267 357 535
## NA 12 0 0 0 2 22 33 10 18 7 14
##
## 27 28 29 30 31 32 33 34 35 36 37
## female 3265 2826 2871 3472 2961 3621 3136 3519 3445 2963 3073
## male 405 429 480 517 487 621 362 401 675 380 909
## NA 0 19 21 48 0 12 1 0 17 15 10
##
## 38 39 40 41 42 43 44 45 46 47 48
## female 3618 2519 4296 4020 5024 5253 5855 5859 5557 6392 6630
## male 602 606 568 619 1111 852 1103 1036 786 709 1750
## NA 0 0 42 0 119 44 16 41 75 4 5
##
## 49 50 51 52 53 54 55 56 57 58 59
## female 8439 9399 9957 10284 10012 11748 13228 12400 13059 12023 12659
## male 1055 1455 1669 1920 2074 2066 2014 2522 2045 2459 3206
## NA 101 148 87 34 194 36 56 39 89 1 34
##
## 60 61 62 63 64 65 66 67 68 69 70
## female 11663 12296 11626 9625 11363 12173 10903 8509 8469 6240 5049
## male 3136 3206 2695 2696 2747 3659 4631 2548 2728 3564 2221
## NA 100 144 76 56 42 0 51 0 0 6 147
##
## 71 72 73 74 75 76 77 78 79 80 81
## female 3530 3495 2991 1944 1905 1863 849 912 758 955 342
## male 1021 1031 1649 812 1129 471 731 618 237 343 113
## NA 0 0 0 0 0 0 0 0 0 0 0
##
## 82 83 84 85 86 87 88 89 90 91 93
## female 440 190 123 183 294 51 115 53 48 27 0
## male 216 119 32 88 110 34 23 9 32 0 28
## NA 0 0 0 0 0 0 0 0 0 0 0
##
## 94 98
## female 5 16
## male 0 0
## NA 0 0
## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`
## Warning: Removed 309688 rows containing non-finite outside the scale range
## (`stat_count()`).
##
## 1 2 3 4 5 6 7
## female 3565 7145 10667 9259 12221 10630 2530
## male 1591 4532 7679 8553 16334 26154 13090
## unknown/multiple 206 491 837 817 977 570 101
Note that you can only provide an argument to the ‘split’ variable that has less than 31 unique values, to avoid uninterpretable outputs. E.g., consider the following code:
## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`
## The 'split-by' variable has many unique values, which will make the output very
## hard to process. Please providea 'split-by' variable that contains less unique
## values.
You can also adjust the x label, y label, title, and colors:
make.table2(table.of = 'age.resp', split = 'gender.resp',
xlab = 'age respondent', ylab = 'number of people',
barcolor = 'purple', barfill = 'yellow')
## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`
##
## 16 17 18 19 20 21 22 23 24 25 26
## female 704 748 791 735 1238 1889 2536 2507 2879 3205 2701
## male 95 59 215 100 437 194 212 227 267 357 535
## NA 12 0 0 0 2 22 33 10 18 7 14
##
## 27 28 29 30 31 32 33 34 35 36 37
## female 3265 2826 2871 3472 2961 3621 3136 3519 3445 2963 3073
## male 405 429 480 517 487 621 362 401 675 380 909
## NA 0 19 21 48 0 12 1 0 17 15 10
##
## 38 39 40 41 42 43 44 45 46 47 48
## female 3618 2519 4296 4020 5024 5253 5855 5859 5557 6392 6630
## male 602 606 568 619 1111 852 1103 1036 786 709 1750
## NA 0 0 42 0 119 44 16 41 75 4 5
##
## 49 50 51 52 53 54 55 56 57 58 59
## female 8439 9399 9957 10284 10012 11748 13228 12400 13059 12023 12659
## male 1055 1455 1669 1920 2074 2066 2014 2522 2045 2459 3206
## NA 101 148 87 34 194 36 56 39 89 1 34
##
## 60 61 62 63 64 65 66 67 68 69 70
## female 11663 12296 11626 9625 11363 12173 10903 8509 8469 6240 5049
## male 3136 3206 2695 2696 2747 3659 4631 2548 2728 3564 2221
## NA 100 144 76 56 42 0 51 0 0 6 147
##
## 71 72 73 74 75 76 77 78 79 80 81
## female 3530 3495 2991 1944 1905 1863 849 912 758 955 342
## male 1021 1031 1649 812 1129 471 731 618 237 343 113
## NA 0 0 0 0 0 0 0 0 0 0 0
##
## 82 83 84 85 86 87 88 89 90 91 93
## female 440 190 123 183 294 51 115 53 48 27 0
## male 216 119 32 88 110 34 23 9 32 0 28
## NA 0 0 0 0 0 0 0 0 0 0 0
##
## 94 98
## female 5 16
## male 0 0
## NA 0 0
make.table2(table.of = 'literariness.read', split = 'gender.author',
xlab = 'Overall literariness scores',
ylab = 'number of people', barcolor = 'black',
barfill = 'darkred')
## Joining with `by = join_by(book.id)`
## Joining with `by = join_by(respondent.id)`
## Warning: Removed 309688 rows containing non-finite outside the scale range
## (`stat_count()`).
##
## 1 2 3 4 5 6 7
## female 3565 7145 10667 9259 12221 10630 2530
## male 1591 4532 7679 8553 16334 26154 13090
## unknown/multiple 206 491 837 817 977 570 101
The orginal survey about Dutch fiction was designed to rank the
responses using descriptive terms, e.g. “very bad”, “neutral”, “a bit
good” etc. In order to conduct the analyses, the responses were then
converted to numerical scales ranging from 1 to 7 (the questions about
literariness and literary quality) or from 1 to 5 (the questions about
the reviewer’s reading patterns). However, if you want the responses
converted back to their original form, invoke the function
order.responses()
that transforms the survey responses into
ordered factors. Use either “bookratings” or “readingbehavior” to
specify which of the survey questions needs to be changed into ordered
factors. (We assume here that the user knows what the ordered factors
are, because otherwise this function will not seem very useful). Levels
of quality.read
and quality.notread
: “very
bad”, “bad”, “a bit bad”, “neutral”, “a bit good”, “good”, “very good”,
“NA”. Levels literariness.read
and
literariness.notread
: “absolutely not literary”,
“non-literary”, “not very literary”, “between literary and
non-literary”,“a bit literary”, “literary”, “very literary”, “NA”.
Levels statements 4/12: “completely disagree”, “disagree”, “neutral”,
“agree”, “completely agree”, “NA”.
To create a data frame with ordered factor levels of the questions on reading behavior:
## tibble [13,541 × 29] (S3: tbl_df/tbl/data.frame)
## $ respondent.id : num [1:13541] 0 1 2 3 4 5 6 7 8 9 ...
## $ gender.resp : Factor w/ 3 levels "female","male",..: 1 1 1 1 1 2 1 1 2 1 ...
## $ age.resp : num [1:13541] 18 24 78 77 71 58 38 51 66 32 ...
## $ zipcode : num [1:13541] 4834 5625 2272 2151 NA ...
## $ education : Ord.factor w/ 8 levels "none/primary school"<..: 5 7 7 5 5 7 6 6 7 7 ...
## $ education.english: Ord.factor w/ 8 levels "No education / primary school"<..: 5 7 7 5 5 7 6 6 7 7 ...
## $ books.per.year : num [1:13541] 20 30 30 12 15 60 25 30 50 2 ...
## $ typically.reads : Ord.factor w/ 5 levels "completely disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
## $ how.literary : Ord.factor w/ 5 levels "completely disagree"<..: 3 3 3 3 4 2 3 3 1 3 ...
## $ s.4a1 : Ord.factor w/ 5 levels "completely disagree"<..: 4 4 4 3 4 2 3 2 3 2 ...
## $ s.4a2 : Ord.factor w/ 5 levels "completely disagree"<..: 4 4 5 4 3 4 5 3 4 5 ...
## $ s.4a3 : Ord.factor w/ 5 levels "completely disagree"<..: 4 5 4 4 4 5 4 5 4 4 ...
## $ s.4a4 : Ord.factor w/ 5 levels "completely disagree"<..: 4 5 4 3 4 3 1 4 4 4 ...
## $ s.4a5 : Ord.factor w/ 5 levels "completely disagree"<..: 5 5 4 3 4 4 3 5 5 4 ...
## $ s.4a6 : Ord.factor w/ 5 levels "completely disagree"<..: 4 5 4 4 4 4 4 4 3 4 ...
## $ s.4a7 : Ord.factor w/ 5 levels "completely disagree"<..: 4 3 3 2 2 1 3 2 2 5 ...
## $ s.4a8 : Ord.factor w/ 5 levels "completely disagree"<..: 4 5 3 4 2 3 1 5 4 1 ...
## $ s.12b1 : Ord.factor w/ 5 levels "completely disagree"<..: 2 4 4 3 4 2 3 2 3 3 ...
## $ s.12b2 : Ord.factor w/ 5 levels "completely disagree"<..: 4 1 4 4 3 4 2 3 5 3 ...
## $ s.12b3 : Ord.factor w/ 5 levels "completely disagree"<..: 3 3 3 3 3 3 2 3 3 3 ...
## $ s.12b4 : Ord.factor w/ 5 levels "completely disagree"<..: 4 3 4 4 4 4 5 4 4 4 ...
## $ s.12b5 : Ord.factor w/ 5 levels "completely disagree"<..: 1 2 3 2 3 1 2 2 4 2 ...
## $ s.12b6 : Ord.factor w/ 5 levels "completely disagree"<..: 4 4 4 3 3 4 2 4 3 2 ...
## $ s.12b7 : Ord.factor w/ 5 levels "completely disagree"<..: 2 3 4 4 4 2 5 3 2 2 ...
## $ s.12b8 : num [1:13541] 3 4 3 4 3 3 2 4 3 3 ...
## $ remarks.survey : chr [1:13541] "" "" "" "" ...
## $ date.time : POSIXct[1:13541], format: "2013-06-04 11:12:00" "2013-04-10 15:33:00" ...
## $ week.nr : num [1:13541] 23 15 15 27 15 29 15 15 15 15 ...
## $ day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 4 5 6 4 2 4 4 4 5 ...
To create a data frame with ordered factor levels of the book ratings:
## tibble [448,055 × 7] (S3: tbl_df/tbl/data.frame)
## $ respondent.id : num [1:448055] 0 0 0 0 0 0 0 0 0 0 ...
## $ book.id : num [1:448055] 1 9 11 19 30 34 82 116 300 372 ...
## $ quality.read : Ord.factor w/ 7 levels "very bad"<"bad"<..: 6 5 7 5 5 7 5 5 6 6 ...
## $ literariness.read : Ord.factor w/ 7 levels "absolutely not literary"<..: 5 6 6 6 4 6 3 5 6 6 ...
## $ quality.notread : Ord.factor w/ 7 levels "very bad"<"bad"<..: NA NA NA NA NA NA NA NA NA NA ...
## $ literariness.notread: Ord.factor w/ 7 levels "absolutely not literary"<..: NA NA NA NA NA NA NA NA NA NA ...
## $ book.read : num [1:448055] 1 1 1 1 1 1 1 1 1 1 ...
The data frame frequencies
contains numerical values for
word frequencies of the 5000 most frequent words (in a descending order
of frequency) of 401 literary novels in Dutch. The table contains
relative frequencies, meaning that the original word occurrences from a
book were divided by the total number of words of the book in question.
The measurments were obtained using the R package stylo
,
and were later rounded to the 5th digit. To learn more about the novels
themselves, type help(books)
.
The row names of the frequencies
data frame contain the
titles of the novels corresponding to the title.short
column in the data frame books
.
## [1] "Allende_NegendeSchriftVan" "Amirrezvani_DochterVanIsfahan"
## [3] "Ammaniti_JijEnIk" "Ammaniti_LaatFeestBeginnen"
## [5] "Ammaniti_LaatsteOudejaarVan" "Ammaniti_ZoGodWil"
## [7] "Appel_VanTweeKanten" "Appel_Weerzin"
## [9] "Auel_LiedVanGrotten" "Austin_EindelijkThuis"
## [11] "Avallone_Staal"
Listing the relative frequency values for the novel Weerzin by Rene Appel:
## de het en een ik ze dat hij van in
## 2.91937 2.89593 1.80979 2.64645 1.10820 2.11716 1.75603 2.50586 1.40455 1.20331
And getting the book information:
## short.title author title title.english genre book.id
## 391 Appel_Weerzin Appel, René Weerzin *Aversion Suspense 391
## riddle.code riddle.code.english translated gender.author
## 391 305 LITERAIRE THRILLER 305 Literary thriller no male
## origin.author original.language inclusion.criterion publication.date
## 391 NL NL bestseller 2011-10-31
## first.print publisher word.count type.count sentence.length.mean
## 391 2008 Ambo/Anthos B.V. 72134 5710 9.9646
## sentence.length.variance paragraph.count sentence.count
## 391 7.2009 2168 7239
## paragraph.length.mean raw.TTR sampled.TTR
## 391 33.2721 0.0792 0.2466
Version 1.0 of the package introduces a table
motivations
, containing the 200k+ lemmatized and POS tagged
tokens making up the text of all motivations. The Dutch Language
Institute (INT, Leiden) took care of POS-tagging the data. The tagging
was manually corrected by Karina van Dalen-Oskam. We tried to guarantee
the highest possible quality, but mistakes may still occur.
The solution to add a token based table was chosen to not burden the
table reviews
with lots of text, XML, or JSON in additional
columns, leading to potential problems with default memory constraints
in R.
To retrieve all tokens:
## motivation.id respondent.id book.id paragraph.id sentence.id token
## 1 1 0 82 1 1 Het
## 2 1 0 82 1 1 is
## 3 1 0 82 1 1 een
## 4 1 0 82 1 1 snel
## 5 1 0 82 1 1 te
## 6 1 0 82 1 1 lezen
## 7 1 0 82 1 1 en
## 8 1 0 82 1 1 snel
## 9 1 0 82 1 1 te
## 10 1 0 82 1 1 doorgronden
## 11 1 0 82 1 1 boek
## 12 1 0 82 1 1 met
## 13 1 0 82 1 1 vlakke
## 14 1 0 82 1 1 personages
## 15 1 0 82 1 1 en
## lemma upos
## 1 het PRON_DET
## 2 is VERB
## 3 een PRON_DET
## 4 snel ADJ
## 5 te ADP
## 6 lezen VERB
## 7 en CONJ
## 8 snel ADJ
## 9 te ADP
## 10 doorgronden VERB
## 11 boek NOUN
## 12 met ADP
## 13 vlak ADJ
## 14 personage NOUN
## 15 en CONJ
Usually one will probably want to work with the full text of
motivations. A convenience function motivations.text()
is
provided to create a view that has one motivation per row:
# We're importing `dplyr` to use `tibble` so we can
# show very large tables somewhat nicer.
suppressMessages(library(dplyr))
mots = motivations.text()
tibble(mots)
## # A tibble: 11,950 × 4
## motivation.id book.id respondent.id text
## <dbl> <dbl> <dbl> <chr>
## 1 1 82 0 Het is een snel te lezen en snel te door…
## 2 2 46 1 Ik vond HhhH eerder een verslag van een …
## 3 3 153 2 het is goed verteld , heeft ook een hist…
## 4 4 239 3 Een prachtig verhaal en uitstekend gesch…
## 5 5 248 5 Het is een goed en snel verteld verhaal …
## 6 6 382 6 -
## 7 7 91 7 Heel de opbouw van het verhaal , de pers…
## 8 8 311 8 Het is de combinatie van vorm en inhoud …
## 9 9 128 10 Wanneer ik denk dat ik een boek nogmaals…
## 10 10 392 11 Beeldend en goed geschreven .
## # ℹ 11,940 more rows
explain
function from the package litRiddle because it has its own explain
function. To use litRiddle’s explain function after dplyr has been
loaded, call it explicitly, like this:
litRiddle::explain(“books”)
.
Gathering all motivations for, for instance, one book, requires some
trivial merging. Let’s see what people said about Binet’s HhhH.
For this we select the book information of the book with ID 46 and we
left join (merge) that (book.id
by book.id
)
with the table mots
having all the motivations:
mots_hhhh <- merge(x = books[books["book.id"]==46,], y = mots, by = "book.id", all.x = TRUE)
tibble(mots_hhhh)
## # A tibble: 64 × 28
## book.id short.title author title title.english genre riddle.code
## <int> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## 2 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## 3 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## 4 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## 5 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## 6 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## 7 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## 8 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## 9 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## 10 46 Binet_Hhhh Binet, Laurent HhhH HhhH Fiction 301-302 (VERT…
## # ℹ 54 more rows
## # ℹ 21 more variables: riddle.code.english <fct>, translated <fct>,
## # gender.author <fct>, origin.author <fct>, original.language <fct>,
## # inclusion.criterion <fct>, publication.date <date>, first.print <int>,
## # publisher <fct>, word.count <int>, type.count <int>,
## # sentence.length.mean <dbl>, sentence.length.variance <dbl>,
## # paragraph.count <int>, sentence.count <int>, paragraph.length.mean <dbl>, …
Hmm… pretty wide table, select the text
column to get an
idea of what is being said, and print with the n
parameter
to see more rows:
## # A tibble: 64 × 1
## `mots_hhhh[, "text"]`
## <chr>
## 1 vanwege de verpakking
## 2 Het vertelperspectief , het feit dat de schrijver en onderzoeker ook op zijn…
## 3 Fictie en non-fictie ( geschiedenis ) gemengd , met gebruikmaking van litera…
## 4 Ja , dat kan ik . Hoewel het hier om non-fictie gaat , zet de auteur alle mo…
## 5 Omdat het zo is .
## 6 Op grond van het spel dat Binet speelt met de verwachtingen van de lezer , d…
## 7 Tussen historische roman en geromantiseerde historie .
## 8 Door de vorm waarin het geschreven is . Vooral de beschouwingen over bronnen…
## 9 ' Vanwege het complexe gedachtengoed dat deze roman rijk is taalgebruik is v…
## 10 de verteller neemt mij mee in de beschreven wereld. ik voel me aangesproken …
## 11 Het is fascinerend . Mooi spel met wat werkelijk gebeurd is en wat de schrij…
## 12 Plot en spanning in combinatie met historische context
## 13 Op een bijzondere wijze schakelt Binet regelmatig de lezer zelf in om zijn v…
## 14 Per ongeluk aangevinkt , niet gelezen , sorry
## 15 Het is een geweldig onderzoek en goed geschreven bovendien . Toch is het gee…
## 16 ' Het ging ' ergens over '.'
## 17 -
## 18 Die kwaliteit ligt in de structuur en beeldende taal van het boek . Eeen goe…
## 19 Je moet in dit boek groeien . Dan laat het je niet meer los . Het blijft lan…
## 20 Het boek heeft meerdere lagen , en benadert vanuit een originele invalshoek …
## 21 Als of zelfs H nog een ziel kan hebben .
## 22 mixture fictie en non-fictie
## 23 Het boek zet mij aan tot nadenken en is ' vloeiend ' geschreven .
## 24 Het schrijfproces van de schrijver en zijn verhaal lopen in verschillende ho…
## 25 veel interieure monologen , bizarre gedachtenspinsels , geheimzinnige zinswe…
## 26 het heeft op mij een goede indruk gemaakt over het verleden
## 27 ' Ik baseer me slechts op een enkel hoofdstuk maar voor wat dat aangaat , is…
## 28 Zeer indringend geschreven , het is een buitengewoon gevoelig onderwerp , ma…
## 29 In het boek wordt op verfijnde wijze gespeeld met de grens tussen fictie en …
## 30 Er wordt niet alleen een verhaal verteld , de verteller is ook duidelijk vra…
## 31 schrijftsijl , opbouw
## 32 Ik vind het bijzonder hoe Binet non-fictie en fictie met elkaar weet te comb…
## 33 het is een nieuw genre : de persoonlijke biografie ( combinatie van biografi…
## 34 Echt andere vorm van verhalen , ernstig geschreven . In het Frans gelezen en…
## 35 Het is een verhalend soort non-fictie , gelezen in vertaling , dat wel , waa…
## 36 Interessant spel met het werk van de historicus : bevragen van de eigen meth…
## 37 Nee , dat kan ik moeilijk toelichten . De rol van de auteur zelf binnen het …
## 38 omdat het mij niet boeide
## 39 Ik vond het een vreselijk boek misschien dat ik daarom denk dat het in hoge …
## 40 Binet behandelt niet alleen de geschiedenis van Heydrich , maar ook de manie…
## # ℹ 24 more rows
If we also want to include review information, this requires another
merge. Rather than trying to combine all data in one huge statement, it
is usually easier to follow a step by step methog. First let’s collect
the motivations for HhhH. We will be more selective of columns.
If you compare the following query with the merge
statement
above, you will find that we use only author and title from
books
and only repsondent ID and the motivational text from
mots
, while we use book.id
from both to match
for merging.
mots_hhhh = merge(x = books[books["book.id"] == 46, c("book.id", "author", "title")], y = mots[, c("book.id", "respondent.id", "text")], by = "book.id", all.x = TRUE)
tibble(mots_hhhh)
## # A tibble: 64 × 5
## book.id author title respondent.id text
## <int> <fct> <fct> <dbl> <chr>
## 1 46 Binet, Laurent HhhH 361 vanwege de verpakking
## 2 46 Binet, Laurent HhhH 4503 Het vertelperspectief , het feit …
## 3 46 Binet, Laurent HhhH 9910 Fictie en non-fictie ( geschieden…
## 4 46 Binet, Laurent HhhH 1923 Ja , dat kan ik . Hoewel het hier…
## 5 46 Binet, Laurent HhhH 505 Omdat het zo is .
## 6 46 Binet, Laurent HhhH 1242 Op grond van het spel dat Binet s…
## 7 46 Binet, Laurent HhhH 4963 Tussen historische roman en gerom…
## 8 46 Binet, Laurent HhhH 1425 Door de vorm waarin het geschreve…
## 9 46 Binet, Laurent HhhH 968 ' Vanwege het complexe gedachteng…
## 10 46 Binet, Laurent HhhH 8986 de verteller neemt mij mee in de …
## # ℹ 54 more rows
We now have a new view that we can again merge with the information
in the reviews
data:
## # A tibble: 64 × 10
## book.id respondent.id author title text quality.read literariness.read
## <int> <dbl> <fct> <fct> <chr> <dbl> <dbl>
## 1 46 1 Binet, Laur… HhhH Ik v… 7 2
## 2 46 1022 Binet, Laur… HhhH Een … 7 6
## 3 46 10278 Binet, Laur… HhhH veel… 7 7
## 4 46 10350 Binet, Laur… HhhH Het … 7 5
## 5 46 10546 Binet, Laur… HhhH Fant… 6 7
## 6 46 10735 Binet, Laur… HhhH Het … 7 7
## 7 46 10980 Binet, Laur… HhhH Stij… 6 7
## 8 46 1121 Binet, Laur… HhhH ' He… 6 6
## 9 46 11443 Binet, Laur… HhhH omda… 4 NA
## 10 46 11586 Binet, Laur… HhhH Op e… 7 7
## # ℹ 54 more rows
## # ℹ 3 more variables: quality.notread <dbl>, literariness.notread <dbl>,
## # book.read <dbl>
Note how we use a vector for by
to ensure we match on
book ID and respondent ID. If we would use only
book.id
we would get all score for that book by
all respondents, but we want the score by these particular respondents
that motivated their rating.
And – being sceptical as we always should be about our strategies – let us just check that we didn’t miss anything, and sample if indeed repsondent 1022 had only one rating for HhhH:
## respondent.id book.id quality.read literariness.read quality.notread
## 33356 1022 46 7 6 NA
## literariness.notread book.read
## 33356 NA 1
Suppose we want to look into word frequencies of motivations. We can
use base R table
to get an idea of how often what
combination of lemma and POS tag appears in the motivations:
toks = motivations # Remmber: that is a *token* table, one token + lemma + POS tag per row.
head(table(toks$lemma, toks$upos), n = 30)
##
## ADJ ADP ADV CONJ INTJ NOUN NUM PRON_DET PROPN PUNCT VERB X
## ! 0 0 0 0 0 0 0 0 0 175 0 0
## !! 0 0 0 0 0 0 0 0 0 12 0 0
## !!! 0 0 0 0 0 0 0 0 0 6 0 0
## !!!! 0 0 0 0 0 0 0 0 0 1 0 0
## !!' 0 0 0 0 0 0 0 0 0 4 0 0
## !' 0 0 0 0 0 0 0 0 0 8 0 0
## !', 0 0 0 0 0 0 0 0 0 1 0 0
## !) 0 0 0 0 0 0 0 0 0 5 0 0
## !)' 0 0 0 0 0 0 0 0 0 1 0 0
## !, 0 0 0 0 0 0 0 0 0 1 0 0
## # 0 0 0 0 0 0 0 0 0 3 0 0
## % 0 0 0 0 0 0 0 0 0 4 0 0
## & 0 0 0 0 0 0 0 0 0 4 0 0
## ' 0 0 0 0 0 0 0 0 0 1702 0 0
## '' 0 0 0 0 0 0 0 0 0 40 0 0
## ') 0 0 0 0 0 0 0 0 0 6 0 0
## ', 0 0 0 0 0 0 0 0 0 51 0 0
## '. 0 0 0 0 0 0 0 0 0 86 0 0
## '.' 0 0 0 0 0 0 0 0 0 18 0 0
## '... 0 0 0 0 0 0 0 0 0 2 0 0
## ': 0 0 0 0 0 0 0 0 0 5 0 0
## '? 0 0 0 0 0 0 0 0 0 2 0 0
## ( 0 0 0 0 0 0 0 0 0 596 0 0
## (' 0 0 0 0 0 0 0 0 0 5 0 0
## (- 0 0 0 0 0 0 0 0 0 1 0 0
## (?) 0 0 0 0 0 0 0 0 0 3 0 0
## (?), 0 0 0 0 0 0 0 0 0 1 0 0
## ) 0 0 0 0 0 0 0 0 0 349 0 0
## )!!!, 0 0 0 0 0 0 0 0 0 1 0 0
## )' 0 0 0 0 0 0 0 0 0 5 0 0
Wow, respondents are creative about using punctuation! In the
interest of completeness we chose not to clean out all those emoticons
from the data set. However, here we don’t need those. So we filter, and
sort. The code in the next cell is not trivial if you are new to R, or
regular expressions. Hopefully the inserted comments will clarify a bit.
Note, just in case you run into puzzling errors, this uses the
dplyr.filter
as we imported dplyr
above. Base
R filter
requires a different approach.
# filter out tokens that do not start with at least one word character
# we use regular expression "\w+" which means "more than one word character",
# the added backslash prevents R from interpreting the backslash as an
# escape character.
mots = filter(motivations, grepl('\\w+', lemma))
# create a data frame out of a table of raw frequencies.
# Look up 'table function' in R documentation.
mots = data.frame(table(mots$lemma, mots$upos))
# use interpretable column names
colnames(mots) = c("lemma", "upos", "freq")
# select only useful information, i.e. those lemma+pos combinations
# that appear more than 0 times
mots = mots[mots['freq'] > 0, ]
# sort from most used to least used
mots = mots[order(mots$freq, decreasing = TRUE), ]
# finally show as a nicer looking table
tibble(mots)
## # A tibble: 10,749 × 3
## lemma upos freq
## <fct> <fct> <int>
## 1 het PRON_DET 9889
## 2 de PRON_DET 6983
## 3 een PRON_DET 5586
## 4 ik PRON_DET 5210
## 5 en CONJ 5017
## 6 is VERB 4756
## 7 van ADP 3766
## 8 niet ADV 3730
## 9 boek NOUN 3573
## 10 in ADP 2917
## # ℹ 10,739 more rows
And rather unsurprisingly it is the pronouns and other functors that lead the pack.
For another exercise, let’s look up something about the lemma “boek” (en. “book”):
## lemma upos freq
## 50542 boek NOUN 3573
## 109576 boek X 3
Linguistic parsers are not infallible. Apparently in three cases the parser did not know how to classify the word “boek”, in which case the POS tag handed out is “X”. Can we find the contexts where those linguistic unknowns were used? For this, first we find the book IDs from the books where this happened:
# First we find the motivation IDs from the books where this happens.
boekx = motivations[motivations["lemma"] == "boek" & motivations["upos"] == "X", ]
boekx
## motivation.id respondent.id book.id paragraph.id sentence.id token lemma
## 317 12 14 159 1 1 boek boek
## 144901 8132 9288 107 1 1 boek boek
## 147507 8258 9431 62 1 2 boek boek
## upos
## 317 X
## 144901 X
## 147507 X
Now we need the full texts of all motivations, so we can find those three motivations we are looking for.
To find the three motivations we merge the boekx
table
and the table with all the motivations, and we keep only those rows that
pertain to the three motivation IDs. I.e. we merge
onby="motvation.id"
with all.x=TRUE
, implying
that we will keep all rows from x
(i.e. the three
motivations with the “boek” POS tagged as “X”) and that we will drop all
non-related y
(i.e. all those motivations that do not have
those linguistically unknown “boek” mentions).
And finally we show what those contexts are:
## # A tibble: 3 × 3
## book.id.x respondent.id.x text
## <dbl> <dbl> <chr>
## 1 159 14 maakt me niet uit of het literair is als ik het eee…
## 2 107 9288 Het is een feel good boek , er is geen literaire wa…
## 3 62 9431 Is net een graadje beter dan 3 stuivers romans , ma…
And just for good measure the full text of the third mention:
## [1] "Is net een graadje beter dan 3 stuivers romans , maar het blijft lectuur , verstrooiing , boeken die je zeer waarschijnlijk geen tweede keer leest want dat weet je al wie elkaar krijgen en hoe . Feel good boek ."
Next versions of the litRiddle
package will support
likert plots. Visit https://github.com/jbryer/likert
to learn more about the general idea and the implementation in R.
Next versions of the litRiddle
package will support
topic modeling of the motivations indicated by the reviewers.
Each function provided by the package has its own help page; the same applies to the datasets:
All the datasets use the UTF-8 encoding (also known as the Unicode). This should normally not cause any problems on MacOS and Linux machines, but Windows might be more tricky in this respect. We haven’t experienced any inconveniences in our testing environment, but we cannot say the same about all the other machines.
Karina van Dalen-Oskam (2023). The Riddle of Literary Quality: A Computational Approach. Amsterdam University Press.
Karina van Dalen-Oskam (2021). Het raadsel literatuur. Is literaire kwaliteit meetbaar? Amsterdam University Press.
Maciej Eder, Saskia Lensink, Joris van Zundert, Karina van Dalen-Oskam (2022). Replicating The Riddle of Literary Quality: The litRiddle package for R, in: Digital Humanities 2022 Conference Abstracts. The University of Tokyo, Japan, 25–29 July 2022, p. 636–637 https://dh2022.dhii.asia/dh2022bookofabsts.pdf
Corina Koolen, Karina van Dalen-Oskam, Andreas van Cranenburgh, Erica Nagelhout (2020). Literary quality in the eye of the Dutch reader: The National Reader Survey. Poetics 79: 101439, https://doi.org/10.1016/j.poetic.2020.101439.
More publications from the project: see https://literaryquality.huygens.knaw.nl/?page_id=588.