--- title: "Literary Quality of Dutch Novels with litRiddle" author: - Maciej Eder - Joris van Zundert - Karina van Dalen-Oskam - Saskia Lensink output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{The Riddle of Literary Quality} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Introduction This vignette explains basic functionalities of the package `litRiddle`, a part of the Riddle of Literary Quality project. The package contains the data of a reader survey about fiction in Dutch, a description of the novels the readers rated, and the results of stylistic measurements of the novels. The package also contains functions to combine, analyze, and visualize these data. See: [https://literaryquality.huygens.knaw.nl/](https://literaryquality.huygens.knaw.nl/) for further details. Information in Dutch about the package can be found at [https://karinavdo.github.io/RaadselLiteratuur/02_07_data_en_R_package.html](https://karinavdo.github.io/RaadselLiteratuur/02_07_data_en_R_package.html). These data are also available as individual csv files for persons wanting to work with the data in non R environments. See: https://github.com/karinavdo/RiddleData. If you use `litRiddle` in your academic publications, please consider citing the following references: **Maciej Eder, Lensink, S., van Zundert, J.J., and van Dalen-Oskam, K.H.** (2022). Replicating The Riddle of Literary Quality: The LitRiddle Package for R. In _Digital Humanities 2022 Conference Abstracts_, 636--637. Tokyo: The University of Tokyo / DH2022 Local Organizing Committee. https://dh2022.dhii.asia/abstracts/files/VAN_DALEN_OSKAM_Karina_Replicating_The_Riddle_of_Literary_Qu.html **Karina van Dalen-Oskam** (2023). _The Riddle of Literary Quality: A Computational Approach._ Amsterdam University Press. ## Installation Install the package from the CRAN repository: ``` R install.packages("litRiddle") ``` Alternatively, try installing it directly from the GitHub repository: ``` R library(devtools) install_github("karinavdo/LitRiddleData", build_vignettes = TRUE) ``` ## Usage First, one has to activate the package so that its functions become visible to the user: ``` {r warning = FALSE} library(litRiddle) ``` ## The dataset To activate the dataset, type one of the following lines (or all of them): ``` {r} data(books) data(respondents) data(reviews) data(motivations) data(frequencies) ``` From now on, the dataset, divided into four data tables, is visible for the user. Please note that the functions discussed below do not need the dataset to be activated (they take care of it themselves), therefore you don't have to remember about this step if you plan to analyze the data using the functions from the package. Time to explore some of the data tables. This generic function will list all the data points from the table `books`: ``` {r eval = FALSE} books ``` This command will dump quite a lot of stuff on the screen offering little insight or overview. It's usually a better idea to select one portion of information at a time, usually one variable or one observation. We assume here that the user has some basic knowledge about R, and particularly s/he knows how to access values in vectors and tables (matrices). To get the titles of the books scored in the survey (or, say, the first 10 titles), one might type: ``` {r} books$title[1:10] ``` Well, but how do I know that the name of the particular variable I want to get is `title`, rather than anything else? There exists a function that lists all the variables from the three data tables. ## Print column names The function that creates a list of all the column names from all three datasets is named `get.columns()` and needs no arguments to be run. What it means is that you simply type the following code, remembering about the parentheses at the end of the function: ``` {r} get.columns() ``` Not bad indeed. However, how can I know what `s.4a2` stands for? ## Explain variables Function that lists an short explanation of what the different column names refer to and what their levels consist of is called `explain()`. To work properly, this function needs an _argument_ to be passed, which basically mean that the user has to specify which dataset s/he is interested in. The options are as follows: ``` {r} explain("books") explain("reviews") explain("respondents") explain("motivations") explain("frequencies") ``` ## Combine data from books, survey, reviews The package provides a function to combine all information of the survey, reviews, and books into one big dataframe. The user can specify whether or not s/he wants to also load the freqTable with the frequency counts of the word n-grams of the books. Combine and load all data from the books, respondents and reviews into a new dataframe (tibble format) ``` {r} dat = combine.all(load.freq.table = FALSE) ``` Combine and load all data from the books, respondents and reviews into a new dataframe (tibble format), and additionally also load the frequency table of all word 1grams of the corpus used. ``` {r} dat = combine.all(load.freq.table = TRUE) ``` ## Find dataset Return the name of the dataset where a column can be found. ``` {r} find.dataset("book.id") find.dataset("age.resp") ``` It's useful to combine it with the already-discussed function `get.columns()`. ## Make table (and plot it!) Make a table of frequency counts for one variable, and plot a histogram of the results. Not sure which variable you want to plot? Invoke the above-discussed function `get.columns()` once more, to see which variables you can choose from: ``` {r eval = FALSE} get.columns() ``` Now the fun stuff: ``` {r} make.table(table.of = 'age.resp') ``` You can also adjust the x label, y label, title, and colors: ``` {r} make.table(table.of = 'age.resp', xlab = 'age respondent', ylab = 'number of people', title = 'Distribution of respondent age', barcolor = 'red', barfill = 'white') ``` Note: please mind that in the above examples we used single quotes to indicate arguments (e.g. `xlab = 'age respondent'`), whereas at the beginning of the document, we used double quotes (`explain("books")`). We did it for a reason, namely we wanted to emphasize that the functions provided by the package `litRiddle` are fully compliant with the generic R syntax, which allows for using either single or double quotes to indicate the strings. ## Make table of X split by Y ``` {r} make.table2(table.of = 'age.resp', split = 'gender.resp') ``` ``` {r} make.table2(table.of = 'literariness.read', split = 'gender.author') ``` Note that you can only provide an argument to the 'split' variable that has less than 31 unique values, to avoid uninterpretable outputs. E.g., consider the following code: ``` {r} make.table2(table.of = 'age.resp', split = 'zipcode') ``` You can also adjust the x label, y label, title, and colors: ``` {r} make.table2(table.of = 'age.resp', split = 'gender.resp', xlab = 'age respondent', ylab = 'number of people', barcolor = 'purple', barfill = 'yellow') ``` ``` {r} make.table2(table.of = 'literariness.read', split = 'gender.author', xlab = 'Overall literariness scores', ylab = 'number of people', barcolor = 'black', barfill = 'darkred') ``` ## Order responses The orginal survey about Dutch fiction was designed to rank the responses using descriptive terms, e.g. "very bad", "neutral", "a bit good" etc. In order to conduct the analyses, the responses were then converted to numerical scales ranging from 1 to 7 (the questions about literariness and literary quality) or from 1 to 5 (the questions about the reviewer's reading patterns). However, if you want the responses converted back to their original form, invoke the function `order.responses()` that transforms the survey responses into ordered factors. Use either "bookratings" or "readingbehavior" to specify which of the survey questions needs to be changed into ordered factors. (We assume here that the user knows what the ordered factors are, because otherwise this function will not seem very useful). Levels of `quality.read` and `quality.notread`: "very bad", "bad", "a bit bad", "neutral", "a bit good", "good", "very good", "NA". Levels `literariness.read` and `literariness.notread`: "absolutely not literary", "non-literary", "not very literary", "between literary and non-literary","a bit literary", "literary", "very literary", "NA". Levels statements 4/12: "completely disagree", "disagree", "neutral", "agree", "completely agree", "NA". To create a data frame with ordered factor levels of the questions on reading behavior: ``` {r} dat.reviews = order.responses('readingbehavior') str(dat.reviews) ``` To create a data frame with ordered factor levels of the book ratings: ``` {r} dat.ratings = order.responses('bookratings') str(dat.ratings) ``` ## Frequencies The data frame `frequencies` contains numerical values for word frequencies of the 5000 most frequent words (in a descending order of frequency) of 401 literary novels in Dutch. The table contains relative frequencies, meaning that the original word occurrences from a book were divided by the total number of words of the book in question. The measurments were obtained using the R package `stylo`, and were later rounded to the 5th digit. To learn more about the novels themselves, type `help(books)`. The row names of the `frequencies` data frame contain the titles of the novels corresponding to the `title.short` column in the data frame `books`. ``` {r} rownames(frequencies)[10:20] ``` Listing the relative frequency values for the novel _Weerzin_ by Rene Appel: ``` {r} frequencies["Appel_Weerzin",][1:10] ``` And getting the book information: ``` {r} books[books["short.title"]=="Appel_Weerzin",] ``` ## Motivations Version 1.0 of the package introduces a table `motivations`, containing the 200k+ lemmatized and POS tagged tokens making up the text of all motivations. The Dutch Language Institute (INT, Leiden) took care of POS-tagging the data. The tagging was manually corrected by Karina van Dalen-Oskam. We tried to guarantee the highest possible quality, but mistakes may still occur. The solution to add a token based table was chosen to not burden the table `reviews` with lots of text, XML, or JSON in additional columns, leading to potential problems with default memory constraints in R. To retrieve all tokens: ``` {r} data(motivations) head(motivations, 15) ``` ### From tokens to text Usually one will probably want to work with the full text of motivations. A convenience function `motivations.text()` is provided to create a view that has one motivation per row: ``` {r} # We're importing `dplyr` to use `tibble` so we can # show very large tables somewhat nicer. suppressMessages(library(dplyr)) mots = motivations.text() tibble(mots) ```
explain
function from the package litRiddle because it has its own explain function. To use litRiddle's explain function after dplyr has been loaded, call it explicitly, like this: litRiddle::explain("books")
.