{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### The Impact of Scale on Content Analysis of Goodreads Reviews\n", "\n", "- We use content analysis: quantitative method for analysing the content of reviews\n", "- Subsets of reviews with different types of focus and different scales (from 1 to 100 to 10,000 to 1 million reviews)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/marijnkoolen/Code/Huygens/scale\n" ] } ], "source": [ "# This reload library is just used for developing the REPUBLIC hOCR parser \n", "# and can be removed once this module is stable.\n", "%reload_ext autoreload\n", "%autoreload 2\n", "\n", "# This is needed to add the repo dir to the path so jupyter\n", "# can load the modules in the scripts directory from the notebooks\n", "import os\n", "import sys\n", "repo_dir = os.path.split(os.getcwd())[0]\n", "print(repo_dir)\n", "if repo_dir not in sys.path:\n", " sys.path.append(repo_dir)\n", " \n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import json\n", "import csv\n", "import os\n", "\n", "data_dir = '../data/GoodReads'\n", "\n", "books_10k_file = os.path.join(data_dir, 'goodreads_reviews-books_above_10k_lang_reviews.csv.gz')\n", "reviewers_5k_file = os.path.join(data_dir, 'goodreads_reviews-reviewers_above_5k_reviews.csv.gz')\n", "random_1M_file = os.path.join(data_dir, 'goodreads_reviews-random_sample_1M.csv.gz')\n", "author_file = os.path.join(data_dir, 'goodreads_book_authors.csv.gz') # author information\n", "book_file = os.path.join(data_dir, 'goodreads_books.csv.gz') # basic book metadata\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading and Merging Data\n", "\n", "We start with a subset of reviews for frequently reviewed books. To see how this subset was created, go to the [Filtering Goodreads reviews](./Filtering-Goodreads-Reviews.ipynb) notebook. This subset contains all reviews for books that have at least 10,000 reviews each. \n", "\n", "We first load the reviews into a Pandas dataframe, then add metadata for the reviewed books from some of the datasets with book metadata." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0user_idbook_idreview_idratingdate_addeddate_updatedread_atstarted_atn_votesn_commentsreview_lengthreview_textauthor_idtitleauthor_namereview_lang
008842281e1d1347389f2ab93d60773d4d2767052248c011811e945eca861b5c31a5492915Wed Jan 13 13:38:25 -0800 2010Wed Mar 22 11:46:36 -0700 2017Sun Mar 25 00:00:00 -0700 2012Fri Mar 23 00:00:00 -0700 201224251326I cracked and finally picked this up. Very enj...153394The Hunger Games (The Hunger Games, #1)Suzanne Collinsen
11704eb93a316aff687a93d5215882eb212767052c52e231744768e9d7f939d1cbeb876665Fri Jul 20 13:59:12 -0700 2012Sun Aug 23 20:49:13 -0700 2015Sat Feb 18 00:00:00 -0800 2012NaN0031Exciting, fun, entertaining! :)153394The Hunger Games (The Hunger Games, #1)Suzanne Collinsen
224b3636a043e5c99fa27ac897ccfa1151276705289f5c6ed51ba6f70d3955a620f9af8305Thu Jun 09 22:05:49 -0700 2011Fri Sep 13 08:47:42 -0700 2013Tue Jul 05 00:00:00 -0700 2011Mon Jul 04 00:00:00 -0700 201100201This was the perfect quick read for a beach va...153394The Hunger Games (The Hunger Games, #1)Suzanne Collinsen
33012aa353140af13109d00ca36cdc0637276705277fa951667b104fd565d5bd6c760437b5Sun Nov 04 18:57:00 -0800 2012Mon Apr 15 12:57:23 -0700 2013Sun Apr 14 00:00:00 -0700 2013NaN001523The United States (and I assume most other soc...153394The Hunger Games (The Hunger Games, #1)Suzanne Collinsen
442f6af21d14c83a5df6cdcef5e6af0b3e276705246f876086c1e378859f889e87d1e6e5c4Thu Jun 07 10:31:00 -0700 2012Thu Jun 07 10:33:17 -0700 2012Mon Apr 16 00:00:00 -0700 2012NaN0098A page turner. Since I hate reality TV I value...153394The Hunger Games (The Hunger Games, #1)Suzanne Collinsen
......................................................
121925121972d168e4a91a8cb0795d72d0adbe9a589710818853a72358e15220c703fbcd1a61ceb60ea63Tue Aug 06 16:05:58 -0700 2013Tue Aug 06 16:06:37 -0700 2013NaNNaN00107Very shocking content. Not well written. Makes...4725841Fifty Shades of Grey (Fifty Shades, #1)E.L. Jamesen
121926121973d43b94b7a0a02e0bbaa6b93b884a0c9d10818853f35af15602f353e3c4b8b357ca2cfd014Sat Jun 16 04:02:33 -0700 2012Sat Jun 16 04:03:52 -0700 2012Fri Jun 08 00:00:00 -0700 2012NaN1045A wonderful, if slightly twisted, love story.4725841Fifty Shades of Grey (Fifty Shades, #1)E.L. Jamesen
12192712197443202656e9c338bb711afbc7136ab344108188530931f46ea40d06bb201410a1c465b2ff2Sun Nov 11 01:28:33 -0800 2012Sun Nov 11 01:29:46 -0800 2012NaNNaN00118Read to see what all the hype was about. Mills...4725841Fifty Shades of Grey (Fifty Shades, #1)E.L. Jamesen
121928121975d94c83867337514c94738b57a1d1967710818853bf6e6e995804cd92d2e0f66a0fe4c5d85Sat Sep 08 09:20:43 -0700 2012Wed Dec 26 03:13:01 -0800 2012NaNNaN00296This book killed the little innocence in me. I...4725841Fifty Shades of Grey (Fifty Shades, #1)E.L. Jamesen
121929121976e60fcbb1c70ed4f383145efcae21c7ac108188536b298c960776d63607d06023ad38b5674Tue Jul 21 03:53:31 -0700 2015Sun Jul 26 09:25:26 -0700 2015Fri Jul 24 00:00:00 -0700 2015Tue Jul 21 00:00:00 -0700 201500274I actually to my own surprise, enjoyed this bo...4725841Fifty Shades of Grey (Fifty Shades, #1)E.L. Jamesen
\n", "

121930 rows × 17 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 user_id book_id \\\n", "0 0 8842281e1d1347389f2ab93d60773d4d 2767052 \n", "1 1 704eb93a316aff687a93d5215882eb21 2767052 \n", "2 2 4b3636a043e5c99fa27ac897ccfa1151 2767052 \n", "3 3 012aa353140af13109d00ca36cdc0637 2767052 \n", "4 4 2f6af21d14c83a5df6cdcef5e6af0b3e 2767052 \n", "... ... ... ... \n", "121925 121972 d168e4a91a8cb0795d72d0adbe9a5897 10818853 \n", "121926 121973 d43b94b7a0a02e0bbaa6b93b884a0c9d 10818853 \n", "121927 121974 43202656e9c338bb711afbc7136ab344 10818853 \n", "121928 121975 d94c83867337514c94738b57a1d19677 10818853 \n", "121929 121976 e60fcbb1c70ed4f383145efcae21c7ac 10818853 \n", "\n", " review_id rating \\\n", "0 248c011811e945eca861b5c31a549291 5 \n", "1 c52e231744768e9d7f939d1cbeb87666 5 \n", "2 89f5c6ed51ba6f70d3955a620f9af830 5 \n", "3 77fa951667b104fd565d5bd6c760437b 5 \n", "4 46f876086c1e378859f889e87d1e6e5c 4 \n", "... ... ... \n", "121925 a72358e15220c703fbcd1a61ceb60ea6 3 \n", "121926 f35af15602f353e3c4b8b357ca2cfd01 4 \n", "121927 0931f46ea40d06bb201410a1c465b2ff 2 \n", "121928 bf6e6e995804cd92d2e0f66a0fe4c5d8 5 \n", "121929 6b298c960776d63607d06023ad38b567 4 \n", "\n", " date_added date_updated \\\n", "0 Wed Jan 13 13:38:25 -0800 2010 Wed Mar 22 11:46:36 -0700 2017 \n", "1 Fri Jul 20 13:59:12 -0700 2012 Sun Aug 23 20:49:13 -0700 2015 \n", "2 Thu Jun 09 22:05:49 -0700 2011 Fri Sep 13 08:47:42 -0700 2013 \n", "3 Sun Nov 04 18:57:00 -0800 2012 Mon Apr 15 12:57:23 -0700 2013 \n", "4 Thu Jun 07 10:31:00 -0700 2012 Thu Jun 07 10:33:17 -0700 2012 \n", "... ... ... \n", "121925 Tue Aug 06 16:05:58 -0700 2013 Tue Aug 06 16:06:37 -0700 2013 \n", "121926 Sat Jun 16 04:02:33 -0700 2012 Sat Jun 16 04:03:52 -0700 2012 \n", "121927 Sun Nov 11 01:28:33 -0800 2012 Sun Nov 11 01:29:46 -0800 2012 \n", "121928 Sat Sep 08 09:20:43 -0700 2012 Wed Dec 26 03:13:01 -0800 2012 \n", "121929 Tue Jul 21 03:53:31 -0700 2015 Sun Jul 26 09:25:26 -0700 2015 \n", "\n", " read_at started_at \\\n", "0 Sun Mar 25 00:00:00 -0700 2012 Fri Mar 23 00:00:00 -0700 2012 \n", "1 Sat Feb 18 00:00:00 -0800 2012 NaN \n", "2 Tue Jul 05 00:00:00 -0700 2011 Mon Jul 04 00:00:00 -0700 2011 \n", "3 Sun Apr 14 00:00:00 -0700 2013 NaN \n", "4 Mon Apr 16 00:00:00 -0700 2012 NaN \n", "... ... ... \n", "121925 NaN NaN \n", "121926 Fri Jun 08 00:00:00 -0700 2012 NaN \n", "121927 NaN NaN \n", "121928 NaN NaN \n", "121929 Fri Jul 24 00:00:00 -0700 2015 Tue Jul 21 00:00:00 -0700 2015 \n", "\n", " n_votes n_comments review_length \\\n", "0 24 25 1326 \n", "1 0 0 31 \n", "2 0 0 201 \n", "3 0 0 1523 \n", "4 0 0 98 \n", "... ... ... ... \n", "121925 0 0 107 \n", "121926 1 0 45 \n", "121927 0 0 118 \n", "121928 0 0 296 \n", "121929 0 0 274 \n", "\n", " review_text author_id \\\n", "0 I cracked and finally picked this up. Very enj... 153394 \n", "1 Exciting, fun, entertaining! :) 153394 \n", "2 This was the perfect quick read for a beach va... 153394 \n", "3 The United States (and I assume most other soc... 153394 \n", "4 A page turner. Since I hate reality TV I value... 153394 \n", "... ... ... \n", "121925 Very shocking content. Not well written. Makes... 4725841 \n", "121926 A wonderful, if slightly twisted, love story. 4725841 \n", "121927 Read to see what all the hype was about. Mills... 4725841 \n", "121928 This book killed the little innocence in me. I... 4725841 \n", "121929 I actually to my own surprise, enjoyed this bo... 4725841 \n", "\n", " title author_name review_lang \n", "0 The Hunger Games (The Hunger Games, #1) Suzanne Collins en \n", "1 The Hunger Games (The Hunger Games, #1) Suzanne Collins en \n", "2 The Hunger Games (The Hunger Games, #1) Suzanne Collins en \n", "3 The Hunger Games (The Hunger Games, #1) Suzanne Collins en \n", "4 The Hunger Games (The Hunger Games, #1) Suzanne Collins en \n", "... ... ... ... \n", "121925 Fifty Shades of Grey (Fifty Shades, #1) E.L. James en \n", "121926 Fifty Shades of Grey (Fifty Shades, #1) E.L. James en \n", "121927 Fifty Shades of Grey (Fifty Shades, #1) E.L. James en \n", "121928 Fifty Shades of Grey (Fifty Shades, #1) E.L. James en \n", "121929 Fifty Shades of Grey (Fifty Shades, #1) E.L. James en \n", "\n", "[121930 rows x 17 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the review dataframe\n", "review_df = pd.read_csv(books_10k_file, sep='\\t', compression='gzip')\n", "\n", "review_df" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from dateutil.parser import parse, tz\n", "\n", "def parse_date(date_str):\n", " try:\n", " return parse(date_str).astimezone(utc)\n", " except TypeError:\n", " return None\n", "\n", "utc = tz.gettz('UTC')\n", "\n", "review_df['date_added'] = review_df.date_added.apply(parse_date)\n", "review_df['date_updated'] = review_df.date_updated.apply(parse_date)\n", "review_df['read_at'] = review_df.read_at.apply(parse_date)\n", "review_df['started_at'] = review_df.started_at.apply(parse_date)\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# get a list of book ids that are in the review dataset\n", "review_book_ids = set(review_df.book_id.unique())\n", "\n", "# load basic book metadata (only book and author id and book title)\n", "bookmeta_df = pd.read_csv(book_file, sep='\\t', compression='gzip', usecols=['book_id', 'work_id', 'author_id', 'title'])\n", "\n", "# filter the book metadata to only the book ids in the review dataset\n", "bookmeta_df = bookmeta_df[bookmeta_df.book_id.isin(review_book_ids)]\n", "\n", "# load the author metadata to get author names \n", "author_df = pd.read_csv(author_file, sep='\\t', compression='gzip', usecols=['author_id', 'name'])\n", "author_df = author_df.rename(columns={'name': 'author_name'})\n", "\n", "# merge the book and author metadata into a single dataframe, \n", "# keeping only author names for books in the review dataset\n", "metadata_df = pd.merge(bookmeta_df, author_df, how='left')\n", "\n", "# merge the review dataset with the book metadata\n", "review_df = pd.merge(review_df, metadata_df, on='book_id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We remove empty reviews as they are non-reviews (see [Filtering Goodreads Reviews](./Filtering-Goodreads-Reviews.ipynb) for details on how and why we do this)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of empty reviews: 0\n" ] } ], "source": [ "print('Number of empty reviews:', len(review_df[review_df.review_length == 0]))\n", "review_df = review_df[review_df.review_length > 0]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# This step writes the current dataframe to file, \n", "# so all the merging steps can be skipped in reruns of the notebook\n", "merged_data_file = '../data/Goodreads/goodreads_reviews-books_above_10k.merged.csv.gzip'\n", "review_df.to_csv(merged_data_file, sep='\\t', compression='gzip')\n", "\n", "#review_df = pd.read_csv(merged_data_file, sep='\\t', compression='gzip')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This datasets contains reviews for nine books that each have at least 10,000 reviews:" ] }, { "cell_type": "code", "execution_count": 290, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "author_name title \n", "E.L. James Fifty Shades of Grey (Fifty Shades, #1) 11176\n", "John Green The Fault in Our Stars 20738\n", "Markus Zusak The Book Thief 11297\n", "Paula Hawkins The Girl on the Train 13401\n", "Stephenie Meyer Twilight (Twilight, #1) 10532\n", "Suzanne Collins Catching Fire (The Hunger Games, #2) 11900\n", " Mockingjay (The Hunger Games, #3) 13534\n", " The Hunger Games (The Hunger Games, #1) 18613\n", "Veronica Roth Divergent (Divergent, #1) 10739\n", "dtype: int64" ] }, "execution_count": 290, "metadata": {}, "output_type": "execute_result" } ], "source": [ "review_df.groupby(['author_name', 'title']).size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suzanne Collins has three books, all part of the same trilogy, among the most frequently reviewed books:" ] }, { "cell_type": "code", "execution_count": 291, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Suzanne Collins 44047\n", "John Green 20738\n", "Paula Hawkins 13401\n", "Markus Zusak 11297\n", "E.L. James 11176\n", "Veronica Roth 10739\n", "Stephenie Meyer 10532\n", "Name: author_name, dtype: int64" ] }, "execution_count": 291, "metadata": {}, "output_type": "execute_result" } ], "source": [ "review_df.author_name.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are reviews in different languages:" ] }, { "cell_type": "code", "execution_count": 292, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "en 113338\n", "es 1650\n", "af 666\n", "id 624\n", "unknown 516\n", "it 486\n", "de 450\n", "tl 385\n", "cy 331\n", "fr 302\n", "so 283\n", "pt 270\n", "sv 254\n", "nl 252\n", "sl 245\n", "no 227\n", "ro 213\n", "ca 186\n", "pl 172\n", "sw 156\n", "da 155\n", "tr 124\n", "et 107\n", "hr 103\n", "vi 89\n", "sk 86\n", "hu 66\n", "cs 63\n", "sq 46\n", "fi 45\n", "lt 23\n", "lv 17\n", "Name: review_lang, dtype: int64" ] }, "execution_count": 292, "metadata": {}, "output_type": "execute_result" } ], "source": [ "review_df.review_lang.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For content analysis, we'll remove the non-English reviews, so content can be more easily compared across reviews." ] }, { "cell_type": "code", "execution_count": 293, "metadata": {}, "outputs": [], "source": [ "review_df = review_df[review_df.review_lang == 'en']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we compare how the reviews are spread over time, for all books together and per book." ] }, { "cell_type": "code", "execution_count": 379, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.rcParams['figure.figsize'] = [15, 5]\n", "\n", "# group all reviews by year and month that they were published\n", "g = review_df.groupby([review_df.date_updated.dt.year, review_df.date_updated.dt.month]).size()\n", "# plot the number of reviews per month as a bar chart\n", "ax = g.plot(kind='bar')\n", "# update the ticks on the x-axis so that they remain readable...\n", "ax.set_xticks(range(len(g)));\n", "# ... with only a tick label for January of each year\n", "ax.set_xticklabels([\"%s-%02d\" % item if item[1] == 1 else '' for item in g.index.tolist()], rotation=90);\n", "plt.gcf().autofmt_xdate()\n", "plt.xlabel('Review month')\n", "plt.ylabel('Number of reviews')\n", "plt.show()\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first reviews are from late 2007, the last from late 2017. The plot shows that the total number of reviews for these nine books increased from late 2007 with a sudden jump in 2012 and with another jump in 2014. However, with the current scale (over 100,000 reviews) and focus (reviews for nine popular books) we don't see differences in patterns per book. We shift our focus by creating views on numbers of reviews per book." ] }, { "cell_type": "code", "execution_count": 353, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 353, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Group the number of reviews by year and by book title\n", "g = review_df.groupby([review_df.date_updated.dt.year, 'title']).size()\n", "# is zero for years in which a book has no reviews\n", "u = g.unstack('title').fillna(0)\n", "for title in review_df.title.unique():\n", " # divide the number of reviews for a book in a certain \n", " # year by the number of reviews over all years to get proportions\n", " u[title] = u[title] / sum(u[title])\n", "# plot as bar chart\n", "u.plot(kind='bar')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We notice that there are some marked differences in how reviews of a book are spread over time. For some, there is large burst just after release (especially *Fifty shades grey* with 50% of its reviews in 2012, then the amount of reviews drops off rapidly), while for others the reviews are more spread out, like *Twilight* and particularly *The book thief* which was released in 2005, had a small fraction of its reviews in 2007, but got an increasing amount of reviews up to a peak in 2014, a full 9 years after its release, and still receiving many reviews in 2017. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start with analysing the reviews for a single book. A random pick from the book ids:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7260188" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.choice(list(review_book_ids))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a new dataframe by **selecting** only the reviews for the randomly selected book." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "18613 Mockingjay (The Hunger Games, #3)\n", "Name: title, dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_id = 7260188\n", "book_df = review_df[review_df.book_id == book_id]\n", "book_df.title.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The chosen book is *Mockingjay*, the third book in *The Hunger Games* trilogy by Suzanne Collins. Let's start with a quick look at the ratings to know if we can expect positive and/or negative reviews:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5 4817\n", "4 4084\n", "3 2834\n", "2 1133\n", "1 363\n", "0 303\n", "Name: rating, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "book_df.rating.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ratings of zero are not actual ratings, but non-ratings, i.e. the reviewer wrote a review but provided no explicit rating. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "Can only use .dt accessor with datetimelike values", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrcParams\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'figure.figsize'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m15\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mg\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbook_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroupby\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mbook_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdate_added\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0myear\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'rating'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0mu\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munstack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'date_added'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'year\\tavg. rating'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Library/Python/3.6/lib/python/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m 5268\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mname\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_accessors\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5269\u001b[0m ):\n\u001b[0;32m-> 5270\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5271\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5272\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_info_axis\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_can_hold_identifiers_and_holds_name\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Library/Python/3.6/lib/python/site-packages/pandas/core/accessor.py\u001b[0m in \u001b[0;36m__get__\u001b[0;34m(self, obj, cls)\u001b[0m\n\u001b[1;32m 185\u001b[0m \u001b[0;31m# we're accessing the attribute of the class, i.e., Dataset.geo\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 186\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_accessor\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 187\u001b[0;31m \u001b[0maccessor_obj\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_accessor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 188\u001b[0m \u001b[0;31m# Replace the property with the accessor object. Inspired by:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 189\u001b[0m \u001b[0;31m# http://www.pydanny.com/cached-property.html\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Library/Python/3.6/lib/python/site-packages/pandas/core/indexes/accessors.py\u001b[0m in \u001b[0;36m__new__\u001b[0;34m(cls, data)\u001b[0m\n\u001b[1;32m 336\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mDatetimeProperties\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morig\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 337\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 338\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAttributeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Can only use .dt accessor with datetimelike values\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: Can only use .dt accessor with datetimelike values" ] } ], "source": [ "plt.rcParams['figure.figsize'] = [15, 5]\n", "\n", "g = book_df.groupby([book_df.date_added.dt.year, 'rating']).size()\n", "u = g.unstack('date_added')\n", "print('year\\tavg. rating')\n", "for year in u.columns:\n", " print(f'{year}\\t{book_df[book_df.date_added.dt.year == year].rating.mean(): >4.2f}')\n", " u[year] = u[year] / sum(u[year])\n", "\n", "g = u.stack()\n", "u = g.unstack('rating')\n", "u.plot(kind='bar')\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot shows that fraction of reviews per year that gets a rating of 1-5 stars (or no rating, represented by the zero values). \n", "\n", "The majority of reviews have a positive rating, and although the fraction of 5-star reviews drops somewhat after the first year (the lowest average rating is in 2014), the majority remains positive. This is typical of online reviews. People don't choose books to read randomly, but those which they expect to like. Furthermore, people who liked a book are more likely willing to put effort into reviewing it. \n", "\n", "Let's look at the differences in review length:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 10\n", "2 10\n", "3 30\n", "4 25\n", "5 22\n", " ..\n", "11845 1\n", "12221 1\n", "12472 1\n", "15704 1\n", "17786 1\n", "Name: review_length, Length: 2503, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "book_df.review_length.value_counts().sort_index()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The shortest review (in text characters): 1\n", "The longest review (in text characters): 17786\n", "The average review length: 608.3044923895375\n", "The standard deviation in review lengths: 1013.9769171514645\n", "\n", "Number of reviews with fewer than 100 characters: 3512\n", "Number of reviews of below average length: 9850\n", "Number of reviews of above average length: 3684\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from collections import Counter\n", "\n", "# count the number of reviews of each length\n", "counts = book_df.review_length.value_counts().sort_index()\n", "print('The shortest review (in text characters):', book_df.review_length.min())\n", "print('The longest review (in text characters):', book_df.review_length.max())\n", "print('The average review length:', book_df.review_length.mean())\n", "print('The standard deviation in review lengths:', book_df.review_length.std())\n", "print('\\nNumber of reviews with fewer than 100 characters:', sum(book_df.review_length < 100))\n", "print('Number of reviews of below average length:', sum(book_df.review_length < book_df.review_length.mean()))\n", "print('Number of reviews of above average length:', sum(book_df.review_length > book_df.review_length.mean()))\n", "\n", "dist = {length: count for length, count in counts.iteritems()}\n", "book_df.review_length.value_counts()\n", "x, y = zip(*book_df.review_length.value_counts().sort_index().iteritems())\n", "plt.plot(x, y)\n", "plt.axvline(x=book_df.review_length.mean(), color='red', linestyle='dotted')\n", "plt.xscale('log')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot above shows the distribution of review lengths in number of characters per review. There is a large spread in review lengths. There are thousands of reviews with fewer than 100 characters. Based on typical average word lengths in English of just over 4 characters per word, plus whitespace between words, that means that these are reviews with fewer than 20 words. The average length is 628 characters (the red dotted line), while the longest is almost 18,000 characters long (roughly 3600 words).\n", "\n", "*Slight tangent on the distribution*: The standard deviation is higher than the average length, signaling that this distribution is skewed towards the left (most reviews are shorter than the average). See the notebook on [Analysing Distributions](./Analyzing-Distributions.ipynb) for a detailed analysis of the different types of distributions and our arguments on why it is important to know about them and take them into account when interpreting data.\n", "\n", "\n", "Let's sample a review and look at the text." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"I found the writing great and the story well moving until it got to the end and the unnecessary tortures and games and killing of children ensued. Why repeat that? We didn't need the shock value. We have already established how cruel the Capitol was. I really didn't need to read about people getting killed in more and more creative ways - it seemed self-serving, like the author has some morbid fascination with meat grinders and burning people or letting them torn apart by vicious monsters. It really detracted from the story for me, which I did find interesting and the ending was surprising - although I would have liked to read about the trial instead of Katniss getting locked up again (she is locked up a lot in this book, another annoying bit). I liked the book very much but I deducted one star for the self-serving, morbid violence that didn't further the story.\"" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_seed = 1205921\n", "\n", "sample_df = book_df.sample(1, random_state=random_seed)\n", "\n", "review_text_col = list(sample_df.columns).index('review_text')\n", "sample_df.iloc[0,review_text_col]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This review describes a somewhat negative reading experience due to the violence in the book, but the reviewer found the story interesting and the ending surprising.\n", "\n", "Let's compare a small sample of 10 reviews:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "review 1: I found the writing great and the story well moving until it got to the end and the unnecessary tortures and games and killing of children ensued. Why repeat that? We didn't need the shock value. We have already established how cruel the Capitol was. I really didn't need to read about people getting killed in more and more creative ways - it seemed self-serving, like the author has some morbid fascination with meat grinders and burning people or letting them torn apart by vicious monsters. It really detracted from the story for me, which I did find interesting and the ending was surprising - although I would have liked to read about the trial instead of Katniss getting locked up again (she is locked up a lot in this book, another annoying bit). I liked the book very much but I deducted one star for the self-serving, morbid violence that didn't further the story.\n", "\n", "\n", "review 2: No where near as good as the first two\n", "\n", "\n", "review 3: Amazing! Review to come.\n", "\n", "\n", "review 4: 1.5 stars Before you come at me with your pitch forks, screaming at me for disliking a book in the Hunger Games trilogy, let me explain something to you. I don't give books 5 stars just because I loved the previous 2 in the series. I refuse to give a book a postive review just because it is the conclusion to the series. I did not like Mockingjay. Not only was I disappointed but I felt robbed (and I do not mean by my $20 that could have been spent on something better like a giant ice-cream cone) I felt robbed because after 2 great books, Suzanne Collins ends her series in such an anti-climatic way that leaves me stunned, but not in the good way. Perhaps I am biased. Scratch that, I am biased. Many of my favourite characters died in this final installment. I know, I know, there was a war. There are going to be deaths. Suzanne Collins put them in the book to make it more real. Well you know what? I don't buy that! Those deaths had nothing of importance, they happened, in my opinion, just to cause angst. And our main character Katniss already provided a lot of that. Let's move on to Katniss, the true reason for my less-than-positive review. She was infuriating and annoying. She seemed like a hopeless little girl that could not pull herself together. She seemed like, dare I say it, Bella Swan. Whiny, indecisive, dependant ... all attributes that the Katniss I loved in the first novel did not possess. One of my favorite female characters, was just completely changed in this one book. I loved her for her strength and fight, not her inability to stand up for herself. This seemed like a watered-down version of Katniss, created to further show her love for Peeta is so strong that she can't function without him. Wait. Back pedal on that. Katniss Everdeen, the girl on fire, overly reliant on a boy? You heard (or read) me correctly. No longer is this girl the head-strong, kick-ass female protaginist I loved. In her place is a girl who is desperately needing Peeta, while toying with Gale only then to feel no remorse when Gale leaves and keeps Peeta just because he's there Let me clarify something before I go on. I am a fan of Peenis. Hehe. Peenis is the pairing of Katniss and Peeta. This book, despite what happened in the end, is not in support of Peenis. Why? Because Katniss is not the same Katniss and Peeta is not the same Peeta. They're imposters, they're not the characters I rooted for, loved and cherished. Despite the suger-coated epilogue, I could not bring myself to like the ending of the novel. It did not seem the least bit real or in-character. Sure, I wanted Peeta to be Katniss' love. But did that mean she would feel nothing when Gale leaves? Obviously not considering they are best friends. There were so many deaths in the book that, as I've mentioned, felt unneccesary. Why did Prim have to die? Why? The series began with Prim being the innocent little sister that Katniss sought to protect. Why make her die? With her dead, the purpose of the series escapes me. Is Suzanne Collins implying that in the war, there is no hope whatsoever. I feel that's a terrible message, I understand war is killing humanity but shouldn't there be some hope? Finnick Odair was my favorite character. He was hot, a hero, hot, sarcastic, hot, kind, hot, smart and did I mention hot? Yes, I did have a book-crush on Finnick. So his role in the book really pissed me off. After giving him such a great reunion with Annie, letting them getting married, why do you have to kill him off? And Annie's reaction (from Katniss' point of view) is little to nothing. We don't hear about her tears, her sadness, just ... nothing If I go on to explain all the deaths in the book, I might just die myself so I'll stop. But the enormous amount of deaths served no purpose whatsoever. The theme in this book (if there was any) is death and revenge and more death. I have a strong feeling the only reason Katniss killed Coin was because of Prim's death. And then, spoiler alert, when they are all voting on if there should be another Hunger Games but with the children of the Capital being the players, why in the world does Katniss say yes? And then say \"for Prim\"? I don't think Prim would like another Games to happen. This whole series is about how awful the Games are, how no one should have to suffer them, how they are cruel and a disgrace. So why does she do a 360 and vote for another Games? It's beyond me. It's purely for revenge. When she had a shimmer of goodness in the book, when she defended her dressing commitee because she claimed they were innocent and didn't know better because they were raised to regard the Games as okay, this is all in spite because then she chooses to make their children play the Games. So the theme in the book? Revenge. Pure revenge. It teaches us that humanity is a nothing but bad, that the only way to resolve problems if fighting fire with fire, to forget about tolerance and forgiveness and to make people pay. In other words, go ahead and be a hypocrite. Nice theme for a book targeted at young adults (notice my sarcasm). Other things I did not like include: the slow beginning, the lack of Peeta (his brain-washed version of himself is not Peeta but just another cause for drama and angst), the filming of everything (and I mean everything, in the midst of war, they are concerned about filming Katniss fight? Seriously?) and, of course, the fragments. That happened. Every. Single. Other. Sentance. I. Understand. That. Everyone. Has. There. Own. Way. Of. Writing. But. Making. All. These. Fragments. Does. Not. Create. Suspense. And. No. Teenage. Girl. Thinks. Like. This. Overall, it was a huge let down. I read the other 2 in the series, loved them both, and this felt like a huge downer. I had to force myself to finish this book while with the others, I finished them in one sitting because they were that good. Why give it 1.5 stars rather than just 1? Because, despite my disappointment, I know many people who do not read at all and yet they were reading this series and for that, I give this book the 0.5 star since it was able to get people to read. This review is ending kind of anti-climatic actually .... oh well, it goes with how Mockingjay ended If you're pissed off that I didn't like this book, I don't really care. I think if Collins decided that instead of writing a series, to just make the book in one book, the first one, it would have been better. All this extra rebellion, District 13, blahblahblah was unnecessary. I felt she could have wrapped up the book nicely in Book 1. Oh well. Still pissed I didn't give this book 5 stars?\n", "\n", "\n", "review 5: This book was more than I expected, the key word being expected. Three times in this book I was blindsided. I turned the page to the next chapter expecting something to happen and then got an eyeful of something else, stunning my conscious recognition. This is just one thing I loved about this book. I also love that this book is no fairytale. It is gut wrenchingly raw and cruel and can solicit emotions of repulsion and disgust right alongside desires of hope and happiness. There is no sugar coating to this series. It was a realistic read. Collins writes believability into a complex snare that draws you in and makes it so hard to put the books down. These books will definitely reside on my favorites shelf. For me this book was worth the wait not that I wouldn't have like to have had it sooner. Now I'm off to savor this book again and glean even more while listening to the audio.\n", "\n", "\n", "review 6: Okay, Actually, I'm a bit disappointing by the end of the story\n", "\n", "\n", "review 7: ** spoiler alert ** This book moved the slowest for me reading-wise. Specifically because it was so descriptive. Dialog is always quicker and more engaging to read. All the psychological stuff that I felt was left out of The Hunger Games, coagulates here in Mockingjay. Katniss, having been given only a few months of what she think will be the rest of her more semi-normal life, has to return to survival mode. Peeta's character development is almost scary, but I think was a good move. He moves from being just the baker's son who was drawn for the games and survived with Katniss' help, to being a survivor himself. Struggling with the Capitol's hijacking and the slow process of recovery. I had a hard time with Finnick's death, almost as much as I had with Cinna's in Catching Fire. But like the deaths in the Harry Potter series, as much as I disliked it such a thing is what gives a story depth and importance. Without loss the struggle loses it's meaning. Without it the story loses grasp on reality. Having read all three books, I can absolutely not agree with anyone that tries claiming this is another stab at a Twilight-style saga.\n", "\n", "\n", "review 8: This series is just hard to describe. For one thing I was continually pulled out of the narrative when the narrator switched back and forth between and past and present tense. It was strange, but maybe that is just me. Overall the idea and the world behind this story is brilliant. I really enjoyed the first book, but as the story went on into the second and third books it just got painful the read. These characters are really tortured beyond what is normal for a young adult book. And towards the end of this finale book I just got fed up with Katniss and her self loathing. She alternates between blaming the capital and the Games for all her problems and then blaming herself. I actually think that is is probably very realistic. But personally I left this series feeling like I should have stopped reading at the end of book one when I still liked all the characters.\n", "\n", "\n", "review 9: I fell in love with this series just months before Mocking Jay came out, one of my fellow library friends introduced me to the series and I devoured both Hunger Games and Catching Fire within days! I was devastated to find out that I had to wait for the conclusion, specially with the cliff hanger in Catching Fire! It was well worth the wait and like always, the conclusion was my favorite part. I love a good book that can make you feel a lot of emotions. My favorite part of this series was that it wasn't centered around boys like Twilight. There was two possible love interests but that wasn't the focus of the story. It was also my first time seeing a strong female lead. I felt like I connected the best with Katniss versus Bella or Trish. I also enjoyed the movies which is a plus knowing the most movies don't compare to the books. Finally I enjoy a book that throws you for a curve and this one had me speechless, and also crying at the end for the rest of the 3 hour drive to visit my family!\n", "\n", "\n", "review 10: Re-read november of 2014: I loved this book even more than I did the first time around. Maybe because I took my time reading it this time, or I already watched part 1 of the film but I just lovee itt. The ending makes sense now, at first it felt rushed and well maybe the second half of the books feels that way, but that's how war is. Fast-paced, no room for thinking things over, move or you're dead. And this book captures that essence really well. This is not a happy ending story, it is more bitter than sweet.But it shows that sometimes in life you need to accept what has happened and those things you can't change. Adapt, and keep going. I give it a 4.5. I think it's still my least favorite from the series considering how dark it is, but it's a great conclusion.\n", "\n", "\n" ] } ], "source": [ "from scripts.text_tail_analysis import get_dataframe_review_texts\n", "sample_size = 10\n", "sample_df = book_df.sample(sample_size, random_state=random_seed)\n", "\n", "for ri, review_text in enumerate(get_dataframe_review_texts(sample_df)):\n", " print(f'review {ri+1}:', review_text)\n", " print('\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many reviews are very short, just one or two short sentences. Many reviewers mention the ending. This book being the last of a trilogy, this is not unexpected, as this book wraps up a longer narrative. We see quite some difference of opinion.\n", "\n", "Taking a first step into a more quantitative analysis of the content, we do a Keyword in Context (KWiC) search for the words 'end', 'ended' and 'ending' to get insight in what reviewers say about it." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "end e story well moving until it got to the end and the unnecessary tortures and games \n", "ending e, which I did find interesting and the ending was surprising - although I would have \n", "ends se after 2 great books, Suzanne Collins ends her series in such an anti-climatic way\n", "end This book, despite what happened in the end, is not in support of Peenis. Why? Beca\n", "ending e, I could not bring myself to like the ending of the novel. It did not seem the least\n", "ending to get people to read. This review is ending kind of anti-climatic actually .... oh \n", "ended .. oh well, it goes with how Mockingjay ended If you're pissed off that I didn't li\n", "end ctually, I'm a bit disappointing by the end of the story\n", "end r a young adult book. And towards the end of this finale book I just got fed up w\n", "end ke I should have stopped reading at the end of book one when I still liked all the \n", "end d me speechless, and also crying at the end for the rest of the 3 hour drive to vis\n", "ending of the film but I just lovee itt. The ending makes sense now, at first it felt rushe\n", "ending ence really well. This is not a happy ending story, it is more bitter than sweet.But\n" ] } ], "source": [ "import re\n", "\n", "\n", "def kwic(pattern, reviews, word_boundaries=True):\n", " pattern = pattern if not word_boundaries else r'\\b' + pattern + r'\\b'\n", " for review in reviews:\n", " for match in re.finditer(pattern, review):\n", " start = match.start(0) - 40 if match.start(0) > 40 else 0\n", " end = match.end(0) + 40\n", " print(f'{match[1]: <15}{review[start:end]}')\n", " \n", "pattern = '(end|ends|ended|ending|endings)'\n", "kwic(pattern, get_dataframe_review_texts(sample_df))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Another way to get insight in the content of multiple reviews is to make frequency lists." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 119),\n", " ('I', 88),\n", " ('and', 56),\n", " ('to', 51),\n", " ('a', 46),\n", " ('of', 43),\n", " ('that', 38),\n", " ('is', 38),\n", " ('book', 37),\n", " ('in', 36),\n", " ('was', 29),\n", " ('this', 29),\n", " ('it', 26),\n", " ('for', 22),\n", " ('Katniss', 19),\n", " ('not', 19),\n", " ('t', 18),\n", " ('with', 18),\n", " ('just', 18),\n", " ('like', 17)]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "tf = Counter()\n", "for text in get_dataframe_review_texts(sample_df):\n", " # split the texts on any non-word characters\n", " words = re.split(r'\\W+', text.strip())\n", " # count the number of times each word occurs across the review texts\n", " tf.update(words)\n", "\n", "tf.most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Among the top 20 most frequent words, we find a domain-generic term, 'book', as well as the name of a character in the book, 'Katniss'. \n", "\n", "How often do variants of 'end' and 'ending' appear in these 10 reviews?" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "end: 6\n", "ends: 1\n", "ended: 1\n", "ending: 5\n", "endings: 0\n" ] } ], "source": [ "for term in ['end', 'ends', 'ended', 'ending', 'endings']:\n", " print(f'{term}:', tf[term])\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of words: 2327\n", "Number of distinct words: 765\n" ] } ], "source": [ "print('Number of words:', sum(tf.values()))\n", "print('Number of distinct words:', len(tf.keys()))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use some of the many wonderful open source Natural Language Processing (NLP) tools to have more control on the textual content. We use [Spacy](https://spacy.io) to parsed the reviews to have access to the individual sentences and words, and get additional information on word forms, part-of-speech, lemmas, etc.\n", "\n", "We start with listing all entities that Spacy identified in the sample of reviews." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Katniss', 18),\n", " ('Peeta', 10),\n", " ('first', 7),\n", " ('one', 6),\n", " ('2', 3),\n", " ('Mockingjay', 3),\n", " ('Suzanne Collins', 3),\n", " ('Peenis', 3),\n", " ('Catching Fire', 3),\n", " ('Capitol', 2),\n", " ('two', 2),\n", " ('1.5', 2),\n", " ('5', 2),\n", " ('Finnick', 2),\n", " ('Annie', 2),\n", " ('Collins', 2),\n", " ('the Hunger Games', 1),\n", " ('20', 1),\n", " ('Bella Swan', 1),\n", " ('One', 1),\n", " ('Katniss Everdeen', 1),\n", " ('Gale', 1),\n", " ('Finnick Odair', 1),\n", " ('Coin', 1),\n", " ('Prim', 1),\n", " ('another Hunger Games', 1),\n", " ('Capital', 1),\n", " ('360', 1),\n", " ('Games', 1),\n", " ('Sentance', 1),\n", " ('I.', 1),\n", " ('0.5', 1),\n", " ('blahblahblah', 1),\n", " ('Book 1', 1),\n", " ('Three', 1),\n", " ('The Hunger Games', 1),\n", " ('only a few months', 1),\n", " ('baker', 1),\n", " ('Cinna', 1),\n", " ('Harry Potter', 1),\n", " ('three', 1),\n", " ('second', 1),\n", " ('third', 1),\n", " ('just months', 1),\n", " ('Jay', 1),\n", " ('Hunger Games', 1),\n", " ('days', 1),\n", " ('Twilight', 1),\n", " ('Bella', 1),\n", " ('Trish', 1),\n", " ('the 3 hour', 1),\n", " ('november of 2014', 1),\n", " ('1', 1),\n", " ('the second half', 1),\n", " ('4.5', 1)]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import spacy\n", "\n", "# load the large model for English\n", "nlp = spacy.load('en_core_web_lg')\n", "\n", "# use nlp to parse each text and store the parsed results as a list of docs\n", "docs = [nlp(text) for text in get_dataframe_review_texts(sample_df)]\n", "\n", "# iterate over the docs, then over the entities in each doc and count them\n", "tf = Counter([entity.text for doc in docs for entity in doc.ents])\n", "\n", "tf.most_common()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is only a short list of entities found in the 10 reviews, most appearing only once. If we look not only at named entities, but at all noun phrases, we get a longer list:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('I', 87),\n", " ('it', 26),\n", " ('me', 14),\n", " ('this book', 13),\n", " ('Katniss', 12),\n", " ('you', 11),\n", " ('It', 10),\n", " ('she', 10),\n", " ('they', 10),\n", " ('the book', 9),\n", " ('Peeta', 8),\n", " ('the story', 7),\n", " ('the end', 6),\n", " ('them', 6),\n", " ('the series', 6),\n", " ('something', 5),\n", " ('what', 5),\n", " ('this series', 5),\n", " ('people', 4),\n", " ('a book', 4),\n", " ('nothing', 4),\n", " ('She', 4),\n", " ('Prim', 4),\n", " ('the Games', 4),\n", " ('We', 3),\n", " ('the conclusion', 3),\n", " ('Mockingjay', 3),\n", " ('Suzanne Collins', 3),\n", " ('herself', 3),\n", " ('her', 3),\n", " ('him', 3),\n", " ('fire', 3),\n", " ('who', 3),\n", " ('Gale', 3),\n", " ('Peenis', 3),\n", " ('This book', 3),\n", " ('myself', 3),\n", " ('war', 3),\n", " ('the books', 3),\n", " ('Catching Fire', 3),\n", " ('the ending', 2),\n", " ('5 stars', 2),\n", " ('deaths', 2),\n", " ('importance', 2),\n", " ('angst', 2),\n", " ('a lot', 2),\n", " ('humanity', 2),\n", " ('He', 2),\n", " ('revenge', 2),\n", " ('the world', 2),\n", " ('another Games', 2),\n", " ('himself', 2),\n", " ('everything', 2),\n", " ('Collins', 2),\n", " ('emotions', 2),\n", " ('the rest', 2),\n", " ('the writing', 1),\n", " ('the unnecessary tortures', 1),\n", " ('games', 1),\n", " ('killing', 1),\n", " ('children', 1),\n", " ('the shock value', 1),\n", " ('the Capitol', 1),\n", " ('more and more creative ways', 1),\n", " ('the author', 1),\n", " ('some morbid fascination', 1),\n", " ('meat grinders', 1),\n", " ('vicious monsters', 1),\n", " ('the trial', 1),\n", " ('another annoying bit', 1),\n", " ('one star', 1),\n", " ('the self-serving, morbid violence', 1),\n", " ('Review', 1),\n", " ('your pitch forks', 1),\n", " ('the Hunger Games', 1),\n", " ('trilogy', 1),\n", " ('books', 1),\n", " ('a postive review', 1),\n", " ('a giant ice-cream cone', 1),\n", " ('2 great books', 1),\n", " ('her series', 1),\n", " ('such an anti-climatic way', 1),\n", " ('the good way', 1),\n", " ('my favourite characters', 1),\n", " ('this final installment', 1),\n", " ('a war', 1),\n", " ('Those deaths', 1),\n", " ('my opinion', 1),\n", " ('our main character', 1),\n", " (\"'s\", 1),\n", " ('the true reason', 1),\n", " ('my less-than-positive review', 1),\n", " ('a hopeless little girl', 1),\n", " ('all attributes', 1),\n", " ('the Katniss', 1),\n", " ('the first novel', 1),\n", " ('my favorite female characters', 1),\n", " ('this one book', 1),\n", " ('her strength', 1),\n", " ('a watered-down version', 1),\n", " ('her love', 1),\n", " ('Wait', 1),\n", " ('Back pedal', 1),\n", " ('Katniss Everdeen', 1),\n", " ('the girl', 1),\n", " ('a boy', 1),\n", " ('You', 1),\n", " ('this girl', 1),\n", " ('the head-strong, kick-ass female protaginist', 1),\n", " ('her place', 1),\n", " ('a girl', 1),\n", " ('no remorse', 1),\n", " ('he', 1),\n", " ('a fan', 1),\n", " ('the pairing', 1),\n", " ('support', 1),\n", " ('the same Katniss', 1),\n", " ('the same Peeta', 1),\n", " ('They', 1),\n", " ('imposters', 1),\n", " ('the characters', 1),\n", " ('the suger-coated epilogue', 1),\n", " ('the novel', 1),\n", " ('character', 1),\n", " (\"Katniss' love\", 1),\n", " ('best friends', 1),\n", " ('so many deaths', 1),\n", " ('The series', 1),\n", " ('the innocent little sister', 1),\n", " ('the purpose', 1),\n", " ('the war', 1),\n", " ('no hope', 1),\n", " ('a terrible message', 1),\n", " ('some hope', 1),\n", " ('Finnick Odair', 1),\n", " ('my favorite character', 1),\n", " ('a book-crush', 1),\n", " ('Finnick', 1),\n", " ('his role', 1),\n", " ('such a great reunion', 1),\n", " ('Annie', 1),\n", " (\"Annie's reaction\", 1),\n", " (\"Katniss' point\", 1),\n", " ('view', 1),\n", " ('her tears', 1),\n", " ('her sadness', 1),\n", " ('all the deaths', 1),\n", " ('the enormous amount', 1),\n", " ('no purpose', 1),\n", " ('The theme', 1),\n", " ('death', 1),\n", " ('more death', 1),\n", " ('a strong feeling', 1),\n", " ('the only reason', 1),\n", " ('Coin', 1),\n", " (\"Prim's death\", 1),\n", " ('spoiler alert', 1),\n", " ('another Hunger Games', 1),\n", " ('the children', 1),\n", " ('the Capital', 1),\n", " ('the players', 1),\n", " ('This whole series', 1),\n", " ('no one', 1),\n", " ('a disgrace', 1),\n", " ('vote', 1),\n", " ('a shimmer', 1),\n", " ('goodness', 1),\n", " ('her dressing commitee', 1),\n", " ('spite', 1),\n", " ('their children', 1),\n", " ('the theme', 1),\n", " ('Revenge', 1),\n", " ('Pure revenge', 1),\n", " ('us', 1),\n", " ('a nothing', 1),\n", " ('the only way', 1),\n", " ('problems', 1),\n", " ('tolerance', 1),\n", " ('forgiveness', 1),\n", " ('other words', 1),\n", " ('a hypocrite', 1),\n", " ('Nice theme', 1),\n", " ('young adults', 1),\n", " ('my sarcasm', 1),\n", " ('Other things', 1),\n", " ('the slow beginning', 1),\n", " ('the lack', 1),\n", " ('his brain-washed version', 1),\n", " ('just another cause', 1),\n", " ('drama', 1),\n", " ('the filming', 1),\n", " ('the midst', 1),\n", " ('course', 1),\n", " ('Sentance', 1),\n", " ('I. Understand', 1),\n", " ('Everyone', 1),\n", " ('Way', 1),\n", " ('Writing', 1),\n", " ('Fragments', 1),\n", " ('Suspense', 1),\n", " ('Teenage', 1),\n", " ('Girl', 1),\n", " ('a huge let', 1),\n", " ('a huge downer', 1),\n", " ('the others', 1),\n", " ('1.5 stars', 1),\n", " ('my disappointment', 1),\n", " ('many people', 1),\n", " ('the 0.5 star', 1),\n", " ('This review', 1),\n", " ('a series', 1),\n", " ('one book', 1),\n", " ('All this extra rebellion', 1),\n", " ('District', 1),\n", " ('blahblahblah', 1),\n", " ('Book', 1),\n", " ('the key word', 1),\n", " ('the page', 1),\n", " ('the next chapter', 1),\n", " ('an eyeful', 1),\n", " ('my conscious recognition', 1),\n", " ('just one thing', 1),\n", " ('no fairytale', 1),\n", " ('repulsion', 1),\n", " ('disgust', 1),\n", " ('desires', 1),\n", " ('hope', 1),\n", " ('happiness', 1),\n", " ('no sugar coating', 1),\n", " ('a realistic read', 1),\n", " ('believability', 1),\n", " ('a complex snare', 1),\n", " ('These books', 1),\n", " ('my favorites shelf', 1),\n", " ('the audio', 1),\n", " ('** spoiler alert', 1),\n", " ('Dialog', 1),\n", " ('All the psychological stuff', 1),\n", " ('The Hunger Games', 1),\n", " ('only a few months', 1),\n", " ('her more semi-normal life', 1),\n", " ('survival mode', 1),\n", " (\"Peeta's character development\", 1),\n", " ('a good move', 1),\n", " (\"just the baker's son\", 1),\n", " ('the games', 1),\n", " (\"Katniss' help\", 1),\n", " ('a survivor', 1),\n", " (\"the Capitol's hijacking\", 1),\n", " ('the slow process', 1),\n", " ('recovery', 1),\n", " ('a hard time', 1),\n", " (\"Finnick's death\", 1),\n", " ('Cinna', 1),\n", " ('the deaths', 1),\n", " ('the Harry Potter series', 1),\n", " ('such a thing', 1),\n", " ('a story depth', 1),\n", " ('loss', 1),\n", " ('the struggle', 1),\n", " ('grasp', 1),\n", " ('reality', 1),\n", " ('all three books', 1),\n", " ('anyone', 1),\n", " ('another stab', 1),\n", " ('a Twilight-style saga', 1),\n", " ('This series', 1),\n", " ('one thing', 1),\n", " ('the narrative', 1),\n", " ('the narrator', 1),\n", " ('just me', 1),\n", " ('the idea', 1),\n", " ('this story', 1),\n", " ('the first book', 1),\n", " ('the second and third books', 1),\n", " ('the read', 1),\n", " ('These characters', 1),\n", " ('a young adult book', 1),\n", " ('this finale book', 1),\n", " ('her self loathing', 1),\n", " ('the capital', 1),\n", " ('all her problems', 1),\n", " ('book', 1),\n", " ('all the characters', 1),\n", " ('love', 1),\n", " ('Jay', 1),\n", " ('my fellow library friends', 1),\n", " ('both Hunger Games', 1),\n", " ('days', 1),\n", " ('the cliff hanger', 1),\n", " ('my favorite part', 1),\n", " ('a good book', 1),\n", " ('My favorite part', 1),\n", " ('boys', 1),\n", " ('Twilight', 1),\n", " ('two possible love interests', 1),\n", " ('the focus', 1),\n", " ('my first time', 1),\n", " ('a strong female lead', 1),\n", " ('Bella', 1),\n", " ('Trish', 1),\n", " ('the movies', 1),\n", " ('a plus', 1),\n", " ('the most movies', 1),\n", " ('a curve', 1),\n", " ('this one', 1),\n", " ('the 3 hour drive', 1),\n", " ('my family', 1),\n", " ('Re-read november', 1),\n", " ('the first time', 1),\n", " ('my time', 1),\n", " ('part', 1),\n", " ('the film', 1),\n", " ('The ending', 1),\n", " ('sense', 1),\n", " ('the second half', 1),\n", " ('Fast-paced, no room', 1),\n", " ('things', 1),\n", " ('that essence', 1),\n", " ('a happy ending story', 1),\n", " ('life', 1),\n", " ('a great conclusion', 1)]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# instead of entities, we can also look at noun-phrases\n", "tf = Counter([ne.text for doc in docs for ne in doc.noun_chunks])\n", "\n", "tf.most_common()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many of these noun chunks are pronouns like 'I', 'me', 'you', 'she', 'they', 'them', 'we'. These are common in reviews, as reviewers often describe their personal reading experience and the affect that the book had on them. In a small sample, they get in way of seeing what content aspects are mentioned.\n", "\n", "Spacy adds word form information to each word in the document. We can easily filter out common stopwords to get a better view of the content words that are mentioned." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('.', 157),\n", " (',', 106),\n", " (' ', 40),\n", " ('book', 37),\n", " ('-', 21),\n", " ('?', 19),\n", " ('Katniss', 19),\n", " ('like', 17),\n", " ('series', 17),\n", " ('read', 11),\n", " ('story', 10),\n", " ('Games', 10),\n", " ('Peeta', 10),\n", " ('(', 8),\n", " (')', 8),\n", " ('books', 8),\n", " ('loved', 8),\n", " ('felt', 8),\n", " ('end', 6),\n", " ('deaths', 6)]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf = Counter([token.text for doc in docs for token in doc if not token.is_stop])\n", "\n", "tf.most_common(20)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we see many punctuation symbols. Let's filter those out as well." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(' ', 40),\n", " ('book', 37),\n", " ('Katniss', 19),\n", " ('like', 17),\n", " ('series', 17),\n", " ('read', 11),\n", " ('story', 10),\n", " ('Games', 10),\n", " ('Peeta', 10),\n", " ('books', 8),\n", " ('loved', 8),\n", " ('felt', 8),\n", " ('end', 6),\n", " ('deaths', 6),\n", " ('love', 6),\n", " ('think', 6),\n", " ('people', 5),\n", " ('ending', 5),\n", " ('good', 5),\n", " ('Collins', 5)]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf = Counter([token.text for doc in docs for token in doc if not token.is_stop and not token.is_punct])\n", "\n", "tf.most_common(20)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most common words are clearly related to the book domain (such as *book*, *read*, *series*, *story*) and the review domain (*like*, *loved*, *felt*, *love*, *good*). Notice that there are many morphological variants of each other. \n", "\n", "We can also count the word lemmas instead of the surface variants in the text:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('book', 45),\n", " (' ', 40),\n", " ('like', 21),\n", " ('Katniss', 19),\n", " ('series', 17),\n", " ('read', 15),\n", " ('love', 14),\n", " ('feel', 14),\n", " ('story', 10),\n", " ('end', 10),\n", " ('Games', 10),\n", " ('death', 10),\n", " ('character', 9),\n", " ('Peeta', 9),\n", " ('think', 8),\n", " ('get', 7),\n", " ('good', 7),\n", " ('way', 6),\n", " ('let', 6),\n", " ('star', 6)]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf = Counter([token.lemma_ for doc in docs for token in doc if not token.is_stop and not token.is_punct])\n", "\n", "tf.most_common(20)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we also see *end* as a common word. \n", "\n", "### Zooming out to more reviewers\n", "\n", "With 10 short reviews we can only see a few commonalities and distinctions. Several mention the ending, some like and some don't. A quantitative perspective doesn't give us much beyond what a close reading of the reviews would give us. \n", "\n", "If we zoom out to a larger group of 10,000 reviews, we get a more stable picture of what aspects are commonly mentioned. But now a different problems rears up." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "took: 294.7146508693695 seconds\n", "number of spacy docs loaded: 113338\n", "number of spacy docs selected: 12607\n" ] } ], "source": [ "from scripts.text_tail_analysis import read_spacy_docs_for_dataframe, select_dataframe_spacy_docs\n", "import spacy\n", "import time\n", "\n", "nlp = spacy.load('en_core_web_lg')\n", "fname = '../data/goodreads-reviews-books_above_10k.doc_bin'\n", "start = time.time()\n", "review_docs = read_spacy_docs_for_dataframe(fname, review_df, nlp)\n", "print('took:', time.time() - start, 'seconds')\n", "print('number of spacy docs loaded:', len(review_docs))\n", "book_docs = select_dataframe_spacy_docs(book_df, review_docs, as_dict=True)\n", "print('number of spacy docs selected:', len(book_docs.keys()))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('book', 16843),\n", " (' ', 14091),\n", " ('Katniss', 9164),\n", " ('like', 6744),\n", " ('read', 6651),\n", " ('end', 6424),\n", " ('series', 5948),\n", " ('love', 5158),\n", " ('think', 4737),\n", " ('character', 4217),\n", " ('Peeta', 4105),\n", " ('feel', 4090),\n", " ('ending', 3969),\n", " ('good', 3882),\n", " ('story', 3162),\n", " ('Collins', 3136),\n", " ('time', 2856),\n", " ('Games', 2853),\n", " ('way', 2794),\n", " ('Hunger', 2763)]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_size = 10000\n", "sample_df = book_df.sample(sample_size, random_state=random_seed)\n", "docs = [nlp(text) for text in get_dataframe_review_texts(sample_df)]\n", "docs = select_dataframe_spacy_docs(sample_df, review_docs, as_dict=False)\n", "\n", "\n", "# calculate the term frequency of individual words\n", "tf = Counter([token.lemma_ for doc in docs for token in doc if not token.is_stop and not token.is_punct])\n", "\n", "tf.most_common(20)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This list is very similar to the one for ten reviews. The book and review domain terms, plus the names of the book, author and main characters. \n", "\n", "Plain word lists are a quick way to get an overview of what is common across a set of reviews. Apart from total word counts, we can also count each word once per document regardless of how frequently the reviewer uses it, so that we get insight in how many reviewers mention a specific term, e.g. 'ending'. With each review being a document, this frequency is known as the *document frequency*." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('book', 6163),\n", " ('series', 3850),\n", " ('end', 3713),\n", " ('read', 3688),\n", " ('like', 3324),\n", " ('Katniss', 3118),\n", " ('ending', 2949),\n", " ('love', 2880),\n", " ('good', 2792),\n", " ('think', 2639),\n", " ('character', 2375),\n", " ('feel', 2305),\n", " ('story', 1960),\n", " ('Collins', 1950),\n", " ('Hunger', 1891),\n", " ('trilogy', 1870),\n", " ('Games', 1861),\n", " ('time', 1852),\n", " ('Peeta', 1842),\n", " ('way', 1832)]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scripts.text_tail_analysis import get_doc_word_token_set\n", "\n", "df = Counter([lemma for doc in docs for lemma in get_doc_word_token_set(doc, use_lemma=True)])\n", "\n", "df.most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is quite insightful. There are 3713 reviews (37% of the 10,000 in the sample) that mention the word *end* and 2949 reviews (could be many of the same reviews) that mention *ending*. Also, 2375 reviewers mention the word *character*, and 1960 mention *story*. \n", "\n", "But what is the problem that rears up here? Let's look at the total number of words and distinct word forms:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of words: 487298\n", "Number of distinct words: 27632\n" ] } ], "source": [ "print('Number of words:', sum(tf.values()))\n", "print('Number of distinct words:', len(tf.keys()))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 10,000 reviews contain 487,298 words in total, and 27,632 distinct words. Above, we have looked at only the 20 most frequent ones. What are these remaining 27,612 words?\n", "\n", "This is where the highly skewed distribution of word frequencies throws up barriers to analysis. How do we get a good overview of what those low-frequency are?" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sum frequency of top 10 terms: 79977 (fraction: 0.16)\n", "Sum frequency of top 20 terms: 113587 (fraction: 0.23)\n", "Sum frequency of top 100 terms: 211555 (fraction: 0.43)\n", "Sum frequency of top 200 terms: 263236 (fraction: 0.54)\n" ] } ], "source": [ "sizes = [10, 20, 100, 200]\n", "for size in sizes:\n", " sum_top = sum([freq for term, freq in tf.most_common(size)])\n", " print(f'Sum frequency of top {size} terms: {sum_top} (fraction: {sum_top / sum(tf.values()): >.2f})')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These top 20 terms represent only 25% of all words. Even if we look at the top 200 words, we're ignoring half of the text. " ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(('book', 'NOUN'), 1785),\n", " ((' ', 'SPACE'), 1221),\n", " (('Katniss', 'PROPN'), 874),\n", " (('read', 'VERB'), 633),\n", " (('series', 'NOUN'), 575),\n", " (('think', 'VERB'), 530),\n", " (('character', 'NOUN'), 473),\n", " (('feel', 'VERB'), 417),\n", " (('ending', 'NOUN'), 416),\n", " (('love', 'VERB'), 416),\n", " (('good', 'ADJ'), 409),\n", " (('Peeta', 'PROPN'), 387),\n", " (('like', 'SCONJ'), 371),\n", " (('end', 'NOUN'), 354),\n", " (('go', 'VERB'), 315),\n", " (('story', 'NOUN'), 313),\n", " (('like', 'VERB'), 306),\n", " (('Collins', 'PROPN'), 305),\n", " (('want', 'VERB'), 302),\n", " (('know', 'VERB'), 289)]" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_lemma_pos = Counter([(token.lemma_, token.pos_) for doc in docs for token in doc if not token.is_stop and not token.is_punct])\n", "\n", "tf_lemma_pos.most_common(20)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Long Tails and Classification\n", "\n", "One thing we can do to organise items in the long tail is to categorise or classify them. \n", "\n", "- group by part-of-speech: this \n", "- group by frequent terms in the sentence that they have a syntactical dependency with\n", "- group by semantic information about each word based on external resources, like sentiment, synonyms, hypernyms, or domain specific word categorisations (e.g. LIWC, Wordnet, ...)." ] }, { "cell_type": "code", "execution_count": 299, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word form\tAll TF (frac)\tTF <= 5 (frac)\tTF = 1 (frac)\n", "----------------------------------------------\n", "VERB \t 12830 0.28\t 1671 0.21\t 564 0.19\n", "NOUN \t 16541 0.36\t 2876 0.36\t 1014 0.34\n", "ADJ \t 6866 0.15\t 1757 0.22\t 674 0.22\n", "PROPN \t 4738 0.10\t 715 0.09\t 408 0.13\n", "SCONJ \t 418 0.01\t 4 0.0\t 1 0.0\n", "NUM \t 481 0.01\t 121 0.02\t 49 0.02\n", "ADV \t 2438 0.05\t 502 0.06\t 210 0.07\n", "X \t 51 0.00\t 41 0.01\t 26 0.01\n", "INTJ \t 407 0.01\t 141 0.02\t 35 0.01\n", "SPACE \t 1221 0.03\t 0 0.0\t 0 0.0\n", "PUNCT \t 32 0.00\t 32 0.0\t 22 0.01\n", "CCONJ \t 12 0.00\t 12 0.0\t 0 0.0\n", "PRON \t 17 0.00\t 17 0.0\t 4 0.0\n", "PART \t 35 0.00\t 0 0.0\t 0 0.0\n", "ADP \t 40 0.00\t 20 0.0\t 10 0.0\n", "DET \t 10 0.00\t 10 0.0\t 6 0.0\n", "SYM \t 1 0.00\t 1 0.0\t 1 0.0\n" ] } ], "source": [ "from collections import defaultdict\n", "from scripts.text_tail_analysis import show_pos_tail_distribution\n", "\n", "tf_lemma_pos = Counter([(token.lemma_, token.pos_) for doc in docs for token in doc if not token.is_stop and not token.is_punct])\n", "\n", "show_pos_tail_distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above we see the proportion of Part-Of-Speech tags across all words and across words that occur at most five times and at most once. **Remember, this is after removal of stopwords and punctuation**.\n", "\n", "- First, the largest categories overall are nouns (36%), verbs (28%), adjectives (15%), proper nouns (10%) and adverbs (5%). Proper nouns refer to single identifiable entities.\n", "\n", "- Among the less frequent words, the proportion of nouns and adverbs remain stable, the proportion of verbs drop, while the number of adjectives and proper nouns go up. \n", "\n", "In other words, the tail has relatively many adjectives and entities, but also many other nouns. In terms of content analysis, these are important categories. Of course, with 1000 reviews and only a few thousand of these words, it is possible to go through all of them to get insights in what they are and how they relate to the book, the reading experience or something else. If we were to scale up to tens of thousands or millions of reviews, this would become increasingly infeasible. \n" ] }, { "cell_type": "code", "execution_count": 421, "metadata": {}, "outputs": [], "source": [ "from scripts.text_tail_analysis import get_lemma_pos_df_index" ] }, { "cell_type": "code", "execution_count": 431, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8269 1671\n" ] } ], "source": [ "df_group1 = book_df[book_df.rating > 3]\n", "df_group2 = book_df[book_df.rating < 3]\n", "\n", "book_docs_group1 = select_dataframe_spacy_docs(df_group1, review_docs, as_dict=True)\n", "book_docs_group2 = select_dataframe_spacy_docs(df_group2, review_docs, as_dict=True)\n", "\n", "print(len(book_docs_group1), len(book_docs_group2))" ] }, { "cell_type": "code", "execution_count": 438, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bad ADJ 684 324 0.0827 0.1939 2.34\n", "well ADJ 627 201 0.0758 0.1203 1.59\n", "second ADJ 597 213 0.0722 0.1275 1.77\n", "strong ADJ 521 200 0.0630 0.1197 1.90\n", "big ADJ 370 149 0.0447 0.0892 1.99\n", "interesting ADJ 339 131 0.0410 0.0784 1.91\n", "main ADJ 282 145 0.0341 0.0868 2.54\n", "old ADJ 270 82 0.0327 0.0491 1.50\n", "disappointed ADJ 217 186 0.0262 0.1113 4.24\n", "depressing ADJ 215 94 0.0260 0.0563 2.16\n", "high ADJ 170 70 0.0206 0.0419 2.04\n", "horrible ADJ 150 79 0.0181 0.0473 2.61\n", "possible ADJ 136 50 0.0164 0.0299 1.82\n", "weak ADJ 125 91 0.0151 0.0545 3.60\n", "major ADJ 123 72 0.0149 0.0431 2.90\n", "huge ADJ 119 58 0.0144 0.0347 2.41\n", "terrible ADJ 118 83 0.0143 0.0497 3.48\n", "poor ADJ 114 46 0.0138 0.0275 2.00\n", "okay ADJ 111 50 0.0134 0.0299 2.23\n", "disappointing ADJ 108 167 0.0131 0.0999 7.65\n", "predictable ADJ 104 41 0.0126 0.0245 1.95\n", "actual ADJ 96 37 0.0116 0.0221 1.91\n", "rushed ADJ 96 33 0.0116 0.0197 1.70\n", "complete ADJ 95 39 0.0115 0.0233 2.03\n", "boring ADJ 94 107 0.0114 0.0640 5.63\n", "close ADJ 94 29 0.0114 0.0174 1.53\n", "sorry ADJ 85 41 0.0103 0.0245 2.39\n", "compelling ADJ 85 33 0.0103 0.0197 1.92\n", "confused ADJ 82 36 0.0099 0.0215 2.17\n", "mad ADJ 80 38 0.0097 0.0227 2.35\n", "obvious ADJ 78 25 0.0094 0.0150 1.59\n", "female ADJ 77 35 0.0093 0.0209 2.25\n", "general ADJ 75 29 0.0091 0.0174 1.91\n", "fine ADJ 72 46 0.0087 0.0275 3.16\n", "awful ADJ 69 51 0.0083 0.0305 3.66\n", "original ADJ 69 25 0.0083 0.0150 1.79\n", "stupid ADJ 68 45 0.0082 0.0269 3.27\n", "anti ADJ 68 38 0.0082 0.0227 2.77\n", "confusing ADJ 68 34 0.0082 0.0203 2.47\n", "entertaining ADJ 65 21 0.0079 0.0126 1.60\n", "depressed ADJ 65 48 0.0079 0.0287 3.65\n", "single ADJ 63 26 0.0076 0.0156 2.04\n", "present ADJ 63 33 0.0076 0.0197 2.59\n", "annoying ADJ 62 47 0.0075 0.0281 3.75\n", "interested ADJ 61 19 0.0074 0.0114 1.54\n", "bleak ADJ 60 23 0.0073 0.0138 1.90\n", "cold ADJ 58 20 0.0070 0.0120 1.71\n", "unnecessary ADJ 56 52 0.0068 0.0311 4.60\n", "total ADJ 56 20 0.0068 0.0120 1.77\n", "weird ADJ 56 26 0.0068 0.0156 2.30\n", "selfish ADJ 56 45 0.0068 0.0269 3.98\n", "decent ADJ 53 20 0.0064 0.0120 1.87\n", "popular ADJ 53 17 0.0064 0.0102 1.59\n", "ok ADJ 52 32 0.0063 0.0192 3.05\n", "teenage ADJ 52 18 0.0063 0.0108 1.71\n", "normal ADJ 51 20 0.0062 0.0120 1.94\n", "tired ADJ 51 35 0.0062 0.0209 3.40\n", "non ADJ 50 16 0.0060 0.0096 1.58\n", "dramatic ADJ 48 15 0.0058 0.0090 1.55\n", "extreme ADJ 47 17 0.0057 0.0102 1.79\n", "post ADJ 47 18 0.0057 0.0108 1.90\n", "instead ADV 345 189 0.0417 0.1131 2.71\n", "honestly ADV 179 93 0.0216 0.0557 2.57\n", "simply ADV 163 54 0.0197 0.0323 1.64\n", "later ADV 145 44 0.0175 0.0263 1.50\n", "seriously ADV 120 76 0.0145 0.0455 3.13\n", "somewhat ADV 117 38 0.0141 0.0227 1.61\n", "literally ADV 106 45 0.0128 0.0269 2.10\n", "nearly ADV 95 39 0.0115 0.0233 2.03\n", "anymore ADV 89 61 0.0108 0.0365 3.39\n", "basically ADV 87 62 0.0105 0.0371 3.53\n", "way ADV 85 34 0.0103 0.0203 1.98\n", "kinda ADV 83 27 0.0100 0.0162 1.61\n", "obviously ADV 80 27 0.0097 0.0162 1.67\n", "ahead ADV 66 29 0.0080 0.0174 2.17\n", "suddenly ADV 60 31 0.0073 0.0186 2.56\n", "unfortunately ADV 59 42 0.0071 0.0251 3.52\n", "constantly ADV 58 23 0.0070 0.0138 1.96\n", "barely ADV 55 42 0.0067 0.0251 3.78\n", "utterly ADV 50 26 0.0060 0.0156 2.57\n", "character NOUN 3184 1071 0.3851 0.6409 1.66\n", "death NOUN 1080 371 0.1306 0.2220 1.70\n", "page NOUN 866 265 0.1047 0.1586 1.51\n", "author NOUN 716 274 0.0866 0.1640 1.89\n", "point NOUN 631 280 0.0763 0.1676 2.20\n", "plot NOUN 550 273 0.0665 0.1634 2.46\n", "girl NOUN 436 143 0.0527 0.0856 1.62\n", "fact NOUN 410 145 0.0496 0.0868 1.75\n", "reason NOUN 408 161 0.0493 0.0963 1.95\n", "person NOUN 388 134 0.0469 0.0802 1.71\n", "sense NOUN 365 117 0.0441 0.0700 1.59\n", "triangle NOUN 346 142 0.0418 0.0850 2.03\n", "writing NOUN 337 133 0.0408 0.0796 1.95\n", "fire NOUN 333 105 0.0403 0.0628 1.56\n", "scene NOUN 326 106 0.0394 0.0634 1.61\n", "decision NOUN 262 95 0.0317 0.0569 1.79\n", "idea NOUN 241 119 0.0291 0.0712 2.44\n", "guy NOUN 223 75 0.0270 0.0449 1.66\n", "kid NOUN 217 92 0.0262 0.0551 2.10\n", "problem NOUN 212 94 0.0256 0.0563 2.19\n", "heroine NOUN 195 87 0.0236 0.0521 2.21\n", "boy NOUN 178 54 0.0215 0.0323 1.50\n", "self NOUN 176 112 0.0213 0.0670 3.15\n", "half NOUN 175 62 0.0212 0.0371 1.75\n", "mother NOUN 147 48 0.0178 0.0287 1.62\n", "writer NOUN 146 51 0.0177 0.0305 1.73\n", "development NOUN 144 64 0.0174 0.0383 2.20\n", "pawn NOUN 138 42 0.0167 0.0251 1.51\n", "finale NOUN 138 53 0.0167 0.0317 1.90\n", "expectation NOUN 137 49 0.0166 0.0293 1.77\n", "resolution NOUN 127 39 0.0154 0.0233 1.52\n", "hell NOUN 122 42 0.0148 0.0251 1.70\n", "sort NOUN 114 41 0.0138 0.0245 1.78\n", "month NOUN 104 36 0.0126 0.0215 1.71\n", "control NOUN 100 42 0.0121 0.0251 2.08\n", "bomb NOUN 100 35 0.0121 0.0209 1.73\n", "example NOUN 92 40 0.0111 0.0239 2.15\n", "interest NOUN 92 47 0.0111 0.0281 2.53\n", "strength NOUN 92 35 0.0111 0.0209 1.88\n", "president NOUN 86 29 0.0104 0.0174 1.67\n", "sentence NOUN 85 44 0.0103 0.0263 2.56\n", "teenager NOUN 83 26 0.0100 0.0156 1.55\n", "protagonist NOUN 83 52 0.0100 0.0311 3.10\n", "tv NOUN 83 30 0.0100 0.0180 1.79\n", "attention NOUN 81 27 0.0098 0.0162 1.65\n", "state NOUN 81 27 0.0098 0.0162 1.65\n", "lack NOUN 79 44 0.0096 0.0263 2.76\n", "version NOUN 77 24 0.0093 0.0144 1.54\n", "snow NOUN 76 32 0.0092 0.0192 2.08\n", "personality NOUN 75 27 0.0091 0.0162 1.78\n", "hospital NOUN 75 63 0.0091 0.0377 4.16\n", "impact NOUN 74 28 0.0089 0.0168 1.87\n", "coin NOUN 74 35 0.0089 0.0209 2.34\n", "direction NOUN 74 25 0.0089 0.0150 1.67\n", "blood NOUN 73 25 0.0088 0.0150 1.69\n", "element NOUN 70 30 0.0085 0.0180 2.12\n", "climax NOUN 69 32 0.0083 0.0192 2.29\n", "narrative NOUN 69 23 0.0083 0.0138 1.65\n", "depression NOUN 68 23 0.0082 0.0138 1.67\n", "propaganda NOUN 68 23 0.0082 0.0138 1.67\n", "purpose NOUN 68 33 0.0082 0.0197 2.40\n", "explanation NOUN 67 27 0.0081 0.0162 1.99\n", "cause NOUN 65 20 0.0079 0.0120 1.52\n", "fighting NOUN 63 22 0.0076 0.0132 1.73\n", "group NOUN 63 25 0.0076 0.0150 1.96\n", "disappointment NOUN 62 133 0.0075 0.0796 10.62\n", "human NOUN 61 19 0.0074 0.0114 1.54\n", "screen NOUN 61 20 0.0074 0.0120 1.62\n", "perspective NOUN 61 19 0.0074 0.0114 1.54\n", "setting NOUN 61 22 0.0074 0.0132 1.78\n", "quality NOUN 59 38 0.0071 0.0227 3.19\n", "result NOUN 58 22 0.0070 0.0132 1.88\n", "number NOUN 57 20 0.0069 0.0120 1.74\n", "excitement NOUN 57 21 0.0069 0.0126 1.82\n", "flaw NOUN 56 26 0.0068 0.0156 2.30\n", "revenge NOUN 56 36 0.0068 0.0215 3.18\n", "volume NOUN 55 22 0.0067 0.0132 1.98\n", "fantasy NOUN 54 18 0.0065 0.0108 1.65\n", "form NOUN 53 18 0.0064 0.0108 1.68\n", "effort NOUN 53 21 0.0064 0.0126 1.96\n", "arrow NOUN 53 26 0.0064 0.0156 2.43\n", "pacing NOUN 52 28 0.0063 0.0168 2.66\n", "citizen NOUN 52 20 0.0063 0.0120 1.90\n", "mission NOUN 51 30 0.0062 0.0180 2.91\n", "room NOUN 51 20 0.0062 0.0120 1.94\n", "song NOUN 49 15 0.0059 0.0090 1.51\n", "hype NOUN 48 17 0.0058 0.0102 1.75\n", "killing NOUN 48 15 0.0058 0.0090 1.55\n", "name NOUN 46 16 0.0056 0.0096 1.72\n", "despair NOUN 46 14 0.0056 0.0084 1.51\n", "ground NOUN 46 14 0.0056 0.0084 1.51\n", "trouble NOUN 46 17 0.0056 0.0102 1.83\n", "Fire PROPN 646 216 0.0781 0.1293 1.65\n", "Catching PROPN 646 222 0.0781 0.1329 1.70\n", "Snow PROPN 573 192 0.0693 0.1149 1.66\n", "Twilight PROPN 104 34 0.0126 0.0203 1.62\n", "Cinna PROPN 98 38 0.0119 0.0227 1.92\n", "Book PROPN 90 44 0.0109 0.0263 2.42\n", "PTSD PROPN 71 25 0.0086 0.0150 1.74\n", "Johanna PROPN 61 20 0.0074 0.0120 1.62\n", "Rue PROPN 54 22 0.0065 0.0132 2.02\n", "get VERB 1243 451 0.1503 0.2699 1.80\n", "die VERB 1024 364 0.1238 0.2178 1.76\n", "kill VERB 918 390 0.1110 0.2334 2.10\n", "start VERB 741 249 0.0896 0.1490 1.66\n", "try VERB 690 224 0.0834 0.1341 1.61\n", "hate VERB 626 238 0.0757 0.1424 1.88\n", "tell VERB 558 182 0.0675 0.1089 1.61\n", "mean VERB 501 155 0.0606 0.0928 1.53\n", "lose VERB 483 167 0.0584 0.0999 1.71\n", "let VERB 462 172 0.0559 0.1029 1.84\n", "have VERB 361 129 0.0437 0.0772 1.77\n", "fall VERB 292 100 0.0353 0.0598 1.69\n", "decide VERB 285 105 0.0345 0.0628 1.82\n", "rush VERB 278 103 0.0336 0.0616 1.83\n", "care VERB 275 180 0.0333 0.1077 3.24\n", "follow VERB 222 76 0.0268 0.0455 1.69\n", "save VERB 222 70 0.0268 0.0419 1.56\n", "force VERB 217 78 0.0262 0.0467 1.78\n", "deserve VERB 213 67 0.0258 0.0401 1.56\n", "spend VERB 203 111 0.0245 0.0664 2.71\n", "throw VERB 195 93 0.0236 0.0557 2.36\n", "lead VERB 180 64 0.0218 0.0383 1.76\n", "suppose VERB 177 85 0.0214 0.0509 2.38\n", "stand VERB 162 62 0.0196 0.0371 1.89\n", "explain VERB 155 62 0.0187 0.0371 1.98\n", "build VERB 143 53 0.0173 0.0317 1.83\n", "compare VERB 138 43 0.0167 0.0257 1.54\n", "suffer VERB 137 45 0.0166 0.0269 1.63\n", "develop VERB 133 52 0.0161 0.0311 1.93\n", "run VERB 130 62 0.0157 0.0371 2.36\n", "mention VERB 127 55 0.0154 0.0329 2.14\n", "drag VERB 126 51 0.0152 0.0305 2.00\n", "allow VERB 122 42 0.0148 0.0251 1.70\n", "use VERB 109 35 0.0132 0.0209 1.59\n", "resolve VERB 108 33 0.0131 0.0197 1.51\n", "stick VERB 107 52 0.0129 0.0311 2.40\n", "shoot VERB 106 48 0.0128 0.0287 2.24\n", "cause VERB 102 39 0.0123 0.0233 1.89\n", "act VERB 101 32 0.0122 0.0192 1.57\n", "buy VERB 96 42 0.0116 0.0251 2.16\n", "ruin VERB 94 62 0.0114 0.0371 3.26\n", "drive VERB 91 36 0.0110 0.0215 1.96\n", "return VERB 88 28 0.0106 0.0168 1.57\n", "hang VERB 87 39 0.0105 0.0233 2.22\n", "suck VERB 86 53 0.0104 0.0317 3.05\n", "bother VERB 85 70 0.0103 0.0419 4.08\n", "lack VERB 82 49 0.0099 0.0293 2.96\n", "root VERB 78 28 0.0094 0.0168 1.78\n", "dislike VERB 75 47 0.0091 0.0281 3.10\n", "manipulate VERB 74 28 0.0089 0.0168 1.87\n", "introduce VERB 68 27 0.0082 0.0162 1.96\n", "fail VERB 68 64 0.0082 0.0383 4.66\n", "hide VERB 67 37 0.0081 0.0221 2.73\n", "walk VERB 66 21 0.0080 0.0126 1.57\n", "occur VERB 63 23 0.0076 0.0138 1.81\n", "blame VERB 61 19 0.0074 0.0114 1.54\n", "treat VERB 60 21 0.0073 0.0126 1.73\n", "sound VERB 60 20 0.0073 0.0120 1.65\n", "control VERB 57 18 0.0069 0.0108 1.56\n", "wake VERB 57 34 0.0069 0.0203 2.95\n", "pass VERB 56 29 0.0068 0.0174 2.56\n", "invest VERB 55 28 0.0067 0.0168 2.52\n", "appear VERB 55 20 0.0067 0.0120 1.80\n", "support VERB 53 25 0.0064 0.0150 2.33\n", "serve VERB 52 17 0.0063 0.0102 1.62\n", "rise VERB 52 20 0.0063 0.0120 1.90\n", "marry VERB 52 19 0.0063 0.0114 1.81\n", "annoy VERB 52 28 0.0063 0.0168 2.66\n", "vote VERB 51 32 0.0062 0.0192 3.10\n", "drop VERB 51 36 0.0062 0.0215 3.49\n", "complete VERB 51 20 0.0062 0.0120 1.94\n", "adore VERB 49 16 0.0059 0.0096 1.62\n", "cut VERB 49 17 0.0059 0.0102 1.72\n", "step VERB 48 15 0.0058 0.0090 1.55\n", "warn VERB 47 19 0.0057 0.0114 2.00\n", "suggest VERB 47 17 0.0057 0.0102 1.79\n" ] } ], "source": [ "token_pos_types = ['ADJ', 'ADV', 'NOUN', 'PROPN', 'VERB']\n", "docs_group1 = [book_docs_group1[review_id] for review_id in book_docs_group1]\n", "docfreq_group1 = get_lemma_pos_df_index(docs_group1, keep_pron=True)\n", "\n", "docs_group2 = [book_docs_group2[review_id] for review_id in book_docs_group2]\n", "docfreq_group2 = get_lemma_pos_df_index(docs_group2, keep_pron=True)\n", "\n", "total_group1 = len(book_docs_group1)\n", "total_group2 = len(book_docs_group2)\n", "\n", "for pos_type in token_pos_types:\n", " for term, freq in docfreq_group1.most_common(1000):\n", " lemma, pos = term\n", " if pos != pos_type:\n", " continue\n", " prop_group1 = freq / total_group1\n", " prop_group2 = docfreq_group2[term] / total_group2\n", " prop = prop_group2 / prop_group1\n", " if prop < 1.5:\n", " continue\n", " print(f'{lemma: <20}{pos: <6}{freq: >6}{docfreq_group2[term]: >6}{prop_group1: >8.4f}{prop_group2: >8.4f}{prop: >6.2f}')\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Adjectives**:\n", "\n", "- 'bad', 'strong', 'big', 'interesting', 'main, 'old', 'disappointed'\n", "\n", "**Adverbs**:\n", "\n", "**Nouns**:\n", "\n", "- 'character', 'page', 'author', 'point', 'plot', 'writing'\n", "\n", "**Pronouns**:\n", "\n", "If we compare the proper nouns, we see that the negative reviews make a comparison to the Twilight series. \n", "\n", "**Verbs**:\n", "\n", "- 'get', 'die', 'kill', 'start', 'try', 'hate', 'tell', 'mean', 'care', 'spend', 'throw'\n" ] }, { "cell_type": "code", "execution_count": 459, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dependency_typedependency_worddependency_posdependency_freqtail_wordtail_postail_freqdep_tail_freqliwc_category
19237headbadADJ324yearNOUN1021relativ|time
19239headbadADJ324characterNOUN10711None
43829childbadADJ324bookNOUN368120None
43830childbadADJ324termNOUN241quant|relativ|time
43832childbadADJ324thingNOUN51913None
43833childbadADJ324partNOUN718funct|quant
43834childbadADJ324moodNOUN102affect
43835childbadADJ324outcomeNOUN151None
43836childbadADJ324endingNOUN56510relativ|time
43838childbadADJ324writingNOUN1334social|cogmech
43843childbadADJ324guyNOUN7510None
43845childbadADJ324choiceNOUN822None
43846childbadADJ324memoryNOUN292None
43848childbadADJ324personNOUN1343social|humans
43850childbadADJ324readNOUN651work|leisure
43851childbadADJ324caseNOUN351None
43852childbadADJ324triangleNOUN1421None
43853childbadADJ324interestNOUN471None
43855childbadADJ324situationNOUN431None
43856childbadADJ324peopleNOUN3851None
43860childbadADJ324sequelNOUN181None
43862childbadADJ324conclusionNOUN953None
43864childbadADJ324reasonNOUN1611None
43865childbadADJ324tasteNOUN194None
43866childbadADJ324aspectNOUN271None
43867childbadADJ324lossNOUN241None
43869childbadADJ324assNOUN224swear|bio|body|sexual
43870childbadADJ324stuffNOUN321funct|pronoun|ipron
43873childbadADJ324modelNOUN121None
43874childbadADJ324goodNOUN131affect|posemo
43875childbadADJ324ideaNOUN1191cogmech|insight
43876childbadADJ324decisionNOUN952None
43877childbadADJ324deathNOUN3711None
43881childbadADJ324dialogueNOUN141None
43882childbadADJ324gameNOUN1871None
43883childbadADJ324seriesNOUN9251quant
43884childbadADJ324wayNOUN5161relativ
43885childbadADJ324movieNOUN1491None
43887childbadADJ324sentenceNOUN441None
43888childbadADJ324boyNOUN541social|humans
43889childbadADJ324oneNOUN451funct|number
43892childbadADJ324trilogyNOUN3041None
\n", "
" ], "text/plain": [ " dependency_type dependency_word dependency_pos dependency_freq \\\n", "19237 head bad ADJ 324 \n", "19239 head bad ADJ 324 \n", "43829 child bad ADJ 324 \n", "43830 child bad ADJ 324 \n", "43832 child bad ADJ 324 \n", "43833 child bad ADJ 324 \n", "43834 child bad ADJ 324 \n", "43835 child bad ADJ 324 \n", "43836 child bad ADJ 324 \n", "43838 child bad ADJ 324 \n", "43843 child bad ADJ 324 \n", "43845 child bad ADJ 324 \n", "43846 child bad ADJ 324 \n", "43848 child bad ADJ 324 \n", "43850 child bad ADJ 324 \n", "43851 child bad ADJ 324 \n", "43852 child bad ADJ 324 \n", "43853 child bad ADJ 324 \n", "43855 child bad ADJ 324 \n", "43856 child bad ADJ 324 \n", "43860 child bad ADJ 324 \n", "43862 child bad ADJ 324 \n", "43864 child bad ADJ 324 \n", "43865 child bad ADJ 324 \n", "43866 child bad ADJ 324 \n", "43867 child bad ADJ 324 \n", "43869 child bad ADJ 324 \n", "43870 child bad ADJ 324 \n", "43873 child bad ADJ 324 \n", "43874 child bad ADJ 324 \n", "43875 child bad ADJ 324 \n", "43876 child bad ADJ 324 \n", "43877 child bad ADJ 324 \n", "43881 child bad ADJ 324 \n", "43882 child bad ADJ 324 \n", "43883 child bad ADJ 324 \n", "43884 child bad ADJ 324 \n", "43885 child bad ADJ 324 \n", "43887 child bad ADJ 324 \n", "43888 child bad ADJ 324 \n", "43889 child bad ADJ 324 \n", "43892 child bad ADJ 324 \n", "\n", " tail_word tail_pos tail_freq dep_tail_freq liwc_category \n", "19237 year NOUN 102 1 relativ|time \n", "19239 character NOUN 1071 1 None \n", "43829 book NOUN 3681 20 None \n", "43830 term NOUN 24 1 quant|relativ|time \n", "43832 thing NOUN 519 13 None \n", "43833 part NOUN 71 8 funct|quant \n", "43834 mood NOUN 10 2 affect \n", "43835 outcome NOUN 15 1 None \n", "43836 ending NOUN 565 10 relativ|time \n", "43838 writing NOUN 133 4 social|cogmech \n", "43843 guy NOUN 75 10 None \n", "43845 choice NOUN 82 2 None \n", "43846 memory NOUN 29 2 None \n", "43848 person NOUN 134 3 social|humans \n", "43850 read NOUN 65 1 work|leisure \n", "43851 case NOUN 35 1 None \n", "43852 triangle NOUN 142 1 None \n", "43853 interest NOUN 47 1 None \n", "43855 situation NOUN 43 1 None \n", "43856 people NOUN 385 1 None \n", "43860 sequel NOUN 18 1 None \n", "43862 conclusion NOUN 95 3 None \n", "43864 reason NOUN 161 1 None \n", "43865 taste NOUN 19 4 None \n", "43866 aspect NOUN 27 1 None \n", "43867 loss NOUN 24 1 None \n", "43869 ass NOUN 22 4 swear|bio|body|sexual \n", "43870 stuff NOUN 32 1 funct|pronoun|ipron \n", "43873 model NOUN 12 1 None \n", "43874 good NOUN 13 1 affect|posemo \n", "43875 idea NOUN 119 1 cogmech|insight \n", "43876 decision NOUN 95 2 None \n", "43877 death NOUN 371 1 None \n", "43881 dialogue NOUN 14 1 None \n", "43882 game NOUN 187 1 None \n", "43883 series NOUN 925 1 quant \n", "43884 way NOUN 516 1 relativ \n", "43885 movie NOUN 149 1 None \n", "43887 sentence NOUN 44 1 None \n", "43888 boy NOUN 54 1 social|humans \n", "43889 one NOUN 45 1 funct|number \n", "43892 trilogy NOUN 304 1 None " ] }, "execution_count": 459, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tail_groupings = get_tail_groupings(docs_group2, docfreq_group2, token_pos_types, liwc, max_threshold=5000, min_threshold=10)\n", "\n", "tail_df = pd.DataFrame(tail_groupings)\n", "\n", "book_terms = ['book', 'novel', 'story', 'plot', 'character', 'twist', 'development']\n", "\n", "tail_df[(tail_df.dependency_word == 'bad') & (tail_df.tail_pos == 'NOUN')]\n" ] }, { "cell_type": "code", "execution_count": 453, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I question the plot, the writing, and the capricious actions of characters.\n", "Ofcourse, I was not expecting any Shakespearean writing in it but the whole book literally felt like that it was written by a 10-year-old; nothing really deep, not as exciting as the previous ones and a rush, out of nowhere ending! \n", "The writing style was still good, but I couldn't get into the storyline as deeply.\n", "The quality of the writing has seriously deteriorated, the main character has become a hysterical, trauma-ridden, selfish, self-pitying whiner, and the plot has taken a suicide leap. \n", "It seems that without the Arena to set the stage Collin's writing has stuttered out and Katniss has lost her fire. \n", "With the monotone writing, the monotone character of Katniss, and the monotone plot, the monotone and banal setting of District 13 doesn't help much.\n", "I also felt cheated by the writing in this installation.\n", "I obviously expect a war to be violent but the tone of the writing in this book (and the previous two) did not prepare me for the level of violence\n", "The world is still very interesting, and Collins' writing style is enjoyable.\n", "I was very disappointed with the quality of the writing and dwindling character development & storylines.\n", "I figured out how these books don't apply the regular paths of writing: three disasters, character's inner goal, external goal; having to decide between making the right decision or achieving the initial goal.\n", "As I was reading this, I slowly realized how bad the writing was.\n", "I know that I didn't like Hunger Games because the writing focused on so much details that really made no sense.\n", "The writing for the action/fighting scenes was awkward and unclear.\n", "Not because that's what the book's bad writing imposed.\n", "The book's writing was marvelous.\n", "Collins' writing style, which was so vivid and present in the first two books, is flattened to a dull, passive observation.\n", "What lazy writing.\n", "As for the writing - at times it rushed forward at the expense of clarity and at other times languished in the inconsequential, repetitive doldrums.\n", "Had the writing been stronger, less roughshod, then it would have sat better with me.\n", "I also didn't love the writing style.\n", "Writing was incredible.\n", "I would rate this a 3.5, but since that's not an option I am going with a 4 because I still find the writing quality and the overall message and themes to be very gripping and kept me glued.\n", "The writing is still brillant.\n", "The language and writing style didn't correspond to what actually happened. \n", "These irregularities in Collins writing stopped me from giving this a much higher rating.\n", "Suzanne Collins has a fantastic writing style and imagination.\n", ", I think Suzanne Collins just got tired of writing and everything went into super-hyper-wtf most-undeveloped-plot-ever speed.\n", "It poses as a Y.A., with its simple writing and juvenile style...\n", "I found the writing weak and needlessly long.\n", "** The quality of writing was piss-poor and long stretches of this were utterly dull.\n", "Can you take some initiative in your writing and explain this for us?)\n", "I mean, she didn't kill Peeta, Beetee, or Annie, which she could easily have done since none of them were particularly stable), and perhaps that's even what is supposed to have happened, but the writing just doesn't support it.\n", "Its like the author got tired of writing and just suddenly ended the book with very little resolution.\n", "All that great writing went out with a whimper, and I am mad about it!\n", "What a disappointment... I loved the first two, loved the author's style of writing and ability to keep you on the edge of your seat throughout the book -\n", "I was satisfied with the ending, but the writing was much sloppier, probably because Collins went from telling a story to telling an ideology.\n", "It seemed like Collins was just tired of writing, and so started throwing at a dartboard of characters to see who she would kill off next.\n", "There are several problems with the writing: \n", "The writing is strong but by Mockingjay I just did not care if the characters lived or died.\n", "I also was very frustrated with how crucial points were handled in the writing.\n", "Damn Suzanne Collins has a brilliant imagination and amazing writing skill.\n", "Hunger Games was great, Catching Fire was defendable, Mockingjay was just sloppy writing.\n", "I found the writing simplistic and could not get attached to characters in a meaningful way. \n", "I thought the writing was as strong in this book as the others and the imagery was outstanding.\n", "But I still like Collins' style of writing.\n", "The writing.\n", "THE BIGGEST WRITING FLAW\n", "I sure would like to see Ms Collins write a book with the scales balanced in the other direction because she IS a 5 star writing talent!\n", "This was the weakest of the trilogy, although the plot was more interesting than the second, the writing had gotten weaker.\n", "The writing was just TERRIBLE.\n", "but I felt the overall simplicity and poor structure of the writing was an insult to young readers.\n", "I'll still see the movie because the story intrigues me, but the writing has too many holes for me.\n", "And Collins employed my most hated writing device- knocking the main character unconscious during the climactic battle and wrapping up the story in a random rehashing of crucial events.\n", "yes the story overall was good but the writing was the worst I've read in a long time.\n", "Again, I think Collins ended with some lazy writing, excusable possibly by the fact that she was trying to get us to read into the character's head, but it was boring and dry. \n", "I've read all three books and although the writing was ok the stories and characters\n", "Writing.\n", "The writing, the intensity, the characters, the creativity....\n", "Seemed like the author got bored and wanted to hurry up and finish writing.\n", "The writing was good, but the plot of this book and the overall story bored me so much\n", "The end was even lazier than Collins' usual writing.\n", "Hell, there were moments of powerful writing in the final chapters.\n", "It really is a solid piece of writing on its own.\n", "The writing in the first book of the trilogy wasn't anything impressive, but I think it deteriorated greatly by the time the third book came around.\n", "Her writing is a copy and paste of one minute self loathing to \"\n", "and I sincerely apologize for my sloppy writing and overindulgence in run-on sentences!\n", "The bad writing, forgivable in the first 2 for good plot, is constantly in your face in this book.\n", "As for the writing?\n", "In the end my rating is more to do with my enjoyment of the book than the writing or story.\n", "the level writing and poor editing shown in Collins' last installment of the Hunger Game novels was unacceptable.\n", "Seriously, it is like when I am having my elementary students write a story that they get tired of writing, so they just say, \"\n", "A new triangle between leaders is squandered in the writing.\n", "They teach it in many writing workshops and seminars: \"Show, don't tell.\" \n", ", some heart-pumping, soul-inspiring lines (\"If we burn, you burn with us!\"), jaw-dropping surprises (I won't say more than The Return of Peeta), Collins' generally high-quality writing AND best of all a happy ending. \n", "It became a non-priority finish around page 100 when I didn't see where the action was going, and even the writing seemed to have changed a little.\n", "Both the writing and the content were not up to my expectations, and I felt very dissatisfied when I finished.\n", "The writing is second rate and so dumbed down a 4 year old has the IQ to follow and where the characters could have been built up and mad more complex\n", "While Collins has shown periods of brilliant writing, other passages leads one to wonder if she was under the influence of morphling herself.\n", "In evaluating this whole series, its biggest limitation is its poor writing.\n", "It was nice to see that in writing.\n", "I just don't like her writing and the character development.\n", "her writing is beautiful.\n", "I still love Suzanne Collins writing, she is an amazing writer but sometimes the plots' just fail.\n", "I was utterly taken aback with the many flaws in the writing and the utter lack of growth by the main character.\n", "His death was marred by horrible writing and I ended up feeling swindled and angry rather than remorse and sorrow. \n", "The writing lacked any emotion whatsoever.\n", "Everything feels authentically numb and unfeeling, a credit to Collins' writing style.\n", "Also, in my opinion, the quality of writing decreased in this novel.\n", "A friend of mine who originally suggested this series to me said something to the effect of \"You can tell by the writing that she loved writing the first book, endured writing the second book, and hated writing the third book.\n", "Without battle scenes to sustain the writing this novel shrivels and wilts.\n", "It was a little anti-climatic, I skipped a lot of unneeded writing.\n", "Bad writing.\n", "I think Collins got bored writing the story and felt like she had to conclude it as quickly - regardless of how sloppy and erratic the writing became - as possible.\n", "Very lazy, apathetic writing towards the end.\n", "It felt to me like Collins just became tired of writing, killed everyone off except for the final two characters to hopefully start their very own Garden of Eden.\n", "I found the book to be boring, unimaginative, and a significant lack of creative writing.\n", "But there were alot of problems with the writing, all of which, I hate to say, seems to go back to bad revising and editing. \n", "Because of this, the quality of writing suffered and then the book was choppy and difficult to follow. \n", "From a writing standpoint, it was seriously lacking in dialogue.\n", "It was like Collins was taking her merry time writing and then realized abruptly that she had a deadline.\n", "That said, the writing was pretty good, I guess. \n", "My problem with the book was the terrible writing and plot.\n", "{sigh again} Yes, it is great writing and full vivid images (\n", "Basically, I didn't feel that the writing was anything spectacular and I preferred the action of the arena as opposed to the bizarre young adult political tension that Mockingjay was driven by. \n", "To me, there was no growth for the author in her writing for this novel. \n", "The first of the Hunger Games trilogy was absolutely amazing, I was astounded by the writing and the whole story in general. '\n", "To be fare, some nice writing in this book, but a major let down regarding expectations.\n", "When an author's writing is two for two, you start getting really excited about the third one.\n", "I still think it's worth a 3 because if nothing else, her writing is very captivating\n", "that would be effin' brilliant writing! \n", ", IT'S JUST BAD WRITING.\n", "it had its good parts, but it felt like suzanne collins just got bored with writing and sped right through the story.\n", "I could hardly follow up the plot, I was not sure if it was because the author's fluid writing skill had deteriorated or because I was reading it in Katniss' voice.\n", "I apologize to Suzanne Collins--but I feel that perhaps her book schedule of editing and writing squeezed her too hard and left her unable to write the way she did for book one.\n", "She should have been given more time to finish writing and editing this book properly.\n", "It never promised to be more than young adult fiction, so I can't fault the writing so much.\n", "I finished the whole trilogy just because I really liked the idea of Panem and yearly Hunger games, but it was Twilight in terms od quality and writing skills.\n", "The themes and ideas were interesting, but the unbearable central character and the naive writing brought the book down.\n", "Seriously, I can't help but wonder if Ms. Collins suffered a major crisis during the writing of this book, had a horrific event take place in her childhood or if she is off her meds.\n", "Both the writing and the plot are rushed.\n", "To understand my rating, you have to understand that I love the writing.\n", "However, her writing is so strong that I had a strong visceral reaction to the story.\n", "This dulls the effectiveness of the great characters and writing.\n", "so you see, Katniss, after you passed out the others...\" The writing in this book is by far the least polished, and I had a lot of trouble following the chaotic street battle in the Capitol.\n", "Writing: 4 stars\n", "The writing is atrocious. \n", "but the lack of development, like Collins just got tired of writing.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "In my experience, the final book of a series is generally the best writing-wise.\n", "However, as any writer knows, long words do not equal good writing.\n", "To be able to experience Panem, its corrupted core, and everything else the book contains without being weighed down with the chintzy, high school-level writing will be very exciting. \n", "It is nothing personal, but I just can't stand the terrible writing, cliche-plot, irrational characters, and the whining teen-girl.\n", "In my view, that's just not good writing -- it's like seeing the stage hand during a play. \n" ] } ], "source": [ "from scripts.text_tail_analysis import has_lemma_pos, sentence_iter\n", "lemma = 'writing'\n", "pos = 'NOUN'\n", "\n", "for sent in sentence_iter(docs_group2):\n", " if has_lemma_pos(sent, lemma, pos):\n", " print(sent)" ] }, { "cell_type": "code", "execution_count": 454, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "At times it's like reading a bad diary.\n", "The worst, though, was that I could see it all coming\n", "This is not my favorite book ever, but it's also not the worst book I've ever read.\n", "First of all, Katniss and Gale stop speaking on bad terms.\n", "This was bad....\n", "worst book in the series.\n", "And good and bad are not clearly defined black and white.\n", "Even in real life, no matter how bad things may be, there is always hope.\n", "Katniss is likely suffering from some form of PTSD, but rather than use the horrific experiences that keep piling up on Katniss as a jumping off point to explore what war-like events do to younger people (such as child soldiers), Collins shuffles around the whole issue by constantly throwing more bad things at Katniss so any recovery or clarity she gains is immediately destroyed by what happens next. \n", "But worse than this is the final fifty pages of the book when the twist is revealed, and I was left completely not caring because all the build up to make Katniss strong, to have her work through her pain, to have her continue forward because what she's doing it right, all of it gets cancelled out for a shock value plot twist that reeks of trying to make a point and failing.\n", "And the worst part is that this all happens over the course of a couple of pages.\n", "Maybe I'm tired and in a bad mood, didn't have a particularly good week, maybe that's making me go more annoyed at the book.\n", "And yes - that made me very emotional with Katniss and Buttercup on the same bad, both crying over Prim.\n", "so I steeled myself for the worst possible outcome because I obsessively fell in love with this series.\n", "We stay right there and see all Katniss sees, good or bad the whole time.\n", "too bad.\n", "Worst ending for a book series.\n", "She would be incredibly mean to people in the capital, and then the next she would suddenly feel bad. \n", "As I was reading this, I slowly realized how bad the writing was.\n", "I guess it was a good story, but it's always a bad sign when you really couldn't care less if the main character dies (and at some points, really hope that it happens).\n", "Not because that's what the book's bad writing imposed.\n", "Also, although the first two books are a bit gruesome, I didn't think they would be too bad for kids to read because Collins doesn't go into too much graphic detail, or dwell too long on the deaths and make them unbearable.\n", "I didn't mind it in the beginning, and even expected it; after all Peeta has been captured, her home has been destroyed, and she's been through hell and back, only to find out the worst is yet to come.\n", "Then there's what happened to Peeta - that may have been the worst part for me.\n", "The story wasn't bad\n", "Perhaps most impressively the books lack a clear good and bad narrative - Katniss herself is seen by others (and often by herself) as unemotional and manipulative in her triangulation between Peeta and Gale, while in turn being manipulated by others particularly in the second book and through the third.\n", "This book had the worst ending known to man!\n", "This book was one of the worst books I've ever read, near the end\n", "the writing[in my opinion] was just bad\n", "You've got to take the good with the bad\n", "the worst of the trilogy, and honestly the love story was never my favorite part of the books.\n", "It was bad enough that her beloved Cinna was killed off, Finnicke killed just after his wedding to the love of his life, Peeta's memories Hijacked, and Gale badly wounded over and over again, Ms Collins then decides to land the biggest blow of all by having Prim blown to bits.\n", "The first person narrative made it all the worse.\n", "War is bad?\n", "Even her relationship with her mother was the same, if not worse, than it was at the start of the series....\n", "The story is full of intrigue, excitement, action and suspense, but there's that lingering \"which guy will she choose\" trope that gets worse and worse as it goes along. \n", "Those are bad stats, Suzanne Collins. \n", "I hated Peeta throughout all three books and it got even worse when he announced that he loved her in the interview.\n", "I almost felt like the bad guys won on this one, and I didn't like it.\n", "And as if that isn't bad enough, there was still so much promise throughout the whole story that never comes to fruition! \n", "I mean, I expect bad endings with some authors like Nicholas Sparks, but I was completely thrown on this one.\n", "Even while expecting the worst, I didn't mind the first half of the book.\n", "There are probably going to be people throwing shoes at me now, but do your worst.\n", "It was worse than the first book, but better than the second.\n", "The novel reads like a bad space opera complete with shallow characters and a functional single level plot.\n", "To compound this bad structural choice, what action that does occurs is short-circuited.\n", "Bad, bad, so bad.\n", "It's too bad :(.\n", "Like a bad memory, I shall quietly sweep the memory of this book under a rug and then roll the rug up and throw it in the garbage, vacuum the leftover dirt (memories) and never think of the rug dirt again but if it happens to seep into my memories I will hastily override that memory with anything else that comes to mind.\n", "Finnick, the victor; Finnick, who had just been married after what is possibly the worst imaginable existence -\n", "The worst book I've ever read.\n", "Worst ending possible.\n", "I have to assume nothing changed in the government and it's just as bad as it was.\n", "Does that make them bad\n", "It was rude the way he said she was ugly and a bad person and a liar\n", "The worst part is, that it can't be undone, it can't be rewritten, the series is forever tainted. \n", "Nothing gets resolved, everything ends worse, and the whole thing is a giant unsatisfying downer.\n", "is bad and humans are dumb.\n", "It could be worse.\n", "I first thought it was a good book with a depressing message; now I realize it is a bad book with a condescending, obvious (and poorly delivered) message. \n", "And it only makes it worse that it's geared to teens. \n", "The series got worse and worse.\n", "At its worst, this series reminds me of the movie Idiocracy, which condemns scatalogical humor as a major theme, then proceeds to appeal to the audience with only scatological humor for 2 hours.\n", "and although she does feel bad for some of the destruction that she has inadvertedly caused Katniss just does wat she wants\n", "Honestly, finishing it just made me in a bad mood.\n", "Honestly one of the worst books I've ever read.\n", "But nothing was worse than the her fate for the Big Bad, Snow.\n", "I feel really bad giving this a two-star rating\n", "I should've stopped with the first as each in the series has gotten progressively worse for me.\n", "A couple of times, Katniss gets injured, things look bad and then CUT!\n", "The best thing I can say about it is that it wasn't as bad as the Harry Potter epilogue.\n", "The whole thing was a bit of a mess, but it could've been worse.\n", "but overall it wasn't a bad read.\n", "This book was the worst of the 3 in the series.\n", "That may make me a bad person due to the horrors she has seen\n", "And what's worse, President Snow has made it clear that no one else is safe either.\n", "Not bad.\n", "Was it better, worse....\n", "Of course, there's nothing bad in a little \"razzle dazzle\" (as Plutarch would say) when it comes to slaughtering children (Or, wait...\n", "But not even the worst case of PTSD could justify the level of age regression and brokeness we are shown in this story! \n", "The HG must be one of THE worst love triangle in history!!!!!!\n", "And while we're on that subject, might I point out how I believe that Katniss could very well be one of the worst love interest in any love triangle in history????!!!!!\n", "to their guy, or anyone for that matter, because someone else wants it so bad.\n", "definetly the worst book in the series.\n", "Now the third book will just be some crazy gaga, some impossible rebels that turn out to be worse than the people we thought were bad guys.\n", "There was just so much focus on the bad and nothing good.\n", "Ok, I completely understand everyone saying that this was a very bad situation, and that war is hard, and that Katniss, of course, is suffering from PTSD.\n", "Really bad ending to the series. \n", ", I still feel there is something left to say, I want to know more, which to me, is the worst thing that can happen when I finish a book. \n", "\"War is BAD!\n", "If you fight back, you are JUST as bad!\"\n", "\"You all don't fight - You'll be bad people!\n", "I get the dark \"war is bad\" thing.\n", "What was worse was that Katniss was stripped of any decency she had and\n", "Worst of all, Gale is stripped of any closure.\n", "This is worst of the three, and feels like meeting the deadline.\n", "For me it was one of those things where I felt the worst had already happened to her, so it was hard for me to think the conditions she was in were really that terrible. \n", "I was tempted to give this a 1-star, but okay this book was bad but not to that extent.\n", "Worst of the series - extremely violent, not very well written, and pretty confusing.\n", "Definitely the worst of the series.\n", "I understand a lot is going on, but her reactions are not consistent, this was a minor problem in the first book that was only made worse now.\n", "More like a 1.5 than 2 stars, because this is worst that the previous\n", "One of the worst sequels I've ever read, which is truly disappointing considering\n", "The only reason she ended up with Peeta was because Gale had the bad luck of being involved with the bombs and was whisked out of the story.\n", "Would it have been so bad for her to kill them?\n", "Very slow and boring, the worst of the 3\n", "yes the story overall was good but the writing was the worst I've read in a long time.\n", "To live there is worse than to live in prison.\n", "I could go on and on and on with things like this, but I think you get the point why this is a bad book.\n", "Seriously, it's almost as if Collins had a brain storming session, \"Let's see, what's the absolute worst possible thing that could happen to a person?\n", "Bad Things - Gale and Katniss shoot war planes down with bows and trick arrows.\n", "The worst love triangle conclusion since Pretty in Pink. \n", "And trust me with all the wars, games and gore in these books, they did not seem as bad as that anti climatic ending.\n", "And whats worst, Katniss was all for it.\n", "they had her give in to someone else's pressure, which is one of the worst possible reasons to become a parent.\n", "It teaches us that humanity is a nothing but bad, that the only way to resolve problems if fighting fire with fire, to forget about tolerance and forgiveness and to make people pay.\n", "I have nothing bad to say about Collins writing and her books.\n", "My rating on this book is very low, not because it was bad, but because by the end of the series I wished I have never began them.\n", "The thoughts this story left me with gave a bad taste in my mouth and horrible images in my mind.\n", "What's worse than living under a fascist regime?\n", "I can't say enough bad things about this book or how this series was ended.\n", "I read a lot of the negative reviews posted on Goodreads, which I still agree with them all, but after reading the book myself these negatives didn't seem as bad as when I read those reviews.\n", "Now whether that's because I was already prepared for them or the book wasn't as bad as they said, hmmm, IDK. \n", "it is by far the worst of the series.\n", "This is the worst of the series, it was choppy and sloppily written.\n", "Probably the worst aspect of the series was the character of Katniss who is a lazy, self-centered, mean and immature junkie who for some reason is in the middle of a love triangle (the boys seem sensible apart from this choice) and also the leader of the revolution.\n", "the promise that life can go on, no matter how bad our losses.\n", "Not saying the book was bad\n", "I almost feel bad for Gale, because he never really had a chance with her.\n", "But this one, should have been as The Hunger Games made me believe: a story about a girl that had the qualities to be admired, the strength to overcome the worst and the capacity to become a hero.\n", "The story wasn't as captivating, the names got worse (Leeg 1 and Leeg 2?\n", "the worst book of the series.\n", "To many reviews already on this book both good and bad.\n", "As if to confirm by suspicions we are renewed with the flip at the end of the book: I'll tell them that on bad mornings, it feels impossible to take pleasure in anything because I'm afraid it could be taken away. \n", "Bad ending hard to even finish was very let down after the first two books\n", "But worst than all of that is the pacing!\n", "The ending was the worst.\n", "So is that good or bad?\n", "I know war is bad and effects everyone in tragic ways, but still Suzanne!!\n", "You made Katniss seem like such a bad-ass throughout the entire series and in the end you make her pick THE F**KING DANDELIONS over the FIRE?!\n", "I usually finish books in a day or a week depending on how good or bad it is.\n", "I understand that it turned all of his good memories to bad, but it shouldn't really affect his personality in any way...\n", "Hey, this one wasn't actually as bad as the second!\n", "In Mockingjay, all these traits are scrapped and we get a Katniss-clone who is angsty and bitchy and whiny (wasn't Bella in Twilight bad enough?).\n", "Not only did she not improve herself from the first book (she was kickass in the first book btw), she got WORSE, an empty shadow of her former self.\n", "And good and bad are not clearly defined black and white.\n", "Even in real life, no matter how bad things may be, there is always hope.\n", "One of the worst books I have ever read.\n", "The bad writing, forgivable in the first 2 for good plot, is constantly in your face in this book.\n", "Bad stuff happens.\n", "Another book I feel bad that I am going to piss readers off with my review. \n", "Too bad.\n", "That was the worst part of the whole book.\n", "It felt like Breaking Bad for YA.\n", "But as before, this is a bad choice.\n", "In fact, it's even worse in this book because Katniss spends most of her time as everyone else's sock puppet.\n", "War is bad.\n", "Catching Fire wasn't too bad.\n", "I can stomach stories where bad things happen\n", "This is not to say it is a bad book.\n", "Worst of all, she never stepped up.\n", "Character deaths went from compelling and tragic in the first book to cheap attempts to show that War Is Bad in this book. \n", "It's bad when it's 140 pages in and\n", "It's too bad, there are so many things in here that could have been amazing and truly emotional to explore.\n", "As I said I dont read these kind of books and the war issue in the Mockingjay brought too much bad memories on the surface. \n", "That is not to say that this one is very much worse than the other two, but just that I am having second thoughts about the whole trilogy.\n", "she's kissing every on that just a bad rule model for a teenage \n", "When I return: the good, the bad, and what the page flip between Chapters 24 and 25 mean for this book and the series. \n", "Here are the reasons why this book is bad: 1.\n", "Book was worst of the series, became very predictable, and ended just like I thought it would.\n", "The worst book of the series.\n", "All I can say is that for me this is the worst from the three books even if the ending is quite satisfying. \n", "You want an example of bad conclusion of a trilogy/series that started off on the highest tops? \n", "All in all, worst ending ever!\n", "It was worse than a B Western in the movies.\n", "The bad guys were so obvious and yet Katniss didn't see them.\n", "Is this the worst and least coherent part of the series, or a point made, stating that there is no right or logical side of war? \n", "It was unnecessary, and it astounds me that neither Collins or her editor thought it would be a bad idea to include such an incessantly long explanation.\n", "Katniss turned out to have way too many issues, got injuried every other chapter and after a while I was just hoping for it to end without being worse and worse as a I read.\n", "After waiting in anticipation for both Mockingjay and Catching Fire, the series has gotten progressively worse.\n", "This is the only series I've ever read that gets progressively worse.\n", "that was the worst ending to such a great series!\n", "By this third book, I knew that certain main characters would live (they do) that Katniss would make one bad decision after another in complete defiance of all good sense or advice (she does), that every other important character in the book would stand up in defense of her bad judgment, make excuses for her behavior and protect her (they do), that it would all get passed off as stress and mental damage from her participation in the Games (it is) and that there would be no ramifications that really mattered (there aren't). \n", "she drifts along only making everything worse and 25 years of her life are wrapped up in the last 5 pages of the book without much explanation and no psychological investment. \n", "It's really too bad.\n", "This was an ok book with a good moral, \"War is bad\".\n", "And what's worse, President Snow has made it clear that no one else is safe either.\n", "worst of the 3 books\n", "The books have gotten progressively worse, almost as though they were destined to be made into movies. \n", "I mean I think Snow should have had a worse death\n", "Wow it sucked so bad\n", "Say to the world that Coin was as bad--or would be as bad--as Snow and how they needed to be careful.\n", "Seeing my favorite characters mangled and insane at the end was maybe the worst scenario.\n", "This can be a good or bad thing depending on the tastes of the reader. \n", "It was repetitive but still not bad.\n", "Too bad---loved The Hunger Games, but am now disillusioned.....\n", "** Bad.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "It's hard to begrudge Collins for concentrating on the complexities of war, where there are no good guys and bad guys, only uneasy alliances, or giving Katniss such a lot to think about.\n", ", so it wasn't the worst thing I've ever read.\n", "After the death of a beloved character, Collins composed a passage that I think aimed for high-brow narrative of loss but came across as an LSD-fueled bungle of bad descriptive poetry; (5)\n", "Things get worse from there. \n", "The worst ending I could have imagined.\n", "Too bad. \n", "Out of the three books, this one I thought was the worst.\n", "Can't really say I was expecting much, I find that the third books in trilogies to often be worse than the first two.\n", "\"it's all my fault, I'm a bad person\" business.\n", "Yes, war is bad.\n", "Everything gets even worse in this book.\n", "Bad dialogue.\n", "Bad writing.\n", "When I reached to the ending, it got even worse.\n", "Not Mockingjay, what a bad ending to a decent series.\n", "One of the WORST endings to a trilogy I have ever seen.\n", "I was hoping for some kind of tech warfare or economic warfare or some good ol' Ludlum-like lie/spy/get the bad guy.\n", "I can't believe how bad it was.\n", "Seriously, to think that you single-handedly have the best reason to bullet the target bad guy\n", "One of the worst books that I have ever read!\n", "Having been warned, I was expecting the worst and hoping for the best from this book, and I landed somewhere in the middle.\n", "But there were alot of problems with the writing, all of which, I hate to say, seems to go back to bad revising and editing. \n", "I loved how the last line was,'But there are much worse games to play'.\n", "A lot of the story can be attributed to her actions, for better or for worse.\n", "Is I bad that my favorite part of this book was reading the reviews of broken-heated readers?\n", "She was tough, bad ass, smart, etc.\n", "In short, I think the author covered a weak plot with action and violence and trying to make Katniss into a bad-ass rebel that didn't fit with the set up from the previous books.\n", "*** I officially take back every bad thing I wrote about Catching Fire.\n", "How bad is Mockingjay?\n", "In fact she made all the other characters worse because they loved her...\n", "And the end was really bad. \n", "The trilogy gets worse as it progresses.\n", "This final chapter is the worst of the three. \n", "However I wouldn't call the series bad.\n", "Why does this President Snow (bad guy) or anybody else in this universe give a rip about what Katniss says or does as she mopes around underground?\n", "You would think that now both our protagonist have survived the games they would have a happy life, well not exactly.. they run into a bit of bad luck\n", "I couldn't feel bad for her\n", "A bad ending\n", "And the ending is the worst half-ass ending ever for a book.\n", "This was literally one of the worst book I have ever read. \n", "** i'm sorry but this book was really bad \n", "The first book sets our hero up as a self-sufficient bad-ass survivor.\n", "I feel bad for Peeta that he's stuck with Katniss as the only emotional link to his forgotten past.\n", "The secound book was even worse than this one so don't bother with that.\n", "There are so many reasons why I thought this was a bad book.\n", "Worst of the 3.\n", "While that results in lots of angst and inevitable reader frustration, it doesn't mean it's a bad route to take - lots of people who had been put into Katniss's position\n", "His parting script should read, \"Thank you, Suzanne, for making it virtually indistinguishable who is the bad guy in this book.\n", "This only horrifies the reader and turns them off in the worst way possible.\n", "This series of books just got worse and worse....\n", "I was beginning to wonder who the good guy was, and who the bad guy was.\n", "Anyways, this review is really bad, but I really need to get started on all my homework which I chose to do at the end of the book. \n", "We knew Coin was bad from the beginning.\n", "What make it worst is, katniss's too often mental breakdown, that triggered with everything.\n", "Finally, the ending is the worst and make me hope suzanne Collins will announce that she is disappointed with her book and will proceed to write the real ending.\n", "Mockingjay has a bad pass at the latter.\n", "Worse, Katniss lapses into a passive heroine.\n", "I thought the movies are bad but\n", "but god honestly this book is so bad...\n", "The sum of the experience , mirrors my sentiment towards each of the books: it is not bad, but it is not good.\n", "I can't believe how bad this one was.\n", "It was so cheesy and something out of a bad 80s Sci-fi movie. \n", "The ending was bad.\n", "Really bad.\n", "I can't even say it was bad but good, just terrible.\n", "Bad\n", "I think \"Bright and early the next morning, the brains assemble to take on the problem of the Nut\" is possibly the worst sentence ever written in a YA book.... EVER!\n", ", IT'S JUST BAD WRITING.\n", "BAD DECISION.\n", "It left such a bad taste in my mouth so that whenever anyone mentions the series, all I can think about is how they end up.\n", "I know it wasn't THAT bad, but this was probably the closest I've ever come to writing something with what could be described best as \"burning, flesh-rotting rage\".\n", "\"And I was thinking,\" Collins continued, \"That we could make Gale, our 'bad boy,' into a huge terrorist.\n", "But worst of all was the way Katniss ends up with Peeta.\n", "The worst was all the characters that we have become attached to that have died (spoiler Alert) especially Prim.\n", "In the end, Katniss is extremely depressed and the example she presents is about the worst one I think of.\n", "This is one of the worse books I have ever read.\n", "Too bad.\n", "The ending of the Hunger Games left a bad taste in my mouth.\n", "It just went from bad to worse, to will someone please just slit my wrists and put me out of my misery?! \n", "This is definitely the worst book of the series.\n", "The worst part of it is that the focus leaves the world and the war, and centers more on Katniss's boy troubles.\n", "but I feel this ending was the worst.\n", "WAR IS BAD.\n", "We understand that it's especially bad for children, too\n", "This book is disturbing, and left me with a bad taste in my mouth.\n", "Bad conclusion to a series.\n", "I did however love the ending where the leader of the rebellion was just as bad as President Snow\n", "This was a huge disappointment - it was a huge surprise to me how bad this third book was\n", "The ending wasn't too bad though, could've been worse.\n", "That whole last third of the book, Finnick's death was THE WORST.\n", "THAT IS BAD,\n", "EQUALS= BAD BOOK.\n", "I FEEL BAD\n", "Worst of the series she made me not care about people who later die , and or characters who survive and I hate the ending where the hunger games continue what did they fight a war for ?\n", "The last mission is the worst in terms of pointlessness.\n", "The worse thing is that you know that this could have been epic (like all Spartacus and 300 like).\n", "Would it really have been a bad thing to give one character an actual happy ending?\n", "And when Peeta came back brainwashed thinking she was the bad guy after he gets rescued from his captors and tries repeatedly to kill her, I was convinced she would end up with Gail.\n", "Here she's confused and torn, in a love triangle that has been building up but suddenly feels underdeveloped and is frankly dull, and fighting to keep her identity in a rebellion that doesn't have any redeeming traits and seems just as bad - if not worse - than the evil Capitol domination of them.\n", "Other than feeling bad for herself for loving two people, she never really considered it at all, or truly acted upon it.\n", "Too bad.\n", "This was worse than the second book.\n", "Worst part - everything else.\n", "Progressively worse\n", "I waited so much for this book that at the end it just let me down so bad.\n", "It's unfortunate that this seems to be common practice for the last book of a series (Harry Potter, Twilight), but for this book it seems particularly misguided, because Mockingjay is (a) no longer than the first two books, and (b) without a doubt the worst book in the trilogy.\n", "The worst part for me was the lack of character development, I was expecting anything tragic to happen to any character including Katniss\n", "she was also still haunted with her bad dream, not really a good ending to me....\n", "No, Katniss has flaws, she has depth, and we watch her struggle and punish herself and we see both her good and bad sides. \n", "And the head of the Capitol, President Snow, felt like a cardboard cutout of a typical bad guy/tyrannical ruler.\n", "All in all, not a bad trilogy.\n", "I found her at best whiny and at worse annoyingly weak willed.\n", "What makes the worst even worse is that it seems Suzanne Collins is attempting to justify all of her actions by making them not her fault and thus make Katniss unnameable for the books plot.\n", "I find this to be the worse book in the series.\n" ] } ], "source": [ "from scripts.text_tail_analysis import has_lemma_pos, sentence_iter\n", "lemma = 'bad'\n", "pos = 'ADJ'\n", "\n", "for sent in sentence_iter(docs_group2):\n", " if has_lemma_pos(sent, lemma, pos):\n", " print(sent)" ] }, { "cell_type": "code", "execution_count": 439, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "great ADJ 1713 205 0.2072 0.1227 0.59\n", "happy ADJ 1209 127 0.1462 0.0760 0.52\n", "sad ADJ 999 128 0.1208 0.0766 0.63\n", "amazing ADJ 929 86 0.1123 0.0515 0.46\n", "perfect ADJ 556 23 0.0672 0.0138 0.20\n", "dark ADJ 436 58 0.0527 0.0347 0.66\n", "realistic ADJ 367 44 0.0444 0.0263 0.59\n", "wonderful ADJ 310 23 0.0375 0.0138 0.37\n", "excellent ADJ 258 15 0.0312 0.0090 0.29\n", "fantastic ADJ 257 29 0.0311 0.0174 0.56\n", "powerful ADJ 223 23 0.0270 0.0138 0.51\n", "brilliant ADJ 223 28 0.0270 0.0168 0.62\n", "satisfied ADJ 215 22 0.0260 0.0132 0.51\n", "beautiful ADJ 209 11 0.0253 0.0066 0.26\n", "intense ADJ 191 21 0.0231 0.0126 0.54\n", "certain ADJ 190 25 0.0230 0.0150 0.65\n", "heartbreaking ADJ 153 8 0.0185 0.0048 0.26\n", "surprised ADJ 153 17 0.0185 0.0102 0.55\n", "unexpected ADJ 147 18 0.0178 0.0108 0.61\n", "bittersweet ADJ 143 9 0.0173 0.0054 0.31\n", "fitting ADJ 136 6 0.0164 0.0036 0.22\n", "brutal ADJ 133 14 0.0161 0.0084 0.52\n", "sweet ADJ 129 15 0.0156 0.0090 0.58\n", "favourite ADJ 122 15 0.0148 0.0090 0.61\n", "incredible ADJ 119 12 0.0144 0.0072 0.50\n", "shocking ADJ 108 11 0.0131 0.0066 0.50\n", "quick ADJ 97 12 0.0117 0.0072 0.61\n", "ready ADJ 95 11 0.0115 0.0066 0.57\n", "pleased ADJ 92 3 0.0111 0.0018 0.16\n", "impossible ADJ 87 10 0.0105 0.0060 0.57\n", "fast ADJ 85 8 0.0103 0.0048 0.47\n", "tough ADJ 85 11 0.0103 0.0066 0.64\n", "broken ADJ 80 9 0.0097 0.0054 0.56\n", "wrenching ADJ 70 5 0.0085 0.0030 0.35\n", "raw ADJ 67 5 0.0081 0.0030 0.37\n", "pretty ADJ 67 8 0.0081 0.0048 0.59\n", "appropriate ADJ 63 7 0.0076 0.0042 0.55\n", "worried ADJ 61 5 0.0074 0.0030 0.41\n", "devastating ADJ 61 1 0.0074 0.0006 0.08\n", "bright ADJ 58 7 0.0070 0.0042 0.60\n", "riveting ADJ 58 3 0.0070 0.0018 0.26\n", "fabulous ADJ 56 7 0.0068 0.0042 0.62\n", "flawed ADJ 55 6 0.0067 0.0036 0.54\n", "mixed ADJ 54 6 0.0065 0.0036 0.55\n", "unique ADJ 52 6 0.0063 0.0036 0.57\n", "open ADJ 49 4 0.0059 0.0024 0.40\n", "current ADJ 49 6 0.0059 0.0036 0.61\n", "fictional ADJ 48 2 0.0058 0.0012 0.21\n", "definitely ADV 624 68 0.0755 0.0407 0.54\n", "highly ADV 217 16 0.0262 0.0096 0.36\n", "emotionally ADV 199 25 0.0241 0.0150 0.62\n", "matter ADV 149 16 0.0180 0.0096 0.53\n", "perfectly ADV 143 15 0.0173 0.0090 0.52\n", "slightly ADV 131 16 0.0158 0.0096 0.60\n", "nicely ADV 91 5 0.0110 0.0030 0.27\n", "beautifully ADV 57 4 0.0069 0.0024 0.35\n", "differently ADV 56 5 0.0068 0.0030 0.44\n", "necessarily ADV 54 6 0.0065 0.0036 0.55\n", "rarely ADV 47 6 0.0057 0.0036 0.63\n", "heart NOUN 575 51 0.0695 0.0305 0.44\n", "read NOUN 573 65 0.0693 0.0389 0.56\n", "question NOUN 314 30 0.0380 0.0180 0.47\n", "tear NOUN 277 17 0.0335 0.0102 0.30\n", "loss NOUN 264 24 0.0319 0.0144 0.45\n", "job NOUN 264 35 0.0319 0.0209 0.66\n", "future NOUN 252 22 0.0305 0.0132 0.43\n", "turn NOUN 199 26 0.0241 0.0156 0.65\n", "horror NOUN 189 16 0.0229 0.0096 0.42\n", "night NOUN 183 17 0.0221 0.0102 0.46\n", "favorite NOUN 172 14 0.0208 0.0084 0.40\n", "reading NOUN 154 19 0.0186 0.0114 0.61\n", "change NOUN 144 19 0.0174 0.0114 0.65\n", "journey NOUN 142 13 0.0172 0.0078 0.45\n", "struggle NOUN 136 15 0.0164 0.0090 0.55\n", "ride NOUN 133 8 0.0161 0.0048 0.30\n", "effect NOUN 131 16 0.0158 0.0096 0.60\n", "nature NOUN 119 9 0.0144 0.0054 0.37\n", "peace NOUN 118 15 0.0143 0.0090 0.63\n", "tribute NOUN 118 13 0.0143 0.0078 0.55\n", "adventure NOUN 114 13 0.0138 0.0078 0.56\n", "politic NOUN 109 12 0.0132 0.0072 0.54\n", "truth NOUN 107 13 0.0129 0.0078 0.60\n", "tale NOUN 105 13 0.0127 0.0078 0.61\n", "today NOUN 99 9 0.0120 0.0054 0.45\n", "ability NOUN 95 9 0.0115 0.0054 0.47\n", "evil NOUN 87 11 0.0105 0.0066 0.63\n", "answer NOUN 86 10 0.0104 0.0060 0.58\n", "plenty NOUN 85 4 0.0103 0.0024 0.23\n", "seat NOUN 85 3 0.0103 0.0018 0.17\n", "justice NOUN 84 11 0.0102 0.0066 0.65\n", "consequence NOUN 81 8 0.0098 0.0048 0.49\n", "side NOUN 81 7 0.0098 0.0042 0.43\n", "cost NOUN 79 6 0.0096 0.0036 0.38\n", "entertainment NOUN 75 5 0.0091 0.0030 0.33\n", "turner NOUN 74 8 0.0089 0.0048 0.53\n", "anger NOUN 71 8 0.0086 0.0048 0.56\n", "roller NOUN 67 8 0.0081 0.0048 0.59\n", "coaster NOUN 67 6 0.0081 0.0036 0.44\n", "morning NOUN 66 6 0.0080 0.0036 0.45\n", "copy NOUN 62 7 0.0075 0.0042 0.56\n", "school NOUN 58 6 0.0070 0.0036 0.51\n", "discussion NOUN 57 7 0.0069 0.0042 0.61\n", "cruelty NOUN 57 4 0.0069 0.0024 0.35\n", "punch NOUN 57 2 0.0069 0.0012 0.17\n", "commentary NOUN 53 7 0.0064 0.0042 0.65\n", "cover NOUN 52 4 0.0063 0.0024 0.38\n", "gut NOUN 51 6 0.0062 0.0036 0.58\n", "aftermath NOUN 50 6 0.0060 0.0036 0.59\n", "dandelion NOUN 50 4 0.0060 0.0024 0.40\n", "damage NOUN 49 5 0.0059 0.0030 0.50\n", "father NOUN 48 4 0.0058 0.0024 0.41\n", "genius NOUN 47 3 0.0057 0.0018 0.32\n", "Everdeen PROPN 257 34 0.0311 0.0203 0.65\n", "series PROPN 124 13 0.0150 0.0078 0.52\n", "Trilogy PROPN 124 9 0.0150 0.0054 0.36\n", "Quell PROPN 71 9 0.0086 0.0054 0.63\n", "Quarter PROPN 69 9 0.0083 0.0054 0.65\n", "MOCKINGJAY PROPN 62 7 0.0075 0.0042 0.56\n", "cry VERB 704 53 0.0851 0.0317 0.37\n", "recommend VERB 496 56 0.0600 0.0335 0.56\n", "break VERB 414 53 0.0501 0.0317 0.63\n", "reread VERB 209 16 0.0253 0.0096 0.38\n", "face VERB 161 19 0.0195 0.0114 0.58\n", "pack VERB 136 15 0.0164 0.0090 0.55\n", "thank VERB 124 15 0.0150 0.0090 0.60\n", "surprise VERB 104 8 0.0126 0.0048 0.38\n", "provoke VERB 100 3 0.0121 0.0018 0.15\n", "laugh VERB 88 7 0.0106 0.0042 0.39\n", "haunt VERB 83 6 0.0100 0.0036 0.36\n", "review VERB 83 11 0.0100 0.0066 0.66\n", "grip VERB 83 8 0.0100 0.0048 0.48\n", "answer VERB 73 1 0.0088 0.0006 0.07\n", "devour VERB 69 5 0.0083 0.0030 0.36\n", "relate VERB 69 9 0.0083 0.0054 0.65\n", "escape VERB 67 8 0.0081 0.0048 0.59\n", "predict VERB 66 3 0.0080 0.0018 0.22\n", "be VERB 64 5 0.0077 0.0030 0.39\n", "unfold VERB 60 2 0.0073 0.0012 0.16\n", "inspire VERB 56 6 0.0068 0.0036 0.53\n", "overthrow VERB 54 7 0.0065 0.0042 0.64\n", "admire VERB 54 7 0.0065 0.0042 0.64\n", "disagree VERB 54 7 0.0065 0.0042 0.64\n", "share VERB 51 3 0.0062 0.0018 0.29\n", "sob VERB 47 2 0.0057 0.0012 0.21\n", "volunteer VERB 46 5 0.0056 0.0030 0.54\n", "reflect VERB 46 3 0.0056 0.0018 0.32\n" ] } ], "source": [ "for pos_type in token_pos_types:\n", " for term, freq in docfreq_group1.most_common(1000):\n", " lemma, pos = term\n", " if pos != pos_type:\n", " continue\n", " prop_group1 = freq / total_group1\n", " prop_group2 = docfreq_group2[term] / total_group2\n", " prop = prop_group2 / prop_group1\n", " if prop > 0.66:\n", " continue\n", " print(f'{lemma: <20}{pos: <6}{freq: >6}{docfreq_group2[term]: >6}{prop_group1: >8.4f}{prop_group2: >8.4f}{prop: >6.2f}')\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- adjectives: 'great', 'happy', 'sad', 'amazing', 'perfect', 'dark', 'realistic'. The word 'dark' is probably intended as a positive aspect. \n", "- adverbs: 'definitely', 'highly', 'emotionally', 'perfectly', 'beautifully'\n", "- nouns: 'heart', 'read', 'question', 'tear'\n", "- proper nouns: 'Everdeen', 'series', 'Trilogy', 'MOCKINGJAY'\n", "- Verbs: 'cry', 'recommend', 'reread', 'thank', 'provoke', 'surprise, 'grip', 'devour', 'relate\n" ] }, { "cell_type": "code", "execution_count": 437, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "good ADJ 3199 594 0.3869 0.3555 0.92\n", "little ADJ 1204 208 0.1456 0.1245 0.85\n", "real ADJ 1151 183 0.1392 0.1095 0.79\n", "final ADJ 867 178 0.1048 0.1065 1.02\n", "favorite ADJ 618 100 0.0747 0.0598 0.80\n", "sure ADJ 592 117 0.0716 0.0700 0.98\n", "different ADJ 592 85 0.0716 0.0509 0.71\n", "hard ADJ 539 118 0.0652 0.0706 1.08\n", "glad ADJ 507 76 0.0613 0.0455 0.74\n", "young ADJ 475 81 0.0574 0.0485 0.84\n", "entire ADJ 472 136 0.0571 0.0814 1.43\n", "new ADJ 451 68 0.0545 0.0407 0.75\n", "emotional ADJ 424 65 0.0513 0.0389 0.76\n", "true ADJ 401 76 0.0485 0.0455 0.94\n", "right ADJ 392 58 0.0474 0.0347 0.73\n", "previous ADJ 356 84 0.0431 0.0503 1.17\n", "long ADJ 341 72 0.0412 0.0431 1.04\n", "able ADJ 309 62 0.0374 0.0371 0.99\n", "slow ADJ 288 73 0.0348 0.0437 1.25\n", "worth ADJ 262 49 0.0317 0.0293 0.93\n", "easy ADJ 257 43 0.0311 0.0257 0.83\n", "human ADJ 252 48 0.0305 0.0287 0.94\n", "wrong ADJ 243 59 0.0294 0.0353 1.20\n", "awesome ADJ 242 36 0.0293 0.0215 0.74\n", "important ADJ 231 67 0.0279 0.0401 1.44\n", "nice ADJ 227 52 0.0275 0.0311 1.13\n", "dead ADJ 215 55 0.0260 0.0329 1.27\n", "exciting ADJ 193 56 0.0233 0.0335 1.44\n", "difficult ADJ 185 40 0.0224 0.0239 1.07\n", "clear ADJ 180 37 0.0218 0.0221 1.02\n", "political ADJ 180 26 0.0218 0.0156 0.71\n", "alive ADJ 168 34 0.0203 0.0203 1.00\n", "satisfying ADJ 164 24 0.0198 0.0144 0.72\n", "dystopian ADJ 161 31 0.0195 0.0186 0.95\n", "crazy ADJ 143 34 0.0173 0.0203 1.18\n", "angry ADJ 138 38 0.0167 0.0227 1.36\n", "short ADJ 138 39 0.0167 0.0233 1.40\n", "violent ADJ 136 37 0.0164 0.0221 1.35\n", "safe ADJ 135 27 0.0163 0.0162 0.99\n", "honest ADJ 133 32 0.0161 0.0192 1.19\n", "deep ADJ 127 22 0.0154 0.0132 0.86\n", "personal ADJ 125 22 0.0151 0.0132 0.87\n", "necessary ADJ 118 27 0.0143 0.0162 1.13\n", "loose ADJ 118 24 0.0143 0.0144 1.01\n", "mental ADJ 118 29 0.0143 0.0174 1.22\n", "epic ADJ 118 35 0.0143 0.0209 1.47\n", "overall ADJ 110 32 0.0133 0.0192 1.44\n", "small ADJ 107 18 0.0129 0.0108 0.83\n", "tragic ADJ 105 21 0.0127 0.0126 0.99\n", "upset ADJ 103 16 0.0125 0.0096 0.77\n", "believable ADJ 102 20 0.0123 0.0120 0.97\n", "evil ADJ 99 25 0.0120 0.0150 1.25\n", "enjoyable ADJ 97 15 0.0117 0.0090 0.77\n", "cruel ADJ 96 22 0.0116 0.0132 1.13\n", "romantic ADJ 91 20 0.0110 0.0120 1.09\n", "painful ADJ 90 25 0.0109 0.0150 1.37\n", "excited ADJ 90 24 0.0109 0.0144 1.32\n", "heavy ADJ 89 15 0.0108 0.0090 0.83\n", "large ADJ 84 15 0.0102 0.0090 0.88\n", "prim ADJ 80 15 0.0097 0.0090 0.93\n", "surprising ADJ 78 12 0.0094 0.0072 0.76\n", "particular ADJ 77 19 0.0093 0.0114 1.22\n", "psychological ADJ 76 11 0.0092 0.0066 0.72\n", "free ADJ 76 20 0.0092 0.0120 1.30\n", "physical ADJ 75 11 0.0091 0.0066 0.73\n", "thrilling ADJ 75 13 0.0091 0.0078 0.86\n", "simple ADJ 75 20 0.0091 0.0120 1.32\n", "complex ADJ 74 11 0.0089 0.0066 0.74\n", "disturbing ADJ 73 20 0.0088 0.0120 1.36\n", "solid ADJ 71 13 0.0086 0.0078 0.91\n", "brave ADJ 70 12 0.0085 0.0072 0.85\n", "innocent ADJ 70 18 0.0085 0.0108 1.27\n", "similar ADJ 69 18 0.0083 0.0108 1.29\n", "fair ADJ 65 10 0.0079 0.0060 0.76\n", "willing ADJ 64 14 0.0077 0.0084 1.08\n", "past ADJ 63 12 0.0076 0.0072 0.94\n", "constant ADJ 63 18 0.0076 0.0108 1.41\n", "afraid ADJ 60 14 0.0073 0.0084 1.15\n", "impressed ADJ 59 13 0.0071 0.0078 1.09\n", "harsh ADJ 59 11 0.0071 0.0066 0.92\n", "rebel ADJ 57 12 0.0069 0.0072 1.04\n", "strange ADJ 56 9 0.0068 0.0054 0.80\n", "late ADJ 55 11 0.0067 0.0066 0.99\n", "unpredictable ADJ 55 8 0.0067 0.0048 0.72\n", "bloody ADJ 54 16 0.0065 0.0096 1.47\n", "ultimate ADJ 54 10 0.0065 0.0060 0.92\n", "fun ADJ 52 12 0.0063 0.0072 1.14\n", "low ADJ 52 13 0.0063 0.0078 1.24\n", "bitter ADJ 51 13 0.0062 0.0078 1.26\n", "worthy ADJ 51 13 0.0062 0.0078 1.26\n", "horrific ADJ 49 13 0.0059 0.0078 1.31\n", "beloved ADJ 49 7 0.0059 0.0042 0.71\n", "minor ADJ 49 9 0.0059 0.0054 0.91\n", "pure ADJ 49 12 0.0059 0.0072 1.21\n", "mature ADJ 48 10 0.0058 0.0060 1.03\n", "early ADJ 48 14 0.0058 0.0084 1.44\n", "future ADJ 48 8 0.0058 0.0048 0.82\n", "hopeful ADJ 47 11 0.0057 0.0066 1.16\n", "smart ADJ 47 14 0.0057 0.0084 1.47\n", "suspenseful ADJ 47 12 0.0057 0.0072 1.26\n", "graphic ADJ 46 13 0.0056 0.0078 1.40\n", "book NOUN 13445 3681 1.6260 2.2029 1.35\n", "series NOUN 5372 925 0.6497 0.5536 0.85\n", "ending NOUN 3488 565 0.4218 0.3381 0.80\n", "end NOUN 2939 705 0.3554 0.4219 1.19\n", "story NOUN 2635 580 0.3187 0.3471 1.09\n", "time NOUN 2475 547 0.2993 0.3273 1.09\n", "way NOUN 2298 516 0.2779 0.3088 1.11\n", "trilogy NOUN 2250 304 0.2721 0.1819 0.67\n", "thing NOUN 2060 519 0.2491 0.3106 1.25\n", "people NOUN 1764 385 0.2133 0.2304 1.08\n", "war NOUN 1752 374 0.2119 0.2238 1.06\n", "life NOUN 1279 257 0.1547 0.1538 0.99\n", "spoiler NOUN 1151 265 0.1392 0.1586 1.14\n", "lot NOUN 1134 243 0.1371 0.1454 1.06\n", "love NOUN 1095 296 0.1324 0.1771 1.34\n", "bit NOUN 929 180 0.1123 0.1077 0.96\n", "world NOUN 862 158 0.1042 0.0946 0.91\n", "action NOUN 862 258 0.1042 0.1544 1.48\n", "movie NOUN 841 149 0.1017 0.0892 0.88\n", "district NOUN 817 151 0.0988 0.0904 0.91\n", "review NOUN 796 179 0.0963 0.1071 1.11\n", "star NOUN 765 154 0.0925 0.0922 1.00\n", "reader NOUN 754 208 0.0912 0.1245 1.37\n", "game NOUN 722 187 0.0873 0.1119 1.28\n", "novel NOUN 714 195 0.0863 0.1167 1.35\n", "alert NOUN 703 181 0.0850 0.1083 1.27\n", "conclusion NOUN 599 95 0.0724 0.0569 0.78\n", "rebellion NOUN 577 113 0.0698 0.0676 0.97\n", "day NOUN 567 79 0.0686 0.0473 0.69\n", "child NOUN 535 141 0.0647 0.0844 1.30\n", "year NOUN 526 102 0.0636 0.0610 0.96\n", "feeling NOUN 464 108 0.0561 0.0646 1.15\n", "moment NOUN 464 98 0.0561 0.0586 1.05\n", "chapter NOUN 427 103 0.0516 0.0616 1.19\n", "twist NOUN 406 65 0.0491 0.0389 0.79\n", "friend NOUN 404 64 0.0489 0.0383 0.78\n", "mind NOUN 404 79 0.0489 0.0473 0.97\n", "rebel NOUN 400 89 0.0484 0.0533 1.10\n", "epilogue NOUN 391 89 0.0473 0.0533 1.13\n", "emotion NOUN 385 54 0.0466 0.0323 0.69\n", "place NOUN 359 96 0.0434 0.0575 1.32\n", "event NOUN 358 95 0.0433 0.0569 1.31\n", "word NOUN 356 80 0.0431 0.0479 1.11\n", "part NOUN 348 71 0.0421 0.0425 1.01\n", "choice NOUN 344 82 0.0416 0.0491 1.18\n", "revolution NOUN 340 74 0.0411 0.0443 1.08\n", "adult NOUN 334 65 0.0404 0.0389 0.96\n", "beginning NOUN 324 96 0.0392 0.0575 1.47\n", "violence NOUN 299 62 0.0362 0.0371 1.03\n", "hope NOUN 299 80 0.0362 0.0479 1.32\n", "thought NOUN 295 60 0.0357 0.0359 1.01\n", "line NOUN 285 81 0.0345 0.0485 1.41\n", "power NOUN 284 49 0.0343 0.0293 0.85\n", "relationship NOUN 283 53 0.0342 0.0317 0.93\n", "fan NOUN 270 53 0.0327 0.0317 0.97\n", "hunger NOUN 266 67 0.0322 0.0401 1.25\n", "arena NOUN 264 54 0.0319 0.0323 1.01\n", "sister NOUN 263 67 0.0318 0.0401 1.26\n", "detail NOUN 260 52 0.0314 0.0311 0.99\n", "opinion NOUN 258 77 0.0312 0.0461 1.48\n", "family NOUN 249 67 0.0301 0.0401 1.33\n", "reality NOUN 249 44 0.0301 0.0263 0.87\n", "installment NOUN 248 63 0.0300 0.0377 1.26\n", "kind NOUN 245 74 0.0296 0.0443 1.49\n", "course NOUN 234 48 0.0283 0.0287 1.02\n", "rest NOUN 229 57 0.0277 0.0341 1.23\n", "government NOUN 225 49 0.0272 0.0293 1.08\n", "head NOUN 218 50 0.0264 0.0299 1.13\n", "battle NOUN 218 62 0.0264 0.0371 1.41\n", "romance NOUN 217 44 0.0262 0.0263 1.00\n", "one NOUN 215 45 0.0260 0.0269 1.04\n", "eye NOUN 213 42 0.0258 0.0251 0.98\n", "theme NOUN 204 53 0.0247 0.0317 1.29\n", "hand NOUN 204 49 0.0247 0.0293 1.19\n", "face NOUN 200 43 0.0242 0.0257 1.06\n", "situation NOUN 198 43 0.0239 0.0257 1.07\n", "pain NOUN 197 41 0.0238 0.0245 1.03\n", "hero NOUN 196 50 0.0237 0.0299 1.26\n", "message NOUN 195 55 0.0236 0.0329 1.40\n", "role NOUN 191 55 0.0231 0.0329 1.42\n", "fiction NOUN 186 29 0.0225 0.0174 0.77\n", "rating NOUN 183 38 0.0221 0.0227 1.03\n", "society NOUN 175 33 0.0212 0.0197 0.93\n", "issue NOUN 174 47 0.0210 0.0281 1.34\n", "surprise NOUN 172 24 0.0208 0.0144 0.69\n", "team NOUN 166 42 0.0201 0.0251 1.25\n", "leader NOUN 165 42 0.0200 0.0251 1.26\n", "couple NOUN 164 41 0.0198 0.0245 1.24\n", "capitol NOUN 162 29 0.0196 0.0174 0.89\n", "symbol NOUN 158 23 0.0191 0.0138 0.72\n", "experience NOUN 156 31 0.0189 0.0186 0.98\n", "man NOUN 147 39 0.0178 0.0233 1.31\n", "closure NOUN 147 44 0.0178 0.0263 1.48\n", "work NOUN 145 33 0.0175 0.0197 1.13\n", "level NOUN 144 33 0.0174 0.0197 1.13\n", "order NOUN 142 32 0.0172 0.0192 1.12\n", "finish NOUN 141 24 0.0171 0.0144 0.84\n", "piece NOUN 138 29 0.0167 0.0174 1.04\n", "film NOUN 137 19 0.0166 0.0114 0.69\n", "hour NOUN 136 20 0.0164 0.0120 0.73\n", "week NOUN 133 27 0.0161 0.0162 1.00\n", "stuff NOUN 133 32 0.0161 0.0192 1.19\n", "teen NOUN 131 26 0.0158 0.0156 0.98\n", "destruction NOUN 131 18 0.0158 0.0108 0.68\n", "case NOUN 124 35 0.0150 0.0209 1.40\n", "age NOUN 124 19 0.0150 0.0114 0.76\n", "survival NOUN 122 18 0.0148 0.0108 0.73\n", "pace NOUN 122 22 0.0148 0.0132 0.89\n", "memory NOUN 122 29 0.0148 0.0174 1.18\n", "aspect NOUN 120 27 0.0145 0.0162 1.11\n", "home NOUN 120 19 0.0145 0.0114 0.78\n", "edge NOUN 120 18 0.0145 0.0108 0.74\n", "start NOUN 117 29 0.0141 0.0174 1.23\n", "suspense NOUN 117 23 0.0141 0.0138 0.97\n", "view NOUN 116 20 0.0140 0.0120 0.85\n", "style NOUN 114 27 0.0138 0.0162 1.17\n", "happiness NOUN 111 18 0.0134 0.0108 0.80\n", "depth NOUN 109 28 0.0132 0.0168 1.27\n", "victor NOUN 109 24 0.0132 0.0144 1.09\n", "storyline NOUN 109 28 0.0132 0.0168 1.27\n", "chance NOUN 108 25 0.0131 0.0150 1.15\n", "fight NOUN 107 23 0.0129 0.0138 1.06\n", "capital NOUN 105 27 0.0127 0.0162 1.27\n", "middle NOUN 105 29 0.0127 0.0174 1.37\n", "woman NOUN 105 25 0.0127 0.0150 1.18\n", "survivor NOUN 103 14 0.0125 0.0084 0.67\n", "enemy NOUN 102 14 0.0123 0.0084 0.68\n", "sadness NOUN 102 14 0.0123 0.0084 0.68\n", "audience NOUN 102 27 0.0123 0.0162 1.31\n", "type NOUN 100 19 0.0121 0.0114 0.94\n", "trauma NOUN 98 20 0.0119 0.0120 1.01\n", "reaction NOUN 98 23 0.0119 0.0138 1.16\n", "need NOUN 96 27 0.0116 0.0162 1.39\n", "humanity NOUN 95 22 0.0115 0.0132 1.15\n", "nightmare NOUN 94 16 0.0114 0.0096 0.84\n", "outcome NOUN 92 15 0.0111 0.0090 0.81\n", "drama NOUN 92 19 0.0111 0.0114 1.02\n", "freedom NOUN 92 26 0.0111 0.0156 1.40\n", "term NOUN 92 24 0.0111 0.0144 1.29\n", "note NOUN 91 25 0.0110 0.0150 1.36\n", "torture NOUN 89 24 0.0108 0.0144 1.33\n", "minute NOUN 89 18 0.0108 0.0108 1.00\n", "act NOUN 88 17 0.0106 0.0102 0.96\n", "complaint NOUN 88 14 0.0106 0.0084 0.79\n", "shock NOUN 88 23 0.0106 0.0138 1.29\n", "People NOUN 88 20 0.0106 0.0120 1.12\n", "conflict NOUN 88 22 0.0106 0.0132 1.24\n", "tragedy NOUN 88 21 0.0106 0.0126 1.18\n", "matter NOUN 88 16 0.0106 0.0096 0.90\n", "voice NOUN 87 17 0.0105 0.0102 0.97\n", "description NOUN 87 19 0.0105 0.0114 1.08\n", "tone NOUN 85 21 0.0103 0.0126 1.22\n", "difference NOUN 85 23 0.0103 0.0138 1.34\n", "country NOUN 84 14 0.0102 0.0084 0.82\n", "warning NOUN 82 14 0.0099 0.0084 0.84\n", "mockingjay NOUN 81 20 0.0098 0.0120 1.22\n", "soldier NOUN 81 20 0.0098 0.0120 1.22\n", "picture NOUN 80 16 0.0097 0.0096 0.99\n", "doubt NOUN 76 14 0.0092 0.0084 0.91\n", "good NOUN 76 13 0.0092 0.0078 0.85\n", "sacrifice NOUN 74 12 0.0089 0.0072 0.80\n", "plan NOUN 74 21 0.0089 0.0126 1.40\n", "fear NOUN 71 13 0.0086 0.0078 0.91\n", "history NOUN 71 15 0.0086 0.0090 1.05\n", "paragraph NOUN 70 12 0.0085 0.0072 0.85\n", "circumstance NOUN 69 15 0.0083 0.0090 1.08\n", "baby NOUN 67 14 0.0081 0.0084 1.03\n", "literature NOUN 67 15 0.0081 0.0090 1.11\n", "force NOUN 66 15 0.0080 0.0090 1.12\n", "cat NOUN 65 16 0.0079 0.0096 1.22\n", "peeta NOUN 65 15 0.0079 0.0090 1.14\n", "spot NOUN 65 9 0.0079 0.0054 0.69\n", "bow NOUN 65 14 0.0079 0.0084 1.07\n", "list NOUN 64 14 0.0077 0.0084 1.08\n", "imagination NOUN 61 10 0.0074 0.0060 0.81\n", "rule NOUN 61 12 0.0074 0.0072 0.97\n", "grief NOUN 61 10 0.0074 0.0060 0.81\n", "fate NOUN 61 13 0.0074 0.0078 1.05\n", "odd NOUN 61 9 0.0074 0.0054 0.73\n", "friendship NOUN 61 11 0.0074 0.0066 0.89\n", "genre NOUN 61 13 0.0074 0.0078 1.05\n", "darkness NOUN 60 9 0.0073 0.0054 0.74\n", "suffering NOUN 60 13 0.0073 0.0078 1.07\n", "light NOUN 60 13 0.0073 0.0078 1.07\n", "body NOUN 60 14 0.0073 0.0084 1.15\n", "food NOUN 59 14 0.0071 0.0084 1.17\n", "blog NOUN 59 10 0.0071 0.0060 0.84\n", "trial NOUN 59 13 0.0071 0.0078 1.09\n", "break NOUN 59 13 0.0071 0.0078 1.09\n", "meaning NOUN 57 17 0.0069 0.0102 1.48\n", "goal NOUN 57 13 0.0069 0.0078 1.13\n", "comment NOUN 57 13 0.0069 0.0078 1.13\n", "medium NOUN 57 8 0.0069 0.0048 0.69\n", "set NOUN 56 12 0.0068 0.0072 1.06\n", "path NOUN 56 11 0.0068 0.0066 0.97\n", "lesson NOUN 56 16 0.0068 0.0096 1.41\n", "camera NOUN 54 16 0.0065 0.0096 1.47\n", "breath NOUN 54 8 0.0065 0.0048 0.73\n", "promise NOUN 54 15 0.0065 0.0090 1.37\n", "hatred NOUN 54 10 0.0065 0.0060 0.92\n", "image NOUN 54 15 0.0065 0.0090 1.37\n", "feel NOUN 54 11 0.0065 0.0066 1.01\n", "dream NOUN 53 9 0.0064 0.0054 0.84\n", "scar NOUN 52 8 0.0063 0.0048 0.76\n", "saga NOUN 51 7 0.0062 0.0042 0.68\n", "connection NOUN 51 12 0.0062 0.0072 1.16\n", "past NOUN 51 10 0.0062 0.0060 0.97\n", "joy NOUN 51 10 0.0062 0.0060 0.97\n", "deal NOUN 50 15 0.0060 0.0090 1.48\n", "uprising NOUN 50 11 0.0060 0.0066 1.09\n", "thank NOUN 49 11 0.0059 0.0066 1.11\n", "tension NOUN 49 14 0.0059 0.0084 1.41\n", "favor NOUN 49 7 0.0059 0.0042 0.71\n", "conversation NOUN 49 10 0.0059 0.0060 1.01\n", "confusion NOUN 49 12 0.0059 0.0072 1.21\n", "tree NOUN 49 11 0.0059 0.0066 1.11\n", "soul NOUN 48 10 0.0058 0.0060 1.03\n", "right NOUN 48 10 0.0058 0.0060 1.03\n", "comparison NOUN 48 13 0.0058 0.0078 1.34\n", "danger NOUN 48 13 0.0058 0.0078 1.34\n", "courage NOUN 48 12 0.0058 0.0072 1.24\n", "brain NOUN 47 12 0.0057 0.0072 1.26\n", "return NOUN 46 8 0.0056 0.0048 0.86\n", "motive NOUN 46 10 0.0056 0.0060 1.08\n", "fun NOUN 46 12 0.0056 0.0072 1.29\n", "Katniss PROPN 7040 1953 0.8514 1.1688 1.37\n", "Peeta PROPN 3185 772 0.3852 0.4620 1.20\n", "Collins PROPN 2577 644 0.3116 0.3854 1.24\n", "Games PROPN 2385 548 0.2884 0.3279 1.14\n", "Hunger PROPN 2302 535 0.2784 0.3202 1.15\n", "Mockingjay PROPN 2162 487 0.2615 0.2914 1.11\n", "Gale PROPN 1990 503 0.2407 0.3010 1.25\n", "Suzanne PROPN 1059 180 0.1281 0.1077 0.84\n", "Capitol PROPN 1002 259 0.1212 0.1550 1.28\n", "Prim PROPN 817 246 0.0988 0.1472 1.49\n", "Finnick PROPN 648 168 0.0784 0.1005 1.28\n", "District PROPN 469 102 0.0567 0.0610 1.08\n", "Coin PROPN 467 133 0.0565 0.0796 1.41\n", "President PROPN 457 97 0.0553 0.0580 1.05\n", "Panem PROPN 361 58 0.0437 0.0347 0.80\n", "YA PROPN 298 63 0.0360 0.0377 1.05\n", "Haymitch PROPN 255 56 0.0308 0.0335 1.09\n", "Harry PROPN 198 50 0.0239 0.0299 1.25\n", "Potter PROPN 171 41 0.0207 0.0245 1.19\n", "Capital PROPN 165 33 0.0200 0.0197 0.99\n", "Annie PROPN 162 27 0.0196 0.0162 0.82\n", "Team PROPN 160 24 0.0193 0.0144 0.74\n", "katniss PROPN 142 31 0.0172 0.0186 1.08\n", "Overall PROPN 112 18 0.0135 0.0108 0.80\n", "Game PROPN 93 18 0.0112 0.0108 0.96\n", "Kat PROPN 81 15 0.0098 0.0090 0.92\n", "Buttercup PROPN 80 18 0.0097 0.0108 1.11\n", "peeta PROPN 71 12 0.0086 0.0072 0.84\n", "Boggs PROPN 65 17 0.0079 0.0102 1.29\n", "Plutarch PROPN 51 11 0.0062 0.0066 1.07\n", "read VERB 5248 1040 0.6347 0.6224 0.98\n", "think VERB 4184 804 0.5060 0.4811 0.95\n", "love VERB 3635 500 0.4396 0.2992 0.68\n", "feel VERB 3047 881 0.3685 0.5272 1.43\n", "end VERB 2456 542 0.2970 0.3244 1.09\n", "like VERB 2250 560 0.2721 0.3351 1.23\n", "know VERB 2192 500 0.2651 0.2992 1.13\n", "want VERB 2109 516 0.2550 0.3088 1.21\n", "go VERB 2046 569 0.2474 0.3405 1.38\n", "happen VERB 1525 459 0.1844 0.2747 1.49\n", "come VERB 1468 312 0.1775 0.1867 1.05\n", "find VERB 1345 332 0.1627 0.1987 1.22\n", "leave VERB 1117 289 0.1351 0.1730 1.28\n", "enjoy VERB 1109 184 0.1341 0.1101 0.82\n", "finish VERB 1074 297 0.1299 0.1777 1.37\n", "write VERB 1035 288 0.1252 0.1724 1.38\n", "take VERB 1024 231 0.1238 0.1382 1.12\n", "give VERB 899 218 0.1087 0.1305 1.20\n", "make VERB 842 189 0.1018 0.1131 1.11\n", "need VERB 817 152 0.0988 0.0910 0.92\n", "expect VERB 805 194 0.0974 0.1161 1.19\n", "say VERB 717 154 0.0867 0.0922 1.06\n", "live VERB 686 148 0.0830 0.0886 1.07\n", "keep VERB 578 135 0.0699 0.0808 1.16\n", "see VERB 574 110 0.0694 0.0658 0.95\n", "understand VERB 546 155 0.0660 0.0928 1.40\n", "believe VERB 521 113 0.0630 0.0676 1.07\n", "wait VERB 498 122 0.0602 0.0730 1.21\n", "turn VERB 497 138 0.0601 0.0826 1.37\n", "wish VERB 496 123 0.0600 0.0736 1.23\n", "hope VERB 480 103 0.0580 0.0616 1.06\n", "change VERB 477 94 0.0577 0.0563 0.98\n", "look VERB 442 102 0.0535 0.0610 1.14\n", "survive VERB 429 60 0.0519 0.0359 0.69\n", "choose VERB 413 119 0.0499 0.0712 1.43\n", "fight VERB 404 116 0.0489 0.0694 1.42\n", "guess VERB 402 87 0.0486 0.0521 1.07\n", "bring VERB 400 89 0.0484 0.0533 1.10\n", "stop VERB 367 109 0.0444 0.0652 1.47\n", "realize VERB 365 78 0.0441 0.0467 1.06\n", "play VERB 333 62 0.0403 0.0371 0.92\n", "help VERB 333 74 0.0403 0.0443 1.10\n", "show VERB 327 57 0.0395 0.0341 0.86\n", "wrap VERB 321 49 0.0388 0.0293 0.76\n", "pick VERB 318 93 0.0385 0.0557 1.45\n", "grow VERB 312 80 0.0377 0.0479 1.27\n", "disappoint VERB 307 87 0.0371 0.0521 1.40\n", "watch VERB 300 66 0.0363 0.0395 1.09\n", "stay VERB 293 51 0.0354 0.0305 0.86\n", "work VERB 283 71 0.0342 0.0425 1.24\n", "miss VERB 282 63 0.0341 0.0377 1.11\n", "continue VERB 276 59 0.0334 0.0353 1.06\n", "begin VERB 275 69 0.0333 0.0413 1.24\n", "hold VERB 267 43 0.0323 0.0257 0.80\n", "create VERB 264 42 0.0319 0.0251 0.79\n", "remember VERB 257 54 0.0311 0.0323 1.04\n", "will VERB 257 59 0.0311 0.0353 1.14\n", "move VERB 243 49 0.0294 0.0293 1.00\n", "wonder VERB 241 61 0.0291 0.0365 1.25\n", "hear VERB 233 62 0.0282 0.0371 1.32\n", "talk VERB 228 54 0.0276 0.0323 1.17\n", "learn VERB 223 34 0.0270 0.0203 0.75\n", "forget VERB 213 48 0.0258 0.0287 1.12\n", "catch VERB 210 60 0.0254 0.0359 1.41\n", "satisfy VERB 204 29 0.0247 0.0174 0.70\n", "agree VERB 204 48 0.0247 0.0287 1.16\n", "admit VERB 200 40 0.0242 0.0239 0.99\n", "ask VERB 198 27 0.0239 0.0162 0.67\n", "set VERB 196 52 0.0237 0.0311 1.31\n", "appreciate VERB 191 28 0.0231 0.0168 0.73\n", "win VERB 191 51 0.0231 0.0305 1.32\n", "consider VERB 190 49 0.0230 0.0293 1.28\n", "describe VERB 189 38 0.0229 0.0227 0.99\n", "pull VERB 189 31 0.0229 0.0186 0.81\n", "imagine VERB 187 35 0.0226 0.0209 0.93\n", "add VERB 180 45 0.0218 0.0269 1.24\n", "destroy VERB 174 50 0.0210 0.0299 1.42\n", "tie VERB 173 26 0.0209 0.0156 0.74\n", "deal VERB 173 35 0.0209 0.0209 1.00\n", "figure VERB 169 34 0.0204 0.0203 1.00\n", "remind VERB 161 32 0.0195 0.0192 0.98\n", "sit VERB 159 45 0.0192 0.0269 1.40\n", "manage VERB 151 34 0.0183 0.0203 1.11\n", "rescue VERB 150 25 0.0181 0.0150 0.82\n", "remain VERB 149 30 0.0180 0.0180 1.00\n", "include VERB 142 36 0.0172 0.0215 1.25\n", "blow VERB 142 34 0.0172 0.0203 1.18\n", "capture VERB 140 27 0.0169 0.0162 0.95\n", "tear VERB 140 21 0.0169 0.0126 0.74\n", "trust VERB 139 19 0.0168 0.0114 0.68\n", "protect VERB 134 27 0.0162 0.0162 1.00\n", "hurt VERB 134 31 0.0162 0.0186 1.14\n", "struggle VERB 128 25 0.0155 0.0150 0.97\n", "spoil VERB 127 30 0.0154 0.0180 1.17\n", "draw VERB 126 30 0.0152 0.0180 1.18\n", "involve VERB 122 32 0.0148 0.0192 1.30\n", "meet VERB 120 27 0.0145 0.0162 1.11\n", "fit VERB 119 21 0.0144 0.0126 0.87\n", "provide VERB 118 19 0.0143 0.0114 0.80\n", "experience VERB 118 20 0.0143 0.0120 0.84\n", "fill VERB 117 18 0.0141 0.0108 0.76\n", "accept VERB 111 22 0.0134 0.0132 0.98\n", "speak VERB 111 22 0.0134 0.0132 0.98\n", "rate VERB 110 27 0.0133 0.0162 1.21\n", "handle VERB 108 25 0.0131 0.0150 1.15\n", "pace VERB 107 20 0.0129 0.0120 0.92\n", "torture VERB 106 26 0.0128 0.0156 1.21\n", "portray VERB 101 19 0.0122 0.0114 0.93\n", "reach VERB 97 21 0.0117 0.0126 1.07\n", "affect VERB 97 27 0.0117 0.0162 1.38\n", "focus VERB 96 27 0.0116 0.0162 1.39\n", "question VERB 91 13 0.0110 0.0078 0.71\n", "call VERB 90 26 0.0109 0.0156 1.43\n", "put VERB 90 25 0.0109 0.0150 1.37\n", "exist VERB 89 23 0.0108 0.0138 1.28\n", "hit VERB 88 16 0.0106 0.0096 0.90\n", "discuss VERB 88 13 0.0106 0.0078 0.73\n", "listen VERB 87 14 0.0105 0.0084 0.80\n", "send VERB 87 24 0.0105 0.0144 1.37\n", "burn VERB 86 26 0.0104 0.0156 1.50\n", "carry VERB 85 19 0.0103 0.0114 1.11\n", "sacrifice VERB 83 21 0.0100 0.0126 1.25\n", "contain VERB 83 18 0.0100 0.0108 1.07\n", "complain VERB 82 15 0.0099 0.0090 0.91\n", "deliver VERB 81 21 0.0098 0.0126 1.28\n", "recover VERB 80 12 0.0097 0.0072 0.74\n", "push VERB 80 16 0.0097 0.0096 0.99\n", "damage VERB 79 14 0.0096 0.0084 0.88\n", "lie VERB 78 15 0.0094 0.0090 0.95\n", "close VERB 75 13 0.0091 0.0078 0.86\n", "conclude VERB 73 15 0.0088 0.0090 1.02\n", "plan VERB 72 21 0.0087 0.0126 1.44\n", "hijack VERB 71 10 0.0086 0.0060 0.70\n", "settle VERB 69 19 0.0083 0.0114 1.36\n", "prefer VERB 69 16 0.0083 0.0096 1.15\n", "prove VERB 68 12 0.0082 0.0072 0.87\n", "present VERB 68 18 0.0082 0.0108 1.31\n", "avoid VERB 67 16 0.0081 0.0096 1.18\n", "base VERB 67 17 0.0081 0.0102 1.26\n", "worry VERB 66 12 0.0080 0.0072 0.90\n", "shock VERB 66 12 0.0080 0.0072 0.90\n", "prepare VERB 66 14 0.0080 0.0084 1.05\n", "offer VERB 65 11 0.0079 0.0066 0.84\n", "touch VERB 64 11 0.0077 0.0066 0.85\n", "attach VERB 63 16 0.0076 0.0096 1.26\n", "overcome VERB 63 13 0.0076 0.0078 1.02\n", "explore VERB 62 16 0.0075 0.0096 1.28\n", "release VERB 61 16 0.0074 0.0096 1.30\n", "reveal VERB 61 13 0.0074 0.0078 1.05\n", "jump VERB 58 12 0.0070 0.0072 1.02\n", "open VERB 57 14 0.0069 0.0084 1.22\n", "endure VERB 57 11 0.0069 0.0066 0.95\n", "engage VERB 57 11 0.0069 0.0066 0.95\n", "discover VERB 56 9 0.0068 0.0054 0.80\n", "kick VERB 55 10 0.0067 0.0060 0.90\n", "pay VERB 55 9 0.0067 0.0054 0.81\n", "bear VERB 54 13 0.0065 0.0078 1.19\n", "sleep VERB 54 14 0.0065 0.0084 1.28\n", "mock VERB 52 12 0.0063 0.0072 1.14\n", "post VERB 50 9 0.0060 0.0054 0.89\n", "scream VERB 50 15 0.0060 0.0090 1.48\n", "justify VERB 50 15 0.0060 0.0090 1.48\n", "represent VERB 50 7 0.0060 0.0042 0.69\n", "teach VERB 50 7 0.0060 0.0042 0.69\n", "heal VERB 49 9 0.0059 0.0054 0.91\n", "anticipate VERB 49 10 0.0059 0.0060 1.01\n", "connect VERB 47 13 0.0057 0.0078 1.37\n" ] } ], "source": [ "for pos_type in token_pos_types:\n", " for term, freq in docfreq_group1.most_common(1000):\n", " lemma, pos = term\n", " if pos != pos_type:\n", " continue\n", " prop_group1 = freq / total_group1\n", " prop_group2 = docfreq_group2[term] / total_group2\n", " prop = prop_group2 / prop_group1\n", " if prop < 0.66 or prop > 1.5:\n", " continue\n", " print(f'{lemma: <20}{pos: <6}{freq: >6}{docfreq_group2[term]: >6}{prop_group1: >8.4f}{prop_group2: >8.4f}{prop: >6.2f}')\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The word 'end' is used slightly more in positive reviews, while 'ending' is used more in negative reviews. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 365, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading dictionary file ../data/LIWC2007_English131104.dic\n", "encoding = utf-8\n", "number of words : 4482\n", "number of categories : 64\n" ] } ], "source": [ "from scripts.liwc import LIWC\n", "\n", "# This dictionary is part of LIWC 2007, which is a commercial product, so not available in our Github repo\n", "liwc_dict_file = '../data/LIWC2007_English131104.dic'\n", "\n", "liwc = LIWC(liwc_dict_file)" ] }, { "cell_type": "code", "execution_count": 366, "metadata": {}, "outputs": [], "source": [ "sample_size = 1000\n", "sample_df = book_df.sample(sample_size, random_state=random_seed)\n", "sample_docs = select_dataframe_spacy_docs(sample_df, review_docs, as_dict=True)\n" ] }, { "cell_type": "code", "execution_count": 367, "metadata": {}, "outputs": [], "source": [ "from scripts.text_tail_analysis import get_lemma_pos_tf_index, group_by_head, group_by_child\n", "\n", "token_pos_types = ['ADJ', 'NOUN', 'PROPN', 'VERB']\n", "doc_list = [sample_docs[review_id] for review_id in sample_docs]\n", "tf_lemma_pos = get_lemma_pos_tf_index(doc_list)\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 369, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dependency_typedependency_worddependency_posdependency_freqtail_wordtail_postail_freqdep_tail_freqliwc_category
218headbookNOUN1630eighthADJ11None
220headbookNOUN1630sappyADJ21None
221headbookNOUN1630disjointedADJ51None
223headbookNOUN1630governmentADJ11None
224headbookNOUN1630exceptionalADJ51None
226headbookNOUN1630middleADJ51None
227headbookNOUN1630lengthyADJ21None
229headbookNOUN1630laterADJ21relativ|time
232headbookNOUN1630remarkableADJ31None
233headbookNOUN1630separateADJ21None
234headbookNOUN1630warADJ41affect|negemo|anger|death
235headbookNOUN1630adequateADJ21None
236headbookNOUN1630dullADJ41None
237headbookNOUN1630specialADJ21affect|posemo
240headbookNOUN1630triumphantADJ11None
242headbookNOUN1630audioADJ22None
245headbookNOUN1630bestADJ21funct|quant|affect|posemo|achieve
249headbookNOUN16307thADJ11None
250headbookNOUN1630lightADJ31percept
251headbookNOUN1630fourthADJ11None
255headbookNOUN1630dystopicADJ21None
256headbookNOUN1630expletiveADJ11None
258headbookNOUN1630prettyADJ51affect|posemo|cogmech|tentat
259headbookNOUN1630pacedADJ51None
260headbookNOUN1630thrillingADJ51None
263headbookNOUN1630anticipatedADJ11None
269headbookNOUN1630darnedADJ11None
270headbookNOUN1630previiousADJ11None
5111childbookNOUN1630intriguingADJ11None
\n", "
" ], "text/plain": [ " dependency_type dependency_word dependency_pos dependency_freq \\\n", "218 head book NOUN 1630 \n", "220 head book NOUN 1630 \n", "221 head book NOUN 1630 \n", "223 head book NOUN 1630 \n", "224 head book NOUN 1630 \n", "226 head book NOUN 1630 \n", "227 head book NOUN 1630 \n", "229 head book NOUN 1630 \n", "232 head book NOUN 1630 \n", "233 head book NOUN 1630 \n", "234 head book NOUN 1630 \n", "235 head book NOUN 1630 \n", "236 head book NOUN 1630 \n", "237 head book NOUN 1630 \n", "240 head book NOUN 1630 \n", "242 head book NOUN 1630 \n", "245 head book NOUN 1630 \n", "249 head book NOUN 1630 \n", "250 head book NOUN 1630 \n", "251 head book NOUN 1630 \n", "255 head book NOUN 1630 \n", "256 head book NOUN 1630 \n", "258 head book NOUN 1630 \n", "259 head book NOUN 1630 \n", "260 head book NOUN 1630 \n", "263 head book NOUN 1630 \n", "269 head book NOUN 1630 \n", "270 head book NOUN 1630 \n", "5111 child book NOUN 1630 \n", "\n", " tail_word tail_pos tail_freq dep_tail_freq \\\n", "218 eighth ADJ 1 1 \n", "220 sappy ADJ 2 1 \n", "221 disjointed ADJ 5 1 \n", "223 government ADJ 1 1 \n", "224 exceptional ADJ 5 1 \n", "226 middle ADJ 5 1 \n", "227 lengthy ADJ 2 1 \n", "229 later ADJ 2 1 \n", "232 remarkable ADJ 3 1 \n", "233 separate ADJ 2 1 \n", "234 war ADJ 4 1 \n", "235 adequate ADJ 2 1 \n", "236 dull ADJ 4 1 \n", "237 special ADJ 2 1 \n", "240 triumphant ADJ 1 1 \n", "242 audio ADJ 2 2 \n", "245 best ADJ 2 1 \n", "249 7th ADJ 1 1 \n", "250 light ADJ 3 1 \n", "251 fourth ADJ 1 1 \n", "255 dystopic ADJ 2 1 \n", "256 expletive ADJ 1 1 \n", "258 pretty ADJ 5 1 \n", "259 paced ADJ 5 1 \n", "260 thrilling ADJ 5 1 \n", "263 anticipated ADJ 1 1 \n", "269 darned ADJ 1 1 \n", "270 previious ADJ 1 1 \n", "5111 intriguing ADJ 1 1 \n", "\n", " liwc_category \n", "218 None \n", "220 None \n", "221 None \n", "223 None \n", "224 None \n", "226 None \n", "227 None \n", "229 relativ|time \n", "232 None \n", "233 None \n", "234 affect|negemo|anger|death \n", "235 None \n", "236 None \n", "237 affect|posemo \n", "240 None \n", "242 None \n", "245 funct|quant|affect|posemo|achieve \n", "249 None \n", "250 percept \n", "251 None \n", "255 None \n", "256 None \n", "258 affect|posemo|cogmech|tentat \n", "259 None \n", "260 None \n", "263 None \n", "269 None \n", "270 None \n", "5111 None " ] }, "execution_count": 369, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scripts.text_tail_analysis import get_tail_groupings\n", "\n", "tail_groupings = get_tail_groupings(doc_list, tf_lemma_pos, token_pos_types, liwc, max_threshold=5, min_threshold=0)\n", "\n", "tail_df = pd.DataFrame(tail_groupings)\n", "\n", "book_terms = ['book', 'novel', 'story', 'plot', 'character', 'twist', 'development']\n", "\n", "tail_df[(tail_df.tail_pos == 'ADJ') & (tail_df.dependency_word == 'book')]" ] }, { "cell_type": "code", "execution_count": 221, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dependency_typedependency_worddependency_posdependency_freqtail_wordtail_postail_freqdep_tail_freqliwc_category
2501headdescribeVERB20describeVERB203verb|present|social
2502headdescribeVERB20eventNOUN401relativ|time
2503headdescribeVERB20feelingNOUN501None
2504headdescribeVERB20themeNOUN191None
2505headdescribeVERB20sceneNOUN301None
2506headdescribeVERB20explainVERB291verb|present|social|cogmech|insight
9015childdescribeVERB20wordNOUN423None
9016childdescribeVERB20beginVERB351verb|present|relativ|time
9017childdescribeVERB20realityNOUN191cogmech|certain
9018childdescribeVERB20pullVERB111None
\n", "
" ], "text/plain": [ " dependency_type dependency_word dependency_pos dependency_freq \\\n", "2501 head describe VERB 20 \n", "2502 head describe VERB 20 \n", "2503 head describe VERB 20 \n", "2504 head describe VERB 20 \n", "2505 head describe VERB 20 \n", "2506 head describe VERB 20 \n", "9015 child describe VERB 20 \n", "9016 child describe VERB 20 \n", "9017 child describe VERB 20 \n", "9018 child describe VERB 20 \n", "\n", " tail_word tail_pos tail_freq dep_tail_freq \\\n", "2501 describe VERB 20 3 \n", "2502 event NOUN 40 1 \n", "2503 feeling NOUN 50 1 \n", "2504 theme NOUN 19 1 \n", "2505 scene NOUN 30 1 \n", "2506 explain VERB 29 1 \n", "9015 word NOUN 42 3 \n", "9016 begin VERB 35 1 \n", "9017 reality NOUN 19 1 \n", "9018 pull VERB 11 1 \n", "\n", " liwc_category \n", "2501 verb|present|social \n", "2502 relativ|time \n", "2503 None \n", "2504 None \n", "2505 None \n", "2506 verb|present|social|cogmech|insight \n", "9015 None \n", "9016 verb|present|relativ|time \n", "9017 cogmech|certain \n", "9018 None " ] }, "execution_count": 221, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tail_groupings = get_tail_groupings(doc_list, tf_lemma_pos, token_pos_types, liwc, max_threshold=50, min_threshold=10)\n", "\n", "tail_df = pd.DataFrame(tail_groupings)\n", "\n", "book_terms = [\n", " 'book', 'novel', 'story', 'plot', 'character', 'twist', 'development', \n", " 'pace', 'scene', 'setting', 'narrative', 'theme', 'event']\n", "author_terms = ['writing', 'style', 'write', 'author', 'writer', 'voice', 'describe', 'explain']\n", "reader_terms = ['reader', 'feel', 'feeling', 'make', 'pull', 'throw']\n", "tail_df[(tail_df.dependency_word == 'describe')]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 251, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Lemma('memorable.s.01.memorable')]\n", "[]\n", "None\n", "\n", "[Lemma('cooling.n.01.cooling'), Lemma('cooling.n.01.chilling'), Lemma('cooling.n.01.temperature_reduction')]\n", "[Synset('temperature_change.n.01')]\n", "None\n", "\n", "[Lemma('chill.v.01.chill')]\n", "[Synset('depress.v.01')]\n", "0.2857142857142857\n", "\n", "[Lemma('cool.v.01.cool'), Lemma('cool.v.01.chill'), Lemma('cool.v.01.cool_down')]\n", "[Synset('change.v.01')]\n", "0.3333333333333333\n", "\n", "[Lemma('cool.v.02.cool'), Lemma('cool.v.02.chill'), Lemma('cool.v.02.cool_down')]\n", "[Synset('change_state.v.01')]\n", "0.2857142857142857\n", "\n", "[Lemma('chilling.s.01.chilling'), Lemma('chilling.s.01.scarey'), Lemma('chilling.s.01.scary'), Lemma('chilling.s.01.shivery'), Lemma('chilling.s.01.shuddery')]\n", "[]\n", "None\n", "\n", "[Lemma('distraught.s.01.distraught'), Lemma('distraught.s.01.overwrought')]\n", "[]\n", "None\n", "\n", "[Lemma('grip.v.01.grip')]\n", "[Synset('seize.v.01')]\n", "0.2857142857142857\n", "\n", "[Lemma('grapple.v.02.grapple'), Lemma('grapple.v.02.grip')]\n", "[Synset('seize.v.01')]\n", "0.2857142857142857\n", "\n", "[Lemma('fascinate.v.02.fascinate'), Lemma('fascinate.v.02.transfix'), Lemma('fascinate.v.02.grip'), Lemma('fascinate.v.02.spellbind')]\n", "[Synset('interest.v.01')]\n", "0.25\n", "\n", "[Lemma('absorbing.s.01.absorbing'), Lemma('absorbing.s.01.engrossing'), Lemma('absorbing.s.01.fascinating'), Lemma('absorbing.s.01.gripping'), Lemma('absorbing.s.01.riveting')]\n", "[]\n", "None\n", "\n" ] } ], "source": [ "terms = ['memorable', 'chilling', 'overwrought', 'gripping']\n", "for term in terms:\n", " syns = wn.synsets(term)\n", " for syn in syns:\n", " print(syn.lemmas())\n", " print(syn.hypernyms())\n", " #print(syn.hyponyms())\n", " affect = wn.synset('affect.v.01')\n", " print(syn.wup_similarity(affect))\n", " print()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 278, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 315, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "book_id author_name title \n", "19063 Markus Zusak The Book Thief 10547\n", "41865 Stephenie Meyer Twilight (Twilight, #1) 9826\n", "2767052 Suzanne Collins The Hunger Games (The Hunger Games, #1) 17403\n", "6148028 Suzanne Collins Catching Fire (The Hunger Games, #2) 11057\n", "7260188 Suzanne Collins Mockingjay (The Hunger Games, #3) 12607\n", "10818853 E.L. James Fifty Shades of Grey (Fifty Shades, #1) 10257\n", "11870085 John Green The Fault in Our Stars 19151\n", "13335037 Veronica Roth Divergent (Divergent, #1) 9866\n", "22557272 Paula Hawkins The Girl on the Train 12624\n", "dtype: int64" ] }, "execution_count": 315, "metadata": {}, "output_type": "execute_result" } ], "source": [ "review_df.groupby(['book_id', 'author_name', 'title']).size()" ] }, { "cell_type": "code", "execution_count": 321, "metadata": {}, "outputs": [], "source": [ "sample_size = 1000\n", "hg1_df = review_df[review_df.book_id == 2767052]\n", "sample_hg1_df = hg1_df.sample(sample_size)\n", "sample_hg1_df\n", "docs_hg1 = [nlp(text) for text in get_sample_review_texts(sample_hg1_df)]\n" ] }, { "cell_type": "code", "execution_count": 322, "metadata": {}, "outputs": [], "source": [ "tf_lemma_pos = get_lemma_pos_tf_index(docs + docs_hg1)\n" ] }, { "cell_type": "code", "execution_count": 326, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Katniss', 'PROPN')\n", "\t ADJ \t []\n", "\t NOUN \t ['prowess', 'behaviour', 'mask', 'monologueing', 'selflessness', 'crisis', 'tomboyishness', 'forehead', 'pin', 'volunter', 'reluctance', 'falseness', 'eyesight', 'volenteer', 'hunt', 'viewpoint']\n", "\t PROPN \t ['Goes', 'Szeyenne', 'story', 'Ngunit', 'Everdeen-', 'edumped']\n", "\n", "('read', 'VERB')\n", "\t ADJ \t ['exhausting', 'stoked', 'eager']\n", "\t ADJ \t ['hesitant', 'psyched', 'adamant']\n", "\n", "\t NOUN \t ['generation', 'joy', 'push', 'try', 'sob']\n", "\t NOUN \t ['headache', 'wish', 'yearn', 'holiday', 'crime', 'reluctance', 'promise', 'thon', 'chore', 'million', 'blessing', 'urge', 'disclaimer', 'car', 'instance', 'hurry', 'category']\n", "\n", "\t PROPN \t []\n", "\t PROPN \t []\n", "\n", "('Katniss', 'PROPN')\n", "\t ADJ \t ['redeeming', 'lead', 'twisted', 'married']\n", "\t ADJ \t []\n", "\n", "\t NOUN \t ['host', 'liberation', 'intervention', 'evolution', 'talk', 'desire', 'progression', 'hallucination', 'relationsip', 'pov', 'channel', 'trial', 'turmoil', 'doing', 'house', 'search', 'bout', 'escapade', 'involve', 'process', 'insight', 'vulnerability', 'mom', 'mark', 'trait', 'challenge', 'reflection', 'whining', 'recovery', 'dilemma', 'bow', 'rumination', 'obsession', 'attempt', 'plague', 'vote']\n", "\t NOUN \t ['prowess', 'behaviour', 'mask', 'monologueing', 'selflessness', 'crisis', 'tomboyishness', 'forehead', 'pin', 'volunter', 'reluctance', 'falseness', 'eyesight', 'volenteer', 'hunt', 'viewpoint']\n", "\n", "\t PROPN \t ['Everdden', 'Everdean', 'Everdine', 'p.o.v']\n", "\t PROPN \t ['Goes', 'Szeyenne', 'story', 'Ngunit', 'Everdeen-', 'edumped']\n", "\n", "('book', 'NOUN')\n", "\t ADJ \t ['underwhelming', 'touching', 'watery', 'mesmerizing', 'unpredictable', 'special']\n", "\t ADJ \t ['unfun', 'intolerable', 'prescient', 'remarkable', 'provoking']\n", "\n", "\t NOUN \t ['wasn', 'hiding', 'tour', 'defense', 'plotline', 'content', 'hangover', 'club', 'weakness', 'center', 'recovery']\n", "\t NOUN \t ['length', '#', 'jacket', 'chore', 'report', 'anthem', 'seductiveness', 'readability', 'worm', 'advance', 'hipster']\n", "\n", "\t PROPN \t []\n", "\t PROPN \t ['Gregor', 'can;t']\n", "\n", "('little', 'ADJ')\n", "\t ADJ \t ['tiring', 'skewed', 'hokey', 'lost', 'spastic', 'agonizing', 'underwhelmed', 'dense', 'odd', 'extreme', 'facetious', 'akward', 'nervous', 'wary', 'eyed', 'paranoid', 'ridiculous', 'anxious', 'wordy']\n", "\t ADJ \t ['grumpy', 'dimensional', 'intimidated', 'thin', 'gross', 'icky', 'childish', 'shaky', 'clunky', 'forced', 'cheesy']\n", "\n", "\t NOUN \t ['fit', 'annoyance', 'visit', 'package', 'talk', 'bow', 'light', 'meh', 'redemption', 'primrose', 'guarantee', 'heartedness', 'duck', 'distinction', 'outside', 'dark', 'context', 'pansy']\n", "\t NOUN \t ['suspension', 'guidance', 'showing', 'publicity', 'juvenile', 'duplicity', 'dark', 'ambiguity', 'quibble', 'mention', 'machina', 'batter', 'roll']\n", "\n", "\t PROPN \t ['Duck']\n", "\t PROPN \t ['OTT']\n", "\n", "('very', 'ADV')\n", "\t ADJ \t ['unconvincing', 'lacking', 'cringy', 'imaginative', 'worthwhile', 'intriguing', 'impulsive', 'hatefull', 'military', 'consistent', 'loyal', 'worried', 'rich', 'fond', 'sympathetic', 'peculiar', 'shallow', 'sensitive', 'touching', 'vivid', 'impressive', 'subtle', 'substantial', 'narrow', 'sensetive', 'mediocre', 'conflicting', 'independent', 'tangible', 'mopey', 'precious', 'conflicted', 'telling', 'cranky', 'concerned', 'dissapointed']\n", "\t ADJ \t ['unlikely', 'telling', 'energetic', 'concerned', 'intertaining', 'addicting', 'talented', 'sensible', 'touching', 'unmoved', 'admirable', 'foward', 'interior', 'connected', 'distinct', 'devoted', 'torn', 'poignant', 'convenient', 'cryptic', 'critical', 'resourceful', 'absorbing', 'english', 'boyish', 'unlucky', 'unusual', 'squicky', 'approachable', 'angsty', 'cinematic', 'gripping']\n", "\n", "\t NOUN \t ['end!!!!!!!!!!I']\n", "\t NOUN \t ['sublty', 'gentleman']\n", "\n", "\t PROPN \t []\n", "\t PROPN \t []\n", "\n", "('more', 'ADV')\n", "\t ADJ \t ['sophisticated', 'bearable', 'insipid', 'detailed', 'introspective', 'thoughtful', 'eloquent', 'triumphant', 'caring', 'hidden', 'focused', 'lovely', 'meaningful', 'understandable', 'readable', 'serious', 'likable', 'rounded', 'concerned', 'critical', 'evident', 'subtle', 'valuable', 'convincing', 'heightened', 'positive', 'natural', 'vulnerable', 'paced', 'gripping', 'forgiving', 'acquainted', 'profound', 'frequent', 'likely', 'effective', 'ideal', 'integral', 'redemptive', 'explicit', 'practical']\n", "\t ADJ \t ['articulate', 'savage', 'substantial', 'suitable', 'dimensional', 'meaty', 'contrived', 'psychotic', 'endearing', 'accessible', 'jaded', 'superficial', 'lethal', 'layered', 'imagined', 'appealing', 'stylistic', 'involved', 'coherent', 'poetic', 'mystified', 'enthralling', 'enthralled', 'amorous']\n", "\n", "\t NOUN \t ['gravitas']\n", "\t NOUN \t []\n", "\n", "\t PROPN \t []\n", "\t PROPN \t []\n", "\n", "('so', 'ADV')\n", "\t ADJ \t ['frustrating', 'brave', 'cringy', 'staged', 'unsure', 'jaded', 'thankful', 'badass', 'intelligent', 'ish', 'attached', 'limited', 'underwhelming', 'hollow', 'indifferent', 'unhappy', 'lackluster', 'aggrivated', 'distant', 'impure', 'dissapointing', 'ephemeral', 'unsettled', 'conflicted', 'damaged', 'descriptive', 'dangerous', 'climatic', 'surprising', 'riveting', 'repetitive', 'disjointed', 'pitiful', 'cliche', 'extraordinary', 'stunned', 'masculine', 'stoked', 'dire', 'fustrating', 'skillful', 'frightening', 'enthralled', 'saccharine', 'exhausted', 'climactic', 'detailed', 'naive', 'gloomy', 'connected', 'mesmerizing', 'engrossing', 'likeable', 'grateful', 'special']\n", "\t ADJ \t ['observant', 'charming', 'wide', 'cheesy', 'repetitive', 'colourful', 'incredibly', 'stressful', 'hyper', 'thick', 'rounded', 'tardy', 'oblivious', 'unneccessary', 'sucky', 'distressed', 'lucky', 'hurt', 'discrete', 'connected', 'blurry', 'menacing', 'immersed', 'vicious', 'fond']\n", "\n", "\t NOUN \t ['reference', 'scatter', 'heartbreaking--']\n", "\t NOUN \t ['manything', 'applause']\n", "\n", "\t PROPN \t ['wright']\n", "\t PROPN \t ['edumped', 'twisted!After']\n", "\n" ] } ], "source": [ "child_group_hg1 = group_by_child(docs_hg1, tf_lemma_pos, token_pos_types, max_threshold=5)\n", "\n", "shared_tokens = [token for token in child_group_hg1 if token in child_group]\n", "\n", "token_lemma_pos = ('Katniss', 'PROPN')\n", "if token_lemma_pos in child_group_hg1:\n", " print(token_lemma_pos)\n", " for token_pos in token_pos_types:\n", " print('\\t', token_pos, '\\t', [lemma for lemma, pos in child_group_hg1[token_lemma_pos] if pos == token_pos])\n", " print()\n", "\n", "\n", "#for token_lemma_pos in child_group_hg1:\n", "for token_lemma_pos in shared_tokens:\n", " if sum(child_group_hg1[token_lemma_pos].values()) < 20:\n", " continue\n", " print(token_lemma_pos)\n", " for token_pos in token_pos_types:\n", " print('\\t', token_pos, '\\t', [lemma for lemma, pos in child_group[token_lemma_pos] if pos == token_pos])\n", " print('\\t', token_pos, '\\t', [lemma for lemma, pos in child_group_hg1[token_lemma_pos] if pos == token_pos])\n", " print()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [], "source": [ "def filter_doc_terms(doc, filter_terms):\n", " return [token for token in doc if token in filter_terms]\n", " \n", "def doc_generator(docs, use_sentences=False):\n", " for doc in docs:\n", " if use_sentences:\n", " for sent in doc.sents:\n", " yield sent\n", " else:\n", " yield doc\n", "\n", "def get_cooc(docs, filter_terms=None, use_sentences=False, use_lemma=False):\n", " cooc = Counter()\n", " for doc in doc_generator(docs, use_sentences=use_sentences):\n", " token_set = get_doc_token_set(doc, use_lemma=use_lemma)\n", " if filter_terms:\n", " token_set = filter_doc_terms(token_set, filter_terms)\n", " cooc.update([term_pair for term_pair in combinations(sorted(token_set), 2)])\n", " return cooc\n", "\n" ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(('book', 'read'), 308),\n", " (('book', 'end'), 280),\n", " (('book', 'series'), 271),\n", " (('book', 'like'), 271),\n", " (('Katniss', 'book'), 254),\n", " (('book', 'good'), 214),\n", " (('book', 'think'), 213),\n", " (('book', 'love'), 212),\n", " (('book', 'character'), 201),\n", " (('Katniss', 'end'), 199),\n", " (('book', 'ending'), 198),\n", " (('read', 'series'), 198),\n", " (('book', 'feel'), 196),\n", " (('end', 'series'), 190),\n", " (('Games', 'Hunger'), 189),\n", " (('Katniss', 'read'), 184),\n", " (('like', 'read'), 179),\n", " (('end', 'read'), 178),\n", " (('end', 'like'), 177),\n", " (('book', 'way'), 176),\n", " (('Katniss', 'like'), 174),\n", " (('Collins', 'book'), 170),\n", " (('book', 'story'), 169),\n", " (('Games', 'book'), 168),\n", " (('Hunger', 'book'), 168),\n", " (('book', 'time'), 164),\n", " (('like', 'series'), 163),\n", " (('Katniss', 'series'), 161),\n", " (('read', 'think'), 154),\n", " (('end', 'think'), 153),\n", " (('Katniss', 'think'), 153),\n", " (('Katniss', 'Peeta'), 152),\n", " (('Katniss', 'character'), 148),\n", " (('end', 'love'), 148),\n", " (('character', 'end'), 147),\n", " (('like', 'think'), 146),\n", " (('book', 'trilogy'), 146),\n", " (('Katniss', 'love'), 146),\n", " (('character', 'series'), 145),\n", " (('good', 'read'), 145),\n", " (('Peeta', 'book'), 144),\n", " (('love', 'series'), 144),\n", " (('feel', 'read'), 144),\n", " (('love', 'read'), 144),\n", " (('feel', 'like'), 143),\n", " (('good', 'series'), 142),\n", " (('character', 'read'), 141),\n", " (('ending', 'like'), 141),\n", " (('end', 'good'), 140),\n", " (('book', 'want'), 139)]" ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "common_terms = [term for term, freq in df.most_common() if freq >= 100 and term != ' ']\n", "cooc = get_cooc(docs, filter_terms=common_terms, use_sentences=False, use_lemma=True)\n", "\n", "\n", "cooc.most_common(50)\n", "\n" ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(('Games', 'Hunger'), 283),\n", " (('book', 'read'), 200),\n", " (('book', 'like'), 179),\n", " (('Katniss', 'Peeta'), 164),\n", " (('Katniss', 'book'), 144),\n", " (('book', 'series'), 129),\n", " (('book', 'good'), 126),\n", " (('Katniss', 'end'), 117),\n", " (('book', 'end'), 117),\n", " (('feel', 'like'), 115),\n", " (('book', 'think'), 109),\n", " (('book', 'love'), 109),\n", " (('Gale', 'Katniss'), 108),\n", " (('Gale', 'Peeta'), 103),\n", " (('book', 'feel'), 100),\n", " (('Collins', 'Suzanne'), 98),\n", " (('Katniss', 'like'), 90),\n", " (('end', 'series'), 82),\n", " (('Collins', 'book'), 81),\n", " (('read', 'series'), 78),\n", " (('book', 'character'), 77),\n", " (('Hunger', 'book'), 74),\n", " (('Katniss', 'think'), 74),\n", " (('Games', 'book'), 73),\n", " (('Katniss', 'love'), 73),\n", " (('end', 'like'), 69),\n", " (('book', 'trilogy'), 67),\n", " (('Katniss', 'feel'), 66),\n", " (('Katniss', 'character'), 65),\n", " (('Peeta', 'end'), 65),\n", " (('book', 'story'), 63),\n", " (('end', 'way'), 63),\n", " (('book', 'way'), 62),\n", " (('book', 'time'), 60),\n", " (('Peeta', 'love'), 60),\n", " (('end', 'think'), 58),\n", " (('book', 'thing'), 56),\n", " (('book', 'finish'), 56),\n", " (('Mockingjay', 'book'), 56),\n", " (('Games', 'series'), 55),\n", " (('like', 'series'), 55),\n", " (('Peeta', 'book'), 54),\n", " (('love', 'series'), 54),\n", " (('Hunger', 'series'), 54),\n", " (('book', 'go'), 54),\n", " (('read', 'time'), 53),\n", " (('book', 'final'), 53),\n", " (('Peeta', 'like'), 52),\n", " (('good', 'series'), 50),\n", " (('book', 'want'), 49)]" ] }, "execution_count": 200, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cooc = get_cooc(docs, filter_terms=common_terms, use_sentences=True, use_lemma=True)\n", "cooc.most_common(50)\n", "\n" ] }, { "cell_type": "code", "execution_count": 268, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Games', 'Hunger') 6.460479234356834\n", "('Collins', 'Suzanne') 6.088259967047493\n", "('Gale', 'Peeta') 5.99832727669207\n", "('Katniss', 'Peeta') 5.610687390134804\n", "('Gale', 'Katniss') 5.495901734017394\n", "('Collins', 'write') 4.992401742361956\n", "('feel', 'like') 4.943319340093928\n", "('Games', 'final') 4.925125255903836\n", "('Hunger', 'final') 4.915074920050335\n", "('Games', 'Mockingjay') 4.733835029127121\n", "('Hunger', 'Mockingjay') 4.72378469327362\n" ] } ], "source": [ "from helper import get_pmi_cooc\n", "\n", "pmi_cooc = get_pmi_cooc(df, cooc, filter_terms=common_terms)\n", "\n", "for ti, term_pair in enumerate(pmi_cooc):\n", " print(term_pair, pmi_cooc[term_pair])\n", " if ti == 10:\n", " break" ] }, { "cell_type": "code", "execution_count": 296, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('ending', 'trilogy') 5.6627197008620245\n", "('lot', 'lot') 3.73396263103296\n", "('like', 'Games') 3.703295832639988\n", "('way', 'way') 3.5018627676966796\n", "('enjoy', 'series') 3.13889763482435\n", "('series', 'book') 3.086841272868297\n", "('like', 'ending') 3.0101486520800425\n", "('Suzanne', 'character') 2.9081494841249214\n", "('lot', 'people') 2.805975859395614\n", "('story', 'thing') 2.583653438550424\n", "('book', 'book') 2.541614222384974\n" ] } ], "source": [ "from helper import get_doc_content_chunks\n", "from scripts.pmi import PMICOOC\n", "\n", "token_sets = [sent_chunks for doc in docs for sent_chunks in get_doc_content_chunks(doc)]\n", "token_sets = [[token.lemma_ if token.lemma_ != '-PRON-' else token.text for token in token_set] for token_set in token_sets]\n", "pmi_cooc = PMICOOC(token_sets, filter_terms=common_terms)\n", "token_freq = Counter([token for token_set in token_sets for token in token_set])\n", "cooc_freq = Counter([token_pair for token_set in token_sets for token_pair in combinations([token for token in token_set], 2)])\n", "pmi_cooc = get_pmi_cooc(token_freq, cooc_freq, filter_terms=common_terms)\n", "\n", "for ti, term_pair in enumerate(pmi_cooc):\n", " print(term_pair, pmi_cooc[term_pair])\n", " if ti == 10:\n", " break" ] }, { "cell_type": "code", "execution_count": 313, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('ending', 'trilogy') 5.6627197008620245\n", "('lot', 'lot') 3.73396263103296\n", "('like', 'Games') 3.703295832639988\n", "('way', 'way') 3.5018627676966796\n", "('enjoy', 'series') 3.13889763482435\n", "('series', 'book') 3.086841272868297\n" ] } ], "source": [ "from scripts.pmi import PMICOOC\n", "pmi_cooc = PMICOOC(token_sets, filter_terms=common_terms)\n", "\n", "for term in pmi_cooc.highest(5):\n", " print(term, pmi_cooc[term])" ] }, { "cell_type": "code", "execution_count": 317, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('ending', 'trilogy') 5.6627197008620245\n", "('lot', 'lot') 3.73396263103296\n", "('like', 'Games') 3.703295832639988\n", "('way', 'way') 3.5018627676966796\n", "('enjoy', 'series') 3.13889763482435\n", "('series', 'book') 3.086841272868297\n", "('like', 'ending') 3.0101486520800425\n", "('Suzanne', 'character') 2.9081494841249214\n", "('lot', 'people') 2.805975859395614\n", "('story', 'thing') 2.583653438550424\n", "('book', 'book') 2.541614222384974\n", "('series', 'leave') 2.476760691006965\n", "('thing', 'happen') 2.34846154872004\n", "('great', 'way') 2.343352337447501\n", "('Collins', 'write') 2.314179600159026\n", "('time', 'trilogy') 2.312815613587419\n", "('lot', 'war') 2.2233705532354926\n", "('Peeta', 'Gale') 2.211760809640805\n", "('Gale', 'Peeta') 2.211760809640805\n", "('come', 'series') 2.208959989109013\n", "('Suzanne', 'write') 2.1851493404152946\n", "('write', 'Suzanne') 2.1851493404152946\n", "('great', 'character') 2.14268164198535\n", "('character', 'great') 2.14268164198535\n", "('Collins', 'war') 2.135814662211173\n", "('people', 'lot') 2.1128286788356685\n", "('Gale', 'character') 2.0972192679085926\n", "('Suzanne', 'war') 2.0608516237377175\n", "('war', 'people') 1.9885309621580913\n", "('Suzanne', 'happen') 1.9798104137345918\n", "('Collins', 'final') 1.9798104137345918\n", "('Suzanne', 'people') 1.9503097493378938\n", "('people', 'Suzanne') 1.9503097493378938\n", "('think', 'series') 1.9364527146764674\n", "('final', 'Gale') 1.9048473752611363\n", "('trilogy', 'people') 1.9015195851684619\n", "('write', 'way') 1.8850447479649568\n", "('way', 'write') 1.8850447479649568\n", "('story', 'happen') 1.8845002339302668\n", "('great', 'great') 1.8779890877582677\n", "('enjoy', 'people') 1.8626041689187882\n", "('story', 'get') 1.854999569533569\n", "('time', 'happen') 1.8487821513281877\n", "('enjoy', 'enjoy') 1.847219250079309\n", "('time', 'get') 1.8192814869314895\n", "('go', 'way') 1.8140930119926724\n", "('story', 'Gale') 1.8095371954568116\n", "('enjoy', 'character') 1.7218316150377058\n", "('know', 'happen') 1.7217288110591014\n", "('find', 'happen') 1.6882895649756453\n", "('end', 'series') 1.6861581742963068\n", "('character', 'finish') 1.6843740525028057\n", "('people', 'enjoy') 1.6802826121248335\n", "('read', 'series') 1.6790407065274429\n", "('get', 'little') 1.6555102091172489\n", "('find', 'way') 1.6386443344864796\n", "('time', 'thing') 1.6316446240741898\n", "('like', 'series') 1.6238542909601519\n", "('think', 'lot') 1.5881460204082518\n", "('way', 'little') 1.5612576708710595\n", "('little', 'way') 1.5612576708710595\n", "('Peeta', 'character') 1.5568348419010574\n", "('Katniss', 'Gale') 1.5560293028176242\n", "('Katniss', 'Peeta') 1.5437123070105858\n", "('write', 'spoiler') 1.5432954542428998\n", "('feel', 'Suzanne') 1.5422639946362378\n", "('thing', 'little') 1.5368662177469004\n", "('go', 'happen') 1.5323811065273951\n", "('little', 'time') 1.5071904496007837\n", "('time', 'little') 1.5071904496007837\n", "('enjoy', 'trilogy') 1.4806695582208178\n", "('write', 'trilogy') 1.4432119956859175\n", "('find', 'character') 1.4379736390243285\n", "('think', 'Suzanne') 1.4256270909104771\n", "('way', 'go') 1.408627903884508\n", "('war', 'war') 1.4059256559979703\n", "('book', 'great') 1.4050826988545706\n", "('great', 'book') 1.4050826988545706\n", "('enjoy', 'book') 1.3896977800150914\n", "('Collins', 'leave') 1.3841185693253595\n", "('final', 'Peeta') 1.3644629492536011\n", "('Mockingjay', 'war') 1.3528158306840217\n", "('spoiler', 'come') 1.3439625516224087\n", "('know', 'go') 1.3328678580029723\n", "('end', 'Peeta') 1.3280953050827264\n", "('want', 'know') 1.3272430942782067\n", "('want', 'happen') 1.3119810411589363\n", "('series', 'love') 1.3023545629693218\n", "('time', 'want') 1.2883112970730315\n", "('character', 'love') 1.2715829043025684\n", "('story', 'Peeta') 1.2691527694492766\n", "('leave', 'way') 1.2603653666824723\n", "('thing', 'thing') 1.255855284112142\n", "('finish', 'Mockingjay') 1.2539699960473891\n", "('Collins', 'trilogy') 1.2378730690052144\n", "('people', 'thing') 1.2203485956552322\n", "('story', 'Katniss') 1.2155966649803143\n", "('get', 'happen') 1.2143425715950202\n", "('feel', 'little') 1.1986742902461607\n", "('finish', 'read') 1.1972026196347048\n", "('way', 'finish') 1.1918975674050114\n", "('spoiler', 'feel') 1.1880921809156237\n", "('people', 'people') 1.1848419071983223\n", "('love', 'way') 1.184571527312939\n", "('write', 'write') 1.184517460107389\n", "('Mockingjay', 'spoiler') 1.1427443609371646\n", "('Suzanne', 'feel') 1.1367988865280734\n", "('book', 'time') 1.123231546713583\n", "('know', 'lot') 1.116137521523476\n", "('good', 'great') 1.1119805589334504\n", "('book', 'find') 1.1058398040017139\n", "('time', 'go') 1.0668786101624514\n", "('spoiler', 'go') 1.066878610162451\n", "('think', 'write') 1.049149519675565\n", "('Mockingjay', 'trilogy') 1.042660902380182\n", "('feel', 'character') 1.0414887067237484\n", "('think', 'Collins') 1.0379666074358191\n", "('read', 'book') 1.0284531403862938\n", "('Suzanne', 'Katniss') 1.0232247723328585\n", "('want', 'end') 1.0183288017206515\n", "('enjoy', 'Collins') 1.016636095961586\n", "('Suzanne', 'go') 1.0155853157749009\n", "('time', 'war') 1.0135326294571585\n", "('war', 'time') 1.0135326294571585\n", "('enjoy', 'write') 0.9988314713280796\n", "('book', 'feel') 0.9970369441529148\n", "('leave', 'feel') 0.9930921658618983\n", "('write', 'character') 0.9912268719428605\n", "('Collins', 'way') 0.9865586407243083\n", "('Gale', 'Katniss') 0.9854844443500113\n", "('love', 'character') 0.9839008318507876\n", "('love', 'Peeta') 0.9806593379266164\n", "('think', 'go') 0.9781977837032804\n", "('Gale', 'leave') 0.9726832942306912\n", "('feel', 'thing') 0.9664535207808344\n", "('Gale', 'Gale') 0.9603857664202851\n", "('get', 'way') 0.9570579763276105\n", "('happen', 'happen') 0.9561611635399372\n", "('know', 'Suzanne') 0.9536185920257009\n", "('Suzanne', 'know') 0.9536185920257009\n", "('thing', 'time') 0.9384974435142445\n", "('come', 'people') 0.9326665232034513\n", "('thing', 'get') 0.9326665232034513\n", "('go', 'go') 0.9248309525064365\n", "('Gale', 'way') 0.9115956022508531\n", "('find', 'end') 0.9091295097556594\n", "('think', 'end') 0.906833297495309\n", "('war', 'leave') 0.9055439913930627\n", "('Peeta', 'Katniss') 0.900575547025302\n", "('trilogy', 'want') 0.900545766064268\n", "('find', 'find') 0.8740381899443894\n", "('lot', 'like') 0.8700824885837721\n", "('like', 'Peeta') 0.8603263136384072\n", "('like', 'book') 0.8554836891626193\n", "('little', 'go') 0.8543171681787783\n", "('feel', 'way') 0.8366942940777351\n", "('thing', 'leave') 0.8305088054501487\n", "('leave', 'thing') 0.8305088054501487\n", "('people', 'good') 0.8242984864816697\n", "('want', 'finish') 0.8241727872796942\n", "('know', 'book') 0.8138566496505423\n", "('Katniss', 'end') 0.8119156786656515\n", "('Katniss', 'character') 0.8101315568721499\n", "('enjoy', 'come') 0.7994985687075882\n", "('read', 'Mockingjay') 0.7966515263289691\n", "('Katniss', 'go') 0.7893695654237208\n", "('Collins', 'character') 0.7858879452621571\n", "('happen', 'character') 0.7858879452621571\n", "('Collins', 'come') 0.7798456308061944\n", "('enjoy', 'leave') 0.7796171981537594\n", "('go', 'leave') 0.7765684153044009\n", "('end', 'Gale') 0.7698674424221519\n", "('think', 'happen') 0.7697026208411397\n", "('thing', 'write') 0.762041006172688\n", "('enjoy', 'find') 0.7524310578496025\n", "('want', 'get') 0.7434838760295515\n", "('leave', 'write') 0.7421596356188591\n", "('write', 'leave') 0.7421596356188591\n", "('leave', 'finish') 0.7421596356188591\n", "('think', 'good') 0.7414485803332919\n", "('book', 'Collins') 0.7414361826913233\n", "('find', 'come') 0.7387841440084206\n", "('thing', 'end') 0.7330534692994355\n", "('find', 'Collins') 0.7327781199482086\n", "('great', 'finish') 0.726534317715778\n", "('think', 'little') 0.7253624425816675\n", "('feel', 'finish') 0.7239536711222865\n", "('Collins', 'feel') 0.7192854399037342\n", "('know', 'leave') 0.7146016915552009\n", "('trilogy', 'read') 0.7139598104838558\n", "('know', 'end') 0.7124565352088127\n", "('get', 'book') 0.7119355182946253\n", "('write', 'enjoy') 0.7111493988762986\n", "('character', 'Gale') 0.7109249067887018\n", "('like', 'Gale') 0.7075635590859971\n", "('like', 'Suzanne') 0.7075635590859971\n", "('leave', 'little') 0.7060546309767428\n", "('know', 'war') 0.7041577323941179\n", "('think', 'Gale') 0.6947395823676844\n", "('Collins', 'go') 0.6850832461401916\n", "('enjoy', 'little') 0.6750443942341823\n", "('feel', 'leave') 0.6746384347433637\n", "('finish', 'finish') 0.6736918363413983\n", "('like', 'thing') 0.6707495859632806\n", "('read', 'story') 0.6674397948489631\n", "('Gale', 'book') 0.666473144217868\n", "('love', 'finish') 0.6663657962493253\n", "('think', 'time') 0.6659901690816986\n", "('think', 'get') 0.6601592487709054\n", "('get', 'go') 0.6555825817434937\n", "('way', 'know') 0.6535139995753627\n", "('finish', 'know') 0.6461338922777402\n", "('Mockingjay', 'little') 0.6425008465017109\n", "('want', 'write') 0.6418512304857396\n", "('book', 'good') 0.6390741700297535\n", "('come', 'war') 0.6377432894951106\n", "('war', 'thing') 0.6377432894951106\n", "('write', 'little') 0.6375868316992819\n", "('great', 'like') 0.635242897506371\n", "('like', 'end') 0.6334555869322751\n", "('read', 'time') 0.6317217122468838\n", "('Mockingjay', 'leave') 0.6292906147649046\n", "('think', 'Katniss') 0.6268921477879663\n", "('want', 'thing') 0.624839884659203\n", "('feel', 'happen') 0.6239752600994095\n", "('Collins', 'want') 0.618833860598991\n", "('little', 'end') 0.6085992948260295\n", "('Katniss', 'get') 0.6060636244615027\n", "('time', 'know') 0.5994467783050872\n", "('great', 'Peeta') 0.5989951071140298\n", "('know', 'people') 0.593615857994294\n", "('happen', 'end') 0.593516052614701\n", "('think', 'war') 0.5883795663767747\n", "('come', 'Mockingjay') 0.5846334641811621\n", "('love', 'love') 0.5820787150211242\n", "('find', 'want') 0.5777723738012172\n", "('know', 'know') 0.5777539536938271\n", "('Collins', 'Katniss') 0.5749396670417656\n", "('come', 'thing') 0.5627081035521967\n", "('thing', 'come') 0.5627081035521967\n", "('Collins', 'happen') 0.5506960554317728\n", "('know', 'Gale') 0.5481534839175365\n", "('Gale', 'know') 0.5481534839175365\n", "('book', 'Mockingjay') 0.546224016066291\n", "('Mockingjay', 'book') 0.546224016066291\n", "('feel', 'Katniss') 0.5428583560515761\n", "('Katniss', 'find') 0.5338781802439918\n", "('leave', 'end') 0.530850541951652\n", "('people', 'come') 0.5272014150952868\n", "('thing', 'people') 0.5272014150952868\n", "('Collins', 'people') 0.521195391035075\n", "('happen', 'people') 0.521195391035075\n", "('think', 'character') 0.5193866948898233\n", "('know', 'Peeta') 0.518594681675992\n", "('Gale', 'end') 0.5185530141412458\n", "('think', 'come') 0.5133443804338607\n", "('read', 'happen') 0.5122906126921154\n", "('Peeta', 'little') 0.5100476210975337\n", "('little', 'Peeta') 0.5100476210975337\n", "('Katniss', 'want') 0.5095460795844611\n", "('know', 'read') 0.5046683719777977\n", "('happen', 'go') 0.5027616893462369\n", "('want', 'little') 0.500385710185797\n", "('know', 'love') 0.49570700854499405\n", "('feel', 'time') 0.49494500035567873\n", "('enjoy', 'read') 0.49272283744022766\n", "('people', 'get') 0.491694726638377\n", "('leave', 'know') 0.4914581402409912\n", "('like', 'find') 0.4901506824807723\n", "('feel', 'get') 0.4891140800448854\n", "('Katniss', 'time') 0.4867314018182897\n", "('Peeta', 'love') 0.48166817180762866\n", "('like', 'Katniss') 0.48134780873481714\n", "('great', 'find') 0.4801339042373011\n", "('enjoy', 'get') 0.47630980779889764\n", "('happen', 'Gale') 0.47573301695831754\n", "('go', 'Peeta') 0.4752008897673657\n", "('finish', 'come') 0.47435893372090715\n", "('know', 'feel') 0.4732521757444189\n", "('write', 'feel') 0.4726392428413805\n", "('go', 'book') 0.47035826529157765\n", "('way', 'end') 0.4697628499718137\n", "('like', 'Mockingjay') 0.46953139527803645\n", "('Peeta', 'get') 0.46546371448950735\n", "('know', 'Katniss') 0.4650385772070299\n", "('Mockingjay', 'read') 0.46017928970775623\n", "('little', 'Katniss') 0.4564915166285714\n", "('go', 'end') 0.4559695278394779\n", "('find', 'Peeta') 0.4539028920884314\n", "('know', 'like') 0.44948195641050676\n", "('know', 'come') 0.446800989657249\n", "('Gale', 'get') 0.4462323525616195\n", "('get', 'Gale') 0.4462323525616195\n", "('great', 'Gale') 0.4462323525616195\n", "('Gale', 'great') 0.4462323525616195\n", "('time', 'write') 0.44468316557479043\n", "('feel', 'Gale') 0.44365170596812803\n", "('want', 'read') 0.44322729634584823\n", "('book', 'read') 0.44066647548417454\n", "('know', 'want') 0.4399398992773038\n", "('get', 'finish') 0.43885224526399735\n", "('little', 'come') 0.43825392907879057\n", "('happen', 'want') 0.43651230380503636\n", "('find', 'Gale') 0.4346715301605436\n", "('end', 'end') 0.4333952058009389\n", "('little', 'happen') 0.4322479050185786\n", "('like', 'good') 0.4288501566169764\n", "('know', 'write') 0.4229903409635305\n", "('know', 'finish') 0.4229903409635305\n", "('great', 'good') 0.41883337837350515\n", "('get', 'know') 0.4112943012003392\n", "('end', 'people') 0.40986470839074485\n", "('want', 'Katniss') 0.40946262102747866\n", "('feel', 'come') 0.40683773284541175\n", "('book', 'want') 0.40410887975037707\n", "('like', 'like') 0.40409761893295393\n", "('great', 'read') 0.40274724062188094\n", "('little', 'get') 0.40274724062188083\n", "('feel', 'Collins') 0.4008317087851997\n", "('go', 'good') 0.40039967668466736\n", "('think', 'read') 0.3888902059604547\n", "('Suzanne', 'love') 0.386063831095167\n", "('love', 'Suzanne') 0.386063831095167\n", "('feel', 'want') 0.3819581369627882\n", "('people', 'feel') 0.37133104438850184\n", "('feel', 'people') 0.37133104438850184\n", "('Peeta', 'way') 0.37121117624331795\n", "('finish', 'good') 0.3659908969991254\n", "('write', 'good') 0.3659908969991254\n", "('want', 'Gale') 0.36154926533158105\n", "('Gale', 'little') 0.3572848665451234\n", "('little', 'Gale') 0.3572848665451234\n", "('enjoy', 'feel') 0.3559461255490225\n", "('like', 'people') 0.3475608250545902\n", "('people', 'like') 0.3475608250545902\n", "('thing', 'Peeta') 0.34681972311915893\n", "('good', 'good') 0.34597203010863326\n", "('like', 'know') 0.33169892075412327\n", "('finish', 'feel') 0.3184885630141222\n", "('feel', 'write') 0.3184885630141222\n", "('people', 'love') 0.31374316951554093\n", "('get', 'Peeta') 0.31131303466224913\n", "('Mockingjay', 'go') 0.3075495227212048\n", "('read', 'find') 0.3041750412311751\n", "('Peeta', 'find') 0.2997522122611732\n", "('love', 'go') 0.2953094678267029\n", "('want', 'war') 0.29440996249395257\n", "('thing', 'Katniss') 0.29326361865019684\n", "('book', 'enjoy') 0.29108549134698153\n", "('come', 'go') 0.28562416209223923\n", "('go', 'come') 0.28562416209223923\n", "('go', 'thing') 0.28562416209223923\n", "('thing', 'go') 0.28562416209223923\n", "('trilogy', 'think') 0.2782246380729349\n", "('come', 'come') 0.2750260311004159\n", "('Katniss', 'leave') 0.2733822480963677\n", "('know', 'think') 0.2696038950290279\n", "('end', 'know') 0.2604714114657554\n", "('get', 'Katniss') 0.257756930193287\n", "('people', 'Katniss') 0.257756930193287\n", "('people', 'think') 0.2546941406627408\n", "('book', 'finish') 0.2536279288120812\n", "('finish', 'book') 0.2536279288120812\n", "('book', 'write') 0.2536279288120812\n", "('go', 'get') 0.2501174736353292\n", "('go', 'people') 0.2501174736353292\n", "('great', 'go') 0.2501174736353292\n", "('find', 'Katniss') 0.24619610779221102\n", "('like', 'leave') 0.2454031073012873\n", "('time', 'come') 0.24535026295429915\n", "('Katniss', 'know') 0.24189502589282005\n", "('read', 'go') 0.2412126952923697\n", "('Collins', 'time') 0.23934423889408732\n", "('Peeta', 'Peeta') 0.23629185834394714\n", "('happen', 'get') 0.23351331858329408\n", "('read', 'lot') 0.2321217235911176\n", "('think', 'Peeta') 0.22846312851387138\n", "('thing', 'know') 0.2236574383430393\n", "('Collins', 'find') 0.22195249618221805\n", "('come', 'love') 0.21571846534792813\n", "('like', 'enjoy') 0.21439287055872705\n", "('get', 'time') 0.20984357449738925\n", "('Katniss', 'come') 0.20625224166056685\n", "('go', 'Gale') 0.2046550995585718\n", "('want', 'leave') 0.1994934059972098\n", "('leave', 'want') 0.1994934059972098\n", "('time', 'find') 0.19828275209631332\n", "('go', 'finish') 0.19727499226094958\n", "('people', 'know') 0.18815074988612948\n", "('great', 'know') 0.18815074988612948\n", "('Gale', 'happen') 0.18805094450653684\n", "('war', 'enjoy') 0.181386574090557\n", "('character', 'Peeta') 0.17054048078116676\n", "('go', 'Katniss') 0.1703303570174975\n", "('go', 'want') 0.16543438640529065\n", "('feel', 'good') 0.16493830349910737\n", "('come', 'Peeta') 0.16449816632520414\n", "('read', 'read') 0.16426802063442092\n", "('go', 'little') 0.16116998761883308\n", "('Collins', 'good') 0.16065197031842227\n", "('Peeta', 'want') 0.15763707594525905\n", "('read', 'come') 0.1505718566270098\n", "('come', 'leave') 0.1373616248902034\n", "('good', 'time') 0.13698222623251757\n", "('read', 'enjoy') 0.1360478935014953\n", "('write', 'know') 0.1353082685117497\n", "('leave', 'happen') 0.13135560082999143\n", "('get', 'good') 0.1311513059217244\n", "('finish', 'want') 0.13102560671974892\n", "('write', 'want') 0.13102560671974892\n", "('go', 'think') 0.13089992331607678\n", "('feel', 'go') 0.12975379138545406\n", "('time', 'read') 0.12089608848089323\n", "('read', 'spoiler') 0.12089608848089302\n", "('find', 'good') 0.11959048352064833\n", "('feel', 'feel') 0.11743596951410425\n", "('Peeta', 'end') 0.1116999807582332\n", "('thing', 'enjoy') 0.10635138814764288\n", "('enjoy', 'thing') 0.10635138814764288\n", "('read', 'finish') 0.09859033096659497\n", "('Katniss', 'good') 0.09788420493878523\n", "('little', 'want') 0.09492060207763253\n", "('time', 'feel') 0.08947989224751424\n", "('Katniss', 'lot') 0.08713141316252361\n", "('lot', 'Katniss') 0.08713141316252361\n", "('Katniss', 'love') 0.07980537307045085\n", "('Katniss', 'people') 0.0754353733993322\n", "('Mockingjay', 'feel') 0.07208814953564498\n", "('write', 'come') 0.0688938256127427\n", "('come', 'finish') 0.0688938256127427\n", "('finish', 'thing') 0.0688938256127427\n", "('Mockingjay', 'happen') 0.06780181635495974\n", "('Mockingjay', 'Collins') 0.06780181635495974\n", "('Collins', 'finish') 0.06288780155253057\n", "('happen', 'write') 0.06288780155253057\n", "('finish', 'happen') 0.06288780155253057\n", "('love', 'come') 0.06156778552066988\n", "('love', 'feel') 0.05984809464114317\n", "('come', 'book') 0.054295026191589914\n", "('book', 'happen') 0.04828900213137792\n", "('happen', 'book') 0.04828900213137792\n", "('time', 'Mockingjay') 0.044132072269054995\n", "('get', 'Mockingjay') 0.03830115195826193\n", "('end', 'happen') 0.03390026467927831\n", "('Collins', 'end') 0.03390026467927831\n", "('write', 'get') 0.03338713715583276\n", "('get', 'write') 0.03338713715583276\n", "('people', 'finish') 0.03338713715583276\n", "('come', 'read') 0.03278882097062636\n", "('read', 'thing') 0.03278882097062636\n", "('love', 'time') 0.03189201737455294\n", "('think', 'way') 0.026910209792029096\n", "('way', 'think') 0.026910209792029096\n", "('Mockingjay', 'find') 0.02674032955718588\n", "('find', 'Mockingjay') 0.02674032955718588\n", "('love', 'people') 0.026061097063760082\n", "('like', 'think') 0.02354886208932412\n", "('Mockingjay', 'know') 0.02243924765779492\n", "('read', 'trilogy') 0.02081262992391053\n", "('end', 'leave') 0.020024918185661537\n", "('book', 'get') 0.018788337734680024\n", "('war', 'Peeta') 0.016389800953908365\n", "('little', 'leave') 0.012907450416797465\n", "('read', 'want') 0.007909225088002656\n", "('find', 'book') 0.007227515333603956\n", "('people', 'end') 0.004399600282580265\n", "('think', 'thing') 0.0025187566678700193\n", "('feel', 'end') 0.0018189536890887228\n", "('Katniss', 'happen') -0.0004244778617961298\n", "('go', 'feel') -0.0037776012390683955\n", "('enjoy', 'end') -0.010985318556898898\n", "('love', 'Gale') -0.019401277012997264\n", "('Gale', 'love') -0.019401277012997264\n", "('go', 'read') -0.021151569175121548\n", "('thing', 'like') -0.02239759459666465\n", "('book', 'Gale') -0.026674036342077354\n", "('go', 'time') -0.03173367850565846\n", "('think', 'people') -0.03298793178903991\n", "('finish', 'end') -0.04844288109179921\n", "('think', 'know') -0.048849836089506604\n", "('think', 'want') -0.05313249788150729\n", "('good', 'book') -0.054073010530191784\n", "('war', 'come') -0.05540389106483467\n", "('write', 'read') -0.05556034886066321\n", "('character', 'feel') -0.05712358194436134\n", "('Peeta', 'come') -0.05864538498900561\n", "('Peeta', 'thing') -0.05864538498900561\n", "('come', 'know') -0.0640246341087415\n", "('end', 'good') -0.06846174798229139\n", "('find', 'like') -0.06946510545465037\n", "('think', 'feel') -0.07330890636537855\n", "('like', 'go') -0.07633798474241234\n", "('end', 'want') -0.08028348694745813\n", "('Gale', 'go') -0.08302697289320887\n", "('end', 'little') -0.08454788573391582\n", "('war', 'get') -0.0909105795217445\n", "('know', 'time') -0.09370040225485816\n", "('Mockingjay', 'enjoy') -0.09486680253760127\n", "('enjoy', 'Mockingjay') -0.09486680253760127\n", "('feel', 'Peeta') -0.09673272003940706\n", "('find', 'war') -0.10247140192282031\n", "('like', 'finish') -0.11074676442795403\n", "('come', 'Katniss') -0.11220148945796768\n", "('end', 'think') -0.11481795003667217\n", "('feel', 'read') -0.12308154973615869\n", "('love', 'know') -0.12333219986122934\n", "('want', 'want') -0.1239585504501197\n", "('want', 'love') -0.12761486165323013\n", "('love', 'think') -0.13089678123833973\n", "('leave', 'Katniss') -0.1320828600117966\n", "('Mockingjay', 'write') -0.1323243650725016\n", "('think', 'leave') -0.13514564954234243\n", "('Gale', 'war') -0.13637295359850174\n", "('come', 'happen') -0.1364451010679606\n", "('happen', 'thing') -0.1364451010679606\n", "('thing', 'Collins') -0.1364451010679606\n", "('leave', 'go') -0.1397223165697542\n", "('love', 'get') -0.15626045973019462\n", "('book', 'love') -0.15916321938815614\n", "('love', 'book') -0.15916321938815614\n", "('get', 'thing') -0.16594576546465847\n", "('enjoy', 'think') -0.16615588628490285\n", "('Peeta', 'good') -0.1670134217107873\n", "('find', 'love') -0.16782128213127065\n", "('read', 'leave') -0.16941410637715734\n", "('people', 'Collins') -0.17195178952487034\n", "('people', 'happen') -0.17195178952487034\n", "('know', 'good') -0.17239267083052323\n", "('good', 'know') -0.17239267083052323\n", "('Peeta', 'know') -0.1745524988839533\n", "('come', 'find') -0.1775065878657345\n", "('thing', 'find') -0.1775065878657345\n", "('find', 'thing') -0.1775065878657345\n", "('want', 'Peeta') -0.17883516067595395\n", "('read', 'feel') -0.18762007087372987\n", "('read', 'know') -0.18847880858214766\n", "('Mockingjay', 'Katniss') -0.1956366444868284\n", "('think', 'Mockingjay') -0.19869943401737422\n", "('get', 'get') -0.20145245392156833\n", "('get', 'people') -0.20145245392156833\n", "('go', 'Mockingjay') -0.20327610104478588\n", "('finish', 'think') -0.20361344881980326\n", "('think', 'finish') -0.20361344881980326\n", "('Katniss', 'feel') -0.20744723834831777\n", "('finish', 'go') -0.20819011584721497\n", "('war', 'feel') -0.21127426177161948\n", "('Gale', 'come') -0.21140813954141596\n", "('come', 'Gale') -0.21140813954141596\n", "('thing', 'Gale') -0.21140813954141596\n", "('Katniss', 'think') -0.21285850696385425\n", "('find', 'get') -0.21301327632264422\n", "('people', 'find') -0.21301327632264422\n", "('book', 'think') -0.218212248240956\n", "('feel', 'know') -0.2198950048155264\n", "('way', 'read') -0.2305017983569954\n", "('read', 'way') -0.2305017983569954\n", "('enjoy', 'know') -0.23269927706151453\n", "('know', 'enjoy') -0.23269927706151453\n", "('want', 'enjoy') -0.23698193885351523\n", "('end', 'go') -0.2371776527204674\n", "('good', 'thing') -0.23880711372953034\n", "('good', 'happen') -0.24481313778974226\n", "('people', 'Gale') -0.24691482799832576\n", "('come', 'little') -0.2548932514811547\n", "('Gale', 'find') -0.2584756503994017\n", "('Gale', 'think') -0.260771862659752\n", "('happen', 'little') -0.2608992755413667\n", "('know', 'Mockingjay') -0.2652428247939857\n", "('want', 'Mockingjay') -0.2695254865859864\n", "('Mockingjay', 'want') -0.2695254865859864\n", "('good', 'get') -0.2743138021864402\n", "('good', 'people') -0.2743138021864402\n", "('end', 'time') -0.2774515518584073\n", "('end', 'get') -0.28328247216920044\n", "('thing', 'feel') -0.2863094477145336\n", "('leave', 'leave') -0.2879848537717899\n", "('find', 'read') -0.3019607623391402\n", "('find', 'little') -0.3019607623391405\n", "('love', 'end') -0.30708334946477833\n", "('like', 'Collins') -0.31608569110865736\n", "('Gale', 'good') -0.31977617626319743\n", "('love', 'Mockingjay') -0.32197196195852906\n", "('Katniss', 'war') -0.3248483759668344\n", "('love', 'Katniss') -0.32565973503771367\n", "('war', 'think') -0.3279111654973804\n", "('Mockingjay', 'come') -0.3316572676929928\n", "('Mockingjay', 'thing') -0.3316572676929928\n", "('thing', 'Mockingjay') -0.3316572676929928\n", "('go', 'war') -0.3324878325247921\n", "('Peeta', 'go') -0.3357293264489633\n", "('go', 'know') -0.3411085755686991\n", "('love', 'thing') -0.3438973225874946\n", "('want', 'go') -0.3453912373606998\n", "('enjoy', 'Katniss') -0.34541465354831163\n", "('like', 'get') -0.3455863555053551\n", "('love', 'happen') -0.34990334664770667\n", "('love', 'Collins') -0.34990334664770667\n", "('happen', 'Peeta') -0.3523334815009983\n", "('finish', 'leave') -0.3564526530492507\n", "('read', 'good') -0.3632612882029363\n", "('little', 'good') -0.3632612882029363\n", "('people', 'Mockingjay') -0.36716395614990255\n", "('Gale', 'feel') -0.3672785102482009\n", "('read', 'think') -0.37324984608644207\n", "('Katniss', 'way') -0.3754921087855896\n", "('Peeta', 'time') -0.376003225586903\n", "('great', 'love') -0.37940401104440435\n", "('enjoy', 'finish') -0.3874628897918111\n", "('love', 'find') -0.39096483344548044\n", "('Gale', 'like') -0.3910487295821126\n", "('leave', 'read') -0.392557657691367\n", "('think', 'love') -0.3932610457058308\n", "('war', 'want') -0.3987372180659927\n", "('thing', 'think') -0.40294635144029434\n", "('happen', 'Katniss') -0.4058895859699604\n", "('Mockingjay', 'Gale') -0.41262633022666\n", "('finish', 'write') -0.42492045232671155\n", "('time', 'Katniss') -0.42955933005586533\n", "('character', 'read') -0.4311724938191467\n", "('like', 'read') -0.43453384152185137\n", "('think', 'great') -0.43845303989720447\n", "('read', 'Collins') -0.4432208323353211\n", "('Mockingjay', 'end') -0.4489939743975347\n", "('find', 'think') -0.45001386229828044\n", "('think', 'find') -0.45001386229828044\n", "('love', 'good') -0.45226535930927625\n", "('think', 'think') -0.4523100745586306\n", "('find', 'go') -0.4545905293256922\n", "('end', 'book') -0.4685067886211167\n", "('know', 'thing') -0.469489742216906\n", "('read', 'Peeta') -0.4707816319141924\n", "('thing', 'want') -0.4737724040089067\n", "('Collins', 'know') -0.475495766277118\n", "('know', 'Collins') -0.475495766277118\n", "('enjoy', 'like') -0.47875431000121826\n", "('Peeta', 'leave') -0.48399186365099905\n", "('leave', 'Peeta') -0.48399186365099905\n", "('write', 'think') -0.49129552127158416\n", "('know', 'get') -0.5049964306738158\n", "('get', 'want') -0.5092790924658166\n", "('people', 'want') -0.5092790924658166\n", "('want', 'people') -0.5092790924658166\n", "('enjoy', 'Peeta') -0.5150021003935593\n", "('good', 'go') -0.5158910551894879\n", "('like', 'write') -0.5162118725361186\n", "('finish', 'like') -0.5162118725361186\n", "('know', 'find') -0.5165572530748919\n", "('end', 'Katniss') -0.5172202686142907\n", "('want', 'find') -0.5208399148668927\n", "('Katniss', 'little') -0.5243377363831546\n", "('little', 'think') -0.5274005259137006\n", "('book', 'like') -0.5308106719572714\n", "('end', 'like') -0.5451994094093711\n", "('Mockingjay', 'Peeta') -0.5475456481260306\n", "('finish', 'Peeta') -0.5524596629284597\n", "('Gale', 'want') -0.5547414665425742\n", "('book', 'Peeta') -0.5670584623496124\n", "('come', 'feel') -0.5739915201663144\n", "('character', 'Katniss') -0.5761628042477407\n", "('end', 'war') -0.578205705877541\n", "('war', 'end') -0.578205705877541\n", "('happen', 'feel') -0.5799975442265263\n", "('come', 'enjoy') -0.5867957924123024\n", "('Katniss', 'Collins') -0.5882111427639152\n", "('get', 'leave') -0.5912922441266516\n", "('Collins', 'enjoy') -0.5928018164725143\n", "('little', 'know') -0.5939439166903122\n", "('know', 'little') -0.5939439166903122\n", "('find', 'leave') -0.6028530665277277\n", "('leave', 'find') -0.6028530665277277\n", "('write', 'Katniss') -0.6060157673974217\n", "('get', 'feel') -0.6094982086232241\n", "('Katniss', 'book') -0.6206145668185746\n", "('book', 'Katniss') -0.6206145668185746\n", "('find', 'feel') -0.6210590310243002\n", "('read', 'end') -0.6235443864666027\n", "('want', 'feel') -0.6296427747156919\n", "('write', 'happen') -0.6302593790074147\n", "('write', 'Collins') -0.6302593790074147\n", "('feel', 'love') -0.6332990859188021\n", "('leave', 'Gale') -0.6367546182034091\n", "('like', 'war') -0.6405095892136957\n", "('like', 'feel') -0.6482715945491851\n", "('come', 'end') -0.6532408918204551\n", "('like', 'want') -0.6534129940496035\n", "('time', 'love') -0.6612551631853923\n", "('good', 'leave') -0.6641535923915237\n", "('leave', 'good') -0.6641535923915237\n", "('happen', 'read') -0.666364383649531\n", "('enjoy', 'Gale') -0.6677648549459695\n", "('finish', 'find') -0.6713208658051885\n", "('find', 'finish') -0.6713208658051885\n", "('find', 'write') -0.6713208658051885\n", "('write', 'find') -0.6713208658051885\n", "('Peeta', 'war') -0.676757379606037\n", "('good', 'feel') -0.6823595568880961\n", "('read', 'like') -0.6858482698027576\n", "('get', 'end') -0.688747580277365\n", "('enjoy', 'good') -0.695163829134084\n", "('read', 'people') -0.6958650480462288\n", "('Collins', 'think') -0.6966344479522872\n", "('end', 'find') -0.700308402678441\n", "('Gale', 'write') -0.7052224174808699\n", "('like', 'love') -0.7058594694221462\n", "('come', 'like') -0.71554477515661\n", "('time', 'think') -0.7203041920381921\n", "('happen', 'like') -0.721550799216822\n", "('like', 'happen') -0.721550799216822\n", "('get', 'think') -0.7261351123489852\n", "('good', 'finish') -0.7326213916689844\n", "('love', 'read') -0.7560335695126815\n", "('leave', 'Mockingjay') -0.757003746354986\n", "('Peeta', 'happen') -0.7577985896091629\n", "('love', 'leave') -0.769243801249488\n", "('feel', 'Mockingjay') -0.7752097108515585\n", "('Katniss', 'Mockingjay') -0.7834233093889474\n", "('little', 'read') -0.7848125340627251\n", "('read', 'little') -0.7848125340627251\n", "('Peeta', 'people') -0.7872992540058608\n", "('good', 'think') -0.7989964606138569\n", "('love', 'enjoy') -0.8002540379920483\n", "('Peeta', 'think') -0.8011562886672869\n", "('Katniss', 'thing') -0.805348670017913\n", "('end', 'feel') -0.8091112625272401\n", "('Mockingjay', 'finish') -0.8254715456324468\n", "('leave', 'think') -0.8282928301022877\n", "('come', 'Collins') -0.8295922816279059\n", "('happen', 'Collins') -0.8355983056881179\n", "('write', 'love') -0.8377116005269488\n", "('finish', 'love') -0.8377116005269488\n", "('like', 'little') -0.8399989496300161\n", "('great', 'Katniss') -0.8408553584748228\n", "('Katniss', 'great') -0.8408553584748228\n", "('get', 'come') -0.8590929460246038\n", "('enjoy', 'go') -0.8638797338722599\n", "('Collins', 'get') -0.8650989700848156\n", "('get', 'Collins') -0.8650989700848156\n", "('Mockingjay', 'think') -0.8918466145773195\n", "('Katniss', 'write') -0.8936978398492025\n", "('Katniss', 'finish') -0.8936978398492025\n", "('write', 'go') -0.9013372964071603\n", "('go', 'write') -0.9013372964071603\n", "('Peeta', 'feel') -0.907662936255736\n", "('Collins', 'Gale') -0.9105613441615731\n", "('Gale', 'Collins') -0.9105613441615731\n", "('think', 'book') -0.9113594288009013\n", "('Peeta', 'like') -0.9314331555896477\n", "('come', 'good') -0.9319542942894756\n", "('happen', 'good') -0.9379603183496875\n", "('good', 'Collins') -0.9379603183496875\n", "('love', 'war') -0.9620093172045259\n", "('good', 'find') -0.9790218051474614\n", "('think', 'like') -0.9880520495891558\n", "('go', 'like') -0.9926287166165674\n", "('feel', 'find') -1.0265241391324649\n", "('feel', 'think') -1.028820351392815\n", "('Gale', 'read') -1.029009494574767\n", "('Collins', 'Mockingjay') -1.0308104723131502\n", "('Collins', 'love') -1.043050527207652\n", "('happen', 'love') -1.043050527207652\n", "('good', 'read') -1.0564084687628816\n", "('come', 'think') -1.0960935320002396\n", "('little', 'feel') -1.103910802747885\n", "('end', 'enjoy') -1.1095976072250087\n", "('Mockingjay', 'good') -1.1331724849747198\n", "('leave', 'like') -1.1408912538186033\n", "('end', 'write') -1.147055169759909\n", "('write', 'end') -1.147055169759909\n", "('end', 'finish') -1.147055169759909\n", "('go', 'find') -1.1477377098856376\n", "('end', 'love') -1.154381209851982\n", "('little', 'love') -1.161498677620846\n", "('happen', 'know') -1.1686429468370634\n", "('want', 'Collins') -1.172925608629064\n", "('end', 'read') -1.1831601744020255\n", "('find', 'know') -1.2097044336348373\n", "('read', 'Katniss') -1.2174849169430997\n", "('Peeta', 'write') -1.2456068434884051\n", "('Peeta', 'finish') -1.2456068434884051\n", "('good', 'want') -1.2752876212906339\n", "('want', 'good') -1.2752876212906339\n", "('war', 'read') -1.2784703542063502\n", "('finish', 'Katniss') -1.299162947957367\n", "('go', 'love') -1.3141284446073975\n", "('Katniss', 'read') -1.3352679525994833\n", "('read', 'love') -1.3438202344148005\n", "('end', 'come') -1.3463880723804005\n", "('thing', 'read') -1.3535055401492642\n", "('Collins', 'read') -1.3595115642094764\n", "('get', 'read') -1.3890122286061741\n", "('read', 'get') -1.3890122286061741\n", "('people', 'read') -1.3890122286061741\n", "('Katniss', 'like') -1.3904543681667745\n", "('Collins', 'like') -1.4146979797767674\n", "('war', 'Katniss') -1.4234606646349444\n", "('read', 'Gale') -1.4344746026829316\n", "('want', 'think') -1.4394268590013979\n", "('get', 'like') -1.4441986441734649\n", "('Collins', 'Peeta') -1.4509457701691082\n", "('Peeta', 'Collins') -1.4509457701691082\n", "('good', 'end') -1.4547561091021821\n", "('leave', 'love') -1.4623909818094332\n", "('Katniss', 'Katniss') -1.4802582630280776\n", "('enjoy', 'love') -1.4934012185519936\n", "('good', 'like') -1.517059992438337\n", "('love', 'write') -1.530858781086894\n", "('good', 'Peeta') -1.5533077828306778\n", "('feel', 'like') -1.5645623264233401\n", "('Peeta', 'read') -1.569393920582302\n", "('good', 'Katniss') -1.6068638872996401\n", "('love', 'like') -1.6221502012963014\n", "('Peeta', 'Mockingjay') -1.6461579367941404\n", "('want', 'like') -1.7520252827177134\n", "('get', 'love') -1.765698372164295\n", "('happen', 'think') -1.7952467366203972\n", "('read', 'write') -1.8473198180887183\n", "('love', 'want') -2.0735250107085434\n" ] } ], "source": [ "for term in pmi_cooc:\n", " print(term, pmi_cooc[term])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "- compare genres\n", "\n", "- differences in subjectivity are not noticeable at small scale. Need a particular, larger-scale focus to bring them out. But they can drown again in very large sets\n", "\n", "- topics need large scale\n", "\n", "- named entities are manageable at small scale, but become harder to deal with at large scale: mostly long tail, unknown, lower accuracy\n", "\n", "- many aspects become harder to summarise and organise at large scale" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "2\n", "3\n", "4\n", "5\n", "6\n", "7\n", "8\n", "9\n", "10\n", "11\n" ] } ], "source": [ "import math\n", "review_df.iloc[0:10,]\n", "\n", "chunks = math.ceil(113000 / 10000)\n", "for chunk in range(chunks):\n", " print(chunk)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }