{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Filtering Goodreads Reviews\n", "\n", "During data exploration, a number of issues with the reviews have come to the surface that require some form of data *cleaning*, i.e. *selection* and *normalization* of reviews.\n", "\n", "This notebook shows the cleaning steps that were taken." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Non-Reviews\n", "\n", "A plot of the review length distribution revealed that there are a few lengths (in number of characters) with high peaks in the frequency distribution. E.g. there are many more reviews of length 3 than expected given the rest of the distribution. Inspection revealed that many of those 3-character reviews contain only a rating, like '3.5' or '4.5'.\n", "\n", "Another peak occurs at length 40: there is a large number of reviews that are only a URL for a webpage that contains the actual review. Goodreads shortens longer URLs to 40 characters in the anchor text of an HTML `` element for display, with the full URL in the anchor `href` attribute. There are 30,277 such reviews. \n", "\n", "Types of non-reviews:\n", "\n", "- length 0: these are empty reviews, which are no reviews at all. There is no review content.\n", "- length 3: these are mainly reviews that only mention a rating, like '3.5' or '4.5'. \n", "- length 9-12: these are mainly reviews that only mention a rating followed by the word 'stars', like '3.5 stars' or '4.5 stars'.\n", "- length 40: there is a large number of reviews that are only a URL for a webpage that contains the actual review. Goodreads shortens longer URLs to 40 characters in the anchor text of an HTML `` element for display, with the full URL in the anchor `href` attribute. There are 30,277 such reviews. \n", "\n", "\n", "\n", "The steps below are taken with the aim to remove these so-called *non-reviews*:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/marijnkoolen/Code/Huygens/scale\n" ] } ], "source": [ "# This reload library is just used for developing the REPUBLIC hOCR parser \n", "# and can be removed once this module is stable.\n", "%reload_ext autoreload\n", "%autoreload 2\n", "\n", "# This is needed to add the repo dir to the path so jupyter\n", "# can load the modules in the scripts directory from the notebooks\n", "import os\n", "import sys\n", "repo_dir = os.path.split(os.getcwd())[0]\n", "print(repo_dir)\n", "if repo_dir not in sys.path:\n", " sys.path.append(repo_dir)\n", " \n", "import numpy as np\n", "import pandas as pd\n", "import json\n", "import csv\n", "from collections import Counter\n", "import gzip\n", "import os\n", "\n", "data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'\n", "\n", "author_file = os.path.join(data_dir, 'goodreads_book_authors.csv.gz') # author information\n", "book_file = os.path.join(data_dir, 'goodreads_books.csv.gz') # basic book metadata\n", "genre_file = os.path.join(data_dir, 'goodreads_book_genres_initial.csv.gz') # book genre information\n", "review_file = os.path.join(data_dir, 'goodreads_reviews_dedup-no_text.csv.gz') # excludes text to save memory\n", "review_text_file = os.path.join(data_dir, 'goodreads_reviews_dedup.csv.gz') # includes text\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idbook_idreview_idratingdate_addeddate_updatedread_atstarted_atn_votesn_commentsreview_length
08842281e1d1347389f2ab93d60773d4d243756645cd416f3efc3f944fce4ce2db2290d5e5Fri Aug 25 13:55:02 -0700 2017Mon Oct 09 08:55:59 -0700 2017Sat Oct 07 00:00:00 -0700 2017Sat Aug 26 00:00:00 -0700 2017160968
18842281e1d1347389f2ab93d60773d4d18245960dfdbb7b0eb5a7e4c26d59a937e2e5feb5Sun Jul 30 07:44:10 -0700 2017Wed Aug 30 00:00:26 -0700 2017Sat Aug 26 12:05:52 -0700 2017Tue Aug 15 13:23:18 -0700 20172812086
28842281e1d1347389f2ab93d60773d4d63929445e212a62bced17b4dbe41150e5bb90373Mon Jul 24 02:48:17 -0700 2017Sun Jul 30 09:28:03 -0700 2017Tue Jul 25 00:00:00 -0700 2017Mon Jul 24 00:00:00 -0700 201760474
38842281e1d1347389f2ab93d60773d4d22078596fdd13cad0695656be99828cd75d6eb734Mon Jul 24 02:33:09 -0700 2017Sun Jul 30 10:23:54 -0700 2017Sun Jul 30 15:42:05 -0700 2017Tue Jul 25 00:00:00 -0700 2017224962
48842281e1d1347389f2ab93d60773d4d6644782bd0df91c9d918c0e433b9ab3a9a5c4514Mon Jul 24 02:28:14 -0700 2017Thu Aug 24 00:07:20 -0700 2017Sat Aug 05 00:00:00 -0700 2017Sun Jul 30 00:00:00 -0700 201780420
....................................
15739962d0f6d1a4edcab80a6010cfcfeda4999f1656001b3d9a00405f7e96752d67b85deda4c7d4Mon Jun 04 18:08:44 -0700 2012Tue Jun 26 18:58:46 -0700 2012NaNSun Jun 10 00:00:00 -0700 201201299
15739963594c86711bd7acdaf655d102df52a9cb100244292bcba3579aa1d728e664de293e16aacf5Fri Aug 01 18:46:18 -0700 2014Fri Aug 01 18:47:07 -0700 2014NaNNaN0071
15739964594c86711bd7acdaf655d102df52a9cb67214377c1a7fcc2614a1a2a29213c11c9910833Tue Aug 27 12:49:25 -0700 2013Tue Aug 27 12:53:46 -0700 2013NaNNaN00224
15739965594c86711bd7acdaf655d102df52a9cb1578819774a9f9d1db09a90aae3a5acea68c65932Fri May 03 13:06:15 -0700 2013Fri May 03 15:35:39 -0700 2013Fri May 03 15:35:39 -0700 2013Fri May 03 00:00:00 -0700 201300108
15739966594c86711bd7acdaf655d102df52a9cb8239301f2af741fb7a99ff730cf29e004f127da4Sat Apr 20 15:18:15 -0700 2013Thu May 02 16:51:20 -0700 2013Thu May 02 16:51:20 -0700 2013Sat Apr 20 00:00:00 -0700 2013006
\n", "

15739967 rows × 11 columns

\n", "
" ], "text/plain": [ " user_id book_id \\\n", "0 8842281e1d1347389f2ab93d60773d4d 24375664 \n", "1 8842281e1d1347389f2ab93d60773d4d 18245960 \n", "2 8842281e1d1347389f2ab93d60773d4d 6392944 \n", "3 8842281e1d1347389f2ab93d60773d4d 22078596 \n", "4 8842281e1d1347389f2ab93d60773d4d 6644782 \n", "... ... ... \n", "15739962 d0f6d1a4edcab80a6010cfcfeda4999f 1656001 \n", "15739963 594c86711bd7acdaf655d102df52a9cb 10024429 \n", "15739964 594c86711bd7acdaf655d102df52a9cb 6721437 \n", "15739965 594c86711bd7acdaf655d102df52a9cb 15788197 \n", "15739966 594c86711bd7acdaf655d102df52a9cb 8239301 \n", "\n", " review_id rating \\\n", "0 5cd416f3efc3f944fce4ce2db2290d5e 5 \n", "1 dfdbb7b0eb5a7e4c26d59a937e2e5feb 5 \n", "2 5e212a62bced17b4dbe41150e5bb9037 3 \n", "3 fdd13cad0695656be99828cd75d6eb73 4 \n", "4 bd0df91c9d918c0e433b9ab3a9a5c451 4 \n", "... ... ... \n", "15739962 b3d9a00405f7e96752d67b85deda4c7d 4 \n", "15739963 2bcba3579aa1d728e664de293e16aacf 5 \n", "15739964 7c1a7fcc2614a1a2a29213c11c991083 3 \n", "15739965 74a9f9d1db09a90aae3a5acea68c6593 2 \n", "15739966 f2af741fb7a99ff730cf29e004f127da 4 \n", "\n", " date_added date_updated \\\n", "0 Fri Aug 25 13:55:02 -0700 2017 Mon Oct 09 08:55:59 -0700 2017 \n", "1 Sun Jul 30 07:44:10 -0700 2017 Wed Aug 30 00:00:26 -0700 2017 \n", "2 Mon Jul 24 02:48:17 -0700 2017 Sun Jul 30 09:28:03 -0700 2017 \n", "3 Mon Jul 24 02:33:09 -0700 2017 Sun Jul 30 10:23:54 -0700 2017 \n", "4 Mon Jul 24 02:28:14 -0700 2017 Thu Aug 24 00:07:20 -0700 2017 \n", "... ... ... \n", "15739962 Mon Jun 04 18:08:44 -0700 2012 Tue Jun 26 18:58:46 -0700 2012 \n", "15739963 Fri Aug 01 18:46:18 -0700 2014 Fri Aug 01 18:47:07 -0700 2014 \n", "15739964 Tue Aug 27 12:49:25 -0700 2013 Tue Aug 27 12:53:46 -0700 2013 \n", "15739965 Fri May 03 13:06:15 -0700 2013 Fri May 03 15:35:39 -0700 2013 \n", "15739966 Sat Apr 20 15:18:15 -0700 2013 Thu May 02 16:51:20 -0700 2013 \n", "\n", " read_at started_at \\\n", "0 Sat Oct 07 00:00:00 -0700 2017 Sat Aug 26 00:00:00 -0700 2017 \n", "1 Sat Aug 26 12:05:52 -0700 2017 Tue Aug 15 13:23:18 -0700 2017 \n", "2 Tue Jul 25 00:00:00 -0700 2017 Mon Jul 24 00:00:00 -0700 2017 \n", "3 Sun Jul 30 15:42:05 -0700 2017 Tue Jul 25 00:00:00 -0700 2017 \n", "4 Sat Aug 05 00:00:00 -0700 2017 Sun Jul 30 00:00:00 -0700 2017 \n", "... ... ... \n", "15739962 NaN Sun Jun 10 00:00:00 -0700 2012 \n", "15739963 NaN NaN \n", "15739964 NaN NaN \n", "15739965 Fri May 03 15:35:39 -0700 2013 Fri May 03 00:00:00 -0700 2013 \n", "15739966 Thu May 02 16:51:20 -0700 2013 Sat Apr 20 00:00:00 -0700 2013 \n", "\n", " n_votes n_comments review_length \n", "0 16 0 968 \n", "1 28 1 2086 \n", "2 6 0 474 \n", "3 22 4 962 \n", "4 8 0 420 \n", "... ... ... ... \n", "15739962 0 1 299 \n", "15739963 0 0 71 \n", "15739964 0 0 224 \n", "15739965 0 0 108 \n", "15739966 0 0 6 \n", "\n", "[15739967 rows x 11 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "review_df = pd.read_csv(review_file, sep='\\t', compression='gzip')\n", "\n", "review_df\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "review_df.review_length.value_counts().sort_index().plot(logx=True)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 6938\n", "1 19717\n", "2 13640\n", "3 45288\n", "4 24220\n", "5 22144\n", "6 15422\n", "7 19845\n", "8 24205\n", "9 66297\n", "10 47467\n", "11 37583\n", "12 30385\n", "13 26734\n", "14 32114\n", "15 35955\n", "16 32618\n", "17 35097\n", "18 34022\n", "19 35913\n", "20 33675\n", "21 31992\n", "22 32401\n", "23 31484\n", "24 32009\n", "25 32592\n", "26 33616\n", "27 32977\n", "28 33254\n", "29 33810\n", "30 33634\n", "31 33967\n", "32 34169\n", "33 33071\n", "34 33903\n", "35 33726\n", "36 33558\n", "37 34014\n", "38 33884\n", "39 33581\n", "40 64165\n", "41 34872\n", "42 34845\n", "43 33229\n", "44 33592\n", "45 34321\n", "46 34057\n", "47 33651\n", "48 33981\n", "49 33656\n", "Name: review_length, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "review_df[review_df.review_length < 50].review_length.value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following steps check individual reviews for characteristics of non-reviews and create a derived review file with the identified non-reviews removed.\n" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "# helper is a module with simple helper functions\n", "from langdetect import detect\n", "from langdetect.lang_detect_exception import LangDetectException\n", "from scripts.helper import read_csv\n", "from collections import Counter\n", "import re\n", "\n", "def is_url(record):\n", " return record['review_length'] <= 40 and record['review_text'].startswith('http')\n", " \n", "def is_rating(record):\n", " if record['review_length'] > 12:\n", " return False\n", " if record['review_length'] < 4 and re.search(r'\\d', record['review_text']):\n", " return True\n", " for word in ['star', 'stars', 'sterne', 'ster', 'sterren', 'rating']:\n", " if re.search(word, record['review_text'], re.IGNORECASE) and re.search(r'\\d', record['review_text']):\n", " return True\n", " return False\n", "\n", "def is_date(record):\n", " if record['review_length'] > 12:\n", " return False\n", " if re.search(r'20\\d{,2}', record['review_text']):\n", " return True\n", " return False\n", "\n", "def is_empty(record):\n", " return record['review_length'] == 0\n", "\n", "def is_non_review(record):\n", " if record['review_length'] > 40:\n", " return False\n", " return is_empty(record) or is_url(record) or is_rating(record) or is_date(record)\n", "\n", "def lang_detect(record):\n", " try:\n", " return detect(record['review_text'])\n", " except LangDetectException:\n", " return 'unknown'\n", "\n", "def is_english(record):\n", " return lang_detect == 'en'\n", " \n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/goodreads_reviews_dedup_filtered.csv.gz\n", "1000000 records parsed\n", "2000000 records parsed\n", "3000000 records parsed\n", "4000000 records parsed\n", "5000000 records parsed\n", "6000000 records parsed\n", "7000000 records parsed\n", "8000000 records parsed\n", "9000000 records parsed\n", "10000000 records parsed\n", "11000000 records parsed\n", "12000000 records parsed\n", "13000000 records parsed\n", "14000000 records parsed\n", "15000000 records parsed\n", "15739967 records parsed\n" ] } ], "source": [ "review_filtered_file = os.path.join(data_dir, 'goodreads_reviews_dedup_filtered-no_text.csv.gz') # excludes text and non-reviews\n", "\n", "\n", "headers = [\n", " 'user_id', 'book_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', \n", " 'started_at', 'n_votes', 'n_comments', 'review_length', 'review_text'\n", "]\n", "\n", "with gzip.open(filtered_file, 'wt') as fh:\n", " writer = csv.writer(fh, delimiter='\\t')\n", " writer.writerow(headers)\n", " for ri, record in enumerate(read_csv(review_text_file)):\n", " record['review_length'] = int(record['review_length'])\n", " if is_non_review(record):\n", " continue\n", " row = [record[header] for header in headers]\n", " writer.writerow(row)\n", " if (ri+1) % 1000000 == 0:\n", " print(ri+1, 'records parsed')\n", "\n", "print(ri+1, 'records parsed')\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idbook_idreview_idratingdate_addeddate_updatedread_atstarted_atn_votesn_commentsreview_length
08842281e1d1347389f2ab93d60773d4d243756645cd416f3efc3f944fce4ce2db2290d5e5Fri Aug 25 13:55:02 -0700 2017Mon Oct 09 08:55:59 -0700 2017Sat Oct 07 00:00:00 -0700 2017Sat Aug 26 00:00:00 -0700 2017160968
18842281e1d1347389f2ab93d60773d4d18245960dfdbb7b0eb5a7e4c26d59a937e2e5feb5Sun Jul 30 07:44:10 -0700 2017Wed Aug 30 00:00:26 -0700 2017Sat Aug 26 12:05:52 -0700 2017Tue Aug 15 13:23:18 -0700 20172812086
28842281e1d1347389f2ab93d60773d4d63929445e212a62bced17b4dbe41150e5bb90373Mon Jul 24 02:48:17 -0700 2017Sun Jul 30 09:28:03 -0700 2017Tue Jul 25 00:00:00 -0700 2017Mon Jul 24 00:00:00 -0700 201760474
38842281e1d1347389f2ab93d60773d4d22078596fdd13cad0695656be99828cd75d6eb734Mon Jul 24 02:33:09 -0700 2017Sun Jul 30 10:23:54 -0700 2017Sun Jul 30 15:42:05 -0700 2017Tue Jul 25 00:00:00 -0700 2017224962
48842281e1d1347389f2ab93d60773d4d6644782bd0df91c9d918c0e433b9ab3a9a5c4514Mon Jul 24 02:28:14 -0700 2017Thu Aug 24 00:07:20 -0700 2017Sat Aug 05 00:00:00 -0700 2017Sun Jul 30 00:00:00 -0700 201780420
....................................
15616192d0f6d1a4edcab80a6010cfcfeda4999f1656001b3d9a00405f7e96752d67b85deda4c7d4Mon Jun 04 18:08:44 -0700 2012Tue Jun 26 18:58:46 -0700 2012NaNSun Jun 10 00:00:00 -0700 201201299
15616193594c86711bd7acdaf655d102df52a9cb100244292bcba3579aa1d728e664de293e16aacf5Fri Aug 01 18:46:18 -0700 2014Fri Aug 01 18:47:07 -0700 2014NaNNaN0071
15616194594c86711bd7acdaf655d102df52a9cb67214377c1a7fcc2614a1a2a29213c11c9910833Tue Aug 27 12:49:25 -0700 2013Tue Aug 27 12:53:46 -0700 2013NaNNaN00224
15616195594c86711bd7acdaf655d102df52a9cb1578819774a9f9d1db09a90aae3a5acea68c65932Fri May 03 13:06:15 -0700 2013Fri May 03 15:35:39 -0700 2013Fri May 03 15:35:39 -0700 2013Fri May 03 00:00:00 -0700 201300108
15616196594c86711bd7acdaf655d102df52a9cb8239301f2af741fb7a99ff730cf29e004f127da4Sat Apr 20 15:18:15 -0700 2013Thu May 02 16:51:20 -0700 2013Thu May 02 16:51:20 -0700 2013Sat Apr 20 00:00:00 -0700 2013006
\n", "

15616197 rows × 11 columns

\n", "
" ], "text/plain": [ " user_id book_id \\\n", "0 8842281e1d1347389f2ab93d60773d4d 24375664 \n", "1 8842281e1d1347389f2ab93d60773d4d 18245960 \n", "2 8842281e1d1347389f2ab93d60773d4d 6392944 \n", "3 8842281e1d1347389f2ab93d60773d4d 22078596 \n", "4 8842281e1d1347389f2ab93d60773d4d 6644782 \n", "... ... ... \n", "15616192 d0f6d1a4edcab80a6010cfcfeda4999f 1656001 \n", "15616193 594c86711bd7acdaf655d102df52a9cb 10024429 \n", "15616194 594c86711bd7acdaf655d102df52a9cb 6721437 \n", "15616195 594c86711bd7acdaf655d102df52a9cb 15788197 \n", "15616196 594c86711bd7acdaf655d102df52a9cb 8239301 \n", "\n", " review_id rating \\\n", "0 5cd416f3efc3f944fce4ce2db2290d5e 5 \n", "1 dfdbb7b0eb5a7e4c26d59a937e2e5feb 5 \n", "2 5e212a62bced17b4dbe41150e5bb9037 3 \n", "3 fdd13cad0695656be99828cd75d6eb73 4 \n", "4 bd0df91c9d918c0e433b9ab3a9a5c451 4 \n", "... ... ... \n", "15616192 b3d9a00405f7e96752d67b85deda4c7d 4 \n", "15616193 2bcba3579aa1d728e664de293e16aacf 5 \n", "15616194 7c1a7fcc2614a1a2a29213c11c991083 3 \n", "15616195 74a9f9d1db09a90aae3a5acea68c6593 2 \n", "15616196 f2af741fb7a99ff730cf29e004f127da 4 \n", "\n", " date_added date_updated \\\n", "0 Fri Aug 25 13:55:02 -0700 2017 Mon Oct 09 08:55:59 -0700 2017 \n", "1 Sun Jul 30 07:44:10 -0700 2017 Wed Aug 30 00:00:26 -0700 2017 \n", "2 Mon Jul 24 02:48:17 -0700 2017 Sun Jul 30 09:28:03 -0700 2017 \n", "3 Mon Jul 24 02:33:09 -0700 2017 Sun Jul 30 10:23:54 -0700 2017 \n", "4 Mon Jul 24 02:28:14 -0700 2017 Thu Aug 24 00:07:20 -0700 2017 \n", "... ... ... \n", "15616192 Mon Jun 04 18:08:44 -0700 2012 Tue Jun 26 18:58:46 -0700 2012 \n", "15616193 Fri Aug 01 18:46:18 -0700 2014 Fri Aug 01 18:47:07 -0700 2014 \n", "15616194 Tue Aug 27 12:49:25 -0700 2013 Tue Aug 27 12:53:46 -0700 2013 \n", "15616195 Fri May 03 13:06:15 -0700 2013 Fri May 03 15:35:39 -0700 2013 \n", "15616196 Sat Apr 20 15:18:15 -0700 2013 Thu May 02 16:51:20 -0700 2013 \n", "\n", " read_at started_at \\\n", "0 Sat Oct 07 00:00:00 -0700 2017 Sat Aug 26 00:00:00 -0700 2017 \n", "1 Sat Aug 26 12:05:52 -0700 2017 Tue Aug 15 13:23:18 -0700 2017 \n", "2 Tue Jul 25 00:00:00 -0700 2017 Mon Jul 24 00:00:00 -0700 2017 \n", "3 Sun Jul 30 15:42:05 -0700 2017 Tue Jul 25 00:00:00 -0700 2017 \n", "4 Sat Aug 05 00:00:00 -0700 2017 Sun Jul 30 00:00:00 -0700 2017 \n", "... ... ... \n", "15616192 NaN Sun Jun 10 00:00:00 -0700 2012 \n", "15616193 NaN NaN \n", "15616194 NaN NaN \n", "15616195 Fri May 03 15:35:39 -0700 2013 Fri May 03 00:00:00 -0700 2013 \n", "15616196 Thu May 02 16:51:20 -0700 2013 Sat Apr 20 00:00:00 -0700 2013 \n", "\n", " n_votes n_comments review_length \n", "0 16 0 968 \n", "1 28 1 2086 \n", "2 6 0 474 \n", "3 22 4 962 \n", "4 8 0 420 \n", "... ... ... ... \n", "15616192 0 1 299 \n", "15616193 0 0 71 \n", "15616194 0 0 224 \n", "15616195 0 0 108 \n", "15616196 0 0 6 \n", "\n", "[15616197 rows x 11 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the filtered review file excludes text and non-reviews\n", "review_filtered_file = os.path.join(data_dir, 'goodreads_reviews_dedup_filtered-no_text.csv.gz') \n", "\n", "review_df = pd.read_csv(review_filtered_file, sep='\\t', compression='gzip')\n", "\n", "review_df\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['user_id', 'book_id', 'review_id', 'rating', 'date_added',\n", " 'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments',\n", " 'review_length', 'work_id'],\n", " dtype='object')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dateutil.parser import parse, tz\n", "\n", "def parse_date(date_str):\n", " try:\n", " return parse(date_str).astimezone(utc)\n", " except TypeError:\n", " return None\n", "\n", "utc = tz.gettz('UTC')\n", "\n", "#book_df = pd.read_csv(book_file, sep='\\t', compression='gzip')\n", "\n", "#book_df[['book_id', 'work_id']]\n", "#review_df = pd.merge(review_df, book_df[['book_id', 'work_id']], on='book_id', how='left')\n", "\n", "review_df['date_added'] = review_df.date_added.apply(parse_date)\n", "review_df['date_updated'] = review_df.date_updated.apply(parse_date)\n", "review_df['read_at'] = review_df.read_at.apply(parse_date)\n", "review_df['started_at'] = review_df.started_at.apply(parse_date)\n", "\n", "review_df.columns" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "review_df.to_csv(review_filtered_file, sep='\\t', compression='gzip')\n", "\n", "#review_df = pd.read_csv(review_filtered_file, sep='\\t', compression='gzip')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making review subsets for content analysis\n", "\n", "The entire Goodreads review collection including all the review text is too big to read into a dataframe, so we create a number of sample review subsets with text included that will be used for content analysis.\n", "\n", "The following criteria will be used to analyse various aspects of scale:\n", "\n", "- all reviews of frequently reviewed books (repetition across reviews as book characteristics)\n", "- all reviews of frequent reviewers (repetition across reviews as reviewer characteristics)\n", "- random sample of reviews (repetition across reviews as book review characteristics)\n" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[11870085, 2767052, 29056083, 20309175, 7260188, 22557272, 5470, 6148028, 19063, 10818853, 13335037, 41865]\n", "[8812783, 15732562, 2792775, 3212258, 878368, 153313, 153313, 6171458, 41107568, 16827462, 48765776, 48765776, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 13155899]\n", "number of books with over 10000 reviews: 12\n", "number of reviews for books with over 10000 reviews: 167512\n" ] } ], "source": [ "counts = review_df.book_id.value_counts()\n", "\n", "threshold = 10000\n", "books_above_10k = [book_id for book_id, count in counts[counts > threshold].iteritems()]\n", "book_ids = list(book_df[book_df.book_id.isin(books_above_10k)].book_id)\n", "work_ids = list(book_df[book_df.book_id.isin(books_above_10k)].work_id)\n", "print(books_above_10k)\n", "print(work_ids)\n", "mapping = {str(book_id): work_id for book_id, work_id in zip(book_ids, work_ids)}\n", "books_above_10k = [str(book_id) for book_id in books_above_10k]\n", "print(f'number of books with over {threshold} reviews:', len(counts[counts > threshold]))\n", "print(f'number of reviews for books with over {threshold} reviews:', sum(counts[counts > threshold]))\n", "\n", "data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'\n", "sample_review_text_file = os.path.join(data_dir, 'goodreads_reviews-books_above_10k_reviews.csv.gz') # includes text\n", "\n", "headers = [\n", " 'user_id', 'book_id', 'work_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', 'started_at', \n", " 'n_votes', 'n_comments', 'review_length', 'review_lang', 'review_text', 'review_lang'\n", "]\n", "\n", "with gzip.open(sample_review_text_file, 'wt') as fh:\n", " writer = csv.writer(fh, delimiter='\\t')\n", " writer.writerow(headers)\n", " for review in read_csv(review_text_file):\n", " if review['book_id'] not in books_above_10k:\n", " continue\n", " review['review_lang'] = lang_detect(review)\n", " review['work_id'] = mapping[review['book_id']]\n", " row = [review[header] for header in headers]\n", " writer.writerow(row)\n" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[16827462, 48765776, 2792775, 28143699, 13306276, 13155899, 41107568, 8812783, 153313, 878368, 17763198, 6171458, 4640799, 15732562, 21825181, 15524549, 17225055, 14863741, 14345371, 3212258, 15545385, 2267189, 14245059, 15524542, 21861351, 4835472, 2754161, 3275794, 50459161]\n", "{'17623911', '7623768', '17380203', '6469151', '848654', '13533423', '33229067', '634951', '1167751', '27069714', '25615225', '22919918', '6794926', '35478512', '20555501', '20661705', '18933600', '20792791', '23219487', '32678493', '31768151', '23211741', '18219043', '25459160', '24792411', '742573', '3708952', '13517276', '17225793', '22747979', '12783911', '31224712', '8045416', '7631356', '16071194', '24483350', '27281555', '10783858', '30652277', '16039145', '33285096', '35547598', '12658862', '21464429', '31333626', '30129222', '33638045', '6703256', '28421203', '17841055', '8376549', '22444974', '6565942', '16049993', '15780438', '2728527', '13541067', '7561318', '23562530', '30332792', '8146730', '1175483', '1029083', '18717073', '15719793', '22907938', '20634095', '30306795', '16100002', '8775537', '28352416', '34615046', '20765880', '31450752', '115233', '24965395', '5136453', '14289293', '18523885', '1294624', '7378684', '3171676', '32604784', '34964436', '1175482', '32196033', '18779332', '15719795', '29624161', '6089758', '6077978', '9923043', '25928461', '19288043', '24675652', '17257607', '19733543', '25750832', '12091570', '15860667', '18189665', '13261812', '10818853', '23394970', '22077269', '18086425', '3744438', '12083073', '21976013', '32311968', '5207452', '20621490', '18913703', '12465069', '15762884', '11494324', '35614220', '653187', '1811146', '1032890', '26029387', '20804177', '20760125', '15723988', '25492938', '10339954', '31140032', '121121', '7667056', '25782431', '9771265', '26868374', '16010180', '20513634', '15868', '3207547', '70740', '9312497', '3345529', '27880360', '7889219', '8306706', '13450845', '24703950', '23347115', '17466045', '20799366', '6469152', '6333218', '1303749', '11283315', '24950272', '7146518', '12213207', '22558519', '17899904', '17796487', '20939592', '18524290', '31279693', '4255', '12386561', '1090736', '29855909', '17697760', '11019770', '12058988', '20895289', '34171499', '31201349', '11506091', '169182', '11072324', '22443450', '19458358', '28933083', '7459906', '22206765', '6596839', '6421355', '17182122', '12576748', '12973516', '12187803', '20446165', '22052749', '32711626', '15747213', '14462380', '18592792', '8116091', '18456083', '22877052', '20819401', '27856546', '20939738', '18144590', '30527969', '25889880', '2193926', '11489526', '23487535', '23813546', '28075512', '1185972', '25576955', '21558662', '11112731', '28116469', '23272156', '15711759', '12823877', '33407095', '22019444', '29561070', '34803847', '6007025', '87784', '18085525', '25570789', '26050341', '27468538', '9970464', '25532234', '33964226', '17736795', '15871', '18626824', '17610151', '3591974', '22008879', '10089874', '16001467', '25398136', '34119033', '430569', '5475', '7932736', '1401476', '5471', '28801690', '25767346', '20563811', '13540149', '9856721', '18688016', '16300093', '6419743', '28930862', '22747441', '5470', '27178169', '16180175', '6696985', '21540995', '17839087', '12657479', '21560020', '25426791', '1471425', '8683527', '11262855', '107497', '9831947', '18713720', '33876518', '18949467', '22918384', '6668046', '35711865', '25038707', '29055777', '19383583', '7202990', '17282465', '14595296', '17317768', '28191772', '19254270', '1162898', '23802330', '28670400', '17797233', '17789078', '13623558', '17982595', '26514633', '7432752', '35596721', '3296450', '9670791', '1167752', '7107665', '26845374', '36226992', '7877620', '16847978', '18689113', '17345357', '24024309', '32854695', '20873522', '12935613', '11502420', '22469599', '25753230', '13625800', '16287485', '26085996', '18461014', '20894754', '15774971', '28094773', '18416234', '13641954', '25372528', '21522382', '2262747', '10296957', '22565648', '28960475', '9969571', '970917', '18667069', '13614350', '23698704', '13561202', '7005741', '16044564', '23695850', '6201586', '256683', '17253765', '11272641', '1558486', '925412', '23223667', '22059026', '18710190', '13424701', '24112181', '22105022', '13602666', '7782572', '848655', '18158464', '13558128', '17406183', '24252929', '11367926', '2187995', '17305435', '20318659', '19478439', '18617752', '26243089', '22820624', '22612307', '25048937', '17380078', '18070037', '25351314', '6214263', '18923851', '17056572', '6682611', '17855823', '7877609', '6988836', '9860176', '30272403', '6287607', '13446802', '25989499', '28927632', '16082380', '1792183', '29090191', '15704186', '31342081', '20362341', '21848807', '21544013', '32196100', '12512617', '18085454', '25477553', '17285979', '23607406', '23586798', '15784168', '35610130', '18277351', '10234834', '18740106', '32787035', '22057878', '22374550', '25032511', '13131651', '26462167', '28364349', '27865405', '28116173', '12370908', '12379979', '23364977', '13562614', '26162946', '26401901', '32469600', '6484809', '15827051', '18517479', '16386616', '30235395', '35291237', '17879118', '22518395', '20799951', '23306917', '13087683', '16638435', '33013778', '32705495', '18750002', '32327896', '20815159', '13611052', '20958895', '26865481', '12535267', '15845588', '20346560', '19140487', '20885669', '11465793', '16110229', '18367755', '16431267', '893131', '25002842', '26328452', '3086961', '17999114', '22820646', '35267531', '26876834', '27068801', '13556970', '32596639', '23902261', '20747215', '296843', '27839326', '17905376', '35561948', '9335682', '455592', '23122150', '13646978', '32672169', '5999961', '18711392', '295898', '20876955', '18400305', '20651067', '21798664', '6017885', '22209071', '20826236', '11476408', '18710654', '17244064', '10616360', '33812720', '36217912', '12497752', '12626302', '18738787', '23592434', '15719753', '23007086', '33791193', '21569815', '22569433', '171813', '13570308', '22247199', '6369436', '17902413', '16234117', '23245346', '18217608', '29546094', '18917661', '15771057', '13621880', '12643928', '22008638', '15736528', '25730991', '1175481', '22820703', '33818894', '18465782', '77523', '10754989', '5477', '4479640', '18274994', '10430432', '18710464', '1781701', '18046319', '7107296', '17789454', '25862866', '25578651', '35624010', '6490389', '29942695', '17182084', '13562963', '12066190', '13184882', '25478508', '12492298', '22451087', '30308795', '25952311', '18722749', '16025268', '25345762', '16018364', '30812729', '19358657', '15743060', '24503488', '18248500', '12280320', '2938728', '8040193', '25851222', '6606279', '13425323', '23463187', '26840292', '22590639', '26826098', '1611724', '1359366', '23003589', '32672170', '16038170', '7776692', '17934350', '15114365', '8667747', '13517330', '18500423', '16043861', '15770294', '23392951', '21800933', '15869972', '18618963', '25220067', '23667081', '16106928', '17802732', '9761397', '6064259', '31700393', '13159888', '23478203', '31304766', '29983806', '25806040', '6772333', '12408058', '29245055', '25095072', '22301044', '23213324', '13489404', '2239941', '15755597', '25402233', '18667071', '3883919', '1039462', '33655993', '17905623', '32306675', '25580994', '25878460', '41865', '27421523', '22397261', '16118339', '9480062', '949562', '872666', '11316396', '21825308', '22946401', '22914379', '17268167', '23264557', '13627059', '11333587', '15991856', '11736539', '5527857', '30507892', '17951429', '18298900', '13372855', '7937116', '20803587', '17319298', '6721985', '13609836', '2617468', '18750632', '12042324', '2747979', '25538495', '6772387', '30631952', '8797737', '15852552', '21415819', '23396332', '15830744', '20763122', '6013089', '7303448', '9226958', '12861705', '23677696', '27804437', '23431006', '34466147', '24629547', '47000', '17315048', '16142085', '13519470', '25491310', '32327029', '16221771', '14759663', '31125943', '21899768', '30963868', '25905906', '18804646', '25303122', '29995403', '30290686', '32172002', '23148432', '18685229', '26244621', '31686251', '17678435', '25270785', '19240738', '23499218', '23011238', '14424887', '13573833', '21838140', '6460568', '22733227', '19019010', '26344382', '22028282', '6535792', '13035203', '11296680', '16136188', '21895490', '18071590', '25414128', '24947668', '30639879', '24060213', '32190294', '32196024', '18505857', '25402339', '8267080', '10428902', '21825994', '16094784', '8430187', '15298019', '22886522', '25841708', '6820997', '25086476', '13335038', '28593492', '23231754', '31302742', '26699682', '18333278', '14069099', '23752009', '35427227', '10319865', '11471507', '8891899', '16163526', '12656079', '31938415', '23944267', '25739188', '19564919', '893136', '13578037', '26815023', '27251284', '13643473', '22026125', '17807489', '20549381', '18926988', '19304768', '6249519', '33959060', '25676931', '16103612', '11718435', '25153867', '21081683', '25563463', '19030181', '32721509', '11831028', '36383978', '4591352', '25532945', '10435027', '15784893', '30302897', '22173562', '816816', '7955579', '31681856', '15833672', '34391120', '3362689', '16025398', '24885537', '12835105', '18667072', '26067879', '29216063', '1781707', '18193741', '15779138', '21926693', '13565922', '24581473', '28572585', '32800617', '17232449', '25782357', '30798481', '18755030', '12328746', '27067314', '17970732', '23251775', '28963439', '23969542', '15860371', '15792385', '30268316', '17928172', '296839', '23241465', '24742536', '12600138', '2140907', '25185586', '17185524', '17563287', '7062423', '23565121', '24737104', '28255378', '26050627', '8419315', '34863649', '7285601', '27835858', '17163051', '4794161', '20838734', '15698354', '8125982', '6567260', '22138225', '25035808', '30074903', '17286849', '15715013', '1118668', '18661631', '8322565', '16149483', '11059672', '11406968', '23308562', '20307239', '22610291', '33874143', '14163319', '68808', '35651100', '13068184', '11780869', '22391027', '16532513', '23729961', '16048509', '10771500', '29453740', '29066539', '26068758', '9884613', '17976538', '17321852', '20321004', '13492054', '23395098', '10197660', '22557272', '762743', '30909770', '31362625', '14742263', '9378347', '5043803', '3279999', '28963433', '864', '21798679', '27865418', '16176099', '20737476', '20493675', '15829776', '13502270', '31076720', '8684868', '47535', '134502', '19174917', '29476176', '6601779', '17372301', '20486826', '28441377', '26136747', '23681652', '12267806', '15827609', '16217859', '12352354', '26133133', '17162156', '25313807', '34005800', '11106711', '22820681', '25559177', '25871053', '34303350', '2430904', '2383357', '16121916', '3652228', '25467816', '29058155', '17670022', '17372039', '24235658', '22013246', '35277259', '29452280', '8696793', '26150280', '27245104', '15743069', '24992096', '11295686', '3544003', '17744501', '27457744', '12250584', '25867602', '15782527', '31928508', '6530628', '7592539', '6012312', '6404106', '862268', '25409006', '25068655', '22371139', '31247145', '6885125', '5983062', '25465121', '13512568', '18491244', '31930223', '6645647', '13585687', '13603056', '1907905', '28813854', '31416507', '8608219', '8442457', '34479625', '20876176', '1118028', '15982675', '20661290', '16153415', '27865302', '24865428', '18747814', '16120785', '13564669', '17383543', '20930777', '15706076', '15704169', '1163304', '16221043', '13189115', '20527042', '18413677', '34168383', '25922545', '13572785', '17740067', '21480931', '202588', '28280014', '18158779', '15783857', '6309556', '30253546', '4468230', '4079980', '29976154', '6945680', '6277870', '20776304', '17835248', '13631961', '22098536', '16309769', '14762245', '1743455', '20959422', '13366606', '28422098', '18960669', '22443451', '4582108', '6662198', '34364064', '21443484', '858352', '22396149', '23509049', '13572331', '23055327', '17926316', '9286283', '3243910', '24712933', '6454654', '29348347', '6753193', '17280301', '26716806', '23362138', '16023652', '2655', '1885731', '6214264', '28486119', '2728868', '11366370', '28251513', '10151606', '22025049', '9797134', '20487641', '11382943', '22886190', '12773641', '18038996', '7641032', '8175796', '16038894', '24585399', '25454820', '7949442', '25418711', '15704174', '18310712', '18872581', '30992603', '28226184', '10382704', '10489305', '23129688', '23510841', '3001581', '22736932', '22141666', '15923121', '12158002', '18402178', '17159918', '17855196', '5970445', '14796360', '23343051', '6788879', '7745385', '26014657', '16207383', '18336048', '24213573', '16084645', '10616322', '11291921', '26133825', '13279499', '2711522', '16691284', '27758324', '832130', '32501454', '24844880', '17694651', '2437444', '30814938', '9523411', '1812673', '17312364', '20936040', '15738816', '22529254', '6941985', '20301447', '13067277', '11472275', '27833446', '621714', '23055326', '19024389', '6308402', '412765', '6365868', '21606929', '18810809', '28002909', '17622641', '30733881', '16059263', '24936823', '15782628', '7303447', '11366921', '25458088', '21526588', '30819387', '21523103', '15834381', '17253961', '19191388', '16136701', '10860047', '15795357', '13450992', '7818697', '23488177', '13266989', '32327894', '23290223', '13637131', '2311681', '23481110', '32452232', '17374967', '23377235', '25149009', '17268736', '3234963', '18075570', '9025186', '35841506', '25739181', '11295839', '6277597', '32571475', '6795650', '21803253', '29557229', '17698624', '6540523', '18303228', '18330586', '32187706', '17701696', '31574524', '295895', '13644251', '30190890', '27432466', '32601138', '18710455', '22098740', '22443454', '13640224', '17792118', '7107655', '27192797', '22027303', '13572091', '18240444', '18808404', '7538739', '1039461', '13509182', '14287920', '16410928', '17458508', '17207560', '6319350', '15765784', '16080569', '4544759', '30194474', '18592700', '10407121', '11542687', '25925012', '25115981', '11306050', '33957050', '28239572', '22880450', '29240385', '2798098', '3708951', '28550078', '25719729', '11387490', '15738309', '23003119', '30073478', '25552609', '16368772', '25750833', '27240909', '1200110', '22901602', '33970415', '25907194', '30736096', '16031628', '13646586', '20910326', '13577338', '25817368', '31616479', '19005845', '32493379', '28396808', '493092', '6848433', '16163548', '9996853', '13417027', '23298852', '24733515', '17368146', '11521684', '988811', '26097476', '13364111', '23396344', '32310213', '23012014', '15779955', '19501104', '18407011', '23264599', '10327361', '25979514', '25380799', '15710432', '7599408', '12494832', '18277434', '23398687', '171796', '1430336', '30537071', '22667278', '13613910', '24844873', '6338979', '10307295', '16123468', '20822585', '27951429', '12061729', '18163440', '15839201', '26847251', '18364652', '32574083', '16240747', '29056090', '26104159', '18374073', '7619312', '16108278', '20691208', '13563204', '15820139', '296829', '18136444', '13233432', '27073501', '27222447', '3292087', '31361778', '35066575', '1359361', '32348285', '17233649', '10180184', '15853000', '22011300', '23245330', '16074780', '21458175', '26256536', '12606483', '15834248', '8423493', '13558130', '18599593', '25648152', '7882466', '31142477', '10400074', '17928697', '22397401', '19866705', '3283211', '25238369', '6456478', '49813', '9156663', '142293', '18331088', '25470866', '19248302', '20446243', '492000', '11870085', '13502941', '1167711', '14769440', '19572096', '10684183', '5514914', '35525846', '25817366', '28415052', '18590221', '16624883', '13608379', '18373805', '11072325', '1167758', '15784593', '13369984', '28869162', '20821105', '70750', '22026914', '23251163', '19465062', '25859959', '11340157', '25907431', '35427560', '26204799', '24598365', '25573037', '18332192', '8256471', '12985366', '7775689', '16122644', '17903642', '24338230', '16094352', '17156273', '6467118', '23597316', '1324062', '20746310', '28525303', '32729810', '34572308', '12943330', '16141742', '31453333', '32912399', '17200915', '26372386', '1033918', '22012199', '17858541', '6453849', '13147906', '2166416', '21845565', '18079910', '9947735', '47522', '3228468', '1508371', '24169712', '27863078', '25195069', '25707413', '4031861', '20555341', '7862090', '25645289', '21840310', '24763181', '28183257', '23506567', '7265901', '20534719', '23666179', '21280885', '17857648', '3', '17735871', '12997113', '1248077', '12973964', '18331796', '26309684', '31352119', '25929273', '22914373', '22841994', '12187799', '22558796', '11085074', '26143397', '21941928', '25748701', '18870649', '25573742', '24615234', '18765860', '10255274', '13455462', '30627196', '28479611', '7428537', '2264102', '28052767', '19035585', '16068905', '24245069', '7683425', '14760741', '22014037', '1611537', '26166591', '27775414', '22012779', '18461597', '22069364', '11304270', '21877015', '29067264', '22232135', '18240575', '7902652', '1826649', '33005686', '11249180', '4786064', '12510082', '25021837', '6371614', '23453787', '12941351', '12813562', '13452130', '22048373', '30968446', '16172579', '23290384', '9577857', '13450839', '18082943', '28147904', '7723926', '17378967', '19496550', '25402286', '18665794', '30258297', '9402421', '2681117', '11389422', '8113493', '20614474', '20861464', '26072029', '25655480', '20319534', '14762489', '20945198', '1907452', '975442', '27042066', '2661', '17252078', '23392759', '20775702', '35711177', '15995451', '13043622', '33249543', '13490003', '36172798', '615203', '22022075', '6658717', '21825598', '16050285', '13541093', '17617613', '17118898', '12139510', '16026746', '26632196', '16282204', '7541858', '651603', '6313338', '13099951', '26154394', '26055229', '30169721', '33133924', '23349454', '2193922', '1035885', '21008563', '9885717', '11477648', '3162139', '17347634', '33597793', '25445626', '17226872', '28959274', '12915074', '24965453', '28485265', '28132722', '25059351', '13074276', '9662788', '6275962', '17618692', '25537895', '15749148', '13584041', '7769080', '9745301', '24129434', '23261348', '51762', '25536116', '19245512', '22591099', '15327848', '21394255', '3162133', '18761327', '29972929', '17233160', '18045177', '24416368', '17372830', '295897', '18685593', '16960820', '24987978', '13492029', '33538809', '17470676', '25248885', '30698397', '848656', '10370664', '1629406', '10323524', '26124554', '736301', '29336777', '31501375', '86940', '22369653', '25742968', '8242247', '27774810', '16041590', '20698577', '1168126', '25153272', '22009164', '9576890', '35628404', '9782708', '25799090', '1482132', '13641944', '30625052', '3475269', '13190720', '12029011', '13020641', '16636891', '12328744', '6999776', '35604481', '13484101', '28863513', '18705050', '28371839', '16118334', '23633718', '27272362', '22614198', '24845481', '33832947', '15750489', '16071770', '13064926', '16003299', '20565586', '25519319', '18513682', '30312363', '5478', '20664548', '27556033', '13417580', '24826119', '35160402', '18428110', '25407919', '25560245', '7701058', '1577365', '13648024', '34318133', '9700012', '7635541', '22053825', '28946401', '6072428', '14854809', '13554008', '30757322', '18918647', '13631515', '34225260', '27561654', '13566014', '296825', '19241699', '3356', '18247775', '14059087', '6371188', '30139651', '10297811', '3049950', '15704161', '22464636', '25520880', '22858373', '2308980', '17973762', '6214262', '26207857', '13558932', '24761879', '12649718', '23383694', '23110080', '20873735', '4593339', '19747310', '18208828', '8519739', '22447268', '26175968', '19478858', '25076674', '23278585', '33401915', '22839547', '988812', '13037285', '20650332', '29192980', '36226979', '36186865', '15798097', '23390821', '71001', '22022744', '7636830', '12040504', '15702765', '30261139', '30365843', '10763617', '3347826', '13582806', '17791941', '11531822', '10857048', '31342863', '2767052', '23965011', '19088744', '17906724', '25885299', '3228285', '18373282', '17340867', '13599993', '15507958', '22616546', '23627224', '24955138', '13384167', '22381143', '45503', '26869474', '29434730', '15757746', '32150125', '19307977', '23392349', '22443992', '18713356', '10623852', '35957342', '35963122', '13410173', '13722513', '18697818', '30177570', '28462125', '1238900', '7415669', '22845230', '24548235', '13484097', '26228167', '26542880', '7636597', '31437636', '15717791', '32113294', '3529641', '8874564', '30625069', '1587371', '21798645', '32881040', '24517276', '33561480', '24845394', '17238204', '13543691', '8409352', '10955232', '25293708', '28449886', '17917566', '13319603', '3882099', '33397085', '27262601', '20446212', '21464366', '18159220', '35185807', '17697576', '11763051', '13556437', '24068972', '19532570', '30892022', '9501098', '22915233', '17230106', '17190305', '1990311', '32876436', '12649671', '20896811', '30304889', '10029930', '21496310', '49771', '21956204', '27209601', '6908529', '25695197', '26829551', '27865646', '13563020', '34625531', '13512949', '20326033', '18039636', '26819816', '1570814', '32307783', '25060416', '32053652', '13601883', '13536858', '6604736', '23217267', '10293381', '29348016', '28457773', '17857128', '13578028', '19232069', '26858928', '21842632', '2830061', '9902321', '23607381', '29560766', '5945479', '13413859', '9886587', '25849641', '30750943', '31342728', '26869421', '23291340', '30268317', '28220755', '10794976', '30329629', '24307863', '9480077', '13408579', '30243012', '26018476', '25124132', '29750146', '25568667', '22915525', '12885533', '19523475', '18878756', '18810188', '12970552', '3597767', '18729913', '25432891', '13574246', '17727547', '22756742', '3312730', '1689096', '13480671', '8930594', '17083703', '11736454', '32070902', '30050074', '8369717', '19376702', '13609263', '1508373', '23362134', '25815945', '23207076', '16046182', '20765998', '18914023', '3487145', '27859636', '25184100', '10790543', '32079434', '24845176', '6599441', '45497', '27224217', '22753803', '22565407', '29352237', '3399740', '16410758', '17673265', '18277369', '16281293', '23164819', '34878457', '13562891', '16052844', '25471933', '15840628', '25170697', '13537626', '36235191', '4666058', '17998865', '25034701', '17383918', '7769068', '24500712', '32862166', '23596051', '17240354', '24911900', '33249079', '29976332', '31693618', '25688111', '1772798', '8679437', '13032018', '11235712', '11735983', '11450073', '25202033', '32714405', '14856786', '2660', '19184431', '26267528', '15743073', '15869901', '87823', '13444844', '5971209', '18522512', '1050869', '18590594', '7338921', '23304136', '24143070', '31177549', '17907734', '2888212', '17289350', '18041074', '13563913', '21416527', '6219406', '15738367', '26190650', '25792193', '25142590', '27872610', '25617779', '7328985', '13598952', '19548976', '22882407', '17798613', '35666553', '33985876', '32716769', '26869457', '23399768', '10201650', '25069196', '2195228', '19542808', '15947905', '26219614', '12390063', '25555156', '16049104', '7260188', '7315452', '23112751', '22754100', '18398230', '20868113', '7693362', '18950883', '18139937', '25587116', '497318', '15717897', '2262746', '1611839', '23121026', '6342509', '23388229', '18952334', '2195230', '22758737', '18177642', '8659282', '20574633', '22716410', '6626071', '21143500', '13569253', '21480930', '10924618', '19345703', '17847583', '32613491', '23356642', '6663401', '30124071', '25557821', '21375409', '25080929', '24834780', '5473', '13394748', '20587921', '11337018', '17568374', '15742581', '18659505', '18397048', '15848517', '24544370', '18218748', '6129629', '13135514', '1141402', '17410339', '17450549', '17451711', '24510659', '6965175', '32327289', '27344676', '11290745', '22006767', '15743062', '26200934', '36364920', '10794937', '18680821', '6355432', '25989361', '8502965', '13319650', '19404278', '27810119', '13580753', '26788341', '31574430', '17351872', '2253168', '25392657', '29800365', '13509120', '7740998', '28447015', '32804803', '6748892', '24548709', '2297303', '25150558', '18340095', '18493426', '31864314', '10002196', '24420539', '27037673', '31114367', '22713014', '21841495', '11143580', '10754520', '32598849', '20929026', '9269688', '23362140', '23588396', '26099082', '7350825', '25665316', '23664702', '13617471', '17984552', '8564532', '15860671', '15768174', '6496061', '19509353', '32196106', '17280541', '29093221', '18040633', '12819583', '16116728', '31285614', '11417254', '3310083', '9325431', '23347055', '18488601', '26939805', '18218169', '23116317', '3188872', '12016313', '591213', '17258791', '14288439', '13511506', '18330911', '6388432', '13087601', '17458734', '16051061', '17611372', '31278743', '18044174', '8045417', '17341321', '13197222', '32309194', '19243721', '13490034', '18746499', '29372056', '16077194', '15821439', '35437810', '15714522', '25636642', '23809341', '16075864', '6130950', '8518234', '27262439', '2264130', '13563330', '25779553', '11985087', '18553527', '4266322', '22589511', '12305738', '22088011', '12396771', '13753712', '19097400', '24911818', '30140875', '49818', '15755549', '13420264', '28525374', '8409368', '8306857', '13414282', '23640217', '6980817', '29639950', '35391134', '7231642', '20806584', '25063777', '11058389', '11732640', '18271170', '18523261', '25923000', '15724164', '17666012', '13321335', '30182118', '7710333', '25912961', '11982320', '15701662', '26796330', '22241296', '25969640', '17884755', '13558189', '25698317', '13508601', '13625156', '17561098', '10327392', '17258713', '28109445', '28001785', '9662715', '18819489', '18410374', '25161218', '17951497', '30733913', '18962294', '35594963', '12038650', '101426', '25440045', '8828360', '2173683', '24501910', '17930465', '16005666', '17337509', '7419435', '24104539', '30770810', '865', '862272', '14785910', '17561086', '2738612', '34823580', '29213388', '14289689', '29479422', '29056083', '1781723', '22911409', '25630971', '20561797', '17204095', '23715973', '33878306', '20829029', '18597369', '9979062', '33843449', '32829500', '15711451', '3175211', '25506775', '30278752', '15727775', '25127455', '17730679', '13604494', '27219097', '14802335', '17608371', '25758236', '24204146', '18337477', '28789556', '22443481', '8333557', '23295309', '30972270', '1739991', '22677634', '30743499', '13512140', '27065545', '17457185', '29752927', '23019251', '21399905', '1546633', '35121839', '18966551', '22666173', '147476', '20555443', '18668534', '19258660', '18463614', '17118893', '32709452', '24617231', '6507425', '16275651', '13553854', '24941243', '12168042', '6418599', '333674', '19203179', '18052325', '32196073', '17698919', '6148028', '28550610', '13330943', '19248807', '11549187', '21797160', '3360617', '13376097', '18224374', '23201213', '22031664', '16154910', '14796823', '26076903', '31831496', '31200595', '21028179', '12955096', '11436054', '9637697', '20614318', '17456181', '7442412', '1137327', '14762491', '20742224', '9460487', '16039308', '16060617', '8391758', '20761204', '20821174', '17384001', '6475535', '1328498', '18071359', '22438404', '13456264', '13082496', '17666929', '15182675', '17159916', '13578034', '25066117', '7092083', '22550979', '22944242', '13687076', '1281235', '49773', '18005662', '30257299', '12359421', '22180850', '22044746', '32493793', '13588543', '22890011', '29238847', '20758225', '18007564', '25318594', '17853242', '25353247', '19660884', '25500233', '23507', '608867', '18871776', '18712499', '13649327', '70754', '17371845', '17564499', '21414828', '27557999', '22466620', '9809096', '14762463', '8479009', '27515628', '17563814', '8515383', '18798818', '24590557', '37449', '29967520', '20603758', '373676', '21887189', '10282869', '13344493', '22492901', '34399341', '18928661', '18634486', '6472227', '25215271', '49810', '28185384', '28415009', '16069305', '22844292', '12338163', '23363818', '21976014', '18039423', '13616052', '19043489', '2156863', '13425450', '15768774', '16177870', '7445780', '17907491', '32564969', '31843042', '25852860', '8284932', '11291520', '23382512', '4163418', '12187797', '34434581', '17255591', '9365923', '24892145', '25431622', '12827311', '24401111', '34052111', '26150650', '26819767', '16067349', '15829468', '185900', '31538363', '17322949', '11143600', '32500082', '13408075', '34858734', '36381037', '13068756', '10327421', '16039920', '35079565', '1498842', '16058418', '18481276', '12250585', '4266341', '27876118', '29069989', '33526658', '34230397', '9489489', '19064', '587862', '840211', '26256646', '983552', '16018772', '13645903', '35525168', '8101281', '17986986', '13589235', '17290810', '31561733', '27190381', '16156593', '33845852', '8846414', '6039746', '15983784', '13562570', '72193', '9171146', '32621564', '862267', '30968175', '1039463', '872', '9361589', '11344832', '9745341', '15751678', '23559858', '2711523', '2657', '295899', '34615137', '33312853', '35553817', '34999214', '1271159', '18386077', '4616627', '20543358', '13356854', '28680885', '18360625', '17407461', '2654', '18039923', '13410779', '12025202', '25258300', '17853376', '12269780', '23276112', '19101755', '19539657', '32570242', '8142865', '13574349', '13575634', '29587499', '12024', '13063912', '12382340', '9365986', '36110908', '21956078', '2264108', '18190123', '9629811', '13612277', '7014517', '6015185', '27858482', '31298704', '6573014', '22088945', '8722062', '16292420', '18086403', '7077211', '20307209', '6633956', '24111211', '29242483', '22703858', '24822879', '25309654', '11149647', '492286', '18770574', '18746856', '13373663', '17795506', '19546752', '6574803', '17880024', '3153393', '22019906', '10526094', '16157474', '6979801', '22024583', '25701321', '21527585', '21939020', '20934720', '7950150', '26092775', '6040381', '18710449', '18597421', '35233095', '3847055', '17343716', '18665249', '18684947', '17566731', '17161949', '19029961', '10282880', '35390821', '25794765', '26164740', '30838324', '18267158', '23396682', '29079827', '15743071', '13092575', '21525121', '17968365', '18362734', '31560002', '8013752', '26853362', '20307223', '13515350', '13452143', '15791627', '12539591', '7931168', '6626070', '10858973', '15704149', '24359582', '9997510', '13554812', '16130437', '12835106', '15784152', '25987480', '30853073', '23275109', '29914433', '526272', '15843793', '26170148', '14802374', '437151', '2198721', '17934906', '45495', '9212093', '13234679', '4478365', '10244455', '24718615', '858351', '13557703', '13026021', '34128781', '25755787', '27259778', '25120177', '1238901', '608858', '18040699', '9584810', '22873046', '49799', '12328749', '13603351', '28757261', '15717947', '23354313', '8120173', '1310182', '19100', '25370921', '18983451', '22038278', '23018804', '888945', '16127939', '5801016', '2397465', '25964074', '29543528', '6902967', '18998388', '29425003', '16280929', '16157187', '27223386', '2206723', '24317333', '936221', '23595964', '22919473', '2261214', '13335037', '7454063', '14569776', '13397021', '29082328', '24964815', '17230504', '29605141', '18401393', '17833818', '893135', '18517873', '25154774', '35608444', '33845528', '20806556', '28494414', '296820', '17557750', '30517639', '17308606', '26803527', '22370890', '9309796', '25477718', '24490481', '35378698', '8039099', '17152275', '14574973', '17845840', '12104761', '7095445', '34823579', '34510127', '21726837', '11738407', '3407971', '14739280', '6405758', '18190276', '23607404', '13509127', '2983323', '17857225', '15736550', '17540725', '15818092', '11112712', '17187024', '28176874', '13530788', '20309175', '22398290', '7645685', '25214284', '30982036', '25147847', '24612163', '8720518', '35383657', '29757622', '26856722', '16412296', '12027088', '17794847', '2287904', '23502642', '18219535', '22614515', '21934594', '9859820', '16125241', '8763821', '20447452', '31187505', '14290343', '32054384', '6906068', '23392562', '7364254', '2261213', '29496243', '27838766', '25310065', '9969572', '17859161', '27224408', '16692909', '17134626', '18815174', '13643133', '8409357', '9452482', '23551264', '30532762', '23315870', '29880709', '16084682', '27263459', '25231392', '20910195', '8574414', '24705880', '20945300', '22828783', '25804350', '18010943', '22263899', '20612067', '27396571', '13557136', '30078661', '22622278', '13582001', '12547520', '27168405', '15715789', '24781288', '27869625', '22455901', '20222637', '31417231', '742576', '23307757', '29747283', '23350847', '840208', '22602377', '18040125', '6984466', '24171268', '18206985', '24674676', '25758406', '17861287', '13164565', '18163629', '33622481', '14622803', '3102821', '13598806', '12061833', '36219831', '6071573', '22824462', '34318', '23527094', '29762289', '6577666', '24963328', '16283961', '26851201', '25270661', '28602838', '45499', '13613827', '2195227', '15747347', '22857439', '31030335', '871224', '18809355', '27278972', '31818515', '12004870', '12885649', '2397472', '21542483', '30295420', '8549012', '19063', '12509921', '25879546', '32572144', '10753689', '12979843', '21863194', '22071142', '20520487', '29481736', '20784229', '32795719', '23811616', '23704118', '13412974', '31682902', '3174348', '9397983', '6010848', '219069', '18539129', '13416045', '30632127', '18486818', '17416947', '35263629', '13690325', '18743497', '25739498', '29430226', '35163194', '14478151', '18132708', '22917699', '28363557', '34389262', '18620726', '165220', '2711524', '21904149', '13451546', '29636938', '22723275', '28946493', '26055123', '24862868', '18596991', '1808773', '25908605', '9692831', '31836443', '1261442', '32596240', '21473426', '29346191', '13617429', '2638810', '15745753', '171797', '24911870', '24559146', '15766140', '296827', '1167760', '12292612', '29070291', '15992524', '33128765', '5670194', '16058834', '25683005', '6053292', '309174', '20626209', '29743258', '27272264', '26522288', '17873797', '22009240', '18488649', '1649550', '26127556', '13164350', '15994655', '12384483', '29232480', '24845469', '11925410', '821787', '2699067', '17616021', '24298721', '26046350', '22019178', '27231015', '25728861', '31283253', '455033', '6089760', '9638604', '2200815', '28928335', '26802550', '9490473', '943315', '7941396', '10471154', '9717320', '18049021', '18901352', '22595857', '12207476', '25661280', '16637370', '23109031', '30964258', '13107968', '8313731', '13089144', '16135349', '31321368', '31314242', '23009744', '8663600', '20445993', '19631541', '31938451', '6772388', '2206724', '13648975', '32294046', '18624585', '16081583', '13219316', '2549275', '7040332', '19078806', '27796606', '24878742', '30525923', '9088786', '14793577', '17970762', '71000', '17917593', '18164704', '6432120', '13507226', '27246710', '28266126', '6588588', '24317970', '15836164', '13111334', '35609920', '3484606', '25016042', '17281930', '31823738', '13548787', '14760501', '13543908', '30525922', '28109798', '17466044', '31299247', '836042', '18044116', '19505295', '20958096', '16165124', '1904709', '32326262', '5514922', '12279616', '24856461', '17906268', '22880245', '31856423', '8128934', '20809922', '23547767', '26068759', '22586854', '26829136', '28525496', '7838236', '22091506', '23691247', '3242678', '34236100', '15743076', '25233603', '13796816', '15848087', '18273877', '18333638', '30365803', '33630768', '8332744', '25500715', '27885543', '26308557', '6080900', '32914485', '10385172', '816815', '13425880', '24801569', '28352417', '28003656', '30851062', '35957356', '9307599', '18246707', '12837725', '15987042', '28117517', '5999963', '7636293', '9456961', '13639293', '15719757', '22010842', '7546830', '24995028', '26887962', '18738782', '15753836', '26087530', '25946362', '15880', '12061809', '25418547', '29969536', '18498332', '26088065', '18866539', '69438', '5953576', '23513830', '23524746', '4687635', '22824478', '21424761', '25587975', '31681857', '10415854', '587400', '18104647', '25088196', '11990331', '13141608', '13497933', '518844', '16205681', '18410187', '11192675', '20498963', '18493699', '13560379', '31122073', '29356081', '6257502', '7176054', '415675', '10257528', '17466622', '34996826', '10611698', '13562630', '32672711', '18626858', '17201174'}\n", "number of works with over 10000 reviews: 29\n", "number of reviews for works with over 10000 reviews: 412905\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "100000 reviews parsed 2013 written\n", "200000 reviews parsed 4103 written\n", "300000 reviews parsed 6351 written\n", "400000 reviews parsed 8555 written\n", "500000 reviews parsed 10701 written\n", "600000 reviews parsed 12692 written\n", "700000 reviews parsed 14628 written\n", "800000 reviews parsed 16675 written\n", "900000 reviews parsed 18746 written\n", "1000000 reviews parsed 20870 written\n", "1100000 reviews parsed 22833 written\n", "1200000 reviews parsed 24898 written\n", "1300000 reviews parsed 26681 written\n", "1400000 reviews parsed 28831 written\n", "1500000 reviews parsed 30908 written\n", "1600000 reviews parsed 32997 written\n", "1700000 reviews parsed 35007 written\n", "1800000 reviews parsed 36898 written\n", "1900000 reviews parsed 38950 written\n", "2000000 reviews parsed 40817 written\n", "2100000 reviews parsed 42744 written\n", "2200000 reviews parsed 44942 written\n", "2300000 reviews parsed 47112 written\n", "2400000 reviews parsed 49475 written\n", "2500000 reviews parsed 51626 written\n", "2600000 reviews parsed 53931 written\n", "2700000 reviews parsed 55797 written\n", "2800000 reviews parsed 57756 written\n", "2900000 reviews parsed 59843 written\n", "3000000 reviews parsed 62320 written\n", "3100000 reviews parsed 64461 written\n", "3200000 reviews parsed 66572 written\n", "3300000 reviews parsed 68727 written\n", "3400000 reviews parsed 70821 written\n", "3500000 reviews parsed 72954 written\n", "3600000 reviews parsed 75265 written\n", "3700000 reviews parsed 77276 written\n", "3800000 reviews parsed 79434 written\n", "3900000 reviews parsed 81383 written\n", "4000000 reviews parsed 83573 written\n", "4100000 reviews parsed 85724 written\n", "4200000 reviews parsed 87724 written\n", "4300000 reviews parsed 89946 written\n", "4400000 reviews parsed 92127 written\n", "4500000 reviews parsed 94314 written\n", "4600000 reviews parsed 96313 written\n", "4700000 reviews parsed 98484 written\n", "4800000 reviews parsed 100492 written\n", "4900000 reviews parsed 102564 written\n", "5000000 reviews parsed 104652 written\n", "5100000 reviews parsed 106968 written\n", "5200000 reviews parsed 109106 written\n", "5300000 reviews parsed 111296 written\n", "5400000 reviews parsed 113354 written\n", "5500000 reviews parsed 115409 written\n", "5600000 reviews parsed 117487 written\n", "5700000 reviews parsed 119753 written\n", "5800000 reviews parsed 122027 written\n", "5900000 reviews parsed 124172 written\n", "6000000 reviews parsed 126501 written\n", "6100000 reviews parsed 128724 written\n", "6200000 reviews parsed 130821 written\n", "6300000 reviews parsed 133040 written\n", "6400000 reviews parsed 135462 written\n", "6500000 reviews parsed 137646 written\n", "6600000 reviews parsed 139999 written\n", "6700000 reviews parsed 142474 written\n", "6800000 reviews parsed 144780 written\n", "6900000 reviews parsed 147061 written\n", "7000000 reviews parsed 149093 written\n", "7100000 reviews parsed 151527 written\n", "7200000 reviews parsed 153774 written\n", "7300000 reviews parsed 156279 written\n", "7400000 reviews parsed 158617 written\n", "7500000 reviews parsed 161079 written\n", "7600000 reviews parsed 163102 written\n", "7700000 reviews parsed 165534 written\n", "7800000 reviews parsed 167715 written\n", "7900000 reviews parsed 169851 written\n", "8000000 reviews parsed 171990 written\n", "8100000 reviews parsed 174345 written\n", "8200000 reviews parsed 176447 written\n", "8300000 reviews parsed 178493 written\n", "8400000 reviews parsed 180749 written\n", "8500000 reviews parsed 182914 written\n", "8600000 reviews parsed 185476 written\n", "8700000 reviews parsed 187747 written\n", "8800000 reviews parsed 190244 written\n", "8900000 reviews parsed 192425 written\n", "9000000 reviews parsed 194689 written\n", "9100000 reviews parsed 196898 written\n", "9200000 reviews parsed 199171 written\n", "9300000 reviews parsed 201383 written\n", "9400000 reviews parsed 203584 written\n", "9500000 reviews parsed 205613 written\n", "9600000 reviews parsed 208139 written\n", "9700000 reviews parsed 210200 written\n", "9800000 reviews parsed 212082 written\n", "9900000 reviews parsed 214407 written\n", "10000000 reviews parsed 216429 written\n", "10100000 reviews parsed 218879 written\n", "10200000 reviews parsed 221157 written\n", "10300000 reviews parsed 223389 written\n", "10400000 reviews parsed 225595 written\n", "10500000 reviews parsed 227881 written\n", "10600000 reviews parsed 230155 written\n", "10700000 reviews parsed 232567 written\n", "10800000 reviews parsed 234743 written\n", "10900000 reviews parsed 237046 written\n", "11000000 reviews parsed 239503 written\n", "11100000 reviews parsed 241727 written\n", "11200000 reviews parsed 244201 written\n", "11300000 reviews parsed 246430 written\n", "11400000 reviews parsed 248960 written\n", "11500000 reviews parsed 251349 written\n", "11600000 reviews parsed 253614 written\n", "11700000 reviews parsed 255938 written\n", "11800000 reviews parsed 258210 written\n", "11900000 reviews parsed 260196 written\n", "12000000 reviews parsed 262362 written\n", "12100000 reviews parsed 264689 written\n", "12200000 reviews parsed 267016 written\n", "12300000 reviews parsed 269351 written\n", "12400000 reviews parsed 271443 written\n", "12500000 reviews parsed 273787 written\n", "12600000 reviews parsed 276068 written\n", "12700000 reviews parsed 278134 written\n", "12800000 reviews parsed 280214 written\n", "12900000 reviews parsed 282462 written\n", "13000000 reviews parsed 284555 written\n", "13100000 reviews parsed 286804 written\n", "13200000 reviews parsed 289258 written\n", "13300000 reviews parsed 291719 written\n", "13400000 reviews parsed 294020 written\n", "13500000 reviews parsed 296323 written\n", "13600000 reviews parsed 298409 written\n", "13700000 reviews parsed 300271 written\n", "13800000 reviews parsed 302678 written\n", "13900000 reviews parsed 304861 written\n", "14000000 reviews parsed 307088 written\n", "14100000 reviews parsed 309131 written\n", "14200000 reviews parsed 311246 written\n", "14300000 reviews parsed 313573 written\n", "14400000 reviews parsed 315823 written\n", "14500000 reviews parsed 318158 written\n", "14600000 reviews parsed 320361 written\n", "14700000 reviews parsed 322520 written\n", "14800000 reviews parsed 324702 written\n", "14900000 reviews parsed 326827 written\n", "15000000 reviews parsed 330967 written\n", "15100000 reviews parsed 335467 written\n", "15200000 reviews parsed 340097 written\n", "15300000 reviews parsed 345419 written\n", "15400000 reviews parsed 349807 written\n", "15500000 reviews parsed 354202 written\n", "15600000 reviews parsed 358680 written\n", "15700000 reviews parsed 362824 written\n", "15739967 reviews parsed 364653 written\n" ] } ], "source": [ "from scripts.helper import read_csv\n", "\n", "counts = review_df.work_id.value_counts()\n", "\n", "threshold = 10000\n", "works_above_10k = [int(work_id) for work_id, count in counts[counts > threshold].iteritems()]\n", "print(works_above_10k)\n", "book_ids = set([str(book_id) for book_id in list(book_df[book_df.work_id.isin(works_above_10k)].book_id)])\n", "work_ids = list(book_df[book_df.work_id.isin(works_above_10k)].work_id)\n", "mapping = {str(book_id): work_id for book_id, work_id in zip(book_ids, work_ids)}\n", "\n", "print(f'number of works with over {threshold} reviews:', len(counts[counts > threshold]))\n", "print(f'number of reviews for works with over {threshold} reviews:', sum(counts[counts > threshold]))\n", "\n", "data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'\n", "sample_review_text_file = os.path.join(data_dir, 'goodreads_reviews-works_above_10k_reviews.csv.gz') # includes text\n", "\n", "headers = [\n", " 'user_id', 'book_id', 'work_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', 'started_at', \n", " 'n_votes', 'n_comments', 'review_length', 'review_lang', 'review_text', 'review_lang'\n", "]\n", "\n", "with gzip.open(sample_review_text_file, 'wt') as fh:\n", " writer = csv.writer(fh, delimiter='\\t')\n", " writer.writerow(headers)\n", " written = 0\n", " for ri, review in enumerate(read_csv(review_text_file)):\n", " if (ri+1) % 1000000 == 0:\n", " print(ri+1, 'reviews parsed', written, 'written')\n", " if review['book_id'] not in book_ids:\n", " continue\n", " review['review_lang'] = lang_detect(review)\n", " review['work_id'] = mapping[review['book_id']]\n", " written += 1\n", " row = [review[header] for header in headers]\n", " writer.writerow(row)\n", "print(ri+1, 'reviews parsed', written, 'written')\n" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0 0\n" ] }, { "data": { "text/plain": [ "Index(['user_id', 'book_id', 'review_id', 'rating', 'date_added',\n", " 'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments',\n", " 'review_length', 'work_id'],\n", " dtype='object')" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts[counts > threshold]\n", "\n", "print(len(book_ids), len(work_ids), len(mapping.keys()))\n", "review_df.columns" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['a2d6dd1685e5aa0a72c9410f8f55e056', '459a6c4decf925aedd08e45045c0d8c6', '4922591667fd3e8adc0c5e3d42cf557a', 'dd9785b14664103617304996541ed77a', '843a44e2499ba9362b47a089b0b0ce75', '9003d274774f4c47e62f77600b08ac1d', 'b7772313835ce6257a3fbe7ad2649a29', '8bb031b637de69eba020a8a466d1110b', '8e7e5b546a63cb9add8431ee6914cf59', '6ac35fe952c608da50153d64f616291b', '795595616d3dbd81bd16b617c9a1fa48', 'a45fb5d39a6a9857ff8362900790510a', '60982541be85a0611e9634b4f63d0cb0', '97e2ce2141fa1c880967d78aec3c14fa', '422e76592e2717d5d59465d22d74d47c', '59151b639f247aa97fffd5c71701db29', 'e5905d648022af7b1309d82a1f4d255b', '37b3e60b4e4152c580fd798d405150ff', 'd8c39b3b11bb2da1c1d5c39f49669dea']\n", "148355\n", "148355\n", "number of users with over 5000 reviews: 19\n", "number of reviews for users with over 5000 reviews: 148355\n", "100000 reviews parsed 3848 written\n", "200000 reviews parsed 3848 written\n", "300000 reviews parsed 3848 written\n", "400000 reviews parsed 9192 written\n", "500000 reviews parsed 9192 written\n", "600000 reviews parsed 9192 written\n", "700000 reviews parsed 13847 written\n", "800000 reviews parsed 13847 written\n", "900000 reviews parsed 13847 written\n", "1000000 reviews parsed 13847 written\n", "1100000 reviews parsed 21612 written\n", "1200000 reviews parsed 21612 written\n", "1300000 reviews parsed 31667 written\n", "1400000 reviews parsed 31667 written\n", "1500000 reviews parsed 31667 written\n", "1600000 reviews parsed 31667 written\n", "1700000 reviews parsed 31667 written\n", "1800000 reviews parsed 36645 written\n", "1900000 reviews parsed 36645 written\n", "2000000 reviews parsed 36645 written\n", "2100000 reviews parsed 36645 written\n", "2200000 reviews parsed 36645 written\n", "2300000 reviews parsed 36645 written\n", "2400000 reviews parsed 36645 written\n", "2500000 reviews parsed 36645 written\n", "2600000 reviews parsed 36645 written\n", "2700000 reviews parsed 46162 written\n", "2800000 reviews parsed 46162 written\n", "2900000 reviews parsed 46162 written\n", "3000000 reviews parsed 46162 written\n", "3100000 reviews parsed 46162 written\n", "3200000 reviews parsed 46162 written\n", "3300000 reviews parsed 46162 written\n", "3400000 reviews parsed 46162 written\n", "3500000 reviews parsed 46162 written\n", "3600000 reviews parsed 46162 written\n", "3700000 reviews parsed 46162 written\n", "3800000 reviews parsed 46162 written\n", "3900000 reviews parsed 46162 written\n", "4000000 reviews parsed 46162 written\n", "4100000 reviews parsed 46162 written\n", "4200000 reviews parsed 46162 written\n", "4300000 reviews parsed 46162 written\n", "4400000 reviews parsed 46162 written\n", "4500000 reviews parsed 46162 written\n", "4600000 reviews parsed 46162 written\n", "4700000 reviews parsed 46162 written\n", "4800000 reviews parsed 53718 written\n", "4900000 reviews parsed 53718 written\n", "5000000 reviews parsed 53718 written\n", "5100000 reviews parsed 53718 written\n", "5200000 reviews parsed 53718 written\n", "5300000 reviews parsed 53718 written\n", "5400000 reviews parsed 53718 written\n", "5500000 reviews parsed 53718 written\n", "5600000 reviews parsed 53718 written\n", "5700000 reviews parsed 53718 written\n", "5800000 reviews parsed 53718 written\n", "5900000 reviews parsed 53718 written\n", "6000000 reviews parsed 53718 written\n", "6100000 reviews parsed 53718 written\n", "6200000 reviews parsed 53718 written\n", "6300000 reviews parsed 59065 written\n", "6400000 reviews parsed 59065 written\n", "6500000 reviews parsed 59065 written\n", "6600000 reviews parsed 59065 written\n", "6700000 reviews parsed 59065 written\n", "6800000 reviews parsed 59065 written\n", "6900000 reviews parsed 61004 written\n", "7000000 reviews parsed 64257 written\n", "7100000 reviews parsed 64257 written\n", "7200000 reviews parsed 64257 written\n", "7300000 reviews parsed 64257 written\n", "7400000 reviews parsed 64257 written\n", "7500000 reviews parsed 64257 written\n", "7600000 reviews parsed 64257 written\n", "7700000 reviews parsed 64257 written\n", "7800000 reviews parsed 64257 written\n", "7900000 reviews parsed 64257 written\n", "8000000 reviews parsed 64257 written\n", "8100000 reviews parsed 64257 written\n", "8200000 reviews parsed 64257 written\n", "8300000 reviews parsed 69248 written\n", "8400000 reviews parsed 69248 written\n", "8500000 reviews parsed 69248 written\n", "8600000 reviews parsed 69248 written\n", "8700000 reviews parsed 69248 written\n", "8800000 reviews parsed 69248 written\n", "8900000 reviews parsed 69248 written\n", "9000000 reviews parsed 69248 written\n", "9100000 reviews parsed 69248 written\n", "9200000 reviews parsed 69248 written\n", "9300000 reviews parsed 69248 written\n", "9400000 reviews parsed 69248 written\n", "9500000 reviews parsed 74090 written\n", "9600000 reviews parsed 74090 written\n", "9700000 reviews parsed 79621 written\n", "9800000 reviews parsed 101432 written\n", "9900000 reviews parsed 101432 written\n", "10000000 reviews parsed 101432 written\n", "10100000 reviews parsed 101432 written\n", "10200000 reviews parsed 101432 written\n", "10300000 reviews parsed 106380 written\n", "10400000 reviews parsed 106380 written\n", "10500000 reviews parsed 106380 written\n", "10600000 reviews parsed 106380 written\n", "10700000 reviews parsed 106380 written\n", "10800000 reviews parsed 106380 written\n", "10900000 reviews parsed 106380 written\n", "11000000 reviews parsed 106380 written\n", "11100000 reviews parsed 106380 written\n", "11200000 reviews parsed 106380 written\n", "11300000 reviews parsed 106380 written\n", "11400000 reviews parsed 106380 written\n", "11500000 reviews parsed 111476 written\n", "11600000 reviews parsed 111476 written\n", "11700000 reviews parsed 111476 written\n", "11800000 reviews parsed 111476 written\n", "11900000 reviews parsed 111476 written\n", "12000000 reviews parsed 111476 written\n", "12100000 reviews parsed 111476 written\n", "12200000 reviews parsed 111476 written\n", "12300000 reviews parsed 119416 written\n", "12400000 reviews parsed 119416 written\n", "12500000 reviews parsed 119416 written\n", "12600000 reviews parsed 119416 written\n", "12700000 reviews parsed 119416 written\n", "12800000 reviews parsed 119416 written\n", "12900000 reviews parsed 119416 written\n", "13000000 reviews parsed 119416 written\n", "13100000 reviews parsed 119416 written\n", "13200000 reviews parsed 119416 written\n", "13300000 reviews parsed 119416 written\n", "13400000 reviews parsed 119416 written\n", "13500000 reviews parsed 119416 written\n", "13600000 reviews parsed 119416 written\n", "13700000 reviews parsed 126470 written\n", "13800000 reviews parsed 126470 written\n", "13900000 reviews parsed 126470 written\n", "14000000 reviews parsed 126470 written\n", "14100000 reviews parsed 126470 written\n", "14200000 reviews parsed 126470 written\n", "14300000 reviews parsed 126470 written\n", "14400000 reviews parsed 126470 written\n", "14500000 reviews parsed 126470 written\n", "14600000 reviews parsed 126470 written\n", "14700000 reviews parsed 126470 written\n", "14800000 reviews parsed 130938 written\n", "14900000 reviews parsed 130938 written\n", "15000000 reviews parsed 130938 written\n", "15100000 reviews parsed 130938 written\n", "15200000 reviews parsed 130938 written\n", "15300000 reviews parsed 130938 written\n", "15400000 reviews parsed 130938 written\n", "15500000 reviews parsed 130938 written\n", "15600000 reviews parsed 130938 written\n", "15700000 reviews parsed 130938 written\n", "15739967 reviews parsed 130938 written\n" ] } ], "source": [ "counts = review_df.user_id.value_counts()\n", "\n", "threshold = 5000\n", "users_above_5k = [user_id for user_id, count in counts[counts > threshold].iteritems()]\n", "book_ids = list(review_df[review_df.user_id.isin(users_above_5k)].book_id)\n", "work_ids = list(review_df[review_df.user_id.isin(users_above_5k)].work_id)\n", "print(users_above_5k)\n", "print(len(book_ids))\n", "print(len(work_ids))\n", "mapping = {str(book_id): work_id for book_id, work_id in zip(book_ids, work_ids)}\n", "print(f'number of users with over {threshold} reviews:', len(counts[counts > threshold]))\n", "print(f'number of reviews for users with over {threshold} reviews:', sum(counts[counts > threshold]))\n", "\n", "data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'\n", "sample_review_text_file = os.path.join(data_dir, 'goodreads_reviews-reviewers_above_5k_reviews.csv.gz') # includes text\n", "\n", "headers = [\n", " 'user_id', 'book_id', 'work_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', 'started_at', \n", " 'n_votes', 'n_comments', 'review_length', 'review_lang', 'review_text', 'review_lang'\n", "]\n", "\n", "\n", "with gzip.open(sample_review_text_file, 'wt') as fh:\n", " writer = csv.writer(fh, delimiter='\\t')\n", " writer.writerow(headers)\n", " written = 0\n", " for ri, review in enumerate(read_csv(review_text_file)):\n", " if (ri+1) % 100000 == 0:\n", " print(ri+1, 'reviews parsed', written, 'written')\n", " if review['user_id'] not in users_above_5k:\n", " continue\n", " try:\n", " review['work_id'] = mapping[review['book_id']]\n", " except KeyError:\n", " continue\n", " written += 1\n", " review['review_lang'] = lang_detect(review)\n", " row = [review[header] for header in headers]\n", " writer.writerow(row)\n", "print(ri+1, 'reviews parsed', written, 'written')\n" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idbook_idreview_idratingdate_addeddate_updatedread_atstarted_atn_votesn_commentsreview_lengthwork_id
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [user_id, book_id, review_id, rating, date_added, date_updated, read_at, started_at, n_votes, n_comments, review_length, work_id]\n", "Index: []" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'\n", "sample_review_text_file = os.path.join(data_dir, 'goodreads_reviews-random_sample_1M.csv.gz') # includes text\n", "\n", "headers = [\n", " 'user_id', 'book_id', 'work_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', 'started_at', \n", " 'n_votes', 'n_comments', 'review_length', 'review_lang', 'review_text', 'review_lang'\n", "]\n", "\n", "threshold = 1000000\n", "prob_threshold = threshold / len(review_df)\n", "\n", "with gzip.open(sample_review_text_file, 'wt') as fh:\n", " writer = csv.writer(fh, delimiter='\\t')\n", " writer.writerow(headers)\n", " for review in read_csv(review_text_file):\n", " if np.random.rand() > prob_threshold:\n", " continue\n", " review['review_lang'] = lang_detect(review)\n", " row = [review[header] for header in headers]\n", " writer.writerow(row)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 4 }