Filtering Goodreads Reviews

During data exploration, a number of issues with the reviews have come to the surface that require some form of data cleaning, i.e. selection and normalization of reviews.

This notebook shows the cleaning steps that were taken.

Non-Reviews

A plot of the review length distribution revealed that there are a few lengths (in number of characters) with high peaks in the frequency distribution. E.g. there are many more reviews of length 3 than expected given the rest of the distribution. Inspection revealed that many of those 3-character reviews contain only a rating, like '3.5' or '4.5'.

Another peak occurs at length 40: there is a large number of reviews that are only a URL for a webpage that contains the actual review. Goodreads shortens longer URLs to 40 characters in the anchor text of an HTML <a> element for display, with the full URL in the anchor href attribute. There are 30,277 such reviews.

Types of non-reviews:

  • length 0: these are empty reviews, which are no reviews at all. There is no review content.
  • length 3: these are mainly reviews that only mention a rating, like '3.5' or '4.5'.
  • length 9-12: these are mainly reviews that only mention a rating followed by the word 'stars', like '3.5 stars' or '4.5 stars'.
  • length 40: there is a large number of reviews that are only a URL for a webpage that contains the actual review. Goodreads shortens longer URLs to 40 characters in the anchor text of an HTML <a> element for display, with the full URL in the anchor href attribute. There are 30,277 such reviews.

The steps below are taken with the aim to remove these so-called non-reviews:

In [1]:
# This reload library is just used for developing the REPUBLIC hOCR parser 
# and can be removed once this module is stable.
%reload_ext autoreload
%autoreload 2

# This is needed to add the repo dir to the path so jupyter
# can load the modules in the scripts directory from the notebooks
import os
import sys
repo_dir = os.path.split(os.getcwd())[0]
print(repo_dir)
if repo_dir not in sys.path:
    sys.path.append(repo_dir)
    
import numpy as np
import pandas as pd
import json
import csv
from collections import Counter
import gzip
import os

data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'

author_file = os.path.join(data_dir, 'goodreads_book_authors.csv.gz') # author information
book_file = os.path.join(data_dir, 'goodreads_books.csv.gz') # basic book metadata
genre_file = os.path.join(data_dir, 'goodreads_book_genres_initial.csv.gz') # book genre information
review_file = os.path.join(data_dir, 'goodreads_reviews_dedup-no_text.csv.gz') # excludes text to save memory
review_text_file = os.path.join(data_dir, 'goodreads_reviews_dedup.csv.gz') # includes text
/Users/marijnkoolen/Code/Huygens/scale
In [2]:
review_df = pd.read_csv(review_file, sep='\t', compression='gzip')

review_df
Out[2]:
user_id book_id review_id rating date_added date_updated read_at started_at n_votes n_comments review_length
0 8842281e1d1347389f2ab93d60773d4d 24375664 5cd416f3efc3f944fce4ce2db2290d5e 5 Fri Aug 25 13:55:02 -0700 2017 Mon Oct 09 08:55:59 -0700 2017 Sat Oct 07 00:00:00 -0700 2017 Sat Aug 26 00:00:00 -0700 2017 16 0 968
1 8842281e1d1347389f2ab93d60773d4d 18245960 dfdbb7b0eb5a7e4c26d59a937e2e5feb 5 Sun Jul 30 07:44:10 -0700 2017 Wed Aug 30 00:00:26 -0700 2017 Sat Aug 26 12:05:52 -0700 2017 Tue Aug 15 13:23:18 -0700 2017 28 1 2086
2 8842281e1d1347389f2ab93d60773d4d 6392944 5e212a62bced17b4dbe41150e5bb9037 3 Mon Jul 24 02:48:17 -0700 2017 Sun Jul 30 09:28:03 -0700 2017 Tue Jul 25 00:00:00 -0700 2017 Mon Jul 24 00:00:00 -0700 2017 6 0 474
3 8842281e1d1347389f2ab93d60773d4d 22078596 fdd13cad0695656be99828cd75d6eb73 4 Mon Jul 24 02:33:09 -0700 2017 Sun Jul 30 10:23:54 -0700 2017 Sun Jul 30 15:42:05 -0700 2017 Tue Jul 25 00:00:00 -0700 2017 22 4 962
4 8842281e1d1347389f2ab93d60773d4d 6644782 bd0df91c9d918c0e433b9ab3a9a5c451 4 Mon Jul 24 02:28:14 -0700 2017 Thu Aug 24 00:07:20 -0700 2017 Sat Aug 05 00:00:00 -0700 2017 Sun Jul 30 00:00:00 -0700 2017 8 0 420
... ... ... ... ... ... ... ... ... ... ... ...
15739962 d0f6d1a4edcab80a6010cfcfeda4999f 1656001 b3d9a00405f7e96752d67b85deda4c7d 4 Mon Jun 04 18:08:44 -0700 2012 Tue Jun 26 18:58:46 -0700 2012 NaN Sun Jun 10 00:00:00 -0700 2012 0 1 299
15739963 594c86711bd7acdaf655d102df52a9cb 10024429 2bcba3579aa1d728e664de293e16aacf 5 Fri Aug 01 18:46:18 -0700 2014 Fri Aug 01 18:47:07 -0700 2014 NaN NaN 0 0 71
15739964 594c86711bd7acdaf655d102df52a9cb 6721437 7c1a7fcc2614a1a2a29213c11c991083 3 Tue Aug 27 12:49:25 -0700 2013 Tue Aug 27 12:53:46 -0700 2013 NaN NaN 0 0 224
15739965 594c86711bd7acdaf655d102df52a9cb 15788197 74a9f9d1db09a90aae3a5acea68c6593 2 Fri May 03 13:06:15 -0700 2013 Fri May 03 15:35:39 -0700 2013 Fri May 03 15:35:39 -0700 2013 Fri May 03 00:00:00 -0700 2013 0 0 108
15739966 594c86711bd7acdaf655d102df52a9cb 8239301 f2af741fb7a99ff730cf29e004f127da 4 Sat Apr 20 15:18:15 -0700 2013 Thu May 02 16:51:20 -0700 2013 Thu May 02 16:51:20 -0700 2013 Sat Apr 20 00:00:00 -0700 2013 0 0 6

15739967 rows × 11 columns

In [5]:
review_df.review_length.value_counts().sort_index().plot(logx=True)
Out[5]:
<AxesSubplot:>
In [4]:
review_df[review_df.review_length < 50].review_length.value_counts().sort_index()
Out[4]:
0      6938
1     19717
2     13640
3     45288
4     24220
5     22144
6     15422
7     19845
8     24205
9     66297
10    47467
11    37583
12    30385
13    26734
14    32114
15    35955
16    32618
17    35097
18    34022
19    35913
20    33675
21    31992
22    32401
23    31484
24    32009
25    32592
26    33616
27    32977
28    33254
29    33810
30    33634
31    33967
32    34169
33    33071
34    33903
35    33726
36    33558
37    34014
38    33884
39    33581
40    64165
41    34872
42    34845
43    33229
44    33592
45    34321
46    34057
47    33651
48    33981
49    33656
Name: review_length, dtype: int64

The following steps check individual reviews for characteristics of non-reviews and create a derived review file with the identified non-reviews removed.

In [64]:
# helper is a module with simple helper functions
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException
from scripts.helper import read_csv
from collections import Counter
import re

def is_url(record):
    return record['review_length'] <= 40 and record['review_text'].startswith('http')
    
def is_rating(record):
    if record['review_length'] > 12:
        return False
    if record['review_length'] < 4 and re.search(r'\d', record['review_text']):
        return True
    for word in ['star', 'stars', 'sterne', 'ster', 'sterren', 'rating']:
        if re.search(word, record['review_text'], re.IGNORECASE) and re.search(r'\d', record['review_text']):
            return True
    return False

def is_date(record):
    if record['review_length'] > 12:
        return False
    if re.search(r'20\d{,2}', record['review_text']):
        return True
    return False

def is_empty(record):
    return record['review_length'] == 0

def is_non_review(record):
    if record['review_length'] > 40:
        return False
    return is_empty(record) or is_url(record) or is_rating(record) or is_date(record)

def lang_detect(record):
    try:
        return detect(record['review_text'])
    except LangDetectException:
        return 'unknown'

def is_english(record):
    return lang_detect == 'en'
    
In [37]:
review_filtered_file = os.path.join(data_dir, 'goodreads_reviews_dedup_filtered-no_text.csv.gz') # excludes text and non-reviews


headers = [
    'user_id', 'book_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', 
    'started_at', 'n_votes', 'n_comments', 'review_length', 'review_text'
]

with gzip.open(filtered_file, 'wt') as fh:
    writer = csv.writer(fh, delimiter='\t')
    writer.writerow(headers)
    for ri, record in enumerate(read_csv(review_text_file)):
        record['review_length'] = int(record['review_length'])
        if is_non_review(record):
            continue
        row = [record[header] for header in headers]
        writer.writerow(row)
        if (ri+1) % 1000000 == 0:
            print(ri+1, 'records parsed')

print(ri+1, 'records parsed')
/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/goodreads_reviews_dedup_filtered.csv.gz
1000000 records parsed
2000000 records parsed
3000000 records parsed
4000000 records parsed
5000000 records parsed
6000000 records parsed
7000000 records parsed
8000000 records parsed
9000000 records parsed
10000000 records parsed
11000000 records parsed
12000000 records parsed
13000000 records parsed
14000000 records parsed
15000000 records parsed
15739967 records parsed
In [3]:
# the filtered review file excludes text and non-reviews
review_filtered_file = os.path.join(data_dir, 'goodreads_reviews_dedup_filtered-no_text.csv.gz') 

review_df = pd.read_csv(review_filtered_file, sep='\t', compression='gzip')

review_df
Out[3]:
user_id book_id review_id rating date_added date_updated read_at started_at n_votes n_comments review_length
0 8842281e1d1347389f2ab93d60773d4d 24375664 5cd416f3efc3f944fce4ce2db2290d5e 5 Fri Aug 25 13:55:02 -0700 2017 Mon Oct 09 08:55:59 -0700 2017 Sat Oct 07 00:00:00 -0700 2017 Sat Aug 26 00:00:00 -0700 2017 16 0 968
1 8842281e1d1347389f2ab93d60773d4d 18245960 dfdbb7b0eb5a7e4c26d59a937e2e5feb 5 Sun Jul 30 07:44:10 -0700 2017 Wed Aug 30 00:00:26 -0700 2017 Sat Aug 26 12:05:52 -0700 2017 Tue Aug 15 13:23:18 -0700 2017 28 1 2086
2 8842281e1d1347389f2ab93d60773d4d 6392944 5e212a62bced17b4dbe41150e5bb9037 3 Mon Jul 24 02:48:17 -0700 2017 Sun Jul 30 09:28:03 -0700 2017 Tue Jul 25 00:00:00 -0700 2017 Mon Jul 24 00:00:00 -0700 2017 6 0 474
3 8842281e1d1347389f2ab93d60773d4d 22078596 fdd13cad0695656be99828cd75d6eb73 4 Mon Jul 24 02:33:09 -0700 2017 Sun Jul 30 10:23:54 -0700 2017 Sun Jul 30 15:42:05 -0700 2017 Tue Jul 25 00:00:00 -0700 2017 22 4 962
4 8842281e1d1347389f2ab93d60773d4d 6644782 bd0df91c9d918c0e433b9ab3a9a5c451 4 Mon Jul 24 02:28:14 -0700 2017 Thu Aug 24 00:07:20 -0700 2017 Sat Aug 05 00:00:00 -0700 2017 Sun Jul 30 00:00:00 -0700 2017 8 0 420
... ... ... ... ... ... ... ... ... ... ... ...
15616192 d0f6d1a4edcab80a6010cfcfeda4999f 1656001 b3d9a00405f7e96752d67b85deda4c7d 4 Mon Jun 04 18:08:44 -0700 2012 Tue Jun 26 18:58:46 -0700 2012 NaN Sun Jun 10 00:00:00 -0700 2012 0 1 299
15616193 594c86711bd7acdaf655d102df52a9cb 10024429 2bcba3579aa1d728e664de293e16aacf 5 Fri Aug 01 18:46:18 -0700 2014 Fri Aug 01 18:47:07 -0700 2014 NaN NaN 0 0 71
15616194 594c86711bd7acdaf655d102df52a9cb 6721437 7c1a7fcc2614a1a2a29213c11c991083 3 Tue Aug 27 12:49:25 -0700 2013 Tue Aug 27 12:53:46 -0700 2013 NaN NaN 0 0 224
15616195 594c86711bd7acdaf655d102df52a9cb 15788197 74a9f9d1db09a90aae3a5acea68c6593 2 Fri May 03 13:06:15 -0700 2013 Fri May 03 15:35:39 -0700 2013 Fri May 03 15:35:39 -0700 2013 Fri May 03 00:00:00 -0700 2013 0 0 108
15616196 594c86711bd7acdaf655d102df52a9cb 8239301 f2af741fb7a99ff730cf29e004f127da 4 Sat Apr 20 15:18:15 -0700 2013 Thu May 02 16:51:20 -0700 2013 Thu May 02 16:51:20 -0700 2013 Sat Apr 20 00:00:00 -0700 2013 0 0 6

15616197 rows × 11 columns

In [9]:
from dateutil.parser import parse, tz

def parse_date(date_str):
    try:
        return parse(date_str).astimezone(utc)
    except TypeError:
        return None

utc = tz.gettz('UTC')

#book_df = pd.read_csv(book_file, sep='\t', compression='gzip')

#book_df[['book_id', 'work_id']]
#review_df = pd.merge(review_df, book_df[['book_id', 'work_id']], on='book_id', how='left')

review_df['date_added'] = review_df.date_added.apply(parse_date)
review_df['date_updated'] = review_df.date_updated.apply(parse_date)
review_df['read_at'] = review_df.read_at.apply(parse_date)
review_df['started_at'] = review_df.started_at.apply(parse_date)

review_df.columns
Out[9]:
Index(['user_id', 'book_id', 'review_id', 'rating', 'date_added',
       'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments',
       'review_length', 'work_id'],
      dtype='object')
In [10]:
review_df.to_csv(review_filtered_file, sep='\t', compression='gzip')

#review_df = pd.read_csv(review_filtered_file, sep='\t', compression='gzip')

Making review subsets for content analysis

The entire Goodreads review collection including all the review text is too big to read into a dataframe, so we create a number of sample review subsets with text included that will be used for content analysis.

The following criteria will be used to analyse various aspects of scale:

  • all reviews of frequently reviewed books (repetition across reviews as book characteristics)
  • all reviews of frequent reviewers (repetition across reviews as reviewer characteristics)
  • random sample of reviews (repetition across reviews as book review characteristics)
In [70]:
counts = review_df.book_id.value_counts()

threshold = 10000
books_above_10k = [book_id for book_id, count in counts[counts > threshold].iteritems()]
book_ids = list(book_df[book_df.book_id.isin(books_above_10k)].book_id)
work_ids = list(book_df[book_df.book_id.isin(books_above_10k)].work_id)
print(books_above_10k)
print(work_ids)
mapping = {str(book_id): work_id for book_id, work_id in zip(book_ids, work_ids)}
books_above_10k = [str(book_id) for book_id in books_above_10k]
print(f'number of books with over {threshold} reviews:', len(counts[counts > threshold]))
print(f'number of reviews for books with over {threshold} reviews:', sum(counts[counts > threshold]))

data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'
sample_review_text_file = os.path.join(data_dir, 'goodreads_reviews-books_above_10k_reviews.csv.gz') # includes text

headers = [
    'user_id', 'book_id', 'work_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', 'started_at', 
    'n_votes', 'n_comments', 'review_length', 'review_lang', 'review_text', 'review_lang'
]

with gzip.open(sample_review_text_file, 'wt') as fh:
    writer = csv.writer(fh, delimiter='\t')
    writer.writerow(headers)
    for review in read_csv(review_text_file):
        if review['book_id'] not in books_above_10k:
            continue
        review['review_lang'] = lang_detect(review)
        review['work_id'] = mapping[review['book_id']]
        row = [review[header] for header in headers]
        writer.writerow(row)
[11870085, 2767052, 29056083, 20309175, 7260188, 22557272, 5470, 6148028, 19063, 10818853, 13335037, 41865]
[8812783, 15732562, 2792775, 3212258, 878368, 153313, 153313, 6171458, 41107568, 16827462, 48765776, 48765776, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 28143699, 13155899]
number of books with over 10000 reviews: 12
number of reviews for books with over 10000 reviews: 167512
In [66]:
from scripts.helper import read_csv

counts = review_df.work_id.value_counts()

threshold = 10000
works_above_10k = [int(work_id) for work_id, count in counts[counts > threshold].iteritems()]
print(works_above_10k)
book_ids = set([str(book_id) for book_id in list(book_df[book_df.work_id.isin(works_above_10k)].book_id)])
work_ids = list(book_df[book_df.work_id.isin(works_above_10k)].work_id)
mapping = {str(book_id): work_id for book_id, work_id in zip(book_ids, work_ids)}

print(f'number of works with over {threshold} reviews:', len(counts[counts > threshold]))
print(f'number of reviews for works with over {threshold} reviews:', sum(counts[counts > threshold]))

data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'
sample_review_text_file = os.path.join(data_dir, 'goodreads_reviews-works_above_10k_reviews.csv.gz') # includes text

headers = [
    'user_id', 'book_id', 'work_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', 'started_at', 
    'n_votes', 'n_comments', 'review_length', 'review_lang', 'review_text', 'review_lang'
]

with gzip.open(sample_review_text_file, 'wt') as fh:
    writer = csv.writer(fh, delimiter='\t')
    writer.writerow(headers)
    written = 0
    for ri, review in enumerate(read_csv(review_text_file)):
        if (ri+1) % 1000000 == 0:
            print(ri+1, 'reviews parsed', written, 'written')
        if review['book_id'] not in book_ids:
            continue
        review['review_lang'] = lang_detect(review)
        review['work_id'] = mapping[review['book_id']]
        written += 1
        row = [review[header] for header in headers]
        writer.writerow(row)
print(ri+1, 'reviews parsed', written, 'written')
[16827462, 48765776, 2792775, 28143699, 13306276, 13155899, 41107568, 8812783, 153313, 878368, 17763198, 6171458, 4640799, 15732562, 21825181, 15524549, 17225055, 14863741, 14345371, 3212258, 15545385, 2267189, 14245059, 15524542, 21861351, 4835472, 2754161, 3275794, 50459161]
{'17623911', '7623768', '17380203', '6469151', '848654', '13533423', '33229067', '634951', '1167751', '27069714', '25615225', '22919918', '6794926', '35478512', '20555501', '20661705', '18933600', '20792791', '23219487', '32678493', '31768151', '23211741', '18219043', '25459160', '24792411', '742573', '3708952', '13517276', '17225793', '22747979', '12783911', '31224712', '8045416', '7631356', '16071194', '24483350', '27281555', '10783858', '30652277', '16039145', '33285096', '35547598', '12658862', '21464429', '31333626', '30129222', '33638045', '6703256', '28421203', '17841055', '8376549', '22444974', '6565942', '16049993', '15780438', '2728527', '13541067', '7561318', '23562530', '30332792', '8146730', '1175483', '1029083', '18717073', '15719793', '22907938', '20634095', '30306795', '16100002', '8775537', '28352416', '34615046', '20765880', '31450752', '115233', '24965395', '5136453', '14289293', '18523885', '1294624', '7378684', '3171676', '32604784', '34964436', '1175482', '32196033', '18779332', '15719795', '29624161', '6089758', '6077978', '9923043', '25928461', '19288043', '24675652', '17257607', '19733543', '25750832', '12091570', '15860667', '18189665', '13261812', '10818853', '23394970', '22077269', '18086425', '3744438', '12083073', '21976013', '32311968', '5207452', '20621490', '18913703', '12465069', '15762884', '11494324', '35614220', '653187', '1811146', '1032890', '26029387', '20804177', '20760125', '15723988', '25492938', '10339954', '31140032', '121121', '7667056', '25782431', '9771265', '26868374', '16010180', '20513634', '15868', '3207547', '70740', '9312497', '3345529', '27880360', '7889219', '8306706', '13450845', '24703950', '23347115', '17466045', '20799366', '6469152', '6333218', '1303749', '11283315', '24950272', '7146518', '12213207', '22558519', '17899904', '17796487', '20939592', '18524290', '31279693', '4255', '12386561', '1090736', '29855909', '17697760', '11019770', '12058988', '20895289', '34171499', '31201349', '11506091', '169182', '11072324', '22443450', '19458358', '28933083', '7459906', '22206765', '6596839', '6421355', '17182122', '12576748', '12973516', '12187803', '20446165', '22052749', '32711626', '15747213', '14462380', '18592792', '8116091', '18456083', '22877052', '20819401', '27856546', '20939738', '18144590', '30527969', '25889880', '2193926', '11489526', '23487535', '23813546', '28075512', '1185972', '25576955', '21558662', '11112731', '28116469', '23272156', '15711759', '12823877', '33407095', '22019444', '29561070', '34803847', '6007025', '87784', '18085525', '25570789', '26050341', '27468538', '9970464', '25532234', '33964226', '17736795', '15871', '18626824', '17610151', '3591974', '22008879', '10089874', '16001467', '25398136', '34119033', '430569', '5475', '7932736', '1401476', '5471', '28801690', '25767346', '20563811', '13540149', '9856721', '18688016', '16300093', '6419743', '28930862', '22747441', '5470', '27178169', '16180175', '6696985', '21540995', '17839087', '12657479', '21560020', '25426791', '1471425', '8683527', '11262855', '107497', '9831947', '18713720', '33876518', '18949467', '22918384', '6668046', '35711865', '25038707', '29055777', '19383583', '7202990', '17282465', '14595296', '17317768', '28191772', '19254270', '1162898', '23802330', '28670400', '17797233', '17789078', '13623558', '17982595', '26514633', '7432752', '35596721', '3296450', '9670791', '1167752', '7107665', '26845374', '36226992', '7877620', '16847978', '18689113', '17345357', '24024309', '32854695', '20873522', '12935613', '11502420', '22469599', '25753230', '13625800', '16287485', '26085996', '18461014', '20894754', '15774971', '28094773', '18416234', '13641954', '25372528', '21522382', '2262747', '10296957', '22565648', '28960475', '9969571', '970917', '18667069', '13614350', '23698704', '13561202', '7005741', '16044564', '23695850', '6201586', '256683', '17253765', '11272641', '1558486', '925412', '23223667', '22059026', '18710190', '13424701', '24112181', '22105022', '13602666', '7782572', '848655', '18158464', '13558128', '17406183', '24252929', '11367926', '2187995', '17305435', '20318659', '19478439', '18617752', '26243089', '22820624', '22612307', '25048937', '17380078', '18070037', '25351314', '6214263', '18923851', '17056572', '6682611', '17855823', '7877609', '6988836', '9860176', '30272403', '6287607', '13446802', '25989499', '28927632', '16082380', '1792183', '29090191', '15704186', '31342081', '20362341', '21848807', '21544013', '32196100', '12512617', '18085454', '25477553', '17285979', '23607406', '23586798', '15784168', '35610130', '18277351', '10234834', '18740106', '32787035', '22057878', '22374550', '25032511', '13131651', '26462167', '28364349', '27865405', '28116173', '12370908', '12379979', '23364977', '13562614', '26162946', '26401901', '32469600', '6484809', '15827051', '18517479', '16386616', '30235395', '35291237', '17879118', '22518395', '20799951', '23306917', '13087683', '16638435', '33013778', '32705495', '18750002', '32327896', '20815159', '13611052', '20958895', '26865481', '12535267', '15845588', '20346560', '19140487', '20885669', '11465793', '16110229', '18367755', '16431267', '893131', '25002842', '26328452', '3086961', '17999114', '22820646', '35267531', '26876834', '27068801', '13556970', '32596639', '23902261', '20747215', '296843', '27839326', '17905376', '35561948', '9335682', '455592', '23122150', '13646978', '32672169', '5999961', '18711392', '295898', '20876955', '18400305', '20651067', '21798664', '6017885', '22209071', '20826236', '11476408', '18710654', '17244064', '10616360', '33812720', '36217912', '12497752', '12626302', '18738787', '23592434', '15719753', '23007086', '33791193', '21569815', '22569433', '171813', '13570308', '22247199', '6369436', '17902413', '16234117', '23245346', '18217608', '29546094', '18917661', '15771057', '13621880', '12643928', '22008638', '15736528', '25730991', '1175481', '22820703', '33818894', '18465782', '77523', '10754989', '5477', '4479640', '18274994', '10430432', '18710464', '1781701', '18046319', '7107296', '17789454', '25862866', '25578651', '35624010', '6490389', '29942695', '17182084', '13562963', '12066190', '13184882', '25478508', '12492298', '22451087', '30308795', '25952311', '18722749', '16025268', '25345762', '16018364', '30812729', '19358657', '15743060', '24503488', '18248500', '12280320', '2938728', '8040193', '25851222', '6606279', '13425323', '23463187', '26840292', '22590639', '26826098', '1611724', '1359366', '23003589', '32672170', '16038170', '7776692', '17934350', '15114365', '8667747', '13517330', '18500423', '16043861', '15770294', '23392951', '21800933', '15869972', '18618963', '25220067', '23667081', '16106928', '17802732', '9761397', '6064259', '31700393', '13159888', '23478203', '31304766', '29983806', '25806040', '6772333', '12408058', '29245055', '25095072', '22301044', '23213324', '13489404', '2239941', '15755597', '25402233', '18667071', '3883919', '1039462', '33655993', '17905623', '32306675', '25580994', '25878460', '41865', '27421523', '22397261', '16118339', '9480062', '949562', '872666', '11316396', '21825308', '22946401', '22914379', '17268167', '23264557', '13627059', '11333587', '15991856', '11736539', '5527857', '30507892', '17951429', '18298900', '13372855', '7937116', '20803587', '17319298', '6721985', '13609836', '2617468', '18750632', '12042324', '2747979', '25538495', '6772387', '30631952', '8797737', '15852552', '21415819', '23396332', '15830744', '20763122', '6013089', '7303448', '9226958', '12861705', '23677696', '27804437', '23431006', '34466147', '24629547', '47000', '17315048', '16142085', '13519470', '25491310', '32327029', '16221771', '14759663', '31125943', '21899768', '30963868', '25905906', '18804646', '25303122', '29995403', '30290686', '32172002', '23148432', '18685229', '26244621', '31686251', '17678435', '25270785', '19240738', '23499218', '23011238', '14424887', '13573833', '21838140', '6460568', '22733227', '19019010', '26344382', '22028282', '6535792', '13035203', '11296680', '16136188', '21895490', '18071590', '25414128', '24947668', '30639879', '24060213', '32190294', '32196024', '18505857', '25402339', '8267080', '10428902', '21825994', '16094784', '8430187', '15298019', '22886522', '25841708', '6820997', '25086476', '13335038', '28593492', '23231754', '31302742', '26699682', '18333278', '14069099', '23752009', '35427227', '10319865', '11471507', '8891899', '16163526', '12656079', '31938415', '23944267', '25739188', '19564919', '893136', '13578037', '26815023', '27251284', '13643473', '22026125', '17807489', '20549381', '18926988', '19304768', '6249519', '33959060', '25676931', '16103612', '11718435', '25153867', '21081683', '25563463', '19030181', '32721509', '11831028', '36383978', '4591352', '25532945', '10435027', '15784893', '30302897', '22173562', '816816', '7955579', '31681856', '15833672', '34391120', '3362689', '16025398', '24885537', '12835105', '18667072', '26067879', '29216063', '1781707', '18193741', '15779138', '21926693', '13565922', '24581473', '28572585', '32800617', '17232449', '25782357', '30798481', '18755030', '12328746', '27067314', '17970732', '23251775', '28963439', '23969542', '15860371', '15792385', '30268316', '17928172', '296839', '23241465', '24742536', '12600138', '2140907', '25185586', '17185524', '17563287', '7062423', '23565121', '24737104', '28255378', '26050627', '8419315', '34863649', '7285601', '27835858', '17163051', '4794161', '20838734', '15698354', '8125982', '6567260', '22138225', '25035808', '30074903', '17286849', '15715013', '1118668', '18661631', '8322565', '16149483', '11059672', '11406968', '23308562', '20307239', '22610291', '33874143', '14163319', '68808', '35651100', '13068184', '11780869', '22391027', '16532513', '23729961', '16048509', '10771500', '29453740', '29066539', '26068758', '9884613', '17976538', '17321852', '20321004', '13492054', '23395098', '10197660', '22557272', '762743', '30909770', '31362625', '14742263', '9378347', '5043803', '3279999', '28963433', '864', '21798679', '27865418', '16176099', '20737476', '20493675', '15829776', '13502270', '31076720', '8684868', '47535', '134502', '19174917', '29476176', '6601779', '17372301', '20486826', '28441377', '26136747', '23681652', '12267806', '15827609', '16217859', '12352354', '26133133', '17162156', '25313807', '34005800', '11106711', '22820681', '25559177', '25871053', '34303350', '2430904', '2383357', '16121916', '3652228', '25467816', '29058155', '17670022', '17372039', '24235658', '22013246', '35277259', '29452280', '8696793', '26150280', '27245104', '15743069', '24992096', '11295686', '3544003', '17744501', '27457744', '12250584', '25867602', '15782527', '31928508', '6530628', '7592539', '6012312', '6404106', '862268', '25409006', '25068655', '22371139', '31247145', '6885125', '5983062', '25465121', '13512568', '18491244', '31930223', '6645647', '13585687', '13603056', '1907905', '28813854', '31416507', '8608219', '8442457', '34479625', '20876176', '1118028', '15982675', '20661290', '16153415', '27865302', '24865428', '18747814', '16120785', '13564669', '17383543', '20930777', '15706076', '15704169', '1163304', '16221043', '13189115', '20527042', '18413677', '34168383', '25922545', '13572785', '17740067', '21480931', '202588', '28280014', '18158779', '15783857', '6309556', '30253546', '4468230', '4079980', '29976154', '6945680', '6277870', '20776304', '17835248', '13631961', '22098536', '16309769', '14762245', '1743455', '20959422', '13366606', '28422098', '18960669', '22443451', '4582108', '6662198', '34364064', '21443484', '858352', '22396149', '23509049', '13572331', '23055327', '17926316', '9286283', '3243910', '24712933', '6454654', '29348347', '6753193', '17280301', '26716806', '23362138', '16023652', '2655', '1885731', '6214264', '28486119', '2728868', '11366370', '28251513', '10151606', '22025049', '9797134', '20487641', '11382943', '22886190', '12773641', '18038996', '7641032', '8175796', '16038894', '24585399', '25454820', '7949442', '25418711', '15704174', '18310712', '18872581', '30992603', '28226184', '10382704', '10489305', '23129688', '23510841', '3001581', '22736932', '22141666', '15923121', '12158002', '18402178', '17159918', '17855196', '5970445', '14796360', '23343051', '6788879', '7745385', '26014657', '16207383', '18336048', '24213573', '16084645', '10616322', '11291921', '26133825', '13279499', '2711522', '16691284', '27758324', '832130', '32501454', '24844880', '17694651', '2437444', '30814938', '9523411', '1812673', '17312364', '20936040', '15738816', '22529254', '6941985', '20301447', '13067277', '11472275', '27833446', '621714', '23055326', '19024389', '6308402', '412765', '6365868', '21606929', '18810809', '28002909', '17622641', '30733881', '16059263', '24936823', '15782628', '7303447', '11366921', '25458088', '21526588', '30819387', '21523103', '15834381', '17253961', '19191388', '16136701', '10860047', '15795357', '13450992', '7818697', '23488177', '13266989', '32327894', '23290223', '13637131', '2311681', '23481110', '32452232', '17374967', '23377235', '25149009', '17268736', '3234963', '18075570', '9025186', '35841506', '25739181', '11295839', '6277597', '32571475', '6795650', '21803253', '29557229', '17698624', '6540523', '18303228', '18330586', '32187706', '17701696', '31574524', '295895', '13644251', '30190890', '27432466', '32601138', '18710455', '22098740', '22443454', '13640224', '17792118', '7107655', '27192797', '22027303', '13572091', '18240444', '18808404', '7538739', '1039461', '13509182', '14287920', '16410928', '17458508', '17207560', '6319350', '15765784', '16080569', '4544759', '30194474', '18592700', '10407121', '11542687', '25925012', '25115981', '11306050', '33957050', '28239572', '22880450', '29240385', '2798098', '3708951', '28550078', '25719729', '11387490', '15738309', '23003119', '30073478', '25552609', '16368772', '25750833', '27240909', '1200110', '22901602', '33970415', '25907194', '30736096', '16031628', '13646586', '20910326', '13577338', '25817368', '31616479', '19005845', '32493379', '28396808', '493092', '6848433', '16163548', '9996853', '13417027', '23298852', '24733515', '17368146', '11521684', '988811', '26097476', '13364111', '23396344', '32310213', '23012014', '15779955', '19501104', '18407011', '23264599', '10327361', '25979514', '25380799', '15710432', '7599408', '12494832', '18277434', '23398687', '171796', '1430336', '30537071', '22667278', '13613910', '24844873', '6338979', '10307295', '16123468', '20822585', '27951429', '12061729', '18163440', '15839201', '26847251', '18364652', '32574083', '16240747', '29056090', '26104159', '18374073', '7619312', '16108278', '20691208', '13563204', '15820139', '296829', '18136444', '13233432', '27073501', '27222447', '3292087', '31361778', '35066575', '1359361', '32348285', '17233649', '10180184', '15853000', '22011300', '23245330', '16074780', '21458175', '26256536', '12606483', '15834248', '8423493', '13558130', '18599593', '25648152', '7882466', '31142477', '10400074', '17928697', '22397401', '19866705', '3283211', '25238369', '6456478', '49813', '9156663', '142293', '18331088', '25470866', '19248302', '20446243', '492000', '11870085', '13502941', '1167711', '14769440', '19572096', '10684183', '5514914', '35525846', '25817366', '28415052', '18590221', '16624883', '13608379', '18373805', '11072325', '1167758', '15784593', '13369984', '28869162', '20821105', '70750', '22026914', '23251163', '19465062', '25859959', '11340157', '25907431', '35427560', '26204799', '24598365', '25573037', '18332192', '8256471', '12985366', '7775689', '16122644', '17903642', '24338230', '16094352', '17156273', '6467118', '23597316', '1324062', '20746310', '28525303', '32729810', '34572308', '12943330', '16141742', '31453333', '32912399', '17200915', '26372386', '1033918', '22012199', '17858541', '6453849', '13147906', '2166416', '21845565', '18079910', '9947735', '47522', '3228468', '1508371', '24169712', '27863078', '25195069', '25707413', '4031861', '20555341', '7862090', '25645289', '21840310', '24763181', '28183257', '23506567', '7265901', '20534719', '23666179', '21280885', '17857648', '3', '17735871', '12997113', '1248077', '12973964', '18331796', '26309684', '31352119', '25929273', '22914373', '22841994', '12187799', '22558796', '11085074', '26143397', '21941928', '25748701', '18870649', '25573742', '24615234', '18765860', '10255274', '13455462', '30627196', '28479611', '7428537', '2264102', '28052767', '19035585', '16068905', '24245069', '7683425', '14760741', '22014037', '1611537', '26166591', '27775414', '22012779', '18461597', '22069364', '11304270', '21877015', '29067264', '22232135', '18240575', '7902652', '1826649', '33005686', '11249180', '4786064', '12510082', '25021837', '6371614', '23453787', '12941351', '12813562', '13452130', '22048373', '30968446', '16172579', '23290384', '9577857', '13450839', '18082943', '28147904', '7723926', '17378967', '19496550', '25402286', '18665794', '30258297', '9402421', '2681117', '11389422', '8113493', '20614474', '20861464', '26072029', '25655480', '20319534', '14762489', '20945198', '1907452', '975442', '27042066', '2661', '17252078', '23392759', '20775702', '35711177', '15995451', '13043622', '33249543', '13490003', '36172798', '615203', '22022075', '6658717', '21825598', '16050285', '13541093', '17617613', '17118898', '12139510', '16026746', '26632196', '16282204', '7541858', '651603', '6313338', '13099951', '26154394', '26055229', '30169721', '33133924', '23349454', '2193922', '1035885', '21008563', '9885717', '11477648', '3162139', '17347634', '33597793', '25445626', '17226872', '28959274', '12915074', '24965453', '28485265', '28132722', '25059351', '13074276', '9662788', '6275962', '17618692', '25537895', '15749148', '13584041', '7769080', '9745301', '24129434', '23261348', '51762', '25536116', '19245512', '22591099', '15327848', '21394255', '3162133', '18761327', '29972929', '17233160', '18045177', '24416368', '17372830', '295897', '18685593', '16960820', '24987978', '13492029', '33538809', '17470676', '25248885', '30698397', '848656', '10370664', '1629406', '10323524', '26124554', '736301', '29336777', '31501375', '86940', '22369653', '25742968', '8242247', '27774810', '16041590', '20698577', '1168126', '25153272', '22009164', '9576890', '35628404', '9782708', '25799090', '1482132', '13641944', '30625052', '3475269', '13190720', '12029011', '13020641', '16636891', '12328744', '6999776', '35604481', '13484101', '28863513', '18705050', '28371839', '16118334', '23633718', '27272362', '22614198', '24845481', '33832947', '15750489', '16071770', '13064926', '16003299', '20565586', '25519319', '18513682', '30312363', '5478', '20664548', '27556033', '13417580', '24826119', '35160402', '18428110', '25407919', '25560245', '7701058', '1577365', '13648024', '34318133', '9700012', '7635541', '22053825', '28946401', '6072428', '14854809', '13554008', '30757322', '18918647', '13631515', '34225260', '27561654', '13566014', '296825', '19241699', '3356', '18247775', '14059087', '6371188', '30139651', '10297811', '3049950', '15704161', '22464636', '25520880', '22858373', '2308980', '17973762', '6214262', '26207857', '13558932', '24761879', '12649718', '23383694', '23110080', '20873735', '4593339', '19747310', '18208828', '8519739', '22447268', '26175968', '19478858', '25076674', '23278585', '33401915', '22839547', '988812', '13037285', '20650332', '29192980', '36226979', '36186865', '15798097', '23390821', '71001', '22022744', '7636830', '12040504', '15702765', '30261139', '30365843', '10763617', '3347826', '13582806', '17791941', '11531822', '10857048', '31342863', '2767052', '23965011', '19088744', '17906724', '25885299', '3228285', '18373282', '17340867', '13599993', '15507958', '22616546', '23627224', '24955138', '13384167', '22381143', '45503', '26869474', '29434730', '15757746', '32150125', '19307977', '23392349', '22443992', '18713356', '10623852', '35957342', '35963122', '13410173', '13722513', '18697818', '30177570', '28462125', '1238900', '7415669', '22845230', '24548235', '13484097', '26228167', '26542880', '7636597', '31437636', '15717791', '32113294', '3529641', '8874564', '30625069', '1587371', '21798645', '32881040', '24517276', '33561480', '24845394', '17238204', '13543691', '8409352', '10955232', '25293708', '28449886', '17917566', '13319603', '3882099', '33397085', '27262601', '20446212', '21464366', '18159220', '35185807', '17697576', '11763051', '13556437', '24068972', '19532570', '30892022', '9501098', '22915233', '17230106', '17190305', '1990311', '32876436', '12649671', '20896811', '30304889', '10029930', '21496310', '49771', '21956204', '27209601', '6908529', '25695197', '26829551', '27865646', '13563020', '34625531', '13512949', '20326033', '18039636', '26819816', '1570814', '32307783', '25060416', '32053652', '13601883', '13536858', '6604736', '23217267', '10293381', '29348016', '28457773', '17857128', '13578028', '19232069', '26858928', '21842632', '2830061', '9902321', '23607381', '29560766', '5945479', '13413859', '9886587', '25849641', '30750943', '31342728', '26869421', '23291340', '30268317', '28220755', '10794976', '30329629', '24307863', '9480077', '13408579', '30243012', '26018476', '25124132', '29750146', '25568667', '22915525', '12885533', '19523475', '18878756', '18810188', '12970552', '3597767', '18729913', '25432891', '13574246', '17727547', '22756742', '3312730', '1689096', '13480671', '8930594', '17083703', '11736454', '32070902', '30050074', '8369717', '19376702', '13609263', '1508373', '23362134', '25815945', '23207076', '16046182', '20765998', '18914023', '3487145', '27859636', '25184100', '10790543', '32079434', '24845176', '6599441', '45497', '27224217', '22753803', '22565407', '29352237', '3399740', '16410758', '17673265', '18277369', '16281293', '23164819', '34878457', '13562891', '16052844', '25471933', '15840628', '25170697', '13537626', '36235191', '4666058', '17998865', '25034701', '17383918', '7769068', '24500712', '32862166', '23596051', '17240354', '24911900', '33249079', '29976332', '31693618', '25688111', '1772798', '8679437', '13032018', '11235712', '11735983', '11450073', '25202033', '32714405', '14856786', '2660', '19184431', '26267528', '15743073', '15869901', '87823', '13444844', '5971209', '18522512', '1050869', '18590594', '7338921', '23304136', '24143070', '31177549', '17907734', '2888212', '17289350', '18041074', '13563913', '21416527', '6219406', '15738367', '26190650', '25792193', '25142590', '27872610', '25617779', '7328985', '13598952', '19548976', '22882407', '17798613', '35666553', '33985876', '32716769', '26869457', '23399768', '10201650', '25069196', '2195228', '19542808', '15947905', '26219614', '12390063', '25555156', '16049104', '7260188', '7315452', '23112751', '22754100', '18398230', '20868113', '7693362', '18950883', '18139937', '25587116', '497318', '15717897', '2262746', '1611839', '23121026', '6342509', '23388229', '18952334', '2195230', '22758737', '18177642', '8659282', '20574633', '22716410', '6626071', '21143500', '13569253', '21480930', '10924618', '19345703', '17847583', '32613491', '23356642', '6663401', '30124071', '25557821', '21375409', '25080929', '24834780', '5473', '13394748', '20587921', '11337018', '17568374', '15742581', '18659505', '18397048', '15848517', '24544370', '18218748', '6129629', '13135514', '1141402', '17410339', '17450549', '17451711', '24510659', '6965175', '32327289', '27344676', '11290745', '22006767', '15743062', '26200934', '36364920', '10794937', '18680821', '6355432', '25989361', '8502965', '13319650', '19404278', '27810119', '13580753', '26788341', '31574430', '17351872', '2253168', '25392657', '29800365', '13509120', '7740998', '28447015', '32804803', '6748892', '24548709', '2297303', '25150558', '18340095', '18493426', '31864314', '10002196', '24420539', '27037673', '31114367', '22713014', '21841495', '11143580', '10754520', '32598849', '20929026', '9269688', '23362140', '23588396', '26099082', '7350825', '25665316', '23664702', '13617471', '17984552', '8564532', '15860671', '15768174', '6496061', '19509353', '32196106', '17280541', '29093221', '18040633', '12819583', '16116728', '31285614', '11417254', '3310083', '9325431', '23347055', '18488601', '26939805', '18218169', '23116317', '3188872', '12016313', '591213', '17258791', '14288439', '13511506', '18330911', '6388432', '13087601', '17458734', '16051061', '17611372', '31278743', '18044174', '8045417', '17341321', '13197222', '32309194', '19243721', '13490034', '18746499', '29372056', '16077194', '15821439', '35437810', '15714522', '25636642', '23809341', '16075864', '6130950', '8518234', '27262439', '2264130', '13563330', '25779553', '11985087', '18553527', '4266322', '22589511', '12305738', '22088011', '12396771', '13753712', '19097400', '24911818', '30140875', '49818', '15755549', '13420264', '28525374', '8409368', '8306857', '13414282', '23640217', '6980817', '29639950', '35391134', '7231642', '20806584', '25063777', '11058389', '11732640', '18271170', '18523261', '25923000', '15724164', '17666012', '13321335', '30182118', '7710333', '25912961', '11982320', '15701662', '26796330', '22241296', '25969640', '17884755', '13558189', '25698317', '13508601', '13625156', '17561098', '10327392', '17258713', '28109445', '28001785', '9662715', '18819489', '18410374', '25161218', '17951497', '30733913', '18962294', '35594963', '12038650', '101426', '25440045', '8828360', '2173683', '24501910', '17930465', '16005666', '17337509', '7419435', '24104539', '30770810', '865', '862272', '14785910', '17561086', '2738612', '34823580', '29213388', '14289689', '29479422', '29056083', '1781723', '22911409', '25630971', '20561797', '17204095', '23715973', '33878306', '20829029', '18597369', '9979062', '33843449', '32829500', '15711451', '3175211', '25506775', '30278752', '15727775', '25127455', '17730679', '13604494', '27219097', '14802335', '17608371', '25758236', '24204146', '18337477', '28789556', '22443481', '8333557', '23295309', '30972270', '1739991', '22677634', '30743499', '13512140', '27065545', '17457185', '29752927', '23019251', '21399905', '1546633', '35121839', '18966551', '22666173', '147476', '20555443', '18668534', '19258660', '18463614', '17118893', '32709452', '24617231', '6507425', '16275651', '13553854', '24941243', '12168042', '6418599', '333674', '19203179', '18052325', '32196073', '17698919', '6148028', '28550610', '13330943', '19248807', '11549187', '21797160', '3360617', '13376097', '18224374', '23201213', '22031664', '16154910', '14796823', '26076903', '31831496', '31200595', '21028179', '12955096', '11436054', '9637697', '20614318', '17456181', '7442412', '1137327', '14762491', '20742224', '9460487', '16039308', '16060617', '8391758', '20761204', '20821174', '17384001', '6475535', '1328498', '18071359', '22438404', '13456264', '13082496', '17666929', '15182675', '17159916', '13578034', '25066117', '7092083', '22550979', '22944242', '13687076', '1281235', '49773', '18005662', '30257299', '12359421', '22180850', '22044746', '32493793', '13588543', '22890011', '29238847', '20758225', '18007564', '25318594', '17853242', '25353247', '19660884', '25500233', '23507', '608867', '18871776', '18712499', '13649327', '70754', '17371845', '17564499', '21414828', '27557999', '22466620', '9809096', '14762463', '8479009', '27515628', '17563814', '8515383', '18798818', '24590557', '37449', '29967520', '20603758', '373676', '21887189', '10282869', '13344493', '22492901', '34399341', '18928661', '18634486', '6472227', '25215271', '49810', '28185384', '28415009', '16069305', '22844292', '12338163', '23363818', '21976014', '18039423', '13616052', '19043489', '2156863', '13425450', '15768774', '16177870', '7445780', '17907491', '32564969', '31843042', '25852860', '8284932', '11291520', '23382512', '4163418', '12187797', '34434581', '17255591', '9365923', '24892145', '25431622', '12827311', '24401111', '34052111', '26150650', '26819767', '16067349', '15829468', '185900', '31538363', '17322949', '11143600', '32500082', '13408075', '34858734', '36381037', '13068756', '10327421', '16039920', '35079565', '1498842', '16058418', '18481276', '12250585', '4266341', '27876118', '29069989', '33526658', '34230397', '9489489', '19064', '587862', '840211', '26256646', '983552', '16018772', '13645903', '35525168', '8101281', '17986986', '13589235', '17290810', '31561733', '27190381', '16156593', '33845852', '8846414', '6039746', '15983784', '13562570', '72193', '9171146', '32621564', '862267', '30968175', '1039463', '872', '9361589', '11344832', '9745341', '15751678', '23559858', '2711523', '2657', '295899', '34615137', '33312853', '35553817', '34999214', '1271159', '18386077', '4616627', '20543358', '13356854', '28680885', '18360625', '17407461', '2654', '18039923', '13410779', '12025202', '25258300', '17853376', '12269780', '23276112', '19101755', '19539657', '32570242', '8142865', '13574349', '13575634', '29587499', '12024', '13063912', '12382340', '9365986', '36110908', '21956078', '2264108', '18190123', '9629811', '13612277', '7014517', '6015185', '27858482', '31298704', '6573014', '22088945', '8722062', '16292420', '18086403', '7077211', '20307209', '6633956', '24111211', '29242483', '22703858', '24822879', '25309654', '11149647', '492286', '18770574', '18746856', '13373663', '17795506', '19546752', '6574803', '17880024', '3153393', '22019906', '10526094', '16157474', '6979801', '22024583', '25701321', '21527585', '21939020', '20934720', '7950150', '26092775', '6040381', '18710449', '18597421', '35233095', '3847055', '17343716', '18665249', '18684947', '17566731', '17161949', '19029961', '10282880', '35390821', '25794765', '26164740', '30838324', '18267158', '23396682', '29079827', '15743071', '13092575', '21525121', '17968365', '18362734', '31560002', '8013752', '26853362', '20307223', '13515350', '13452143', '15791627', '12539591', '7931168', '6626070', '10858973', '15704149', '24359582', '9997510', '13554812', '16130437', '12835106', '15784152', '25987480', '30853073', '23275109', '29914433', '526272', '15843793', '26170148', '14802374', '437151', '2198721', '17934906', '45495', '9212093', '13234679', '4478365', '10244455', '24718615', '858351', '13557703', '13026021', '34128781', '25755787', '27259778', '25120177', '1238901', '608858', '18040699', '9584810', '22873046', '49799', '12328749', '13603351', '28757261', '15717947', '23354313', '8120173', '1310182', '19100', '25370921', '18983451', '22038278', '23018804', '888945', '16127939', '5801016', '2397465', '25964074', '29543528', '6902967', '18998388', '29425003', '16280929', '16157187', '27223386', '2206723', '24317333', '936221', '23595964', '22919473', '2261214', '13335037', '7454063', '14569776', '13397021', '29082328', '24964815', '17230504', '29605141', '18401393', '17833818', '893135', '18517873', '25154774', '35608444', '33845528', '20806556', '28494414', '296820', '17557750', '30517639', '17308606', '26803527', '22370890', '9309796', '25477718', '24490481', '35378698', '8039099', '17152275', '14574973', '17845840', '12104761', '7095445', '34823579', '34510127', '21726837', '11738407', '3407971', '14739280', '6405758', '18190276', '23607404', '13509127', '2983323', '17857225', '15736550', '17540725', '15818092', '11112712', '17187024', '28176874', '13530788', '20309175', '22398290', '7645685', '25214284', '30982036', '25147847', '24612163', '8720518', '35383657', '29757622', '26856722', '16412296', '12027088', '17794847', '2287904', '23502642', '18219535', '22614515', '21934594', '9859820', '16125241', '8763821', '20447452', '31187505', '14290343', '32054384', '6906068', '23392562', '7364254', '2261213', '29496243', '27838766', '25310065', '9969572', '17859161', '27224408', '16692909', '17134626', '18815174', '13643133', '8409357', '9452482', '23551264', '30532762', '23315870', '29880709', '16084682', '27263459', '25231392', '20910195', '8574414', '24705880', '20945300', '22828783', '25804350', '18010943', '22263899', '20612067', '27396571', '13557136', '30078661', '22622278', '13582001', '12547520', '27168405', '15715789', '24781288', '27869625', '22455901', '20222637', '31417231', '742576', '23307757', '29747283', '23350847', '840208', '22602377', '18040125', '6984466', '24171268', '18206985', '24674676', '25758406', '17861287', '13164565', '18163629', '33622481', '14622803', '3102821', '13598806', '12061833', '36219831', '6071573', '22824462', '34318', '23527094', '29762289', '6577666', '24963328', '16283961', '26851201', '25270661', '28602838', '45499', '13613827', '2195227', '15747347', '22857439', '31030335', '871224', '18809355', '27278972', '31818515', '12004870', '12885649', '2397472', '21542483', '30295420', '8549012', '19063', '12509921', '25879546', '32572144', '10753689', '12979843', '21863194', '22071142', '20520487', '29481736', '20784229', '32795719', '23811616', '23704118', '13412974', '31682902', '3174348', '9397983', '6010848', '219069', '18539129', '13416045', '30632127', '18486818', '17416947', '35263629', '13690325', '18743497', '25739498', '29430226', '35163194', '14478151', '18132708', '22917699', '28363557', '34389262', '18620726', '165220', '2711524', '21904149', '13451546', '29636938', '22723275', '28946493', '26055123', '24862868', '18596991', '1808773', '25908605', '9692831', '31836443', '1261442', '32596240', '21473426', '29346191', '13617429', '2638810', '15745753', '171797', '24911870', '24559146', '15766140', '296827', '1167760', '12292612', '29070291', '15992524', '33128765', '5670194', '16058834', '25683005', '6053292', '309174', '20626209', '29743258', '27272264', '26522288', '17873797', '22009240', '18488649', '1649550', '26127556', '13164350', '15994655', '12384483', '29232480', '24845469', '11925410', '821787', '2699067', '17616021', '24298721', '26046350', '22019178', '27231015', '25728861', '31283253', '455033', '6089760', '9638604', '2200815', '28928335', '26802550', '9490473', '943315', '7941396', '10471154', '9717320', '18049021', '18901352', '22595857', '12207476', '25661280', '16637370', '23109031', '30964258', '13107968', '8313731', '13089144', '16135349', '31321368', '31314242', '23009744', '8663600', '20445993', '19631541', '31938451', '6772388', '2206724', '13648975', '32294046', '18624585', '16081583', '13219316', '2549275', '7040332', '19078806', '27796606', '24878742', '30525923', '9088786', '14793577', '17970762', '71000', '17917593', '18164704', '6432120', '13507226', '27246710', '28266126', '6588588', '24317970', '15836164', '13111334', '35609920', '3484606', '25016042', '17281930', '31823738', '13548787', '14760501', '13543908', '30525922', '28109798', '17466044', '31299247', '836042', '18044116', '19505295', '20958096', '16165124', '1904709', '32326262', '5514922', '12279616', '24856461', '17906268', '22880245', '31856423', '8128934', '20809922', '23547767', '26068759', '22586854', '26829136', '28525496', '7838236', '22091506', '23691247', '3242678', '34236100', '15743076', '25233603', '13796816', '15848087', '18273877', '18333638', '30365803', '33630768', '8332744', '25500715', '27885543', '26308557', '6080900', '32914485', '10385172', '816815', '13425880', '24801569', '28352417', '28003656', '30851062', '35957356', '9307599', '18246707', '12837725', '15987042', '28117517', '5999963', '7636293', '9456961', '13639293', '15719757', '22010842', '7546830', '24995028', '26887962', '18738782', '15753836', '26087530', '25946362', '15880', '12061809', '25418547', '29969536', '18498332', '26088065', '18866539', '69438', '5953576', '23513830', '23524746', '4687635', '22824478', '21424761', '25587975', '31681857', '10415854', '587400', '18104647', '25088196', '11990331', '13141608', '13497933', '518844', '16205681', '18410187', '11192675', '20498963', '18493699', '13560379', '31122073', '29356081', '6257502', '7176054', '415675', '10257528', '17466622', '34996826', '10611698', '13562630', '32672711', '18626858', '17201174'}
number of works with over 10000 reviews: 29
number of reviews for works with over 10000 reviews: 412905
100000 reviews parsed 2013 written
200000 reviews parsed 4103 written
300000 reviews parsed 6351 written
400000 reviews parsed 8555 written
500000 reviews parsed 10701 written
600000 reviews parsed 12692 written
700000 reviews parsed 14628 written
800000 reviews parsed 16675 written
900000 reviews parsed 18746 written
1000000 reviews parsed 20870 written
1100000 reviews parsed 22833 written
1200000 reviews parsed 24898 written
1300000 reviews parsed 26681 written
1400000 reviews parsed 28831 written
1500000 reviews parsed 30908 written
1600000 reviews parsed 32997 written
1700000 reviews parsed 35007 written
1800000 reviews parsed 36898 written
1900000 reviews parsed 38950 written
2000000 reviews parsed 40817 written
2100000 reviews parsed 42744 written
2200000 reviews parsed 44942 written
2300000 reviews parsed 47112 written
2400000 reviews parsed 49475 written
2500000 reviews parsed 51626 written
2600000 reviews parsed 53931 written
2700000 reviews parsed 55797 written
2800000 reviews parsed 57756 written
2900000 reviews parsed 59843 written
3000000 reviews parsed 62320 written
3100000 reviews parsed 64461 written
3200000 reviews parsed 66572 written
3300000 reviews parsed 68727 written
3400000 reviews parsed 70821 written
3500000 reviews parsed 72954 written
3600000 reviews parsed 75265 written
3700000 reviews parsed 77276 written
3800000 reviews parsed 79434 written
3900000 reviews parsed 81383 written
4000000 reviews parsed 83573 written
4100000 reviews parsed 85724 written
4200000 reviews parsed 87724 written
4300000 reviews parsed 89946 written
4400000 reviews parsed 92127 written
4500000 reviews parsed 94314 written
4600000 reviews parsed 96313 written
4700000 reviews parsed 98484 written
4800000 reviews parsed 100492 written
4900000 reviews parsed 102564 written
5000000 reviews parsed 104652 written
5100000 reviews parsed 106968 written
5200000 reviews parsed 109106 written
5300000 reviews parsed 111296 written
5400000 reviews parsed 113354 written
5500000 reviews parsed 115409 written
5600000 reviews parsed 117487 written
5700000 reviews parsed 119753 written
5800000 reviews parsed 122027 written
5900000 reviews parsed 124172 written
6000000 reviews parsed 126501 written
6100000 reviews parsed 128724 written
6200000 reviews parsed 130821 written
6300000 reviews parsed 133040 written
6400000 reviews parsed 135462 written
6500000 reviews parsed 137646 written
6600000 reviews parsed 139999 written
6700000 reviews parsed 142474 written
6800000 reviews parsed 144780 written
6900000 reviews parsed 147061 written
7000000 reviews parsed 149093 written
7100000 reviews parsed 151527 written
7200000 reviews parsed 153774 written
7300000 reviews parsed 156279 written
7400000 reviews parsed 158617 written
7500000 reviews parsed 161079 written
7600000 reviews parsed 163102 written
7700000 reviews parsed 165534 written
7800000 reviews parsed 167715 written
7900000 reviews parsed 169851 written
8000000 reviews parsed 171990 written
8100000 reviews parsed 174345 written
8200000 reviews parsed 176447 written
8300000 reviews parsed 178493 written
8400000 reviews parsed 180749 written
8500000 reviews parsed 182914 written
8600000 reviews parsed 185476 written
8700000 reviews parsed 187747 written
8800000 reviews parsed 190244 written
8900000 reviews parsed 192425 written
9000000 reviews parsed 194689 written
9100000 reviews parsed 196898 written
9200000 reviews parsed 199171 written
9300000 reviews parsed 201383 written
9400000 reviews parsed 203584 written
9500000 reviews parsed 205613 written
9600000 reviews parsed 208139 written
9700000 reviews parsed 210200 written
9800000 reviews parsed 212082 written
9900000 reviews parsed 214407 written
10000000 reviews parsed 216429 written
10100000 reviews parsed 218879 written
10200000 reviews parsed 221157 written
10300000 reviews parsed 223389 written
10400000 reviews parsed 225595 written
10500000 reviews parsed 227881 written
10600000 reviews parsed 230155 written
10700000 reviews parsed 232567 written
10800000 reviews parsed 234743 written
10900000 reviews parsed 237046 written
11000000 reviews parsed 239503 written
11100000 reviews parsed 241727 written
11200000 reviews parsed 244201 written
11300000 reviews parsed 246430 written
11400000 reviews parsed 248960 written
11500000 reviews parsed 251349 written
11600000 reviews parsed 253614 written
11700000 reviews parsed 255938 written
11800000 reviews parsed 258210 written
11900000 reviews parsed 260196 written
12000000 reviews parsed 262362 written
12100000 reviews parsed 264689 written
12200000 reviews parsed 267016 written
12300000 reviews parsed 269351 written
12400000 reviews parsed 271443 written
12500000 reviews parsed 273787 written
12600000 reviews parsed 276068 written
12700000 reviews parsed 278134 written
12800000 reviews parsed 280214 written
12900000 reviews parsed 282462 written
13000000 reviews parsed 284555 written
13100000 reviews parsed 286804 written
13200000 reviews parsed 289258 written
13300000 reviews parsed 291719 written
13400000 reviews parsed 294020 written
13500000 reviews parsed 296323 written
13600000 reviews parsed 298409 written
13700000 reviews parsed 300271 written
13800000 reviews parsed 302678 written
13900000 reviews parsed 304861 written
14000000 reviews parsed 307088 written
14100000 reviews parsed 309131 written
14200000 reviews parsed 311246 written
14300000 reviews parsed 313573 written
14400000 reviews parsed 315823 written
14500000 reviews parsed 318158 written
14600000 reviews parsed 320361 written
14700000 reviews parsed 322520 written
14800000 reviews parsed 324702 written
14900000 reviews parsed 326827 written
15000000 reviews parsed 330967 written
15100000 reviews parsed 335467 written
15200000 reviews parsed 340097 written
15300000 reviews parsed 345419 written
15400000 reviews parsed 349807 written
15500000 reviews parsed 354202 written
15600000 reviews parsed 358680 written
15700000 reviews parsed 362824 written
15739967 reviews parsed 364653 written
In [77]:
counts[counts > threshold]

print(len(book_ids), len(work_ids), len(mapping.keys()))
review_df.columns
0 0 0
Out[77]:
Index(['user_id', 'book_id', 'review_id', 'rating', 'date_added',
       'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments',
       'review_length', 'work_id'],
      dtype='object')
In [88]:
counts = review_df.user_id.value_counts()

threshold = 5000
users_above_5k = [user_id for user_id, count in counts[counts > threshold].iteritems()]
book_ids = list(review_df[review_df.user_id.isin(users_above_5k)].book_id)
work_ids = list(review_df[review_df.user_id.isin(users_above_5k)].work_id)
print(users_above_5k)
print(len(book_ids))
print(len(work_ids))
mapping = {str(book_id): work_id for book_id, work_id in zip(book_ids, work_ids)}
print(f'number of users with over {threshold} reviews:', len(counts[counts > threshold]))
print(f'number of reviews for users with over {threshold} reviews:', sum(counts[counts > threshold]))

data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'
sample_review_text_file = os.path.join(data_dir, 'goodreads_reviews-reviewers_above_5k_reviews.csv.gz') # includes text

headers = [
    'user_id', 'book_id', 'work_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', 'started_at', 
    'n_votes', 'n_comments', 'review_length', 'review_lang', 'review_text', 'review_lang'
]


with gzip.open(sample_review_text_file, 'wt') as fh:
    writer = csv.writer(fh, delimiter='\t')
    writer.writerow(headers)
    written = 0
    for ri, review in enumerate(read_csv(review_text_file)):
        if (ri+1) % 100000 == 0:
            print(ri+1, 'reviews parsed', written, 'written')
        if review['user_id'] not in users_above_5k:
            continue
        try:
            review['work_id'] = mapping[review['book_id']]
        except KeyError:
            continue
        written += 1
        review['review_lang'] = lang_detect(review)
        row = [review[header] for header in headers]
        writer.writerow(row)
print(ri+1, 'reviews parsed', written, 'written')
['a2d6dd1685e5aa0a72c9410f8f55e056', '459a6c4decf925aedd08e45045c0d8c6', '4922591667fd3e8adc0c5e3d42cf557a', 'dd9785b14664103617304996541ed77a', '843a44e2499ba9362b47a089b0b0ce75', '9003d274774f4c47e62f77600b08ac1d', 'b7772313835ce6257a3fbe7ad2649a29', '8bb031b637de69eba020a8a466d1110b', '8e7e5b546a63cb9add8431ee6914cf59', '6ac35fe952c608da50153d64f616291b', '795595616d3dbd81bd16b617c9a1fa48', 'a45fb5d39a6a9857ff8362900790510a', '60982541be85a0611e9634b4f63d0cb0', '97e2ce2141fa1c880967d78aec3c14fa', '422e76592e2717d5d59465d22d74d47c', '59151b639f247aa97fffd5c71701db29', 'e5905d648022af7b1309d82a1f4d255b', '37b3e60b4e4152c580fd798d405150ff', 'd8c39b3b11bb2da1c1d5c39f49669dea']
148355
148355
number of users with over 5000 reviews: 19
number of reviews for users with over 5000 reviews: 148355
100000 reviews parsed 3848 written
200000 reviews parsed 3848 written
300000 reviews parsed 3848 written
400000 reviews parsed 9192 written
500000 reviews parsed 9192 written
600000 reviews parsed 9192 written
700000 reviews parsed 13847 written
800000 reviews parsed 13847 written
900000 reviews parsed 13847 written
1000000 reviews parsed 13847 written
1100000 reviews parsed 21612 written
1200000 reviews parsed 21612 written
1300000 reviews parsed 31667 written
1400000 reviews parsed 31667 written
1500000 reviews parsed 31667 written
1600000 reviews parsed 31667 written
1700000 reviews parsed 31667 written
1800000 reviews parsed 36645 written
1900000 reviews parsed 36645 written
2000000 reviews parsed 36645 written
2100000 reviews parsed 36645 written
2200000 reviews parsed 36645 written
2300000 reviews parsed 36645 written
2400000 reviews parsed 36645 written
2500000 reviews parsed 36645 written
2600000 reviews parsed 36645 written
2700000 reviews parsed 46162 written
2800000 reviews parsed 46162 written
2900000 reviews parsed 46162 written
3000000 reviews parsed 46162 written
3100000 reviews parsed 46162 written
3200000 reviews parsed 46162 written
3300000 reviews parsed 46162 written
3400000 reviews parsed 46162 written
3500000 reviews parsed 46162 written
3600000 reviews parsed 46162 written
3700000 reviews parsed 46162 written
3800000 reviews parsed 46162 written
3900000 reviews parsed 46162 written
4000000 reviews parsed 46162 written
4100000 reviews parsed 46162 written
4200000 reviews parsed 46162 written
4300000 reviews parsed 46162 written
4400000 reviews parsed 46162 written
4500000 reviews parsed 46162 written
4600000 reviews parsed 46162 written
4700000 reviews parsed 46162 written
4800000 reviews parsed 53718 written
4900000 reviews parsed 53718 written
5000000 reviews parsed 53718 written
5100000 reviews parsed 53718 written
5200000 reviews parsed 53718 written
5300000 reviews parsed 53718 written
5400000 reviews parsed 53718 written
5500000 reviews parsed 53718 written
5600000 reviews parsed 53718 written
5700000 reviews parsed 53718 written
5800000 reviews parsed 53718 written
5900000 reviews parsed 53718 written
6000000 reviews parsed 53718 written
6100000 reviews parsed 53718 written
6200000 reviews parsed 53718 written
6300000 reviews parsed 59065 written
6400000 reviews parsed 59065 written
6500000 reviews parsed 59065 written
6600000 reviews parsed 59065 written
6700000 reviews parsed 59065 written
6800000 reviews parsed 59065 written
6900000 reviews parsed 61004 written
7000000 reviews parsed 64257 written
7100000 reviews parsed 64257 written
7200000 reviews parsed 64257 written
7300000 reviews parsed 64257 written
7400000 reviews parsed 64257 written
7500000 reviews parsed 64257 written
7600000 reviews parsed 64257 written
7700000 reviews parsed 64257 written
7800000 reviews parsed 64257 written
7900000 reviews parsed 64257 written
8000000 reviews parsed 64257 written
8100000 reviews parsed 64257 written
8200000 reviews parsed 64257 written
8300000 reviews parsed 69248 written
8400000 reviews parsed 69248 written
8500000 reviews parsed 69248 written
8600000 reviews parsed 69248 written
8700000 reviews parsed 69248 written
8800000 reviews parsed 69248 written
8900000 reviews parsed 69248 written
9000000 reviews parsed 69248 written
9100000 reviews parsed 69248 written
9200000 reviews parsed 69248 written
9300000 reviews parsed 69248 written
9400000 reviews parsed 69248 written
9500000 reviews parsed 74090 written
9600000 reviews parsed 74090 written
9700000 reviews parsed 79621 written
9800000 reviews parsed 101432 written
9900000 reviews parsed 101432 written
10000000 reviews parsed 101432 written
10100000 reviews parsed 101432 written
10200000 reviews parsed 101432 written
10300000 reviews parsed 106380 written
10400000 reviews parsed 106380 written
10500000 reviews parsed 106380 written
10600000 reviews parsed 106380 written
10700000 reviews parsed 106380 written
10800000 reviews parsed 106380 written
10900000 reviews parsed 106380 written
11000000 reviews parsed 106380 written
11100000 reviews parsed 106380 written
11200000 reviews parsed 106380 written
11300000 reviews parsed 106380 written
11400000 reviews parsed 106380 written
11500000 reviews parsed 111476 written
11600000 reviews parsed 111476 written
11700000 reviews parsed 111476 written
11800000 reviews parsed 111476 written
11900000 reviews parsed 111476 written
12000000 reviews parsed 111476 written
12100000 reviews parsed 111476 written
12200000 reviews parsed 111476 written
12300000 reviews parsed 119416 written
12400000 reviews parsed 119416 written
12500000 reviews parsed 119416 written
12600000 reviews parsed 119416 written
12700000 reviews parsed 119416 written
12800000 reviews parsed 119416 written
12900000 reviews parsed 119416 written
13000000 reviews parsed 119416 written
13100000 reviews parsed 119416 written
13200000 reviews parsed 119416 written
13300000 reviews parsed 119416 written
13400000 reviews parsed 119416 written
13500000 reviews parsed 119416 written
13600000 reviews parsed 119416 written
13700000 reviews parsed 126470 written
13800000 reviews parsed 126470 written
13900000 reviews parsed 126470 written
14000000 reviews parsed 126470 written
14100000 reviews parsed 126470 written
14200000 reviews parsed 126470 written
14300000 reviews parsed 126470 written
14400000 reviews parsed 126470 written
14500000 reviews parsed 126470 written
14600000 reviews parsed 126470 written
14700000 reviews parsed 126470 written
14800000 reviews parsed 130938 written
14900000 reviews parsed 130938 written
15000000 reviews parsed 130938 written
15100000 reviews parsed 130938 written
15200000 reviews parsed 130938 written
15300000 reviews parsed 130938 written
15400000 reviews parsed 130938 written
15500000 reviews parsed 130938 written
15600000 reviews parsed 130938 written
15700000 reviews parsed 130938 written
15739967 reviews parsed 130938 written
In [87]:
 
Out[87]:
user_id book_id review_id rating date_added date_updated read_at started_at n_votes n_comments review_length work_id
In [56]:
data_dir = '/Volumes/Samsung_T5/Data/Book-Reviews/GoodReads/'
sample_review_text_file = os.path.join(data_dir, 'goodreads_reviews-random_sample_1M.csv.gz') # includes text

headers = [
    'user_id', 'book_id', 'work_id', 'review_id', 'rating', 'date_added', 'date_updated', 'read_at', 'started_at', 
    'n_votes', 'n_comments', 'review_length', 'review_lang', 'review_text', 'review_lang'
]

threshold = 1000000
prob_threshold = threshold / len(review_df)

with gzip.open(sample_review_text_file, 'wt') as fh:
    writer = csv.writer(fh, delimiter='\t')
    writer.writerow(headers)
    for review in read_csv(review_text_file):
        if np.random.rand() > prob_threshold:
            continue
        review['review_lang'] = lang_detect(review)
        row = [review[header] for header in headers]
        writer.writerow(row)
In [ ]:
 
In [ ]: