veryscrape package

Submodules

veryscrape.cli module

Console script for veryscrape.

veryscrape.items module

class veryscrape.items.Item(content='', topic='', source='', created_at=None)[source]

Bases: object

class veryscrape.items.ItemGenerator(q, topic='', source='')[source]

Bases: object

cancel()[source]
filter(text)[source]
max_seen_items = 50000
process_text(text)[source]
process_time(text)[source]

veryscrape.process module

veryscrape.process.clean_article(content)[source]

Converts html text into article text

veryscrape.process.clean_tweet(content)[source]

Unescapes and replaces mentions and hashtags with static tokens (@ - MENTION, # - HASHTAG)

veryscrape.process.clean_reddit_comment(content)[source]

Replace subreddit paths and user paths with static tokens (/r/… - SUBREDDIT)

veryscrape.process.clean_general(content)[source]

Remove any urls, non-ascii text and redundant spaces, normalize swearwords

veryscrape.process.clean_item(item)[source]

Clean an item of undesirable data :param item: item to clean with all functions registered to item.source :return: cleaned item

veryscrape.process.register(name, *funcs)[source]

Register cleaning function so it is run automatically on items with source ‘name’ :param name: name of data source (e.g. ‘twitter’) :param funcs: cleaning functions to apply

veryscrape.process.unregister(name, *funcs)[source]

Unregister a function registered with ‘veryscrape.process.register’ :param name: name of data source (e.g. ‘twitter’) :param funcs: cleaning functions to remove

veryscrape.process.classify_text(text, topic_query_dict)[source]

Attempts to classify a text based on query strings organized by topic (Note, this is meant to be very fast - for a web spider) :param text: text to classify :param topic_query_dict: dict of topics and queries:

e.g. {‘t1’: [‘q1’, ‘q2’], ‘t2’: [‘q3’], …
Returns:which topic does the text belong to
veryscrape.process.extract_urls(text)[source]

Extract urls in a given text and return the urls :param text: text to extract urls from :return: set of urls

veryscrape.process.remove_urls(text, remove={':', ';', '[', ']', ' ', '}', '(', ')', '{'})[source]

Removes (without returning) all urls present in a text :param text: text to clean urls from :param remove: break characters for url :return: text clean of urls

veryscrape.scrape module

class veryscrape.scrape.Scraper(*args, proxy_pool=None, **kwargs)[source]

Bases: abc.ABC

close()[source]
item_gen

alias of veryscrape.items.ItemGenerator

scrape(query, topic='', **kwargs)[source]
scrape_continuously(query, topic='', **kwargs)[source]
scrape_every = 300
session_class

alias of veryscrape.session.Session

source = ''
stream(query, topic='', **kwargs)[source]
class veryscrape.scrape.SearchEngineScraper(*args, proxy_pool=None, **kwargs)[source]

Bases: veryscrape.scrape.Scraper

bad_domains = {'.com/', '.biz/', '.org/', '.edu/', '.net/', '.gov/'}
static clean_urls(urls)[source]

Generator for removing useless or uninteresting urls from an iterable of urls

extract_urls(text)[source]
false_urls = {'googleusercontent.', 'youtube.', 'googlenewsblog.', 'google.', 'blogger.'}
query_string(query)[source]
scrape(query, topic='', **kwargs)[source]
scrape_every = 900

veryscrape.session module

exception veryscrape.session.FetchError[source]

Bases: Exception

Exception raised when a fetch request fails despite retries This is used for flow control of circuit-breaker logic in Scraper

class veryscrape.session.OAuth1(client, secret, token, token_secret)[source]

Bases: object

oauth_params

Returns dictionary of oauth1 parameters required for oauth1 signed http request

patch_request(method, url, params, kwargs)[source]
class veryscrape.session.OAuth1Session(*args, **kwargs)[source]

Bases: veryscrape.session.Session

base_url = None
class veryscrape.session.OAuth2(client, secret, token_url)[source]

Bases: object

auth_token()[source]

Returns dictionary of oauth2 parameters required for oauth2 signed http request

oauth2_token_expired

Returns true if current oauth2 token needs to be refreshed

patch_request(method, url, params, kwargs)[source]
class veryscrape.session.OAuth2Session(*args, **kwargs)[source]

Bases: veryscrape.session.OAuth1Session

class veryscrape.session.RateLimiter(rate_limits, period)[source]

Bases: object

get_limit(url, parent=None, metadata=None)[source]
refresh_limits(rate_limit, metadata)[source]
wait_limit(url)[source]
class veryscrape.session.Session(*args, proxy_pool=None, **kwargs)[source]

Bases: object

error_on_failure = True
fetch(method, url, *, params=None, stream_func=None, **kwargs)[source]
on_error(error_code)[source]

This is called when a request returns a non-200 status code. :param error_code: status code of http request

persist_user_agent = True
rate_limit_period = 60
rate_limits = {}
request(method, url, **kwargs)[source]
retries_to_error = 5
sleep_increment = 15
user_agent = None

veryscrape.veryscrape module

class veryscrape.veryscrape.VeryScrape(q, loop=None)[source]

Bases: object

Many API, much data, VeryScrape!

Parameters:
  • q – Queue to output data gathered from scraping
  • loop – Event loop to run the scraping
close()[source]
create_all_scrapers_and_streams(config)[source]

Creates all scrapers and stream functions associated with a config A scraper is a class inheriting from scrape.Scraper A stream function is a function returning an items.ItemGenerator

Parameters:config – scrape configuration
Returns:list of scrapers, list of stream functions
scrape(config, *, n_cores=1, max_items=0, max_age=None)[source]

Scrape, process and organize data on the web based on a scrape config :param config: dict: scrape configuration This is a map of scrape sources to data gathering information. The basic scheme is as follows (see examples for real example): {

“source1”:
{
“first_authentication|split|by|pipe”:
{
“topic1”: [“query1”, “query2”], “topic2”: [“query3”, “query4”]

},

“second_authentication|split|by|pipe”:
{
“topic3”: [“query5”, “query6”]

}

},

“source2”: …

} :param n_cores: number of cores to use for processing data Set to 0 to use all available cores. Set to -1 to disable processing. :param max_items: :param max_age:

veryscrape.veryscrape.register(name, scraper, classify=False)[source]

Register scraper class so it is created automatically from keys in VeryScrape.config when VeryScrape is run :param name: name of data source (e.g. ‘twitter’) :param scraper: scraper class :param classify: whether scraper needs to classify text topic afterwards

veryscrape.veryscrape.unregister(name)[source]

Unregister scraper class registered with ‘veryscrape.register’ :param name: name of data source (e.g. ‘twitter’)

veryscrape.wrappers module

class veryscrape.wrappers.GeneratorWrapper(item_gen, loop=None)[source]

Bases: object

cancel()[source]
get()[source]
put(item)[source]
class veryscrape.wrappers.ItemMerger(*item_gens)[source]

Bases: object

cancel()[source]
class veryscrape.wrappers.ItemProcessor(items, n_cores=1, loop=None)[source]

Bases: veryscrape.wrappers.GeneratorWrapper

cancel()[source]
classify(topic_query_dict)

Attempts to classify a text based on query strings organized by topic (Note, this is meant to be very fast - for a web spider) :param text: text to classify :param topic_query_dict: dict of topics and queries:

e.g. {‘t1’: [‘q1’, ‘q2’], ‘t2’: [‘q3’], …
Returns:which topic does the text belong to
put(item)[source]
update_topics(**topics_by_source)[source]

Update local topics by source for use in classification of items :param topics_by_source: dict[list]: associated queries by topic

class veryscrape.wrappers.ItemSorter(items, max_items=None, max_age=None, loop=None)[source]

Bases: veryscrape.wrappers.GeneratorWrapper

get()[source]
put(item)[source]

Module contents

class veryscrape.VeryScrape(q, loop=None)[source]

Bases: object

Many API, much data, VeryScrape!

Parameters:
  • q – Queue to output data gathered from scraping
  • loop – Event loop to run the scraping
close()[source]
create_all_scrapers_and_streams(config)[source]

Creates all scrapers and stream functions associated with a config A scraper is a class inheriting from scrape.Scraper A stream function is a function returning an items.ItemGenerator

Parameters:config – scrape configuration
Returns:list of scrapers, list of stream functions
scrape(config, *, n_cores=1, max_items=0, max_age=None)[source]

Scrape, process and organize data on the web based on a scrape config :param config: dict: scrape configuration This is a map of scrape sources to data gathering information. The basic scheme is as follows (see examples for real example): {

“source1”:
{
“first_authentication|split|by|pipe”:
{
“topic1”: [“query1”, “query2”], “topic2”: [“query3”, “query4”]

},

“second_authentication|split|by|pipe”:
{
“topic3”: [“query5”, “query6”]

}

},

“source2”: …

} :param n_cores: number of cores to use for processing data Set to 0 to use all available cores. Set to -1 to disable processing. :param max_items: :param max_age:

veryscrape.register(name, scraper, classify=False)[source]

Register scraper class so it is created automatically from keys in VeryScrape.config when VeryScrape is run :param name: name of data source (e.g. ‘twitter’) :param scraper: scraper class :param classify: whether scraper needs to classify text topic afterwards

veryscrape.unregister(name)[source]

Unregister scraper class registered with ‘veryscrape.register’ :param name: name of data source (e.g. ‘twitter’)