veryscrape package¶
Subpackages¶
Submodules¶
veryscrape.cli module¶
Console script for veryscrape.
veryscrape.items module¶
veryscrape.process module¶
-
veryscrape.process.clean_tweet(content)[source]¶ Unescapes and replaces mentions and hashtags with static tokens (@ - MENTION, # - HASHTAG)
-
veryscrape.process.clean_reddit_comment(content)[source]¶ Replace subreddit paths and user paths with static tokens (/r/… - SUBREDDIT)
-
veryscrape.process.clean_general(content)[source]¶ Remove any urls, non-ascii text and redundant spaces, normalize swearwords
-
veryscrape.process.clean_item(item)[source]¶ Clean an item of undesirable data :param item: item to clean with all functions registered to item.source :return: cleaned item
-
veryscrape.process.register(name, *funcs)[source]¶ Register cleaning function so it is run automatically on items with source ‘name’ :param name: name of data source (e.g. ‘twitter’) :param funcs: cleaning functions to apply
-
veryscrape.process.unregister(name, *funcs)[source]¶ Unregister a function registered with ‘veryscrape.process.register’ :param name: name of data source (e.g. ‘twitter’) :param funcs: cleaning functions to remove
-
veryscrape.process.classify_text(text, topic_query_dict)[source]¶ Attempts to classify a text based on query strings organized by topic (Note, this is meant to be very fast - for a web spider) :param text: text to classify :param topic_query_dict: dict of topics and queries:
e.g. {‘t1’: [‘q1’, ‘q2’], ‘t2’: [‘q3’], …Returns: which topic does the text belong to
veryscrape.scrape module¶
-
class
veryscrape.scrape.Scraper(*args, proxy_pool=None, **kwargs)[source]¶ Bases:
abc.ABC-
item_gen¶ alias of
veryscrape.items.ItemGenerator
-
scrape_every= 300¶
-
session_class¶ alias of
veryscrape.session.Session
-
source= ''¶
-
-
class
veryscrape.scrape.SearchEngineScraper(*args, proxy_pool=None, **kwargs)[source]¶ Bases:
veryscrape.scrape.Scraper-
bad_domains= {'.org/', '.edu/', '.com/', '.biz/', '.net/', '.gov/'}¶
-
static
clean_urls(urls)[source]¶ Generator for removing useless or uninteresting urls from an iterable of urls
-
false_urls= {'youtube.', 'blogger.', 'googlenewsblog.', 'google.', 'googleusercontent.'}¶
-
scrape_every= 900¶
-
veryscrape.session module¶
-
exception
veryscrape.session.FetchError[source]¶ Bases:
ExceptionException raised when a fetch request fails despite retries This is used for flow control of circuit-breaker logic in Scraper
-
class
veryscrape.session.OAuth1(client, secret, token, token_secret)[source]¶ Bases:
object-
oauth_params¶ Returns dictionary of oauth1 parameters required for oauth1 signed http request
-
-
class
veryscrape.session.OAuth1Session(*args, **kwargs)[source]¶ Bases:
veryscrape.session.Session-
base_url= None¶
-
-
class
veryscrape.session.OAuth2(client, secret, token_url)[source]¶ Bases:
object-
auth_token()[source]¶ Returns dictionary of oauth2 parameters required for oauth2 signed http request
-
oauth2_token_expired¶ Returns true if current oauth2 token needs to be refreshed
-
-
class
veryscrape.session.Session(*args, proxy_pool=None, **kwargs)[source]¶ Bases:
object-
error_on_failure= True¶
-
on_error(error_code)[source]¶ This is called when a request returns a non-200 status code. :param error_code: status code of http request
-
persist_user_agent= True¶
-
rate_limit_period= 60¶
-
rate_limits= {}¶
-
retries_to_error= 5¶
-
sleep_increment= 15¶
-
user_agent= None¶
-
veryscrape.veryscrape module¶
-
class
veryscrape.veryscrape.VeryScrape(q, loop=None)[source]¶ Bases:
objectMany API, much data, VeryScrape!
Parameters: - q – Queue to output data gathered from scraping
- loop – Event loop to run the scraping
-
create_all_scrapers_and_streams(config)[source]¶ Creates all scrapers and stream functions associated with a config A scraper is a class inheriting from scrape.Scraper A stream function is a function returning an items.ItemGenerator
Parameters: config – scrape configuration Returns: list of scrapers, list of stream functions
-
scrape(config, *, n_cores=1, max_items=0, max_age=None)[source]¶ Scrape, process and organize data on the web based on a scrape config :param config: dict: scrape configuration This is a map of scrape sources to data gathering information. The basic scheme is as follows (see examples for real example): {
- “source1”:
- {
- “first_authentication|split|by|pipe”:
- {
- “topic1”: [“query1”, “query2”], “topic2”: [“query3”, “query4”]
},
- “second_authentication|split|by|pipe”:
- {
- “topic3”: [“query5”, “query6”]
}
},
“source2”: …
} :param n_cores: number of cores to use for processing data Set to 0 to use all available cores. Set to -1 to disable processing. :param max_items: :param max_age:
-
veryscrape.veryscrape.register(name, scraper, classify=False)[source]¶ Register scraper class so it is created automatically from keys in VeryScrape.config when VeryScrape is run :param name: name of data source (e.g. ‘twitter’) :param scraper: scraper class :param classify: whether scraper needs to classify text topic afterwards
veryscrape.wrappers module¶
-
class
veryscrape.wrappers.ItemProcessor(items, n_cores=1, loop=None)[source]¶ Bases:
veryscrape.wrappers.GeneratorWrapper-
classify(topic_query_dict)¶ Attempts to classify a text based on query strings organized by topic (Note, this is meant to be very fast - for a web spider) :param text: text to classify :param topic_query_dict: dict of topics and queries:
e.g. {‘t1’: [‘q1’, ‘q2’], ‘t2’: [‘q3’], …Returns: which topic does the text belong to
-
Module contents¶
-
class
veryscrape.VeryScrape(q, loop=None)[source]¶ Bases:
objectMany API, much data, VeryScrape!
Parameters: - q – Queue to output data gathered from scraping
- loop – Event loop to run the scraping
-
create_all_scrapers_and_streams(config)[source]¶ Creates all scrapers and stream functions associated with a config A scraper is a class inheriting from scrape.Scraper A stream function is a function returning an items.ItemGenerator
Parameters: config – scrape configuration Returns: list of scrapers, list of stream functions
-
scrape(config, *, n_cores=1, max_items=0, max_age=None)[source]¶ Scrape, process and organize data on the web based on a scrape config :param config: dict: scrape configuration This is a map of scrape sources to data gathering information. The basic scheme is as follows (see examples for real example): {
- “source1”:
- {
- “first_authentication|split|by|pipe”:
- {
- “topic1”: [“query1”, “query2”], “topic2”: [“query3”, “query4”]
},
- “second_authentication|split|by|pipe”:
- {
- “topic3”: [“query5”, “query6”]
}
},
“source2”: …
} :param n_cores: number of cores to use for processing data Set to 0 to use all available cores. Set to -1 to disable processing. :param max_items: :param max_age:
-
veryscrape.register(name, scraper, classify=False)[source]¶ Register scraper class so it is created automatically from keys in VeryScrape.config when VeryScrape is run :param name: name of data source (e.g. ‘twitter’) :param scraper: scraper class :param classify: whether scraper needs to classify text topic afterwards