veryscrape package¶
Subpackages¶
Submodules¶
veryscrape.cli module¶
Console script for veryscrape.
veryscrape.items module¶
veryscrape.process module¶
-
veryscrape.process.
clean_tweet
(content)[source]¶ Unescapes and replaces mentions and hashtags with static tokens (@ - MENTION, # - HASHTAG)
-
veryscrape.process.
clean_reddit_comment
(content)[source]¶ Replace subreddit paths and user paths with static tokens (/r/… - SUBREDDIT)
-
veryscrape.process.
clean_general
(content)[source]¶ Remove any urls, non-ascii text and redundant spaces, normalize swearwords
-
veryscrape.process.
clean_item
(item)[source]¶ Clean an item of undesirable data :param item: item to clean with all functions registered to item.source :return: cleaned item
-
veryscrape.process.
register
(name, *funcs)[source]¶ Register cleaning function so it is run automatically on items with source ‘name’ :param name: name of data source (e.g. ‘twitter’) :param funcs: cleaning functions to apply
-
veryscrape.process.
unregister
(name, *funcs)[source]¶ Unregister a function registered with ‘veryscrape.process.register’ :param name: name of data source (e.g. ‘twitter’) :param funcs: cleaning functions to remove
-
veryscrape.process.
classify_text
(text, topic_query_dict)[source]¶ Attempts to classify a text based on query strings organized by topic (Note, this is meant to be very fast - for a web spider) :param text: text to classify :param topic_query_dict: dict of topics and queries:
e.g. {‘t1’: [‘q1’, ‘q2’], ‘t2’: [‘q3’], …Returns: which topic does the text belong to
veryscrape.scrape module¶
-
class
veryscrape.scrape.
Scraper
(*args, proxy_pool=None, **kwargs)[source]¶ Bases:
abc.ABC
-
item_gen
¶ alias of
veryscrape.items.ItemGenerator
-
scrape_every
= 300¶
-
session_class
¶ alias of
veryscrape.session.Session
-
source
= ''¶
-
-
class
veryscrape.scrape.
SearchEngineScraper
(*args, proxy_pool=None, **kwargs)[source]¶ Bases:
veryscrape.scrape.Scraper
-
bad_domains
= {'.com/', '.biz/', '.org/', '.edu/', '.net/', '.gov/'}¶
-
static
clean_urls
(urls)[source]¶ Generator for removing useless or uninteresting urls from an iterable of urls
-
false_urls
= {'googleusercontent.', 'youtube.', 'googlenewsblog.', 'google.', 'blogger.'}¶
-
scrape_every
= 900¶
-
veryscrape.session module¶
-
exception
veryscrape.session.
FetchError
[source]¶ Bases:
Exception
Exception raised when a fetch request fails despite retries This is used for flow control of circuit-breaker logic in Scraper
-
class
veryscrape.session.
OAuth1
(client, secret, token, token_secret)[source]¶ Bases:
object
-
oauth_params
¶ Returns dictionary of oauth1 parameters required for oauth1 signed http request
-
-
class
veryscrape.session.
OAuth1Session
(*args, **kwargs)[source]¶ Bases:
veryscrape.session.Session
-
base_url
= None¶
-
-
class
veryscrape.session.
OAuth2
(client, secret, token_url)[source]¶ Bases:
object
-
auth_token
()[source]¶ Returns dictionary of oauth2 parameters required for oauth2 signed http request
-
oauth2_token_expired
¶ Returns true if current oauth2 token needs to be refreshed
-
-
class
veryscrape.session.
Session
(*args, proxy_pool=None, **kwargs)[source]¶ Bases:
object
-
error_on_failure
= True¶
-
on_error
(error_code)[source]¶ This is called when a request returns a non-200 status code. :param error_code: status code of http request
-
persist_user_agent
= True¶
-
rate_limit_period
= 60¶
-
rate_limits
= {}¶
-
retries_to_error
= 5¶
-
sleep_increment
= 15¶
-
user_agent
= None¶
-
veryscrape.veryscrape module¶
-
class
veryscrape.veryscrape.
VeryScrape
(q, loop=None)[source]¶ Bases:
object
Many API, much data, VeryScrape!
Parameters: - q – Queue to output data gathered from scraping
- loop – Event loop to run the scraping
-
create_all_scrapers_and_streams
(config)[source]¶ Creates all scrapers and stream functions associated with a config A scraper is a class inheriting from scrape.Scraper A stream function is a function returning an items.ItemGenerator
Parameters: config – scrape configuration Returns: list of scrapers, list of stream functions
-
scrape
(config, *, n_cores=1, max_items=0, max_age=None)[source]¶ Scrape, process and organize data on the web based on a scrape config :param config: dict: scrape configuration This is a map of scrape sources to data gathering information. The basic scheme is as follows (see examples for real example): {
- “source1”:
- {
- “first_authentication|split|by|pipe”:
- {
- “topic1”: [“query1”, “query2”], “topic2”: [“query3”, “query4”]
},
- “second_authentication|split|by|pipe”:
- {
- “topic3”: [“query5”, “query6”]
}
},
“source2”: …
} :param n_cores: number of cores to use for processing data Set to 0 to use all available cores. Set to -1 to disable processing. :param max_items: :param max_age:
-
veryscrape.veryscrape.
register
(name, scraper, classify=False)[source]¶ Register scraper class so it is created automatically from keys in VeryScrape.config when VeryScrape is run :param name: name of data source (e.g. ‘twitter’) :param scraper: scraper class :param classify: whether scraper needs to classify text topic afterwards
veryscrape.wrappers module¶
-
class
veryscrape.wrappers.
ItemProcessor
(items, n_cores=1, loop=None)[source]¶ Bases:
veryscrape.wrappers.GeneratorWrapper
-
classify
(topic_query_dict)¶ Attempts to classify a text based on query strings organized by topic (Note, this is meant to be very fast - for a web spider) :param text: text to classify :param topic_query_dict: dict of topics and queries:
e.g. {‘t1’: [‘q1’, ‘q2’], ‘t2’: [‘q3’], …Returns: which topic does the text belong to
-
Module contents¶
-
class
veryscrape.
VeryScrape
(q, loop=None)[source]¶ Bases:
object
Many API, much data, VeryScrape!
Parameters: - q – Queue to output data gathered from scraping
- loop – Event loop to run the scraping
-
create_all_scrapers_and_streams
(config)[source]¶ Creates all scrapers and stream functions associated with a config A scraper is a class inheriting from scrape.Scraper A stream function is a function returning an items.ItemGenerator
Parameters: config – scrape configuration Returns: list of scrapers, list of stream functions
-
scrape
(config, *, n_cores=1, max_items=0, max_age=None)[source]¶ Scrape, process and organize data on the web based on a scrape config :param config: dict: scrape configuration This is a map of scrape sources to data gathering information. The basic scheme is as follows (see examples for real example): {
- “source1”:
- {
- “first_authentication|split|by|pipe”:
- {
- “topic1”: [“query1”, “query2”], “topic2”: [“query3”, “query4”]
},
- “second_authentication|split|by|pipe”:
- {
- “topic3”: [“query5”, “query6”]
}
},
“source2”: …
} :param n_cores: number of cores to use for processing data Set to 0 to use all available cores. Set to -1 to disable processing. :param max_items: :param max_age:
-
veryscrape.
register
(name, scraper, classify=False)[source]¶ Register scraper class so it is created automatically from keys in VeryScrape.config when VeryScrape is run :param name: name of data source (e.g. ‘twitter’) :param scraper: scraper class :param classify: whether scraper needs to classify text topic afterwards