elgeopaso.jobs.crawlers.georezo_rss_parser module¶
Name: GeoRezo Jobs RSS Parser Purpose: Parse GeoRezo RSS Python: 3.7+
- class elgeopaso.jobs.crawlers.georezo_rss_parser.GeorezoRssParser(feed_base_url='https://georezo.net/extern.php?fid=10', feed_length_param='show', items_to_parse=50, user_agent='ElGeoPaso/DEV +https://elgeopaso.georezo.net/')[source]¶
Bases :
object
Handy module to parse GeoRezo job offers through RSS.
- Paramètres
feed_base_url (str) – URL to the feed. Defaults to: « https://georezo.net/extern.php?fid=10 » - optional
feed_length_param (str) – name of the URL parameter to specifiy the number of items. Defaults to: « show » - optional
items_to_parse (int) – number of items to request to the feed. Defaults to: 50 - optional
user_agent (str) – HTTP user-agent. Defaults to: « ElGeoPaso/DEV +https://elgeopaso.georezo.net/ » - optional
- CRAWLER_LATEST_METADATA = 'crawler_georezo_rss_latest.json'¶
- FEED_DATETIME_RAW_FORMAT = '%a, %d %b %Y %H:%M:%S %z'¶
- FEED_DATETIME_RAW_FORMAT_ARROW = 'ddd, D MMM YYYY HH:mm:ss Z'¶
- classmethod extract_offer_id_from_url(in_url)[source]¶
Parse input URL to extract RSS item ID = job offer ID.
- Paramètres
in_url (str) – input URL as string. In GeoRezo RSS, it’s: - in raw XML: “<guid isPermaLink= »true »>https://georezo.net/forum/viewtopic.php?pid=331081#p331081</guid>” - parsed by feedparser: entry.id = “https://georezo.net/forum/viewtopic.php?pid=331144#p331144”
- Renvoie
offer ID
- Type renvoyé
- classmethod load_previous_crawler_metadata(from_source='./last_id_georezo.txt')[source]¶
Retrieve last parsed item ID from specified source.
- Paramètres
from_source (str) – where to load the ID. Defaults to: « ./last_id_georezo.txt »
- Lève
NotImplementedError – [description]
ValueError – [description]
- Renvoie
dictionary with previous crawler execution metadata
- Type renvoyé
- parse_new_offers(ignore_encoding_errors=True, only_new_offers=True)[source]¶
Parse RSS feed, handle errors and filter on new offers.
- Paramètres
- Renvoie
list with offers whose identifier is superior to the latest parsed
- Type renvoyé
- save_parsing_metadata(feed_parsed, save_type='json')[source]¶
Dumps some metadata from parsed feed to track behavior and enforce future usage into a structured JSON file.
- Paramètres
feed_parsed (feedparser.FeedParserDict) – parsed feed
save_type (str) – type of save to perform. Defaults to: « json » - optional
- Renvoie
dictionary of saved data
- Type renvoyé
- Example
[ { "encoding": "ISO-8859-1", "entries_required": 50, "entries_total": 50, "feed_updated_converted": "2020-03-10 13:07:06+01:00", "feed_updated_parsed": [ 2020, 3, 10, 12, 7, 6, 1, 70, 0 ], "feed_updated_raw": "Tue, 10 Mar 2020 13:07:06 +0100", "latest_offer_id": 331132, "status": 200, "version": "rss20" } ]