elgeopaso.jobs.crawlers.georezo_rss_parser module

Name: GeoRezo Jobs RSS Parser Purpose: Parse GeoRezo RSS Python: 3.7+

class elgeopaso.jobs.crawlers.georezo_rss_parser.GeorezoRssParser(feed_base_url='https://georezo.net/extern.php?fid=10', feed_length_param='show', items_to_parse=50, user_agent='ElGeoPaso/DEV +https://elgeopaso.georezo.net/')[source]

Bases : object

Handy module to parse GeoRezo job offers through RSS.

Paramètres
  • feed_base_url (str) – URL to the feed. Defaults to: « https://georezo.net/extern.php?fid=10 » - optional

  • feed_length_param (str) – name of the URL parameter to specifiy the number of items. Defaults to: « show » - optional

  • items_to_parse (int) – number of items to request to the feed. Defaults to: 50 - optional

  • user_agent (str) – HTTP user-agent. Defaults to: « ElGeoPaso/DEV +https://elgeopaso.georezo.net/ » - optional

CRAWLER_LATEST_METADATA = 'crawler_georezo_rss_latest.json'
FEED_DATETIME_RAW_FORMAT = '%a, %d %b %Y %H:%M:%S %z'
FEED_DATETIME_RAW_FORMAT_ARROW = 'ddd, D MMM YYYY HH:mm:ss Z'
classmethod extract_offer_id_from_url(in_url)[source]

Parse input URL to extract RSS item ID = job offer ID.

Paramètres

in_url (str) – input URL as string. In GeoRezo RSS, it’s: - in raw XML: “<guid isPermaLink= »true »>https://georezo.net/forum/viewtopic.php?pid=331081#p331081</guid>” - parsed by feedparser: entry.id = “https://georezo.net/forum/viewtopic.php?pid=331144#p331144

Renvoie

offer ID

Type renvoyé

int

classmethod load_previous_crawler_metadata(from_source='./last_id_georezo.txt')[source]

Retrieve last parsed item ID from specified source.

Paramètres

from_source (str) – where to load the ID. Defaults to: « ./last_id_georezo.txt »

Lève
Renvoie

dictionary with previous crawler execution metadata

Type renvoyé

dict

parse_new_offers(ignore_encoding_errors=True, only_new_offers=True)[source]

Parse RSS feed, handle errors and filter on new offers.

Paramètres
  • ignore_encoding_errors (bool) – option to ignore encoding exceptions. Defaults to: True

  • only_new_offers (bool) – option to return only new offers basing on the previous crawler execution. If False, all of the feed items will be returned. Defaults to: True

Renvoie

list with offers whose identifier is superior to the latest parsed

Type renvoyé

list

save_parsing_metadata(feed_parsed, save_type='json')[source]

Dumps some metadata from parsed feed to track behavior and enforce future usage into a structured JSON file.

Paramètres
  • feed_parsed (feedparser.FeedParserDict) – parsed feed

  • save_type (str) – type of save to perform. Defaults to: « json » - optional

Renvoie

dictionary of saved data

Type renvoyé

dict

Example

[
    {
        "encoding": "ISO-8859-1",
        "entries_required": 50,
        "entries_total": 50,
        "feed_updated_converted": "2020-03-10 13:07:06+01:00",
        "feed_updated_parsed": [
            2020,
            3,
            10,
            12,
            7,
            6,
            1,
            70,
            0
        ],
        "feed_updated_raw": "Tue, 10 Mar 2020 13:07:06 +0100",
        "latest_offer_id": 331132,
        "status": 200,
        "version": "rss20"
    }
]