Setup:: >>> import lxml.html We'll define a link translation function: >>> base_href = 'http://old/base/path.html' >>> try: import urlparse ... except ImportError: import urllib.parse as urlparse >>> def relocate_href(link): ... link = urlparse.urljoin(base_href, link) ... if link.startswith('http://old'): ... return 'https://new' + link[len('http://old'):] ... else: ... return link Now for content. First, to make it easier on us, we need to trim the normalized HTML we get from these functions:: Some basics:: >>> from lxml.html import usedoctest, tostring >>> from lxml.html import rewrite_links >>> print(rewrite_links( ... 'link', relocate_href)) link >>> print(rewrite_links( ... '', relocate_href)) >>> print(rewrite_links( ... '', relocate_href)) >>> print(rewrite_links('''\ ... ... ... x\ ... ''', relocate_href)) x Links in CSS are also handled:: >>> print(rewrite_links(''' ... ''', relocate_href)) >>> print(rewrite_links(''' ... ''', relocate_href)) Those links in style attributes are also rewritten:: >>> print(rewrite_links(''' ...
text
... ''', relocate_href))
text
The ```` tag is also respected (but also removed):: >>> print(rewrite_links(''' ... ... ... ... ... link ... ''', relocate_href)) link The ``iterlinks`` method (and function) gives you all the links in the document, along with the element and attribute the link comes from. This makes it fairly easy to see what resources the document references or embeds (an ```` tag is a reference, an ```` tag is something embedded). It returns a generator of ``(element, attrib, link)``, which is awkward to test here, so we'll make a printer:: >>> from lxml.html import iterlinks, document_fromstring, tostring >>> def print_iter(seq): ... for element, attrib, link, pos in seq: ... if pos: ... extra = '@%s' % pos ... else: ... extra = '' ... print('%s %s="%s"%s' % (element.tag, attrib, link, extra)) >>> print_iter(iterlinks(''' ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... Hi world! ... ...
... ''')) meta content="/redirect"@6 meta content="/quoted_url"@8 link href="style.css" style None="/other-styles.css"@69 style None="/bg.gif"@40 script src="/js-funcs.js" a href="/test.html" a href="/other.html" td style="/td-bg.png"@22 img src="/logo.gif" td style="/quoted.png"@23 An application of ``iterlinks()`` is ``make_links_absolute()``:: >>> from lxml.html import make_links_absolute >>> print(make_links_absolute(''' ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... Hi world! ...
... ''', ... base_url="http://my.little.server/url/"))
Hi world!
### Test disabled to support Py2.6 and earlier #If the document contains invalid links, you may choose to "discard" or "ignore" #them by passing the respective option into the ``handle_failures`` argument:: # # >>> html = lxml.html.fromstring ('''\ # ...
# ... test2 # ...
''') # # >>> html.make_links_absolute(base_url="http://my.little.server/url/", # ... handle_failures="discard") # # >>> print(lxml.html.tostring (html, pretty_print=True, encoding='unicode')) #
# test2 #
Check if we can replace multiple links inside of the same text string:: >>> html = lxml.html.fromstring ("""\ ... ... ... Test ... ... ... ...

Hi

... ... ... """, ... base_url = 'http://www.example.com/') >>> html.make_links_absolute () >>> print(lxml.html.tostring (html, pretty_print=True, encoding='unicode')) Test

Hi