commit python3-beautifulsoup4 for openSUSE:Factory
Hello community, here is the log from the commit of package python3-beautifulsoup4 for openSUSE:Factory checked in at 2016-07-28 23:46:31 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/python3-beautifulsoup4 (Old) and /work/SRC/openSUSE:Factory/.python3-beautifulsoup4.new (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "python3-beautifulsoup4" Changes: -------- --- /work/SRC/openSUSE:Factory/python3-beautifulsoup4/python3-beautifulsoup4-doc.changes 2016-05-25 21:27:38.000000000 +0200 +++ /work/SRC/openSUSE:Factory/.python3-beautifulsoup4.new/python3-beautifulsoup4-doc.changes 2016-07-28 23:46:33.000000000 +0200 @@ -1,0 +2,42 @@ +Wed Jul 20 15:07:27 UTC 2016 - arun@gmx.de + +- update to version 4.5.0: + * Beautiful Soup is no longer compatible with Python 2.6. This + actually happened a few releases ago, but it's now official. + * Beautiful Soup will now work with versions of html5lib greater + than 0.99999999. [bug=1603299] + * If a search against each individual value of a multi-valued + attribute fails, the search will be run one final time against the + complete attribute value considered as a single string. That is, + if a tag has class="foo bar" and neither "foo" nor "bar" matches, + but "foo bar" does, the tag is now considered a match. + This happened in previous versions, but only when the value being + searched for was a string. Now it also works when that value is a + regular expression, a list of strings, etc. [bug=1476868] + * Fixed a bug that deranged the tree when a whitespace element was + reparented into a tag that contained an identical whitespace + element. [bug=1505351] + * Added support for CSS selector values that contain quoted spaces, + such as tag[style="display: foo"]. [bug=1540588] + * Corrected handling of XML processing instructions. [bug=1504393] + * Corrected an encoding error that happened when a BeautifulSoup + object was copied. [bug=1554439] + * The contents of <textarea> tags will no longer be modified when + the tree is prettified. [bug=1555829] + * When a BeautifulSoup object is pickled but its tree builder cannot + be pickled, its .builder attribute is set to None instead of being + destroyed. This avoids a performance problem once the object is + unpickled. [bug=1523629] + * Specify the file and line number when warning about a + BeautifulSoup object being instantiated without a parser being + specified. [bug=1574647] + * The `limit` argument to `select()` now works correctly, though + it's not implemented very efficiently. [bug=1520530] + * Fixed a Python 3 ByteWarning when a URL was passed in as though it + were markup. Thanks to James Salter for a patch and + test. [bug=1533762] + * We don't run the check for a filename passed in as markup if the + 'filename' contains a less-than character; the less-than character + indicates it's most likely a very small document. [bug=1577864] + +------------------------------------------------------------------- @@ -11 +52,0 @@ - python3-beautifulsoup4.changes: same change Old: ---- beautifulsoup4-4.4.1.tar.gz New: ---- beautifulsoup4-4.5.0.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python3-beautifulsoup4-doc.spec ++++++ --- /var/tmp/diff_new_pack.F2JBPt/_old 2016-07-28 23:46:34.000000000 +0200 +++ /var/tmp/diff_new_pack.F2JBPt/_new 2016-07-28 23:46:34.000000000 +0200 @@ -17,7 +17,7 @@ Name: python3-beautifulsoup4-doc -Version: 4.4.1 +Version: 4.5.0 Release: 0 Summary: Documentation for python3-beautifulsoup4 License: MIT @@ -25,8 +25,8 @@ Url: http://www.crummy.com/software/BeautifulSoup/ Source: https://files.pythonhosted.org/packages/source/b/beautifulsoup4/beautifulsoup4-%{version}.tar.gz BuildRoot: %{_tmppath}/%{name}-%{version}-build -BuildRequires: python3-beautifulsoup4 = %{version} BuildRequires: python3-Sphinx +BuildRequires: python3-beautifulsoup4 = %{version} Requires: python3-beautifulsoup4 = %{version} BuildArch: noarch ++++++ python3-beautifulsoup4.spec ++++++ --- /var/tmp/diff_new_pack.F2JBPt/_old 2016-07-28 23:46:34.000000000 +0200 +++ /var/tmp/diff_new_pack.F2JBPt/_new 2016-07-28 23:46:34.000000000 +0200 @@ -17,7 +17,7 @@ Name: python3-beautifulsoup4 -Version: 4.4.1 +Version: 4.5.0 Release: 0 Summary: HTML/XML Parser for Quick-Turnaround Applications Like Screen-Scraping License: MIT ++++++ beautifulsoup4-4.4.1.tar.gz -> beautifulsoup4-4.5.0.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/COPYING.txt new/beautifulsoup4-4.5.0/COPYING.txt --- old/beautifulsoup4-4.4.1/COPYING.txt 2015-06-24 12:56:23.000000000 +0200 +++ new/beautifulsoup4-4.5.0/COPYING.txt 2016-07-16 17:25:37.000000000 +0200 @@ -1,6 +1,6 @@ Beautiful Soup is made available under the MIT license: - Copyright (c) 2004-2015 Leonard Richardson + Copyright (c) 2004-2016 Leonard Richardson Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/NEWS.txt new/beautifulsoup4-4.5.0/NEWS.txt --- old/beautifulsoup4-4.4.1/NEWS.txt 2015-09-29 01:53:36.000000000 +0200 +++ new/beautifulsoup4-4.5.0/NEWS.txt 2016-07-20 02:35:09.000000000 +0200 @@ -1,3 +1,56 @@ += 4.5.0 (20160719) = + +* Beautiful Soup is no longer compatible with Python 2.6. This + actually happened a few releases ago, but it's now official. + +* Beautiful Soup will now work with versions of html5lib greater than + 0.99999999. [bug=1603299] + +* If a search against each individual value of a multi-valued + attribute fails, the search will be run one final time against the + complete attribute value considered as a single string. That is, if + a tag has class="foo bar" and neither "foo" nor "bar" matches, but + "foo bar" does, the tag is now considered a match. + + This happened in previous versions, but only when the value being + searched for was a string. Now it also works when that value is + a regular expression, a list of strings, etc. [bug=1476868] + +* Fixed a bug that deranged the tree when a whitespace element was + reparented into a tag that contained an identical whitespace + element. [bug=1505351] + +* Added support for CSS selector values that contain quoted spaces, + such as tag[style="display: foo"]. [bug=1540588] + +* Corrected handling of XML processing instructions. [bug=1504393] + +* Corrected an encoding error that happened when a BeautifulSoup + object was copied. [bug=1554439] + +* The contents of <textarea> tags will no longer be modified when the + tree is prettified. [bug=1555829] + +* When a BeautifulSoup object is pickled but its tree builder cannot + be pickled, its .builder attribute is set to None instead of being + destroyed. This avoids a performance problem once the object is + unpickled. [bug=1523629] + +* Specify the file and line number when warning about a + BeautifulSoup object being instantiated without a parser being + specified. [bug=1574647] + +* The `limit` argument to `select()` now works correctly, though it's + not implemented very efficiently. [bug=1520530] + +* Fixed a Python 3 ByteWarning when a URL was passed in as though it + were markup. Thanks to James Salter for a patch and + test. [bug=1533762] + +* We don't run the check for a filename passed in as markup if the + 'filename' contains a less-than character; the less-than character + indicates it's most likely a very small document. [bug=1577864] + = 4.4.1 (20150928) = * Fixed a bug that deranged the tree when part of it was diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/PKG-INFO new/beautifulsoup4-4.5.0/PKG-INFO --- old/beautifulsoup4-4.4.1/PKG-INFO 2015-09-29 02:19:48.000000000 +0200 +++ new/beautifulsoup4-4.5.0/PKG-INFO 2016-07-20 12:38:04.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 1.1 Name: beautifulsoup4 -Version: 4.4.1 +Version: 4.5.0 Summary: Screen-scraping library Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/ Author: Leonard Richardson diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/beautifulsoup4.egg-info/PKG-INFO new/beautifulsoup4-4.5.0/beautifulsoup4.egg-info/PKG-INFO --- old/beautifulsoup4-4.4.1/beautifulsoup4.egg-info/PKG-INFO 2015-09-29 02:19:48.000000000 +0200 +++ new/beautifulsoup4-4.5.0/beautifulsoup4.egg-info/PKG-INFO 2016-07-20 12:38:04.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 1.1 Name: beautifulsoup4 -Version: 4.4.1 +Version: 4.5.0 Summary: Screen-scraping library Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/ Author: Leonard Richardson diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/beautifulsoup4.egg-info/requires.txt new/beautifulsoup4-4.5.0/beautifulsoup4.egg-info/requires.txt --- old/beautifulsoup4-4.4.1/beautifulsoup4.egg-info/requires.txt 2015-09-29 02:19:48.000000000 +0200 +++ new/beautifulsoup4-4.5.0/beautifulsoup4.egg-info/requires.txt 2016-07-20 12:38:04.000000000 +0200 @@ -1,7 +1,6 @@ +[html5lib] +html5lib [lxml] lxml - -[html5lib] -html5lib \ No newline at end of file diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/__init__.py new/beautifulsoup4-4.5.0/bs4/__init__.py --- old/beautifulsoup4-4.4.1/bs4/__init__.py 2015-09-29 02:09:17.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/__init__.py 2016-07-20 02:28:09.000000000 +0200 @@ -5,8 +5,8 @@ Beautiful Soup uses a pluggable XML or HTML parser to parse a (possibly invalid) document into a tree representation. Beautiful Soup -provides provides methods and Pythonic idioms that make it easy to -navigate, search, and modify the parse tree. +provides methods and Pythonic idioms that make it easy to navigate, +search, and modify the parse tree. Beautiful Soup works with Python 2.6 and up. It works better if lxml and/or html5lib is installed. @@ -14,17 +14,22 @@ For more than you ever wanted to know about Beautiful Soup, see the documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ + """ +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. + __author__ = "Leonard Richardson (leonardr@segfault.org)" -__version__ = "4.4.1" -__copyright__ = "Copyright (c) 2004-2015 Leonard Richardson" +__version__ = "4.5.0" +__copyright__ = "Copyright (c) 2004-2016 Leonard Richardson" __license__ = "MIT" __all__ = ['BeautifulSoup'] import os import re +import traceback import warnings from .builder import builder_registry, ParserRejectedMarkup @@ -77,7 +82,7 @@ ASCII_SPACES = '\x20\x0a\x09\x0c\x0d' - NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nTo get rid of this warning, change this:\n\n BeautifulSoup([your markup])\n\nto this:\n\n BeautifulSoup([your markup], \"%(parser)s\")\n" + NO_PARSER_SPECIFIED_WARNING = "No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system (\"%(parser)s\"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, change code that looks like this:\n\n BeautifulSoup([your markup])\n\nto this:\n\n BeautifulSoup([your markup], \"%(parser)s\")\n" def __init__(self, markup="", features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, @@ -137,6 +142,10 @@ from_encoding = from_encoding or deprecated_argument( "fromEncoding", "from_encoding") + if from_encoding and isinstance(markup, unicode): + warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.") + from_encoding = None + if len(kwargs) > 0: arg = kwargs.keys().pop() raise TypeError( @@ -161,19 +170,29 @@ markup_type = "XML" else: markup_type = "HTML" + + caller = traceback.extract_stack()[0] + filename = caller[0] + line_number = caller[1] warnings.warn(self.NO_PARSER_SPECIFIED_WARNING % dict( + filename=filename, + line_number=line_number, parser=builder.NAME, markup_type=markup_type)) self.builder = builder self.is_xml = builder.is_xml + self.known_xml = self.is_xml self.builder.soup = self self.parse_only = parse_only if hasattr(markup, 'read'): # It's a file-type object. markup = markup.read() - elif len(markup) <= 256: + elif len(markup) <= 256 and ( + (isinstance(markup, bytes) and not b'<' in markup) + or (isinstance(markup, unicode) and not u'<' in markup) + ): # Print out warnings for a couple beginner problems # involving passing non-markup to Beautiful Soup. # Beautiful Soup will still parse the input as markup, @@ -195,16 +214,10 @@ if isinstance(markup, unicode): markup = markup.encode("utf8") warnings.warn( - '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup) - if markup[:5] == "http:" or markup[:6] == "https:": - # TODO: This is ugly but I couldn't get it to work in - # Python 3 otherwise. - if ((isinstance(markup, bytes) and not b' ' in markup) - or (isinstance(markup, unicode) and not u' ' in markup)): - if isinstance(markup, unicode): - markup = markup.encode("utf8") - warnings.warn( - '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup) + '"%s" looks like a filename, not markup. You should' + 'probably open this file and pass the filehandle into' + 'Beautiful Soup.' % markup) + self._check_markup_is_url(markup) for (self.markup, self.original_encoding, self.declared_html_encoding, self.contains_replacement_characters) in ( @@ -223,15 +236,52 @@ self.builder.soup = None def __copy__(self): - return type(self)(self.encode(), builder=self.builder) + copy = type(self)( + self.encode('utf-8'), builder=self.builder, from_encoding='utf-8' + ) + + # Although we encoded the tree to UTF-8, that may not have + # been the encoding of the original markup. Set the copy's + # .original_encoding to reflect the original object's + # .original_encoding. + copy.original_encoding = self.original_encoding + return copy def __getstate__(self): # Frequently a tree builder can't be pickled. d = dict(self.__dict__) if 'builder' in d and not self.builder.picklable: - del d['builder'] + d['builder'] = None return d + @staticmethod + def _check_markup_is_url(markup): + """ + Check if markup looks like it's actually a url and raise a warning + if so. Markup can be unicode or str (py2) / bytes (py3). + """ + if isinstance(markup, bytes): + space = b' ' + cant_start_with = (b"http:", b"https:") + elif isinstance(markup, unicode): + space = u' ' + cant_start_with = (u"http:", u"https:") + else: + return + + if any(markup.startswith(prefix) for prefix in cant_start_with): + if not space in markup: + if isinstance(markup, bytes): + decoded_markup = markup.decode('utf-8', 'replace') + else: + decoded_markup = markup + warnings.warn( + '"%s" looks like a URL. Beautiful Soup is not an' + ' HTTP client. You should probably use an HTTP client like' + ' requests to get the document behind the URL, and feed' + ' that document to Beautiful Soup.' % decoded_markup + ) + def _feed(self): # Convert the document to Unicode. self.builder.reset() @@ -335,7 +385,18 @@ if parent.next_sibling: # This node is being inserted into an element that has # already been parsed. Deal with any dangling references. - index = parent.contents.index(o) + index = len(parent.contents)-1 + while index >= 0: + if parent.contents[index] is o: + break + index -= 1 + else: + raise ValueError( + "Error building tree: supposedly %r was inserted " + "into %r after the fact, but I don't see it!" % ( + o, parent + ) + ) if index == 0: previous_element = parent previous_sibling = None diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/builder/__init__.py new/beautifulsoup4-4.5.0/bs4/builder/__init__.py --- old/beautifulsoup4-4.4.1/bs4/builder/__init__.py 2015-06-28 21:48:48.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/builder/__init__.py 2016-07-20 02:28:09.000000000 +0200 @@ -1,9 +1,13 @@ +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. + from collections import defaultdict import itertools import sys from bs4.element import ( CharsetMetaAttributeValue, ContentMetaAttributeValue, + HTMLAwareEntitySubstitution, whitespace_re ) @@ -227,7 +231,7 @@ Such as which tags are empty-element tags. """ - preserve_whitespace_tags = set(['pre', 'textarea']) + preserve_whitespace_tags = HTMLAwareEntitySubstitution.preserve_whitespace_tags empty_element_tags = set(['br' , 'hr', 'input', 'img', 'meta', 'spacer', 'link', 'frame', 'base']) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/builder/_html5lib.py new/beautifulsoup4-4.5.0/bs4/builder/_html5lib.py --- old/beautifulsoup4-4.4.1/bs4/builder/_html5lib.py 2015-09-29 01:48:58.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/builder/_html5lib.py 2016-07-17 17:31:37.000000000 +0200 @@ -1,8 +1,10 @@ +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. + __all__ = [ 'HTML5TreeBuilder', ] -from pdb import set_trace import warnings from bs4.builder import ( PERMISSIVE, @@ -23,6 +25,15 @@ Tag, ) +try: + # Pre-0.99999999 + from html5lib.treebuilders import _base as treebuilder_base + new_html5lib = False +except ImportError, e: + # 0.99999999 and up + from html5lib.treebuilders import base as treebuilder_base + new_html5lib = True + class HTML5TreeBuilder(HTMLTreeBuilder): """Use html5lib to build a tree.""" @@ -47,7 +58,14 @@ if self.soup.parse_only is not None: warnings.warn("You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.") parser = html5lib.HTMLParser(tree=self.create_treebuilder) - doc = parser.parse(markup, encoding=self.user_specified_encoding) + + extra_kwargs = dict() + if not isinstance(markup, unicode): + if new_html5lib: + extra_kwargs['override_encoding'] = self.user_specified_encoding + else: + extra_kwargs['encoding'] = self.user_specified_encoding + doc = parser.parse(markup, **extra_kwargs) # Set the character encoding detected by the tokenizer. if isinstance(markup, unicode): @@ -55,7 +73,13 @@ # charEncoding to UTF-8 if it gets Unicode input. doc.original_encoding = None else: - doc.original_encoding = parser.tokenizer.stream.charEncoding[0] + original_encoding = parser.tokenizer.stream.charEncoding[0] + if not isinstance(original_encoding, basestring): + # In 0.99999999 and up, the encoding is an html5lib + # Encoding object. We want to use a string for compatibility + # with other tree builders. + original_encoding = original_encoding.name + doc.original_encoding = original_encoding def create_treebuilder(self, namespaceHTMLElements): self.underlying_builder = TreeBuilderForHtml5lib( @@ -67,7 +91,7 @@ return u'<html><head></head><body>%s</body></html>' % fragment -class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder): +class TreeBuilderForHtml5lib(treebuilder_base.TreeBuilder): def __init__(self, soup, namespaceHTMLElements): self.soup = soup @@ -105,7 +129,7 @@ return self.soup def getFragment(self): - return html5lib.treebuilders._base.TreeBuilder.getFragment(self).element + return treebuilder_base.TreeBuilder.getFragment(self).element class AttrList(object): def __init__(self, element): @@ -137,9 +161,9 @@ return name in list(self.attrs.keys()) -class Element(html5lib.treebuilders._base.Node): +class Element(treebuilder_base.Node): def __init__(self, element, soup, namespace): - html5lib.treebuilders._base.Node.__init__(self, element.name) + treebuilder_base.Node.__init__(self, element.name) self.element = element self.soup = soup self.namespace = namespace @@ -324,7 +348,7 @@ class TextNode(Element): def __init__(self, element, soup): - html5lib.treebuilders._base.Node.__init__(self, None) + treebuilder_base.Node.__init__(self, None) self.element = element self.soup = soup diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/builder/_htmlparser.py new/beautifulsoup4-4.5.0/bs4/builder/_htmlparser.py --- old/beautifulsoup4-4.4.1/bs4/builder/_htmlparser.py 2015-06-28 21:49:08.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/builder/_htmlparser.py 2016-07-17 21:10:15.000000000 +0200 @@ -1,5 +1,8 @@ """Use the HTMLParser library to parse HTML files that aren't too bad.""" +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. + __all__ = [ 'HTMLParserTreeBuilder', ] diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/builder/_lxml.py new/beautifulsoup4-4.5.0/bs4/builder/_lxml.py --- old/beautifulsoup4-4.4.1/bs4/builder/_lxml.py 2015-06-28 21:49:20.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/builder/_lxml.py 2016-07-17 00:35:57.000000000 +0200 @@ -1,3 +1,5 @@ +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. __all__ = [ 'LXMLTreeBuilderForXML', 'LXMLTreeBuilder', @@ -12,6 +14,7 @@ Doctype, NamespacedAttribute, ProcessingInstruction, + XMLProcessingInstruction, ) from bs4.builder import ( FAST, @@ -103,6 +106,10 @@ # iterate over the encodings, and tell lxml to try to parse # the document as each one in turn. is_html = not self.is_xml + if is_html: + self.processing_instruction_class = ProcessingInstruction + else: + self.processing_instruction_class = XMLProcessingInstruction try_encodings = [user_specified_encoding, document_declared_encoding] detector = EncodingDetector( markup, try_encodings, is_html, exclude_encodings) @@ -201,7 +208,7 @@ def pi(self, target, data): self.soup.endData() self.soup.handle_data(target + ' ' + data) - self.soup.endData(ProcessingInstruction) + self.soup.endData(self.processing_instruction_class) def data(self, content): self.soup.handle_data(content) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/dammit.py new/beautifulsoup4-4.5.0/bs4/dammit.py --- old/beautifulsoup4-4.4.1/bs4/dammit.py 2015-09-29 01:58:41.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/dammit.py 2016-07-17 21:14:33.000000000 +0200 @@ -6,9 +6,10 @@ Feed Parser. It works best on XML and HTML, but it does not rewrite the XML or HTML to reflect a new encoding; that's the tree builder's job. """ +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. __license__ = "MIT" -from pdb import set_trace import codecs from htmlentitydefs import codepoint2name import re @@ -346,7 +347,7 @@ self.tried_encodings = [] self.contains_replacement_characters = False self.is_html = is_html - + self.log = logging.getLogger(__name__) self.detector = EncodingDetector( markup, override_encodings, is_html, exclude_encodings) @@ -376,9 +377,10 @@ if encoding != "ascii": u = self._convert_from(encoding, "replace") if u is not None: - logging.warning( + self.log.warning( "Some characters could not be decoded, and were " - "replaced with REPLACEMENT CHARACTER.") + "replaced with REPLACEMENT CHARACTER." + ) self.contains_replacement_characters = True break diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/diagnose.py new/beautifulsoup4-4.5.0/bs4/diagnose.py --- old/beautifulsoup4-4.4.1/bs4/diagnose.py 2015-09-29 01:56:24.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/diagnose.py 2016-07-16 17:27:02.000000000 +0200 @@ -1,5 +1,7 @@ """Diagnostic functions, mainly for use when doing tech support.""" +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. __license__ = "MIT" import cProfile diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/element.py new/beautifulsoup4-4.5.0/bs4/element.py --- old/beautifulsoup4-4.4.1/bs4/element.py 2015-09-29 01:56:01.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/element.py 2016-07-20 02:28:09.000000000 +0200 @@ -1,8 +1,10 @@ +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. __license__ = "MIT" -from pdb import set_trace import collections import re +import shlex import sys import warnings from bs4.dammit import EntitySubstitution @@ -99,6 +101,8 @@ preformatted_tags = set(["pre"]) + preserve_whitespace_tags = set(['pre', 'textarea']) + @classmethod def _substitute_if_appropriate(cls, ns, f): if (isinstance(ns, NavigableString) @@ -169,11 +173,19 @@ This is used when mapping a formatter name ("minimal") to an appropriate function (one that performs entity-substitution on - the contents of <script> and <style> tags, or not). It's + the contents of <script> and <style> tags, or not). It can be inefficient, but it should be called very rarely. """ + if self.known_xml is not None: + # Most of the time we will have determined this when the + # document is parsed. + return self.known_xml + + # Otherwise, it's likely that this element was created by + # direct invocation of the constructor from within the user's + # Python code. if self.parent is None: - # This is the top-level object. It should have .is_xml set + # This is the top-level object. It should have .known_xml set # from tree creation. If not, take a guess--BS is usually # used on HTML markup. return getattr(self, 'is_xml', False) @@ -677,6 +689,11 @@ PREFIX = '' SUFFIX = '' + # We can't tell just by looking at a string whether it's contained + # in an XML document or an HTML document. + + known_xml = None + def __new__(cls, value): """Create a new NavigableString. @@ -743,10 +760,16 @@ SUFFIX = u']]>' class ProcessingInstruction(PreformattedString): + """A SGML processing instruction.""" PREFIX = u'<?' SUFFIX = u'>' +class XMLProcessingInstruction(ProcessingInstruction): + """An XML processing instruction.""" + PREFIX = u'<?' + SUFFIX = u'?>' + class Comment(PreformattedString): PREFIX = u'<!--' @@ -781,7 +804,8 @@ """Represents a found HTML tag with its attributes and contents.""" def __init__(self, parser=None, builder=None, name=None, namespace=None, - prefix=None, attrs=None, parent=None, previous=None): + prefix=None, attrs=None, parent=None, previous=None, + is_xml=None): "Basic constructor." if parser is None: @@ -795,6 +819,14 @@ self.name = name self.namespace = namespace self.prefix = prefix + if builder is not None: + preserve_whitespace_tags = builder.preserve_whitespace_tags + else: + if is_xml: + preserve_whitespace_tags = [] + else: + preserve_whitespace_tags = HTMLAwareEntitySubstitution.preserve_whitespace_tags + self.preserve_whitespace_tags = preserve_whitespace_tags if attrs is None: attrs = {} elif attrs: @@ -805,6 +837,13 @@ attrs = dict(attrs) else: attrs = dict(attrs) + + # If possible, determine ahead of time whether this tag is an + # XML tag. + if builder: + self.known_xml = builder.is_xml + else: + self.known_xml = is_xml self.attrs = attrs self.contents = [] self.setup(parent, previous) @@ -824,7 +863,7 @@ Its contents are a copy of the old Tag's contents. """ clone = type(self)(None, self.builder, self.name, self.namespace, - self.nsprefix, self.attrs) + self.nsprefix, self.attrs, is_xml=self._is_xml) for attr in ('can_be_empty_element', 'hidden'): setattr(clone, attr, getattr(self, attr)) for child in self.contents: @@ -997,7 +1036,7 @@ tag_name, tag_name)) return self.find(tag_name) # We special case contents to avoid recursion. - elif not tag.startswith("__") and not tag=="contents": + elif not tag.startswith("__") and not tag == "contents": return self.find(tag) raise AttributeError( "'%s' object has no attribute '%s'" % (self.__class__, tag)) @@ -1057,10 +1096,11 @@ def _should_pretty_print(self, indent_level): """Should this tag be pretty-printed?""" + return ( - indent_level is not None and - (self.name not in HTMLAwareEntitySubstitution.preformatted_tags - or self._is_xml)) + indent_level is not None + and self.name not in self.preserve_whitespace_tags + ) def decode(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, @@ -1280,6 +1320,7 @@ _selector_combinators = ['>', '+', '~'] _select_debug = False + quoted_colon = re.compile('"[^"]*:[^"]*"') def select_one(self, selector): """Perform a CSS selection operation on the current element.""" value = self.select(selector, limit=1) @@ -1305,8 +1346,7 @@ if limit and len(context) >= limit: break return context - - tokens = selector.split() + tokens = shlex.split(selector) current_context = [self] if tokens[-1] in self._selector_combinators: @@ -1358,7 +1398,7 @@ return classes.issubset(candidate.get('class', [])) checker = classes_match - elif ':' in token: + elif ':' in token and not self.quoted_colon.search(token): # Pseudo-class tag_name, pseudo = token.split(':', 1) if tag_name == '': @@ -1389,11 +1429,8 @@ self.count += 1 if self.count == self.destination: return True - if self.count > self.destination: - # Stop the generator that's sending us - # these things. - raise StopIteration() - return False + else: + return False checker = Counter(pseudo_value).nth_child_of_type else: raise NotImplementedError( @@ -1498,13 +1535,12 @@ # don't include it in the context more than once. new_context.append(candidate) new_context_ids.add(id(candidate)) - if limit and len(new_context) >= limit: - break elif self._select_debug: print " FAILURE %s %s" % (candidate.name, repr(candidate.attrs)) - current_context = new_context + if limit and len(current_context) >= limit: + current_context = current_context[:limit] if self._select_debug: print "Final verdict:" @@ -1668,21 +1704,15 @@ if isinstance(markup, list) or isinstance(markup, tuple): # This should only happen when searching a multi-valued attribute # like 'class'. - if (isinstance(match_against, unicode) - and ' ' in match_against): - # A bit of a special case. If they try to match "foo - # bar" on a multivalue attribute's value, only accept - # the literal value "foo bar" - # - # XXX This is going to be pretty slow because we keep - # splitting match_against. But it shouldn't come up - # too often. - return (whitespace_re.split(match_against) == markup) - else: - for item in markup: - if self._matches(item, match_against): - return True - return False + for item in markup: + if self._matches(item, match_against): + return True + # We didn't match any particular value of the multivalue + # attribute, but maybe we match the attribute value when + # considered as a string. + if self._matches(' '.join(markup), match_against): + return True + return False if match_against is True: # True matches any non-None value. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/testing.py new/beautifulsoup4-4.5.0/bs4/testing.py --- old/beautifulsoup4-4.4.1/bs4/testing.py 2015-09-29 01:56:34.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/testing.py 2016-07-17 04:18:57.000000000 +0200 @@ -1,5 +1,7 @@ """Helper classes for tests.""" +# Use of this source code is governed by a BSD-style license that can be +# found in the LICENSE file. __license__ = "MIT" import pickle @@ -215,9 +217,22 @@ self.assertEqual(comment, baz.previous_element) def test_preserved_whitespace_in_pre_and_textarea(self): - """Whitespace must be preserved in <pre> and <textarea> tags.""" - self.assertSoupEquals("<pre> </pre>") - self.assertSoupEquals("<textarea> woo </textarea>") + """Whitespace must be preserved in <pre> and <textarea> tags, + even if that would mean not prettifying the markup. + """ + pre_markup = "<pre> </pre>" + textarea_markup = "<textarea> woo\nwoo </textarea>" + self.assertSoupEquals(pre_markup) + self.assertSoupEquals(textarea_markup) + + soup = self.soup(pre_markup) + self.assertEqual(soup.pre.prettify(), pre_markup) + + soup = self.soup(textarea_markup) + self.assertEqual(soup.textarea.prettify(), textarea_markup) + + soup = self.soup("<textarea></textarea>") + self.assertEqual(soup.textarea.prettify(), "<textarea></textarea>") def test_nested_inline_elements(self): """Inline elements can be nested indefinitely.""" @@ -480,7 +495,9 @@ hebrew_document = b'<html><head><title>Hebrew (ISO 8859-8) in Visual Directionality</title></head><body><h1>Hebrew (ISO 8859-8) in Visual Directionality</h1>\xed\xe5\xec\xf9</body></html>' soup = self.soup( hebrew_document, from_encoding="iso8859-8") - self.assertEqual(soup.original_encoding, 'iso8859-8') + # Some tree builders call it iso8859-8, others call it iso-8859-9. + # That's not a difference we really care about. + assert soup.original_encoding in ('iso8859-8', 'iso-8859-8') self.assertEqual( soup.encode('utf-8'), hebrew_document.decode("iso8859-8").encode("utf-8")) @@ -563,6 +580,11 @@ soup = self.soup(markup) self.assertEqual(markup, soup.encode("utf8")) + def test_processing_instruction(self): + markup = b"""<?xml version="1.0" encoding="utf8"?>\n<?PITarget PIContent?>""" + soup = self.soup(markup) + self.assertEqual(markup, soup.encode("utf8")) + def test_real_xhtml_document(self): """A real XHTML document should come out *exactly* the same as it went in.""" markup = b"""<?xml version="1.0" encoding="utf-8"?> diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/tests/test_html5lib.py new/beautifulsoup4-4.5.0/bs4/tests/test_html5lib.py --- old/beautifulsoup4-4.4.1/bs4/tests/test_html5lib.py 2015-09-29 01:51:22.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/tests/test_html5lib.py 2016-07-17 17:42:40.000000000 +0200 @@ -84,6 +84,17 @@ self.assertEqual(u"<body><p><em>foo</em></p><em>\n</em><p><em>bar<a></a></em></p>\n</body>", soup.body.decode()) self.assertEqual(2, len(soup.find_all('p'))) + def test_reparented_markup_containing_identical_whitespace_nodes(self): + """Verify that we keep the two whitespace nodes in this + document distinct when reparenting the adjacent <tbody> tags. + """ + markup = '<table> <tbody><tbody><ims></tbody> </table>' + soup = self.soup(markup) + space1, space2 = soup.find_all(string=' ') + tbody1, tbody2 = soup.find_all('tbody') + assert space1.next_element is tbody1 + assert tbody2.next_element is space2 + def test_processing_instruction(self): """Processing instructions become comments.""" markup = b"""<?PITarget PIContent?>""" diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/tests/test_soup.py new/beautifulsoup4-4.5.0/bs4/tests/test_soup.py --- old/beautifulsoup4-4.4.1/bs4/tests/test_soup.py 2015-07-05 19:19:39.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/tests/test_soup.py 2016-07-16 17:55:30.000000000 +0200 @@ -118,15 +118,34 @@ soup = self.soup(filename) self.assertEqual(0, len(w)) - def test_url_warning(self): - with warnings.catch_warnings(record=True) as w: - soup = self.soup("http://www.crummy.com/") - msg = str(w[0].message) - self.assertTrue("looks like a URL" in msg) + def test_url_warning_with_bytes_url(self): + with warnings.catch_warnings(record=True) as warning_list: + soup = self.soup(b"http://www.crummybytes.com/") + # Be aware this isn't the only warning that can be raised during + # execution.. + self.assertTrue(any("looks like a URL" in str(w.message) + for w in warning_list)) + + def test_url_warning_with_unicode_url(self): + with warnings.catch_warnings(record=True) as warning_list: + # note - this url must differ from the bytes one otherwise + # python's warnings system swallows the second warning + soup = self.soup(u"http://www.crummyunicode.com/") + self.assertTrue(any("looks like a URL" in str(w.message) + for w in warning_list)) + + def test_url_warning_with_bytes_and_space(self): + with warnings.catch_warnings(record=True) as warning_list: + soup = self.soup(b"http://www.crummybytes.com/ is great") + self.assertFalse(any("looks like a URL" in str(w.message) + for w in warning_list)) + + def test_url_warning_with_unicode_and_space(self): + with warnings.catch_warnings(record=True) as warning_list: + soup = self.soup(u"http://www.crummyuncode.com/ is great") + self.assertFalse(any("looks like a URL" in str(w.message) + for w in warning_list)) - with warnings.catch_warnings(record=True) as w: - soup = self.soup("http://www.crummy.com/ is great") - self.assertEqual(0, len(w)) class TestSelectiveParsing(SoupTest): diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/bs4/tests/test_tree.py new/beautifulsoup4-4.5.0/bs4/tests/test_tree.py --- old/beautifulsoup4-4.4.1/bs4/tests/test_tree.py 2015-09-29 01:42:21.000000000 +0200 +++ new/beautifulsoup4-4.5.0/bs4/tests/test_tree.py 2016-07-20 02:51:35.000000000 +0200 @@ -222,6 +222,17 @@ self.assertSelects( tree.find_all(id_matches_name), ["Match 1.", "Match 2."]) + def test_find_with_multi_valued_attribute(self): + soup = self.soup( + "<div class='a b'>1</div><div class='a c'>2</div><div class='a d'>3</div>" + ) + r1 = soup.find('div', 'a d'); + r2 = soup.find('div', re.compile(r'a d')); + r3, r4 = soup.find_all('div', ['a b', 'a d']); + self.assertEqual('3', r1.string) + self.assertEqual('3', r2.string) + self.assertEqual('1', r3.string) + self.assertEqual('3', r4.string) class TestFindAllByAttribute(TreeTest): @@ -294,10 +305,10 @@ f = tree.find_all("gar", class_=re.compile("a")) self.assertSelects(f, ["Found it"]) - # Since the class is not the string "foo bar", but the two - # strings "foo" and "bar", this will not find anything. + # If the search fails to match the individual strings "foo" and "bar", + # it will be tried against the combined string "foo bar". f = tree.find_all("gar", class_=re.compile("o b")) - self.assertSelects(f, []) + self.assertSelects(f, ["Found it"]) def test_find_all_with_non_dictionary_for_attrs_finds_by_class(self): soup = self.soup("<a class='bar'>Found it</a>") @@ -1328,6 +1339,13 @@ copied = copy.deepcopy(self.tree) self.assertEqual(copied.decode(), self.tree.decode()) + def test_copy_preserves_encoding(self): + soup = BeautifulSoup(b'<p> </p>', 'html.parser') + encoding = soup.original_encoding + copy = soup.__copy__() + self.assertEqual(u"<p> </p>", unicode(copy)) + self.assertEqual(encoding, copy.original_encoding) + def test_unicode_pickle(self): # A tree containing Unicode characters can be pickled. html = u"<b>\N{SNOWMAN}</b>" @@ -1676,8 +1694,8 @@ def setUp(self): self.soup = BeautifulSoup(self.HTML, 'html.parser') - def assertSelects(self, selector, expected_ids): - el_ids = [el['id'] for el in self.soup.select(selector)] + def assertSelects(self, selector, expected_ids, **kwargs): + el_ids = [el['id'] for el in self.soup.select(selector, **kwargs)] el_ids.sort() expected_ids.sort() self.assertEqual(expected_ids, el_ids, @@ -1720,6 +1738,13 @@ for selector in ('html div', 'html body div', 'body div'): self.assertSelects(selector, ['data1', 'main', 'inner', 'footer']) + + def test_limit(self): + self.assertSelects('html div', ['main'], limit=1) + self.assertSelects('html body div', ['inner', 'main'], limit=2) + self.assertSelects('body div', ['data1', 'main', 'inner', 'footer'], + limit=10) + def test_tag_no_match(self): self.assertEqual(len(self.soup.select('del')), 0) @@ -1902,6 +1927,14 @@ ('div[data-tag]', ['data1']) ) + def test_quoted_space_in_selector_name(self): + html = """<div style="display: wrong">nope</div> + <div style="display: right">yes</div> + """ + soup = BeautifulSoup(html, 'html.parser') + [chosen] = soup.select('div[style="display: right"]') + self.assertEqual("yes", chosen.string) + def test_unsupported_pseudoclass(self): self.assertRaises( NotImplementedError, self.soup.select, "a:no-such-pseudoclass") diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/doc/source/index.rst new/beautifulsoup4-4.5.0/doc/source/index.rst --- old/beautifulsoup4-4.4.1/doc/source/index.rst 2015-09-29 00:46:53.000000000 +0200 +++ new/beautifulsoup4-4.5.0/doc/source/index.rst 2015-11-24 13:36:12.000000000 +0100 @@ -1649,7 +1649,7 @@ soup.select("title") # [<title>The Dormouse's story</title>] - soup.select("p nth-of-type(3)") + soup.select("p:nth-of-type(3)") # [<p class="story">...</p>] Find tags beneath other tags:: diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.4.1/setup.py new/beautifulsoup4-4.5.0/setup.py --- old/beautifulsoup4-4.4.1/setup.py 2015-09-29 02:11:15.000000000 +0200 +++ new/beautifulsoup4-4.5.0/setup.py 2016-07-20 12:37:28.000000000 +0200 @@ -5,7 +5,7 @@ setup( name="beautifulsoup4", - version = "4.4.1", + version = "4.5.0", author="Leonard Richardson", author_email='leonardr@segfault.org', url="http://www.crummy.com/software/BeautifulSoup/bs4/",
participants (1)
-
root@hilbert.suse.de