Hello community, here is the log from the commit of package python-beautifulsoup4 for openSUSE:Factory checked in at 2019-07-30 13:05:12 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/python-beautifulsoup4 (Old) and /work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.4126 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "python-beautifulsoup4" Tue Jul 30 13:05:12 2019 rev:29 rq:717648 version:4.8.0 Changes: -------- --- /work/SRC/openSUSE:Factory/python-beautifulsoup4/python-beautifulsoup4.changes 2019-03-04 09:11:05.132700786 +0100 +++ /work/SRC/openSUSE:Factory/.python-beautifulsoup4.new.4126/python-beautifulsoup4.changes 2019-07-30 13:05:15.146390127 +0200 @@ -1,0 +2,23 @@ +Mon Jul 22 16:18:23 UTC 2019 - Todd R <toddrme2178@gmail.com> + +- Update to 4.8.0 + * It's now possible to customize the TreeBuilder object by passing + keyword arguments into the BeautifulSoup constructor. The main + reason to do this right now is to change how which attributes are + treated as multi-valued attributes (the way 'class' is treated by + default). You can do this with the `multi_valued_attributes` argument. + * The role of Formatter objects has been greatly expanded. The Formatter + class now controls the following: + > The function to call to perform entity substitution. (This was + previously Formatter's only job.) + > Which tags should be treated as containing CDATA and have their + contents exempt from entity substitution. + > The order in which a tag's attributes are output. + > Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>' + All preexisting code should work as before. + * Added a new method to the API, Tag.smooth(), which consolidates + multiple adjacent NavigableString elements. + * ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is now + recognized as a named entity and converted to a single quote. + +------------------------------------------------------------------- Old: ---- beautifulsoup4-4.7.1.tar.gz New: ---- beautifulsoup4-4.8.0.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python-beautifulsoup4.spec ++++++ --- /var/tmp/diff_new_pack.SDgoFB/_old 2019-07-30 13:05:15.818389950 +0200 +++ /var/tmp/diff_new_pack.SDgoFB/_new 2019-07-30 13:05:15.822389949 +0200 @@ -18,7 +18,7 @@ %{?!python_module:%define python_module() python-%{**} python3-%{**}} Name: python-beautifulsoup4 -Version: 4.7.1 +Version: 4.8.0 Release: 0 Summary: HTML/XML Parser for Quick-Turnaround Applications Like Screen-Scraping License: MIT ++++++ beautifulsoup4-4.7.1.tar.gz -> beautifulsoup4-4.8.0.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/NEWS.txt new/beautifulsoup4-4.8.0/NEWS.txt --- old/beautifulsoup4-4.7.1/NEWS.txt 2019-01-07 01:36:52.000000000 +0100 +++ new/beautifulsoup4-4.8.0/NEWS.txt 2019-07-20 01:41:41.000000000 +0200 @@ -1,3 +1,30 @@ += 4.8.0 (20190720, "One Small Soup") + +* It's now possible to customize the TreeBuilder object by passing + keyword arguments into the BeautifulSoup constructor. The main + reason to do this right now is to change how which attributes are + treated as multi-valued attributes (the way 'class' is treated by + default). You can do this with the `multi_valued_attributes` argument. + [bug=1832978] + +* The role of Formatter objects has been greatly expanded. The Formatter + class now controls the following: + + - The function to call to perform entity substitution. (This was + previously Formatter's only job.) + - Which tags should be treated as containing CDATA and have their + contents exempt from entity substitution. + - The order in which a tag's attributes are output. [bug=1812422] + - Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>' + + All preexisting code should work as before. + +* Added a new method to the API, Tag.smooth(), which consolidates + multiple adjacent NavigableString elements. + +* ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is now + recognized as a named entity and converted to a single quote. [bug=1818721] + = 4.7.1 (20190106) * Fixed a significant performance problem introduced in 4.7.0. [bug=1810617] diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/PKG-INFO new/beautifulsoup4-4.8.0/PKG-INFO --- old/beautifulsoup4-4.7.1/PKG-INFO 2019-01-07 01:51:37.000000000 +0100 +++ new/beautifulsoup4-4.8.0/PKG-INFO 2019-07-20 13:29:22.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 2.1 Name: beautifulsoup4 -Version: 4.7.1 +Version: 4.8.0 Summary: Screen-scraping library Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/ Author: Leonard Richardson diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/PKG-INFO new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/PKG-INFO --- old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/PKG-INFO 2019-01-07 01:51:37.000000000 +0100 +++ new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/PKG-INFO 2019-07-20 13:29:22.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 2.1 Name: beautifulsoup4 -Version: 4.7.1 +Version: 4.8.0 Summary: Screen-scraping library Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/ Author: Leonard Richardson diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/SOURCES.txt new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/SOURCES.txt --- old/beautifulsoup4-4.7.1/beautifulsoup4.egg-info/SOURCES.txt 2019-01-07 01:51:37.000000000 +0100 +++ new/beautifulsoup4-4.8.0/beautifulsoup4.egg-info/SOURCES.txt 2019-07-20 13:29:22.000000000 +0200 @@ -17,6 +17,7 @@ bs4/dammit.py bs4/diagnose.py bs4/element.py +bs4/formatter.py bs4/testing.py bs4/builder/__init__.py bs4/builder/_html5lib.py diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/__init__.py new/beautifulsoup4-4.8.0/bs4/__init__.py --- old/beautifulsoup4-4.7.1/bs4/__init__.py 2019-01-07 01:50:44.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/__init__.py 2019-07-17 03:31:51.000000000 +0200 @@ -18,7 +18,7 @@ """ __author__ = "Leonard Richardson (leonardr@segfault.org)" -__version__ = "4.7.1" +__version__ = "4.8.0" __copyright__ = "Copyright (c) 2004-2019 Leonard Richardson" # Use of this source code is governed by the MIT license. __license__ = "MIT" @@ -98,8 +98,10 @@ name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments. - :param builder: A specific TreeBuilder to use instead of looking one - up based on `features`. You shouldn't need to use this. + :param builder: A TreeBuilder subclass to instantiate (or + instance to use) instead of looking one up based on + `features`. You only need to use this if you've implemented a + custom TreeBuilder. :param parse_only: A SoupStrainer. Only parts of the document matching the SoupStrainer will be considered. This is useful @@ -118,11 +120,17 @@ :param kwargs: For backwards compatibility purposes, the constructor accepts certain keyword arguments used in Beautiful Soup 3. None of these arguments do anything in - Beautiful Soup 4 and there's no need to actually pass keyword - arguments into the constructor. + Beautiful Soup 4; they will result in a warning and then be ignored. + + Apart from this, any keyword arguments passed into the BeautifulSoup + constructor are propagated to the TreeBuilder constructor. This + makes it possible to configure a TreeBuilder beyond saying + which one to use. + """ if 'convertEntities' in kwargs: + del kwargs['convertEntities'] warnings.warn( "BS4 does not respect the convertEntities argument to the " "BeautifulSoup constructor. Entities are always converted " @@ -177,13 +185,17 @@ warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.") from_encoding = None - if len(kwargs) > 0: - arg = kwargs.keys().pop() - raise TypeError( - "__init__() got an unexpected keyword argument '%s'" % arg) - - if builder is None: - original_features = features + # We need this information to track whether or not the builder + # was specified well enough that we can omit the 'you need to + # specify a parser' warning. + original_builder = builder + original_features = features + + if isinstance(builder, type): + # A builder class was passed in; it needs to be instantiated. + builder_class = builder + builder = None + elif builder is None: if isinstance(features, basestring): features = [features] if features is None or len(features) == 0: @@ -194,9 +206,16 @@ "Couldn't find a tree builder with the features you " "requested: %s. Do you need to install a parser library?" % ",".join(features)) - builder = builder_class() - if not (original_features == builder.NAME or - original_features in builder.ALTERNATE_NAMES): + + # At this point either we have a TreeBuilder instance in + # builder, or we have a builder_class that we can instantiate + # with the remaining **kwargs. + if builder is None: + builder = builder_class(**kwargs) + if not original_builder and not ( + original_features == builder.NAME or + original_features in builder.ALTERNATE_NAMES + ): if builder.is_xml: markup_type = "XML" else: @@ -231,7 +250,10 @@ markup_type=markup_type ) warnings.warn(self.NO_PARSER_SPECIFIED_WARNING % values, stacklevel=2) - + else: + if kwargs: + warnings.warn("Keyword arguments to the BeautifulSoup constructor will be ignored. These would normally be passed into the TreeBuilder constructor, but a TreeBuilder instance was passed in as `builder`.") + self.builder = builder self.is_xml = builder.is_xml self.known_xml = self.is_xml diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/__init__.py new/beautifulsoup4-4.8.0/bs4/builder/__init__.py --- old/beautifulsoup4-4.7.1/bs4/builder/__init__.py 2018-12-31 02:49:46.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/builder/__init__.py 2019-07-14 22:16:04.000000000 +0200 @@ -7,7 +7,6 @@ from bs4.element import ( CharsetMetaAttributeValue, ContentMetaAttributeValue, - HTMLAwareEntitySubstitution, nonwhitespace_re ) @@ -90,18 +89,40 @@ is_xml = False picklable = False - preserve_whitespace_tags = set() empty_element_tags = None # A tag will be considered an empty-element # tag when and only when it has no contents. # A value for these tag/attribute combinations is a space- or # comma-separated list of CDATA, rather than a single CDATA. - cdata_list_attributes = {} + DEFAULT_CDATA_LIST_ATTRIBUTES = {} + DEFAULT_PRESERVE_WHITESPACE_TAGS = set() + + USE_DEFAULT = object() + + def __init__(self, multi_valued_attributes=USE_DEFAULT, preserve_whitespace_tags=USE_DEFAULT): + """Constructor. - def __init__(self): - self.soup = None + :param multi_valued_attributes: If this is set to None, the + TreeBuilder will not turn any values for attributes like + 'class' into lists. Setting this do a dictionary will + customize this behavior; look at DEFAULT_CDATA_LIST_ATTRIBUTES + for an example. + + Internally, these are called "CDATA list attributes", but that + probably doesn't make sense to an end-user, so the argument name + is `multi_valued_attributes`. + :param preserve_whitespace_tags: + """ + self.soup = None + if multi_valued_attributes is self.USE_DEFAULT: + multi_valued_attributes = self.DEFAULT_CDATA_LIST_ATTRIBUTES + self.cdata_list_attributes = multi_valued_attributes + if preserve_whitespace_tags is self.USE_DEFAULT: + preserve_whitespace_tags = self.DEFAULT_PRESERVE_WHITESPACE_TAGS + self.preserve_whitespace_tags = preserve_whitespace_tags + def initialize_soup(self, soup): """The BeautifulSoup object has been initialized and is now being associated with the TreeBuilder. @@ -131,7 +152,7 @@ if self.empty_element_tags is None: return True return tag_name in self.empty_element_tags - + def feed(self, markup): raise NotImplementedError() @@ -237,7 +258,6 @@ Such as which tags are empty-element tags. """ - preserve_whitespace_tags = HTMLAwareEntitySubstitution.preserve_whitespace_tags empty_element_tags = set([ # These are from HTML5. 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'menuitem', 'meta', 'param', 'source', 'track', 'wbr', @@ -259,7 +279,7 @@ # encounter one of these attributes, we will parse its value into # a list of values if possible. Upon output, the list will be # converted back into a string. - cdata_list_attributes = { + DEFAULT_CDATA_LIST_ATTRIBUTES = { "*" : ['class', 'accesskey', 'dropzone'], "a" : ['rel', 'rev'], "link" : ['rel', 'rev'], @@ -276,6 +296,8 @@ "output" : ["for"], } + DEFAULT_PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea']) + def set_up_substitutions(self, tag): # We are only interested in <meta> tags if tag.name != 'meta': diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/_html5lib.py new/beautifulsoup4-4.8.0/bs4/builder/_html5lib.py --- old/beautifulsoup4-4.7.1/bs4/builder/_html5lib.py 2018-12-31 02:50:27.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/builder/_html5lib.py 2019-07-08 03:59:55.000000000 +0200 @@ -199,7 +199,7 @@ def __setitem__(self, name, value): # If this attribute is a multi-valued attribute for this element, # turn its value into a list. - list_attr = HTML5TreeBuilder.cdata_list_attributes + list_attr = self.element.cdata_list_attributes if (name in list_attr['*'] or (self.element.name in list_attr and name in list_attr[self.element.name])): diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/_htmlparser.py new/beautifulsoup4-4.8.0/bs4/builder/_htmlparser.py --- old/beautifulsoup4-4.7.1/bs4/builder/_htmlparser.py 2018-12-24 16:32:39.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/builder/_htmlparser.py 2019-07-07 20:09:37.000000000 +0200 @@ -214,12 +214,15 @@ NAME = HTMLPARSER features = [NAME, HTML, STRICT] - def __init__(self, *args, **kwargs): + def __init__(self, parser_args=None, parser_kwargs=None, **kwargs): + super(HTMLParserTreeBuilder, self).__init__(**kwargs) + parser_args = parser_args or [] + parser_kwargs = parser_kwargs or {} if CONSTRUCTOR_TAKES_STRICT and not CONSTRUCTOR_STRICT_IS_DEPRECATED: - kwargs['strict'] = False + parser_kwargs['strict'] = False if CONSTRUCTOR_TAKES_CONVERT_CHARREFS: - kwargs['convert_charrefs'] = False - self.parser_args = (args, kwargs) + parser_kwargs['convert_charrefs'] = False + self.parser_args = (parser_args, parser_kwargs) def prepare_markup(self, markup, user_specified_encoding=None, document_declared_encoding=None, exclude_encodings=None): diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/builder/_lxml.py new/beautifulsoup4-4.8.0/bs4/builder/_lxml.py --- old/beautifulsoup4-4.7.1/bs4/builder/_lxml.py 2019-01-07 00:41:32.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/builder/_lxml.py 2019-07-08 03:59:55.000000000 +0200 @@ -94,7 +94,7 @@ parser = parser(target=self, strip_cdata=False, encoding=encoding) return parser - def __init__(self, parser=None, empty_element_tags=None): + def __init__(self, parser=None, empty_element_tags=None, **kwargs): # TODO: Issue a warning if parser is present but not a # callable, since that means there's no way to create new # parsers for different encodings. @@ -103,6 +103,7 @@ self.empty_element_tags = set(empty_element_tags) self.soup = None self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED] + super(LXMLTreeBuilderForXML, self).__init__(**kwargs) def _getNsTag(self, tag): # Split the namespace URL out of a fully-qualified lxml tag diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/dammit.py new/beautifulsoup4-4.8.0/bs4/dammit.py --- old/beautifulsoup4-4.7.1/bs4/dammit.py 2018-12-24 16:31:48.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/dammit.py 2019-07-08 03:45:20.000000000 +0200 @@ -57,15 +57,24 @@ lookup = {} reverse_lookup = {} characters_for_re = [] - for codepoint, name in list(codepoint2name.items()): + + # &apos is an XHTML entity and an HTML 5, but not an HTML 4 + # entity. We don't want to use it, but we want to recognize it on the way in. + # + # TODO: Ideally we would be able to recognize all HTML 5 named + # entities, but that's a little tricky. + extra = [(39, 'apos')] + for codepoint, name in list(codepoint2name.items()) + extra: character = unichr(codepoint) - if codepoint != 34: + if codepoint not in (34, 39): # There's no point in turning the quotation mark into - # ", unless it happens within an attribute value, which - # is handled elsewhere. + # " or the single quote into ', unless it + # happens within an attribute value, which is handled + # elsewhere. characters_for_re.append(character) lookup[character] = name - # But we do want to turn " into the quotation mark. + # But we do want to recognize those entities on the way in and + # convert them to Unicode characters. reverse_lookup[name] = character re_definition = "[%s]" % "".join(characters_for_re) return lookup, reverse_lookup, re.compile(re_definition) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/element.py new/beautifulsoup4-4.8.0/bs4/element.py --- old/beautifulsoup4-4.7.1/bs4/element.py 2019-01-07 01:35:05.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/element.py 2019-07-16 22:46:05.000000000 +0200 @@ -16,7 +16,11 @@ 'The soupsieve package is not installed. CSS selectors cannot be used.' ) -from bs4.dammit import EntitySubstitution +from bs4.formatter import ( + Formatter, + HTMLFormatter, + XMLFormatter, +) DEFAULT_OUTPUT_ENCODING = "utf-8" PY3K = (sys.version_info[0] > 2) @@ -99,138 +103,71 @@ return match.group(1) + encoding return self.CHARSET_RE.sub(rewrite, self.original_value) -class HTMLAwareEntitySubstitution(EntitySubstitution): - - """Entity substitution rules that are aware of some HTML quirks. - - Specifically, the contents of <script> and <style> tags should not - undergo entity substitution. - - Incoming NavigableString objects are checked to see if they're the - direct children of a <script> or <style> tag. - """ - - cdata_containing_tags = set(["script", "style"]) - - preformatted_tags = set(["pre"]) - - preserve_whitespace_tags = set(['pre', 'textarea']) - - @classmethod - def _substitute_if_appropriate(cls, ns, f): - if (isinstance(ns, NavigableString) - and ns.parent is not None - and ns.parent.name in cls.cdata_containing_tags): - # Do nothing. - return ns - # Substitute. - return f(ns) - - @classmethod - def substitute_html(cls, ns): - return cls._substitute_if_appropriate( - ns, EntitySubstitution.substitute_html) - - @classmethod - def substitute_xml(cls, ns): - return cls._substitute_if_appropriate( - ns, EntitySubstitution.substitute_xml) - -class Formatter(object): - """Contains information about how to format a parse tree.""" - - # By default, represent void elements as <tag/> rather than <tag> - void_element_close_prefix = '/' - - def substitute_entities(self, *args, **kwargs): - """Transform certain characters into named entities.""" - raise NotImplementedError() - -class HTMLFormatter(Formatter): - """The default HTML formatter.""" - def substitute(self, *args, **kwargs): - return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs) - -class MinimalHTMLFormatter(Formatter): - """A minimal HTML formatter.""" - def substitute(self, *args, **kwargs): - return HTMLAwareEntitySubstitution.substitute_xml(*args, **kwargs) - -class HTML5Formatter(HTMLFormatter): - """An HTML formatter that omits the slash in a void tag.""" - void_element_close_prefix = None - -class XMLFormatter(Formatter): - """Substitute only the essential XML entities.""" - def substitute(self, *args, **kwargs): - return EntitySubstitution.substitute_xml(*args, **kwargs) - -class HTMLXMLFormatter(Formatter): - """Format XML using HTML rules.""" - def substitute(self, *args, **kwargs): - return HTMLAwareEntitySubstitution.substitute_html(*args, **kwargs) - class PageElement(object): """Contains the navigational information for some part of the page (either a tag or a piece of text)""" + + def setup(self, parent=None, previous_element=None, next_element=None, + previous_sibling=None, next_sibling=None): + """Sets up the initial relations between this element and + other elements.""" + self.parent = parent + + self.previous_element = previous_element + if previous_element is not None: + self.previous_element.next_element = self + + self.next_element = next_element + if self.next_element is not None: + self.next_element.previous_element = self - # There are five possible values for the "formatter" argument passed in - # to methods like encode() and prettify(): - # - # "html" - All Unicode characters with corresponding HTML entities - # are converted to those entities on output. - # "html5" - The same as "html", but empty void tags are represented as - # <tag> rather than <tag/> - # "minimal" - Bare ampersands and angle brackets are converted to - # XML entities: & < > - # None - The null formatter. Unicode characters are never - # converted to entities. This is not recommended, but it's - # faster than "minimal". - # A callable function - it will be called on every string that needs to undergo entity substitution. - # A Formatter instance - Formatter.substitute(string) will be called on every string that - # needs to undergo entity substitution. - # - - # In an HTML document, the default "html", "html5", and "minimal" - # functions will leave the contents of <script> and <style> tags - # alone. For an XML document, all tags will be given the same - # treatment. - - HTML_FORMATTERS = { - "html" : HTMLFormatter(), - "html5" : HTML5Formatter(), - "minimal" : MinimalHTMLFormatter(), - None : None - } - - XML_FORMATTERS = { - "html" : HTMLXMLFormatter(), - "minimal" : XMLFormatter(), - None : None - } + self.next_sibling = next_sibling + if self.next_sibling is not None: + self.next_sibling.previous_sibling = self - def format_string(self, s, formatter='minimal'): + if (previous_sibling is None + and self.parent is not None and self.parent.contents): + previous_sibling = self.parent.contents[-1] + + self.previous_sibling = previous_sibling + if previous_sibling is not None: + self.previous_sibling.next_sibling = self + + def format_string(self, s, formatter): """Format the given string using the given formatter.""" - if isinstance(formatter, basestring): - formatter = self._formatter_for_name(formatter) if formatter is None: - output = s - else: - if isinstance(formatter, Callable): - # Backwards compatibility -- you used to pass in a formatting method. - output = formatter(s) - else: - output = formatter.substitute(s) + return s + if not isinstance(formatter, Formatter): + formatter = self.formatter_for_name(formatter) + output = formatter.substitute(s) return output + def formatter_for_name(self, formatter): + """Look up or create a Formatter for the given identifier, + if necessary. + + :param formatter: Can be a Formatter object (used as-is), a + function (used as the entity substitution hook for an + XMLFormatter or HTMLFormatter), or a string (used to look up + an XMLFormatter or HTMLFormatter in the appropriate registry. + """ + if isinstance(formatter, Formatter): + return formatter + if self._is_xml: + c = XMLFormatter + else: + c = HTMLFormatter + if callable(formatter): + return c(entity_substitution=formatter) + return c.REGISTRY[formatter] + @property def _is_xml(self): """Is this element part of an XML tree or an HTML tree? - This is used when mapping a formatter name ("minimal") to an - appropriate function (one that performs entity-substitution on - the contents of <script> and <style> tags, or not). It can be + This is used in formatter_for_name, when deciding whether an + XMLFormatter or HTMLFormatter is more appropriate. It can be inefficient, but it should be called very rarely. """ if self.known_xml is not None: @@ -248,46 +185,13 @@ return getattr(self, 'is_xml', False) return self.parent._is_xml - def _formatter_for_name(self, name): - "Look up a formatter function based on its name and the tree." - if self._is_xml: - return self.XML_FORMATTERS.get(name, XMLFormatter()) - else: - return self.HTML_FORMATTERS.get(name, HTMLFormatter()) - - def setup(self, parent=None, previous_element=None, next_element=None, - previous_sibling=None, next_sibling=None): - """Sets up the initial relations between this element and - other elements.""" - self.parent = parent - - self.previous_element = previous_element - if previous_element is not None: - self.previous_element.next_element = self - - self.next_element = next_element - if self.next_element is not None: - self.next_element.previous_element = self - - self.next_sibling = next_sibling - if self.next_sibling is not None: - self.next_sibling.previous_sibling = self - - if (previous_sibling is None - and self.parent is not None and self.parent.contents): - previous_sibling = self.parent.contents[-1] - - self.previous_sibling = previous_sibling - if previous_sibling is not None: - self.previous_sibling.next_sibling = self - nextSibling = _alias("next_sibling") # BS3 previousSibling = _alias("previous_sibling") # BS3 def replace_with(self, replace_with): if self.parent is None: raise ValueError( - "Cannot replace one element with another when the" + "Cannot replace one element with another when the " "element to be replaced is not part of a tree.") if replace_with is self: return @@ -742,6 +646,7 @@ self.__class__.__name__, attr)) def output_ready(self, formatter="minimal"): + """Run the string through the provided formatter.""" output = self.format_string(self, formatter) return self.PREFIX + output + self.SUFFIX @@ -760,10 +665,12 @@ but the return value will be ignored. """ - def output_ready(self, formatter="minimal"): - """CData strings are passed into the formatter. - But the return value is ignored.""" - self.format_string(self, formatter) + def output_ready(self, formatter=None): + """CData strings are passed into the formatter, purely + for any side effects. The return value is ignored. + """ + if formatter is not None: + ignore = self.format_string(self, formatter) return self.PREFIX + self + self.SUFFIX class CData(PreformattedString): @@ -831,14 +738,6 @@ self.name = name self.namespace = namespace self.prefix = prefix - if builder is not None: - preserve_whitespace_tags = builder.preserve_whitespace_tags - else: - if is_xml: - preserve_whitespace_tags = [] - else: - preserve_whitespace_tags = HTMLAwareEntitySubstitution.preserve_whitespace_tags - self.preserve_whitespace_tags = preserve_whitespace_tags if attrs is None: attrs = {} elif attrs: @@ -861,12 +760,31 @@ self.setup(parent, previous) self.hidden = False - # Set up any substitutions, such as the charset in a META tag. - if builder is not None: + if builder is None: + # In the absence of a TreeBuilder, assume this tag is nothing + # special. + self.can_be_empty_element = False + self.cdata_list_attributes = None + else: + # Set up any substitutions for this tag, such as the charset in a META tag. builder.set_up_substitutions(self) + + # Ask the TreeBuilder whether this tag might be an empty-element tag. self.can_be_empty_element = builder.can_be_empty_element(name) - else: - self.can_be_empty_element = False + + # Keep track of the list of attributes of this tag that + # might need to be treated as a list. + # + # For performance reasons, we store the whole data structure + # rather than asking the question of every tag. Asking would + # require building a new data structure every time, and + # (unlike can_be_empty_element), we almost never need + # to check this. + self.cdata_list_attributes = builder.cdata_list_attributes + + # Keep track of the names that might cause this tag to be treated as a + # whitespace-preserved tag. + self.preserve_whitespace_tags = builder.preserve_whitespace_tags parserClass = _alias("parser_class") # BS3 @@ -981,6 +899,43 @@ for element in self.contents[:]: element.extract() + def smooth(self): + """Smooth out this element's children by consolidating consecutive strings. + + This makes pretty-printed output look more natural following a + lot of operations that modified the tree. + """ + # Mark the first position of every pair of children that need + # to be consolidated. Do this rather than making a copy of + # self.contents, since in most cases very few strings will be + # affected. + marked = [] + for i, a in enumerate(self.contents): + if isinstance(a, Tag): + # Recursively smooth children. + a.smooth() + if i == len(self.contents)-1: + # This is the last item in .contents, and it's not a + # tag. There's no chance it needs any work. + continue + b = self.contents[i+1] + if (isinstance(a, NavigableString) + and isinstance(b, NavigableString) + and not isinstance(a, PreformattedString) + and not isinstance(b, PreformattedString) + ): + marked.append(i) + + # Go over the marked positions in reverse order, so that + # removing items from .contents won't affect the remaining + # positions. + for i in reversed(marked): + a = self.contents[i] + b = self.contents[i+1] + b.extract() + n = NavigableString(a+b) + a.replace_with(n) + def index(self, element): """ Find the index of a child by identity, not value. Avoids issues with @@ -1115,14 +1070,6 @@ u = self.decode(indent_level, encoding, formatter) return u.encode(encoding, errors) - def _should_pretty_print(self, indent_level): - """Should this tag be pretty-printed?""" - - return ( - indent_level is not None - and self.name not in self.preserve_whitespace_tags - ) - def decode(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): @@ -1136,30 +1083,32 @@ encoding. """ - # First off, turn a string formatter into a Formatter object. This - # will stop the lookup from happening over and over again. - if not isinstance(formatter, Formatter) and not isinstance(formatter, Callable): - formatter = self._formatter_for_name(formatter) + # First off, turn a non-Formatter `formatter` into a Formatter + # object. This will stop the lookup from happening over and + # over again. + if not isinstance(formatter, Formatter): + formatter = self.formatter_for_name(formatter) + attributes = formatter.attributes(self) attrs = [] - if self.attrs: - for key, val in sorted(self.attrs.items()): - if val is None: - decoded = key - else: - if isinstance(val, list) or isinstance(val, tuple): - val = ' '.join(val) - elif not isinstance(val, basestring): - val = unicode(val) - elif ( + for key, val in attributes: + if val is None: + decoded = key + else: + if isinstance(val, list) or isinstance(val, tuple): + val = ' '.join(val) + elif not isinstance(val, basestring): + val = unicode(val) + elif ( isinstance(val, AttributeValueWithCharsetSubstitution) - and eventual_encoding is not None): - val = val.encode(eventual_encoding) - - text = self.format_string(val, formatter) - decoded = ( - unicode(key) + '=' - + EntitySubstitution.quoted_attribute_value(text)) - attrs.append(decoded) + and eventual_encoding is not None + ): + val = val.encode(eventual_encoding) + + text = formatter.attribute_value(val) + decoded = ( + unicode(key) + '=' + + formatter.quoted_attribute_value(text)) + attrs.append(decoded) close = '' closeTag = '' @@ -1168,9 +1117,7 @@ prefix = self.prefix + ":" if self.is_empty_element: - close = '' - if isinstance(formatter, Formatter): - close = formatter.void_element_close_prefix or close + close = formatter.void_element_close_prefix or '' else: closeTag = '</%s%s>' % (prefix, self.name) @@ -1185,7 +1132,8 @@ else: indent_contents = None contents = self.decode_contents( - indent_contents, eventual_encoding, formatter) + indent_contents, eventual_encoding, formatter + ) if self.hidden: # This is the 'document root' object. @@ -1217,6 +1165,13 @@ s = ''.join(s) return s + def _should_pretty_print(self, indent_level): + """Should this tag be pretty-printed?""" + return ( + indent_level is not None + and self.name not in self.preserve_whitespace_tags + ) + def prettify(self, encoding=None, formatter="minimal"): if encoding is None: return self.decode(True, formatter=formatter) @@ -1232,19 +1187,19 @@ indented this many spaces. :param eventual_encoding: The tag is destined to be - encoded into this encoding. This method is _not_ + encoded into this encoding. decode_contents() is _not_ responsible for performing that encoding. This information is passed in so that it can be substituted in if the document contains a <META> tag that mentions the document's encoding. - :param formatter: The output formatter responsible for converting - entities to Unicode characters. + :param formatter: A Formatter object, or a string naming one of + the standard Formatters. """ # First off, turn a string formatter into a Formatter object. This # will stop the lookup from happening over and over again. - if not isinstance(formatter, Formatter) and not isinstance(formatter, Callable): - formatter = self._formatter_for_name(formatter) + if not isinstance(formatter, Formatter): + formatter = self.formatter_for_name(formatter) pretty_print = (indent_level is not None) s = [] @@ -1255,16 +1210,19 @@ elif isinstance(c, Tag): s.append(c.decode(indent_level, eventual_encoding, formatter)) - if text and indent_level and not self.name == 'pre': + preserve_whitespace = ( + self.preserve_whitespace_tags and self.name in self.preserve_whitespace_tags + ) + if text and indent_level and not preserve_whitespace: text = text.strip() if text: - if pretty_print and not self.name == 'pre': + if pretty_print and not preserve_whitespace: s.append(" " * (indent_level - 1)) s.append(text) - if pretty_print and not self.name == 'pre': + if pretty_print and not preserve_whitespace: s.append("\n") return ''.join(s) - + def encode_contents( self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/formatter.py new/beautifulsoup4-4.8.0/bs4/formatter.py --- old/beautifulsoup4-4.7.1/bs4/formatter.py 1970-01-01 01:00:00.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/formatter.py 2019-07-16 22:46:05.000000000 +0200 @@ -0,0 +1,99 @@ +from bs4.dammit import EntitySubstitution + +class Formatter(EntitySubstitution): + """Describes a strategy to use when outputting a parse tree to a string. + + Some parts of this strategy come from the distinction between + HTML4, HTML5, and XML. Others are configurable by the user. + """ + # Registries of XML and HTML formatters. + XML_FORMATTERS = {} + HTML_FORMATTERS = {} + + HTML = 'html' + XML = 'xml' + + HTML_DEFAULTS = dict( + cdata_containing_tags=set(["script", "style"]), + ) + + def _default(self, language, value, kwarg): + if value is not None: + return value + if language == self.XML: + return set() + return self.HTML_DEFAULTS[kwarg] + + def __init__( + self, language=None, entity_substitution=None, + void_element_close_prefix='/', cdata_containing_tags=None, + ): + """ + + :param void_element_close_prefix: By default, represent void + elements as <tag/> rather than <tag> + """ + self.language = language + self.entity_substitution = entity_substitution + self.void_element_close_prefix = void_element_close_prefix + self.cdata_containing_tags = self._default( + language, cdata_containing_tags, 'cdata_containing_tags' + ) + + def substitute(self, ns): + """Process a string that needs to undergo entity substitution.""" + if not self.entity_substitution: + return ns + from element import NavigableString + if (isinstance(ns, NavigableString) + and ns.parent is not None + and ns.parent.name in self.cdata_containing_tags): + # Do nothing. + return ns + # Substitute. + return self.entity_substitution(ns) + + def attribute_value(self, value): + """Process the value of an attribute.""" + return self.substitute(value) + + def attributes(self, tag): + """Reorder a tag's attributes however you want.""" + return sorted(tag.attrs.items()) + + +class HTMLFormatter(Formatter): + REGISTRY = {} + def __init__(self, *args, **kwargs): + return super(HTMLFormatter, self).__init__(self.HTML, *args, **kwargs) + + +class XMLFormatter(Formatter): + REGISTRY = {} + def __init__(self, *args, **kwargs): + return super(XMLFormatter, self).__init__(self.XML, *args, **kwargs) + + +# Set up aliases for the default formatters. +HTMLFormatter.REGISTRY['html'] = HTMLFormatter( + entity_substitution=EntitySubstitution.substitute_html +) +HTMLFormatter.REGISTRY["html5"] = HTMLFormatter( + entity_substitution=EntitySubstitution.substitute_html, + void_element_close_prefix = None +) +HTMLFormatter.REGISTRY["minimal"] = HTMLFormatter( + entity_substitution=EntitySubstitution.substitute_xml +) +HTMLFormatter.REGISTRY[None] = HTMLFormatter( + entity_substitution=None +) +XMLFormatter.REGISTRY["html"] = XMLFormatter( + entity_substitution=EntitySubstitution.substitute_html +) +XMLFormatter.REGISTRY["minimal"] = XMLFormatter( + entity_substitution=EntitySubstitution.substitute_xml +) +XMLFormatter.REGISTRY[None] = Formatter( + Formatter(Formatter.XML, entity_substitution=None) +) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/testing.py new/beautifulsoup4-4.8.0/bs4/testing.py --- old/beautifulsoup4-4.7.1/bs4/testing.py 2018-12-31 03:11:14.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/testing.py 2019-07-08 03:59:55.000000000 +0200 @@ -63,19 +63,19 @@ @property def default_builder(self): - return default_builder() + return default_builder def soup(self, markup, **kwargs): """Build a Beautiful Soup object from markup.""" builder = kwargs.pop('builder', self.default_builder) return BeautifulSoup(markup, builder=builder, **kwargs) - def document_for(self, markup): + def document_for(self, markup, **kwargs): """Turn an HTML fragment into a document. The details depend on the builder. """ - return self.default_builder.test_fragment_to_document(markup) + return self.default_builder(**kwargs).test_fragment_to_document(markup) def assertSoupEquals(self, to_parse, compare_parsed_to=None): builder = self.default_builder @@ -232,7 +232,7 @@ soup = self.soup("") new_tag = soup.new_tag(name) self.assertEqual(True, new_tag.is_empty_element) - + def test_pickle_and_unpickle_identity(self): # Pickling a tree, then unpickling it, yields a tree identical # to the original. @@ -491,6 +491,12 @@ u"<p>\u2022 AT&T is in the s&p 500</p>" ) + def test_apos_entity(self): + self.assertSoupEquals( + u"<p>Bob's Bar</p>", + u"<p>Bob's Bar</p>", + ) + def test_entities_in_foreign_document_encoding(self): # and are invalid numeric entities referencing # Windows-1252 characters. - references a character common diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_html5lib.py new/beautifulsoup4-4.8.0/bs4/tests/test_html5lib.py --- old/beautifulsoup4-4.7.1/bs4/tests/test_html5lib.py 2018-12-23 23:16:18.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/tests/test_html5lib.py 2019-07-07 21:54:34.000000000 +0200 @@ -22,7 +22,7 @@ @property def default_builder(self): - return HTML5TreeBuilder() + return HTML5TreeBuilder def test_soupstrainer(self): # The html5lib tree builder does not support SoupStrainers. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_htmlparser.py new/beautifulsoup4-4.8.0/bs4/tests/test_htmlparser.py --- old/beautifulsoup4-4.7.1/bs4/tests/test_htmlparser.py 2018-07-15 14:26:01.000000000 +0200 +++ new/beautifulsoup4-4.8.0/bs4/tests/test_htmlparser.py 2019-07-07 21:52:25.000000000 +0200 @@ -9,9 +9,7 @@ class HTMLParserTreeBuilderSmokeTest(SoupTest, HTMLTreeBuilderSmokeTest): - @property - def default_builder(self): - return HTMLParserTreeBuilder() + default_builder = HTMLParserTreeBuilder def test_namespaced_system_doctype(self): # html.parser can't handle namespaced doctypes, so skip this one. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_lxml.py new/beautifulsoup4-4.8.0/bs4/tests/test_lxml.py --- old/beautifulsoup4-4.7.1/bs4/tests/test_lxml.py 2019-01-07 00:41:32.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/tests/test_lxml.py 2019-07-07 21:54:54.000000000 +0200 @@ -36,7 +36,7 @@ @property def default_builder(self): - return LXMLTreeBuilder() + return LXMLTreeBuilder def test_out_of_range_entity(self): self.assertSoupEquals( @@ -79,7 +79,7 @@ @property def default_builder(self): - return LXMLTreeBuilderForXML() + return LXMLTreeBuilderForXML def test_namespace_indexing(self): # We should not track un-prefixed namespaces as we can only hold one diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_soup.py new/beautifulsoup4-4.8.0/bs4/tests/test_soup.py --- old/beautifulsoup4-4.7.1/bs4/tests/test_soup.py 2016-07-27 03:27:42.000000000 +0200 +++ new/beautifulsoup4-4.8.0/bs4/tests/test_soup.py 2019-07-16 22:46:05.000000000 +0200 @@ -24,6 +24,7 @@ EncodingDetector, ) from bs4.testing import ( + default_builder, SoupTest, skipIf, ) @@ -54,7 +55,72 @@ soup = self.soup(utf8_data, exclude_encodings=["utf-8"]) self.assertEqual("windows-1252", soup.original_encoding) + def test_custom_builder_class(self): + # Verify that you can pass in a custom Builder class and + # it'll be instantiated with the appropriate keyword arguments. + class Mock(object): + def __init__(self, **kwargs): + self.called_with = kwargs + self.is_xml = True + def initialize_soup(self, soup): + pass + def prepare_markup(self, *args, **kwargs): + return '' + + kwargs = dict( + var="value", + # This is a deprecated BS3-era keyword argument, which + # will be stripped out. + convertEntities=True, + ) + with warnings.catch_warnings(record=True): + soup = BeautifulSoup('', builder=Mock, **kwargs) + assert isinstance(soup.builder, Mock) + self.assertEqual(dict(var="value"), soup.builder.called_with) + + # You can also instantiate the TreeBuilder yourself. In this + # case, that specific object is used and any keyword arguments + # to the BeautifulSoup constructor are ignored. + builder = Mock(**kwargs) + with warnings.catch_warnings(record=True) as w: + soup = BeautifulSoup( + '', builder=builder, ignored_value=True, + ) + msg = str(w[0].message) + assert msg.startswith("Keyword arguments to the BeautifulSoup constructor will be ignored.") + self.assertEqual(builder, soup.builder) + self.assertEqual(kwargs, builder.called_with) + + def test_cdata_list_attributes(self): + # Most attribute values are represented as scalars, but the + # HTML standard says that some attributes, like 'class' have + # space-separated lists as values. + markup = '<a id=" an id " class=" a class "></a>' + soup = self.soup(markup) + + # Note that the spaces are stripped for 'class' but not for 'id'. + a = soup.a + self.assertEqual(" an id ", a['id']) + self.assertEqual(["a", "class"], a['class']) + + # TreeBuilder takes an argument called 'mutli_valued_attributes' which lets + # you customize or disable this. As always, you can customize the TreeBuilder + # by passing in a keyword argument to the BeautifulSoup constructor. + soup = self.soup(markup, builder=default_builder, multi_valued_attributes=None) + self.assertEqual(" a class ", soup.a['class']) + + # Here are two ways of saying that `id` is a multi-valued + # attribute in this context, but 'class' is not. + for switcheroo in ({'*': 'id'}, {'a': 'id'}): + with warnings.catch_warnings(record=True) as w: + # This will create a warning about not explicitly + # specifying a parser, but we'll ignore it. + soup = self.soup(markup, builder=None, multi_valued_attributes=switcheroo) + a = soup.a + self.assertEqual(["an", "id"], a['id']) + self.assertEqual(" a class ", a['class']) + class TestWarnings(SoupTest): def _no_parser_specified(self, s, is_there=True): @@ -217,7 +283,7 @@ self.assertEqual( self.sub.substitute_xml_containing_entities("ÁT&T"), "ÁT&T") - + def test_quotes_not_html_substituted(self): """There's no need to do this except inside attribute values.""" text = 'Bob\'s "bar"' diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/bs4/tests/test_tree.py new/beautifulsoup4-4.8.0/bs4/tests/test_tree.py --- old/beautifulsoup4-4.7.1/bs4/tests/test_tree.py 2019-01-07 00:46:26.000000000 +0100 +++ new/beautifulsoup4-4.8.0/bs4/tests/test_tree.py 2019-07-16 22:46:05.000000000 +0200 @@ -25,6 +25,7 @@ Comment, Declaration, Doctype, + Formatter, NavigableString, SoupStrainer, Tag, @@ -416,6 +417,48 @@ self.assertEqual([], soup.find_all(id=1, text="bar")) +class TestSmooth(TreeTest): + """Test Tag.smooth.""" + + def test_smooth(self): + soup = self.soup("<div>a</div>") + div = soup.div + div.append("b") + div.append("c") + div.append(Comment("Comment 1")) + div.append(Comment("Comment 2")) + div.append("d") + builder = self.default_builder() + span = Tag(soup, builder, 'span') + span.append('1') + span.append('2') + div.append(span) + + # At this point the tree has a bunch of adjacent + # NavigableStrings. This is normal, but it has no meaning in + # terms of HTML, so we may want to smooth things out for + # output. + + # Since the <span> tag has two children, its .string is None. + self.assertEquals(None, div.span.string) + + self.assertEqual(7, len(div.contents)) + div.smooth() + self.assertEqual(5, len(div.contents)) + + # The three strings at the beginning of div.contents have been + # merged into on string. + # + self.assertEqual('abc', div.contents[0]) + + # The call is recursive -- the <span> tag was also smoothed. + self.assertEqual('12', div.span.string) + + # The two comments have _not_ been merged, even though + # comments are strings. Merging comments would change the + # meaning of the HTML. + self.assertEqual('Comment 1', div.contents[1]) + self.assertEqual('Comment 2', div.contents[2]) class TestIndex(TreeTest): @@ -896,7 +939,7 @@ self.assertEqual(soup.a.contents[0].next_element, "bar") def test_insert_tag(self): - builder = self.default_builder + builder = self.default_builder() soup = self.soup( "<a><b>Find</b><c>lady!</c><d></d></a>", builder=builder) magic_tag = Tag(soup, builder, 'magictag') @@ -1532,7 +1575,7 @@ # callable is called on every string. self.assertEqual( decoded, - self.document_for(u"<b><FOO></b><b>BAR</b><br>")) + self.document_for(u"<b><FOO></b><b>BAR</b><br/>")) def test_formatter_is_run_on_attribute_values(self): markup = u'<a href="http://a.com?a=b&c=é">e</a>' @@ -1570,11 +1613,11 @@ self.assertTrue(b"< < hey > >" in encoded) def test_prettify_leaves_preformatted_text_alone(self): - soup = self.soup("<div> foo <pre> \tbar\n \n </pre> baz ") + soup = self.soup("<div> foo <pre> \tbar\n \n </pre> baz <textarea> eee\nfff\t</textarea></div>") # Everything outside the <pre> tag is reformatted, but everything # inside is left alone. self.assertEqual( - u'<div>\n foo\n <pre> \tbar\n \n </pre>\n baz\n</div>', + u'<div>\n foo\n <pre> \tbar\n \n </pre>\n baz\n <textarea> eee\nfff\t</textarea>\n</div>', soup.div.prettify()) def test_prettify_accepts_formatter_function(self): @@ -1683,6 +1726,29 @@ else: self.assertEqual(b'<b>\\u2603</b>', repr(soup)) +class TestFormatter(SoupTest): + + def test_sort_attributes(self): + # Test the ability to override Formatter.attributes() to, + # e.g., disable the normal sorting of attributes. + class UnsortedFormatter(Formatter): + def attributes(self, tag): + self.called_with = tag + for k, v in sorted(tag.attrs.items()): + if k == 'ignore': + continue + yield k,v + + soup = self.soup('<p cval="1" aval="2" ignore="ignored"></p>') + formatter = UnsortedFormatter() + decoded = soup.decode(formatter=formatter) + + # attributes() was called on the <p> tag. It filtered out one + # attribute and sorted the other two. + self.assertEquals(formatter.called_with, soup.p) + self.assertEquals(u'<p aval="2" cval="1"></p>', decoded) + + class TestNavigableStringSubclasses(SoupTest): def test_cdata(self): diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/doc/source/index.rst new/beautifulsoup4-4.8.0/doc/source/index.rst --- old/beautifulsoup4-4.7.1/doc/source/index.rst 2018-12-31 17:52:43.000000000 +0100 +++ new/beautifulsoup4-4.8.0/doc/source/index.rst 2019-07-17 03:31:27.000000000 +0200 @@ -31,7 +31,7 @@ * `这篇文档当然还有中文版. <http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/>`_ * このページは日本語で利用できます(`外部リンク <http://kondou.com/BS4/>`_) -* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_) +* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 <https://web.archive.org/web/20150319200824/http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_) Getting help ------------ @@ -266,9 +266,9 @@ +----------------------+--------------------------------------------+--------------------------------+--------------------------+ | Parser | Typical usage | Advantages | Disadvantages | +----------------------+--------------------------------------------+--------------------------------+--------------------------+ -| Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | * Batteries included | * Not very lenient | -| | | * Decent speed | (before Python 2.7.3 | -| | | * Lenient (as of Python 2.7.3 | or 3.2.2) | +| Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | * Batteries included | * Not as fast as lxml, | +| | | * Decent speed | less lenient than | +| | | * Lenient (As of Python 2.7.3 | html5lib. | | | | and 3.2.) | | +----------------------+--------------------------------------------+--------------------------------+--------------------------+ | lxml's HTML parser | ``BeautifulSoup(markup, "lxml")`` | * Very fast | * External C dependency | @@ -428,8 +428,15 @@ print(rel_soup.p) # <p>Back to the <a rel="index contents">homepage</a></p> -You can use ```get_attribute_list`` to get a value that's always a list, -string, whether or not it's a multi-valued atribute + You can disable this by passing ``multi_valued_attributes=None`` as a +keyword argument into the ``BeautifulSoup`` constructor:: + + no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html', multi_valued_attributes=None) + no_list_soup.p['class'] + # u'body strikeout' + +You can use ```get_attribute_list`` to get a value that's always a +list, whether or not it's a multi-valued atribute:: id_soup.p.get_attribute_list('id') # ["my id"] @@ -440,8 +447,20 @@ xml_soup.p['class'] # u'body strikeout' +Again, you can configure this using the ``multi_valued_attributes`` argument:: + + class_is_multi= { '*' : 'class'} + xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi) + xml_soup.p['class'] + # [u'body', u'strikeout'] +You probably won't need to do this, but if you do, use the defaults as +a guide. They implement the rules described in the HTML specification:: + from bs4.builder import builder_registry + builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES + + ``NavigableString`` ------------------- @@ -2093,6 +2112,40 @@ Like ``replace_with()``, ``unwrap()`` returns the tag that was replaced. +``smooth()`` +--------------------------- + +After calling a bunch of methods that modify the parse tree, you may end up with two or more ``NavigableString`` objects next to each other. Beautiful Soup doesn't have any problems with this, but since it can't happen in a freshly parsed document, you might not expect behavior like the following:: + + soup = BeautifulSoup("<p>A one</p>") + soup.p.append(", a two") + + soup.p.contents + # [u'A one', u', a two'] + + print(soup.p.encode()) + # <p>A one, a two</p> + + print(soup.p.prettify()) + # <p> + # A one + # , a two + # </p> + +You can call ``Tag.smooth()`` to clean up the parse tree by consolidating adjacent strings:: + + soup.smooth() + + soup.p.contents + # [u'A one, a two'] + + print(soup.p.prettify()) + # <p> + # A one, a two + # </p> + +The ``smooth()`` method is new in Beautiful Soup 4.8.0. + Output ====== @@ -2103,7 +2156,7 @@ The ``prettify()`` method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each -tag and each string: +tag and each string:: markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) @@ -2216,7 +2269,7 @@ # </body> # </html> - If you pass in ``formatter="html5"``, it's the same as +If you pass in ``formatter="html5"``, it's the same as ``formatter="html5"``, but Beautiful Soup will omit the closing slash in HTML void tags like "br":: @@ -2245,16 +2298,17 @@ print(link_soup.a.encode(formatter=None)) # <a href="http://example.com/?foo=val1&bar=val2">A link</a> -Finally, if you pass in a function for ``formatter``, Beautiful Soup -will call that function once for every string and attribute value in -the document. You can do whatever you want in this function. Here's a -formatter that converts strings to uppercase and does absolutely -nothing else:: +If you need more sophisticated control over your output, you can +use Beautiful Soup's ``Formatter`` class. Here's a formatter that +converts strings to uppercase, whether they occur in a text node or in an +attribute value:: + from bs4.formatter import HTMLFormatter def uppercase(str): return str.upper() + formatter = HTMLFormatter(uppercase) - print(soup.prettify(formatter=uppercase)) + print(soup.prettify(formatter=formatter)) # <html> # <body> # <p> @@ -2263,34 +2317,31 @@ # </body> # </html> - print(link_soup.a.prettify(formatter=uppercase)) + print(link_soup.a.prettify(formatter=formatter)) # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2"> # A LINK # </a> -If you're writing your own function, you should know about the -``EntitySubstitution`` class in the ``bs4.dammit`` module. This class -implements Beautiful Soup's standard formatters as class methods: the -"html" formatter is ``EntitySubstitution.substitute_html``, and the -"minimal" formatter is ``EntitySubstitution.substitute_xml``. You can -use these functions to simulate ``formatter=html`` or -``formatter==minimal``, but then do something extra. - -Here's an example that replaces Unicode characters with HTML entities -whenever possible, but `also` converts all strings to uppercase:: - - from bs4.dammit import EntitySubstitution - def uppercase_and_substitute_html_entities(str): - return EntitySubstitution.substitute_html(str.upper()) - - print(soup.prettify(formatter=uppercase_and_substitute_html_entities)) - # <html> - # <body> - # <p> - # IL A DIT <<SACRÉ BLEU!>> - # </p> - # </body> - # </html> +Subclassing ``HTMLFormatter`` or ``XMLFormatter`` will give you even +more control over the output. For example, Beautiful Soup sorts the +attributes in every tag by default:: + + attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>') + print(attr_soup.p.encode()) + # <p a="3" m="2" z="1"></p> + +To turn this off, you can subclass the ``Formatter.attributes()`` +method, which controls which attributes are output and in what +order. This implementation also filters out out one of the attributes. + + class UnsortedAttributes(HTMLFormatter): + def attributes(self, tag): + for k, v in tag.attrs.items(): + if k == 'm': + continue + yield k, v + print(attr_soup.p.encode(formatter=UnsortedAttributes())) + # <p z="1" a="3"></p> One last caveat: if you create a ``CData`` object, the text inside that object is always presented `exactly as it appears, with no @@ -3097,6 +3148,7 @@ * ``findPrevious`` -> ``find_previous`` * ``findPreviousSibling`` -> ``find_previous_sibling`` * ``findPreviousSiblings`` -> ``find_previous_siblings`` +* ``getText`` -> ``get_text`` * ``nextSibling`` -> ``next_sibling`` * ``previousSibling`` -> ``previous_sibling`` diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/beautifulsoup4-4.7.1/setup.py new/beautifulsoup4-4.8.0/setup.py --- old/beautifulsoup4-4.7.1/setup.py 2019-01-07 01:47:07.000000000 +0100 +++ new/beautifulsoup4-4.8.0/setup.py 2019-07-20 01:50:29.000000000 +0200 @@ -8,7 +8,7 @@ setup( name="beautifulsoup4", - version = "4.7.1", + version = "4.8.0", author="Leonard Richardson", author_email='leonardr@segfault.org', url="http://www.crummy.com/software/BeautifulSoup/bs4/",