commit perl-HTML-TableExtract for openSUSE:Factory
Hello community, here is the log from the commit of package perl-HTML-TableExtract for openSUSE:Factory checked in at Sun Feb 27 13:56:55 CET 2011. -------- --- perl-HTML-TableExtract/perl-HTML-TableExtract.changes 2010-12-01 14:27:33.000000000 +0100 +++ perl-HTML-TableExtract/perl-HTML-TableExtract.changes 2011-02-25 18:52:49.000000000 +0100 @@ -1,0 +2,8 @@ +Fri Feb 25 17:51:03 UTC 2011 - chris@computersalat.de + +- recreated by cpanspec 1.78.03 + o fix deps +- add HTML patch +- noarch pkg + +------------------------------------------------------------------- calling whatdependson for head-i586 New: ---- HTML-TableExtract-2.10-HTML.patch ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ perl-HTML-TableExtract.spec ++++++ --- /var/tmp/diff_new_pack.hvKXSi/_old 2011-02-27 13:56:31.000000000 +0100 +++ /var/tmp/diff_new_pack.hvKXSi/_new 2011-02-27 13:56:31.000000000 +0100 @@ -1,7 +1,7 @@ # -# spec file for package perl-HTML-TableExtract (Version 2.10) +# spec file for package perl-HTML-TableExtract # -# Copyright (c) 2010 SUSE LINUX Products GmbH, Nuernberg, Germany. +# Copyright (c) 2011 SUSE LINUX Products GmbH, Nuernberg, Germany. # # All modifications and additions to the file contributed by third parties # remain the property of their copyright owners, unless otherwise agreed @@ -15,58 +15,144 @@ # Please submit bugfixes or comments via http://bugs.opensuse.org/ # -# norootforbuild Name: perl-HTML-TableExtract -Url: http://cpan.org/modules/by-module/HTML/ -License: Public Domain, Freeware -Group: Development/Libraries/Perl -AutoReqProv: on -Requires: perl-HTML-Parser -BuildRequires: perl-HTML-Parser -BuildRequires: perl-macros -# Needed only for tests: -BuildRequires: perl-HTML-Tree perl-Test-Pod-Coverage -Summary: Simplifies extraction of information within tables in HTML documents Version: 2.10 -Release: 81 -Source: HTML-TableExtract-%{version}.tar.bz2 +Release: 86 +License: GPL+ or Artistic +%define cpan_name HTML-TableExtract +Summary: For extracting the content contained in tables within an HTML document +Url: http://search.cpan.org/dist/HTML-TableExtract/ +Group: Development/Libraries/Perl +#Source: http://www.cpan.org/authors/id/M/MS/MSISK/HTML-TableExtract-2.10.tar.gz +Source: %{cpan_name}-%{version}.tar.bz2 +Patch0: %{cpan_name}-2.10-HTML.patch +BuildArch: noarch BuildRoot: %{_tmppath}/%{name}-%{version}-build +BuildRequires: perl +BuildRequires: perl-macros +BuildRequires: perl(HTML::ElementTable) >= 1.16 +BuildRequires: perl(HTML::Parser) +Requires: perl(HTML::ElementTable) >= 1.16 +Requires: perl(HTML::Parser) %{perl_requires} %description -HTML::TableExtract is a module that simplifies the extraction of -information contained in tables within HTML documents. - -Tables of note may be specified using Headers, Depth, Count, -Attributes, or some combination of the three. See the module -documentation for details. - - +HTML::TableExtract is a subclass of HTML::Parser that serves to extract the +information from tables of interest contained within an HTML document. The +information from each extracted table is stored in table objects. Tables +can be extracted as text, HTML, or HTML::ElementTable structures (for +in-place editing or manipulation). + +There are currently four constraints available to specify which tables you +would like to extract from a document: _Headers_, _Depth_, _Count_, and +_Attributes_. + +_Headers_, the most flexible and adaptive of the techniques, involves +specifying text in an array that you expect to appear above the data in the +tables of interest. Once all headers have been located in a row of that +table, all further cells beneath the columns that matched your headers are +extracted. All other columns are ignored: think of it as vertical slices +through a table. In addition, TableExtract automatically rearranges each +row in the same order as the headers you provided. If you would like to +disable this, set _automap_ to 0 during object creation, and instead rely +on the column_map() method to find out the order in which the headers were +found. Furthermore, TableExtract will automatically compensate for cell +span issues so that columns are really the same columns as you would +visually see in a browser. This behavior can be disabled by setting the +_gridmap_ parameter to 0. HTML is stripped from the entire textual content +of a cell before header matches are attempted -- unless the _keep_html_ +parameter was enabled. + +_Depth_ and _Count_ are more specific ways to specify tables in relation to +one another. _Depth_ represents how deeply a table resides in other tables. +The depth of a top-level table in the document is 0. A table within a +top-level table has a depth of 1, and so on. Each depth can be thought of +as a layer; tables sharing the same depth are on the same layer. Within +each of these layers, _Count_ represents the order in which a table was +seen at that depth, starting with 0. Providing both a _depth_ and a _count_ +will uniquely specify a table within a document. + +_Attributes_ match based on the attributes of the html <table> tag, for +example, boder widths or background color. + +Each of the _Headers_, _Depth_, _Count_, and _Attributes_ specifications +are cumulative in their effect on the overall extraction. For instance, if +you specify only a _Depth_, then you get all tables at that depth (note +that these could very well reside in separate higher- level tables +throughout the document since depth extends across tables). If you specify +only a _Count_, then the tables at that _Count_ from all depths are +returned (i.e., the _n_th occurrence of a table at each depth). If you only +specify _Headers_, then you get all tables in the document containing those +column headers. If you have specified multiple constraints of _Headers_, +_Depth_, _Count_, and _Attributes_, then each constraint has veto power +over whether a particular table is extracted. + +If no _Headers_, _Depth_, _Count_, or _Attributes_ are specified, then all +tables match. + +When extracting only text from tables, the text is decoded with +HTML::Entities by default; this can be disabled by setting the _decode_ +parameter to 0. + +Extraction Modes + The default mode of extraction for HTML::TableExtract is raw text or + HTML. In this mode, embedded tables are completely decoupled from one + another. In this case, HTML::TableExtract is a subclass of + HTML::Parser: + + use HTML::TableExtract; + + Alternativevly, tables can be extracted as HTML::ElementTable + structures, which are in turn embedded in an HTML::Element tree + representing the entire HTML document. Embedded tables are not + decoupled from one another since this tree structure must be + manitained. In this case, HTML::TableExtract is a subclass of + HTML::TreeBuilder (itself a subclass of HTML:::Parser): + + use HTML::TableExtract qw(tree); + + In either case, the basic interface for HTML::TableExtract and the + resulting table objects remains the same -- all that changes is what + you can do with the resulting data. + + HTML::TableExtract is a subclass of HTML::Parser, and as such inherits + all of its basic methods such as 'parse()' and 'parse_file()'. During + scans, 'start()', 'end()', and 'text()' are utilized. Feel free to + override them, but if you do not eventually invoke them in the SUPER + class with some content, results are not guaranteed. + +Advice + The main point of this module was to provide a flexible method of + extracting tabular information from HTML documents without relying to + heavily on the document layout. For that reason, I suggest using + _Headers_ whenever possible -- that way, you are anchoring your + extraction on what the document is trying to communicate rather than + some feature of the HTML comprising the document (other than the fact + that the data is contained in a table). %prep -%setup -q -n HTML-TableExtract-%{version} +%setup -q -n %{cpan_name}-%{version} +%patch0 -p1 %build -perl Makefile.PL -make %{?_smp_mflags} +%{__perl} Makefile.PL INSTALLDIRS=vendor +%{__make} %{?_smp_mflags} %check -make test +%{__make} test %install -make DESTDIR=$RPM_BUILD_ROOT install_vendor +%perl_make_install %perl_process_packlist +%perl_gen_filelist %clean -[ "$RPM_BUILD_ROOT" != "/" ] && [ -d $RPM_BUILD_ROOT ] && rm -rf $RPM_BUILD_ROOT +%{__rm} -rf %{buildroot} -%files -%defattr(-,root,root) -%doc Changes MANIFEST README -%doc %{_mandir}/man?/* -%{perl_vendorlib}/HTML -%{perl_vendorarch}/auto/HTML-TableExtract +%files -f %{name}.files +%defattr(644,root,root,755) +%doc Changes README %changelog ++++++ HTML-TableExtract-2.10-HTML.patch ++++++ diff -ruN HTML-TableExtract-2.10-orig/t/gnarly.html HTML-TableExtract-2.10/t/gnarly.html --- HTML-TableExtract-2.10-orig/t/gnarly.html 2006-05-01 23:22:47.000000000 +0200 +++ HTML-TableExtract-2.10/t/gnarly.html 2011-02-25 18:41:08.000000000 +0100 @@ -1 +1 @@ -<html><head><title>gnarly table</title></head><body><table border=1><tr><td colspan=4 rowspan=1>(0,0) [1,4]</td><td colspan=4 rowspan=2>(0,1) [2,4]</td></tr><tr><td colspan=1 rowspan=2>(1,0) [2,1]</td><td colspan=1 rowspan=1>(1,1) [1,1]</td><td colspan=2 rowspan=1>(1,2) [1,2]</td></tr><tr><td colspan=4 rowspan=2>(2,0) [2,4]</td><td colspan=2 rowspan=2>(2,1) [2,2]</td><td colspan=1 rowspan=1>(2,2) [1,1]</td></tr><tr><td colspan=1 rowspan=1>(3,0) [1,1]</td><td colspan=1 rowspan=1>(3,1) [1,1]</td></tr><tr><td colspan=2 rowspan=3>(4,0) [3,2]</td><td colspan=1 rowspan=1>(4,1) [1,1]</td><td colspan=1 rowspan=3>(4,2) [3,1]</td><td colspan=4 rowspan=4>(4,3) [4,4]</td></tr><tr><td colspan=1 rowspan=1>(5,0) [1,1]</td></tr><tr><td colspan=1 rowspan=1>(6,0) [1,1]</td></tr><tr><td colspan=4 rowspan=1>(7,0) [1,4]</td></tr></table></body></html> +<html><head><title>gnarly table</title></head><body><table border="1"><tr><td colspan="4" rowspan="1">(0,0) [1,4]</td><td colspan="4" rowspan="2">(0,1) [2,4]</td></tr><tr><td colspan="1" rowspan="2">(1,0) [2,1]</td><td colspan="1" rowspan="1">(1,1) [1,1]</td><td colspan="2" rowspan="1">(1,2) [1,2]</td></tr><tr><td colspan="4" rowspan="2">(2,0) [2,4]</td><td colspan="2" rowspan="2">(2,1) [2,2]</td><td colspan="1" rowspan="1">(2,2) [1,1]</td></tr><tr><td colspan="1" rowspan="1">(3,0) [1,1]</td><td colspan="1" rowspan="1">(3,1) [1,1]</td></tr><tr><td colspan="2" rowspan="3">(4,0) [3,2]</td><td colspan="1" rowspan="1">(4,1) [1,1]</td><td colspan="1" rowspan="3">(4,2) [3,1]</td><td colspan="4" rowspan="4">(4,3) [4,4]</td></tr><tr><td colspan="1" rowspan="1">(5,0) [1,1]</td></tr><tr><td colspan="1" rowspan="1">(6,0) [1,1]</td></tr><tr><td colspan="4" rowspan="1">(7,0) [1,4]</td></tr></table></body></html> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Remember to have fun... -- To unsubscribe, e-mail: opensuse-commit+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-commit+help@opensuse.org
participants (1)
-
root@hilbert.suse.de