Mailinglist Archive: opensuse-commit (1906 mails)

< Previous Next >
commit python-logreduce for openSUSE:Factory
Hello community,

here is the log from the commit of package python-logreduce for
openSUSE:Factory checked in at 2019-04-01 12:35:03
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-logreduce (Old)
and /work/SRC/openSUSE:Factory/.python-logreduce.new.25356 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Package is "python-logreduce"

Mon Apr 1 12:35:03 2019 rev:7 rq:686071 version:0.4.0

Changes:
--------
--- /work/SRC/openSUSE:Factory/python-logreduce/python-logreduce.changes
2018-12-24 11:39:25.645556701 +0100
+++
/work/SRC/openSUSE:Factory/.python-logreduce.new.25356/python-logreduce.changes
2019-04-01 12:35:10.017825784 +0200
@@ -1,0 +2,48 @@
+Mon Mar 18 09:23:25 UTC 2019 - Dirk Mueller <dmueller@xxxxxxxx>
+
+- update to 0.4.0:
+ * Bump model version and fix typo
+ * Add HashingAnnoy model
+ * Add hashing\_nn benchmark in doc string
+ * Add HashingApproximateNeighbors model
+ * Implement iterator interface for file-like objects
+ * Refactor TokenizerTests
+ * Provide a bit more info about timings of the training
+ * Remove support for bag-of-words\_lshf
+ * Don't store duplicate data in model
+ * Fix heat\_uuid regexp formatting
+ * Relax digits\_re again a bit
+ * Vectorizer optimisation: don't do word analysing
+ * debug\_lineprocess: Handle more than one input file
+ * debug\_lineprocess: Format output slightly nicer and remove duplicates
+ * Tighten heat\_uuid regexp
+ * Tighten length-based regexp matches properly
+ * debug\_lineprocess add some simple word / token statistics
+ * Blacklist .xml extension
+ * Use for loop instead of handcrafted while construct
+ * tests: use free tcp port for gearman server
+ * Add --model-type argument to top-level command
+ * tokenizer: remove sshd warnings
+ * Make debugging scripts callable again
+ * Reduce code duplication a bit
+ * Micro-optimize the tokenization
+ * ci: enable gate jobs
+ * Make systemd service file SCL independent
+ * Transition webui related files to the log-classify name
+ * Match uuid\_re before heat\_re
+ * Use SqlAlchemy intrinsics for ordering
+ * Fix overly greedy date tokenization
+ * Fix tokenization error on removing SSH fingerprints
+ * DRY: Remove implementation override that also exists in the base class
+ * Fix assertEquals() deprecation warning
+ * Use generator for reading files
+ * tokenizer regexp speedups
+ * cmd: add --json argument to report options
+ * Spelling typos
+ * logreduce: Fix inconsistency for model\_file in model-run
+ * logreduce.spec: Fixes
+ * README: Add openSUSE instructions
+ * Add py36/py37 to the env list as well
+ * Run pep8 against pip installed flake8
+
+-------------------------------------------------------------------

Old:
----
logreduce-0.3.0.tar.gz

New:
----
logreduce-0.4.0.tar.gz

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ python-logreduce.spec ++++++
--- /var/tmp/diff_new_pack.o1KJfu/_old 2019-04-01 12:35:11.441826149 +0200
+++ /var/tmp/diff_new_pack.o1KJfu/_new 2019-04-01 12:35:11.441826149 +0200
@@ -1,7 +1,7 @@
#
# spec file for package python-logreduce
#
-# Copyright (c) 2018 SUSE LINUX GmbH, Nuernberg, Germany.
+# Copyright (c) 2019 SUSE LINUX GmbH, Nuernberg, Germany.
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
@@ -19,7 +19,7 @@
%{?!python_module:%define python_module() python-%{**} python3-%{**}}
%define skip_python2 1
Name: python-logreduce
-Version: 0.3.0
+Version: 0.4.0
Release: 0
Summary: Log file anomaly extractor
License: Apache-2.0

++++++ logreduce-0.3.0.tar.gz -> logreduce-0.4.0.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/.zuul.yaml
new/logreduce-0.4.0/.zuul.yaml
--- old/logreduce-0.3.0/.zuul.yaml 2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/.zuul.yaml 2018-11-08 08:39:56.000000000 +0100
@@ -10,7 +10,7 @@
nodeset:
nodes:
- name: container
- label: f27-oci
+ label: runc-fedora

- project:
name: logreduce
@@ -21,15 +21,25 @@
nodeset:
nodes:
- name: testrunner
- label: fedora-oci
+ label: runc-fedora
- tox-py35:
nodeset:
nodes:
- name: testrunner
- label: fedora-oci
+ label: runc-fedora
gate:
jobs:
- - noop
+ - logreduce-tests
+ - tox-pep8:
+ nodeset:
+ nodes:
+ - name: testrunner
+ label: runc-fedora
+ - tox-py35:
+ nodeset:
+ nodes:
+ - name: testrunner
+ label: runc-fedora
release:
jobs:
- upload-pypi
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/ChangeLog
new/logreduce-0.4.0/ChangeLog
--- old/logreduce-0.3.0/ChangeLog 2018-10-25 11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/ChangeLog 2018-11-08 08:40:11.000000000 +0100
@@ -1,6 +1,53 @@
CHANGES
=======

+0.4.0
+-----
+
+* Bump model version and fix typo
+* Add HashingAnnoy model
+* Add hashing\_nn benchmark in doc string
+* Add HashingApproximateNeighbors model
+* Implement iterator interface for file-like objects
+* Refactor TokenizerTests
+* Provide a bit more info about timings of the training
+* Remove support for bag-of-words\_lshf
+* Don't store duplicate data in model
+* Fix heat\_uuid regexp formatting
+* Relax digits\_re again a bit
+* Vectorizer optimisation: don't do word analysing
+* debug\_lineprocess: Handle more than one input file
+* debug\_lineprocess: Format output slightly nicer and remove duplicates
+* Tighten heat\_uuid regexp
+* Tighten length-based regexp matches properly
+* debug\_lineprocess add some simple word / token statistics
+* Blacklist .xml extension
+* Use for loop instead of handcrafted while construct
+* tests: use free tcp port for gearman server
+* Add --model-type argument to top-level command
+* tokenizer: remove sshd warnings
+* Make debugging scripts callable again
+* Reduce code duplication a bit
+* Micro-optimize the tokenization
+* ci: enable gate jobs
+* Make systemd service file SCL independent
+* Transition webui related files to the log-classify name
+* Match uuid\_re before heat\_re
+* Use SqlAlchemy intrinsics for ordering
+* Fix overly greedy date tokenization
+* Fix tokenization error on removing SSH fingerprints
+* DRY: Remove implementation override that also exists in the base class
+* Fix assertEquals() deprecation warning
+* Use generator for reading files
+* tokenizer regexp speedups
+* cmd: add --json argument to report options
+* Spelling typos
+* logreduce: Fix inconsistency for model\_file in model-run
+* logreduce.spec: Fixes
+* README: Add openSUSE instructions
+* Add py36/py37 to the env list as well
+* Run pep8 against pip installed flake8
+
0.3.0
-----

diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/PKG-INFO new/logreduce-0.4.0/PKG-INFO
--- old/logreduce-0.3.0/PKG-INFO 2018-10-25 11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/PKG-INFO 2018-11-08 08:40:11.000000000 +0100
@@ -1,6 +1,6 @@
Metadata-Version: 1.1
Name: logreduce
-Version: 0.3.0
+Version: 0.4.0
Summary: Extract anomalies from log files
Home-page: https://logreduce.softwarefactory-project.io/
Author: Tristan Cacqueray
@@ -52,6 +52,18 @@
python3 setup.py develop --user
popd

+
+ * openSUSE:
+
+ .. code-block:: console
+
+ sudo zypper install python3-scikit-learn
+ git clone https://softwarefactory-project.io/r/logreduce
+ pushd logreduce
+ python3 setup.py develop --user
+ popd
+
+
* Pip:

.. code-block:: console
@@ -159,7 +171,7 @@
* logreduce-server: the REST and Gearman server
* logreduce-worker: job executor
* logreduce-client: client cli
- * logreduce-ui: web ui
+ * logreduce-webui: logreduce web interface

API
...
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/README.rst
new/logreduce-0.4.0/README.rst
--- old/logreduce-0.3.0/README.rst 2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/README.rst 2018-11-08 08:39:56.000000000 +0100
@@ -44,6 +44,18 @@
python3 setup.py develop --user
popd

+
+* openSUSE:
+
+.. code-block:: console
+
+ sudo zypper install python3-scikit-learn
+ git clone https://softwarefactory-project.io/r/logreduce
+ pushd logreduce
+ python3 setup.py develop --user
+ popd
+
+
* Pip:

.. code-block:: console
@@ -151,7 +163,7 @@
* logreduce-server: the REST and Gearman server
* logreduce-worker: job executor
* logreduce-client: client cli
-* logreduce-ui: web ui
+* logreduce-webui: logreduce web interface

API
...
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/doc/index.rst
new/logreduce-0.4.0/doc/index.rst
--- old/logreduce-0.3.0/doc/index.rst 2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/doc/index.rst 2018-11-08 08:39:56.000000000 +0100
@@ -44,6 +44,18 @@
python3 setup.py develop --user
popd

+
+* openSUSE:
+
+.. code-block:: console
+
+ sudo zypper install python3-scikit-learn
+ git clone https://softwarefactory-project.io/r/logreduce
+ pushd logreduce
+ python3 setup.py develop --user
+ popd
+
+
* Pip:

.. code-block:: console
@@ -151,7 +163,7 @@
* logreduce-server: the REST and Gearman server
* logreduce-worker: job executor
* logreduce-client: client cli
-* logreduce-ui: web ui
+* logreduce-webui: logreduce web interface

API
...
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/etc/httpd/log-classify.conf
new/logreduce-0.4.0/etc/httpd/log-classify.conf
--- old/logreduce-0.3.0/etc/httpd/log-classify.conf 1970-01-01
01:00:00.000000000 +0100
+++ new/logreduce-0.4.0/etc/httpd/log-classify.conf 2018-11-08
08:39:56.000000000 +0100
@@ -0,0 +1,27 @@
+ProxyVia On
+ProxyRequests Off
+RewriteEngine on
+
+<Directory /var/www/log-classify>
+ Options Indexes SymLinksIfOwnerMatch
+ Require all granted
+ IndexOptions FancyIndexing HTMLTable NameWidth=* SuppressDescription
+</Directory>
+
+Alias /log-classify/datasets /var/www/log-classify/anomalies
+
+<Directory /usr/share/log-classify>
+ DirectoryIndex index.html
+ Require all granted
+ Order allow,deny
+ Allow from all
+</Directory>
+
+Alias /log-classify /usr/share/log-classify
+# Don't rewrite files or directories
+RewriteRule ^/log-classify/api/(.*)$ http://localhost:20004/api/$1 [L,P]
+RewriteCond /usr/share/%{REQUEST_FILENAME} !-f
+RewriteCond /usr/share/%{REQUEST_FILENAME} !-d
+RewriteCond /usr/share/%{REQUEST_FILENAME} !-l
+# Rewrite everything else to index.html to allow html5 state links
+RewriteRule ^/log-classify/.*$ /usr/share/log-classify/index.html [L]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/etc/httpd/logreduce.conf
new/logreduce-0.4.0/etc/httpd/logreduce.conf
--- old/logreduce-0.3.0/etc/httpd/logreduce.conf 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/etc/httpd/logreduce.conf 1970-01-01
01:00:00.000000000 +0100
@@ -1,27 +0,0 @@
-ProxyVia On
-ProxyRequests Off
-RewriteEngine on
-
-<Directory /var/www/logreduce>
- Options Indexes SymLinksIfOwnerMatch
- Require all granted
- IndexOptions FancyIndexing HTMLTable NameWidth=* SuppressDescription
-</Directory>
-
-Alias /log-classify/datasets /var/www/logreduce/anomalies
-
-<Directory /usr/share/log-classify>
- DirectoryIndex index.html
- Require all granted
- Order allow,deny
- Allow from all
-</Directory>
-
-Alias /log-classify /usr/share/log-classify
-# Don't rewrite files or directories
-RewriteRule ^/log-classify/api/(.*)$ http://localhost:20004/api/$1 [L,P]
-RewriteCond /usr/share/log-classify/%{REQUEST_FILENAME} !-f
-RewriteCond /usr/share/log-classify/%{REQUEST_FILENAME} !-d
-RewriteCond /usr/share/log-classify/%{REQUEST_FILENAME} !-l
-# Rewrite everything else to index.html to allow html5 state links
-RewriteRule ^/log-classify/.*$ /usr/share/log-classify/index.html [L]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/etc/logreduce/config.yaml
new/logreduce-0.4.0/etc/logreduce/config.yaml
--- old/logreduce-0.3.0/etc/logreduce/config.yaml 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/etc/logreduce/config.yaml 2018-11-08
08:39:56.000000000 +0100
@@ -13,9 +13,9 @@
# Where the models are saved locally
models_folder: /var/lib/logreduce/models
# Where the archived dataset are stored locally
- dataset_folder: /var/www/logreduce/anomalies
+ dataset_folder: /var/www/log-classify/anomalies
# Where the logs are expected or downloaded
- logserver_folder: /var/www/logreduce/logs
+ logserver_folder: /var/www/log-classify/logs
logging:
loggers:
logreduce:
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/etc/systemd/logreduce-server.service
new/logreduce-0.4.0/etc/systemd/logreduce-server.service
--- old/logreduce-0.3.0/etc/systemd/logreduce-server.service 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/etc/systemd/logreduce-server.service 2018-11-08
08:39:56.000000000 +0100
@@ -7,8 +7,7 @@
User=logreduce
Group=logreduce
SyslogIdentifier=logreduce-server
-EnvironmentFile=-/etc/opt/rh/rh-python35/sysconfig/enable-py3
-ExecStart=/opt/rh/rh-python35/root/usr/bin/logreduce-server -c
/etc/opt/rh/rh-python35/logreduce/config.yaml
+ExecStart=/usr/bin/logreduce-server -c /etc/logreduce/config.yaml

[Install]
WantedBy=multi-user.target
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/etc/systemd/logreduce-worker.service
new/logreduce-0.4.0/etc/systemd/logreduce-worker.service
--- old/logreduce-0.3.0/etc/systemd/logreduce-worker.service 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/etc/systemd/logreduce-worker.service 2018-11-08
08:39:56.000000000 +0100
@@ -7,8 +7,7 @@
User=logreduce
Group=logreduce
SyslogIdentifier=logreduce-worker
-EnvironmentFile=-/etc/opt/rh/rh-python35/sysconfig/enable-py3
-ExecStart=/opt/rh/rh-python35/root/usr/bin/logreduce-worker -c
/etc/opt/rh/rh-python35/logreduce/config.yaml
+ExecStart=/usr/bin/logreduce-worker -c /etc/logreduce/config.yaml

[Install]
WantedBy=multi-user.target
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/cmd.py
new/logreduce-0.4.0/logreduce/cmd.py
--- old/logreduce-0.3.0/logreduce/cmd.py 2018-10-25 11:23:44.000000000
+0200
+++ new/logreduce-0.4.0/logreduce/cmd.py 2018-11-08 08:39:56.000000000
+0100
@@ -41,7 +41,6 @@
parser.print_help()
exit(4)
logreduce.utils.setup_logging(args.debug)
- self.model_type = "hashing_nn"
self.job = None
self.exclude_file = logreduce.utils.DEFAULT_IGNORE_FILES
self.exclude_path = logreduce.utils.DEFAULT_IGNORE_PATHS
@@ -87,6 +86,9 @@
parser.add_argument("--tmp-dir", default=os.getcwd())
parser.add_argument("--cacheonly", action="store_true",
help="Do not download any logs")
+ parser.add_argument("--model-type", default="hashing_nn",
+ choices=list(models.keys()),
+ help="The model type")

# Common arguments
def path_filters(s):
@@ -117,6 +119,8 @@

def report_filters(s):
s.add_argument("--html", metavar="FILE", help="Render html result")
+ s.add_argument("--json", metavar="FILE",
+ help="Optional json output")
s.add_argument("--static-location",
help="The js/css static directory location")
s.add_argument("--threshold", default=0.2, type=float,
@@ -137,9 +141,6 @@
def model_filters(s):
s.add_argument("--max-age", type=int, default=7,
help="Maximum age of a model")
- s.add_argument("--model-type", default="hashing_nn",
- choices=list(models.keys()),
- help="The model type")

def journal_filters(s):
s.add_argument("--range", choices=("day", "week", "month"),
@@ -158,7 +159,7 @@
s.set_defaults(func=self.model_run)
path_filters(s)
report_filters(s)
- s.add_argument("model_file", metavar="FILE")
+ s.add_argument("model_file")
s.add_argument("target", nargs='+')

# Local directory
@@ -256,8 +257,6 @@
s = sub.add_parser("diff", help="Compare directories/files")
s.set_defaults(func=self.diff)
report_filters(s)
- s.add_argument("--json", metavar="FILE",
- help="Optional json output")
s.add_argument("baseline", nargs='+')
s.add_argument("target")

@@ -439,7 +438,7 @@
def diff(self, baseline, target):
clf = self._get_classifier()
clf.train(baseline)
- self._report(clf, target, json_file=self.json)
+ self._report(clf, target)

def download_logs(self, logs_url, target_dir=None):
if logs_url.endswith("/job-output.txt.gz"):
@@ -486,13 +485,13 @@
clf.include_path = self.include_path
return clf

- def _report(self, clf, target_dirs, target_source=None, json_file=None):
+ def _report(self, clf, target_dirs, target_source=None):
if self.context_length is not None:
self.before_context = self.context_length
self.after_context = self.context_length

console_output = True
- if json_file or self.html:
+ if self.json or self.html:
console_output = False
output = clf.process(path=target_dirs,
path_source=target_source,
@@ -508,8 +507,8 @@
render_html(output, self.static_location))
open(self.html.replace(".html", ".json"), "w").write(
json.dumps(output))
- if json_file is not None:
- open(json_file, "w").write(json.dumps(output))
+ if self.json:
+ open(self.json, "w").write(json.dumps(output))
else:
print("%02.2f%% reduction (from %d lines to %d)" % (
output["reduction"],
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/models.py
new/logreduce-0.4.0/logreduce/models.py
--- old/logreduce-0.3.0/logreduce/models.py 2018-10-25 11:23:44.000000000
+0200
+++ new/logreduce-0.4.0/logreduce/models.py 2018-11-08 08:39:56.000000000
+0100
@@ -13,8 +13,9 @@
import os
import warnings

+import numpy as np
+
from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.neighbors import LSHForest
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import HashingVectorizer
# from sklearn import svm
@@ -55,23 +56,20 @@
return [0.5] * len(test_data)


-class LSHF(Model):
- """Forest model, faster for on large index (>20000 samples)"""
+class SimpleNeighbors(Model):
+ """Simple NN model"""
def __init__(self, name=""):
super().__init__(name)
self.vectorizer = TfidfVectorizer(
- analyzer='word', lowercase=False, tokenizer=None,
+ analyzer=str.split, lowercase=False, tokenizer=None,
preprocessor=None, stop_words=None)
-
- self.lshf = LSHForest(
- random_state=int(os.environ.get("LR_RANDOM_STATE", 42)),
- n_estimators=int(os.environ.get("LR_N_ESTIMATORS", 23)))
+ self.nn = NearestNeighbors(
+ algorithm='brute',
+ metric='cosine')

def train(self, train_data):
- with warnings.catch_warnings():
- warnings.simplefilter("ignore")
- dat = self.vectorizer.fit_transform(train_data)
- self.lshf.fit(dat)
+ dat = self.vectorizer.fit_transform(train_data)
+ self.nn.fit(dat)
self.info = "%d samples, %d features" % dat.shape
return dat

@@ -82,24 +80,31 @@
chunk = test_data[chunk_pos:min(len(test_data),
chunk_pos + CHUNK_SIZE)]
dat = self.vectorizer.transform(chunk)
- distances, _ = self.lshf.kneighbors(dat, n_neighbors=1)
+ distances, _ = self.nn.kneighbors(dat, n_neighbors=1)
all_distances.extend(distances)
return all_distances


-class SimpleNeighbors(Model):
- """Simple NN model"""
+class HashingNeighbors(Model):
+ """ HashingVectorized NN model.
+ Fastest implementation for low sample sizes (<1e5),
+ logreduce-tests benchmark: 12sec
+ """
def __init__(self, name=""):
super().__init__(name)
- self.vectorizer = TfidfVectorizer(
- analyzer='word', lowercase=False, tokenizer=None,
+ self.vectorizer = HashingVectorizer(
+ binary=True, n_features=2**18,
+ analyzer=str.split, lowercase=False, tokenizer=None,
preprocessor=None, stop_words=None)
+ # HashingVectorizer produces sparse vectors, and
+ # sorted(sklearn.neighbors.VALID_METRICS_SPARSE['algorithm']) is
+ # empty for anything != brute
self.nn = NearestNeighbors(
- algorithm='brute',
- metric='cosine')
+ algorithm='brute', metric='cosine',
+ n_jobs=1, n_neighbors=1)

def train(self, train_data):
- dat = self.vectorizer.fit_transform(train_data)
+ dat = self.vectorizer.transform(train_data)
self.nn.fit(dat)
self.info = "%d samples, %d features" % dat.shape
return dat
@@ -111,46 +116,105 @@
chunk = test_data[chunk_pos:min(len(test_data),
chunk_pos + CHUNK_SIZE)]
dat = self.vectorizer.transform(chunk)
- distances, _ = self.nn.kneighbors(dat, n_neighbors=1)
+ distances, _ = self.nn.kneighbors(dat)
all_distances.extend(distances)
return all_distances


-class HashingNeighbors(Model):
- """Simple NN model"""
- # True random words
+class HashingApproximateNeighbors(Model):
+ """ Approximate Nearest Neighbor Search.
+ This implementation is rather slow, logreduce-tests benchmark: 60sec.
+ The code may be optimized to not record training data since we don't care
+ what the actual neighbor is, and it should simply return distance as float
+ and not str.
+
+ TODO: benchmark with higher sample size.
+ """
def __init__(self, name=""):
super().__init__(name)
self.vectorizer = HashingVectorizer(
binary=True,
- analyzer='word', lowercase=False, tokenizer=None,
+ analyzer=str.split, lowercase=False, tokenizer=None,
preprocessor=None, stop_words=None)
- self.nn = NearestNeighbors(algorithm='brute', metric='cosine')

def train(self, train_data):
+ try:
+ import pysparnn.cluster_index as ci
+ except ImportError:
+ raise RuntimeError("Install this dependency to use this model: "
+ "https://github.com/facebookresearch/pysparnn";)
+ train_data = list(train_data)
dat = self.vectorizer.transform(train_data)
- self.nn.fit(dat)
- self.info = "%d samples, %d features" % dat.shape
- return dat
+ self.nn = ci.MultiClusterIndex(dat, train_data)
+ self.info = ''

def test(self, test_data):
all_distances = []
- with warnings.catch_warnings():
- for chunk_pos in range(0, len(test_data), CHUNK_SIZE):
- chunk = test_data[chunk_pos:min(len(test_data),
- chunk_pos + CHUNK_SIZE)]
- dat = self.vectorizer.transform(chunk)
- distances, _ = self.nn.kneighbors(dat, n_neighbors=1)
- all_distances.extend(distances)
+ for chunk_pos in range(0, len(test_data), CHUNK_SIZE):
+ chunk = test_data[chunk_pos:min(len(test_data),
+ chunk_pos + CHUNK_SIZE)]
+ dat = self.vectorizer.transform(chunk)
+ distances = self.nn.search(
+ dat, k=1, k_clusters=2, return_distance=True)
+ # Work around str format of distance...
+ for distance in distances:
+ if distance[0][0].startswith('-'):
+ all_distances.append([0.0])
+ continue
+ all_distances.append([float(distance[0][0])])
return all_distances

- def process_line(self, line):
- return Tokenizer.process(line)
+
+class HashingAnnoy(Model):
+ """HashingAnnoy NN model.
+ logreduce-tests FAILED: 85.66% accuracy, 21.84% false-positive,
+ logreduce-tests benchmark: 56sec
+
+ TODO: test and benchmark with higher sample size.
+ """
+ def __init__(self, name=""):
+ try:
+ from annoy import AnnoyIndex
+ except ImportError:
+ raise RuntimeError("Install annoy library first")
+ super().__init__(name)
+ features = 2**13
+ self.vectorizer = HashingVectorizer(
+ binary=True, n_features=features,
+ analyzer=str.split, lowercase=False, tokenizer=None,
+ preprocessor=None, stop_words=None)
+ self.nn = AnnoyIndex(features)
+
+ def train(self, train_data):
+ dat = self.vectorizer.transform(train_data)
+ for idx in range(len(train_data)):
+ self.nn.add_item(idx, dat[idx].toarray()[0])
+ self.nn.build(10) # n trees
+ self.info = "%d samples, %d features" % dat.shape
+ return dat
+
+ def test(self, test_data):
+ all_distances = []
+ dat = self.vectorizer.transform(test_data)
+ for v in dat:
+ d = self.nn.get_nns_by_vector(
+ v.toarray()[0], 1, include_distances=True)
+ all_distances.append([d[1][0]])
+ # normalize
+ # l1
+ # norm = np.sum(all_distances)
+ # l2
+ norm = np.sqrt(np.sum(np.square(all_distances)))
+ normalized_distances = all_distances / norm
+ # Scores are much lower, increase artificially here for now
+ normalized_distances *= 2
+ return normalized_distances


models = {
- 'bag-of-words_lshf': LSHF,
'bag-of-words_nn': SimpleNeighbors,
'hashing_nn': HashingNeighbors,
+ 'hashing_ann': HashingApproximateNeighbors,
+ 'hashing_annoy': HashingAnnoy,
'noop': Noop,
}
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/process.py
new/logreduce-0.4.0/logreduce/process.py
--- old/logreduce-0.3.0/logreduce/process.py 2018-10-25 11:23:44.000000000
+0200
+++ new/logreduce-0.4.0/logreduce/process.py 2018-11-08 08:39:56.000000000
+0100
@@ -34,7 +34,9 @@

class Classifier:
log = logging.getLogger("logreduce.Classifier")
- version = 4
+ # Bump this version when models created with earlier versions
+ # should be rejected
+ version = 5

def __init__(self,
model='bag-of-words_nn', exclude_paths=[], exclude_files=[]):
@@ -112,6 +114,13 @@
# Remove numbers and symbols
return re.subn(r'[^a-zA-Z\/\._-]*', '', shortfilename)[0]

+ @staticmethod
+ def _is_log_classify_invocation(model_name, line):
+ """ Returns True if the line is related to log-classify"""
+ return model_name == "job-output.txt" and (
+ "TASK [log-classify " in line or
+ "TASK [Generate ara report]" in line)
+
def train(self, baselines, command=sys.argv):
"""Train the model, baselines can be path(s) or build dict(s)"""
start_time = time.monotonic()
@@ -147,36 +156,29 @@
model.size = 0
model.count = 0
model.uuid = str(uuid.uuid4())
- # Tokenize and store all lines in train_data
- train_data = []
+ # Tokenize and store all de-duplicated lines in train_data
+ train_data = set()
for filename in filenames:
self.log.debug("%s: Loading %s" % (model_name, filename))
fobj = None
try:
fobj = open_file(filename)
- idx = 0
- while True:
- line = fobj.readline()
- if line == b'':
- break
+ for line in fobj:
line = line.decode('ascii', errors='ignore')
# Special case to not train ourself
- if model_name == "job-output.txt" and (
- "TASK [log-classify " in line or
- "TASK [Generate ara report]" in line):
+ if self._is_log_classify_invocation(model_name, line):
break
# Remove ansible std_lines list now
line = remove_ansible_std_lines_lists(line)
for sub_line in line.split(r'\r'):
sub_line = model.process_line(sub_line)
if sub_line:
- train_data.append(sub_line)
- idx += 1
+ train_data.add(sub_line)
+ model.count += 1
try:
model.size += os.stat(filename).st_size
except TypeError:
pass
- model.count += idx
except KeyboardInterrupt:
exit(1)
except Exception:
@@ -203,14 +205,19 @@

self.training_lines_count += model.count
self.training_size += model.size
+ train_data_time = time.monotonic() - model_start_time
+ self.log.debug(
+ "%s: Parsing took %s", model_name,
+ format_speed(model.count, model.size, train_data_time))
try:
# Transform and fit the model data
+ train_start_time = time.monotonic()
model = self.get(model_name)
model.train(train_data)
- model.train_time = time.monotonic() - model_start_time
+ model.train_time = time.monotonic() - train_start_time

- self.log.debug("%s: %s %s" % (
- model_name, model.info,
+ self.log.debug("%s: Fitting took %s" % (
+ model_name,
format_speed(model.count, model.size, model.train_time)))
except ValueError:
self.log.exception("%s: couldn't train with %s" % (model_name,
@@ -291,15 +298,10 @@
try:
fobj = open_file(filename)
idx = 0
- while True:
- line = fobj.readline()
- if line == b'':
- break
+ for line in fobj:
line = line.decode('ascii', errors='ignore')
# Special case to not test ourself
- if model_name == "job-output.txt" and (
- "TASK [log-classify " in line or
- "TASK [Generate ara report]" in line):
+ if self._is_log_classify_invocation(model_name, line):
break
# Remove ansible std_lines list now
line = remove_ansible_std_lines_lists(line)
@@ -362,8 +364,7 @@
outliers = []
last_outlier = 0
remaining_after_context = 0
- line_pos = 0
- while line_pos < len(data):
+ for line_pos in range(len(data)):
distance, line = get_line_info(line_pos)
if distance >= self.threshold:
if line_pos - last_outlier >= self.merge_distance:
@@ -383,7 +384,6 @@
outliers.append((line_pos, distance, line))
remaining_after_context -= 1
last_outlier = line_pos
- line_pos += 1

# Yield result
yield (filename_rel, filename_orig, model, outliers,
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/server/api.py
new/logreduce-0.4.0/logreduce/server/api.py
--- old/logreduce-0.3.0/logreduce/server/api.py 2018-10-25 11:23:44.000000000
+0200
+++ new/logreduce-0.4.0/logreduce/server/api.py 2018-11-08 08:39:56.000000000
+0100
@@ -65,7 +65,8 @@
"""Return the anomalies list"""
results = []
with self.db.session() as session:
- for anomaly in session.query(model.Anomaly):
+ for anomaly in (session.query(model.Anomaly)
+ .order_by(model.Anomaly.report_date.desc())):
results.append({
'uuid': anomaly.uuid,
'name': anomaly.name,
@@ -75,7 +76,6 @@
'build': anomaly.build.toDict()
})
cherrypy.response.headers['Access-Control-Allow-Origin'] = '*'
- results.reverse()
return results

def _getAnomaly(self, session, anomaly_id):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/test_api.py
new/logreduce-0.4.0/logreduce/tests/test_api.py
--- old/logreduce-0.3.0/logreduce/tests/test_api.py 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/test_api.py 2018-11-08
08:39:56.000000000 +0100
@@ -22,7 +22,7 @@
import logreduce.server.client
import logreduce.server.rpc as rpc

-from . utils import fake_build_result
+from . utils import fake_build_result, find_free_port

logging.basicConfig(level=logging.DEBUG)

@@ -31,7 +31,7 @@
@classmethod
def setup_class(cls):
cls.tmpfile = tempfile.mkstemp()[1]
- cls.gearman = {'addr': '0.0.0.0', 'port': 4742}
+ cls.gearman = {'addr': '0.0.0.0', 'port': find_free_port()}
cls.gear = rpc.Server(**cls.gearman)
cls.gear.start()
cls.downloadLog = []
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/test_download.py
new/logreduce-0.4.0/logreduce/tests/test_download.py
--- old/logreduce-0.3.0/logreduce/tests/test_download.py 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/test_download.py 2018-11-08
08:39:56.000000000 +0100
@@ -44,4 +44,4 @@
})
mock_request.return_value = MockResponse(json.dumps(fake_builds))
zb = logreduce.download.ZuulBuilds("http://zuul.example.com/api";)
- self.assertEquals(3, len(zb.get(result="SUCCESS")))
+ self.assertEqual(3, len(zb.get(result="SUCCESS")))
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/test_model.py
new/logreduce-0.4.0/logreduce/tests/test_model.py
--- old/logreduce-0.3.0/logreduce/tests/test_model.py 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/test_model.py 2018-11-08
08:39:56.000000000 +0100
@@ -79,4 +79,4 @@
anomaly_uuid = self.db.import_report(session, report)

anomaly = session.query(model.Anomaly).get(anomaly_uuid)
- self.assertEquals("check", anomaly.build.pipeline)
+ self.assertEqual("check", anomaly.build.pipeline)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/test_units.py
new/logreduce-0.4.0/logreduce/tests/test_units.py
--- old/logreduce-0.3.0/logreduce/tests/test_units.py 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/test_units.py 2018-11-08
08:39:56.000000000 +0100
@@ -17,11 +17,74 @@


class TokenizerTests(unittest.TestCase):
+ def check_expected(self, tests):
+ for raw_line, tokens_out in tests.items():
+ self.assertEqual(
+ tokens_out, Tokenizer.process(raw_line))
+
def test_random_words(self):
tokens = Tokenizer.process("Created interface: br-42")
self.assertNotIn("br-42", tokens)
tokens = Tokenizer.process("Instance 0xdeadbeef42 created")
- self.assertEquals("Instance created", tokens)
+ self.assertEqual("Instance created", tokens)
+
+ def test_hash_tokenizing(self):
+ self.check_expected({
+ 'Accepted publickey: RSA '
+ 'SHA256:UkrwIX8QHA4B2Bny0XHyqgSXM7wFMQTEDtT+PpY9Ep4':
+ 'Accepted publickey RNGH',
+ # This used to match 'jan' -> DATE
+ 'SHA256:FePTgARR5A3kxb2GJa0QAWjanaI2q+TvneBxzHNqbTA zuul@ze03':
+ 'RNGH zuul'
+ })
+
+ def test_ipv6_tokenizing(self):
+ self.check_expected({
+ 'mysql+pymysql://root:secretdatabase@[::1]/cinder?"':
+ 'mysql pymysql //root secretdatabase RNGI /cinder',
+ 'listen_port fe80::f816:3eff:fe47:5142':
+ 'listen_port RNGI',
+ 'listen_port FE80::F816:3eff:fe47:5142':
+ 'listen_port RNGI',
+ 'listen_port ::8888':
+ 'listen_port RNGI'
+ })
+
+ def test_date_non_tokenizing(self):
+ """Tests that should not match the DATE verb"""
+ self.check_expected({
+ 'keys randomart image':
+ 'keys randomart image',
+ 'Start zuul_console daemon':
+ 'Start zuul_console daemon',
+ })
+
+ def test_uuid_words(self):
+ self.check_expected({
+ '| 0473427f-f505-4b50-bc70-72fb6d74568a | vmname | SHUTOFF | - '
+ ' | Shutdown | fixed=192.168.123.3 |':
+ 'RNGU vmname SHUTOFF Shutdown fixed RNGI',
+ '"UndercloudServiceChain-2kbhkd45kcs3-ServiceChain-54rklv3rnxhe" ':
+ 'UndercloudServiceChain HEATID ServiceChain HEATID'
+ })
+
+ def test_non_uuid_words(self):
+ self.check_expected({
+ 'dnsmasq-dhcp[31216]: DHCPRELEASE':
+ 'dnsmasq dhcp DHCPRELEASE',
+ })
+
+ def test_digits_tokenizing(self):
+ self.check_expected({
+ 'Started Session 2677 of user root':
+ 'Started Session user root',
+ 'Instance 0xdeadbeef42 created':
+ 'Instance created',
+ 'systemd[4552]: Startup finished in 28ms.':
+ 'systemd Startup finished',
+ '764928K 33% 469M 3.05s':
+ ''
+ })

def test_filename2modelname(self):
for fname, modelname in (
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/utils.py
new/logreduce-0.4.0/logreduce/tests/utils.py
--- old/logreduce-0.3.0/logreduce/tests/utils.py 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/utils.py 2018-11-08
08:39:56.000000000 +0100
@@ -10,6 +10,16 @@
# License for the specific language governing permissions and limitations
# under the License.

+import socket
+from contextlib import closing
+
+
+def find_free_port():
+ with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
+ s.bind(('', 0))
+ return s.getsockname()[1]
+
+
fake_result = {
'anomalies_count': 18,
'baselines': ['test_process.py'],
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tokenizer.py
new/logreduce-0.4.0/logreduce/tokenizer.py
--- old/logreduce-0.3.0/logreduce/tokenizer.py 2018-10-25 11:23:44.000000000
+0200
+++ new/logreduce-0.4.0/logreduce/tokenizer.py 2018-11-08 08:39:56.000000000
+0100
@@ -1,3 +1,5 @@
+# Copyright 2018 Red Hat, Inc.
+# Copyright 2018 SUSE Linux GmbH.
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
@@ -21,45 +23,14 @@

UUID_RE = r'[0-9a-f]{8}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{4}-' \
'?[0-9a-f]{12}'
-
IPV4_RE = r'(([01]?[0-9]?[0-9]|2[0-4][0-9]|2[5][0-5])\.){3}' \
r'([01]?[0-9]?[0-9]|2[0-4][0-9]|2[5][0-5])'
-# TODO: simplify this if possible...
-IPV6_RE = (r'(?:(?:[0-9A-Fa-f]{1,4}:){6}(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
- r'(?:(?:[0-9]|[1-9][0-9]|'
- r'1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}(?:[0-9]|[1-9][0-9]|'
- r'1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
- r'::(?:[0-9A-Fa-f]{1,4}:){5}(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
- r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
- r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
- r'(?:[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}'
- r'(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
- r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
- r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
- r'(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}'
- r'(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
- r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
- r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
- r'(?:(?:[0-9A-Fa-f]{1,4}:){,2}[0-9A-Fa-f]{1,4})?::'
- r'(?:[0-9A-Fa-f]{1,4}:){2}(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
- r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
- r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
- r'(?:(?:[0-9A-Fa-f]{1,4}:){,3}[0-9A-Fa-f]{1,4})?::'
- r'[0-9A-Fa-f]{1,4}:(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
- r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
- r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
- r'(?:(?:[0-9A-Fa-f]{1,4}:){,4}[0-9A-Fa-f]{1,4})?::'
- r'(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
- r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
- r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
- r'(?:(?:[0-9A-Fa-f]{1,4}:){,5}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}|'
- r'(?:(?:[0-9A-Fa-f]{1,4}:){,6}[0-9A-Fa-f]{1,4})?::)')
-MAC_RE = r'([0-9A-F]{2}[:-]){5}([0-9A-F]{2})'
+IPV6_RE = r'([0-9A-Fa-f]{0,4}:){2,6}(\d{1,3}\.){0,3}\d{1,3}'
+MAC_RE = r'([0-9a-fA-F]{2}[:-]){5}([0-9a-fA-F]{2})'


class Tokenizer:
rawline_re = re.compile(
- r'('
# useless http GET
r'"GET / HTTP/1.1"'
r'|"OPTIONS * HTTP/1.0" 200'
@@ -81,35 +52,30 @@
r'|unix_chkpwd.*: password check failed for user'
r'|sshd.*: authentication failure'
r'|sshd.*: Failed password for'
+ r'|sshd.*- POSSIBLE BREAK-IN ATTEMPT'
# zuul random test
r'|zuul.*echo BECOME-SUCCESS-'
r'|^[^ ]{64}$'
# useless debug statement
r'|ovs-ofctl .* (dump-ports|dump-flows|show)\b'
r'|(ip|eb)tables .* -L\b'
- r')')
- ip_re = re.compile(r'(%s|%s|%s)' % (IPV4_RE, IPV6_RE, MAC_RE), re.I)
- power2_re = re.compile(r'([0-9a-f]{128}|[0-9a-f+/]{64}|'
- '[0-9a-f]{40}|[0-9a-f]{32})', re.I)
- uuid_re = re.compile(r'(%s|tx[^ ]{32})' % UUID_RE, re.I)
- date_re = re.compile('(%s|%s|%s|%s)' % (DAYS, SHORT_DAYS,
- SHORT_MONTHS, MONTHS), re.I)
- heat_re = re.compile("-[^ -]{12}[- $]", re.I)
- comments = re.compile(r'([\s]*# |^%% |^#|^[\s]*id = ").*')
- alpha_re = re.compile(r'[^a-zA-Z_\/\s]')
- gitver_re = re.compile(r'git[a-z0-9]+', re.I)
- digits_re = re.compile(r'(0x[0-9a-f]+|[0-9])', re.I)
- randpath_re = re.compile(r'('
- r'/tmp/ansible\.[a-z0-9_]{8}'
- r'|/tmp/tmp[a-z0-9_]{6}'
- r'|/tmp/tmp.[a-z0-9]{10}'
- r')', re.I)
- gitsha_re = re.compile(r'('
- r'[a-z0-9]{7}\.\.[a-z0-9]{7}'
- r')', re.I)
- hash_re = re.compile(r'('
- r'SHA256:[a-z0-9+/]{43} '
- r')', re.I)
+ )
+ ip_re = re.compile(r'%s|%s|%s' % (IPV4_RE, IPV6_RE, MAC_RE))
+ power2_re = re.compile(r'\b(?:[\w+/]{128}|[\w+/]{64}|'
+ r'[0-9a-fA-F]{40}|[0-9a-fA-F]{32})\b')
+ uuid_re = re.compile(r'\b(?:%s|tx[^ ]{32})\b' % UUID_RE, re.I)
+ date_re = re.compile(r'\b(?:%s|%s|%s|%s)\b' % (DAYS, SHORT_DAYS,
+ SHORT_MONTHS, MONTHS), re.I)
+ heat_re = re.compile(r'-\w{12}[- \"$]')
+ comments = re.compile(r'(?:[\s]*# |^%% |^#|^[\s]*id = ").*')
+ alpha_re = re.compile(r'[^a-zA-Z_\/\s]+')
+ gitver_re = re.compile(r'git\w+')
+ digits_re = re.compile(r'0x[0-9a-fA-F]{2,}|[0-9]+(?:\.\d+)?')
+ randpath_re = re.compile(r'(?:/tmp/ansible\.\w{8}'
+ r'|/tmp/tmp\w{6}'
+ r'|/tmp/tmp\.\w{10})\b')
+ gitsha_re = re.compile(r'\b\w{7}\.\.\w{7}\b')
+ hash_re = re.compile(r'SHA256:[\w+/]{43}\b')

@staticmethod
def process(line):
@@ -118,24 +84,26 @@
return ''
strip = line
# Remove words that are exactly 32, 64 or 128 character longs
- strip = Tokenizer.power2_re.subn("RNGN", strip)[0]
+ strip = Tokenizer.power2_re.sub("RNGN", strip)
# Remove uuid
- strip = Tokenizer.heat_re.subn(" HEAT ", strip)[0]
- strip = Tokenizer.uuid_re.subn("RNGU", strip)[0]
- # Remove date
- strip = Tokenizer.date_re.subn("DATE", strip)[0]
+ strip = Tokenizer.uuid_re.sub("RNGU", strip)
+ # Remove heat short uuid but keep spacing
+ # ObjectName-2kbhkd45kcs3-ServiceName -> ObjectName-HEATID-ServiceName
+ strip = Tokenizer.heat_re.sub(" HEATID ", strip)
# Remove git sha
- strip = Tokenizer.gitsha_re.subn("RNGG", strip)[0]
+ strip = Tokenizer.gitsha_re.sub("RNGG", strip)
# Remove hashes
- strip = Tokenizer.hash_re.subn("RNGH", strip)[0]
+ strip = Tokenizer.hash_re.sub("RNGH", strip)
# Remove random path
- strip = Tokenizer.randpath_re.subn("RNGP", strip)[0]
+ strip = Tokenizer.randpath_re.sub("RNGP", strip)
+ # Remove date
+ strip = Tokenizer.date_re.sub("DATE", strip)
# Remove ip/addr
- strip = Tokenizer.ip_re.subn("RNGI", strip)[0]
+ strip = Tokenizer.ip_re.sub("RNGI", strip)
# Remove numbers
- strip = Tokenizer.digits_re.subn("", strip)[0]
+ strip = Tokenizer.digits_re.sub("", strip)
# Only keep characters
- strip = Tokenizer.alpha_re.subn(" ", strip)[0]
+ strip = Tokenizer.alpha_re.sub(" ", strip)
# Remove tiny words
strip = " ".join(filter(lambda x: len(x) > 3, strip.split()))
# Weight failure token
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/utils.py
new/logreduce-0.4.0/logreduce/utils.py
--- old/logreduce-0.3.0/logreduce/utils.py 2018-10-25 11:23:44.000000000
+0200
+++ new/logreduce-0.4.0/logreduce/utils.py 2018-11-08 08:39:56.000000000
+0100
@@ -18,10 +18,11 @@
import sqlite3
import zlib
import json
+import datetime
+import time
+
try:
from systemd import journal
- import datetime
- import time
journal_installed = True
except ImportError:
journal_installed = False
@@ -42,6 +43,7 @@
"etc/systemd/",
"etc/polkit-1/",
"etc/pki/",
+ "etc/swift/.*\.builder",
"group_vars/all.yaml",
"keystone/credential-keys",
"keystone/fernet-keys",
@@ -100,35 +102,36 @@
]

BLACKLIST_EXTENSIONS = (
- ".sqlite",
- ".svg",
- ".woff",
- ".ttf",
+ ".conf",
+ ".conf.txt",
+ ".crt",
+ ".csr",
".css",
- ".js",
".db",
".ico",
+ ".journal",
+ ".js",
+ ".json",
+ ".json.txt",
+ "_key",
+ ".key",
+ ".pem",
".png",
- ".tgz",
".pyc",
".pyo",
- ".so",
- ".key",
- "_key",
- ".crt",
- ".csr",
- ".pem",
+ "ring.gz",
".rpm",
+ ".so",
+ ".sqlite",
".subunit",
- ".journal",
- ".json",
- ".json.txt",
- ".yaml.txt",
- ".conf",
- ".conf.txt",
+ ".svg",
+ ".tgz",
+ ".ttf",
+ ".woff",
+ ".xml",
".yaml",
+ ".yaml.txt",
".yml",
- "ring.gz",
)

FACILITY2NAME = {
@@ -190,11 +193,14 @@
self.journal.close()
del self.journal

- def readline(self):
+ def __iter__(self):
+ return self
+
+ def __next__(self):
entry = self.journal.get_next()
ts = entry.get('__REALTIME_TIMESTAMP', datetime.datetime(1970, 1, 1))
if not entry or (self.until and ts.timestamp() > self.until):
- return b''
+ raise StopIteration
facility = entry.get('SYSLOG_FACILITY')
if isinstance(facility, int):
entry['LEVEL'] = FACILITY2NAME.get(facility, 'NOTI').upper()
@@ -216,9 +222,12 @@
self.lines = []
self.idx = 0

- def readline(self):
+ def __iter__(self):
+ return self
+
+ def __next__(self):
if self.idx >= len(self.lines):
- return b''
+ raise StopIteration
self.idx += 1
return self.lines[self.idx - 1].encode('utf-8')

diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce.egg-info/PKG-INFO
new/logreduce-0.4.0/logreduce.egg-info/PKG-INFO
--- old/logreduce-0.3.0/logreduce.egg-info/PKG-INFO 2018-10-25
11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/logreduce.egg-info/PKG-INFO 2018-11-08
08:40:11.000000000 +0100
@@ -1,6 +1,6 @@
Metadata-Version: 1.1
Name: logreduce
-Version: 0.3.0
+Version: 0.4.0
Summary: Extract anomalies from log files
Home-page: https://logreduce.softwarefactory-project.io/
Author: Tristan Cacqueray
@@ -52,6 +52,18 @@
python3 setup.py develop --user
popd

+
+ * openSUSE:
+
+ .. code-block:: console
+
+ sudo zypper install python3-scikit-learn
+ git clone https://softwarefactory-project.io/r/logreduce
+ pushd logreduce
+ python3 setup.py develop --user
+ popd
+
+
* Pip:

.. code-block:: console
@@ -159,7 +171,7 @@
* logreduce-server: the REST and Gearman server
* logreduce-worker: job executor
* logreduce-client: client cli
- * logreduce-ui: web ui
+ * logreduce-webui: logreduce web interface

API
...
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce.egg-info/SOURCES.txt
new/logreduce-0.4.0/logreduce.egg-info/SOURCES.txt
--- old/logreduce-0.3.0/logreduce.egg-info/SOURCES.txt 2018-10-25
11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/logreduce.egg-info/SOURCES.txt 2018-11-08
08:40:11.000000000 +0100
@@ -11,7 +11,7 @@
tox.ini
doc/conf.py
doc/index.rst
-etc/httpd/logreduce.conf
+etc/httpd/log-classify.conf
etc/logreduce/config.yaml
etc/systemd/logreduce-server.service
etc/systemd/logreduce-worker.service
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce.egg-info/pbr.json
new/logreduce-0.4.0/logreduce.egg-info/pbr.json
--- old/logreduce-0.3.0/logreduce.egg-info/pbr.json 2018-10-25
11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/logreduce.egg-info/pbr.json 2018-11-08
08:40:11.000000000 +0100
@@ -1 +1 @@
-{"git_version": "a7a1da5", "is_release": true}
\ No newline at end of file
+{"git_version": "aa49628", "is_release": true}
\ No newline at end of file
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce.spec
new/logreduce-0.4.0/logreduce.spec
--- old/logreduce-0.3.0/logreduce.spec 2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce.spec 2018-11-08 08:39:56.000000000 +0100
@@ -2,7 +2,7 @@
%{!?scl:%global pkg_name %{name}}

Name: %{?scl_prefix}logreduce
-Version: 0.1.0
+Version: 0.3.0
Release: 2%{?dist}
Summary: Extract anomalies from log files

@@ -32,7 +32,7 @@

%package server
Summary: The logreduce server
-Requires: %{?scl_prefix}logreduce
+Requires: %{?scl_prefix}logreduce = %version
Requires: %{?scl_prefix}python-alembic
Requires: %{?scl_prefix}python-sqlalchemy
Requires: %{?scl_prefix}python-cherrypy
@@ -46,7 +46,7 @@

%package worker
Summary: The logreduce worker
-Requires: %{?scl_prefix}logreduce
+Requires: %{?scl_prefix}logreduce = %version
Requires: %{?scl_prefix}python-gear

%description worker
@@ -70,11 +70,17 @@
%{?scl:scl enable %{scl} - << \EOF}
PBR_VERSION=%{version} %{__python3} setup.py build
%{?scl:EOF}
+# TODO: make this replace conditional only when SCL is enabled
sed -e 's#/var/lib/logreduce#/var/opt/rh/rh-python35/lib/logreduce#' \
-e 's#/var/log/logreduce#/var/opt/rh/rh-python35/log/logreduce#' \
-i etc/logreduce/config.yaml
sed -e 's#/usr/share/#/opt/rh/rh-python35/root/usr/share/#' \
- -i etc/httpd/logreduce.conf
+ -i etc/httpd/log-classify.conf
+sed -e 's#/usr/bin/#/opt/rh/rh-python35/root/usr/bin/#' \
+ -e 's#/etc/logreduce/#/etc/opt/rh/rh-python35/logreduce/#' \
+ -e
's#^ExecStart#EnvironmentFile=-/etc/opt/rh/rh-python35/sysconfig/enable-py3\nExecStart#'
\
+ -i etc/systemd/logreduce-server.service
etc/systemd/logreduce-worker.service
+
pushd web
ln -s /opt/patternfly-react-ui-deps/node_modules/ node_modules
PUBLIC_URL="/log-classify/" ./node_modules/.bin/yarn build
@@ -90,27 +96,17 @@
install -p -D -m 0644 etc/systemd/logreduce-server.service
%{buildroot}%{_unitdir}/%{?scl_prefix}logreduce-server.service
install -p -D -m 0644 etc/systemd/logreduce-worker.service
%{buildroot}%{_unitdir}/%{?scl_prefix}logreduce-worker.service
install -p -D -m 0644 etc/logreduce/config.yaml
%{buildroot}%{_sysconfdir}/logreduce/config.yaml
-install -p -D -m 0644 etc/httpd/logreduce.conf
%{buildroot}/etc/httpd/conf.d/logreduce.conf
+install -p -D -m 0644 etc/httpd/log-classify.conf
%{buildroot}/etc/httpd/conf.d/log-classify.conf
install -p -d -m 0700 %{buildroot}%{_sharedstatedir}/logreduce
install -p -d -m 0700 %{buildroot}%{_localstatedir}/log/logreduce
-install -p -d -m 0755 %{buildroot}/var/www/logreduce/anomalies
-install -p -d -m 0755 %{buildroot}/var/www/logreduce/logs
-
+install -p -d -m 0755 %{buildroot}/var/www/log-classify/anomalies
+install -p -d -m 0755 %{buildroot}/var/www/log-classify/logs

-%pre server
-getent group logreduce >/dev/null || groupadd -r logreduce
-if ! getent passwd logreduce >/dev/null; then
- useradd -r -g logreduce -G logreduce -d %{_sharedstatedir}/logreduce -s
/sbin/nologin -c "Logreduce Daemon" logreduce
-fi
-exit 0

-%pre worker
+%pre
getent group logreduce >/dev/null || groupadd -r logreduce
-if ! getent passwd logreduce >/dev/null; then
+getent passwd logreduce >/dev/null || \
useradd -r -g logreduce -G logreduce -d %{_sharedstatedir}/logreduce -s
/sbin/nologin -c "Logreduce Daemon" logreduce
-fi
-exit 0
-

%post server
%systemd_post %{?scl_prefix}logreduce-server.service
@@ -127,7 +123,6 @@
%postun worker
%systemd_postun %{?scl_prefix}logreduce-worker.service

-
%files
%license LICENSE
%doc README.rst
@@ -140,10 +135,10 @@

%files server
%{_bindir}/logreduce-server
-%config(noreplace) /etc/httpd/conf.d/logreduce.conf
+%config(noreplace) /etc/httpd/conf.d/log-classify.conf
%{_unitdir}/%{?scl_prefix}logreduce-server.service
-%dir %attr(0755, logreduce, logreduce) /var/www/logreduce/logs
-%dir %attr(0755, logreduce, logreduce) /var/www/logreduce/anomalies
+%dir %attr(0755, logreduce, logreduce) /var/www/log-classify/logs
+%dir %attr(0755, logreduce, logreduce) /var/www/log-classify/anomalies

%files worker
%{_bindir}/logreduce-worker
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/logreduce-0.3.0/roles/log-classify/defaults/main.yaml
new/logreduce-0.4.0/roles/log-classify/defaults/main.yaml
--- old/logreduce-0.3.0/roles/log-classify/defaults/main.yaml 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/roles/log-classify/defaults/main.yaml 2018-11-08
08:39:56.000000000 +0100
@@ -28,7 +28,7 @@
# Process console-log
logclassify_console: true
# Process ara ansible.sqlite
-logclassify_ara_databae: false
+logclassify_ara_database: false

# Include paths from baseline logs
logclassify_logserver_dir: ""
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/scripts/debug_binsize.py
new/logreduce-0.4.0/scripts/debug_binsize.py
--- old/logreduce-0.3.0/scripts/debug_binsize.py 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/scripts/debug_binsize.py 2018-11-08
08:39:56.000000000 +0100
@@ -1,4 +1,4 @@
-#!/bin/env python3
+#!/usr/bin/env python3
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
@@ -16,8 +16,9 @@

import sys
from logreduce.utils import files_iterator, open_file
-from logreduce.models import Classifier, Model
-from logreduce.models import remove_ansible_std_lines_lists
+from logreduce.process import Classifier
+from logreduce.models import Model
+from logreduce.tokenizer import remove_ansible_std_lines_lists

try:
path = sys.argv[1]
@@ -32,17 +33,14 @@
bag_name = Classifier.filename2modelname(filename_rel)
groups.setdefault(bag_name, []).append(filename)

-model = Model()
+model = Model(bag_name)
for group_name, files in sorted(groups.items()):
for filename in files:
fobj = None
try:
fobj = open_file(filename)
idx = 0
- while True:
- line = fobj.readline()
- if line == b'':
- break
+ for line in fobj:
line = line.decode('ascii', errors='ignore')
# Remove ansible std_lines list now
line = remove_ansible_std_lines_lists(line)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/scripts/debug_filename2modelname.py
new/logreduce-0.4.0/scripts/debug_filename2modelname.py
--- old/logreduce-0.3.0/scripts/debug_filename2modelname.py 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/scripts/debug_filename2modelname.py 2018-11-08
08:39:56.000000000 +0100
@@ -1,4 +1,4 @@
-#!/bin/env python3
+#!/usr/bin/env python3
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
@@ -16,7 +16,7 @@

import sys
from logreduce.utils import files_iterator
-from logreduce.models import Classifier
+from logreduce.process import Classifier

try:
path = sys.argv[1]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/scripts/debug_lineprocess.py
new/logreduce-0.4.0/scripts/debug_lineprocess.py
--- old/logreduce-0.3.0/scripts/debug_lineprocess.py 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/scripts/debug_lineprocess.py 2018-11-08
08:39:56.000000000 +0100
@@ -1,4 +1,4 @@
-#!/bin/env python3
+#!/usr/bin/env python3
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
@@ -14,15 +14,33 @@

"""Script to debug line tokenization"""

+from collections import Counter
import sys
+
from logreduce.tokenizer import Tokenizer

try:
path = sys.argv[1]
except IndexError:
- print("usage: %s file" % sys.argv[0])
+ print("usage: %s [file]..." % sys.argv[0])
exit(1)

-for line in open(path).readlines():
- print(line[:-1])
- print("-> %s" % Tokenizer.process(line))
+tokens_c = Counter()
+word_c = Counter()
+line_set = set()
+for path in sys.argv[1:]:
+ for line in open(path):
+ word_c.update(line.split())
+ tokens = Tokenizer.process(line)
+ tokens_c.update(tokens.split())
+ line = line.rstrip()
+ if line not in line_set and (line != tokens):
+ line_set.add(line)
+ print(" ", line)
+ print("-> %s" % tokens)
+
+print("Total words: %d Total Tokens: %d" % (
+ len(word_c), len(tokens_c)))
+
+print("Top 10 words: %s", word_c.most_common(10))
+print("Top 10 Tokens: %s", tokens_c.most_common(10))
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/test-requirements.txt
new/logreduce-0.4.0/test-requirements.txt
--- old/logreduce-0.3.0/test-requirements.txt 2018-10-25 11:23:44.000000000
+0200
+++ new/logreduce-0.4.0/test-requirements.txt 2018-11-08 08:39:56.000000000
+0100
@@ -1,2 +1,3 @@
pytest
mock
+systemd-python
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/tox.ini new/logreduce-0.4.0/tox.ini
--- old/logreduce-0.3.0/tox.ini 2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/tox.ini 2018-11-08 08:39:56.000000000 +0100
@@ -1,15 +1,16 @@
[tox]
-envlist = py35,pep8
+envlist = py35,py36,py37,pep8
minversion = 1.6
skipsdist = True
sitepackages = True

[testenv]
-sitepackages = True
usedevelop = True
deps = -rtest-requirements.txt
commands = py.test -v

[testenv:pep8]
-deps = flake8
-commands = flake8-3 --ignore=E26,E501,E251,E225,E722 logreduce
+basepython = python3
+sitepackages = False
+deps = flake8<3.6.0
+commands = flake8 --ignore=E26,E501,E251,E225,E722 logreduce
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/logreduce-0.3.0/web/src/pages/UserReport.jsx
new/logreduce-0.4.0/web/src/pages/UserReport.jsx
--- old/logreduce-0.3.0/web/src/pages/UserReport.jsx 2018-10-25
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/web/src/pages/UserReport.jsx 2018-11-08
08:39:56.000000000 +0100
@@ -97,7 +97,7 @@
<Grid>
<h2>Report a new build to be analyzed</h2>
<p>Use the form bellow to report a Zuul build and trigger an automated
- analyzis</p>
+ analyzes</p>
<hr />
<Form horizontal>
<FormGroup controlId='name'>
@@ -118,7 +118,7 @@
<Col sm={9}>
<FormControl type='text' inputRef={i => this.reporter = i}/>
<HelpBlock>
- {'Enter your name like "irc-name" or "email address"'}
+ {'Enter your name like "IRC nick" or "Email address"'}
</HelpBlock>
</Col>
</FormGroup>
@@ -157,7 +157,7 @@
))}
</FormControl>
<HelpBlock>
- Those are known Zuul API endpoints to query build informations.
+ Those are known Zuul API endpoints to query build information.
</HelpBlock>
</Col>
</FormGroup>


< Previous Next >
This Thread