xlutils.filter
==============

.. currentmodule:: xlutils.filter

This framework is designed to filter and split Excel files using a
series of modular readers, filters, and writers all tied together by
the :func:`process` function. These components are
described below, followed by documentation for the currently available
readers, filters and writers. Right at the bottom is an example that
puts all the various pieces together.

.. _reader:

Readers
=======

Readers are components that get data from a source and parse it into a
series of :class:`xlrd.Book` objects before making calls to the appropriate
methods on the first filter in the chain. 

Because it is usually only the source of the data to be processed that
changes, a handy base class is provided for readers.

Here's an example reader that uses this base class to process the file
it is initialised with:

>>> import os
>>> from xlrd import open_workbook
>>> from xlutils.filter import BaseReader
>>> class MyReader(BaseReader):
...
...    def __init__(self, filename):
...       self.filename = filename
...
...    def get_filepaths(self):
...       return (os.path.join(test_files, self.filename),)

If you need to create a more unorthodox reader, such as one that reads
its data from a network socket or that needs to pass special
parameters to :func:`xlrd.open_workbook`, then read the implementation of
:class:`BaseReader`.

.. _filter:

Filters
=======

Filters are the important bits of this framework. They are built up in
chains to achieve the results required for a particular task. A filter
must define certain methods, the full set of these is shown in the
example below. The implementation of these methods can do whatever the
filter requires, but generally they end up calling the appropriate
methods on the next filter. 

Here's an example filter that does nothing but print messages when its
methods are called and then call the next filter in the chain:

>>> class MyFilter:
...
...    def __init__(self,name):
...        self.name = name
... 
...    def start(self):
...        print self.name,'start'
...        self.next.start()
...
...    def workbook(self,rdbook,wtbook_name):
...        print self.name,'workbook',rdbook,wtbook_name
...        self.next.workbook(rdbook,wtbook_name)
...    
...    def sheet(self,rdsheet,wtsheet_name):
...        print self.name,'sheet',rdsheet,wtsheet_name
...        self.next.sheet(rdsheet,wtsheet_name)
...        
...    def set_rdsheet(self,rdsheet):
...        print self.name,'set_rdsheet',rdsheet
...        self.next.sheet(rdsheet,wtsheet_name)
...        
...    def row(self,rdrowx,wtrowx):
...        print self.name,'row',rdrowx,wtrowx
...        self.next.row(rdrowx,wtrowx)
...
...    def cell(self,rdrowx,rdcolx,wtrowx,wtcolx):
...        print self.name,'cell',rdrowx,rdcolx,wtrowx,wtcolx
...        self.next.cell(rdrowx,rdcolx,wtrowx,wtcolx)
...
...    def finish(self):
...        print self.name, 'finish'
...        self.next.finish()

For full details of when each of these methods are called, see the
AP documentation for :class:`BaseFilterInterface`.

.. _writer:

Writers
=======

Writers are components that get handle calls from the appropriate
methods on the last filter in the chain. It is the writer that
actually does the work of copying all the information from the
appropriate sources and writing them to the output files.

Because there is a lot of work involved in this and it is only usually
the method of writing the binary data to its destination that differs,
a handy base class is provided for writers.

Here's an example writer that just writes the data to a temporary
directory:

>>> from xlutils.filter import BaseWriter
>>> our_temp_dir = TempDirectory().path
>>> class MyWriter(BaseWriter):
...
...    def get_stream(self,filename):
...       return open(os.path.join(our_temp_dir, filename), 'wb')

Available Readers
=================

Several reader implementations are included that cover many common use cases: 

GlobReader
----------

If you're processing files that are on disk, then this is probably
the reader for you. It returns all files matching the path
specification it's created with. Here's an example:

>>> from xlutils.filter import GlobReader
>>> r = GlobReader(os.path.join(test_files,'test*.xls'))
>>> sorted([p[len(test_files)+1:] for p in r.get_filepaths()])
['test.xls', 'testall.xls', 'testnoformatting.xls']

All the other functionality is provided by :class:`BaseReader`:

>>> isinstance(r,BaseReader)
True

XLRDReader
----------

If you want to "save" an :class:`xlrd.Book` object which you've already
created in some other way:

>>> from xlrd import open_workbook
>>> wb = open_workbook(os.path.join(test_files, 'testall.xls'))

Then the :class:`XLRDReader` is likely what you're after: 

>>> from xlutils.filter import XLRDReader
>>> r = XLRDReader(wb, 'foo.xls')

.. note:: You must supply a filename as shown above as the original
    filename is not stored in the :class:`xlrd.Book` object

Most functionality is provided by :class:`BaseReader`:

>>> isinstance(r, BaseReader)
True

However, its :meth:`~XLRDReader.get_workbooks` method just returns the
values it was instantiated with:

>>> tuple(r.get_workbooks())
((<xlrd...Book object at ...>, 'foo.xls'),)
>>> tuple(r.get_workbooks())[0][0] is wb
True

To show it working, here we send the book straight to a writer:

>>> from xlutils.filter import DirectoryWriter, process
>>> os.listdir(temp_dir)
[]
>>> process(r,DirectoryWriter(temp_dir))
>>> os.listdir(temp_dir)
['foo.xls']

TestReader
----------

This reader is specifically designed for testing filter
implementations with known sets of cells. This example should give a
good idea of how to use it:

>>> from mock import Mock
>>> from pprint import pprint
>>> from xlutils.tests.test_filter import TestReader
>>> from xlrd import XL_CELL_NUMBER
>>> r = TestReader(
...         ('Sheet1',(('R0C0','R0C1'),
...                    ('R1C0','R1C1'))),
...         ('Sheet2',(((XL_CELL_NUMBER,0.0),),),)
...     )
>>> c = Mock()
>>> r(c)
>>> pprint(c.method_calls)
[call.start(),
 call.workbook(<xlutils.tests.fixtures.DummyBook...>, 'test.xls'),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet1'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.cell(0, 1, 0, 1),
 call.row(1, 1),
 call.cell(1, 0, 1, 0),
 call.cell(1, 1, 1, 1),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet2'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.finish()]

Available Writers
=================

Several writer implementations are included that cover many common use cases: 

DirectoryWriter
---------------

If you're processing files that are on disk, then this is probably
the writer for you. It stores files in the directory passed to it
during creation. Here's an example:

>>> from xlutils.filter import DirectoryWriter
>>> temp_dir = TempDirectory().path
>>> w = DirectoryWriter(temp_dir)

Most of the functionality is provided by :class:`BaseWriter`:

>>> isinstance(w, BaseWriter)
True

The :meth:`~DirectoryWriter.get_stream` method makes sure the files end up in the
directory specified:

>>> os.listdir(temp_dir)
[]
>>> f = w.get_stream('test.xls')
>>> f.write('some \r\n data')
>>> f.close()
>>> os.listdir(temp_dir)
['test.xls']
>>> open(os.path.join(temp_dir,'test.xls'),'rb').read()
'some \r\n data'

StreamWriter
------------

If you want to write exactly one workbook to a stream, then this is
the writer for you:

>>> from tempfile import TemporaryFile
>>> from xlutils.filter import StreamWriter
>>> tf = TemporaryFile()
>>> w = StreamWriter(tf)

Most of the functionality is provided by :class:`BaseWriter`:

>>> isinstance(w, BaseWriter)
True

The :meth:`~StreamWriter.get_stream` method makes sure the excel data is written to
the stream provided:

>>> f = w.get_stream('test.xls')
>>> f.write('xls data')
>>> tf.seek(0)
>>> tf.read()
'xls data'

.. note:: Only one file may be written to a :class:`StreamWriter`,
  further attempts will result in an exception being raised:

>>> w.get_stream('test2.xls')
Traceback (most recent call last):
...
Exception: Attempt to write more than one workbook

:class:`StreamWriter` also doesn't close any streams passed to it:

>>> tf = TemporaryFile()
>>> process(TestReader(('Sheet1',[['R0C0']])),StreamWriter(tf))
>>> tf.seek(0)
>>> len(tf.read())
5632

XLWTWriter
----------

If you want to change cells after the filtering process is complete
then an :class:`XLWTWriter` can be used to obtain the :class:`xlwt.Workbook`
objects that result: 

>>> from xlutils.filter import XLWTWriter
>>> w = XLWTWriter()
>>> process(TestReader(('Sheet1',[['R0C0']])),w)

The objects can then be manipulated and saved as required:

>>> w.output
[('test.xls', <xlwt.Workbook.Workbook object at ...>)]

>>> book = w.output[0][1]
>>> book.get_sheet(0).write(0,1,'R0C1')

>>> temp_dir = TempDirectory()
>>> temp_dir.listdir()
No files or directories found.
>>> book.save(os.path.join(temp_dir.path,w.output[0][0]))
>>> temp_dir.listdir()
test.xls

As with previous writers, most of the functionality is provided by
:class:`BaseWriter`:

>>> isinstance(w, BaseWriter)
True

Available Filters
=================

A selection of filters are included as described below:

BaseFilter
----------

This is a "do nothing" filter that makes a great base class for your
own filters. All the required methods are implemented such that they
just call the same method on the next filter in the chain.

.. _column_trimmer:

ColumnTrimmer
-------------

This filter will strip columns containing no useful data from the
end of sheets. For example:

>>> from xlutils.filter import process, ColumnTrimmer
>>> r = TestReader(
...         ('Sheet1',(('','',''),
...                    ('R1C0','',''),
...                    ('R2C0','R2C1',' \t\r\n'))),
...         ('Sheet2',(('R0C0',),),),
...         ('Sheet3',(('R0C0','',(XL_CELL_NUMBER,0.0)),)),
...     )
>>> c = Mock()
>>> process(r,ColumnTrimmer(),c)
>>> pprint(c.method_calls)
[call.start(),
 call.workbook(<xlutils.tests.fixtures.DummyBook...>, 'test.xls'),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet1'),
 call.row(0, 0),
 call.row(1, 1),
 call.row(2, 2),
 call.cell(0, 0, 0, 0),
 call.cell(0, 1, 0, 1),
 call.cell(1, 0, 1, 0),
 call.cell(1, 1, 1, 1),
 call.cell(2, 0, 2, 0),
 call.cell(2, 1, 2, 1),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet2'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet3'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.finish()]

When sheets are trimmed, a message is also logged to aid debugging:
 
>>> from testfixtures import LogCapture
>>> l = LogCapture()
>>> process(r, ColumnTrimmer(), c)
>>> print l
xlutils.filter DEBUG
  Number of columns trimmed from 3 to 2 for sheet 'Sheet1'
xlutils.filter DEBUG
  Number of columns trimmed from 3 to 1 for sheet 'Sheet3'

The definition of 'no useful data' can also be controlled by passing
in a function that returns ``True`` or ``False`` for each value:

>>> def not_useful(cell):
...     if not cell.value or cell.value=='junk': return True

>>> r = TestReader(
...         ('Sheet1',(('R1C0','','junk'),)),
...     )
>>> c = Mock()
>>> process(r, ColumnTrimmer(not_useful), c)
>>> pprint(c.method_calls)
[call.start(),
 call.workbook(<xlutils.tests.fixtures.DummyBook...>, 'test.xls'),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet1'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.finish()]

.. _error-filters:

ErrorFilter
-----------

This filter caches all method calls in a file on disk and will only
pass them on to filters further down in the chain when its
:class:`~ErrorFilter.finish` method has been called *and* no
error messages have been logged to the python logging framework.

.. note:: To be effective, this filter must be the last in the chain
    before the writer!

Here's an example of how to set one up. We need to be able to see
what messages are logged, so we use a :class:`~testfixtures.LogCapture`:

>>> h = LogCapture()

Now, we install the filter:

>>> from xlutils.filter import process,ErrorFilter
>>> f = ErrorFilter()

To show the filter in action, we need a little helper:

>>> import logging
>>> from xlutils.filter import BaseFilter
>>> class Log(BaseFilter):
...    def __init__(self,level):
...        self.level = level
...    def workbook(self,rdbook,wtbook_name):
...        if wtbook_name=='test.xls':
...            logging.getLogger('theLogger').log(self.level,'a message')
...        self.next.workbook(rdbook,wtbook_name)

So, when we have errors logged, no methods other than finish are
passed on to the next filter:
   
>>> c = Mock()
>>> process(MyReader('test.xls'), Log(logging.ERROR), f, c)
>>> len(c.method_calls)
0

As well as the error message logged, we can also see the :class:`ErrorFilter`
logs an error to that that the method calls have not been passed on:

>>> print h
theLogger ERROR
  a message
xlutils.filter ERROR
  No output as errors have occurred.
>>> h.clear()

This error message can be controlled when the :class:`ErrorFilter` is
instantiated:

>>> f = ErrorFilter(message='wingnuts! errors have occurred!')
>>> process(MyReader('test.xls'), Log(logging.ERROR), f, c)
>>> print h
theLogger ERROR
  a message
xlutils.filter ERROR
  wingnuts! errors have occurred!

However, when no errors are logged, all method calls are passed:

>>> c = Mock()
>>> process(MyReader('test.xls'), Log(logging.WARNING), f, c)
>>> len(c.method_calls)
17

In addition to the logging of error messages, error cells will 
also cause all methods to be filtered:

>>> from xlrd import XL_CELL_ERROR
>>> r = TestReader(
...         (u'Price(\xa3)',(((XL_CELL_ERROR,0),),)),
...         )
>>> c = Mock()
>>> h.clear()
>>> process(r,ErrorFilter(),c)
>>> len(c.method_calls)
0
>>> print h
xlutils.filter ERROR
  Cell A1 of sheet 'Price(?)' contains a bad value: error (#NULL!)
xlutils.filter ERROR
  No output as errors have occurred.

You can also configure the log level at which messages prevent the
:class:`ErrorFilter` from passing its method calls on to the next
filter in the chain:

>>> f = ErrorFilter(logging.WARNING)

Now, warnings will cause methods to not be passed on:

>>> c = Mock()
>>> process(MyReader('test.xls'),Log(logging.WARNING),f,c)
>>> len(c.method_calls)
0

But if only debug messages are logged, the method calls will still
be passed on:

>>> c = Mock()
>>> process(MyReader('test.xls'),Log(logging.DEBUG),f,c)
>>> len(c.method_calls)
17

An example which may prove useful is how to set up a filter such
that if any errors are logged while processing one workbook, that
workbook is filtered out but other subsequent workbooks are not
filtered out.

This is done by inserting a filter such as the following earlier in
the chain:

>>> class BatchByWorkbook(BaseFilter):
...    started = False
...    def start(self): pass
...    def workbook(self,rdbook,wtbook_name):
...       if self.started:
...           self.next.finish()
...       self.next.start()
...       self.next.workbook(rdbook,wtbook_name)
...       self.started = True

Here it is at work, starting with an empty output directory:

>>> temp_dir = TempDirectory().path
>>> os.listdir(temp_dir)
[]

Now ``test.xls``, ``testall.xls`` and ``testnoformatting.xls`` are processed,
but errors are only logged while processing ``test.xls``:

>>> process(
...     GlobReader(os.path.join(test_files,'test*.xls')),
...     BatchByWorkbook(),
...     Log(logging.ERROR),
...     ErrorFilter(),
...     DirectoryWriter(temp_dir)
... )

So, the output directory contains ``testall.xls``, but no ``test.xls``:

>>> sorted(os.listdir(temp_dir))
['testall.xls', 'testnoformatting.xls']

.. _echo:

Echo
----

This filter will print calls to the methods configured when the
filter is created along with the arguments passed.

>>> from xlutils.filter import Echo, process
>>> r = TestReader(
...         ('Sheet1',(('R0C0','R0C1'),
...                    ('R1C0','R1C1'))),
...     )
>>> process(r, Echo(methods=('workbook',)), Mock())
workbook:(<...DummyBook...>, 'test.xls')

If ``True`` is passed instead of a list of method names, then all
methods called will be printed:

>>> process(r, Echo(methods=True), Mock())
start:()
workbook:(<...DummyBook...>, 'test.xls')
sheet:(<xlrd.sheet.Sheet...>, 'Sheet1')
row:(0, 0)
cell:(0, 0, 0, 0)
cell:(0, 1, 0, 1)
row:(1, 1)
cell:(1, 0, 1, 0)
cell:(1, 1, 1, 1)
finish:()

If you need to see what's happening at various points in a chain,
you can also give an :class:`Echo` a name:

>>> process(r,Echo('first'), Echo('second'), Mock())
'first' start:()
'second' start:()
'first' workbook:(<...DummyBook...>, 'test.xls')
'second' workbook:(<...DummyBook...>, 'test.xls')
'first' sheet:(<xlrd.sheet.Sheet...>, 'Sheet1')
'second' sheet:(<xlrd.sheet.Sheet...>, 'Sheet1')
'first' row:(0, 0)
'second' row:(0, 0)
'first' cell:(0, 0, 0, 0)
'second' cell:(0, 0, 0, 0)
'first' cell:(0, 1, 0, 1)
'second' cell:(0, 1, 0, 1)
'first' row:(1, 1)
'second' row:(1, 1)
'first' cell:(1, 0, 1, 0)
'second' cell:(1, 0, 1, 0)
'first' cell:(1, 1, 1, 1)
'second' cell:(1, 1, 1, 1)
'first' finish:()
'second' finish:()


MemoryLogger
------------

This filter will dump stats to the path it was configured with using
the heapy package if it is available. If it is not available, no
operations are performed.

For example, with a :class:`MemoryLogger` configured as follows:

>>> from xlutils.filter import MemoryLogger
>>> m = MemoryLogger('/some/path', methods=('sheet','cell'))

The equivalent of the following call::

  from guppy import hpy; hpy().heap().stat.dump('/some/path')

will be performed whenever the :class:`MemoryLogger`'s
:meth:`~MemoryLogger.sheet` and :meth:`~MemoryLogger.cell` methods are
called.

A :class:`MemoryLogger` configured as followed will log memory usage
whenever any of the :class:`MemoryLogger`'s methods are called:

>>> m = MemoryLogger('/some/path',True)

For more information on heapy, please see:

http://guppy-pe.sourceforge.net/#Heapy

Example Usage
=============

Here's an example that makes use of all the types of components
described above to filter out odd numbered rows from an original
workbook's sheets. To do this we need one more filter:

>>> from xlutils.filter import BaseFilter
>>> class EvenFilter(BaseFilter):
...
...     def row(self,rdrowx,wtrowx):
...         if not rdrowx%2:
...             self.next.row(rdrowx,wtrowx)
...     
...     def cell(self,rdrowx,rdcolx,wtrowx,wtcolx):
...         if not rdrowx%2:
...             self.next.cell(rdrowx,rdcolx,wtrowx,wtcolx)

Now we can put it all together with a call to the :func:`process`
function:

>>> from xlutils.filter import process
>>> process(
...     MyReader('test.xls'),
...     MyFilter('before'),
...     EvenFilter(),
...     MyFilter('after'),
...     MyWriter()
...     )
before start
after start
before workbook <xlrd...Book object at ...> test.xls
after workbook <xlrd...Book object at ...> test.xls
before sheet <xlrd.sheet.Sheet object at ...> Sheet1
after sheet <xlrd.sheet.Sheet object at ...> Sheet1
before row 0 0
after row 0 0
before cell 0 0 0 0
after cell 0 0 0 0
before cell 0 1 0 1
after cell 0 1 0 1
before row 1 1
before cell 1 0 1 0
before cell 1 1 1 1
before sheet <xlrd.sheet.Sheet object at ...> Sheet2
after sheet <xlrd.sheet.Sheet object at ...> Sheet2
before row 0 0
after row 0 0
before cell 0 0 0 0
after cell 0 0 0 0
before cell 0 1 0 1
after cell 0 1 0 1
before row 1 1
before cell 1 0 1 0
before cell 1 1 1 1
before finish
after finish

As you can see if you've read this far, there's quite a lot of output,
but it's certainly informative! However, just to be on the safe side,
we can see that the output file was actually written:

>>> os.listdir(our_temp_dir)
['test.xls']
