A few elements in the interface are specific and and need an explanation.
An udi (unique document identifier) identifies a document. Because of limitations inside the index engine, it is restricted in length (to 200 bytes), which is why a regular URI cannot be used. The structure and contents of the udi is defined by the application and opaque to the index engine. For example, the internal file system indexer uses the complete document path (file path + internal path), truncated to length, the suppressed part being replaced by a hash value.
This data value (set as a field in the Doc
object) is stored, along with the URL, but not indexed by
Recoll. Its contents are not interpreted, and its use is up
to the application. For example, the Recoll internal file
system indexer stores the part of the document access path
internal to the container file (ipath
in
this case is a list of subdocument sequential numbers). url
and ipath are returned in every search result and permit
access to the original document.
The fields
file inside
the Recoll configuration defines which document fields are
either "indexed" (searchable), "stored" (retrievable with
search results), or both.
Data for an external indexer, should be stored in a separate index, not the one for the Recoll internal file system indexer, except if the latter is not used at all). The reason is that the main document indexer purge pass would remove all the other indexer's documents, as they were not seen during indexing. The main indexer documents would also probably be a problem for the external indexer purge operation.
Recoll versions after 1.11 define a Python programming interface, both for searching and indexing. The indexing portion has seen little use, but the searching one is used in the Recoll Ubuntu Unity Lens and Recoll Web UI.
The API is inspired by the Python database API specification. There were two major changes in recent Recoll versions:
recoll
module became a
package (with an internal recoll
module) as of Recoll version 1.19, in order to add more
functions. For existing code, this only changes the way
the interface must be imported.
We will mostly describe the new API and package structure here. A paragraph at the end of this section will explain a few differences and ways to write code compatible with both versions.
The Python interface can be found in the source package,
under python/recoll
.
The python/recoll/
directory
contains the usual setup.py
. After
configuring the main Recoll code, you can use the script to
build and install the Python module:
cd recoll-xxx/python/recoll
python setup.py build
python setup.py install
The normal Recoll installer installs the Python API along with the main code.
When installing from a repository, and depending on the distribution, the Python API can sometimes be found in a separate package.
The recoll
package contains two
modules:
The recoll
module contains
functions and classes used to query (or update) the
index.
The rclextract
module contains
functions and classes used to access document
data.
connect()
function connects to
one or several Recoll index(es) and returns
a Db
object.
confdir
may specify
a configuration directory. The usual defaults
apply.extra_dbs
is a list of
additional indexes (Xapian directories). writable
decides if
we can index new data through this
connection.A Db object is created by
a connect()
function and holds a
connection to a Recoll index.
Methods
Db
object after
this.Query
object
for this index.maxchars
defines the
maximum total size of the abstract.
contextwords
defines how many
terms are shown around the keyword.match_type
can be either
of wildcard
, regexp
or stem
. Returns a list of terms
expanded from the input expression.
A Query
object (equivalent to a
cursor in the Python DB API) is created by
a Db.query()
call. It is used to
execute index searches.
Methods
fieldname
, in ascending
or descending order. Must be called before executing
the search.query_string
, a Recoll
search language string.Doc
objects in the current
search results, and returns them as an array of the
required size, which is by default the value of
the arraysize
data member.Doc
object
from the current search results.mode
can
be relative
or absolute
. ishtml
can be set to indicate that the input text is HTML and
that HTML special characters should not be escaped.
methods
if set should be an object
with methods startMatch(i) and endMatch() which will be
called for each match and should return a begin and end
tagdoc
(a Doc
object) by selecting text around the match terms.
If methods is set, will also perform highlighting. See
the highlight method.
for doc in
query:
will work.Data descriptors
scroll()
). Starts at
0.A Doc
object contains index data
for a given document. The data is extracted from the
index when searching, or set by the indexer program when
updating. The Doc object has many attributes to be read or
set by its user. It matches exactly the Rcl::Doc C++
object. Some of the attributes are predefined, but,
especially when indexing, others can be set, the name of
which will be processed as field names by the indexing
configuration. Inputs can be specified as Unicode or
strings. Outputs are Unicode objects. All dates are
specified as Unix timestamps, printed as strings. Please
refer to the rcldb/rcldoc.h
C++ file
for a description of the predefined attributes.
At query time, only the fields that are defined
as stored
either by default or in
the fields
configuration file will be
meaningful in the Doc
object. Especially this will not be the case for the
document text. See the rclextract
module for accessing document contents.
Methods
A SearchData
object allows building
a query by combining clauses, for execution
by Query.executesd()
. It can be used
in replacement of the query language approach. The
interface is going to change a little, so no detailed doc
for now...
Methods
Index queries do not provide document content (only a
partial and unprecise reconstruction is performed to show the
snippets text). In order to access the actual document data,
the data extraction part of the indexing process
must be performed (subdocument access and format
translation). This is not trivial in
general. The rclextract
module currently
provides a single class which can be used to access the data
content for result documents.
Methods
Extractor
object is
built from a Doc
object, output
from a query.ipath
and return
a Doc
object. The doc.text field
has the document text as either text/plain or
text/html according to doc.mimetype. The typical use
would be as follows:
qdoc = query.fetchone() extractor = recoll.Extractor(qdoc) text = extractor.textextract(qdoc.ipath)
qdoc = query.fetchone() extractor = recoll.Extractor(qdoc) filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)
The following sample would query the index with a user
language string. See the python/samples
directory inside the Recoll source for other
examples. The recollgui
subdirectory
has a very embryonic GUI which demonstrates the
highlighting and data extraction functions.
#!/usr/bin/env python from recoll import recoll db = recoll.connect() db.setAbstractParams(maxchars=80, contextwords=4) query = db.query() nres = query.execute("some user question") print "Result count: ", nres if nres > 5: nres = 5 for i in range(nres): doc = query.fetchone() print "Result #%d" % (query.rownumber,) for k in ("title", "size"): print k, ":", getattr(doc, k).encode('utf-8') abs = db.makeDocAbstract(doc, query).encode('utf-8') print abs print
The following code fragments can be used to ensure that code can run with both the old and the new API (as long as it does not use the new abilities of the new API of course).
Adapting to the new package structure:
try: from recoll import recoll from recoll import rclextract hasextract = True except: import recoll hasextract = False
Adapting to the change of nature of
the next
Query
member. The same test can be used to choose to use
the scroll()
method (new) or set
the next
value (old).
rownum = query.next if type(query.next) == int else \ query.rownumber