Pinot
Copyright 2005-2007 Fabrice Colin <fabricecolin at users dot berlios dot de>
http://pinot.berlios.de/


1. What is Pinot
2. Upgrading from previous versions
3. Available engines
4. Indexes
5. Indexing and monitoring
6. Search strategy
7. Viewing cached results
8. File formats
9. File patterns
10. More Like This
11. Saving results
12. D-Bus service
13. Deskbar Applet plugin
14. How to reset indexes
15. Dependencies
16. FAQ


1. What is Pinot


  Pinot is :
    * a D-Bus service that crawls, indexes and monitors your documents
      ("pinot-dbus-daemon")
    * a GTK-based user interface that enables to query the index built by the
      service or your favourite Web engine, and display and analyze the
      results ("pinot")
    * other command-line tools

  It was developed and tested on GNU/Linux and should work on other Unix-like
  systems.


2. Upgrading from previous versions


  * from 0.50 or older to 0.60 or newer
  The My Documents index was renamed My Web Pages. A new index, appropriately
  named My Documents ;-), includes all local files including mbox files and is
  populated by the new D-Bus service.
  If you used the Import facility to index specific directories on your system,
  it is recommended you manually unindex the corresponding documents from the
  My Web Pages index, and configure these directories to be automatically
  indexed by the D-Bus service, See section "5. Indexing and monitoring".
  The My EMail index was dropped. Email messages will now appear in your
  My Documents index.


  * to 0.71 or newer
  V0.71 and newer will detect what version the My Documents index was built
  with and automatically upgrade it by reindexing all documents, if necessary.
  As for My Web Pages, the UI will indicate when upgrading is necessary.
  Updating documents with the Index, Update menu is recommended.


3. Available engines 


  One of the main functionalities of Pinot is metasearch. This lets you query
  a variety of sources, including Web-based search engines. By default, the
  list of available engines is hidden and defaults to internal indexes (see
  section "4. Indexes"). To show the list of engines, click on the Show All
  Search Engines button, next to the Query field immediately below the menu
  bar. Click on the same button again to hide the list.

  Any number of engine or engine group may be selected at any one time.
  Multi-selection is done like in any other application. All queries are always
  run against the list of currently selected engines.

  Starting with v0.42, Pinot supports both Sherlock and OpenSearch Description
  plugins. They are installed in $PREFIX/share/pinot/engines/, where PREFIX
  is usually /usr. Additional engines can be installed in that directory or
  in ~/.pinot/engines. Note this directory is not created automatically.

  Sherlock is what Firefox and the Mozilla suite use. Chances are that somebody
  wrote a plugin for the engine you are interested in. MyCroft at
  http://mycroft.mozdev.org/ has got plenty of plugins. Beware that a lot are
  out of date and will require some changes. Use pinot-search on the
  command-line to run a quick check on a plugin, eg
  $ pinot-search sherlock /usr/share/pinot/engines/Bozo.src "clowns"

  Plugins are categorized by channels. For Sherlock plugins, the routeType
  element under SEARCH specifies the name of the channel the plugin belongs to.

  As for OpenSearch, A9.com has an extensive list at
  http://a9.com/-/search/moreColumns.jsp. A Description file is not available
  for all engines though...
  Pinot should work with OpenSearch Description 1.0 and 1.1 (draft 2) plugins.
  One thing to keep in mind is that because Description doesn't describe how
  to parse the results page returned by the search engine, Pinot assumes that
  the engine will return results formatted according to the OpenSearch Response
  standard.
  In practice, this means that plugins that don't stick to the following rules
  will be ignored or won't show any result :
    * For Description 1.1 plugins, the type attribute on the Url field must be
      set to "application/atom+xml" or "application/rss+xml" (default).
      "text/html" will be rejected.
    * The search engine's results page content type must be some form of XML,
      otherwise Pinot won't attempt parsing it. 
  Pinot differs from the Description spec in that it interprets the Tags field
  as a channel name. The standard defines Tags as a "space-delimited set of
  words that are used as keywords to identify and categorize this search
  content".

  If the version of Pinot you are using was built with support for the Google
  SOAP API, a "Google API" engine will appear in the group The Web. To query
  this engine, create an account at http://api.google.com/. You will then
  receive a Google API key by email. Enter this key in the corresponding
  field in the General tab of the Preferences box.


4. Indexes


  Pinot has two internal indexes. My Documents is populated by the D-Bus
  service and contains documents found on your computer. My Web Pages is
  populated by the UI whenever you :
    * import an external document, using the Index, Import URL menu
    * index results returned by Web engines, using the Results, Index menu
      or through a Stored Query
  Both index may have any of the file types listed in section "8. File formats".

  Indexes built by any other Xapian-based tools can be added to Pinot. To add
  an external index, click the + button at the bottom of the engines list.
  It can either be local, in which case you will have to select the directory
  where it is found, or served from a remote machine by xapian-tcpsrv. See
  the manual page for xapian-tcpsrv(1).

  All indexes are grouped together under the channel Current User in the
  engines list.


5. Indexing and monitoring


  Pinot can index any directory configured under the Indexing tab of the
  Preferences box. Monitoring is optional and should be disabled for the
  directories whose contents seldom change, eg /usr/share/doc.
  Indexing and monitoring of directories is handled by the D-Bus service.

  While Pinot is not currently able to get to and index application-specific
  data held in dot-directories, it can index common file formats as listed
  in section "8. File formats".

  All files and directories with a name that starts with a dot, eg ".gaim",
  are skipped and their content is not indexed. If you wish to include the
  contents of some dot-directories, create a symlink to a directory that is
  configured in Preferences. For instance, if your home directory is
  configured for indexing, create a symlink from "~/.gaim/logs" to
  "~/Documents/Logs" to index your GAIM log files.
  Do the same for mbox files held in dot-directories, eg ".thunderbird". Pinot
  0.74 and older required mbox files to be configured separately. Newer
  versions are able to fully index (and monitor, if necessary) all mbox files
  found during a directory crawl, just like any other file type.

  If you want to exclude any specific files or directories from indexing, use
  patterns as described in section "9. File patterns".


6. Search strategy


  When querying an index, Pinot adopts a multi-stepped search strategy that
  first attempts to find the documents that best match the query. If a given
  step doesn't return results, the next step alters the query slightly before
  it is run again. These steps are :
    * step 1
      obey the query's operators, don't stem terms
    * step 2
      obey operators, stem terms
    * step 3
      don't follow operators (all terms are OR'ed) and don't stem terms
    * step 4
      don't follow operators and stem terms
  Stemming is available only to Stored Queries for which a language is defined
  (Query parameters box, Indexes only tab).

  Queries can be expressed in a very natural way, using a combination of operators
  and filters. For instance "type:text/html and lang:en and (tcp near ip)" looks
  for HTML files in English that mention TCP/IP. 

  Supported filters are :
    * filters that apply to all engines
      "site" for host name, eg "site:pinot.berlios.de"
      "file" for file name, eg "file:index.html"
    * filters that only apply to Current User engines
      "title" for title, eg "title:Session"
      "url" for URL, eg "url:http://pinot.berlios.de/"
      "dir" for directory, eg "dir:/home/fabrice/test.txt"
      "lang" for ISO language code, eg "lang:en"
      "type" for MIME type, eg "type:text/html"
      "label" for label, eg "label:Important"

  Allowed language codes are "da", "nl", "en", "fi", "fr", "de", "it", "nn", "pt",
  "ru", "es" and "sv".

  For more information about the query syntax, you may want to have a look at the
  documentation for Xapian's QueryParser at http://www.xapian.org/docs/queryparser.html


7. Viewing cached results


  Results returned by search engines can be viewed "live" by selecting the View
  menuitem under Results. This opens whatever application defined for the
  result's MIME type and/or protocol scheme.
  In addition, Pinot allows to view the page as cached by Google and the Wayback
  Machine. Cache providers are actually configured in globalconfig.xml, located
  in /etc/pinot/. For instance :
  <cache>
    <name>Google</name>
    <location>http://www.google.com/search?q=cache:%url0</location>
    <protocols>http, https</protocols>
  </cache>

  This is self-explanatory :-) Here it configures a cache provider called
  "Google" that handles both http and https. The location field supports
  two parameters that are substituted to obtain the URL to open : 
    * %url is the result's URL as displayed by the UI, eg
      http://pinot.berlios.de/download.html
    * %url0 is the result's URL without the protocol, eg
      pinot.berlios.de/download.html


8. File formats


  The following document types are supported and will be fully indexed :
    * plain text
    * HTML
    * PDF (pdftotext is required)
    * RTF (unrtf is required)
    * MS Word (antiword is required)
    * XML
    * OpenDocument/StarOffice files
    * mbox
    * MP3 and Ogg Vorbis (TagLib required)
    * RPM (rpm required)
    * DEB (dpkg required)

  For other document types, Pinot will only index metadata such as name,
  location etc...

  If you wish to add support for another document type, and know of a
  command-line program that can convert that type to plain text, XML or HTML,
  you should be able to add it to external-filters.xml, located in /etc/pinot/.


9. File patterns


  Starting with version 0.62, it is possible to skip indexing of files with
  glob(3) patterns. These patterns are configured in the Indexing tab of the
  Preferences box, and can be used as a blacklist or a whitelist (as of 0.75).

  Patterns also to files and directories. For instance "*/Desktop*" will skip
  "~/Desktop" and not crawl nor monitor this directory's contents.
  If you have never run Pinot before, the list will be pre-configured to skip
  common picture and video types such as JPG, GIF, ZIP and MPG.


10. More Like This


   The More Like This feature is enabled when at least one of the results
   currently selected is indexed, ie if the result was previously manually
   indexed in My Web Pages, or if it's a local file indexed in My Documents.
   Its role is to expand queries.

   When activated, it will suggest other terms from the selected results
   and create a new Stored Query prefixed with "More Like". For instance,
   if you run a Stored Query with name "Me", the expanded query's name will be
   "More Like Me".


11. Saving results


   Starting with version 0.72, lists of results can be saved to disk by
   selecting the Save As menuitem under Results. Two output formats are
   available to choose from in the file selector opened by Save As :
     * CSV, a text format
       The semi-colon character (';') is used to delimit fields.
     * OpenSearch response, a XML/RSS format
       See http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements
       for details.


12. D-Bus service


   D-Bus activation makes sure the service is running whenever one of its
   methods is invoked by any consumer application. For instance, clicking
   OK on the Preferences box will call the service's GetStatistics method,
   thus ensuring the service is running.

   Starting with 0.63, the service is auto-started. The file that enables
   this is located at /etc/xdg/autostart/pinot-dbus-daemon.desktop.
   For older versions, if you want the service to be started with your X
   session, add pinot-dbus-daemon to your desktop environment's startup
   programs list.

   A few things to keep in mind :
     * the service doesn't reload its configuration while running, so any
       change (for instance, adding or removing an indexable directory) will
       take effect when the service is restarted.
     * when starting, the service will first crawl all configured locations
       and (re)index new and modified files. Starting with 0.64, the daemon's
       scheduling priority is set very low (15, can be adjusted with --priority)
       so that you can still play UT2004 while it is indexing !
     * when finished crawling, the service will monitor these locations
       for changes and should consume little resources, unless a huge
       quantity of files needs its attention.
     * any change detected by the monitor is immediately queued and acted upon
       as soon as possible, eg reindex a file that was modified. This and low
       priority should make the daemon not very intrusive.

  If you have dbus-send installed, you can start the daemon manually by calling
  the GetStatistics method with
  $ dbus-send --session --print-reply --type=method_call \
    --dest=de.berlios.Pinot /de/berlios/Pinot de.berlios.Pinot.GetStatistics
  and stop it by calling the Stop method with
  $ dbus-send --session --print-reply --type=method_call \
    --dest=de.berlios.Pinot /de/berlios/Pinot de.berlios.Pinot.Stop
  The daemon will also exit if killed.

  For a list of available D-Bus methods, see the file pinot-dbus-daemon.xml.


13. Deskbar Applet plugin


  Deskbar Applet provides an omnipresent versatile search interface for Gnome.
  Searching documents indexed by Pinot with the applet is possible. Users
  of RPM-based systems may install the pinot-desbar RPM. Otherwise, the
  file scripts/python/pinot-live.py can simply be copied into
  $PREFIX/lib/deskbar-applet/handlers (or, on 64 bit systems,
  $PREFIX/lib64/deskbar-applet/handlers) for use by all users, or into
  ~/.gnome2/deskbar-applet/handlers for use only by the current user.
  Once enabled, the plugin will rely on the D-Bus service to query the My
  Documents index.
  Deskbar Applet appears to sort results by names, so it may not present
  results in the same order as the Pinot UI.


14. How to reset indexes


  You may wish to reset one of the index and start from scratch. There
  are several ways to do this, depending on which index it is.

  If you want to reset My Web Pages, you can either :
    * use Pinot to unindex every single document by selecting them all
      and choosing Unindex in the Index menu
    * or delete ~/.pinot/index or ~/.pinot/index/* while Pinot isn't running

  If you want to reset My Documents, special considerations apply because
  of the historical data maintained by the daemon. First, make sure the
  daemon is stopped, then delete the index with
  $ rm ~/.pinot/daemon/*
  and remove historical data as follows
  $ sqlite3 ~/.pinot/history-daemon "delete from CrawlHistory; \
    delete from CrawlSources;"


15. Dependencies


  The version numbers indicate the minimum version Pinot has been tested
  with; older versions may or may not work.

---------------------------------------------------------------
Name							Version
---------------------------------------------------------------
SQLite							3.3.1
http://www.sqlite.org/

xapian-core						1.0.2
http://www.xapian.org/

curl (1)						7.13.1
http://curl.haxx.se/
- OR -
neon (1)						0.24.7
http://www.webdav.org/neon/

Google SOAP API (2)	Search/Google/googleapi		beta2

 gSOAP (2)						2.7.8c
 http://www.cs.fsu.edu/~engelen/soap.html

gtkmm							2.6.2
http://www.gtkmm.org/

libxml++						2.12.0
http://libxmlplusplus.sourceforge.net/

libtextcat						2.2
http://software.wise-guys.nl/libtextcat/

gmime							2.1.9
http://spruce.sourceforge.net/gmime

boost (3)						1.32.0
http://www.boost.org/

D-Bus with GLib bindings				0.61
http://www.freedesktop.org/wiki/Software/dbus

shared-mime-info					0.17
http://freedesktop.org/Software/shared-mime-info

desktop-file-utils					0.10
http://www.freedesktop.org/software/desktop-file-utils

unzip							5.52
http://www.info-zip.org/pub/infozip/UnZip.html

pdftotext						3.00
http://www.foolabs.com/xpdf/
http://poppler.freedesktop.org/

antiword						0.36.1
http://www.winfield.demon.nl/

unrtf							0.19.3
http://www.gnu.org/software/unrtf/unrtf.html

TagLib							1.4
http://ktown.kde.org/~wheeler/taglib/

openssh-askpass (4)					4.3
http://www.openssh.com/portable.html

---------------------------------------------------------------
Notes :
(1) enabled with "./configure --with-http=neon|curl"
(2) enabled with "./configure --with-soap=yes"
(3) for building only
(4) experimental - required only if _SSH_TUNNEL is set
---------------------------------------------------------------


16. FAQ


  * At startup, when listing an index or indexing documents, Pinot complains
    of an "index error"
    This is likely because a previous instance didn't exit properly and one
    (or more) index is still locked. Quit Pinot and look for a "db_lock" file
    in "~/.pinot/index". If it's there, delete it and restart Pinot.
    Starting with 0.60, indexes are created with the newer Flint back-end
    which is immune to this issue.

  * When compiling from source, you may run into linking problems related to
    libstdc++.la. Try patching xapian-config as explained in the post at
    http://lists.tartarus.org/pipermail/xapian-devel/2006-January/000293.html

  * If you experience segfaults at startup and you are on Fedora, chances are
    it's because of libxml++/libsigc++. See this Bugzilla entry :
    https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178592
    The latest version seems to fix this issue.

  * If the daemon crashes seemingly randomly while indexing, it may be because
    SQLite wasn't built thread-safe. I have witnessed this mostly on dual CPU
    machines, but others are not immune. Try rebuilding SQLite by passing
    "--enable-threadsafe --enable-threads-override-locks" to configure.

  * If you have built Pinot from source, make sure you have done a "make install".
    Pinot will fail to start if it can't find stuff it needs, its icon for
    instance.

  * If "make install" fails with an error about Categories in pinot.desktop
    and you have desktop-file-utils 0.11, either downgrade to 0.10 or upgrade
    to 0.12 if possible, or edit the top-level Makefile and replace the line
     @desktop-file-install --vendor="" --dir=$(DESTDIR)$(datadir)/applications pinot.desktop
    with
     $(INSTALL_DATA) pinot.desktop $(DESTDIR)$(datadir)/applications/pinot.desktop
    and run "make install" again.

  * In stored queries, the directory filter doesn't work as expected if the
    directory name starts with a non alpha-numeric character. This has been
    fixed in Xapian 0.9.8.

  * On FreeBSD, threading issues may cause the daemon to crash unexpectedly.
    A fix is to add the following lines to /etc/libmap.conf (which may not exist) :
     [/usr/local/bin/pinot-dbus-daemon]
     libpthread.so.2         libc_r.so.6
     libpthread.so           libc_r.so

     [/usr/local/bin/pinot]
     libpthread.so.2         libc_r.so.6
     libpthread.so           libc_r.so
    Refer to the libmap.conf(5) man page for details.

  * Upgrading from Xapian 0.9 or older to 1.0 is handled automatically, but
    downgrading back to 0.9 requires resetting the indexes manually. 

