- Indexing one file (or database record) from file or pipe
- joda -s database wordListFile [fileRef] [datestamp] [infobyte] [id]
- database means the full path of the database without extension
- wordListFile is the full path to a file containing the word list.
Use a dot "." if data transfer is done by an input pipe from the caller.
The word list is simply structured: item[,infobyte] i.e.: "gnu,1" The optionally "infobytes" are an additional information attached to the word, i.e. "2" to flag a word in the headline of the source or "1" to flag as a leader word etc. The infobyte values can be used for retrieving in a simple way (-q) or a more sophisticated way (-x). You may give joda some meta information about a text if joda is used with file references (stand alone). The meta info will be stored in the database and joda will show it with the search result. Therefore start the meta line with *|*. You can give title information, a teaser text etc. Example: *|*TITLE=Hurrican over Hawaii|SUBTITLE=foo|350 Words. This will be shown with the search result. So you can easily create a listing with a short info about the text without opening and parsing the text. The internal format of the meta info is done by your choice. Only the leading *|* chars at the beginning of the line is important. Words from meta lines will never get indexed. joda does not care about the content of a meta line!
- fileRef is the full path of the file to be archived (if is any) use a dot "." for a database record instead of a file
- datestamp is an optional parameter. If it is not set, the actually time/date will be used. datestamps are only available if the parameter "useBigOccList" in the database config file is set to 10 or 12. Otherwise no datestamp is stored with the words and the parameter gets ignored in all methods (see above "Extended features").
- id is required if joda is in use without file reference list (*.ref). Otherwise joda will create the id by itself (see "preliminary remarks") In other words: Either a fileRef or an ID has to be set!
- infobyte is an optional bitmap which would be OR-ed with the individuell (and also optional) infobyte given with each single word. I.e. it is possible to flag each word incividually with a title flag (i.e. "1") and all words of a file or record generally with a flag for a special source (i.e "8"). The detailed documention for this extended feature has to be written as remarked above under "Extended features).
Examples:
- joda -s myarchiv myfile_words.txt /home/test/myfile.html 01.12.2003
stores the word list from "myfile_words.txt" from "myfile.html" using a datestamp (default is actual date)
- joda -s myarchiv . /home/test/myfile.html <myfile_words.txt
the same but using a piped input from a text file instead a temp file
- filterscript.pl /home/test/myfile.html|joda -s myarchiv . /home/test/myfile.html
the same but using a piped input from an external filter script. This is the most common case!
- Indexing multiple files from pipe:
- joda -p database {word list from many files via pipe}
- database means the full path of the database without extension
- the word list is not a parameter, it is piped into joda (like the 3rd example above "-s"). An external filter programm calls joda, pipelining the words of multi files into it. The data stream needs to be separated by special chars per file/record: @f@FILENAME|DATESTAMP|INFOBYTE(s)|ID ist the startmark. At least, filename and datestamp have to be set. If rather database records than files are used, use a dot as filename. In this case the ID is needed. For infobytes and datestamp informations please see above under "-s".
With one call, joda can be feeded with words from hundreds or thousands of files. A Perl or Python script (or something similar) creates the word lists, separated by the chars described above and calls joda. This is the most performant kind of indexing, because joda as a binary executable is very fast with its duties archiving and retrieving. But scripts written in Perl or Python are more flexible for filtering purposes. So a combination of both is most reliable.
Alternativly you can use the next function, where directory scanning is done by joda, calling a filter program for each file:
- Indexing multiple files from a directory tree:
- joda -a database filepath filepattern [datestamp]
- database means the full path of the database without extension
- filepath is the full path to the file(s) to be archived
- filepattern are the pattern of the file(s) to be archived
- timestamp is in format dd.mm.yyyy. In most cases this parameter is empty, so the files datestamps (mtime) are used.
- word list(s) are created by an external program. This program is defined in the database config file ("execProg") or is called "arcfilter" by default. The programm may be a pipe or may create a temporary file for data transfer, depending of the value of
"tmpFile=" in the databases config file.
Examples:
- joda -a /var/archives/myarchiv /home/test/myfiles *.html 01.12.2003
joda scans recursivly the given "filepath" and calls for every file which matched to the "filepattern" an external filter program. This can be written in Perl, Python or every other language. It has to return the word list as described above by pipeline or temporary file. Because external filters parse the data format of the source (i.e. plain text, HTML, XML etc), this makes joda flexible and independent from formats.
- joda -a /var/archives/myarchiv /home/test/myfiles *.html
Same as above, but for each file its own timestamp is used as jodas datestamp
In most cases the "-p" function above is recommended because it needs only one auxiliary system call (the joda call) in total and not one call for every file to be archived. On the other hand, the "-a" call needs only a very simple filter script without any other functionality.
See above (-s) for the structure of the word list.
- Erase one file via filter program:
- joda -d database fileToDelete id {word list from file via filter prog}
- database means the full path of the database without extension.
- fileToDelete is the full path to the file which indexing in joda should be erased. The file itself is not touched and has eventuelly to be deleted manually.
- id is the joda identity key of the file, which it has received during archiving. Practically a master database or list has stored this information during the archiving process.
- word list(s) are created by an external program using a pipe or a temporary file (same procedure as described above under "-a")
Example:
- joda -d /var/archives/myarchiv /home/test/myfiles/fileX.html 4711
- Erase one file via word list:
- joda -e database wordList id {word list from file}
- database means the full path of the database without extension
- wordList is a LF (\n = character #10) separated list of the words to be deleted. For a complete deleting of a file entry it has to be the same list as formerly used for the archiving.
- id is the joda identity key of the file, which it has received during archiving. Practically a master database or list has stored this information during the archiving process.
Example:
- joda -e /var/archives/myarchiv word1\nword2\nword3\n 4711
- Merge two databases into one or optimize a database:
- joda -m [-wordcheck] [-filecheck] [-verbose] database sourcedb [minimalDate]
- wordcheck will check every word from sourcedb against the stoppword list
- filecheck will check every file to be still existing
- verbose shows the progress of the process
- database means the full path of the target database without extension
- sourcedb means the full path of the source database without extension
- minimalDate is the earliest datestamp to be merged. With "minimalDate" it is possible
to skip old entries while merging. Only available for bigger formats if the occurency lists
(see above "Extented features").
Example:
- joda -m -verbose /var/archives/myarchiv_new /var/archives/myarchiv
Merge can be used for the optimization of a database. For this purpose, it is sufficient, to create a copy (or softlink) of the source database config file as target config file. Now merge the existing database into the new (empty) one. The result is a perfect optimized copy of the source database, because now all equal words are (internally) pooled in one cluster. This will minimize disk seeks, therefore improves retrieving speed and saves disk space. Strongly recommended at least for for completed databases (like archives).
- Optimize a database:
- joda -opt [-wordcheck] [-filecheck] [-verbose] database [minimalDate]
Example:
- joda -opt -verbose /var/archives/myarchiv
This is just an abbreviation for "merge". After the processing the optimized database has the same name as the original one. During the processing a temp file name is used.
- Retrieving functions:
- joda -q database query [from] [until] [fFilter] [maxHits] [sortOrder] [info]
joda -qu database query [from] [until] [fFilter] [maxHits] [sortOrder] [info]
- database means the full path of the database without extension
- query is a word or a combination of words. Parenthesis and the logical operators can be used: AND, OR, NOT, and NEAR. It is possible to sharpen the query with word distance values after AND and NOT, i.e. AND.4 means that only word hits are valid if the distance is maximum four words. Parenthesis should be used to clearify the context. Otherwise the search is done sequentially and the result may be not satisfying. Queries are always case insensitive.
- from: the earliest date the result should be from (give a dot "." for empty value)
- until: the latest the result should be from (dot ignores)
- fFilter is only available if joda uses file references (master mode, see "preliminary remarks" above). In this case, parts of the filename of a indexed file could be used as filter while retrieving. Regular expressions can be used (see example below).
- bFilter is the bit filter in the range from 1 to 255. If it is set, the retrieved words have to match the value exacetly. I.e. if you sets the value "1" for all words in an articels headline, you can retrieve those ones by using a bFilter=1. For a more sophisticated way of using bit filters, see below ("-x").
- maxHits means the maximum hit counter. Because joda sorts the results internally, the user will see the most recent "maxHits" hits.
- sortOrder can be set to 1, 2 or 3. "1" means sorting by datestamp in the first line and bny weight in the second line. "2" will sort by weight (and secundary by time) and is the default if no value (or 0) ist given. "3" sorts in the first line by the info byte and in the second line by weight.
With -q(u) joda retrieves all files/records with matched the given query. The letter "u" means that the query is formatted in UTF-8. Otherwise the query has to be in the same ISO-8859-XX format like the archived words. If "u" is used, the value "charsettable" in the databases config file has to be set to one of the "8859_X.iso translation tables belonging to the joda distribution.
The first two lines of jodas respone are always:
- {Query in short form}
- Number of hits
- followed by a list of hits, separated by \n
The format of the hits depends on the value of "fileref" in the database configuration file. Working with filenames (a *.ref file exists), joda answers:
- filename · title · id · datestamp · weight · infobyte
where "title" is coming from meta information if file refence list is used (see above "-s").
Without file ref joda will answer per hit:
- id · datestamp · weight · infobyte ·
Important: " · " chars shown above are TABS in reality, this
means \t (character #9)!
Query examples:
- joda -q database Bill
finds all files/records with "Bill"
- joda -q database Bill*
find all occurences if "Bill", Billy", "Billard" etc. "*" is a wildcard which set only at the tail of a word.
- joda -q database "Bill Clinton"
finds all files/records where Bill and Clinton are in. Same as "Bill AND Clinton" or "Bill&Clinton"
- joda -q database "'Bill Clinton'"
is equal to "Bill AND.1 Clinton" and "Bill&.1Clinton". Word distance must be one word in the file.
- joda -q database "(Bill or Hillary) and.1 Clinton"
finds all "Bill Clinton" and "Hillary Clinton" but not "Christopher Clinton"
- joda -q database "((Bill or Hillary) and.1 Clinton) not Bush"
find all "Bill Clinton" and "Hillary Clinton" but the word Bush must be missed in the article.
- joda -q database "Barney NEAR Flintstone*"
finds all Barneys and Flintstone(s) if the word distance of both words is less or equal to 50. Equal to AND.50. For sharpening the query.
- joda -q database "Barney NEAR Flintstone*" 01.01.2002 31.12.2003
The text with Barneys and the Flintstones has to be last modified (or stored) in 2003. Only available if joda handles datestamps (see above). Especially designed for running joda as stand alone database, but can make sense in slave mode to.
- joda -q database "Barney NEAR Flintstone*" . . /home/comics
The filename has to match "/home/comics". If not joda skips the hit.
The two dots means "no start date" and "no end date" given.
- joda -q database "Barney NEAR Flintstone*" . . REGEX=^/var/www/doc/index.html?$
The filename has to match the regular expression.
- joda -q database "Barney NEAR Flintstone*" . . . 50
Maximum of 50 hits will be shown. Default is 100.
- joda -q database "Barney NEAR Flintstone*" . . . . 1
Sorting order is "by time" instead of "by weight".
- joda -q database "Barney NEAR Flintstone*" . . . . . 1
Both Barney and Flintstones have to be flagged with info bitvalue 1
(ie. must be headline words). In a -q query, a bFilter value of Zero means "undefined" and matches all stored values.
- joda -q database "Barney NEAR /stone/" . . . . . 1
is nearly the same using a regular expression search item
- joda -q database:database2 Bill
finds all files/records with "Bill" in the database named "database2" which has no own config file, using "database.config" (a so called "clone config")
-
joda -x database query [from] [until] [fFilter] [bFilter] [maxHits] [sortOrder]
joda -xu database query [from] [until] [fFilter] [bFilter] [maxHits] [sortOrder]
The difference between queries using "-q" and those using "-x" is only the handling of the bit filter. While it is in -q a single value which has to match exactly, its use in -x is more sophisticated.
[lydon documentation starts here (to be written)]