Quantcast
   
Titan CMS online support, find help and get answers

Advanced dtSearch Configuration

  • Published: October 08, 2008
  • |
  • Updated: October 13, 2008
  • |
  • Version: 5
 Summary

This article describes how to modify the Advanced Configurations XML for search indexes in Titan CMS installations that leverage dtSearch as the primary site search platform, in an effort to improve results retrieval and in-context synopsis content.

 Background

A common goal and desire for many site administrators and marketing professionals is to have rich, accurate search results for their site content. Titan CMS offers as a core feature the ability to integrate dtSearch as the primary search platform for a site. dtSearch is a full text search engine platform used both for crawling websites, documents, and databases, as well as for retrieving and presenting search results content.

One of the key benefits of dtSearch is its wide-breadth of programmatically configurable settings, making it possible to fine-tune the indexing and searching capabilities of the tool for a number of situations. Titan CMS has been written to expose the configuration of these settings using XML data controlled through the Titan Administration module.

 Details

The sections that follow define the advanced settings that can be configured for dtSearch within Titan CMS. In most cases, the naming convention for these attributes follows the naming in the dtSearch API. It is important to note that dtSearch supports other attributes that are not exposed through the configuration in Titan CMS.

In order to modify these configuration options, you will need to work with the Raw dtSearch HTML Config text area. We recommend that you save the current value so you can restore it in case something goes wrong.

Searching & Indexing Options

These are the basic options supported by dtSearch that apply to indexing a search process. These basic configuration values are defaulted in Titan CMS, but can be overridden by specifying different values through the advanced configuration XML. Examine the sample configuration help text to see where these values should be defined.

AlphabetFile Name of dtSearch alphabet file to use when parsing text into words.
FuzzyChar Character that enables fuzzy searching for a search term (default: "%")
Hyphens Controls the treatment of hyphens. See the Hyphens Settings options below.
IndexNumbers If false, any word that begins with a digit will not be indexed.
MatchDigitChar Wildcard character that matches a single digit (default: "=").
MaxStoredFieldSize Maximum size of a single stored field. Stored fields are field data collected during indexing that is returned in search results.
MaxWordLength Words longer than the maxWordLength will be truncated when indexing. The default maxWordLength is 32. The maximum value is 128.
NoiseWordFile List of noise words to skip during indexing (default: "noise.dat")
PhonicChar Character that enables phonic searching for a search term (default "#")
StemmingChar Character that enables stemming for a search term (default: "~").
StemmingRulesFile Stemming rules for stemming searches (default: "stemming.dat")
TextFieldsFile Name of the file containing rules for extraction of field data from text files based on markers in the text.
XmlIgnoreTags Comma-separated list of tags to ignore when indexing XML
FieldFlags Flags that control indexing of metadata. See Field Flags options below.
IndexingFlags Flags that control the indexing job. See Indexing Flags options below.
 TextFlags Flags that control text-processing options. See Text Flags options below.

 

Hyphen Settings

These are the options for the Hyphens node. The behavior of each is described below.

Ignore index "first-class" as "firstclass"
Hyphen index "first-class" as "first-class"
Space index "first-class" as "first" and "class"
All index "first-class" all three ways

 

Field Flags

These are the Field Flags options that can be entered. Each item should be in a <FieldFlag> element inside the <FieldFlags> node.

dtsoFfSkipFilenameField  Do not generate a field named Filename containing the name of the file.
dtsoFfSkipDocumentProperties  Do not index or search document summary fields
dtsoFfHtmlShowLinks  Make HTML links searchable
dtsoFfHtmlShowImgSrc  Make HTML IMG src= attribute searchable
dtsoFfHtmlShowComments  Make HTML Comments searchable
dtsoFfHtmlShowScripts  Make HTML Scripts searchable
dtsoFfHtmlShowStylesheets  Make HTML style sheets searchable
dtsoFfHtmlShowMetatags  Make HTML meta tags searchable and visible, appended to the body of the HTML file
dtsoFfHtmlNoHeaderFields  Suppress automatic generation of the HtmlTitle field for the title and the HtmlH1, HtmlH2, etc. fields for header content in HTML files.
dtsoFfOfficeSkipHiddenContent  Skip non-text streams in Office (Word, Excel, PowerPoint) documents.
dtsoFfXmlHideFieldNames  Do not index field names in XML files
dtsoFfShowNtfsProperties  Make NTFS file properties searchable
dtsoFfXmlSkipAttributes  Do not index attributes in XML files
dtsoFfSkipFilenameFieldPath  Include only the filename (not the path) in the Filename field generated at the end of each document.

 

Indexing Flags

These are the Indexing Flags options that can be entered. Each item should be in a <IndexingFlag> element inside the <IndexingFlags> node.

dtsAlwaysAdd  Index every document specified in the IndexJob, even if the document is already in the index with the same modification date and size
dtsIndexCreateCaseSensitive  Create a case-sensitive index. Index will treat words with different capitalization as different words. (apple and Apple would be two different words.)
dtsIndexCreateAccentSensitive  Create an accent-sensitive index.
dtsIndexCreateRelativePaths  Use relative rather than absolute paths in storing document locations.
dtsIndexResumeUpdate  Resume an earlier index update that did not complete. (Version 7 indexes only.)
dtsIndexCacheText  Compress and store the text of documents in the index, for use in generating Search Reports and highlighting hits. (Version 7 indexes only.)
dtsIndexCacheOriginalFile  Compress and store documents in the index, for use in generating Search Reports and highlighting hits. (Version 7 indexes only.)
dtsIndexCacheTextWithoutFields  When text caching is enabled, do not cache any fields that were provided through the data source API (in DocFields).
dtsIndexKeepExistingDocIds  Preserve existing document ids following a compression of an index or a merge of two or more indexes (this flag is ignored during merges if the indexes being merged have overlapping ranges of document ids).
dtsIndexCreateVersion7  Create an index using the version 7 index format. Version 7 indexes are created by default in versions after 7.0, so this flag is no longer needed.

 

Text Flags

These are the Text Flags options that can be entered. Each item should be in a <TextFlag> element inside the <TextFlags> node.

dtsoTfSkipNumericValues By default, dtSearch indexes numbers both as text and as numeric values, which is necessary for numeric range searching. Use this flag to suppress indexing of numeric values in applications that do not require numeric range searching. This setting can reduce the size of the index by about 20%.
dtsoTfSkipXFirstAndLast Suppress automatic generation of xfirstword and xlastword. By default, xfirstword is defined to be the first word in each document, and xlastword is defined to be the last word in each document. These words are generated when an index is created, so this flag must be set during indexing to suppress xlastword and xfirstword.
dtsoTfRecognizeDates Automatically recognize dates in text as it is indexed.

 

Search Settings

These are the basic options supported by dtSearch that apply only to the search process. These basic configuration values are defaulted in Titan CMS, but can be overridden by specifying different values through the advanced configuration XML. Examine the sample configuration help text to see where these values should be defined.

TimeoutSeconds Set to a non-zero value to force the search to halt after a specified time.
AutoStopLimit Make the search automatically stop when this many documents were found
MaxFilesToRetreive Limit the maximum size of search results to a specified number of files.
SearchStemming Enable stemming for all words in the search request 
SearchAutoTermWeight Apply the automatic term weighting to each term in the request.
SearchPositionalScoring Rank documents higher when hits are closer to the top of the document and when hits are located close to each other within a document. This improves relevancy ranking for "all words" and "any words" searches.
SynopsisEnabled A Boolean value indicating whether or not to produce the synopsis report.
SearchPhonic Enable phonic searching for all words in the search request.
SearchFuzziness If non-zero, the engine will match words that are close to but not identical to a search term.
SearchTypeAnyWords Find any of the words in the search request.
SearchTypeAllWords Find all of the words in the search request.
SynopsisHeader Text to appear at the top of the report.
SynopsisFooter Text to appear after the end of the report.
SynopsisNumberContextBlocks Number of blocks of context to include in the report for each document.
SynopsisMaxWordsToRead Number of words to scan in each document looking for blocks of context to include in the report.
SynopsisWordsOfContext Number of words of context to include around each hit.

 

 Related Documents

Knowledge Base Article: Improving dtSearch Results

 References

dtSearch – Full Text Search Engine
http://www.dtsearch.com

How to prevent filenames and document properties from appearing in documents
http://support.dtsearch.com/dts0173.htm

dtSearch Text Retrieval Engine -- .NET 2.0 API
http://support.dtsearch.com/webhelp/dtSearchNetApi2/frames.html