Text Extract

Summary: Context-sensitive search, grep, and extract text from multiple pages using search terms and regular expressions Version: 2017-06-07 Prerequisites: PmWiki 2.2 beta Status: new & experimental Maintainer: HansB? Categories: Markup Searching Includes SystemTools Markup Expressions Download: see Cookbook:TextExtract

Questions answered by this recipe

How can I do context-sensitive searches through multiple pages showing results within lines or paragraphs?
How can I show content from different pages if the content matches specific query terms?
How can I have a context-sensitive Search form?
How can I search pages using regex like grep?

This section is optional; use it to indicate the types of questions (if any) this recipe is intended to answer.

Description

A markup expression for extracting text lines (paragraphs) from multiple pages using regular expressions and wildcard pagename patterns plus a search form markup directive.

Installation:

Download extract.php, copy to cookbook folder and install in config.php with:

include_once("$FarmD/cookbook/extract.php");

Usage:

Markup syntax:

{(extract 'Text(Pattern)' PageName [PageName2] [PageName3] ... [keyword=value] [keyword=value] ...)}

Arguments:

  • Text(Pattern) - display lines containing Text string or matching regular expression TextPattern. This must be the first argument. A dot '.' will include all of the page text, since it symbolises any character in a regex pattern.
  • PageName - source pages from PageName or Group.PageName. Allowed are Wiki wildcards '*' and '?'. You can specify any number of pagenames, space-separated, and each could contain wiki wildcards. Note that wiki wildcard pagename patterns are not the same as regex patterns!
  • -PageName - a minus in front of a page name will exclude the page from the search. Wildcards are allowed.
  • PageName#section - the text from anchored #section will be taken as source.
  • PageName#sectA#sectB - the text from anchor #sectA to anchor #sectB will be taken as source.
  • Options:
    • group=GROUP - source pages from GROUP (same as pagelist group= option) (wiki wildcards and comma-separated listing allowed)
    • name=NAME - source pages NAME from any group (same as pagelist name= option) (wiki wildcards and comma-separated listing allowed)
    • lines=n - the text source is the first n lines of a page or page section.
    • lines=-n - the text source is the last n lines of a page or page section.
    • lines=n..m - the text source is the lines from line n to line m (including line m) of a page or page section.
    • lines=n.. - the text source is the lines from line n till end of a page or page section.
    • unit=page - the whole page text is shown (or a part of a source page specified by PageName#section or PageName#sectA#sectB.
    • unit=line - default: single text row (line) is shown.
    • unit=para - whole paragraph is shown (separated by empty lines or headings)
    • cut=PATTERN - do not display lines matching PATTERN
    • count=n - include only n number of pages in the output.
    • lines=n - display first n lines; lines=-n display last n lines; lines=n..m display lines from n to m; lines=n.. display lines from line n to end
    • snip=PATTERN - do not display text matching PATTERN, remove it from the line
    • markup=cut - lines with directives will be ignored.
    • markup=on - directives will be active, but only if pattern is '.' or unit=page.
    • markup=code - default: lines including directives will be shown as source code.
    • highlight=off - do not use match highlighting.
    • highlight=bold - default: bold (emphasised) highlighting, unless textpattern is '.' (including whole page source text).
    • case=1 - do case-sensitive search. Default is 0 insensitive case search.
    • prefix=link - display page link above extract; prefix=STRING display STRING above extract
    • prefix=linkmod - display line with page link and 'modified by' link and modified time above extract
    • suffix=STRING - display STRING on line below text page extract
    • header=STRING - display STRING on first line.
    • header=count - display results counter on first line.
    • header=full - display extended result count on first line plus a footer to mark end.
    • footer=STRING - display STRING at the end as a footer.

Text(Pattern)

You can enter text strings or regex patterns. 'cat' will look for all occurrences of 'cat'. The default is a case-insensitive search, so any occurrence of 'Cat', 'CAT', 'cAt' etc. will also be returned. 'cat dog' will look for string 'cat dog'. To look for matches of 'cat' or 'dog' use 'cat|dog'. To match the word 'cat' and not 'catastrophe' use word regex boundary markers '\b': '\bcat\b'. Regex uses some characters as special control characters: the dot ., the star *, the question mark ?, the pipe |, the dollar $, and brackets. To use any of these as normal characters you need to escape them with a backslash in front.

The regex dot . character represent any character, so if you use a single dot as the textpattern the whole page content will be returned, as it matches everything.

To specifically exclude lines matching some text(pattern) put it into the cut= option. With the snip= option on the other hand you can prevent certain words or phrases being shown in any matching lines, but still get the line.

PageName source lists

Following the Text(Pattern) as the first argument, the second and following bare arguments (not keyword=value option parameters) are treated as page names, which can be given in form of PageName, or Group.PageName, and can include wildcard characters star * and question mark ?, ? representing any valid single character, and * representing any string of valid characters. So PmWiki.* will be interpreted as all pages in group PmWiki, *.RecentChanges means all RecentChanges pages. A page name with a minus - in front will be excluded from the pages to be searched, again wiki wildcard characters are allowed. Note that the wildcard pagename pattern is not a regex pattern, and a dot here means just the separator between the Group and PageName component of a page name!

Comma-separated lists of page names can also be given.

Instead of using all of a page as the source for the text extract, one can specify an anchor defined page section as source with Group.PageName#anchor, or a section between two anchors with Group.PageName#anchor1#anchor2. With this section syntax you cannot use wiki wildcards.

Search form markup

Markup (:extract:) will produce a search form with a field for entering search terms or a regular expression and a field for entering a page name or pagename with wildcards. Results will be shown with markup extractresult:).

Default parameters for markup (:extract:)

  • size=30
  • button='Search'
  • searchlabel='Search for'
  • pageslabel='On pages'
  • caselabel='Case sensitive search'
  • prefix='link'
  • header='full'

Other optional parameters

  • page=PageName (including wildcards) - this will hide the pagename field of the form, and pass on 'PageName' as source page parameter.
  • page2=PageName (incl. wildcards) - a second page option, this will not hide the pagename field. Useful for instance to hardcode pages which should not be searched, by providing these with a minus sign: page2=-*.*RecentChanges

Comma-separated lists of page names can also be given.

  • All the other keyword=value options from the {(extract ....)} markup expression can be used.

Examples:

Default Search Form

(:extract:)

(:extractresult:)

Search PmWiki Documentation

(:extract page=PmWiki.*:)

(:extractresult:)

Notes

Fox Context Sensitive Search Form

The following Fox form can be used as general Search / Find / Text extraction tool (needs latest v.2008-01-25 fox.php):
Make sure to allow foxaction 'display' by setting $FoxPagePermissions['Group.PageName'] = 'display';
or add to Site.FoxConfig: Group.Pagename: display;

(:foxmessages:)
>>frame width=25em<<
(:fox form  foxaction=display target={*$FullName}:)
(:input default request=1:)
(:foxtemplate "{$$(extract '{$$search}' {$$pages}
   cut='{$$cut}' snip='{$$snip}' case='{$$case}' prefix=link header=full)}":)
|| Search for:||(:input text search size=30:) ||
|| On pages:||(:input text pages size=30:) ||
|| Cut lines incl.:||(:input text cut '' size=30:) ||
|| Snip text:||(:input text snip '' size=30:) ||
|| Case sensitive:(:input checkbox case '1':) ||||
|| ||(:input submit post Enter:) (:input submit cancel Cancel:) ||
(:foxend form:)
>><<

(:foxdisplay:)

The form displays the output to the position of markup (:foxdisplay:).

If you want to write the output into a page (instead of the expression), and want the output cleaned up without having the (:spacer:) markup written, use this in the template:

{$$(cleanspacer (extract .....))}

and change the fox form markup to:

(:fox form target=YourTargetPage:)

PmWiki Search Form

It is possible to use TextExtract with a standard PmWiki search form, but searches are a lot (3 to 4 times) slower. Perhaps this is useful in some situations were it is necessary to use some pagelist options, which TextExtract does not supply.

Create a custom pagelist template in site.LocalTemplates:

!!!#extract
[[#extract]]
{(extract dummy {=$FullName} prefix=link)}
[[#extractend]]

and use a searchbox form for instance like this:

Search the PmWiki Documentation

(:searchbox group=PmWiki fmt=#extract:)

(:searchresults:)

(:searchbox ...:) can take all standard pagelist parameters.

Release Notes

  • 2008-03-07: Added unit=para option to show whole paragraphs, separated by empty lines or headings.
  • 2008-02-12: Changed extractresult markup so outpput does not get wrapped in <p>..</p> tags
  • 2008-02-11: Added options group= name= for source pages (same as PageList directive). Improved handling of input from pagelist markup expression (PowerTools)Added option count= and prefix=linkmod.
  • 2008-01-31: Added markup=on option for processing markup directives when pattern is '.' or unit=page. Fixed wrong line handling when unit=page. Added cleanup of form input options. Added qualifying of relative links.
  • 2008-01-29: Added simple filter to suppress bad pattern input by disallowing input of single regex special characters. Added capability to receive input from Pmwiki standard search form, with use of custom fmt template.
  • 2008-01-28: Added search form with markup (:extract:) and (:extractresult:). Optimised code. Improved handling of directives and highlighting. Removed timer since results were not very meaningful. Added default option arrays. Added capability to handle comma-separated pagename lists.
  • 2008-01-25a: Added error notice if no pages were found matching the PageName list. Changed full header to include number of pages searched.
  • 2008-01-25: Minor fixes to handling of parameters supplied.
  • 2008-01-24: Further improved highlighting. Added markup expressions to be rendered as source code rather than evaluated in output (same as directives). Improved vertical spacing for both nolinebreaks and linebreaks conditions, by adding custom (:spacer:) markup. Added markup expression {(cleanspacer ...)} as a wrapper for use in form templates to write output directly into a page, to remove the (:spacer:) markup.
  • 2008-01-23: Added handling of -PageName for page exclusion from source list. Added results counter and timer for option 'header'. Added case sensitive and insensitive search option. Improved handling of directives and of highlighting. Renamed 'out' to 'markup'.
  • 2008-01-22: Added 'highlight', 'unit' and 'out' options.
  • 2008-01-21b: Renamed script. Renamed expression to 'extract'. Renamed 'hide' option to 'snip'.
  • 2008-01-21a: Added suffix= option. Added handling of page section as source input. Added support for multiple PageNames, each can also have wiki wildcard characters, unless the pagename has a #section specified.
  • 2008-01-21: Enhanced lines= option. Changed fmt= to prefix=
  • 2008-01-20a: Added lines= parameter
  • 2008-01-20: Initial release

If the recipe has multiple releases, then release notes can be placed here. Note that it's often easier for people to work with "release dates" instead of "version numbers".

See Also

Contributors

Comments