Modern Information Retrieval Appendix: Porter's algorithm |
Contents |
The rules in the Porter algorithm are separated into five distinct phases numbered from 1 to 5. They are applied to the words in the text starting from phase 1 and moving on to phase 5. Further, they are applied sequentially one after the other as commands in a program. Thus, in the immediately following, we specify the Porter algorithm in a pseudo programming language whose commands take the form of rules for suffix substitution (as above). This pseudo language adopts the following (semi-formal) conventions:
- a consonant variable is represented by the symbol C which is used to refer to any letter other than a,e,i,o,u and other than the letter y preceded by a consonant;Thus, the expression (C)* refers to a sequence of zero or more consonants while the expression ((V)*(C)*)* refers to a sequence of zero or more vowels followed by zero or more consonants which can appear zero or more times. It is important to distinguish the above from the sequence (V*C) which states that a sequence must be present and that this sequence necessarily starts with a vowel, followed by a subsequence of zero or more letters, and finished by a consonant. Finally, the command
- a vowel variable is represented by the symbol V which is used to refer to any letter which is not a consonant;
- a generic letter (consonant or vowel) is represented by the symbol L;
- the symbol 1#1 is used to refer to an empty string (i.e., one with no letters);
- combinations of C, V, and L are used to define patterns;
- the symbol * is used to refer to zero or more repetitions of a given pattern;
- the symbol + is used to refer to one or more repetitions of a given pattern;
- matched parenthesis are used to subordinate a sequence of variables to the operators * and +;
- a generic pattern is a combination of symbols, matched parenthesis, and the operators * and +;
- the substitution rules are treated as commands which are separated by a semicolon punctuation mark;
- the substitution rules are applied to the suffixes in the current word;
- a conditional if statement is expressed as ``if (pattern) rule'' and the rule is executed only if the pattern in the condition matches the current word;
- a line which starts with a % is treated as a comment;
- curled brackets are used to form compound commands;
- a ``select rule with longest suffix'' statement selects a single rule for execution among all the rules in a compound command. The rule selected is the one with the largest matching suffix.
if (*V*L) then ed 2#21#1states that the substitution of the suffix ed by nil (i.e., the removal of the suffix ed) only occurs if the current word contains a vowel and at least one additional letter.
The Porter algorithm is applied to each word in the text (simple formulation) and is given by the following procedure.
% Phase 1: Plurals and past participles.
select rule with longest suffix {sses 2#2 ss;select rule with longest suffix {
ies 2#2 i;
ss 2#2 ss;
s 2#2 1#1; }
if ( (C)*((V)+(C)+)+(V)*eed) then eed 2#2 ee;if (*V*y) then y 2#2 i;
if (*V*ed or *V*ing) then {select rule with longest suffix {}ed 2#2 1#1;select rule with longest suffix {
ing 2#2 1#1; }at 2#2 ate;}
bl 2#2 ble;
iz 2#2 ize;
if ((* C1C2) and ( C1 = C2) and ( 3#3 {l,s,z})) then C1C2 2#2 C1;
if (( (C)*((V)+(C)+)C1V1C2) and ( 4#4 {w,x,y})) then C1V1C2 2#2 C1V1C2e; }
if ( (C)*((V)+(C)+)+(V)*) then
select rule with longest suffix {ational 2#2 ate;if ( (C)*((V)+(C)+)+(V)*) then
tional 2#2 tion;
enci 2#2 ence;
anci 2#2 ance;
izer 2#2 ize;
abli 2#2 able;
alli 2#2 al;
entli 2#2 ent;
eli 2#2 e;
ousli 2#2 ous;
ization 2#2 ize;
ation 2#2 ate;
ator 2#2 ate;
alism 2#2 al;
iveness 2#2 ive;
fulness 2#2 ful;
ousness 2#2 ous;
aliti 2#2 al;
iviti 2#2 ive;
biliti 2#2 ble; }
select rule with longest suffix {icate 2#2 ic;if ( (C)*((V)+(C)+)((V)+(C)+)+(V)*) then
ative 2#2 1#1;
alize 2#2 al;
iciti 2#2 ic;
ical 2#2 ic;
ful 2#2 1#1;
ness 2#2 1#1; }
select rule with longest suffix {al 2#2 1#1;select rule with longest suffix {
ance 2#2 1#1;
ence 2#2 1#1;
er 2#2 1#1;
ic 2#2 1#1;
able 2#2 1#1;
ible 2#2 1#1;
ant 2#2 1#1;
ement 2#2 1#1;
ment 2#2 1#1;
ent 2#2 1#1;
ou 2#2 1#1;
ism 2#2 1#1;
ate 2#2 1#1;
iti 2#2 1#1;
ous 2#2 1#1;
ive 2#2 1#1;
ize 2#2 1#1;
if (*s or *t) then ion 2#2 1#1; }if ( (C)*((V)+(C)+)((V)+(C)+)+(V)*) then e 2#2 1#1;if ( (C)*((V)+(C)+)((V)+(C)+)+V*ll) then ll 2#2 l;
if (( (C)*((V)+(C)+)(V)*) and not (( *C1V1C2) and ( 4#4 {w,x,y}))) then e 2#2 nil; }