A full and working Python implementation of sed
Usage as a command line utility
pythonsed
is a full and working Python implementation of sed. Its reference is GNU sed 4.2 of which it implements almost all commands and features. It may be used as a command line utility or it can be used as a module to bring sed functionality to Python scripts.
A complete set of tests is available as well as a testing utility. These tests include scripts from various origins and cover all aspects of sed functionalities.
Version | Status |
---|---|
Python 3.7 | Fully compatible except s///g with zero length matches. See this question at stackoverflow |
Python 3 | Fully compatible |
Python 2.7.4 and above | Fully compatible |
Python 2.7 to Python 2.7.3 | Fully compatible except regexps of the form ((.*)*). This causes one of the script from Chang suite to fail. |
Python 2.6 | Fully compatible except regexps of the form ((.*)*). argparse module must be installed. |
Python 2.5 and below | Not tested |
Compatibility status applies also to the testing utility test-suite.py
.
PythonSed
is released under the MIT license.
To install, just clone or download the repository zip file and run the setup in download directory:
pip install .
This installs a command line utility named pythonsed
and a package named PythonSed
.
pythonsed
is as console program receiving information from the command line. The format of the command line is:
pythonsed [options] -e<script expression> <input text file>
pythonsed [options] -f<script file> <input text file>
Note that pythonsed
accepts only one script file or expression, and only one input file. options
may be one or both of:
-n
disable automatic printing
-r
use extended regular expressions
pythonsed
may also use redirection to receive its input or send its output with the usual syntax:
cat myfile | pythonsed -f myscript1.sed | pythonsed -f myscript2.sed > myresultfile
It is also possible for pythonsed
to receive its input from the keyboard by omitting any input file:
pythonsed -f myscript.sed
An example covering all necessary symbols:
from PythonSed import Sed, SedException
sed = Sed()
try:
sed.no_autoprint = True
sed.regexp_extended = False
sed.load_script('myscript.sed')
sed.apply('myinput.txt')
except SedException as e:
print(e.message)
except:
raise
Note that sed.apply()
returns the list of lines printed by the script. As a default, these lines are printed to stdout. sed.apply()
has an output parameter which enables to inhibit printing the lines (output=None
) or enables to redirect the output to some text file (output=somefile.txt
).
The script may also be read from a string by using sed.load_string(my_script_string)
.
PythonSed
implements all standard commands and regular expression features of sed. Its reference is GNU sed 4.2. It implements almost all its features except the most specific ones.
GNU sed manual page can serve as a reference for PythonSed
given the differences described in the following.
number | standard behavior |
$ | standard behavior |
/regexp/ | standard behavior |
/regexp/I | implemented |
\%regexp% | standard behavior |
address,address | standard behavior |
address! | standard behavior |
0,/regexp/ | not implemented |
first~step | not implemented |
addr1,+N | not implemented |
addr1,~N | not implemented |
char | standard behavior |
* | standard behavior |
\+ | standard behavior |
\? | standard behavior |
\{i\} \{i,j\} \{i,\} | standard behavior |
\(regexp\) | standard behavior |
. | standard behavior |
^ | standard behavior. When not at start of regexp, matches as itself |
$ | standard behavior. When not at end of regexp, matches as itself |
[list] [^list] | standard behavior. [.ch.], [=a=], [:space:] are not implemented |
regexp1\|regexp2 | standard behavior |
regexp1regexp2 | standard behavior |
\digit | standard behavior (back reference) |
\n \t | standard behavior (extensions \s\S etc. are not handled) |
\char | standard behavior (disable special regexp characters) |
Note that for any combination of quantifiers (*, +, ?, {}), consecutive quantifiers or a quantifier starting a regexp will launch an error. This is true in basic or extended regular expression modes.
Using the -r switch enables to simplify regular expressions by removing the antislah character before the special characters +, ?, (, ), |, { and }. If these characters must appear as regular characters in a regexp, they must be slashed.
a\ | Compliant | (including one liner syntax and double address extensions) |
b label | Compliant | |
: label | Compliant | |
c\ | Compliant | (including single line and double address extensions) |
d | Compliant | |
D | Compliant | |
= | Compliant | (including double address extension) |
g | Compliant | |
G | Compliant | |
h | Compliant | |
H | Compliant | |
i\ | Compliant | (including single line and double address extensions) |
l | Compliant | (length parameter not implemented) |
n | Compliant | |
N | Compliant | |
p | Compliant | |
P | Compliant | |
q | Compliant | (except exit code extension) |
r filename | Compliant | (including double address extension but not reading from stdin) |
s | Compliant | (except escape sequences in replacement (\L, \l, \U, \u, \E), modifiers e and M/m, and combination of modifier g and number) |
t label | Compliant | |
w filename | Compliant | (including double address extension but not writing to stdout or stderr) |
x | Compliant | |
y | Compliant | |
# | Compliant | (comments start anywhere in the line.) |
Compliant means compliant with GNU sed description.
The other commands specific to GNU sed are not implemented.
The working of PythonSed
is tested and compared to the behavior of GNU sed with a set of tests and a testing utility.
The tests are either coded in text files with .suite extension or may be stored in test directories as standard sed scripts.
The test suites are:
unit.suite | a text filecontaining unitary tests |
chang.suite | a text file containing scripts from Roger Chang web site |
test-suite1 | a set of scripts from GNU sed test suite |
test-suite2 | a set of scripts from the seder's grab-bag, Rosetta code web site and GitHub (lisp!) |
test-suite3 | additional unitary tests better stored in a folder with some extra data text files |
test-suite4 | a set of scripts from the sed $HOME |
Tests are launched and checked with the test-suite.py
Python script. This script uses either PythonSed
package to run the sed scripts, or any sed executable. This enables to compare the working of PythonSed
with the one of GNU sed.
The calling syntax is:
test-suite.py <testsuite> [number] [-b executable] [-x list of script references]
Parameters | |
---|---|
testsuite |
either a text file with .suite extension or a test directory |
number |
an optional reference number of a test, when present only this tests is run |
executable |
an optional name or path of a sed executable to use for testing |
list of script references |
an optional list of tests to exclude for instance when a feature is not implemented. A script reference is either the title of the test for tests stored in modules, or the the name of the script file. |
When tests are stored in a text file (with .suite extension), they are made of four elements:
- the title of the test
- the script itself
- the input list of lines
- the expected result
The four elements of a test are separated with lines made of three identical characters, for instance:
---
Test substitution with global flag
---
s/an/AN/g
---
In Xanadu did Kubhla Khan
---
In XANadu did Kubhla KhAN
---
Note also that:
- the script section may be empty, enabling to test a script on various data without repeating the script.
- The input and output sections may be empty, enabling to test various scripts on the same data, without repeating the data.
- Flags are set with a comment on the first line. As usual, #n stops autoprint mode and extended regexp mode is set with #r or #nr.
- The expected result may be ??? when the test has no result and ends with an error.
- All text outside the test, i.e. before first delimiter or after last delimiter, is ignored and acts like a comment.
When tests are stored in a directory, they are represented by three or four files with same name but different extensions:
- the script itself, with '.sed' extension
- the input of the script, with '.inp' extension
- the expected result of the script, with '.good' extension
- possibly a file, with '.flags' extension, containing the sed switches -n and/or -r.
Some other files may be used when using reading or writing commands in scripts. In that case, the expected written files must be named with extension '.wgoodN' where N is the number of the expected written file.
A python implementation of sed has to face legitimate questions about timing. Fortunately, results are not bad. Unfortunately, they seem correlated with version number. Timings are given in seconds.
Platform | GNU sed 4.2.1 | sed.py python 2.6 | sed.py python 2.7 | sed.py python 3.4 |
---|---|---|---|---|
Windows7, Intel Xeon 3.2 GHz, 6 Gb RAM | 19.4 | 19.1 | 22.6 | 26.9 |
Windows XP, Intel Pentium4 3.2 GHz, 4 Gb RAM | 47.5 | 50.7 | 56.5 | 71.2 |
Linux, Intel Pentium4 3.2 GHz, 4 Gb RAM | - | - | 51.0 | - |
Test conditions:
- Only script files are used (scripts from folders testsuiteN). This is to avoid measuring the time to extract scripts, inputs and results from .suite files.
- The given values are averaged from three consecutive test runs.
At one moment, one has to decide what will be in the release to come, and what can be delayed. Here are some features which would be nice to have but can be delayed to a future version.
-
Better POSIX compliance:
-
multiple scripts on the command line (-e, -f)
-
multiple input file
-
character classes
-
Better error handling (display of the number of the line in error)
-
Better error handling when testing (the error message could be tested)
-
Use PythonSed as a basis for a sed debugger.
-
...