GitHub - learnbyexample/regexp-cut: Use awk to provide cut like syntax for field extraction
Uses awk to provide cut like syntax for field extraction. The command name is rcut.
Motivation
cut's syntax is handy for many field extraction problems. But it doesn't allow multi-character or regexp delimiters. So, this project aims to provide cut like syntax for those cases. Currently uses mawk in a bash script.
ℹ️ Note that rcut isn't feature compatible or a replacement for the cut command. rcut helps when you need features like regexp field separator.
Features
- Default field separation is same as
awk - Both input (
-d) and output (-o) field separators can be multiple characters - Input field separator can use regular expressions
- this script uses
mawkby default - you can change it to
gawkfor better regexp support with-goption
- this script uses
- If input field separator is a single character, output field separator will also be this same character
- Fixed string input field separator can be enabled by using the
-Foption- if
-ois not used, value passed to the-doption will be set as the output field separator
- if
- Field range can be specified by using
-separator (same ascut)-by itself means all the fields (this is also the default if-foption isn't used at all)- if start of the range isn't given, default is
1 - if end of the range isn't given, default is last field of a line
- Negative indexing is allowed if you use
-noption-1means the last field,-2means the second-last field and so on- you'll have to use
:to specify field ranges
- Multiple fields and ranges can be separated using
,character (same ascut) - Unlike
cut, order matters with the-foption and field/range duplication is also allowed- this assumes
-c(complement) is not active
- this assumes
- Using
-coption will print all the fields in the same order as input except the fields specified by-foption - Using
-soption will suppress lines not matching the input field separator - Minimum field number is forced to be
1 - Maximum field number is forced to be last field of a line
Examples
$ cat spaces.txt 1 2 3 x y z i j k # by default, it uses awk's space/tab field separation and trimming # unlike cut, order matters $ rcut -f3,1 spaces.txt 3 1 z x k i # multi-character delimiter $ echo 'apple:-:fig:-:guava' | rcut -d:-: -f2 fig # regexp delimiter $ echo 'Sample123string42with777numbers' | rcut -d'[0-9]+' -f1,4 Sample numbers # fixed string delimiter $ echo '123)(%)*#^&(*@#.[](\\){1}\xyz' | rcut -Fd')(%)*#^&(*@#.[](\\){1}\' -f1,2 -o, 123,xyz # multiple ranges can be specified, order matters $ printf '1 2 3 4 5\na b c d e\n' | rcut -f2-3,5,1,2-4 2 3 5 1 2 3 4 b c e a b c d # last field $ printf 'apple ball cat\n1 2 3 4 5' | rcut -nf-1 cat 5 # except last two fields $ printf 'apple ball cat\n1 2 3 4 5' | rcut -cnf-2: apple 1 2 3 # suppress lines without input field delimiter $ printf '1,2,3,4\nhello\na,b,c\n' | rcut -sd, -f2 2 b # -g option will switch to gawk $ echo '1aa2aa3' | rcut -gd'a{2}' -f2 2
See Examples.md for many more examples.
Tests
You can use script.awk to check if all the example code snippets are working as expected.
$ cd examples/
$ awk -f script.awk Examples.mdTODO
- Step value other than
1for field range - What to do if start of the range is greater than end?
- And possibly more...
Similar tools
- hck — close to drop in replacement for
cutthat can use a regex delimiter, works on compressed files, etc - choose — negative indexing, regexp based delimiters, etc
Contributing
- Please open an issue for typos/bugs/suggestions/etc
- Even for pull requests, open an issue for discussion before submitting PRs
- In case you need to reach me, mail me at
echo 'bGVhcm5ieWV4YW1wbGUubmV0QGdtYWlsLmNvbQo=' | base64 --decodeor send a DM via twitter
License
This project is licensed under MIT, see LICENSE file for details.