In a project creating a Ladino dictionary in which I have a few thousands of YAML files. They used to include lists of values, but a while ago I split them up into individual entries. I did this because the people who are editing them are not used to YAML files and it makes it a lot easier to explain them what to do.
However, the previous change left me with 1-item lists in each file. I wanted to clean that up.
Example files
Here are a few examples files that were also reduced in size for this demo.
- ladino: kaza
- ladino: komer
inglez: to eat
- ladino: biervo
inglez: word
# some comment
As you can see each one has an entry for a Ladino expression. Some of the files have translations to English. Other files in the real data-set had further translations to Hebrew, Turkish, French, Portuguese, and Spanish.
Some files had comments.
That dash at the first row and the indentation is the left-over from the time when more than one of these were in each file.
So I wanted to get rid of the first two columns in every line, except when they start with a hash-mark (#).
Here is the Perl one-liner to do so.
perl -p -i -e 's/^[^#].//' *.yaml
- The '*.yaml' at the end is a shell expression that will list all the YAML files in the current directory as the parameters of this command.
- The -p tells perl to read the content of each file line-by-line and print it.
- The -i tells perl to replace the original files with the content that was printed.
- The -e tells perl that the following string is a perl program and not the name of the file where the perl program is
- The perl program 's/^[^#].//' will be execute on every line read from the files.
- The 's///' is regex substitution. It works on the current line and changes the current line. So the lines that are saved back to the files are the modified lines.
- Between the 1st and 2nd slash is the regex.
- The first
^
means the match must start at the beginning of the line. - The
[^#]
means that there must be a character that is not#
. This will match any character on the first place of the file except #. - The
.
means match any character. - The string that is between the 2nd and 3rd slash is the replacement. It is an empty string so if there is a match it will be replaced by the empty string.
That's the whole thing.
Improvement
Now that I am explaining it, it occurred to me that this would be a safer solution:
perl -p -i -e 's/^[- ] //' *.yaml
Here the regex is s/^[- ] //
which means the first character must be either a dash or a space and the second character must be a space and those two are replaced.
So if there is anything else as the first two characters the line will not be changed. This is safer as it is more specific as what we would like to match for replacement.
Results
For this article I saved the resulting files in a separate place:
ladino: kaza
ladino: komer
inglez: to eat
ladino: biervo
inglez: word
# some comment
Top comments (5)
Perl seems amazing. I don't think that it is that popular nowadays, but if I want to learn it where can I? I learnt Perl regex but maybe there will be some good "Top to bottom" Perl guide. I'm really interested in Perl after a few pf your posts.
I am not sure what you mean by "top to bottom", but I can point you to the Perl Tutorial I wrote.
Thanks :).
Nice one, but you can do this with
sed
instead ofperl
too and save a bunch of characters 😀will do the trick just fine, and if you really have a lot of work to do:
will run it in parallel on as many threads as you have CPU cores 😁
You kids and your fancy "sed -i". Back in the day, sed didn't have that, but Perl had -i long before! The sed folks wisened up and took that idea as their own!