I found this
YouTube movie explaining how to do it. I would like to repeat it but I'm not a Linux user. At 01:50 a sed command is used to replace some bytes with something else. But I don't understand with what exactly. Can anybody explain what is done, please?
The issue being addressed is that someone (
very incorrectly) ran the original code through a character set/encoding conversion tool, which changed a few of the bytes. That's bad enough, but worse yet, some of the original tools used to assemble this code are using that original character set/encoding and consider some of the new byte sequences to be invalid, and presumably barf on them.
Ideally the way to recover from this would be to go back and get the original files, before they were run through that conversion tool. This doesn't seem possible at the moment, so in an attempt to recover anyway, the presenter is running yet another conversion tool (of sorts, it's basically a hack) to at least partially reverse the conversion process, enough that the original tools will now accept this new source code.
The sed command in particular is as follows:
The for loop takes each of those filenames in turn, assigned to
$f, and runs the sed command with the filename substituted in
src/$f.
Sed itself is invoked with three options, albeit not in the most obvious way. The first is
-i
, which asks it to overwrite the input file instead of writing its edited output to the standard output. So
src/$f will be replaced with the changed version of the file that sed makes.
The second option is combined with the third, but a more clear way of stating it would be
-r -e ...
(or, heck,
-ire ...
if you want). But anyway, the important thing is the
-r
, which much better known as
-E
, which says to use extended regular expressions rather than "standard" regular expressions. Essentially, this is just selecting the exact language that sed will use.
The third option is where the real work happens:
-e [I]command[/I]
gives sed a command to run. (This can be used multiple times to give it multiple commands or, if you have enough of them, you might put them in a file and ask sed to read commands from that.)
The command itself is of the form
s/[I]match[/I]/[I]replacement[/I]/g
, which says to process each line in the file and, anywhere you find
match, replace it with
replacement. This would normally be done only once per lines, but the
g
at the end says to repeat this as many times as needed on that line until every copy of
match is replaced.
The particular match they're using uses the
\xNN
hex character notation, where
\x41
would be an ASCII or UTF_8
'A' and so on. The two vertical bars in the match pattern separate alternatives, as in
AB|CD
would match the characters
"AB" or the characters
"CD". So here they're matching any one of three different possible sequences of characters, which are (in my usual Motorola hex notation):
- $EF $BF $BD
- $C4 $BF
- $C4 $B4
The replacement part is much more simple, it's a straight
'#' character. So, anywhere sed finds one of the three sequences above, it will be replaced with the single character
'#'.
Those character sequences above are various characters in UTF-8 (
'�',
'Ŀ' and
'Ĵ', I think) but they don't really have any meaning because they're the result of a nonsense conversion, as far as I can tell. So basically, the sed script just gets rid of them; presumably
'#' is a comment character or within a comment or something else ignored or at least not invalid at the particular points in the file where the above strings occur.
Another question: what are these bytes doing in an ASM file at all?
Per above, someone apparently incorrectly ran a conversion program on the files that they ought not have run.
(Personally, I bet it was non-Vim users, since Vim users would let Vim do the appropriate character set conversion on the fly, keeping the file in the original encoding but converting to UTF-8 for display systems that require that. :-))