deduplicate | DullSharpness

As I’ve stated in previous posts, I’m often forced to use minimalist tools to accomplish a job. In this installment, I challenged myself to deduplicate a list of random numbers (or any values, really) and output just the distinct ones into a new file. It’s very easy in SQL (“Select distinct…”), but there’s a little more to it when DOS is your only option. Not many people will find themselves limited in such a way, but as with many of my posts, this is more about the mental exercise and the application of techniques than it is about the end product.

To see how this works, start with a file that has about 100 lines, each line beginning with a random number between 1 and 50 inclusive. You can generate this list using Microsoft Excel. Enter the formula “=INT(RAND()*50)+1” into cell A1 and copy it down the next 99 rows. Voila: a set of random integers that’s sure to have some duplicates. Copy/paste (or just save) this set into a text file named “UnsortedDupes.txt”.

Drop the following script into a batch file within the same directory as “UnsortedDupes.txt”

@Echo Off
REM AUTHOR: www.dullsharpness.com

REM Read up on Delayed Variable Expansion in the SET command's help
SETLOCAL ENABLEDELAYEDEXPANSION

REM Initialize the variable that will hold previous value
REM (choose a value that you won't encounter in your list)
set prev=bogusvalue

REM Pipe the file into a sorter, and for each full sorted line, cast aside
REM the lines that match the previous line.
for /F %%i in ('type UnsortedDupes.txt ^| sort') Do (
   echo PREV1=%prev%, VAL=%%i, PREV2="!prev!"
   If NOT "!prev!"=="%%i" (echo %%i>> DeDuped.txt)
   set prev=%%i
)

REM Turn off the SETLOCAL settings
ENDLOCAL

After you run the batch file, you’ll find that all of your distinct/deduplicated values have been echoed into “DeDuped.txt.”

Worth noting:

The reason this script works, which I presume is obvious, is that when we iterate over a list of sorted duplicate values, each time the value we’re processing differs from the previous value we processed, we recognize that it’s a new unique value and echo it to the deduped list. And each time the value we’re processing matches the previous value processed, we know we’ve already recorded it, so we ignore it.

Delayed variable expansion is required for this script to work. Otherwise, the PREV variable won’t get assigned its new value each time through the loop. Behavior of DOS variable expansion is discussed in more detail here.

I made the sort operation more complicated (and perhaps less efficient) than it needed to be because I piped standard output into it using the TYPE command. I did this because I wanted to demonstrate the technique of piping within the FOR command, and specifically the requirement to escape the pipe character with a caret “^|”. An easier way to perform the same thing is with “sort UnsortedDupes.txt”, and the SORT help listing says that’s a faster way too.

Using numbers for this exercise exposes an imperfection in the SORT command: it sorts alphabetically instead of numerically. The result, for example, is that “10” comes before “2”. There does not appear to be a way to overcome the SORT command’s limitation.

This example shows how to dedupe a list of single values, but the FOR command allows you to examine a specific field from a delimited row. And SORT allows you to specify which character in the row to start examining. Therefore, applying either one of those tactics would allow you to deduplicate your list based on a specific field or character number. Review the help contents of FOR and SORT to learn how to use those capabilities.

That’s all for now.

DullSharpness

Bright Ideas, Boring Topics

Tag Archives: deduplicate