Jump to content

Search for filenames with non-standard characters?


philocalist

Recommended Posts

Bit of an odd one this: as an ex pro photographer, I have many thousands of files that I need to sort for duplicates, typically a folder containing many subfolders, and the root folder might contain in excess of 100,000 files.

 

Now, I have very good software that would normally handle a job like this very nicely, the problem is that it is being stalled by SOME of the files in there. In simple terms it normally scans each file in there to create a data file, then looks for dupes within the data file. What is happening now is that it WILL scan the folder / subfolders OK, and produces said data file: it will also identify the duplicate files within there. What would normally happen next is that I would have the option to delete or move duplicate files (according to criteria I set), either manually or automatically (and the auto route is the one normally taken, as a scan of a typical folder may weel turn up in excess of 10,000 dupes - manually deleting each one is not really an option!)

 

The problem is that now, the software is refusing to delete anything after identifying the dupes.

A bit of head scratching and a couple of emails to the software vendor seems to indicate that the problem is filenames that contain non-qwerty standard characters. If for example, a filename contains say Japanese characters, or even some of the characters used in mainland Europe that have accents attached, or are characters similar to UK ones, but not in use in the UK, the software will go no further.

 

So the question is, can anyone think of a way that I can search within said root folders / subfolders for filenames that contain characters not used in the UK alphabet.

 

I'm currently facing this problem on a Vista PC, but could easily run it on a Windows 7 or 8 PC if there was a solution there?

In a perfect world, I want to be able to search for these files, and have them listed in a way that indicates their location. In a REALLY perfect world, I'd then be able to batch rename them in some controllable way to eliminate the rogue characters, or as a last resort, either move or delete them totally from the root folder / subfolder in question.

 

If anyone can suss out how to do this within the OS, great, but I'm not against buying software that will do the job if neccesary.

Thanks!

Link to comment
Share on other sites

Tough one. if you had an entire text list of the directory and sub directory it would be easy to write a script in php or java to sort out A-Z, 0-9 and any other characters deemed acceptable by the software.

 

you could try experimenting with vista search parameters. there is one that might be of use which is NOT

 

it must be in caps to be recognised as an operator.

 

if you go to the root directory and type a NOT b for example, in the search bar it will find files files with a in the filename but excluding the ones with a and b.

 

i think you might be able to get away with doing: *.jpg NOT a,b,c,d,e,f,g,h so on and so forth (not sure how it will treat the , )

 

if that doesnt works then have fun trying this one: *.jpg NOT a NOT b NOT c NOT d etc :D

 

im not holding my breath on any of that working how i would expect it to. so i guess your only other option is to find file search software with advanced regular expression feature like would be used in java, php etc :(

Owner of Tacklesack.co.uk


Moderator at The-Pikers-Pit.co.uk

Link to comment
Share on other sites

ugh i found this after posting: http://stackoverflow.com/questions/1183659/windows-advanced-file-matching

 

regex can be used in cmd with the findstr command as for the expression to use i cant think of it as its 4am and im a little sleepy. happy hunting.

Owner of Tacklesack.co.uk


Moderator at The-Pikers-Pit.co.uk

Link to comment
Share on other sites

Is your software smart enough to look for uncode values? If so and since most of the UK symbols are in a fairly small and contiguous range, you could search for codes outside that range and convert them to some string then do something with any file names containing that string. Anything above U+007A (hex) is unlikely to be a UK letter or number so if you converted any value above that to something like xyz, you could do something with any file name containing "xyz".

 

Take a look in charmap to see what I'm talking about.

" My choices in life were either to be a piano player in a whore house or a politician. And to tell the truth, there's hardly any difference!" - Harry Truman, 33rd US President

Link to comment
Share on other sites

:doh: Waaaay over my head :-) The biggest single problem (I think) is that I'm trying to search for files that contain letters / symbols etc that are unknown to me (and therefore I cannot specify them in any search) What I need is a list of the file names (and locations) that contain thes 'unknown' characters. Basically, if filename contains ONLY UK alphabet and / or numbers and / or spaces, ignore it: if filename contains anything else, list it for me!

Link to comment
Share on other sites

I can't think of a solution that isn't at least a little technical but it can be done if your software can look for unicode values.

 

The most technical thing you need to understand is hexidecimal (base 16) numbers. Computers love hex numbers. What you have is a number system that does not stop at 9 as our decimal system does (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) but extends the numbers as 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F.

 

In our familiar decimal system, the number following the largest single digit (9) is 10 so we count 8, 9, 10, 11, etc.

In hex, we count 8, 9, A, B, C, D, E, F, 10, 11, etc. and 10 hex is a bit larger than 10 decimal.

 

Here endith the lesson.

 

Now click on start>run>charmap and look at the familiar letters, numbers, symbols. They have the lowest unicode values (written as U + a 4 digit hexidecimal number). Any that are larger are not our symbols. When you look at charmap you can see that the first (top-left) symbol is ! or U0021. The last of the English symbols is } or U+ 007D. Any unicode symbol larger than 007D is not one of ours so if you find a U007E, U007F, U0080, etc. it isn't English

" My choices in life were either to be a piano player in a whore house or a politician. And to tell the truth, there's hardly any difference!" - Harry Truman, 33rd US President

Link to comment
Share on other sites

Thanks for trying to help Newt (and Andy!) Fortunately, I'm of as generation that was still taught 'proper' maths at school (and some of it stuck :hypocrite: ), so I can get my head around hex numbers, but unless I'm mistaken, no resident program with Vista (or Win 7 / 8) will search for filenames using that info - though I'd be happy to be proven wrong. Any tholughts on a third-party alternative search program?

That said, I'm still not convinced it would work, as what I need to do is identify everything that is NOT standard a - z / 0 - 9 ... for all I know there could be hundreds of them in there, all different, and without knowing what they are individually, I can't search for them specifically.

Link to comment
Share on other sites

pain in arse is this one mate. been trying to figure out ways of doing it with a batch file.

 

i created one file with odd characters and tried to rename all files with an incremental number and it fails on the one with the odd character. though it did rename all the files except those weird ones if you reckon that would help ? it might make it easier to "see" those weird ones manually

 

heres the batch file i used.

 

setlocal enabledelayedexpansion
set /a count=0
for /f "tokens=*" %%a in ('dir /b /od *.jpg') do (
ren "%%a" renamed_05_01_2013_!count!.jpg
 set /a count+=1

)

you can rename this part(renamed_05_01_2013_) to suit your needs. also where you see JPG for the fileextension if its not jpg.

 

you would need to copy the batch file to each sub folder and run each one though, as it wont work its way through each sub directory on its own.

 

make a backup first in-case it all goes wrong. i cant really stress that enough. i dont want to be to blame for screwing up all your photos :P

Owner of Tacklesack.co.uk


Moderator at The-Pikers-Pit.co.uk

Link to comment
Share on other sites

  • 2 weeks later...

Just a thought, which may help someone out there to point me in the right direction ... can anyone come up with a way to accomplish the search outlined, so that it would end up listing all files that contained non-ascii characters?

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We and our partners use cookies on our website to give you the most relevant experience by remembering your preferences, repeat visits and to show you personalised advertisements. By clicking “I Agree”, you consent to the use of ALL the cookies. However, you may visit Cookie Settings to provide a controlled consent.