Discussion:
Find unused XML files in a project
Nordlund, Eric
2014-07-31 05:20:46 UTC
Permalink
Hello docbook-apps.

I have a large set of projects that I am looking to scrub for unused graphics and XML files prior to sending off to localization.

Some of my colleagues have created some very basic bash and batch scripts to scan through the folders and find files that aren’t referenced in any of the source files so we can delete them, but I worry that these scripts don’t catch everything (unused XML files in the base directory that reference images will ‘bless’ this images) and we could still have extraneous files left over or accidentally delete important ones unknowingly.

Each project has a book.xml file that is the gold master for the outputs. If the book.xml file or any of its includes doesn’t reference a file in the project, it’s safe to delete. I was hoping that I could use xmllint to tell me which files are loaded when I try to validate the book.xml, but I haven’t found the magic formula yet.

I’ve tried the following command to reference all of the loaded files during a pass, but it doesn’t seem to list the image files referenced, which is mostly the point of this exercise, and I get a lot of noise from the module files for the DTD on every include.
$ xmllint --load-trace book.xml --xinclude --noout &> test1

Has anyone had a similar problem to solve? Am I going about this the right way?

Thanks, and I’m open to any suggestion. If bash and xmllint don’t work here, I am partial to Python as an alternative. Just saying.

Eric Nordlund
Senior Technical Writer
Amazon Web Services
Ph: 206-266-8048 | ***@amazon.com<applewebdata://542D1E87-0A8D-4B5A-A2DC-DE8204C46879/***@amazon.com>

[Description: Description: New Picture]
Kirill Churin
2014-07-31 08:21:47 UTC
Permalink
Post by Nordlund, Eric
Hello docbook-apps.
I have a large set of projects that I am looking to scrub for unused
graphics and XML files prior to sending off to localization.
Some of my colleagues have created some very basic bash and batch
scripts to scan through the folders and find files that aren’t referenced
in any of the source files so we can delete them, but I worry that these
scripts don’t catch everything (unused XML files in the base directory that
reference images will ‘bless’ this images) and we could still have
extraneous files left over or accidentally delete important ones
unknowingly.
Each project has a book.xml file that is the gold master for the
outputs. If the book.xml file or any of its includes doesn’t reference a
file in the project, it’s safe to delete. I was hoping that I could use
xmllint to tell me which files are loaded when I try to validate the
book.xml, but I haven’t found the magic formula yet.
I’ve tried the following command to reference all of the loaded files
during a pass, but it doesn’t seem to list the image files referenced,
which is mostly the point of this exercise, and I get a lot of noise from
the module files for the DTD on every include.
$ xmllint --load-trace book.xml --xinclude --noout &> test1
Has anyone had a similar problem to solve? Am I going about this the right way?
Thanks, and I’m open to any suggestion. If bash and xmllint don’t work
here, I am partial to Python as an alternative. Just saying.
*Eric Nordlund*
Senior Technical Writer
Amazon Web Services
[image: Description: Description: New Picture]
https://gist.github.com/reflexing/3184e28a315ed0cc4a1c
--
Kirill Churin
Fintan Bolton
2014-07-31 08:41:16 UTC
Permalink
Maybe this Python script would be of some use:

https://github.com/fbolton/sibin

My doc library does lots of re-use with xi:includes and images are also
referenced all over the place. To interface with 'publican' (Red Hat
publication tool), though, I need to have all my XML files under one
directory and all my image files under one images/ subdirectory. The
sibin tool consolidates all of my books (under the publican/ directory)
so that they have the tidy structure that 'publican' likes.

Sounds like a similar kind of problem that you are dealing with.

Oh! But one thing to watch out for: the script also converts 'olink'
elements into plain HTTP 'link' elements. You will probably want to
disable that part of the script.

Cheers,
Fintan
Post by Nordlund, Eric
Hello docbook-apps.
I have a large set of projects that I am looking to scrub for unused
graphics and XML files prior to sending off to localization.
Some of my colleagues have created some very basic bash and batch
scripts to scan through the folders and find files that aren’t
referenced in any of the source files so we can delete them, but I worry
that these scripts don’t catch everything (unused XML files in the base
directory that reference images will ‘bless’ this images) and we could
still have extraneous files left over or accidentally delete important
ones unknowingly.
Each project has a book.xml file that is the gold master for the
outputs. If the book.xml file or any of its includes doesn’t reference a
file in the project, it’s safe to delete. I was hoping that I could use
xmllint to tell me which files are loaded when I try to validate the
book.xml, but I haven’t found the magic formula yet.
I’ve tried the following command to reference all of the loaded files
during a pass, but it doesn’t seem to list the image files referenced,
which is mostly the point of this exercise, and I get a lot of noise
from the module files for the DTD on every include.
$ xmllint --load-trace book.xml --xinclude --noout &> test1
Has anyone had a similar problem to solve? Am I going about this the right way?
Thanks, and I’m open to any suggestion. If bash and xmllint don’t work
here, I am partial to Python as an alternative. Just saying.
*Eric Nordlund*
Senior Technical Writer
Amazon Web Services
Description: Description: New Picture
--
Fintan Bolton
Content Services | Red Hat, Inc.
***@redhat.com
home office. +49-89-14347132
blog: http://docinfusion.blogspot.com
Loading...