Skip to main content

Notice

We are in the process of rolling out a soft launch of the RDA website, which includes a new member platform. Existing RDA members PLEASE REACTIVATE YOUR ACCOUNT using this link: https://rda-login.wicketcloud.com/users/confirmation. Visitors may encounter functionality issues with group pages, navigation, missing content, broken links, etc. As you explore the new site, please provide your feedback using the UserSnap tool on the bottom right corner of each page. Thank you for your understanding and support as we work through all issues as quickly as possible. Stay updated about upcoming features and functionalities: https://www.rd-alliance.org/rda-web-platform-upcoming-features-and-functionalities/

PDF questions and ArchivesData center connections

  • Creator
    Discussion
  • #118255

    Lynn Yarmey
    Member

    Hello Archives and Records Professionals IG!
    I am hoping you might be able to help me answer a request I just received from a data center friend on archival practices around PDF migration. My apologies if this is a distraction to the group or better addressed elsewhere (?), offline information would be lovely if that is more appropriate.
    The broader connection for the IG might be that data center folks, not necessarily trained in archival practices or up to speed in the institutional records environments, also archive more traditional records with their data holdings. Making (additional) connections across these communities would be an amazing thing, imo.
    Sincere thanks!
    Lynn
    ————————————————————————-
    My basic question(s) are (1) how are you creating PDF/A files (2) are you validating them, and (3) how are you addressing validation errors? (4) Lastly, any thoughts on where I can find answers to similar questions online? I seems to be having zero luck, but I can’t be the only one having these issues?
    Background:
    We’re in the process of handling the backlog conversion of our repository’s materials into archival versions. We’re a disciplinary specific repo, and I’ve worked to ensure that we only accession materials that have a clear migration path, but as you can imagine, it’d be nice if it were only that easy. My current challenge is in working with the 19,000 PDF files in our system (many are quite large).
    We’ve been doing an analysis of various tools that can create PDF/A files, comparing Adobe Acrobat, Ghostscript, and ABBYY FineReader, in how well they converted a small test set of PDFs and evaluating how well they’d work for larger batches in an automated manner. We then took the resulting PDF/A files and validated them using VeraPDF, Adobe’s preflight, and PDFBox’s preflight tools.
    We’ve also looked at tools like Archivematica, but shied away as they don’t support all of the file formats that we support, and don’t address some of these issues.
    Challenges:
    In our initial tests a few years ago, none of the tools created valid PDF/A files in all cases, currently Adobe Acrobat is doing better, but it lacks useful tools for automating hundreds or thousand of conversions. Archivematica uses Ghostscript to convert files, but it seem to be the worst of the bunch in terms of validation errors. We use ABBYY FineReader because of it’s superior OCR and batch tools, but it also produces errors at the moment. I’ve yet to find a good tool beyond Acrobat, when it works.

  • Author
    Replies
  • #131589

    Dear Lynn and colleagues,
    Data management at the DANS research data archive includes conversion of document formats such as .doc, .docx, .rtf to PDF/A for long-term preservation. We make use of a Python script for processing the files; depending on the amount we manually check all converted files or a representative sample.
    In the words of our preservation officer, Valentijn Gillissen: “We do not convert PDF files to PDF/A. We were looking into options to do so three years ago, but opted against it after consulting with Johan van der Knijff, Digital Preservation Researcher at the National Library of the Netherlands (KB). He argued that PDF/A documents are PDF files with additional restrictions, which means that you will lose properties should the original file contain features that are not allowed by the PDF/A standard.
    The type of source material makes a difference: digitized or born-digital material. You are more likely to succeed in conversion of digitized PDF files to PDF/A, but you might argue that the PDF/A standard does not actually offer a huge benefit over the original PDF in those cases. The standard offers more advantages if the PDF is born-digital, but conversion options had always been limited and included the risk of information loss, with little to no options for validation.
    The conclusion was to only pursue PDF->PDF/A conversion strategies if the advantages of going for the PDF/A standard are clear, if you are aware of all the risks and if you have a validation strategy.
    The majority of PDF files that DANS ingests is digitized material, we tend to receive born-digital documents in Microsoft Word or other formats which we can easily convert to PDF/A. For the born-digital PDF files, we have so far decided to stick with the original PDF because they are robust enough as it is. There is also the matter of the lack of proper tools: there may have been improvement on this front over the past years, but we are not aware of any trustworthy methods.”
    Kind regards,
    Marjan
    Dr Marjan Grootveld
    Senior policy officer
    +31(0)6 12 10 15 14
    Skype: mgrootveld1
    ***@***.***
    Office days Monday – Thursday
    DANS: Netherlands Institute for Permanent Access to Digital Research Resources
    Anna van Saksenlaan 51 | 2593 HW The Hague | P.O. Box 93067 | 2509 AB The Hague | +31 70 349 44 50 | ***@***.*** | dans.knaw.nl
    DANS is an institute of the Dutch Academy KNAW and funding organisation NWO.
    Van: namens yarmey
    Datum: woensdag 16 augustus 2017 21:56
    Aan: “***@***.***-groups.org
    Onderwerp: [rda-archives-records-ig] PDF questions and ArchivesData center connections
    Hello Archives and Records Professionals IG!
    I am hoping you might be able to help me answer a request I just received from a data center friend on archival practices around PDF migration. My apologies if this is a distraction to the group or better addressed elsewhere (?), offline information would be lovely if that is more appropriate.
    The broader connection for the IG might be that data center folks, not necessarily trained in archival practices or up to speed in the institutional records environments, also archive more traditional records with their data holdings. Making (additional) connections across these communities would be an amazing thing, imo.
    Sincere thanks!
    Lynn
    ————————————————————————-
    My basic question(s) are (1) how are you creating PDF/A files (2) are you validating them, and (3) how are you addressing validation errors? (4) Lastly, any thoughts on where I can find answers to similar questions online? I seems to be having zero luck, but I can’t be the only one having these issues?
    Background:
    We’re in the process of handling the backlog conversion of our repository’s materials into archival versions. We’re a disciplinary specific repo, and I’ve worked to ensure that we only accession materials that have a clear migration path, but as you can imagine, it’d be nice if it were only that easy. My current challenge is in working with the 19,000 PDF files in our system (many are quite large).
    We’ve been doing an analysis of various tools that can create PDF/A files, comparing Adobe Acrobat, Ghostscript, and ABBYY FineReader, in how well they converted a small test set of PDFs and evaluating how well they’d work for larger batches in an automated manner. We then took the resulting PDF/A files and validated them using VeraPDF, Adobe’s preflight, and PDFBox’s preflight tools.
    We’ve also looked at tools like Archivematica, but shied away as they don’t support all of the file formats that we support, and don’t address some of these issues.
    Challenges:
    In our initial tests a few years ago, none of the tools created valid PDF/A files in all cases, currently Adobe Acrobat is doing better, but it lacks useful tools for automating hundreds or thousand of conversions. Archivematica uses Ghostscript to convert files, but it seem to be the worst of the bunch in terms of validation errors. We use ABBYY FineReader because of it’s superior OCR and batch tools, but it also produces errors at the moment. I’ve yet to find a good tool beyond Acrobat, when it works.

  • #131580

    Thanks for the response, Marjan!
    And just to note, the chairs are happy for the mailing list to be used for this type of query.
    Best regards,
    Rebecca Grant
    Research Data Manager
    Open Research Group
    Springer Nature
    The Campus | Trematon Walk | Wharfdale Road | London N1 9FN
    T: +44 (0) 2070144273
    ***@***.***
    http://www.springernature.com
    ORCiD: 0000-0002-7614-0806
    From: marjan.grootveld=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of mgrootveld
    Sent: 17 August 2017 11:58
    To: ***@***.***-groups.org
    Cc: Valentijn Gilissen
    Subject: Re: [rda-archives-records-ig] PDF questions and ArchivesData center connections
    Dear Lynn and colleagues,
    Data management at the DANS research data archive includes conversion of document formats such as .doc, .docx, .rtf to PDF/A for long-term preservation. We make use of a Python script for processing the files; depending on the amount we manually check all converted files or a representative sample.
    In the words of our preservation officer, Valentijn Gillissen: “We do not convert PDF files to PDF/A. We were looking into options to do so three years ago, but opted against it after consulting with Johan van der Knijff, Digital Preservation Researcher at the National Library of the Netherlands (KB). He argued that PDF/A documents are PDF files with additional restrictions, which means that you will lose properties should the original file contain features that are not allowed by the PDF/A standard.
    The type of source material makes a difference: digitized or born-digital material. You are more likely to succeed in conversion of digitized PDF files to PDF/A, but you might argue that the PDF/A standard does not actually offer a huge benefit over the original PDF in those cases. The standard offers more advantages if the PDF is born-digital, but conversion options had always been limited and included the risk of information loss, with little to no options for validation.
    The conclusion was to only pursue PDF->PDF/A conversion strategies if the advantages of going for the PDF/A standard are clear, if you are aware of all the risks and if you have a validation strategy.
    The majority of PDF files that DANS ingests is digitized material, we tend to receive born-digital documents in Microsoft Word or other formats which we can easily convert to PDF/A. For the born-digital PDF files, we have so far decided to stick with the original PDF because they are robust enough as it is. There is also the matter of the lack of proper tools: there may have been improvement on this front over the past years, but we are not aware of any trustworthy methods.”
    Kind regards,
    Marjan
    Dr Marjan Grootveld
    Senior policy officer
    +31(0)6 12 10 15 14
    Skype: mgrootveld1
    ***@***.***
    Office days Monday – Thursday
    DANS: Netherlands Institute for Permanent Access to Digital Research Resources
    Anna van Saksenlaan 51 | 2593 HW The Hague | P.O. Box 93067 | 2509 AB The Hague | +31 70 349 44 50 | ***@***.*** | dans.knaw.nl
    DANS is an institute of the Dutch Academy KNAW and funding organisation NWO.
    Van: namens yarmey
    Datum: woensdag 16 augustus 2017 21:56
    Aan: “***@***.***-groups.org
    Onderwerp: [rda-archives-records-ig] PDF questions and ArchivesData center connections
    Hello Archives and Records Professionals IG!
    I am hoping you might be able to help me answer a request I just received from a data center friend on archival practices around PDF migration. My apologies if this is a distraction to the group or better addressed elsewhere (?), offline information would be lovely if that is more appropriate.
    The broader connection for the IG might be that data center folks, not necessarily trained in archival practices or up to speed in the institutional records environments, also archive more traditional records with their data holdings. Making (additional) connections across these communities would be an amazing thing, imo.
    Sincere thanks!
    Lynn
    ————————————————————————-
    My basic question(s) are (1) how are you creating PDF/A files (2) are you validating them, and (3) how are you addressing validation errors? (4) Lastly, any thoughts on where I can find answers to similar questions online? I seems to be having zero luck, but I can’t be the only one having these issues?
    Background:
    We’re in the process of handling the backlog conversion of our repository’s materials into archival versions. We’re a disciplinary specific repo, and I’ve worked to ensure that we only accession materials that have a clear migration path, but as you can imagine, it’d be nice if it were only that easy. My current challenge is in working with the 19,000 PDF files in our system (many are quite large).
    We’ve been doing an analysis of various tools that can create PDF/A files, comparing Adobe Acrobat, Ghostscript, and ABBYY FineReader, in how well they converted a small test set of PDFs and evaluating how well they’d work for larger batches in an automated manner. We then took the resulting PDF/A files and validated them using VeraPDF, Adobe’s preflight, and PDFBox’s preflight tools.
    We’ve also looked at tools like Archivematica, but shied away as they don’t support all of the file formats that we support, and don’t address some of these issues.
    Challenges:
    In our initial tests a few years ago, none of the tools created valid PDF/A files in all cases, currently Adobe Acrobat is doing better, but it lacks useful tools for automating hundreds or thousand of conversions. Archivematica uses Ghostscript to convert files, but it seem to be the worst of the bunch in terms of validation errors. We use ABBYY FineReader because of it’s superior OCR and batch tools, but it also produces errors at the moment. I’ve yet to find a good tool beyond Acrobat, when it works.
    DISCLAIMER: This e-mail is confidential and should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage mechanism. Macmillan Publishers Limited does not accept liability for any statements made which are clearly the sender’s own and not expressly made on behalf of Nature Research or one of their agents.
    Please note that Macmillan Publishers Limited and their agents and affiliates do not accept any responsibility for viruses or malware that may be contained in this e-mail or its attachments and it is your responsibility to scan the e-mail and attachments (if any).

Log in to reply.