How to Copy-and-Paste Text from a PDF to Word in Seconds and Make It Pretty

Now that the Florida Rules of Civil Procedure require parties to prepare their responses to written discovery (interrogatories, requests for production, and requests for admissions), lawyers and support staff need to become efficient with creating their response documents based on a PDF-version that was served. Sometimes it can be difficult to copy and paste text from PDF documents and get everything properly reformatted in Microsoft Word. You often get duplicate numbers, and all kinds of new lines that make everything a big mess. Not fun to manually reformat it. Thankfully, there’s a quick way to avoid the drudgery.

In this post, I’ll show you how to quickly create a Word document from the PDF document with the proper automatic numbering. Once you learn how to do this, you can take a 13-page, 85 item discovery request and get it into Word in about 60 seconds, all properly formatted and ready to add the responses. If you’re an Apple person, you’re going to have to apply these concepts to your environment; they work in Windows.

What Happens When you Try to Paste Directly into Microsoft Word

First, a little background will help explain why this process is so efficient and why direct into Word copy-and-paste causes problems. Have you ever tried to copy and paste text from a PDF and put it directly into word? If so, you’ve probably gotten something that looks like this:

Word document showing poorly-formatted cut-and-paste text from PDF file.

What a mess. While you’ve gotten the text in into Word, it won’t recognize the existing numbers and adds numbers to each new line. The manual (re)formatting of this mess can be a slog and take a long time. The duplicate numbers must be deleted, and you have to manually delete lines so that you get all the text for one request into a single numbered paragraph. Sometimes things work well direct from PDF to Word and you don’t have to reformat much. But when things get ugly, like above, this is where the following process really shines.

Gladly, there’s a really fast way to get text from a PDF into a Word document cleanly without any manual formatting in Word. You just need a text editor and a few pieces of code. Now don’t get worried, this is very easy–even if you’ve never used a text editor or done any coding. Just follow these steps and they should work for you.

By the way, if something happens that you didn’t expect as you go through these steps (i.e. you made a mistake), just select “undo” from the Edit menu or Control-Z, and you’ll go backwards and undo whatever happened that you didn’t expect. Review and try again.

Get Notepad++

To get started, you need to download a text editor that allows you to manipulate text with “regular expressions.” Personally, I like to use Notepad++ because it’s well-maintained and you can edit all kinds of files in it. Really versatile.

After you download Notepad++, open it and it should give you a new blank window that looks like this:

Notepad++ blank new window.

Copy The Text From the PDF

Next, you need to copy and paste the text from the PDF file into the new Notepad++ window. Open your PDF file in Acrobat or Acrobat Reader. Using Acrobat’s “Selection tool”, you can select the text you want to copy by clicking and dragging the cursor. (Where to find the tool depends on what version of Acrobat you’re using.)

Start selecting text at (and include) the number 1, and go to the end of the page–just don’t accidentally copy any header or footer. Press Control-C and the text you’ve selected will be copied to the clipboard.

PDF file starting point for cut-and-paste processing.
Acrobat select text tool.
PDF file showing selected and copied text.

Then go to your Notepad++ window and paste it there (Control-V):

Notepad++ window with cut-and-paste text from PDF prior to formatting.

Great, that’s our first chunk. See the numbers in gray on the lefthand side? Each one of those is a new line (or paragraph). Just like in Word, we’ve got a bunch of formatting issues so stuff isn’t all pretty like we want it. But Notepad++ will allow us to format it all at once very precisely, unlike Word.

By the way, if the text is running far off to the right and you have to scroll to see it, you can change that by selecting word wrap from the menu bar: View –> Word wrap.

Alright, keep going by copying and pasting every page of the PDF text you need and pasting it to the end of the last chunk in Notepad++. Again, be sure to include the numbers! Keep going until you have all of the text from the PDF into your Notepad++ window. Note: You may need to manually add a space between chunks of copied text to make sure you don’t accidentally smush the last word from one chunk into the first word of the next chunk.

Everything copied into Notepad++, including the numbers from the PDF? OK, now we’re ready to format it!

The Magic: RegEx

Now comes the magical part. We’re going to remove all those line breaks (and the numbers themselves) so that all the text which corresponds to one list item from the PDF list is all grouped together as one paragraph. When we’re done with editing our text in Notepad++ we can then copy and paste it from there into Word. Don’t worry that we’re removing the numbers, Word will automatically add the numbers back when you format it into a numbered list, and then we’ll have our pretty and properly formatted Word document. Yay.

Remove the new lines

The first step is to remove all the separate lines and get everything into a single large paragraph. In Notepad++, open the “Replace” dialog box (control-H, or from the menu bar click on Search –> Replace). Once that dialog box is open, be sure to select the bottom radio button for Regular expression:

Notepad++ replace dialogue box from menu bar.

Now we’re ready to add our code into those two boxes. Add this code to the first box, “Find what:”

\r\n

That expression tells Notepad++ to look for all the returns or new lines (technically it says look for both of them together). In the “Replace with” box we simply type a space. You won’t be able to see it in the box, but be sure to type it, otherwise you’ll end up with words smushed together.

Notepad++ replace dialog box showing radio button to select regular expressions.

Click “Replace All” button, and presto! All the text will now be a single paragraph (or, more accurately, on one line).

Great, we’ve removed all those extra new lines in one fell swoop. Everything is all together in one big paragraph, which you can tell by the single numeral one in gray on the lefthand side (blue arrow, below). Now we’re ready to separate out each numbered item onto its own line (make it into its own paragraph). How are we going to do that? Well, your text should still have the numbers from your PDF document, which should be one or two digit numbers followed by a period (red circles, below). The numbers should also be preceded by a space and followed by a space (see final thoughts, below if not):

Text in Notepad++ window after first step, showing all text in single paragraph.

Remove the numbers and make each its own paragraph

Now we’re going to simultaneously delete each list number and add a return to make each item its own paragraph. This way, the text that corresponds to each numbered item from our PDF will be all together. The way we do that is with the following code in the Replace dialog box.

Find What: \s[0-9]{1,2}\.\s
Replace with: \r\n

Once again, click “Replace All.” Voila! Your one giant paragraph of text should now be multiple paragraphs of text–each represented by the line number on the lefthand side in Notepad++ and corresponding to the list number from your PDF document.

Text in Notepad++ window after second step of replacing numbers with new lines.

The above RegEx code is used to do two things, select the numbers (and only those list numbers) and replace each with a new line. It tells Notepad++ to find a space, followed by a 1 or 2 digit number, followed by a period, followed by a space. When it finds that, it replaces it with a new line. If by chance your text includes that exact sequence, you may have to manually edit it now. Also, if you’ve got a really long list that includes three-digit numbers, then you can modify the code by changing the number two to three in the curly brackets. And also note that this code may not find the very first number, since it may not be preceded by a space. Just delete it manually.

Again, don’t worry that we lost the numbers from the text itself, they’ll come back in Word when we cut-and-paste the text into our Word document.

Before going to the next step, you can verify that your edited/formatted list in Notepad++ corresponds to the number of items from your PDF document by matching the number on the lefthand side with the number of items in your PDF file:

Comparison of Notepad++ window after formatting with source PDF file to ensure successful formatting prior to pasting into Word document.

Does the last number on the gray lefthand side in Notepad++ match up with the number of items from your PDF? If so, move on. If not, you might have missed something in the process of copying and pasting, or there was some unique number in your text that caused a problem in the second step. Be sure that you copy and paste the text with the number from the PDF document; don’t skip copying the number or the above process won’t work!

Copy-and-Paste the Text into Word

The last step is to select all of our formatted text in Notepad++, copy it, and paste it into Word. Control-A to select all the text, or click-and-drag the cursor to select everything in Notepad++. Then press Control-C to copy.

Now in your Word document, paste the text (Control-V). If you don’t already have a style set up, you can just use Word’s list tool to quickly add numbers to your text. Word should add a number for each of the separate paragraphs, and they should match up perfectly with your PDF file.

Microsoft Word document with text pasted from Notepad++ showing proper automatic numbered list, identical to source PDF.

There you go, a fully formatted list from a PDF file into a Word document without ever hitting the delete button or manual formatting. Now you can add the responses easily right after the corresponding discovery request.

Or, as you may also appreciate, this works really well if you need to create discovery from a form template you found somewhere (say, a West form, discovery in another lawsuit, etc.).

The time difference between cutting and pasting directly into Word from the PDF and manually formatting everything is huge. HUGE! Like an order of magnitude! (Sorry, the nerdy side of me couldn’t resist. In plain English, that means ten times.)

Final Thoughts

This should work for most situations. The other alternative to quickly get the information into Word might be to dictate it with software like Nuance’s Dragon Naturally Speaking (which is incredibly powerful and efficient–I use it constantly). But the above programmatic method, once you do it and familiarize yourself with it, is the most efficient way I’ve found to do it short of asking the other side for a copy of their original Word document. And you can imagine what most lawyer’s responses to that question will be! Yeah, no.

Sometimes, depending on how the PDF file was created in the first place, may not result in all the garbage, and you can copy and paste cleanly from PDF to Word document. But frequently it’s garbage, especially with an OCR’d scan of a printed piece of paper.

Other times, the code above might not work because the PDF file’s text that you copied didn’t have a space, but maybe a tab character instead. You’d need to modify the code to change the slash-s into a slash-tee: \t. Your mileage may vary but regardless the goal is always to select the right text, and then replace it. Step 1 is to select and remove the returns. Step two is to select the numbers and replace with a return. The above code should work for most situations. Otherwise you could ask me, and maybe I can help. Or try Regex101.com for a great sandbox to test things.

Anyway, I hope lawyers and support staff out there will improve our collective perception by the public that’s we’re inefficient and over-bill clients. We need to always strive to become more efficient with the ever-evolving tools that technology gives us. Just because we bill by the hour doesn’t mean we should be deliberately inefficient with our workflows.

In future posts, I’ll share how to use Notepad++ to copy-and-paste deposition transcripts to get rid of all those pesky line numbers!

And if you happen to need an fees expert witness, contact the firm and we can evaluate whether other lawyers or support staff are being efficient in their billing.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.