Posted | |
---|---|
Parent Directory | 2014-05-07 14:16 |
My goal with this blog post is to show you how I created the solution so that you’re better equipped to roll your own.
Relevanssi is an excellent WordPress search plugin. A client asked me recently if we could make Relevanssi search PDFs. He found that SearchWP searches content of PDFs IF you install xpdf on your server and allow php’s exec() function. Installing XPDF seemed straight-forward but allowing exec() seemed like a bad idea from a security standpoint. So I decided to stick with Relevanssi and roll my own PDF content-extracting solution.
In a nutshell: I extract content from each uploaded PDF and store it in a post meta of the post or page to which the PDF was attached. I told Relevanssi to index content in that post meta.
I knew I could extract the content of a PDF document using xpdf with a simple command (Once I installed it from here: http://www.foolabs.com/xpdf/download.html):
pdftotext -nopgbrk document.pdf document.pdf.txt
Since without exec() enabled I couldn’t have WordPress invoke xpdf on upload I needed a way to queue all documents that needed to be converted to text files and then create a scheduled task that would do so.
First though: The table I used for this queueing:
CREATE TABLE IF NOT EXISTS `document_index_queue` ( `pkid` int(11) NOT NULL AUTO_INCREMENT, `post_id` bigint(20) unsigned DEFAULT NULL, `file_path` varchar(1000) DEFAULT NULL, `status` enum('to_be_processed','in_process','processed','failed') DEFAULT 'to_be_processed', `attachment_id` bigint(20) unsigned NOT NULL, PRIMARY KEY (`pkid`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Note that when you upload a file through WordPress it gets its own entry in wp_posts (I’m assuming your db prefix was left default as wp_) with the post_type of ‘attachment’. I am storing this value in the attachment_id column. If the uploaded file is associated with a page or a post it will have post_parent pointing to the ID of that post/page. I’m storing this value in post_id.
If you use the Document Revisions plugin, the post_id will be the main document entry.
Next I needed to make sure that each uploaded PDF document gets added to the queue. I rolled the following function as a plugin and activated it.
/** * Queue uploaded PDF for indexing * * @global class $wpdb * @param string $file Path to the file * @param int $attachment_id Post ID of the attachment * @return string */ function cs_queue_pdf_for_indexing( $file, $attachment_id ) { global $wpdb; if ( false !== strpos( strtolower( $file ), '.pdf' ) ) { $post = get_post( $attachment_id ); $wpdb->insert( 'document_index_queue', array( 'post_id' => $post->post_parent, 'attachment_id' => $attachment_id, 'file_path' => $file ), array( '%d', '%d', '%s' ) ); } return $file; } add_filter( 'update_attached_file', 'cs_queue_pdf_for_indexing', 10, 2 );
Now that we have our queue getting populated we need to schedule a task to process the queue. Based on the path to wp-config.php you can see that I have this file in a subfolder of the WordPress install.
require_once( '../wp-config.php' ); global $wpdb; $to_be_processed = $wpdb->get_results( "SELECT attachment_id, file_path FROM document_index_queue WHERE status = 'to_be_processed'"); if ( ! $fp = fopen('../pdfs.txt', 'w') ) die( 'Cannot open file.' ); $in_process_ids = array(); foreach ( $to_be_processed as $file ) { if ( file_exists( $file->file_path ) ) { fwrite( $fp, $file->file_path . "\n" ); $in_process_ids[] = $file->attachment_id; } } fclose( $fp ); $in_process_csv = implode( ',', $in_process_ids ); $wpdb->query( "UPDATE document_index_queue SET status = 'in_process' WHERE attachment_id IN ({$in_process_csv})");
At the end we will have a file called pdfs.txt in the root of the WordPress install with one file path per line. Next in the sequence of the scheduled task is converting the PDF files to text files.
for /F "tokens=*" %%A in (pdfs.txt) do pdftotext -nopgbrk -eol dos %%A %%A.txt
Now we’ll have our text files alongside the original PDF files in their respective folders. So for example wp-content/uploads/2014/4/document.pdf will have wp-content/uploads/2014/4/document.pdf.txt.
Next, we need to read all those text files and push their content to the post meta table, pointing to the right posts or pages.
<?php require_once( '../wp-config.php' ); global $wpdb; $to_be_processed = $wpdb->get_results( "SELECT * FROM document_index_queue WHERE status = 'in_process'"); // See what is alread indexed $indexed_ids_a = $wpdb->get_results( "SELECT post_id AS attachment_id, meta_value AS parent_id FROM wp_postmeta WHERE meta_key = 'txt_meta_id' " ); $indexed_post_ids = array(); $indexed_attachment_ids = array(); if ( $indexed_ids_a ) { foreach ( $indexed_ids_a as $indexed_row ) { $indexed_post_ids[ $indexed_row->attachment_id ] = $indexed_row->parent_id; } $indexed_attachment_ids = array_keys( $indexed_row->attachment_id ); } // Process files $fq = fopen( 'pdftotextquery.sql', 'w' ); $fr = fopen( 'relevansi_ids.txt', 'w' ); foreach ( $to_be_processed as $file ) { $txt_filepath = $file->file_path . '.txt'; if ( file_exists( $txt_filepath ) ) { $handle = fopen( $txt_filepath, 'r' ); $contents = ''; while ( ! feof( $handle ) ) { $contents .= fread( $handle, 8192 ); } fclose( $handle ); $contents = mb_convert_encoding( $contents, 'UTF-8' ); if ( empty( $contents ) ) { $status = 'failed'; } else { if ( empty( $file->post_id ) ) $file->post_id = $file->attachment_id; if ( in_array( $file->attachment_id, $indexed_attachment_ids ) ) { $query = $wpdb->prepare( "UPDATE `wp_postmeta` SET `meta_value` = %s WHERE post_id = %d AND meta_key = 'txt_contents' );", $contents, $file->post_id); fwrite( $fq, $query . "\r\n" ); $status = 'processed'; } else { $query = $wpdb->prepare( "INSERT INTO `wp_postmeta` ( `post_id`, `meta_key`, `meta_value` ) VALUES ( %d, 'txt_contents', %s );", $file->post_id, $contents ); fwrite( $fq, $query . "\r\n" ); $query = $wpdb->prepare( "INSERT INTO `wp_postmeta` ( `post_id`, `meta_key`, `meta_value` ) VALUES ( %d, 'txt_meta_id', LAST_INSERT_ID() );", $file->attachment_id ); fwrite( $fq, $query . "\r\n" ); $status = 'processed'; } fwrite( $fr, $file->post_id . ";" ); } } else { $status = 'failed'; } fwrite( $fq, "UPDATE document_index_queue SET status = '{$status}' WHERE pkid = {$file->pkid}; \r\n" ); echo "UPDATE document_index_queue SET status = '{$status}' WHERE pkid = {$file->pkid}\r\n"; } fclose( $fr ); fclose( $fq );
Notice that this file does not execute any write queries. This is not to overwhelm the server. Instead, the script writes the queries into a SQL file which our scheduled task will import all at once, much more efficiently.
The two meta keys we used were txt_contents and txt_meta_id.
The txt_contents is stored with the post/page the PDF is attached to. This way when you search for something that is contained in the PDF the search will return the page to which the PDF is attached. Uploaded PDFs that do not have a parent post/page have this value stored directly with them. You can modify this behavior but this is how my client wanted it.
The txt_meta_id is stored with the attachment itself and it points to wp_usermeta.umeta_id of where the corresponding txt_contents live. This is how we know what to remove when the user deletes the PDF.
So now let’s import that SQL file:
mysql -u mysqluser -p mysqlpass mysqldatabase < pdftotextquery.sql
The import_parsed_files.php above also created a file called relevansi_ids.txt that contains the post IDs we need to have Relevanssi reindex. Here’s a script that takes care of that:
<?php $file = 'relevansi_ids.txt'; require '../wp-config.php'; if ( ! file_exists( $file ) ) die( 'File does not exist' ); $fr = fopen( $file, 'r' ); if ( false === $fr ) die( 'could not open file' ); $contents = ''; while ( ! feof( $fr ) ) { $contents .= fread( $fr, 8192 ); } fclose( $fr ); if ( empty( $contents ) ) die( 'No contents read' ); $post_ids = explode( ';', $contents ); foreach ( $post_ids as $post_id ) { relevanssi_publish( $post_id ); }
And that’s that. Well… almost. When a document is deleted we also need to remove the post meta we created for the post/page it is attached to. The following code is in my plugin to take care of that:
function cs_delete_attachment_txt( $attachment_id ) { global $wpdb; $meta_id = get_post_meta( $attachment_id, 'txt_meta_id', true ); if ( $meta_id ) { $wpdb->query( $wpdb->prepare( "DELETE FROM $wpdb->postmeta WHERE meta_id = %d", $meta_id ) ); } } add_action( 'delete_attachment', 'cs_delete_attachment_txt' );
Lastly, I will share a script that I ran to process all existing PDFs on the website:
<?php require_once( '../wp-config.php' ); global $wpdb; $to_be_processed = $wpdb->get_results( "SELECT p.ID, p.post_parent, pm.meta_value FROM wp_posts p INNER JOIN wp_postmeta pm ON pm.post_id = p.ID WHERE p.post_mime_type = 'application/pdf' AND pm.meta_key = '_wp_attached_file'"); foreach( $to_be_processed as $row ) { $file = ABSPATH . 'wp-content/uploads/' . $row->meta_value; $wpdb->insert( 'document_index_queue', array( 'post_id' => $row->post_parent, 'attachment_id' => $row->ID, 'file_path' => $file ), array( '%d', '%d', '%s' ) ); }
This website is powered by WordPress, using the IndexOf theme.
Weston Ruter wrote:
Clever approach!
Have you considered using WP-CLI for doing your shell scripting? You could create custom commands, and then leverage `wp db query` to run your SQL instead of having to invoke the mysql client manually with the necessary connection info.
Anyway, I think it’s better to require wp-load.php instead of wp-config.php.
BTW, love your theme!
Posted on 16-May-14 at 4:58 pm | Permalink
Andrej wrote:
Thanks for the feedback Weston! I’m having way too much fun with WP-CLI this evening. You’re absolutely right about the right way to include WP.
Posted on 16-May-14 at 9:53 pm | Permalink