Indexar pdfs y words a myqsl

Esto está buenísimo, la idea de estos scripts es indexar el contenido de pdfs y docs en una base de datos mysql para su posterior búsqueda. Si bien es medio rebuscado está bueno en muchas situaciones y lo publico porque no creo que haya mucho de esto en internet.

Como voy a indexar muchos archivos y luego haré busqueda sobre ellos es fundamental tener todo en tablas separadas porque si no mysql tarda mucho en encontrar los archivos (con búsquedas fulltext).

En la tabla proyect tendremos la lista de proyectos a indexar, la crearémos

CREATE TABLE `project` (
`id_proj` int(10) unsigned NOT NULL,
`nombre` char(50) default NULL,
`tabla` char(50) NOT NULL,
PRIMARY KEY  (`id_proj`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Y ahora insertamos un proyecto

insert into project values ('13','pruebaproyecto','pruebaproyecto');

y creamos las tablas de ese proyecto.

CREATE TABLE `pruebaproyecto_files` (
`id_files` int(10) unsigned NOT NULL auto_increment,
`ruta` varchar(200) NOT NULL default '',
`nombre` varchar(200) NOT NULL default '',
PRIMARY KEY  (`id_files`),
KEY `rutai` (`ruta`(50)),
KEY `nombrei` (`nombre`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `pruebaproyecto_lineas` (
`id_lines` int(10) unsigned NOT NULL,
`pagina` int(10) unsigned NOT NULL,
`id_files` int(10) unsigned NOT NULL,
`texto` text NOT NULL,
PRIMARY KEY  (`pagina`,`id_files`,`id_lines`),
FULLTEXT KEY `textoi` (`texto`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Ahora los scripts que hacen el trabajo
—idexa_dir.pl—

#!/usr/bin/perl
#Dependencias
use File::Find;
$ruta = $ARGV[0];
$user = $ARGV[1];
main();

##########
## MAIN ##
##########
sub main
{
find(\&indexa_file, $ruta);
}

sub indexa_file {
my $file = $_;
my $ruta = $File::Find::dir;
$ruta =~ s/([^\w||.||\/||-])/\\$1/g;
$file =~ s/([^\w||.||\/||-])/\\$1/g;
#print $file."\n";
system ("/usr/sbin/indexa_txt.pl $ruta $file $user") if (-f $file && ($file =~ /\.doc$/ || $file =~ /\.pdf$/));
}

—EOF—
—indexa_txt.pl—

#!/usr/bin/perl
#Dependencias
use DBI;
$ENV{'PATH'} = $ENV{'PATH'}."/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib/courier-imap/sbin:/usr/lib/courier-imap/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/usr/share/texmf/bin/";
#system('PATH=$PATH:"/usr/share/texmf/bin/"');
#Base de datos
$db='txt';
#User_db
$user='root';
#pass_db
$pass='XXXX';
#File a censurar
$proj_ph=$ARGV[0];
$proj_nombre=$ARGV[2];
$proj_ph=~s/\/$//g;
$proj_ph=~s/([^\w||.||\/||-])/\\$1/g;
$file_na=$ARGV[1];
$file_na=~s/([^\w||.||\/||-])/\\$1/g;
$fullname=$proj_ph.'/'.$file_na;
$wvPDF='/usr/bin/wvPDF';
$pdftk='/usr/bin/pdftk';
$mensaje="";
#config_xdpf
#xpdf_opt
$xpdf_opt="-layout";
#path_xpdf
$pht_xpdf="/usr/bin/pdftotext";
$ext=substr($file_na,length($file_na)-3,3);
db_connect();
$sql="SELECT count(*) from project where tabla='$proj_nombre'";
$sth = $dbh->prepare($sql);
$sth->execute or die exit;
@rows = $sth->fetchrow;
if($rows[0]<1){     print "la tabla $proj_nombre no existe";     exit; } main() if ($file_na =~ /\.doc$/ || $file_na =~ /\.pdf$/); ########## ## MAIN ## ########## sub main { db_connect(); $rand=randomStr(30); $rand='ind_'.$rand; mkdir "/tmp/$rand/"; if ( $ext eq 'doc' ) { #primero lo paso a PDF system("$wvPDF $fullname /tmp/$rand/$rand.pdf > /dev/null");
}
else {
system("cp $fullname /tmp/$rand/$rand.pdf > /dev/null");
}
#divido el file en paginas
chdir("/tmp/$rand");
system("$pdftk $rand.pdf burst > /dev/null");
$files=`ls -1 pg*.pdf`;
@file=split("\n",$files);
$cant=0;
#inserto el archivo en FILES
$id_files=insert_file($proj_ph,$file_na);
foreach my $f (@file) {
$page=$f;
$out=$f;
$page=~s/^pg_//;
#    $page=~s/00//g;
$page=~s/^0//;
$page=~s/$\.pdf//;
$out=~s/pdf/txt/g;
system("$pht_xpdf $xpdf_opt $xpdf_cfg $f $out > /dev/null");
#inserto las lineas en LINES
$renglon=0;
open (DATOS,"$out");
while () {
$renglon++;
chomp;
$sth=$dbh->prepare("INSERT DELAYED INTO ".$proj_nombre."_lineas values (?, ?, ?, ?);") or die("Couldn't prepare statement: " . $dbh->errstr);
$sth->execute($renglon, $page, $id_files, $_);
}
}
#borrar temporales
system("rm -rf /tmp/$rand");
}
sub db_connect
{
$dsn = "DBI:mysql:database=".$db.";host=localhost;port=3306";
$dbh = DBI->connect($dsn, $user, $pass ) or die("Could not connect!");
}

sub usage() {
print "indexa_txt project_path relative_dir file_name\nDeveloped by Matias Neiff\n";
}

sub insert_file() {
my $file;
my $ruta;
$ruta = @_[0];
$file= @_[1];
$ruta =~ s/\\//g;
$file =~ s/\\//g;
$sth=$dbh->prepare("select id_files from ".$proj_nombre."_files where ruta = '$ruta' and nombre = '$file';") or die("Couldn't prepare statement: " . $dbh->errstr);
$sth->execute or die exit;
$pk = $sth->fetchrow;
if (not $pk) {
$sth=$dbh->prepare("INSERT INTO ".$proj_nombre."_files (ruta,nombre) values (?, ?);") or die("Couldn't prepare statement: " . $dbh->errstr);
$sth->execute("$ruta", "$file");
$sql = "select last_insert_id() from ".$proj_nombre."_files limit 1;";
$sth = $dbh->prepare($sql);
$sth->execute or die exit;
@rows = $sth->fetchrow;
$pk=$rows[0];
} else {
$sql="delete from ".$proj_nombre."_lineas where id_files = $pk;";
$sth = $dbh->prepare($sql);
$sth->execute or die exit;
}
return $pk;
}

sub randomStr {
my $password;
my $_rand;

my $password_length = $_[0];
if (!$password_length) {
$password_length = 10;
}

my @chars = split(" ",
"a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9");

srand;

for (my $i=0; $i <= $password_length ;$i++) {
$_rand = int(rand 41);
$password .= $chars[$_rand];
}
return $password;
}

—EOF—

Luego cuando querramos indexar un directorio y sus subsdirectorios ejecutamos:
/usr/sbin/indexa_dir.pl /var/directorio/a/indexar pruebaproyecto

¿El resultado?

Se cargarán en la base de datos todas las líneas separadas por páginas y archivo. Esto es muy útil por ejemplo para buscar regionalismos, pero también para buscar cualquier cosa 😉

Espero le sirva.

Saludos.

[poll id=”3]

Tags: .doc mysql, bash script, dividir pdf, indexar archivos doc, indexar pdf, indexar word, mysql, pdf, pdf mysql, pdftk, perl, perl script, word mysql, wvPDF

This entry was posted on Thursday, December 17th, 2009 at 12:46 pm and is filed under Scripting, Servidores Linux. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.

Indexar pdfs y words a myqsl

Share this: