1. 程式人生 > >基於Lire庫搜尋相似圖片

基於Lire庫搜尋相似圖片

什麼是Lire

LIRE(Lucene Image REtrieval)提供一種的簡單方式來建立基於影象特性的Lucene索引。利用該索引就能夠構建一個基於內容的影象檢索(content- based image retrieval,CBIR)系統,來搜尋相似的影象。LIRE使用的特性都取自MPEG-7標準: ScalableColor、ColorLayout、EdgeHistogram。此外該類庫還提供一個搜尋該索引的方法。

下面直接介紹程式碼實現

程式碼結構

Gradle依賴為

dependencies {
    compile fileTree(dir: 'libs', include: ['*.jar'])
    testCompile group: 'junit', name: 'junit', version: '4.11'

    compile group: 'us.codecraft', name: 'webmagic-core', version: '0.7.3'
    // https://mvnrepository.com/artifact/us.codecraft/webmagic-extension
    compile group: 'us.codecraft', name: 'webmagic-extension', version: '0.7.3'

    compile group: 'commons-io', name: 'commons-io', version: '2.6'

    compile group: 'org.apache.lucene', name: 'lucene-core', version: '6.4.0'
    compile group: 'org.apache.lucene', name: 'lucene-analyzers-common', version: '6.4.0'
    compile group: 'org.apache.lucene', name: 'lucene-queryparser', version: '6.4.0'

    // https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient
    compile group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.6'
}

爬取圖片樣本

使用WebMagic爬蟲爬取華為應用市場應用的圖示當做樣本,WebMagic使用請看《WebMagic爬取應用市場應用資訊

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;

/**
 * @author wzj
 * @create 2018-07-17 22:06
 **/
public class AppStoreProcessor implements PageProcessor
{
    // 部分一:抓取網站的相關配置,包括編碼、抓取間隔、重試次數等
    private Site site = Site.me().setRetryTimes(5).setSleepTime(1000);

    public void process(Page page)
    {
        //獲取名稱
        String name = page.getHtml().xpath("//p/span[@class='title']/text()").toString();
        page.putField("appName",name );

        String downloadIconUrl =  page.getHtml().xpath("//img[@class='app-ico']/@src").toString();
        page.putField("downloadIconUrl",downloadIconUrl );

        if (name == null || downloadIconUrl == null)
        {
            //skip this page
            page.setSkip(true);
        }

        //獲取頁面其他連結
        Selectable links = page.getHtml().links();
        page.addTargetRequests(links.regex("(http://app.hicloud.com/app/C\\d+)").all());
    }


    public Site getSite()
    {
        return site;
    }

    public static void main(String[] args)
    {
        Spider.create(new AppStoreProcessor())

                .addUrl("http://app.hicloud.com")
                .addPipeline(new MyPipeline())
                .thread(20)
                .run();
    }
}

上面程式碼提取出來每個頁面的圖示下載URL,自定義了Pipeline來儲存應用圖示,使用Apache的HttpClient包來下載圖片

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

import java.io.*;
import java.nio.file.Paths;

/**
 * @author wzj
 * @create 2018-07-17 22:16
 **/
public class MyPipeline implements Pipeline
{
    /**
     * 儲存檔案的路徑,儲存到資源目錄下
     */
    private static final String saveDir = MyPipeline.class.getResource("/conf/image").getPath();

    /*
     * 統計數目
     */
    private int count = 1;


    /**
     * Process extracted results.
     *
     * @param resultItems resultItems
     * @param task        task
     */
    public void process(ResultItems resultItems, Task task)
    {
        String appName = resultItems.get("appName");
        String downloadIconUrl = resultItems.get("downloadIconUrl");

        try
        {
            saveIcon(downloadIconUrl,appName);
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }

        System.out.println(String.valueOf(count++) + " " + appName);
    }

    public void saveIcon(String downloadUrl,String appName) throws IOException
    {
        CloseableHttpClient client = HttpClients.createDefault();
        HttpGet get = new HttpGet(downloadUrl);
        CloseableHttpResponse response = client.execute(get);
        HttpEntity entity = response.getEntity();
        InputStream input = entity.getContent();
        BufferedInputStream bufferedInput = new BufferedInputStream(input);
        File file = Paths.get(saveDir,appName + ".png").toFile();
        FileOutputStream output = new FileOutputStream(file);
        byte[] imgByte = new byte[1024 * 2];
        int len = 0;
        while ((len = bufferedInput.read(imgByte, 0, imgByte.length)) != -1)
        {
            output.write(imgByte, 0, len);
        }
        input.close();
        output.close();
    }
}

注意:可能華為應用市場有反爬蟲機制,每次只能爬取1000個左右的圖示。

Lire測試程式碼

注意:類中的IMAGE_PATH指定圖片路徑,INDEX_PATH指定索引儲存位置,程式碼拷貝之後,需要修改路徑。

indexImages方法是建立索引,searchSimilarityImage方法是查詢最相似的圖片,並把相似度打印出來。

GenericFastImageSearcher方法的第一個引數是指定搜尋Top相似的圖片,我設定的為5,就找出最相似的5個圖片。

ImageSearcher searcher = new GenericFastImageSearcher(5, CEDD.class);

圖片越相似,給出的相似值越小,如果為1.0說明是原圖片,下面是完整程式碼

import net.semanticmetadata.lire.builders.DocumentBuilder;
import net.semanticmetadata.lire.builders.GlobalDocumentBuilder;
import net.semanticmetadata.lire.imageanalysis.features.global.CEDD;
import net.semanticmetadata.lire.searchers.GenericFastImageSearcher;
import net.semanticmetadata.lire.searchers.ImageSearchHits;
import net.semanticmetadata.lire.searchers.ImageSearcher;
import net.semanticmetadata.lire.utils.FileUtils;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Iterator;
import java.util.List;


/**
 * @author wzj
 * @create 2018-07-22 11:16
 **/
public class ImageSimilarityTest
{
    /**
     * 圖片儲存的路徑
     */
    private static final String IMAGE_PATH = "H:\\JAVA\\ImageSim\\conf\\image";

    /**
     * 索引儲存目錄
     */
    private static final String INDEX_PATH = "H:\\JAVA\\ImageSim\\conf\\index";


    public static void main(String[] args) throws IOException
    {
        //indexImages();
        searchSimilarityImage();
    }

    private static void indexImages() throws IOException
    {
        List<String> images = FileUtils.getAllImages(Paths.get(IMAGE_PATH).toFile(), true);

        GlobalDocumentBuilder globalDocumentBuilder = new GlobalDocumentBuilder(false, false);
        globalDocumentBuilder.addExtractor(CEDD.class);

        IndexWriterConfig conf = new IndexWriterConfig(new WhitespaceAnalyzer());
        IndexWriter indexWriter = new IndexWriter(FSDirectory.open(Paths.get(INDEX_PATH)), conf);

        for (Iterator<String> it = images.iterator(); it.hasNext(); )
        {
            String imageFilePath = it.next();
            System.out.println("Indexing " + imageFilePath);

            BufferedImage img = ImageIO.read(new FileInputStream(imageFilePath));
            Document document = globalDocumentBuilder.createDocument(img, imageFilePath);
            indexWriter.addDocument(document);
        }

        indexWriter.close();

        System.out.println("Create index image successful.");
    }

    private static void searchSimilarityImage() throws IOException
    {
        IndexReader ir = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
        ImageSearcher searcher = new GenericFastImageSearcher(5, CEDD.class);

        String inputImagePath = "H:\\JAVA\\ImageSim\\conf\\image\\5.png";
        BufferedImage img = ImageIO.read(Paths.get(inputImagePath).toFile());

        ImageSearchHits hits = searcher.search(img, ir);


        for (int i = 0; i < hits.length(); i++)
        {
            String fileName = ir.document(hits.documentID(i)).getValues(DocumentBuilder.FIELD_NAME_IDENTIFIER)[0];
            System.out.println(hits.score(i) + ": \t" + fileName);
        }
    }
}

測試結果如下:

原始碼下載