基於Lire庫搜尋相似圖片
阿新 • • 發佈:2019-02-02
什麼是Lire
LIRE(Lucene Image REtrieval)提供一種的簡單方式來建立基於影象特性的Lucene索引。利用該索引就能夠構建一個基於內容的影象檢索(content- based image retrieval,CBIR)系統,來搜尋相似的影象。LIRE使用的特性都取自MPEG-7標準: ScalableColor、ColorLayout、EdgeHistogram。此外該類庫還提供一個搜尋該索引的方法。
下面直接介紹程式碼實現
程式碼結構
Gradle依賴為
dependencies { compile fileTree(dir: 'libs', include: ['*.jar']) testCompile group: 'junit', name: 'junit', version: '4.11' compile group: 'us.codecraft', name: 'webmagic-core', version: '0.7.3' // https://mvnrepository.com/artifact/us.codecraft/webmagic-extension compile group: 'us.codecraft', name: 'webmagic-extension', version: '0.7.3' compile group: 'commons-io', name: 'commons-io', version: '2.6' compile group: 'org.apache.lucene', name: 'lucene-core', version: '6.4.0' compile group: 'org.apache.lucene', name: 'lucene-analyzers-common', version: '6.4.0' compile group: 'org.apache.lucene', name: 'lucene-queryparser', version: '6.4.0' // https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient compile group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.6' }
爬取圖片樣本
使用WebMagic爬蟲爬取華為應用市場應用的圖示當做樣本,WebMagic使用請看《WebMagic爬取應用市場應用資訊》
import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.processor.PageProcessor; import us.codecraft.webmagic.selector.Selectable; /** * @author wzj * @create 2018-07-17 22:06 **/ public class AppStoreProcessor implements PageProcessor { // 部分一:抓取網站的相關配置,包括編碼、抓取間隔、重試次數等 private Site site = Site.me().setRetryTimes(5).setSleepTime(1000); public void process(Page page) { //獲取名稱 String name = page.getHtml().xpath("//p/span[@class='title']/text()").toString(); page.putField("appName",name ); String downloadIconUrl = page.getHtml().xpath("//img[@class='app-ico']/@src").toString(); page.putField("downloadIconUrl",downloadIconUrl ); if (name == null || downloadIconUrl == null) { //skip this page page.setSkip(true); } //獲取頁面其他連結 Selectable links = page.getHtml().links(); page.addTargetRequests(links.regex("(http://app.hicloud.com/app/C\\d+)").all()); } public Site getSite() { return site; } public static void main(String[] args) { Spider.create(new AppStoreProcessor()) .addUrl("http://app.hicloud.com") .addPipeline(new MyPipeline()) .thread(20) .run(); } }
上面程式碼提取出來每個頁面的圖示下載URL,自定義了Pipeline來儲存應用圖示,使用Apache的HttpClient包來下載圖片
import org.apache.http.HttpEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import us.codecraft.webmagic.ResultItems; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.pipeline.Pipeline; import java.io.*; import java.nio.file.Paths; /** * @author wzj * @create 2018-07-17 22:16 **/ public class MyPipeline implements Pipeline { /** * 儲存檔案的路徑,儲存到資源目錄下 */ private static final String saveDir = MyPipeline.class.getResource("/conf/image").getPath(); /* * 統計數目 */ private int count = 1; /** * Process extracted results. * * @param resultItems resultItems * @param task task */ public void process(ResultItems resultItems, Task task) { String appName = resultItems.get("appName"); String downloadIconUrl = resultItems.get("downloadIconUrl"); try { saveIcon(downloadIconUrl,appName); } catch (IOException e) { e.printStackTrace(); } System.out.println(String.valueOf(count++) + " " + appName); } public void saveIcon(String downloadUrl,String appName) throws IOException { CloseableHttpClient client = HttpClients.createDefault(); HttpGet get = new HttpGet(downloadUrl); CloseableHttpResponse response = client.execute(get); HttpEntity entity = response.getEntity(); InputStream input = entity.getContent(); BufferedInputStream bufferedInput = new BufferedInputStream(input); File file = Paths.get(saveDir,appName + ".png").toFile(); FileOutputStream output = new FileOutputStream(file); byte[] imgByte = new byte[1024 * 2]; int len = 0; while ((len = bufferedInput.read(imgByte, 0, imgByte.length)) != -1) { output.write(imgByte, 0, len); } input.close(); output.close(); } }
注意:可能華為應用市場有反爬蟲機制,每次只能爬取1000個左右的圖示。
Lire測試程式碼
注意:類中的IMAGE_PATH指定圖片路徑,INDEX_PATH指定索引儲存位置,程式碼拷貝之後,需要修改路徑。
indexImages方法是建立索引,searchSimilarityImage方法是查詢最相似的圖片,並把相似度打印出來。
GenericFastImageSearcher方法的第一個引數是指定搜尋Top相似的圖片,我設定的為5,就找出最相似的5個圖片。
ImageSearcher searcher = new GenericFastImageSearcher(5, CEDD.class);
圖片越相似,給出的相似值越小,如果為1.0說明是原圖片,下面是完整程式碼
import net.semanticmetadata.lire.builders.DocumentBuilder;
import net.semanticmetadata.lire.builders.GlobalDocumentBuilder;
import net.semanticmetadata.lire.imageanalysis.features.global.CEDD;
import net.semanticmetadata.lire.searchers.GenericFastImageSearcher;
import net.semanticmetadata.lire.searchers.ImageSearchHits;
import net.semanticmetadata.lire.searchers.ImageSearcher;
import net.semanticmetadata.lire.utils.FileUtils;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Iterator;
import java.util.List;
/**
* @author wzj
* @create 2018-07-22 11:16
**/
public class ImageSimilarityTest
{
/**
* 圖片儲存的路徑
*/
private static final String IMAGE_PATH = "H:\\JAVA\\ImageSim\\conf\\image";
/**
* 索引儲存目錄
*/
private static final String INDEX_PATH = "H:\\JAVA\\ImageSim\\conf\\index";
public static void main(String[] args) throws IOException
{
//indexImages();
searchSimilarityImage();
}
private static void indexImages() throws IOException
{
List<String> images = FileUtils.getAllImages(Paths.get(IMAGE_PATH).toFile(), true);
GlobalDocumentBuilder globalDocumentBuilder = new GlobalDocumentBuilder(false, false);
globalDocumentBuilder.addExtractor(CEDD.class);
IndexWriterConfig conf = new IndexWriterConfig(new WhitespaceAnalyzer());
IndexWriter indexWriter = new IndexWriter(FSDirectory.open(Paths.get(INDEX_PATH)), conf);
for (Iterator<String> it = images.iterator(); it.hasNext(); )
{
String imageFilePath = it.next();
System.out.println("Indexing " + imageFilePath);
BufferedImage img = ImageIO.read(new FileInputStream(imageFilePath));
Document document = globalDocumentBuilder.createDocument(img, imageFilePath);
indexWriter.addDocument(document);
}
indexWriter.close();
System.out.println("Create index image successful.");
}
private static void searchSimilarityImage() throws IOException
{
IndexReader ir = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
ImageSearcher searcher = new GenericFastImageSearcher(5, CEDD.class);
String inputImagePath = "H:\\JAVA\\ImageSim\\conf\\image\\5.png";
BufferedImage img = ImageIO.read(Paths.get(inputImagePath).toFile());
ImageSearchHits hits = searcher.search(img, ir);
for (int i = 0; i < hits.length(); i++)
{
String fileName = ir.document(hits.documentID(i)).getValues(DocumentBuilder.FIELD_NAME_IDENTIFIER)[0];
System.out.println(hits.score(i) + ": \t" + fileName);
}
}
}
測試結果如下: