Efficient Parallel kNN Joins for Large Data in MapReduce

[Overview] [Papers and Talks] [Source Code] [Dataset] [Contacts]　

Overview

In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce . Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.

Papers and Talks

1. Efficient Parallel kNN Joins for Large Data in MapReduce

Full version: Talk:

Source Code

Important Notice

If you use this library for your work, please kindly cite our paper. Thanks!

If you find any bugs or any suggestions/comments, we are very happy to hear from you!

Library Description

The library is developed in Java for use with Hadoop 0.20.2. The code is self-explanatory. Makefiles are included to compile the library into jar files. The classpath in the makefile should be updated to refer to the Hadoop 0.20.2 jar file and the Apache Commons Logging jar file. Our library requires third party libraries please refer to readme.txt contained in the source code tarball file for detailed information.

Download

kNN Joins MapReduce Library [tar.bz2]

Quick Install

Hadoop 0.20.2 may be obtained from here and Apache Commons Logging may be obtained from here. After obtaining both, update the makefiles as mentioned above in order to compile. As an example, the library may be invoked by specifying the command:
hadoop jar knn.jar test.BPhase1
A detailed description of command line arguments is displayed.

How to apply patch file and compile code.
(Given all dependent libraries are ready.)
tar jxvf mrknnj-release-0.1.tar.bz2
cd mrknnj-release

For hbrj
mkdir rtree
cd rtree
tar jxvf ../rtree.tar.bz2
tar xvf elki.jar
patch -p1 < ../hbrj/patch-index
mv de ../hbrj/
cd ..
rm -rf rtree
cd hbrj
make elki
make all

For hzknnj
cd mrknnj-release
mkdir btree
cd btree
tar xvf ../btree.tar.bz2
gzip -d *.gz
tar xvf 02-06-04_collections.tar
cd collections/src
patch -p1 < ../../../hzknnj/patch-collections
mv com ../../../hzknnj/
cd ../..
tar xvf 02-06-04_disc.tar
cd disc/src
patch -p1 < ../../../hzknnj/patch-disc
mv com ../../../hzknnj/
cd ../..
tar xvf 02-06-04_util.tar
cd util/
mv util.jar ../../hzknnj
cd ../../hzknnj
make lib
make all

Dataset

A data set with 1 million records can be from here.

Contacts

Chi Zhang Jeffrey Jestes