search2009.06.05 13:45
Posted by myditto
search2009.04.22 15:57
search2009.03.03 11:35

Leo Breiman and Adele Cutler

RF is an example of a tool that is useful in doing analyses of scientific data.
But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem.
Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.


We assume that the user knows about the construction of single classification trees. Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Each tree is grown as follows:

  1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
  2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
  3. Each tree is grown to the largest extent possible. There is no pruning.

In the original paper on random forests, it was shown that the forest error rate depends on two things:

  • The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
  • The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.

Reducing m reduces both the correlation and the strength. Increasing it increases both. Somewhere in between is an "optimal" range of m - usually quite wide. Using the oob error rate (see below) a value of m in the range can quickly be found. This is the only adjustable parameter to which random forests is somewhat sensitive.

Features of Random Forests

  • It is unexcelled in accuracy among current algorithms.
  • It runs efficiently on large data bases.
  • It can handle thousands of input variables without variable deletion.
  • It gives estimates of what variables are important in the classification.
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses.
  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
  • It has methods for balancing error in class population unbalanced data sets.
  • Generated forests can be saved for future use on other data.
  • Prototypes are computed that give information about the relation between the variables and the classification.
  • It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
  • The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
  • It offers an experimental method for detecting variable interactions.

Posted by myditto
search2009.02.16 15:12
While an overly complex system may allow perfect classification of the training samples, it is unlikely perform well on new patterns. This situation is Known as overfitting. One of the most important areas of research in statistical pattern classification is determining how to adjust the complexity of the model - not so simple that it can not explain the difference between the categories, yet not so complex as to give poor classification on novel patterns. Are there principled methods for finding the best complexity for a classifier?

from  prof. Richard Duda's book 
Posted by myditto
search2009.01.22 15:50

from :

What is the future of mobile internet usage?

March 17, 2008

The world of Mobile search is evolving extremely fast. Beginning with keyword-based search, going through the next step - voice search, now the end user is offered to send a photo by his cell phone in order to find relevant to his photo’s query information in Internet.

Mobile Search is a developing branch that allows users to find mobile content interactively on mobile websites. With the years, mobile content has changed its media direction towards mobile multimedia. Nevertheless, mobile search is not just a simple shift of PC web search to mobile equipment, but it is connected to specialized segments of mobile broadband and mobile content, both of which have been fast-paced evolving recently.

The major search engines are aggressively trying to create applications and relationships in order to take advantage of a mobile ad market. According to a leading market research firm eMarketer, strong competition for the US mobile search market might be anticipated, having in mind the large US online ad market and strong pushes by portals. By 2011, mobile search is expected to account for around $715 million.

The Mobile directory search industry is almost as old as the telecom and offers services that enable people by entering a word or phrase on their phone to find local services based on their current location. An example of usage would be a person looking for a local hotel after a tiring journey or taxi company after a night out. The services can also come with a map and directions to facilitate the user.

What was the next step? GOOG-411. This is another but this time voice-activated mobile search. The free service allows callers to access Google’s local information through voice search. There is no doubt, that mobile voice search is simpler and more convenient for the callers than typing on the phone’s buttons.

“I’d have to be a visionary to be vindicated, and I’m making no such claim. It’s just hard to ignore that most people prefer talking in their phones to typing on them, and a mobile search engine that made voice search possible might have an easier time finding an audience”, said Bryson Meunier, Product Champion, Natural Search in a posting at For the same reasons Meunier believes that mobile visual search could be bigger than voice search.

How do the searchers initiate a visual query? Simply by snapping a photo of something with their phone, which the mobile search engine processes with algorithms and returns relevant digital content based on its interpretation of the user’s visual query.

Visual Search is now gathering popularity. At the Cebit trade show in Germany, Vodafone demonstrated Otello, a search engine that uses images as input. Users send pictures via MMS (Multimedia Messaging Service) from their mobile phones. Otello then returns information relevant to the picture to the mobile phone, just like a normal search engine. There are other examples of companies like SnapNow and Mobot that have actually been offering this service for a few years. Google has its own Mobile Visual Search engines in the face of Never Vision.

Of course, the audience for mobile visual search is currently not so large, but it might be just a matter of time, predicted Meunier.

Posted by myditto
search2009.01.21 12:00

Similarity Based Image Retrieval System - With the volume of data already at unprecedented levels and expected to continue to increase rampantly, technology enabling quick searches of still and video images is much in demand. In response, Hitachi has developed a Similarity-Based Image Retrieval technology, a search engine for just such large-scale image and video archives. Similarity-Based Image Retrieval technology automatically extracts quantified information intrinsic to the image — such as color, shapes and forms — and runs searches to locate a match. This innovative search technique can be used for something as basic as searching for a movie scene or image on a camcorder to something as complex as searching for facial imagery in security, video surveillance or law enforcement applications.

Hitachi 연구소에서 진해되어왔던 프로젝트가 적용되어 빛을 보는 것 같습니다. 하지만 데모 이미지가 없어서 성능이 어느 정도인지는 알 수가 없습니다. 이전에 CHI나 UIST에 만났던 연구원들은 꽤 재미있는 일들을 많이 하는 것 같아서 부러웟던 기억이 나네요. 
Posted by myditto
search2009.01.09 11:06
Tip #9: Google's Vision for OpenCV
Google makes extensive use of OpenCV internally for their street view, map stitching and other image processing needs. They have recently contributed some of their code back to OpenCV in the form of Daniel Filip's C++ image wrapper class cvwimage. Look for cvwimage.h in the .../cxcore/include directory. This code came too late to be documented in the book, but we will document it on the website after we ourselves get more familiar with it. Google is interested in speed and minimizing bugs, so this class wraps IplImage and concentrates on these things:

1. Images are explicitly owned to avoid memory leak problems.
2. This class provides fast access to subregions of an image, especially lines. The member functions are only things that are very fast. To call most OpenCV functionality, you expose the pointer to the image and call the OpenCV functions as usual. That means, there is little learning curve with this class.
3. You can derive window pointers to sub regions of the original image.

Typically in Google, they allocate a huge image space and then put all their images that they are processing into sub-regions of this huge window (see the Region of Interest "ROI" discussion starting on the bottom of page 43 in the book). This huge window becomes their processing buffer and sub-regions are allocated and passed to OpenCV or Google functions for processing. Memory managment isn't much of a problem because only one routine "owns" the huge window so it's easy to manage when this is allocated or deallocated.

Posted by myditto
search2007.06.06 23:38
사용자 삽입 이미지

barcamp Seoul

를 처음으로 참가하다... 워낙 내 자신을 드러내지 않는 성격에 이런 오프라인 모임에 나가기가 쉽지 않았습니다. 하지만 이제는 내가 하고 있는 일들이 사용자들에게 얼마나 유익한 일들인지 알리고자 한다. 처음 만나는 사람을 본다는 생각에 설레이기도 하며
겁이 나기도 했다. 내가 과연 생각했던대로 내 생각을 잘 전달할 수 있을까? 개인적으로는 나에게는 가르침의 달란트가 없다는 걸 잘 알기 때문이다. 진정으로 절대 고수들은 어려운 내용도 아주 쉽게 설명할 수 있다. 예를 들면 파인만 아저씨의 물리책 말이다.. 아직 난 고수는 아닌 것 같다. 내공을 쌓아야겠다... 아래는 발표한 자료 content based image retrieval 이다.
처음 만나는 분들이지만, 웹 이라는 울타리 안에서 다양한 얘기를 들으니 책임감도 느끼고 배운 내용도 참 많았다. 내 발표 내용은 듣는 분들에게 조금 어려운 내용이라는 생각과 함께 다음에는 좀 더 쉬운 이미지 검색 얘기를 해야겠다. 그러고 보니 블로그를 연지 처음으로 한국말로 썼네...
Posted by myditto
search2007.01.23 18:19

from :

Spatial Data Structures

K-D Trees, PR Quadtrees

K-Dimensional Trees work with "k" dimensions not just 1 dimension as BSTs do
Point Regional (PR) Quadtrees have 2k branches at each node given data in a k-dimensional space
Multi-dimensional queries

Range queries


Geographic Information Systems (GIS)
Computer Graphics


K-D Trees: a Generalization of BSTs

Binary Search Trees search using 1 key

Some data naturally contain multiple dimensions

Given k-dimensional data, a K-D tree indexes dimension 0 of the data at level 0, dimension 1 at level 1, etc.

Figure 13.8 (b) from book


Definition: the discriminator is the search key at each level i

calculation of discriminator : i mod k

Given 3-dimensional data indexed by (x,y,z),

y is the discriminator if the level number of the k-d tree is 10 since 10 mod 3 = 1


K-D Trees ….

At ith level of the tree, discriminator indexes the tree exactly as for a BST: select i mod kth coordinate of data item and compare it with i mod kth coordinate of value stored at node of K-D tree.

If (data value < node value)



    take-right-branch // >= case

For 2-D space, K-D tree subdivides space into rectangles. The direction of subdivision alternates among the dimensions:


Figure 13.8 (a) from book


K-D Tree Find Operation

Simple modification of BST search.

Must determine correct discriminator at each level (= level number mod k) to find out what coordinate of key to use for comparison at that level

KDNode* KD::Kdfindhelp(KDNode* rt, ELEM val, int lev, int K) const


   //lev: current level (mod K); K: number of dimensions

   //dkey(val, i) returns the key value of val for dimension i

   if (rt == NULL)

      return NULL; // Empty tree

   else if (val == rt->value)

      return rt; // Found it

   else if (dkey(val, lev) < dkey(rt->value, lev))

      return Kdfindhelp(rt->left, val, (lev+1) % K, K);


      return Kdfindhelp(rt->right, val, (lev+1) % K, K);


Example: find (55, 80) in

Figure 13.8 (b) from book


Deleting Nodes in a K-D Tree

Deletion similar to BST but slightly harder.

Step 1: find node to be deleted

Step 2: two cases must be handled

No children - replace ptr to node by NULL

Has children - replace node by minimum node in right subtree using KDfindmin (harder to find than in a BST!)

If no right subtree exists, then first move left subtree to become right subtree and proceed with usual delete procedure using KDfindmin

Special note: node found by KDfindmin may occur in middle of tree. Recursive call to the delete routine must be used to replace it (careful!)



Must search both left and right subtrees to find minimum value for a given discriminator

KDNode* KD::KDfindmin(KDNode* rt, int discrim, int level, int K)


    // discrim: discriminator key used for minimum search

    // level: current level(mod K); K: # of levels in key

    KDNode *temp1, *temp2;

    if (rt == NULL)

      return NULL;

    temp1 = KDfindmin(rt->left, discrim, (level+1)%K, K)

    if (discrim != level)


       //Min val could be on l/r side

       temp2=KDfindmin(rt->right, discrim, (level+1)%K,K);

       if((temp1 == NULL) || ((temp2 != NULL) && (dkey(temp2->value,discrim) <


           temp1 = temp2;   //in line above, rt subtree has smaller key value


    // temp1 has the smallest value in children (if any)

    if ((temp1 == NULL) || (dkey(rt->value,discrim) <


          return rt;


         return temp1;


Now delete C(70, 10) from

Figure 13.8 (b) from book


K-D Tree Range Query

Problem: find all k-dimensional points within distance 'D' of point P

Must search through K-D tree using Euclidean distance formula

If any coordinate in the K-D tree is more than distance 'D' away from point P's corresponding coordinate, then no need to search one of the subtrees below that node.

Example: if D=10, y-coordinate of node is being examined and is 13 less than the y-coordinate of P, then no search needed of left subtree below that node

void KDRegionSearch(KDNode* rt, ELEM val, int D, int level, int K)


    // val: search key; D: search radius;

    // level: current level (mod K); K: # of dimensions

    // Check if record at rt is in circle

    if(InCircle(key(val), D, key(rt->value)))

       printout(rt->value);//Do appropriate action with node

    if(dkey(rt->value,level)>(dkey(val,level)-D)      //Go left

       KDRegionSearch(rt->left, val, D, (level+1)%K,K);

    if(dkey(rt->value,level)<(dkey(val,level)+D))     //Go right

       KDRegionSearch(rt->right, val, D, (level+1) % K, K);


Function 'InCircle' uses Euclidean distance to test if point is within distance D of query point

Posted by myditto
search2006.10.10 08:42


"The YouTube team has built an exciting and powerful media platform that complements Google's mission to organize the world's information and make it universally accessible and useful," Google chief executive Eric Schmidt said in a statement.

"Together, we are natural partners to offer a compelling media entertainment service to users, content owners and advertisers."

YouTube soared to online popularity after its launch in February 2005. The company claims that more than 100 million videos are watched daily by visitors to the free website, which features a content ranging from silly home videos to snippets of Hollywood films, television shows and concerts.

"Our community has played a vital role in changing the way that people consume media, creating a new clip culture," said YouTube co-founder Chad Hurley.

"By joining forces with Google, we can benefit from its global reach and technology leadership to deliver a more comprehensive entertainment experience for our users and to create new opportunities for our partners."

YouTube will continue to operate independently with its headquarters in San Bruno, California, after the acquisition is complete, according to Google. All of YouTube's 65 employees were to remain with the company.

In unofficial Silicon Valley tradition, YouTube was launched on its meteoric rise from a garage. In its first foray seeking external funds, the company raised 3.5 million dollars from Sequoia Capital in November 2005. In the newest deal, though, Google will pay for YouTube with shares of its own high-flying stock.

Analysts have touted a Google purchase of YouTube as a symbiotic alliance. By purchasing YouTube, Google would be able to apply its proven prowess for generating revenue through online advertising.

YouTube meanwhile will provide Google traction in the online video-sharing market, where Google's own video service has failed to take off.

"This is the next step in the evolution of the Internet," Schmidt said in a telephone press conference from "YouTube's world headquarters."

"It is a natural next step," Schmidt said in the conference, joined by Hurley, YouTube co-founder Steve Chen, and other Google executives.

"With Google's technology and search leadership we will have the resources to take our services to the next level," Chen said. "We believe this is just the beginning."

Hurley downplayed his previous public mantra that YouTube was not for sale, saying that the point was that the website remains independent and had the resources to thrive.

Google said that it was drawn to YouTube because it was the clear market leader and had put together a "remarkable team" in a short time.

"We think one of the keys to comprehensive search experience will be video," said Google co-founder Sergey Brin.

"On the whole it is hard to me to imagine a better fit with another company. YouTube really reminds me of Google just a few short years ago."

Google said that the reason the deal was done as a stock trade was to provide YouTube owners a tax break while making the takeover "cheaper" for the Mountain View, California, online search powerhouse.

The deal is expected to be culminated by year-end provided it clears regulatory requirements.

"We feel we arrived at a purchase price that is very fair and it is a great value that has been created," said Google vice president David Drummond.

The blockbuster deal was announced on the same day that Google and YouTube unveiled agreements with major studios to post copyrighted music videos online.

The deals were seen as efforts to pre-empt accusations of rampant copyright infringement at the online video-sharing sites and enlist studios as potential beneficiaries of the trend.

All of the new agreements will rely on advertising to generate revenues by encouraging users to click on companies' links alongside the videos.

YouTube was putting new systems in place to "fingerprint" copyrighted material so it could be tracked, Hurley said.

"We are committed to develop tools that help identify content and monetize it in new ways," Hurley said.

Google and YouTube engineers already have dozens of ideas for handling ads, search, and videos, according to Schmidt.

"Most people believe that this is just the beginning of a video Internet revolution," Schmidt said. "I think there is a whole new ecosystem and we are expecting to be a part of it."

Meanwhile, Google Video "will not go away now, or ever," Schmidt said.

Posted by myditto

티스토리 툴바