SML Pro Blog: algorithm

Showing posts with label algorithm. Show all posts

Saturday, January 5, 2008

Flickr Analytics: The making of interestingness / SML Analytics

As part of an ongoing effort to document myself to better understand myself, I posted this design concept I created for Eric Roos' music business Dangeroos onto Flickr in August 2007.

The piece was designed back in 2002, and was a quick sketch or mood board for the Dangeroos Web site. The identity Dangeroos is a play on Eric's last name, so the choice for Interstate (the typeface) was obvious. Red was chosen because it suggests danger. Wave morphs, color rectangles and circular compositions dropped in to support the idea of sound.

Shortly after I started a project called Flickr Analytics to analyze Flickr's interestingness algorithm. Because of that, I become fairly aware of the individual image's ranking over time. To my surprise, this design stays consistently within the top 20 most interest images, and over the past few months, reigned as the number one most interesting image on my entire Flickr stream.

What was even more interesting to me is that the image also drives much traffic: more than 4000 views within the last 5 months, which amounts to 25+ views a day. That's a lot for an image with no human element.

So when Flickr Stats launched on 2007-12-17, it was a God send, for it enables me to analyze traffic and get a better understanding of where traffic is coming from.

How to get to Flickr Stats

To the bottom right corner, below the list of tags of an image on Flickr is what I would call the utility area of the page. This is also where you can access the Photo stats of the image in question. Clicking on the link will bring to a page similar to the one below:

Photo Stats

At this time, Flickr Stats only allows you to view detailed traffic information from the last 28 days, which is not as flexible as most site analytics tool, but is still much better than none at all.

The first time I saw this page I was stunned. Previously I had guessed that the reason why the Dangeroos design was popular had to do with that it was the first image on my 100 Most Interesting Design (set), which is why I had kept it on my Flickr homepage all these time. Data suggests otherwise. In fact, 54% of its traffic (1,041 visits) came from images.search.yahoo.com and 30% (572 visits) came from flickr.com.

Really? What were people searching for on Yahoo? I clicked on the domain name and get to the referrer detail page for the Yahoo Image Search:

Apparently, I'm getting a lot of hits from Yahoo from people search for graphic design. FlickrStats has a nice feature which allows you to click on the keywords to go directly to the search results in question. This is where I noted that apparently back in 2007-12-18, searching for graphic design on Yahoo put this piece on the first 5 results:

Image Search Algorithm

If you think about how difficult it is to develop a useful text search algorithm, you can image how challenging it must be to create a good image search algorithm. Indeed, until most recently, Google Image Search relies on the image's file name alone to feed you results.

Aside from gaining a huge user base, Yahoo's decision to buy Flickr is obvious: image tagging data.

Tagging is a voluntary act by the user: creating meta data to organize his collection fo photo much easier. To the search engine, however, tagging is free metadata association. One strategy is deciphering whether the tags are accurate can go like this: each time someone search for a search term, say graphic design, my search engine will throw 20 images associated with that tag on the search results page. An image that's more related to that search term will more likely to be clicked on by someone searching for that result.

With time and patience, it would be possible for me to figure out which tags are valid and which tags are not. People who search for the search term and then either favorited or commented on that photo would mean that image is more relevant (and thus "interesting") to that search term.

The same strategy can be applied to Flickr Groups. When you post an image to a particular group, the group usually is associated with certain keywords. When users click on an image among all others, they are functioning as bots with very advanced algorithms to do things that machine cannot yet do: identify the good images from the rest. I call humans participating in these activities BioBots.

The key in these systems is to identify the experts. Once you have collected enough data on a user and noticed, for example, that they have a degree in graphic design, working in the graphic design field, and perhaps are members of mostly graphic design groups, it may be fair to say that their opinions on graphic design matter more. You can thus put in your algorithm to give their opinions more weight for the same reason why KOL (key opinion leaders) in pharma talk has their role in medical sites.

©2008 See-ming Lee 李思明 SML / SML Pro Blog / SML Universe. All rights reserved.

Friday, November 9, 2007

Thoughts on Google Personalized Search Results

No comments:

I did some random Googling this morning and found that my initials SML is featured on page one again (Google: SML) which is sweet. However, I noted something interesting when I performed the same search at work.

In both cases, I was logged into Google account—which technically speaking should give me the same results. Well, apparently not. I did some screenshots so you can scrutinize it:

Home: SML Flickr: Google: SML / 2007-11-08 / SML Data

Work: SML Flickr: Google: SML / 2007-11-08 / SML Data

Noting this, I can't help but wonder if Google uses algorithms to randomize results a bit to test which links people would click on in various times—a strategy that I hypothesize is being used on Flickr to determine interestingness as well.

In turn, this suggests an interesting visualization exercise: what would the results pages look like if I search for the same thing over and over again by taking screenshots of the results page? It would be a kind of a SERP time-lapse, if you will. And if the data is available, all the merrier. It would surely be nice to see search results nodes moving at real time. Now I just need two more instances of SML working on these projects while I can still live my 24-hour day.

More random searches
Additional random searches that I like to do are Google: Google SML vs Google: SML Google, which gives different results. This means that Google's search algorithm give weights to the ordering of the words. By how much? I'll let you genius programmers find out. Feel free to email me your results when you do. I LOVE DATA!

Related SML Universe
+ SML Data
+ SML Google
+ SML Search

SML Copyright Notice
©2007 See-ming Lee 李思明 SML / SML Pro Blog / SML Universe. All rights reserved.

Wednesday, October 10, 2007

Human Computation / Google TechTalks

No comments:

Recently I proposed an alternative method for human identification on my blog so when a friend sent me this video via Del.icio.us, I was absolutely delighted:

Google Video: Human Computation / Google TechTalks 2006-07-26

Luis von Ahn

Assistant Professor, Computer Science Department, Carnegie Mellon University
Ph.D. Computer Science, Carnegie Mellon University, 2005
B.S. Mathematics, Duke University, 2000
Recipient, Microsoft Research Fellowship

Key points

Tasks like image recognition are trivial for humans, but continue to challenge even the most sophisticated computer programs.
Utilize human processing power to solve problems that computers cannot yet solve.
Traditional approaches to solving such problems focus on improving software.
Luis von Ahn advocates a novel approach: constructivly channel human brainpower using computer games.
ESP Game = an enjoyable online game that help label images on the Web with descriptive keywords
These keywords can be used to improve the accuracy of image search.
People play the game not because they want to help, but because they enjoy it.

Fun Facts

Number of human-hours spent playing Solitaire in 2003 = 9 billion human-hours
Number of human-hours spent building the Empire State Building = 7 million human-hours (6.8 hours of Solitaire)
Number of human-hours spent building the Panama Canal = 20 million human-hours (less than a day of Solitaire)

Related Web Sites

Related SML

SML Copyright Notice

Friday, September 28, 2007

SML Google: Web History: Interesting Items / 2007-09-26T21:41-04:00 / SML

No comments:

SML Flickr: SML Google: Web History: Interesting Items / 2007-09-26T21:41-04:00 / SML

SML Google: Web History: Interesting Items / 2007-09-26T21:41-04:00 / SML, originally uploaded by See-ming Lee 李思明 SML.

SML.Screenshot: SML.Google.Web-History.Interesting-Items.20070926T2141-0400.png

SML.to.Google.2007-09-26T20:09-04:00

1. It appears that your algorithm on machine-learning is doing quite well. These items are indeed interesting!

2. I recommend that your tweak your modifiers. For example, in Top-Queries.5, Propellerhead.Reason.4 is obviously related to my audio software searches, but you should be smart enough to know that I don't torrent.

Torrent, like data, information, website, etc. is a modifier and as such should probably not be used. Focus on the subject, which, in this case, should be Reason 4. In other words, make sets out of your queries for (reason 4)* instead.

3. SML.initHHI(Google.Algorithm.Machine-Learning.Author);

Cheers,
See-ming

Searches
Recent top queries related to your searches
1. columbia university
2. david deutsch
3. demonoid
4. leah culver
5. reason 4 torrent
6. revolution money
7. sonnox
8. feedhub
9. attendi

Pages
Web pages related to your searches
1. NeHe Productions: Main Page
2. FT.com / World - Chinese military hacked into Pentagon
3. NATIVE INSTRUMENTS : Home
4. Become an SEO Professional & Dominate Google's Search Results ...
5. kirupa.com - Shocked Resource for Making Designers better Developers!
6. Schema Tutorial
7. A Picture is Worth... Being Nice to Cyclists in Toronto (TreeHugger)
8. MTA Metro-North
9. Lamictal (lamotrigine) - The Good, The Bad and The Funny. From ...
10. http://upload.wikimedia.org/wikipedia/commons/7/74/Timeline_of_web_browsers.svg

Videos
+ Using Data to "Brute Force" Hard Problems in Vision and Graphics
+ Canon Shutter sounds ranging from Canon 1DS MARK II to the 350D (Rebel XT)
+ skydive life

Google Gadgets
+ Rob Galbraith DPI: Digital photography news, reviews, tutorials and discussion forums for professional photographers

SML Copyright Notice
©2007 See-ming Lee / SML Flickr / SML Universe. All rights reserved.

SML Copyright Notice
©2007 See-ming Lee 李思明 SML / SML Ideas / SML Universe. All rights reserved.

SML Google: Web History: Interesting Items / 2007-09-26T21:41-04:00 / SML

No comments:

SML Google: Web History: Interesting Items / 2007-09-26T21:41-04:00 / SML, originally uploaded by See-ming Lee 李思明 SML.

SML Google: Web History: Interesting Items / 2007-09-26T21:41-04:00 / SML

No comments:

SML Google: Web History: Interesting Items / 2007-09-26T21:41-04:00 / SML, originally uploaded by See-ming Lee 李思明 SML.

Wednesday, September 19, 2007

FM7 0.2.5 / 2006.09.10 / SML

No comments:

FM7 0.2.5 - 20060910 - SML / SML Music / SML IMEEM
Copyright 2006 See-ming Lee / SML Music / SML Universe

Sound Design: See-ming Lee playing with Native Instruments FM7 in 2006
Algorithmic Music Generation: See-ming Lee using Ableton Live

Result = Engineering + Mathematics + Music + Programming = Fun!

SML Copyright Notice

Human Identification Algorithm

No comments:

Are you human? (Sorry, We Have to Ask). Can’t read the text? Listen to it.
—Digg - Submit Item

You have seen it time and time again. That required step when you submit stories these days for human identification. It composed with letters that are salted with noise and artifacts to defer a bot's attempt to fill out the form and thus spam.

It was a smart idea, but algorithms catch up and soon they don’t work quite as well anymore. So the level has of difficulty for OCR comprehension increased bit by bit everyday. It has become so complex these days that even I cannot identify what characters they are supposed to be.

Perhaps because of the number of complaints received, now we also have the audio version of the same thing. What are they going to do when the audio recognition algorithm got better?

In my opinion, this human identification process simply does not work. Algorithms will get smarter everyday for visual or audio algorithms. A better way is to ask logic questions. For example, ask people to verbally describe the difference between a nerd and a geek. Ask them why they they are reading your blog.

Opinions are largely based on logic, but it is also largely based on creativity, and creativity is something that cannot easily be programmed yet—until the natural language algorithm catch up on it. Another difficult thing that comes natural to us but fairly difficult to do for a machine is comprehension.

I have tested this behavior with a survey which ask the question: Name the odd-man-out among the following: AOL / Google / MSN / Yahoo and state the reason supporting your answer. I get very interesting answers. They are all very inspiring and as such I know that they are not machines.

Being able to go through those answers and pick out the human responses are also most definitely a task that ought to be done by a human. I do not think that there is a computer program that can decipher how creative something is yet. However, I have fears that there are projects underway that is attempting to understand creativity using brute force.

A couple of weeks ago, I went to Google Answers, and I discovered that they are no longer accepting new questions. This was a site where users submit questions and get answers responded by other users. The snippets are very interesting and no doubt allow Google to index more interesting data that is not readily available on the Web. Having the ability to train an algorithm to act like human is a very ambitious activity, but it appears that the algorithm training has paid off.

I visited Google Translate and Google Language Tools recently and I am very impressed with their English to Chinese translation capability.

Unlike English, Chinese uses a defined set of characters. Where Latin languages generally create new meanings by the use of new words, the number of Chinese characters do not change. New meanings are created through the combination of the order of these characters. As such, while Chinese children are rarely able to read newspapers until they are graduating from primary school, there will be no more new words to learn after that. It's all pattern recognition after that part. It's a bit like iconography systems, where new meanings are created out of a predefined set of modifiers.

Despite the language's complexity, I witnessed that the Google Translator is able to handle English to Chinese text relatively easily, which is light years ahead of the translation tools I have used before, and it definitely makes me wonder what else Google Research is brewing inside their labs these days.

SML Copyright Notice

Thursday, August 30, 2007

SML on SEO

No comments:

In the world of SEO (Wikipedia: Search Engine Optimization), content is king. If you write good content and thus draw enough target audience, search engines will be your friends.

If you are mainly interested in the U.S. market, then Google is your friend, because that is where this search engine has the highest market penetration. If your target audience happens to be in Asia or Europe, then you are probably better of with Yahoo!, because it has long had an international brand presence and Google just started to expand into those markets recently.

Interestingly, most Americans find it surprising that Yahoo has more netizen population than Google overall (Source: Compete.com). Since Google has more American population than any other search engine, it is natural to assume so. However, if you survey your friends outside the U.S. to see which search engines they use most, and you may be surprised with your results. In my random sampling, I have found that almost all of my friends in the UK prefers MSN Live Search.

To find out just how well you rank among all the search engines, I recommend Jux2, a meta search engines which combines and compares the results of Google, Yahoo and MSN. You may be surprised of how many results are specific to a single database. If you are trying to appeal to an international audience, you will do best to optimize your search strategies for all three primary players.

If you are an individual, can you utilize these techniques to compete with global international companies? I think so. I Googled SML (initials for See-ming Lee, my name) today, and this blog is prominently featured on page one among approximately 7,450,000 results. I am competing with global players and acronyms here. It's definitely a very 'gratifying' activity. :)

Do you have to spend a lot of money? Does it take a long time to see your ROI (Wikipedia: Return on Investment)? I don't think so. I believe that I am gaining these benefits all by writing a few poems recently. And I published pretty much all of them within the last four months.

I fell into all these mostly out of my recent interest in network theory. Based on my research, I have a hunch that Google's algorithm has largely to do with network theory (Wikipedia: Network Theory / SML Bookmarks: Social Media / SML Bookmarks: Network). This is a hunch, not a proof. Theoretically speaking, I don't think that any proofs are definite. You can, on the other hand, validate your confidence level based on statistics and analytics reports.

Sunday, August 19, 2007

Flickr Analytics / SML Analytics

No comments:

FlickrAnalytics.com

Objective.

Analyze the Flickr Intersestingness algorithm.

Methodology

Tagging and grouping top 20 images from the Popular-Views, Popular-Favorites and Popular-Comments pages on Flickr.

Compare results with images from the various sets via AND / OR operators.

Preliminary Analysis

Top Views = what drives users to click;

Related tags: top-v111 / top-v333 / top-v555 / top-v777 / top-v999 / top-v1111 / top-v2222 / top-v3333 / top-v4444 / top-v5555 / views-top20

Top Faves = what drives users to buy;

Top Comments = what drives users to blog;

Related tags: comments-top20

All three galleries combined give you the winning ingredients for market success for any advertising campaigns. These sets, as such, double-duty as quick-visual reference for your next photoshoot.

If analyzing my own photographs give me this conclusion, I can only imagine the entire collection of images hosted on Flickr can provide.

Hypothesis: Top Interesting = Algorithm (Views, Favorites, Comments)

AND(Views, Favorites, Comments) / Flickr Analytics / SML (Set)
Thumbnails / Detail / Slideshow

I hypothesize that Yahoo is using these data to train an AI (artifical intelligence) algorithm to predict images that will be influential. (Related: see my blog post on Theorizing aesthetics). And if they aren't, I think that they should. Google literally dominates the text-ad market by indexing every single bit of text available to them, it would be smart for Yahoo to pick a different market segment and become the expert in it.

200 Most Interesting Images / Flickr Analytics / SML (Set)
Thumbnails / Detail / Slideshow

200 Most Interesting People / Flickr Analytics / SML (Set)
Thumbnails / Detail / Slideshow

200 Most Interesting Designs / Flickr Analytics / SML (Set)
Thumbnails / Detail / Slideshow