aicoder - nealrichter's blog

aicoder - nealrichter's blog

Neal Richter  //  Chief Scientist @rubiconproject. PhD in CS. AI, data mining, NLP, search engine and optimization geek. Formerly @ OthersOnline, RightNow. Hometown: Bozeman, MT

Jan 7 / 9:13am

Software pattern for proportional control of QPS in a webservice

Problem Statement

Imagine your are writing a webservice that must call a back-end service, such as a data store.  Let's assume (with out loss of generality) that the data store (and your hardware supporting it) has some limit in QPS that it can handle.  We'd like the client system (your web service) to impose a limit on the QPS to the back-end service.  Also assume that this is a distributed webservice, lots of worker threads.

Requirements

  1. Given a goal in QPS manage the maximum outgoing requests per second to that goal.
  2. Be fast. Maintain a fast controller settling time when the goal or queries change.
  3. Be adaptive.  Respond to swings of incomming requests that need to be queried against the service.
  4. Be distributed. Locally active against global numbers without knowing the number of workers.
  5. Be robust.  Handle additions and subtractions of worker/clients to the system without coordination. Minimize overshoot.

Design

Assume that querying this backend service, while valuable and mostly needed, is optional under duress.  It's far more important for your front-end service to be responsive and return some error or 'No Content' than hang on a busted back-end.   As result we'll use a sampling rate 'r' to denote the % of time that the web service should query the back-end.  Under normal conditions this rate is 1 (100%).  Also assume that the goal in QPS to the back-end is set in some configuration area in your system.  Under duress the rate r will be adaptively tuned to obey the QPS goal.  Also assume you have smartly implemented some monitoring system like Ganglia/Nagios/Cacti and are emitting events to it when you call the back-end service.

Inputs

  • G = Goal QPS
  • M = Current measured QPS (from Ganglia).
  • r = Current sampling rate [0,1]

Outputs

  • r_new = A sampling rate [0,1]

Adaptive Mechanism

  • r_new = r * (G/M)
  • r_new = MAX(0,MIN(1,r_new))    //clamp r_new between [0,1]

Benefits

  • Needs only global G and M as inputs.
  • No-coordination needed between workers/servers other than the globally observed M.
  • Adaptively moves the per-worker sampling rate independent of all other worker's rates.  
  • Workers can have different incoming QPS rates from a load balancer, the controller will adapt.

Failure Modes

  • If the sensor for M fails to be updated then the controller is blind
  • If the Goal is set or re-set to zero, then the controller will stop traffic
  • Both of these can be addressed easily.
Desired Response
  • The overshoot/undershoot is called 'ringing'
  • The time to approach the goal is called the 'settling time'

Implementation Notes

  • Emit a ganglia/graphite/XXX counter for both requests sent and skipped
  • Each worker should do a bit of randomization of when it performs its sampling-rate-update to smooth out the 
    • every 3 minutes plus/minus 1 minute

Invertability

This design could be inverted for a server implementation.  If your webservice has a set of APIs that are heavy to execute, then this controller could be used to control the incomming QPS that are delegated to the heavy work.

  1. Receive request
  2. Submit to sampling rate
  3. If Yes then delegate the request to the executor
  4. If No then respond to the client with 'HTTP 204 No Content' or equivalent empty reponse.
Dec 27 / 9:10am

Standford's Introduction to Computational Advertising course

Andrei Broder and Vanja Josifovski, of Yahoo! Research, have wrapped up their Stanford course again this fall.  As always the lecture slides are a great intro to the area.

MS&E 239: Introduction to Computational Advertising

Computational advertising is an emerging new scientific sub-discipline, at the intersection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, classification, optimization, and microeconomics. The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine ("sponsored search"), a user reading a web page ("content match" and "display ads"), a user watching a movie on a portable device, and so on. The information about the user can vary from scarily detailed to practically nil. The number of potential advertisements might be in the billions. Thus, depending on the definition of "best match" this problem leads to a variety of massive optimization and search problems, with complicated constraints, and challenging data representation and access problems. The solution to these problems provides the scientific and technical foundations for the $20 billion online advertising industry.

This course aims to provide a good introduction to the main algorithmic issues and solutions in computational advertising, as currently applied to building platforms for various online advertising formats. At the same time we intend to briefly survey the economics and marketplace aspects of the industry, as well as some of the research frontiers. The intended audience are students interested in the practical and theoretical aspects of web advertising.
Dec 27 / 9:03am

Summer Program on Computational Advertising

Deepak Agarwal of Yahoo Research is organizing a summer program on Computational Advertising.  It appears to be geared for grad students.

http://www.samsi.info/programs/summer-program-august-6-17-2012-computational-advertising

This two-week program will run from August 6 to August 17, 2012. The first week will be at the Radisson RTP in the Research Triangle Park, NC. The location is in close proximity to SAMSI. The first three days will be spent on technical presentations by leading researchers and industry experts, to bring everyone up to speed on the currently used methodology. On the fourth day, the participants will self-organize into working groups, each of which will address one of the key problem areas (it is permitted that people join more than one group, and the organizers will try to arrange the working group schedules to faciliate that).  The second week will be spent at SAMSI headquarters in Research Triangle Park.
Nov 15 / 8:24am

What Software Engineers should know about Control Theory

Over the years I've noticed an interesting lack of specific domain knowledge among CS and software people.  Other than the few co-workers that majored in Electrical Engineering, almost no one has heard of the field of 'Control Theory'.

Control theory is an interdisciplinary branch of engineering and mathematics, that deals with the behavior of dynamical systems. The desired output of a system is called the reference. When one or more output variables of a system need to follow a certain reference over time, a controller manipulates the inputs to a system to obtain the desired effect on the output of the system.
 
Let's imagine that you write internet web services for a living.  Some Rest or SOAP APIs that take an input and give an output.

Your boss walks up to you one day and that asks for a system that does the following:
- Create a webservice that calls another (or three) for data/inputs, then does X with them.
- Meters the usage of the other web services.
- Your webservice must respond within Y milliseconds with good output or a NULL output.
- Support high concurrency, ie not use too many servers.

The problem is that these other third-party webservices are not your own.  What is their response time?  Will they give bad data?  How should your webservice react to failures of the others?

Does this sound familiar?  It should to many.  This is the replicated-and-shared-connector problem (MySQL, memcached), the partitioned-services problem (federated search, and large scale search engines) and the API-as-a-service problem (Mashery, etc).

There are two basic types of controls relevant here:
  • Open Loop, Feed-forward: Requires good model of system inputs and response of the system.
Feedforward

  • Closed Loop, Feed-back

Types of adaptive control are as follows:
  • Linear Feedback
  • Stability Analysis
  • Frequency response
  • response time
Adaptive Schemes
  • Gain Scheduling
  • Model Reference Adaptive Systems
  • Self-tuning regulators
  • Dual Control

Here's one survey deck from a lecture. Unfortunately for software engineers, most of the presentations of the above are in linear system form rather than an algorithmic form.

Dr. Joe L Hellerstein of Google and co-workers taught a course at U of Washington in 2008 that was more software focused.  He's also written a textbook on it and a few papers.

I'd like to see a 'software patterns' set created for easier use by software engineers.  I'll attempt to present a couple common forms as patterns in a future blog post.
Jul 12 / 10:49pm

SchemaMgr - MySQL schema management tool

SchemaMgr is a simple perl tool to manage schema change in MySQL DBs. I wrote this in 2007/2008 and it has been used in production for many years.

https://github.com/nealrichter/schemamgr

Each change is assigned a version number and placed in a file.  When the SQL in the file is executed successfully, a special table is set with that version number. Subsequent runs install only the higher versioned files.


It can also be used to reinstall views and stored procedures.

The best practice is to copy the file and change X in the filename and in the $DB_NAME variable.

$ ./bin/schemamgr_X.pl 
Usage: either create or upgrade X database
        schemamgr_X.pl -i -uUSERNAME -pPASSWORD [-vVERSION] [-b]
                 updates DB of to current (default) or requested version
        schemamgr_X.pl -s -uUSERNAME -pPASSWORD
                 reinstalls all stored procedures
        schemamgr_X.pl -w -uUSERNAME -pPASSWORD 
                 reinstalls all views
        schemamgr_X.pl -q -uUSERNAME -pPASSWORD 
                 Requests and prints current version
Optional Params 
         -vXX -- upgrades upto a specific version number XX 
         -b   -- backs up the database (with data) before upgrades 
         -nYY -- runs the upgrades against database YY - default is X

By convention you create ONE create_X_objects_v1 file with a date.
All other files are update files with greater than v1 numbers.

build/
|-- create_X_objects_v1_20110615.sql
|-- update_X_objects_v2_20110701.sql
`-- update_X_objects_v3_20110702.sql

Jun 28 / 9:02am

Managing yourself to tasks and finishing them.

I saw these two short articles from HBR come across my twitter stream and read them.  They stuck with me for more than a week.

The essence of the advice is "Prep-Do-Review".  For each task you do make and review a plan for it before starting.  Once the task completed review the plan.  Did you finish? What did you learn?  What would you do differently next time?

The essence of this advice is to think in terms of "to-go" versus "to-date" performance on a task.  When you entertain to-date thinking it's very easy to see how much you have accomplished so far.  This can lead to a lowering of ongoing effort or allow yourself to become distracted and work on other tasks.

So the optimal algorithm that combines the two is:

PrepForWork();
Do {
   IncrementalWork();
   X = DistanceToFinish();
} Until (X == 0)
ReviewResults();

Don't look back until you are done!

Note that there is a formal method of proof in mathematics and CS that goes something like this:

1) Define an integral metric f(x) measuring the distance to the goal
2) Define the starting distance
3) Show that your "algorithm/method" monotonically decreases f(x)
4) Infer that the goal will be reached
5) (Optional) Calculate the minimum number of steps required to reach the goal.

The important point is the rigor of the mathematical proof version. Your idea gets no partial credit for so-far progress, the algorithm either gets to the goal or it fails. The proof is either true or it is false. Thus you are either Done or you are NOT Done.
Apr 7 / 2:21pm

JSON parsing speed in various Node.JS versions

We use Node.JS for a very high capacity service at the Rubicon Project.  It often drives or handles in excess of 10B HTTP requests per day sending or receiving JSON data.
Out of curiosity I ran some tests on JSON parsing speed in different versions of Node.JS

node.js code:

var sys = require('sys');
var data = "{ \"item_uuid\": \"8ec56438-d3cf-442a-bbf7-7f076f229f35\", \"return_code\": 0, \"data\": [ { \"valid\": true, \"votes\": 2345, \"date\":\"Thu, 07 Apr 2011 15:17:17 EDT\", \"headline\": \"Senate Majority Leader Harry Reid indicates there likely will be a government shutdown on Friday. Lawmakers have been unable to agree on a new federal budget\", \"source\": \"Yahoo News\", \"published\":{\"hour\":\"19\",\"timezone\":\"UTC\",\"second\":\"17\",\"month\":\"4\",\"minute\":\"17\",\"utime\":\"1302203837\",\"day\":\"7\",\"day_of_week\":\"4\",\"year\":\"2011\"} } ] }";

try {
   for(var i = 0; i < 1000000; i++)
   {
        var tmp = JSON.parse(data);
   }
} catch(e) { sys.puts("ERROR: on parsing JSON with v8 parser"); }

sys.puts(data);
var tmp = JSON.parse(data);
sys.puts(JSON.stringify(tmp));
sys.puts("\n DONE \n");
process.exit();

Essentially this re-parses the same example JSON (I created a fake RSS like JSON pacakge) 1M times.

Here are the results:
  • Node 0.1.3x:    real 0m30.050s
  • Node 0.2.6:      real 0m30.050s
  • Node 0.3.8:      real 0m9.915s
  • Node 0.4.5:      real 0m9.999s
For reference I ran the same test against a very fast tokening parser in C called jsmn, and a C++ one called vjson.
  • jsmn: real 0m2.276s
  • vjson: real 0m7.465s        Note that vjson is a destructive parser, and I had to fix that first.

Interestingly the JSON parser in node 0.4.5 and prior versions appears to be written in pure Javascript.  See the file:  node-v0.4.5/deps/v8/src/json.js

It's unclear if the speed improvements are a result of improvements to the parser implementation or in some efficiency/speed leap in versions of V8 included in Node.
  • Node 0.1.33:    v8: 2010-03-17: Version 2.1.5
  • Node 0.2.6:      v8: 2010-08-16: Version 2.3.8
  • Node 0.3.8:      v8: 2011-02-02: Version 3.1.1
  • Node 0.4.5:      v8: 2011-03-02: Version 3.1.8
Mar 22 / 9:26pm

Pamela Samuelson on startups and software patents

Following up on my last post here is the view from Pamela Samuelson:

Why software startups decide to patent ... or not

Two-thirds of the approximately 700 software entrepreneurs who participated in the 2008 Berkeley Patent Survey report that they neither have nor are seeking patents for innovations embodied in their products and services. These entrepreneurs rate patents as the least important mechanism among seven options for attaining competitive advantage in the marketplace. Even software startups that hold patents regard them as providing only a slight incentive to invest in innovation.

She also lists a variety of reasons why these software entrepreneurs decided to forgo patenting their last invention.  It's a very interesting write up. 
Mar 10 / 2:30pm

Comments re "The Noisy Channel: A Practical Rant About Software Patents"

The Noisy Channel: A Practical Rant About Software Patents - [My comments cross-posted here]

Daniel, nice writeup.

I worked for a BigCo and filed many patents. It was a mixed bag. The time horizon is so long that even after I’ve been gone for 3.5 years many of them are still lost in the USPTO. Average time for me to see granted patents was 5+ years.

Here are my biased opinions:

1) Patents really matter for BigCos operating on a long time horizon. It’s a strategic investment.

2) Patents are nearly worthless for a Startup or SmallCo. The time horizon is way past your foreseeable future, and thus the whole effort is akin to planning for an alternate reality different than the current business context. Throwing coins in a fountain for good luck is about as relevant. You simply are better off getting a filing date on a provisional design writeup and hiring an engineer with the money you’d spend on Patent lawyers.

3) As an Acquiring company looking at a company to acquire, Provisional or Pending Patents are a liability not an asset. They take time and resources to push to completion for a strategy of deterrence.

4) Patents are mostly ignored in the professional literature. Take Sentiment Analysis as one example. Sentiment Analysis exploded in 2001 w.r.t. Academic publishing, yet there are more than a few older patents discussing good technical work on Sentiment Analysis. I’ve NEVER seen an algorithm in a patent cited in a paper as previous work. And I have seen academic papers with algorithms already 90% covered by an older patent… and the papers are cited as ‘novel work’.

5) Finding relevant patents is ludicrously hard. It might be the most challenging problem in IR w.r.t. a corpus IMO. Different words mean the same thing and vise versa due to the pseudo-ability in a Patent to redefine a word away from the obvious meaning. With two different lawyers rendering the same technical design into a writeup and claims results in wildly different work product.

6) I’ve seen some doosey granted Patents. Things that appear to either be implementations of very old CS ideas into new domains.. or worse stuff that would be a class project as an undergrad.

It’s just plain ugly in this realm.


Mar 10 / 1:20am

On Strategic Plans

This needs absolutely no comment.

“We have a ‘strategic plan.’ It’s called doing things.” ~ Herb Kelleher