Friday, March 7. 2014
Some weeks ago I gave a talk about the "Dark Side of Open Data" at the Open Data Institute, where I predicted that the major beneficiaries of government data were not going to be private citizens, taxpayers, or enthusiastic small startups, but large enterprises with deep pockets and less than altruistic service models. The slide I used noted that history tells us any potential goldmine will be mined, and the obvious business model would be:
As to who would do this, the question I posed was "Which side are all the sharpest knives on?". No surprises then, that today I read in a McKinsey article on trends in Big Data that:
...there was a growing awareness, among participants, of the potential of tapping swelling reservoirs of external data—sometimes known as open data—and combining them with existing proprietary data to improve models and business outcomes. (See “What executives should know about open data.”) Hedge funds have been among the first to exploit a flood of newly accessible government data, correlating that information with stock-price movements to spot short-term investment opportunities.
Which immediately begs the question as, given the government is giving away the data, and the taxpayer funding it, should they be getting a better deal and not letting it go for $0.00?. I contend, in a world where companies such as Facebook valued at c $ 175 bn will pay $19bn for companies like Whatsapp primarily for their user data assets, that the answer is "no".
Another slide I put up was a rather perceptive comment by Jo Bates, of Manchester Metropolitan University, from 2012:
The current ‘transparency agenda’ [of the UK government, supported by prominent Open Data advocates] should be recognised as an initiative that also aims to enable the marketisation of public services, and this is something that is not readily apparent to the general observer.
The issue is that there is major asymmetry between those that stand to gain (a few corporation s and companies) and those that stand to lose (citizens who have their data appropriated and misused with no recompense). That point is made loud and clear by the McKinsey news...and this is just the beginning, I'd predict. My last slide but one was about what I predict we will see for the next few years:
- The combination of enthusiasts who see no problems, and commercial interests who intend to make money from the exact problems it will cause, will ensure data will get out without adequate protections or safeguards, at low cost (to the buyers)
So it is no great surprise that hedge funds are early entrants, nor that this week news emerged that 13 years of UK health data had already been sold under the radar to insurance companies for a pittance (to be fair, it was sold for modelling purposes, but the fact remains no one had agreed their data should be sold).
However, there are signs of hope. Days after I gave my talk, the Health Secretary had to abandon plans to sell off health data after a vigorous public protest campaign (waged heavily by social media....) and days later decided they would not sell patient data to such customers. In fact, what looks like an early day charter emerged, as the Government promised to:
....provide "rock-solid" assurance to patients that confidential information will not be sold for commercial insurance purposes, the Department of Health said.
Reading the comments to that report though, it is clear that all the shenanigans and the backlash that finally brought the Government to this point has significantly reduced any trust that this new recommendation will actually be followed - especially as they are going to try yet again to change the law, to be able to make data accessible in a few months time.
The other interesting event today was an abortion charity being heavily fined for being somewhat cavalier with peoples' data and giving it to a hacker. While its a pity its a charity, unless penalties for slack data care are pretty heavy there will be little incentive to look after peoples' data and it will be open season for hackers.
Wednesday, February 26. 2014
I haven't heard much about Prediction Markets for a while, but here is a new one - predicting Innovation - Innovation Excellence:
Prediction markets were popularized in James Surowiecki’s 2004 book, The Wisdom of Crowds. They are systems which forecast the outcome of projects or events based on how willing individuals are to buy “stock” in them. People buy shares in the topics they think will succeed. Each topic or event then gets a value similar to a stock market price. These prices can be interpreted as predictions of the likelihood of the event.
Much was predicted for Prediction Markets a few years back, but they faded from view as results were not as stellar as, er, predicted (especially in the US elections), but hope always burns. The reason is typically that the preconditions for them to work are ignored, i.e. that all choices must be made by a heterogenous and fairly large number of people who are in no way influenced by one another or any common intrinsic factors.
If this can be pulled off in companies (or by companies crowdsourcing innovation) it will be a very interesting.
One to watch.
Wednesday, February 19. 2014
From the BBC, a report on a series of universities trying to build a system that can counter social media borne rumours, lies lies and gossip. The Pheme (maneda fter the Greek goddess of Gossip) is a collaboration between five universities — Sheffield, Warwick, King's College London, Saarland in Germany and MODUL University Vienna — and four companies: ATOS in Spain, iHub in Kenya, Ontotext in Bulgaria and swissinfo.ch, led by Kalina Bontcheva of from the University of Sheffield. Pheme will classify online rumours into four types:
Apparently different types of digital disngenuity leave their own type of digital footprints and can be recognized. The system will also look at the accounts spreading it and look for bots. Idea is then to search for information that is true from known sources and re-seed the stream of the original falsehood followed with "the truth". It will be ready late 2015 apparently.
The obvious flaw is if it can detect falsehoods, any half decent falsehood spreading system can detect it and re-seed the same trails. The other sad flaw is many people will rather believe a convenient lie than an uncomfortable truth. The war for the truth is about to be fought in the cyber-memespace to an unprecented degree - I wonder if there wll be a new subscience of memetics, called "phemetics".
Saturday, February 1. 2014
Twitter Internal Social Network
Ever since Billy Beane used algorithms to pick a winning baseball team from B-list talent (Moneyball), its been clear that the "War for Talent" (and worship of the A lister) is going to be turned upside down by algorithms sooner or later. Here it comes...(Atlantic)
According to John Hausknecht, a professor at Cornell’s school of industrial and labor relations, in recent years the economy has witnessed a “huge surge in demand for workforce-analytics roles.” Hausknecht’s own program is rapidly revising its curriculum to keep pace. You can now find dedicated analytics teams in the human-resources departments of not only huge corporations such as Google, HP, Intel, General Motors, and Procter & Gamble, to name just a few, but also companies like McKee Foods, the Tennessee-based maker of Little Debbie snack cakes. Even Billy Beane is getting into the game. Last year he appeared at a large conference for corporate HR executives in Austin, Texas, where he reportedly stole the show with a talk titled “The Moneyball Approach to Talent Management.”
This will be interesting - my empirical view of "A list" talent is that it is too often nothing more than people picking similar people to themselves, so I have no doubt that once the algorithms get going a totally different picture of what "talent" looks like will emerge. As Malcolm Gladwell showed in Blink, when orchestras transitioned to “blind” auditions, in which each musician seeking a job performed from behind a screen, the proportion of women winning spots in the most-prestigious orchestras shot up fivefold, notably when they played instruments typically identified closely with men. I have no doubt that we will see similar, and that many "Non-Alpha-Male" characteristics will be seen as far more effective.
The Atlantic quotes the example of Xerox call centres (call centres have a LOT of measurable employee data) when they switched to an algorithm based approach in 2010, using:
....an online evaluation that incorporates personality testing, cognitive-skill assessment, and multiple-choice questions about how the applicant would handle specific scenarios that he or she might encounter on the job. An algorithm behind the evaluation analyzes the responses, along with factual information gleaned from the candidate’s application, and spits out a color-coded rating: red (poor candidate), yellow (middling), or green (hire away). Those candidates who score best, I learned, tend to exhibit a creative but not overly inquisitive personality, and participate in at least one but not more than four social networks, among many other factors.
This is exactly what you'll see when algorithms run wider - characteristics that people thought were important will turn out not to be, and vice versa.
I do have a concern though - the previous generations of predictive recruitment systems (like Psychometric testing etc) all came complete with their own brands of psychobabble, and the loop wasn't always closed - ie the following through and testing the theories against the actual outcomes vs. predicted. I can see a risk emerging with for example company social networks (see this map of the Twitter internal comms net, one example of which is the picture at the top of the post) which are today being analysed to try and predict employee performance. In the early days people won't really understand what they are looking at, and will make some crap decisions about the required characteristics for "success" (and even what success looks like) until enough loops have been closed, and a large body of predictive data is gathered. We are already seeing some interestingly counter-intuitive effects - for e.g. it's not the drunken party pic of people on Facebook that accurately predicts poor employees, but instead factors like a propensity to badmouth others. I wonder how many companies have nixed hard partying, high potential people over the last 10 years...
(There is another irony to this - Facebook - and thus dis-employment by Facebook - originated on campuses, yet the algorithms are starting to show that a university education is not a be all and end all either)
And of course, given the 'Net is always on, why not assess everyone all the time - a company called Gild, for example, has algorithms that monitore developers 24/7/365:
....begin by scouring the Web for any and all open-source code, and for the coders who wrote it. They evaluate the code for its simplicity, elegance, documentation, and several other factors, including the frequency with which it’s been adopted by other programmers. For code that was written for paid projects, they look at completion times and other measures of productivity. Then they look at questions and answers on social forums such as Stack Overflow, a popular destination for programmers seeking advice on challenging projects. They consider how popular a given coder’s advice is, and how widely that advice ranges.
The risk is that, just as Taylorism took "Scientific Management" too far and dehumanised work to such an extent that huge other social costs emerged (but were borne elsewhere than the companies that caused them), and it took half a century to peg it back, we may well have "Networked Management" turning the online world into a culture of Mechanical Turks, rather than the craft haven the NetUtopians dream of - with similar results.
Never mind Orwell's 1984 "Big Brother", by 2024* we could have a digital Brave New World, with people pre-sorted from Alpha to Epsilon by the time they come out of high school.
(*or 116 After Ford)
Tuesday, February 26. 2013
Talk tonight - "The Keys to the City of Knowledge" by Conrad Wolfram at Policy Exchange, calling for Computable Data. Once one got over the "Leader of company specialising in computable data search argues for more computable data to search" shocker ( ) there were some interest points made.
Firstly, he argued that computable data is about presenting the "data behind the data", for what he terms "citizen computation" - or "trying to get the answers, not other people's answers". Now this I agree with, as we note that Gresham's Law is increasingly applying to online data, ie bad data is driving out good. I also buy his argument that discussions about decisions is "where the sliders are on data models, not simple Black/White answers" that pass for public and media debate. The issue I have with this though is the level of maths capability required of everyone - and in the UK at any rate, maths and the whole STEM area is relatively lowly valued among the university educated (and relatively lowly paid in the UK), never mind teaching much higher levels of maths skills to les autres. Wolfram argues that Maths needs to be taught differently in schools, and that Computable Data now is like computers in the Assembler days, and we need need to get to "Mac" layer of computable data fast.
But that was all by the by, what really did interest me in the talk were three other points he made:
1. The Value Chain of Knowledge - here are some notes I made:
Thus A Guiding Rule: Compositional knowledge >> dead information
2. What Data is most likely to emerge first? Mainly data that is either publically funding, or largely yours, eg:
Publicly funded R&D - should be computable
Areas that are most likely to be early data sources are:
- Health - biggest gainer due to diagnosis improvement, which is inefficient & labour intensive. Sensor based medicine is coming. Also data on relative hospital performance (Tripadvisor for your bypass op, as Susan Calman may have put it)
He points out it is necessary to unpack simple metric data, (exam results, school league tables) as they are both easy to game and not hugely informative. Work off computable data, not metrics
Corporate information - a lot is missing ( very private)
3. National Productivity - the Computable country needs a computable knowledge economy, which requires:
There was a rather fascinating angle on this, during the question phase. Essentially the discussion had moved to getting corporate data out from behind the firewall (he argues for a VRM-like ownership of your own data) but the point was made that privatisation is bad for Big Infrastructure and some other areas, so maybe a Computable Country should re-nationalise some areas as the data creates more value in a public entity than in a private one*.
Now THAT is an interesting argument, if one started to calculate the value of the chained data....the economics of Open Data may move from "Interesting" to "Critical"
(*Somehow I don't think the Policy Exchange would have intended this, being a right wing think tank )
Friday, February 1. 2013
Broadstuff's riff on Prof David de Roere of OeRC's Web Science slide.
I attended the WebScience Trust event yesterday on Data Observatories, a very "Motley Crew" (As Dame Wendy Hall put it) of people who are active in the space of Web data analytics etc. It was a good session as it had represention from various academics studying the area from Oxford, Warwick, UCL, Cardiff Universities etc, a number of non-profits, research groups, and a few companies operating in the area (like us). I took copious notes
Apart from being fascinating to see how many ways so many other people are attacking this emerging and extremely varied area, my main takeaways of the day were:
One person suggested we needed to derive a "3 Laws of Robotics" for web data collection and analysis companies. Amen to that!
Also, it was interesting to see not just the mix of hard scientists and "soft" scientists, but the segue of hard scientists doing soft science, soft scientists doing hard science, etc - a Motley Crew indeed....
Wednesday, November 7. 2012
Friday, October 19. 2012
Interesting article in HBR over here, implying that "big set" data analysis reaches limitations to its effectiveness fairly fast:
Firstly, remember the Netflix competition to improve their algorithm:
Five years ago, the company launched a competition to improve on the Cinematch algorithm it had developed over many years. It released a record-large (for 2007) dataset, with about 480,000 anonymized users, 17,770 movies, and user/movie ratings ranging from 1 to 5 (stars). Before the competition, the error of Netflix's own algorithm was about 0.95 (using a root-mean-square error, or RMSE, measure), meaning that its predictions tended to be off by almost a full "star." The Netflix Prize of $1 million would go to the first algorithm to reduce that error by just 10%, to about 0.86.
I recall the guys at a UK Netflix lookalike, LoveFilm, telling me that about 5 factors got the 80/20 prediction, so there was clearly a massive falling off in effectiveness as data analysis complexity increased.
But that is predicting intention behavior of demand, so what about retention - is this any easier, after all, one should have bucketloads of data and llots of historical nous dealing with ones's customers? It would appear not:
A study [pdf here] that Brij Masand and I [Gregory Piatetsky-Shapiro] conducted would suggest the answer is no. We looked at some 30 different churn-modeling efforts in banking and telecom, and surprisingly, although the efforts used different data and different modeling algorithms, they had very similar lift curves. The lists of top 1% likely defectors had a typical lift of around 9-11. Lists of top 10% defectors all had a lift of about 3-4. Very similar lift curves have been reported in other work. (See here and here.) All this suggests a limiting factor to prediction accuracy for consumer behavior such as churn.
(Lift is the ratio of actual churn vs the churners in the "big data" analysis, so if a "Big Data" algorithm predicts a list of customers that has 20% of actual churners in it, vs an averagel churn of 2%, that is a "Lift" of (20/2) = 10. That still means the list is 80% wrong though.
And how about predicting Ad effectiveness?
The average CTR% [Click Through Rate] for display ads has been reported as low as 0.1-0.2%. Behavioral and targeted advertising have been able to improve on that significantly, with researchers reporting up to seven-fold improvements. But note that a seven-fold improvement from 0.2% amounts to 1.4% — meaning that today's best targeted advertising is ignored 98.6% of the time.
(Actually, 0.1% sounds high to me, I'd think it was almost an order of magnitude lower nowadays)
Interestingly the article predicts Big Data will help more in the emerging services:
Big data analytics can improve predictions, but the biggest effects of big data will be in creating wholly new areas. Google, for example, can be considered one of the first successes of big data; the fact of its growth suggests how much value can be produced. While analytics may be a small part of its overall code, Google's ability to target ads based on queries is responsible for over 95% of its revenue. Social networks, too, will rely on big data to grow and prosper. The success of Facebook, Twitter, and LinkedIn social networks depends on their scale, and big data tools and analytics will be required for them to keep growing.
Google's reducing profits may be a sign that it's advantage is coming to an end, which - if the view here is right about dimiinishing returns - does not augur well going forward. Also, as they warn:
if you're counting on it to make people much more predictable, you're expecting too much.
Quite. And yet, and yet...one more tweak...
Also, bear in mind there are some big impacts in pivotal areas. A small change in a competitive area like say churn can have tremendous impact, especially if one is in a zero sum game (eg mature mobile phone markets), and played over multiple cycles. For example, assume 2 companies with equal market share, both with 20% churn. A very simple simulation will show if one player can get a sustained reduction of 1% of that 20% churn - ie to 19.8% in monthly customer retention over say 36 cycles (3 years) will give that player 53.5% vs the others' reduced 46.5% share - a shift of 7% points of market share - not a bad structural change in any saturated market, in fact shifts like that can drive competitors out.....
The answer, as always, is to accurately understand the costs vs the benefits.
Friday, May 25. 2012
Facebook is in talks to buy Opera, the company behind the Opera web browser, PocketLint reports. Opera has both a mobile web browsing app and a desktop browser, and it's an alternative to Internet Explorer, Mozilla's Firefox browser, Google's Chrome browser and the Apple Safari browser. Opera says it has more than 270 million users on its browser. Additional sources told The Next Web that it's looking for buyers and currently has a hiring freeze.
Its an interesting strategy for 2 reasons:
(i) Facebook is already doing every sort of datamining it can think of (and probably that its Genetic Algorithms can think of), and apart from your credit card the next best way of knowing all about your intentions is your browser. That is whay, by the way, thers is no way I'll use Google Chrome (though it seems my privacy concerns are not a worry of the Great Masses out there, with Chrome now having the largest installed base).
I will immediately disconnect Opera from my Smartphone if Facebook get their hooks on it. And if they issue a credit card....
Tuesday, March 13. 2012
Big Data - Rough Cut Valuation
Interesting post from Nic Brisbourne, summarising a GigaOm article on Big Data case studies. Like Nic I am dubious about some areas so applied a classic 2x2 analysis (see diagram above) to parse where I thought value might lie. His summary is below, my comments are in italics:
I was interested when GigaOM put up a post this morning titled 10 ways big data changes everything. I read through the ten ‘case studies’, and summarised them below. I’ve put my opinion on the trend in italics after a summary of the GigaOM case study. There was, in my opinion, a lot of fluff in the examples they chose, and of the ten there were only two that really stood out to me as areas with the depth and breadth to be home to multiple successful startups, and they were business intelligence applications of big data and virtual assistants.
Interesting its now suddenly so popular, when we built the data analytic MDE in 2007, theer was very little interest in deep analystics of the social media datastream, now there are (I heard at the FT Conference) "50 starving London startups looking for funding for Big Data".
Another point made by Nic is that:
With many years experience in this game i'd say the potential is there, but often the culture, mindset and busines models are too differemt/difficult for the company to easily assimilate them. Still, there is nothing like a spinoff.....
(Page 1 of 1, totaling 10 entries)
More Broad Stuff
Poll of the Week
Will Augmented reality just be a flash in the pan?
Creative Commons Licence
Original content in this work is licensed under a Creative Commons License