Thursday, March 24. 2016
Now who would have predicted this - Sarah Perez at TC:
The clue is Sarah's line "Given that this is the Internet" - it was an 800lb Gorilla "Given"
Cue a million human internet trolls thinking today "hey, with one of those I am unstoppable 24 x 7"
The problem AI has is that it potentially promises everything and the Hypesters run with that as fact, but in truth today's early AI systems are more like intelligences with Savants syndrome - there's one thing they are taught to do very well, but are completely clueless at anything else. Deep Mind can beat Lee Se Dol at Go, but it can't do any of the other things Lee can do (and it burns 50,000 times more energy in not doing it....). Tay can "learn" (aka parrot) what people tell it but cannot distinguish between deep truths and inflammatory statements.
Oh - one last thing - Tay is modelled on a teenage girl. Sheeesh.
To be fair this is not Tay's fault per se, that you have to lay at the hands of its "parents" - how did they not see this one coming (and, when it started, who was watching its first faltering steps on its first day out playing in the traffic?). Anyone who has been around the internet for more than a few months (or even days - Boaty McBoatface anyone) would have seen that this was a not improbable risk.
PS There is a deeper story here, about the reliance on machine intelligences before they are ready, and the damage they can do without anyone knowing. The thing about complex systems is it's often very hard to find out when they are not working properly, malfunction is often not obvious - so it is necessary to watch them very closely, not let them off to roam on the Interwebz like Tay.
Wednesday, April 30. 2014
Mindmap from Ric West ( @ RiczWest ) of the event
I spoke last night at the Cybersalon event on "Reclaiming our Big Data", must say I enjoyed all the contribution from the other panellists who I felt were very well qualified to talk about this area:
Also very well facilitated by Wendy Grossman, and an interesting introduction from Eva Pascoe*
There was a rather good mindmap of the event (see picture above - and it's worth a thousand words, so I don't have to write any ). As you can imagine there were opinions from all over the spectrum, but I did think there was a coalescence around one major issue over the evening from all - that the social understanding, regulations and various laws about privacy, damage restitution and rights to people's own data are not in any way ready for the impact of Big Data.
To my mind the major issue then, over the next few years will be the "Wild West" that will occur until regulation, law and social understanding catch up. That there will be a Wild West cum Gold Rush is not disputable (though no doubt it will be), as there is just too much value in this arena (and too much naivete among the consumer/citizen base) for it not to. The question we need to address going forward is how wild we want it to be, and for how long. And the only way to minimise both in my view is to raise the understanding among those people whose data is being put into play - i.e. YOU!. The recent care.data shenanigans are merely the start of this game, and set the tone of the way it will be played in this coming Wild West - that one was headed off at the nick of time for now, but not before the NHS sold off a lot of the data for a pittance on the quiet (See Ben Goldacre's summary of this over here).
(More detail on my previous notes over here)
*Oh yes - one more thing - Eva Pascoe, the event organiser, made a fascinating observation - that In the history of technology, the more cuddly the logo, the more sinister the company. You have been warned
Friday, March 7. 2014
Some weeks ago I gave a talk about the "Dark Side of Open Data" at the Open Data Institute, where I predicted that the major beneficiaries of government data were not going to be private citizens, taxpayers, or enthusiastic small startups, but large enterprises with deep pockets and less than altruistic service models. The slide I used noted that history tells us any potential goldmine will be mined, and the obvious business model would be:
As to who would do this, the question I posed was "Which side are all the sharpest knives on?". No surprises then, that today I read in a McKinsey article on trends in Big Data that:
...there was a growing awareness, among participants, of the potential of tapping swelling reservoirs of external data—sometimes known as open data—and combining them with existing proprietary data to improve models and business outcomes. (See “What executives should know about open data.”) Hedge funds have been among the first to exploit a flood of newly accessible government data, correlating that information with stock-price movements to spot short-term investment opportunities.
Which immediately begs the question as, given the government is giving away the data, and the taxpayer funding it, should they be getting a better deal and not letting it go for $0.00?. I contend, in a world where companies such as Facebook valued at c $ 175 bn will pay $19bn for companies like Whatsapp primarily for their user data assets, that the answer is "no".
Another slide I put up was a rather perceptive comment by Jo Bates, of Manchester Metropolitan University, from 2012:
The current ‘transparency agenda’ [of the UK government, supported by prominent Open Data advocates] should be recognised as an initiative that also aims to enable the marketisation of public services, and this is something that is not readily apparent to the general observer.
The issue is that there is major asymmetry between those that stand to gain (a few corporation s and companies) and those that stand to lose (citizens who have their data appropriated and misused with no recompense). That point is made loud and clear by the McKinsey news...and this is just the beginning, I'd predict. My last slide but one was about what I predict we will see for the next few years:
- The combination of enthusiasts who see no problems, and commercial interests who intend to make money from the exact problems it will cause, will ensure data will get out without adequate protections or safeguards, at low cost (to the buyers)
So it is no great surprise that hedge funds are early entrants, nor that this week news emerged that 13 years of UK health data had already been sold under the radar to insurance companies for a pittance (to be fair, it was sold for modelling purposes, but the fact remains no one had agreed their data should be sold).
However, there are signs of hope. Days after I gave my talk, the Health Secretary had to abandon plans to sell off health data after a vigorous public protest campaign (waged heavily by social media....) and days later decided they would not sell patient data to such customers. In fact, what looks like an early day charter emerged, as the Government promised to:
....provide "rock-solid" assurance to patients that confidential information will not be sold for commercial insurance purposes, the Department of Health said.
Reading the comments to that report though, it is clear that all the shenanigans and the backlash that finally brought the Government to this point has significantly reduced any trust that this new recommendation will actually be followed - especially as they are going to try yet again to change the law, to be able to make data accessible in a few months time.
The other interesting event today was an abortion charity being heavily fined for being somewhat cavalier with peoples' data and giving it to a hacker. While its a pity its a charity, unless penalties for slack data care are pretty heavy there will be little incentive to look after peoples' data and it will be open season for hackers.
Wednesday, February 26. 2014
I haven't heard much about Prediction Markets for a while, but here is a new one - predicting Innovation - Innovation Excellence:
Prediction markets were popularized in James Surowiecki’s 2004 book, The Wisdom of Crowds. They are systems which forecast the outcome of projects or events based on how willing individuals are to buy “stock” in them. People buy shares in the topics they think will succeed. Each topic or event then gets a value similar to a stock market price. These prices can be interpreted as predictions of the likelihood of the event.
Much was predicted for Prediction Markets a few years back, but they faded from view as results were not as stellar as, er, predicted (especially in the US elections), but hope always burns. The reason is typically that the preconditions for them to work are ignored, i.e. that all choices must be made by a heterogenous and fairly large number of people who are in no way influenced by one another or any common intrinsic factors.
If this can be pulled off in companies (or by companies crowdsourcing innovation) it will be a very interesting.
One to watch.
Wednesday, February 19. 2014
From the BBC, a report on a series of universities trying to build a system that can counter social media borne rumours, lies lies and gossip. The Pheme (maneda fter the Greek goddess of Gossip) is a collaboration between five universities — Sheffield, Warwick, King's College London, Saarland in Germany and MODUL University Vienna — and four companies: ATOS in Spain, iHub in Kenya, Ontotext in Bulgaria and swissinfo.ch, led by Kalina Bontcheva of from the University of Sheffield. Pheme will classify online rumours into four types:
Apparently different types of digital disngenuity leave their own type of digital footprints and can be recognized. The system will also look at the accounts spreading it and look for bots. Idea is then to search for information that is true from known sources and re-seed the stream of the original falsehood followed with "the truth". It will be ready late 2015 apparently.
The obvious flaw is if it can detect falsehoods, any half decent falsehood spreading system can detect it and re-seed the same trails. The other sad flaw is many people will rather believe a convenient lie than an uncomfortable truth. The war for the truth is about to be fought in the cyber-memespace to an unprecented degree - I wonder if there wll be a new subscience of memetics, called "phemetics".
Saturday, February 1. 2014
Twitter Internal Social Network
Ever since Billy Beane used algorithms to pick a winning baseball team from B-list talent (Moneyball), its been clear that the "War for Talent" (and worship of the A lister) is going to be turned upside down by algorithms sooner or later. Here it comes...(Atlantic)
According to John Hausknecht, a professor at Cornell’s school of industrial and labor relations, in recent years the economy has witnessed a “huge surge in demand for workforce-analytics roles.” Hausknecht’s own program is rapidly revising its curriculum to keep pace. You can now find dedicated analytics teams in the human-resources departments of not only huge corporations such as Google, HP, Intel, General Motors, and Procter & Gamble, to name just a few, but also companies like McKee Foods, the Tennessee-based maker of Little Debbie snack cakes. Even Billy Beane is getting into the game. Last year he appeared at a large conference for corporate HR executives in Austin, Texas, where he reportedly stole the show with a talk titled “The Moneyball Approach to Talent Management.”
This will be interesting - my empirical view of "A list" talent is that it is too often nothing more than people picking similar people to themselves, so I have no doubt that once the algorithms get going a totally different picture of what "talent" looks like will emerge. As Malcolm Gladwell showed in Blink, when orchestras transitioned to “blind” auditions, in which each musician seeking a job performed from behind a screen, the proportion of women winning spots in the most-prestigious orchestras shot up fivefold, notably when they played instruments typically identified closely with men. I have no doubt that we will see similar, and that many "Non-Alpha-Male" characteristics will be seen as far more effective.
The Atlantic quotes the example of Xerox call centres (call centres have a LOT of measurable employee data) when they switched to an algorithm based approach in 2010, using:
....an online evaluation that incorporates personality testing, cognitive-skill assessment, and multiple-choice questions about how the applicant would handle specific scenarios that he or she might encounter on the job. An algorithm behind the evaluation analyzes the responses, along with factual information gleaned from the candidate’s application, and spits out a color-coded rating: red (poor candidate), yellow (middling), or green (hire away). Those candidates who score best, I learned, tend to exhibit a creative but not overly inquisitive personality, and participate in at least one but not more than four social networks, among many other factors.
This is exactly what you'll see when algorithms run wider - characteristics that people thought were important will turn out not to be, and vice versa.
I do have a concern though - the previous generations of predictive recruitment systems (like Psychometric testing etc) all came complete with their own brands of psychobabble, and the loop wasn't always closed - ie the following through and testing the theories against the actual outcomes vs. predicted. I can see a risk emerging with for example company social networks (see this map of the Twitter internal comms net, one example of which is the picture at the top of the post) which are today being analysed to try and predict employee performance. In the early days people won't really understand what they are looking at, and will make some crap decisions about the required characteristics for "success" (and even what success looks like) until enough loops have been closed, and a large body of predictive data is gathered. We are already seeing some interestingly counter-intuitive effects - for e.g. it's not the drunken party pic of people on Facebook that accurately predicts poor employees, but instead factors like a propensity to badmouth others. I wonder how many companies have nixed hard partying, high potential people over the last 10 years...
(There is another irony to this - Facebook - and thus dis-employment by Facebook - originated on campuses, yet the algorithms are starting to show that a university education is not a be all and end all either)
And of course, given the 'Net is always on, why not assess everyone all the time - a company called Gild, for example, has algorithms that monitore developers 24/7/365:
....begin by scouring the Web for any and all open-source code, and for the coders who wrote it. They evaluate the code for its simplicity, elegance, documentation, and several other factors, including the frequency with which it’s been adopted by other programmers. For code that was written for paid projects, they look at completion times and other measures of productivity. Then they look at questions and answers on social forums such as Stack Overflow, a popular destination for programmers seeking advice on challenging projects. They consider how popular a given coder’s advice is, and how widely that advice ranges.
The risk is that, just as Taylorism took "Scientific Management" too far and dehumanised work to such an extent that huge other social costs emerged (but were borne elsewhere than the companies that caused them), and it took half a century to peg it back, we may well have "Networked Management" turning the online world into a culture of Mechanical Turks, rather than the craft haven the NetUtopians dream of - with similar results.
Never mind Orwell's 1984 "Big Brother", by 2024* we could have a digital Brave New World, with people pre-sorted from Alpha to Epsilon by the time they come out of high school.
(*or 116 After Ford)
Tuesday, February 26. 2013
Talk tonight - "The Keys to the City of Knowledge" by Conrad Wolfram at Policy Exchange, calling for Computable Data. Once one got over the "Leader of company specialising in computable data search argues for more computable data to search" shocker ( ) there were some interest points made.
Firstly, he argued that computable data is about presenting the "data behind the data", for what he terms "citizen computation" - or "trying to get the answers, not other people's answers". Now this I agree with, as we note that Gresham's Law is increasingly applying to online data, ie bad data is driving out good. I also buy his argument that discussions about decisions is "where the sliders are on data models, not simple Black/White answers" that pass for public and media debate. The issue I have with this though is the level of maths capability required of everyone - and in the UK at any rate, maths and the whole STEM area is relatively lowly valued among the university educated (and relatively lowly paid in the UK), never mind teaching much higher levels of maths skills to les autres. Wolfram argues that Maths needs to be taught differently in schools, and that Computable Data now is like computers in the Assembler days, and we need need to get to "Mac" layer of computable data fast.
But that was all by the by, what really did interest me in the talk were three other points he made:
1. The Value Chain of Knowledge - here are some notes I made:
Thus A Guiding Rule: Compositional knowledge >> dead information
2. What Data is most likely to emerge first? Mainly data that is either publically funding, or largely yours, eg:
Publicly funded R&D - should be computable
Areas that are most likely to be early data sources are:
- Health - biggest gainer due to diagnosis improvement, which is inefficient & labour intensive. Sensor based medicine is coming. Also data on relative hospital performance (Tripadvisor for your bypass op, as Susan Calman may have put it)
He points out it is necessary to unpack simple metric data, (exam results, school league tables) as they are both easy to game and not hugely informative. Work off computable data, not metrics
Corporate information - a lot is missing ( very private)
3. National Productivity - the Computable country needs a computable knowledge economy, which requires:
There was a rather fascinating angle on this, during the question phase. Essentially the discussion had moved to getting corporate data out from behind the firewall (he argues for a VRM-like ownership of your own data) but the point was made that privatisation is bad for Big Infrastructure and some other areas, so maybe a Computable Country should re-nationalise some areas as the data creates more value in a public entity than in a private one*.
Now THAT is an interesting argument, if one started to calculate the value of the chained data....the economics of Open Data may move from "Interesting" to "Critical"
(*Somehow I don't think the Policy Exchange would have intended this, being a right wing think tank )
Friday, February 1. 2013
Broadstuff's riff on Prof David de Roere of OeRC's Web Science slide.
I attended the WebScience Trust event yesterday on Data Observatories, a very "Motley Crew" (As Dame Wendy Hall put it) of people who are active in the space of Web data analytics etc. It was a good session as it had represention from various academics studying the area from Oxford, Warwick, UCL, Cardiff Universities etc, a number of non-profits, research groups, and a few companies operating in the area (like us). I took copious notes
Apart from being fascinating to see how many ways so many other people are attacking this emerging and extremely varied area, my main takeaways of the day were:
One person suggested we needed to derive a "3 Laws of Robotics" for web data collection and analysis companies. Amen to that!
Also, it was interesting to see not just the mix of hard scientists and "soft" scientists, but the segue of hard scientists doing soft science, soft scientists doing hard science, etc - a Motley Crew indeed....
Wednesday, November 7. 2012
Friday, October 19. 2012
Interesting article in HBR over here, implying that "big set" data analysis reaches limitations to its effectiveness fairly fast:
Firstly, remember the Netflix competition to improve their algorithm:
Five years ago, the company launched a competition to improve on the Cinematch algorithm it had developed over many years. It released a record-large (for 2007) dataset, with about 480,000 anonymized users, 17,770 movies, and user/movie ratings ranging from 1 to 5 (stars). Before the competition, the error of Netflix's own algorithm was about 0.95 (using a root-mean-square error, or RMSE, measure), meaning that its predictions tended to be off by almost a full "star." The Netflix Prize of $1 million would go to the first algorithm to reduce that error by just 10%, to about 0.86.
I recall the guys at a UK Netflix lookalike, LoveFilm, telling me that about 5 factors got the 80/20 prediction, so there was clearly a massive falling off in effectiveness as data analysis complexity increased.
But that is predicting intention behavior of demand, so what about retention - is this any easier, after all, one should have bucketloads of data and llots of historical nous dealing with ones's customers? It would appear not:
A study [pdf here] that Brij Masand and I [Gregory Piatetsky-Shapiro] conducted would suggest the answer is no. We looked at some 30 different churn-modeling efforts in banking and telecom, and surprisingly, although the efforts used different data and different modeling algorithms, they had very similar lift curves. The lists of top 1% likely defectors had a typical lift of around 9-11. Lists of top 10% defectors all had a lift of about 3-4. Very similar lift curves have been reported in other work. (See here and here.) All this suggests a limiting factor to prediction accuracy for consumer behavior such as churn.
(Lift is the ratio of actual churn vs the churners in the "big data" analysis, so if a "Big Data" algorithm predicts a list of customers that has 20% of actual churners in it, vs an averagel churn of 2%, that is a "Lift" of (20/2) = 10. That still means the list is 80% wrong though.
And how about predicting Ad effectiveness?
The average CTR% [Click Through Rate] for display ads has been reported as low as 0.1-0.2%. Behavioral and targeted advertising have been able to improve on that significantly, with researchers reporting up to seven-fold improvements. But note that a seven-fold improvement from 0.2% amounts to 1.4% — meaning that today's best targeted advertising is ignored 98.6% of the time.
(Actually, 0.1% sounds high to me, I'd think it was almost an order of magnitude lower nowadays)
Interestingly the article predicts Big Data will help more in the emerging services:
Big data analytics can improve predictions, but the biggest effects of big data will be in creating wholly new areas. Google, for example, can be considered one of the first successes of big data; the fact of its growth suggests how much value can be produced. While analytics may be a small part of its overall code, Google's ability to target ads based on queries is responsible for over 95% of its revenue. Social networks, too, will rely on big data to grow and prosper. The success of Facebook, Twitter, and LinkedIn social networks depends on their scale, and big data tools and analytics will be required for them to keep growing.
Google's reducing profits may be a sign that it's advantage is coming to an end, which - if the view here is right about dimiinishing returns - does not augur well going forward. Also, as they warn:
if you're counting on it to make people much more predictable, you're expecting too much.
Quite. And yet, and yet...one more tweak...
Also, bear in mind there are some big impacts in pivotal areas. A small change in a competitive area like say churn can have tremendous impact, especially if one is in a zero sum game (eg mature mobile phone markets), and played over multiple cycles. For example, assume 2 companies with equal market share, both with 20% churn. A very simple simulation will show if one player can get a sustained reduction of 1% of that 20% churn - ie to 19.8% in monthly customer retention over say 36 cycles (3 years) will give that player 53.5% vs the others' reduced 46.5% share - a shift of 7% points of market share - not a bad structural change in any saturated market, in fact shifts like that can drive competitors out.....
The answer, as always, is to accurately understand the costs vs the benefits.
(Page 1 of 2, totaling 12 entries) » next page
More Broad Stuff
Poll of the Week
Will Augmented reality just be a flash in the pan?
Creative Commons Licence
Original content in this work is licensed under a Creative Commons License