Archive
Cascade: Crowdsourcing Taxonomy Creation
Cascade: Crowdsourcing Taxonomy Creation
Source: Microsoft Research
Taxonomies are a useful and ubiquitous way of organizing information. However, creating organizational hierarchies is difficult because the process requires a global understanding of the objects to be categorized. Usually one is created by an individual or a small group of people working together for hours or even days. Unfortunately, this centralized approach does not work well for the large, quickly-changing datasets found on the web. Cascade is an automated workflow that creates a taxonomy from the collective efforts of crowd workers who spend as little as 20 seconds each. We evaluate Cascade and show that on three datasets its quality is 80-90% of that of experts. The cost of Cascade is competitive with expert information architects, despite taking six times more human labor. Fortunately, this labor can be parallelized such that Cascade will run in as fast as five minutes instead of hours or days.
Enhancing Personalized Search by Mining and Modeling Task Behavior
Enhancing Personalized Search by Mining and Modeling Task Behavior
Source: Microsoft Research
Personalized search systems tailor search results to the current user intent using historic search interactions. This relies on being able to find pertinent information in the user’s search history, which can be challenging for unseen queries and for new search scenarios. Building richer models of users’ current and historic search tasks can help improve the likelihood of finding relevant content and enhance the relevance and coverage of personalization methods. The task-based approach can be applied to the current user’s search history, or as we focus on here, all users’ search histories as so-called “groupization” (a variant of personalization whereby other users’ profiles can be used to personalize the search experience). We describe a method whereby we mine historic search-engine logs to find other users performing similar tasks to the current user and leverage their on-task behavior to identify Web pages to promote in the current ranking. We investigate the effectiveness of this approach versus query-based matching and finding related historic activity from the current user (i.e., group vs. individual). As part of our studies we also explore the use of the on-task behavior of particular user cohorts, such as people who are more expert in the current topic, rather than all users, with potentially-promising results. Our findings have direct implications for improving personalization in Web search engines.
Unified Entity Search in Social Media Community
Unified Entity Search in Social Media Community
Source: Microsoft Research
The search for entities is the most common search behavior on the Web, especially in social media communities where entities (such as images, videos, people, locations, and tags) are highly heterogeneous and correlated. While previous research usually deals with these social media entities separately, we are investigating in this paper a unified, multilevel, and correlative entity graph to represent the unstructured social media data, through which various applications (e.g., friend suggestion, personalized image search, image tagging, etc.) can be realized more effectively in one single framework. We regard the social media objects equally as “entities” and all of these applications as “entity search” problem which searches for entities with different types. We first construct a multi-level graph which organizes the heterogeneous entities into multiple levels, with one type of entities as vertices in each level. The edges between graphs pairwisely connect the entities weighted by intra-relations in the same level and inter-links across two different levels distilled from the social behaviors (e.g., tagging, commenting, and joining communities). To infer the strength of intrarelations, we propose a circular propagation scheme, which reinforces the mutual exchange of information across different entity types in a cyclic manner. Based on the constructed unified graph, we explicitly formulate entity search as a global optimization problem in a unified Bayesian framework, in which various applications are efficiently realized. Empirically, we validate the effectiveness of our unified entity graph for various social media applications on millionscale real-world dataset.
Formal Methods for Computer-Aided STEM Education
Formal Methods for Computer-Aided STEM Education
Source: Microsoft Research
Providing personalized and interactive education (as in one-on-one tutoring) remains an unsolved problem for standard classrooms. The arrival of Massive Open Online Courses (MOOCs), while having provided a unique opportunity to share quality instruction with massive number of students, only exacerbate this problem with an even higher student to teacher ratio. We believe that automated intelligent tutoring can play a revolutionary role in this context. In this article, we motivate four problem definitions, namely problem generation, solution generation, feedback generation, and content authoring in the context of intelligent tutoring. We describe how formal methods can play a useful role in addressing these problems. We present some recent results that have been applied to a variety of STEM subject domains (including logic, automata, programming, arithmetic, algebra, and geometry) as illustrative examples.
Querying Encrypted Data (Tutorial)
Querying Encrypted Data (Tutorial)
Source: Microsoft Research
Data security is a serious concern when we migrate data to a cloud DBMS. Database encryption, where sensitive columns are encrypted before they are stored in the cloud, has been proposed as a mechanism to address such data security concerns. The intuitive expectation is that an adversary cannot “learn” anything about the encrypted columns, since she does not have access to the encryption key. However, query processing becomes a challenge since it needs to “look inside” the data. This tutorial explores the space of designs studied in prior work on processing queries over encrypted data. We cover approaches based on both classic client-server and involving the use of a trusted hardware module where data can be securely decrypted. We discuss the security challenges that arise in both approaches and how they may be addressed. Briefly, supporting the full complexity of a modern DBMS including complex queries, transactions and stored procedures leads to significant challenges that we survey.
Some Evidence for Impact of Limited Education on Hierarchical User Interface Navigation
Some Evidence for Impact of Limited Education on Hierarchical User Interface Navigation
Source: Microsoft Research
One of the greatest challenges in designing applications for economically poor communities is that potential users may have little or no education. We investigated how limited education appears to impact the ability to navigate a hierarchical UI, even when it has no text. We scored 60 participants from low-income communities in India using tests of textual literacy and Raven’s Progressive Matrices. These were used as proxies for educational level and a subset of cognitive abilities. We then evaluated participants’ performance on a UI task involving hierarchical navigation. First, our results confirm that textual literacy is correlated with scores on the Raven’s test. In addition, we found that performance on both instruments are predictive of performance in navigating UI hierarchies, even when the UI is text-free. This provides statistically significant confirmation of previous anecdotal hypotheses. We conclude with design recommendations for UI hierarchies for people with limited education.
Speech-Centric Information Processing: An Optimization-Oriented Approach
Speech-Centric Information Processing: An Optimization-Oriented Approach
Source: Microsoft Research
Automatic speech recognition is a central and common component of voice-driven information processing systems in human language technology, including spoken language translation, spoken language understanding, voice search, spoken document retrieval, and so on. Interfacing speech recognition with its downstream text-based processing tasks of translation, understanding, and information retrieval creates both challenges and opportunities in optimal design of the combined, speech-enabled systems. We present an optimization-oriented statistical framework for the overall system design where the interactions between the sub-systems in tandem are fully incorporated and where design consistency is established between the optimization objectives and the end-to-end system performance metrics. Techniques for optimizing such objectives in both the decoding and learning phases of the speech-centric information processing system design are described, in which the uncertainty in speech recognition sub-system’s outputs is fully considered and marginalized. This paper provides an overview of the past and current work in this area. Future challenges and new opportunities are also discussed and analyzed.
Joint Optimization of Bid and Budget Allocation in Sponsored Search
Joint Optimization of Bid and Budget Allocation in Sponsored Search
Source: Microsoft Research
This paper is concerned with the joint allocation of bid price and campaign budget in sponsored search. In this application, an advertiser can create a number of campaigns and set a budget for each of them. In a campaign, he/she can further create several ad groups with bid keywords and bid prices. Data analysis shows that many advertisers are dealing with a very large number of campaigns, bid keywords, and bid prices at the same time, which poses a great challenge to the optimality of their campaign management. As a result, the budgets of some campaigns might be too low to achieve the desired performance goals while those of some other campaigns might be wasted; the bid prices for some keywords may be too low to win competitive auctions while those of some other keywords may be unnecessarily high. In this paper, we propose a novel algorithm to automatically address this issue. In particular, we model the problem as a constrained optimization problem, which maximizes the expected advertiser revenue subject to the constraints of the total budget of the advertiser and the ranges of bid price change. By solving this optimization problem, we can obtain an optimal budget allocation plan as well as an optimal bid price setting. Our simulation results based on the sponsored search log of a commercial search engine have shown that by employing the proposed method, we can effectively improve the performances of the advertisers while at the same time we also see an increase in the revenue of the search engine. In addition, the results indicate that this method is robust to the second-order effects caused by the bid fluctuations from other advertisers.
Fixity: Identity, Time and Durée on Facebook
Fixity: Identity, Time and Durée on Facebook
Source: Microsoft Research
The purpose of social network services (SNS) is to enable new ways of making contact and staying in touch. The finessed use of SNS can enable people to manage their social connections with fluidity; enabling change of social grouping and evolving identity. Key to this performance is that it is enacted through time. Certain aspects of SNS may of course create a fixing in identity and its performance, trapping people, for example, in a display of identity in the past that they have come to regret. In this paper, we shall report evidence that suggests that the temporal experiencing of Facebook with regard to this aspect of time and identity needs to be placed alongside another feature of the way the service is used. This leads people to feel as if they are always acting ‘in the now’ and that their history – as well as that of others they connect to – seems to disappear from view. We shall suggest that the performance of identity through time is thus constrained. Users seek but cannot find adequate ways of adjusting their identity by crafting past and future performances outside the envelope of identity in the present, in the ‘now’, the one facilitated and emphasized by Facebook design and use patterns.
Discovering Regions of Different Functions in a City Using Human Mobility and POIs
Discovering Regions of Different Functions in a City Using Human Mobility and POIs
Source: Microsoft Research
The development of a city gradually fosters different functional regions, such as educational areas and business districts. In this paper, we propose a framework (titled DRoF) that discovers Regions of different Functions in a city using both human mobility among regions and points of interests (POIs) located in a region. Specifically, we segment a city into disjointed regions according to major roads, such as highways and urban express ways. We infer the functions of each region using a topic-based inference model, which regards a region as a document, a function as a topic, categories of POIs (e.g., restaurants and shopping malls) as metadata (like authors, affiliations, and key words), and human mobility patterns (when people reach/leave a region and where people come from and leave for) as words. As a result, a region is represented by a distribution of functions, and a function is featured by a distribution of mobility patterns. We further identify the intensity of each function in different locations. The results generated by our framework can benefit a variety of applications, including urban planning, location choosing for a business, and social recommendations. We evaluated our method using large-scale and real-world datasets, consisting of two POI datasets of Beijing (in 2010 and 2011) and two 3-month GPS trajectory datasets (representing human mobility) generated by over 12,000 taxicabs in Beijing in 2010 and 2011 respectively. The results justify the advantages of our approach over baseline methods solely using POIs or human mobility.
Keeping Information Safe from Social Networking Apps
Keeping Information Safe from Social Networking Apps
Source: Microsoft Research
The ability of third-party applications to aggregate and repurpose personal data is a fundamental privacy weakness in today’s social networking platforms. Prior work has proposed sandboxing in a hosted cloud infrastructure to prevent leakage of user information. In this paper, we extend simple sandboxing to allow sharing of information among friends in a social network, and to help application developers securely aggregate user data according to differential privacy properties. Enabling these two key features requires preventing, among other subtleties, a new “Kevin Bacon” attack aimed at aggregating private data through a social network graph. We describe the significant architectural and security implications for the application framework in the
Progressive authentication: deciding when to authenticate on mobile phones
Progressive authentication: deciding when to authenticate on mobile phones
Source: Microsoft Research
Mobile users are often faced with a trade-off between security and convenience. Either users do not use any security lock and risk compromising their data, or they use security locks but then have to inconveniently authenticate every time they use the device. Rather than exploring a new authentication scheme, we address the problem of deciding when to surface authentication and for which applications. We believe reducing the number of times a user is requested to authenticate lowers the barrier of entry for users who currently do not use any security. Progressive authentication, the approach we propose, combines multiple signals (biometric, continuity, possession) to determine a level of confidence in a user’s authenticity. Based on this confidence level and the degree of protection the user has configured for his applications, the system determines whether access to them requires authentication. We built a prototype running on modern phones to demonstrate progressive authentication and used it in a lab study with nine users. Compared to the state-of-theart, the system is able to reduce the number of required authentications by 42% and still provide acceptable security guarantees, thus representing an attractive solution for users who do not use any security mechanism on their devices.
Measuring and Fingerprinting Click-Spam in Ad Networks
Measuring and Fingerprinting Click-Spam in Ad Networks
Source: Microsoft Research
Advertising plays a vital role in supporting free websites and smartphone apps. Click-spam, i.e., fraudulent or invalid clicks on online ads where the user has no actual interest in the advertiser’s site, results in advertising revenue being misappropriated by click-spammers. While ad networks take active measures to block click-spam today, the effectiveness of these measures is largely unknown. Moreover, advertisers and third parties have no way of independently estimating or defending against click-spam.
In this paper, we take the first systematic look at click-spam. We propose the first methodology for advertisers to independently measure click-spam rates on their ads. We also develop an automated methodology for ad networks to proactively detect different simultaneous click-spam attacks. We validate both methodologies using data from major ad networks. We then conduct a large-scale measurement study of click-spam across ten major ad networks and four types of ads. In the process, we identify and perform in-depth analysis on seven ongoing click-spam attacks not blocked by major ad networks at the time of this writing. Our findings highlight the severity of the click-spam problem, especially for mobile ads.
Goldilocks and the Two Mobile Devices: Going Beyond All-Or-Nothing Access to a Device’s Applications
Goldilocks and the Two Mobile Devices: Going Beyond All-Or-Nothing Access to a Device’s Applications
Source: Microsoft Research
Most mobile phones and tablets support only two access control device states: locked and unlocked. We investigated how well all-or-nothing device access control meets the need of users by interviewing 20 participants who had both a smartphone and tablet. We find all-or-nothing device access control to be a remarkably poor fit with users’ preferences. On both phones and tablets, participants wanted roughly half their applications to be available even when their device was locked and half protected by authentication. We also solicited participants’ interest in new access control mechanisms designed specifically to facilitate device sharing. ; Fourteen participantsa majority (14 out of 20) preferred these controls to existing security locks alone. Finally, we gauged participants’ interest in using face and voice biometrics to authenticate to their mobile phone and tablets; participants were surprisingly receptive to biometrics, given that they were also aware of security and reliability limitations.
Learning from GPS Data for Mobile Recommendation
Learning from GPS Data for Mobile Recommendation
Source: Microsoft Research
With the increasing popularity of location-based services, we have accumulated a lot of location data on the Web. In this paper, we are interested in answering two popular location-related queries in our daily life: 1) if we want to do something such as sightseeing or dining in a large city like Beijing, where should we go? 2) If we want to visit a place such as the Bird’s
Nest in Beijing Olympic park, what can we do there? We develop a mobile recommendation system to answer these queries. In our system, we first model the users’ location and activity histories as a user-location-activity rating tensor1. Because each user has limited data, the resulting rating tensor is essentially very sparse. This makes our recommendation task difficult.
In order to address this data sparsity problem, we propose three algorithms2 based on collaborative filtering. The first algorithm merges all the users’ data together, and uses a collective matrix factorization model to provide general recommendation [3]. The second algorithm treats each user differently and uses a collective tensor and matrix factorization model to provide personalized recommendation [4]. The third algorithm is a new algorithm which further improves our previous two algorithms by using a ranking-based collective tensor and matrix factorization model. Instead of trying to predict the missing entry values as accurately as possible, it focuses on directly optimizing the ranking loss w.r.t. user preferences on the locations and activities. Therefore, it is more consistent with our ultimate goal of ranking locations/activities for recommendations. For these three algorithms, we also exploit some additional information, such as user-user similarities, location features, activity-activity correlations and user-location preferences, to help the CF tasks. We extensively evaluate our algorithms using a real-world GPS dataset collected by 119 users over 2.5 years. We show that all our three algorithms can consistently outperform the competing baselines, and our newly proposed third algorithm can also outperform our other two previous algorithms.
+ Full Paper (PDF)
An Operating System for the Home
Network devices for the home such as remotely controllable locks, lights, thermostats, cameras, and motion sensors are now readily available and inexpensive. In theory, this enables scenarios like remotely monitoring cameras from a smartphone or customizing climate control based on occupancy patterns. However,in practice today, such smarthome scenarios are limited to expert hobbyists and the rich because of the high overhead of managing and extending current technology. We present HomeOS, a platform that bridges this gap by presenting users and developers with a PC-like abstraction for technology in the home. It presents network devices as peripherals with abstract interfaces, enables cross-device tasks via applications written against these interfaces, and gives users a management interface designed for the home environment. HomeOS already has tens of applications and supports a wide range of devices. It has been running in 12 real homes for 4–8 months, and 42 students have built new applications and added support for additional devices independent of our efforts.
+ Full Paper (PDF)
An Untold Story of Middleboxes in Cellular Networks
An Untold Story of Middleboxes in Cellular Networks (PDF)
Source: University of Michigan and Microsoft Research
The use of cellular data networks is increasingly popular as network coverage becomes more ubiquitous and many diverse usercontributed mobile applications are available. The growing cellular traffic demand means that cellular network carriers are facing greater challenges to provide users with good network performance and energy ef?ciency, while protecting networks from potential attacks. To better utilize their limited network resources while securing the network and protecting client devices, the carriers have already deployed various network policies that influence traffic behavior. Today, these policies are mostly opaque, though they directly impact application designs and may even introduce network vulnerabilities.
We present NetPiculet, the first tool that unveils carriers’ NAT and firewall policies by conducting intelligent measurement. By running NetPiculet in the major U.S. cellular providers as well as deploying it as a smartphone application in the wild in more than 100 cellular ISPs, we identified the key NAT and firewall policies which have direct implications on performance, energy, and security. For example, NAT boxes and firewalls set timeouts for idle TCP connections, which sometimes cause significant energy waste on mobile devices. Although most carriers today deploy sophisticated firewalls, they are still vulnerable to various attacks such as battery draining and denial of service. These findings can inform developers in optimizing the interaction between mobile applications and cellular networks and also guide carriers in improving their network con?gurations.
Ameliorating Buyer’s Remorse
Ameliorating Buyer’s Remorse
Source: Microsoft Research
Keeping in pace with the increasing importance of commerce conducted over the Web, several e-commerce websites now provide admirable facilities for helping consumers decide what product to buy and where to buy it. However, since the prices of durable and high-tech products generally fall over time as firms continually introduce products that have enhanced features, a buyer of such products is often faced with a dilemma: Should she buy the product now or wait for cheaper prices?
We present the design and implementation of Prodcast, an experimental system whose goal is to help consumers decide when to buy a product. The system makes use of forecasts of future prices based on price histories of the products, incorporating features such as sales volume, seasonality, and competition in making its recommendation. We describe techniques that are well-suited for this task and present a comprehensive evaluation of their relative merits using retail sales data for electronic products. Our back-testing of the system indicates that the system is capable of helping consumers time their purchase, resulting in significant savings to them.
+ Full Paper (PDF)