New release of Cloud Storage Connector for Hadoop: Improving performance, throughput and more

We’re pleased to announce a new version of the Cloud Storage Connector for Hadoop (also known as GCS Connector), which makes it even easier to substitute your Hadoop Distributed File System (HDFS) with Cloud Storage. This new release can give you increased throughput efficiency for columnar file formats such as Parquet and ORC, isolation for Cloud Storage directory modifications, and overall big data workload performance improvements, like lower latency, increased parallelization, and intelligent defaults.The Cloud Storage Connector is an open source Java client library that runs in Hadoop JVMs (like data nodes, mappers, reducers, Spark executors, and more) and allows your workloads to access Cloud Storage. The connector lets your big data open source software [such as Hadoop and Spark jobs, or the Hadoop Compatible File System (HCFS) CLI] read/write data directly to Cloud Storage, instead of to HDFS. Storing data in Cloud Storage has several benefits over HDFS: Significant cost reduction as compared to a long-running HDFS cluster with three replicas on persistent disks;Separation of storage from compute, allowing you to grow each layer independently;Persisting the storage even after Hadoop clusters are terminated;Sharing Cloud Storage buckets between ephemeral Hadoop clusters;No storage administration overhead, like managing upgrades and high availability for HDFS.The Cloud Storage Connector’s source code is completely open source and is supported by Google Cloud Platform (GCP). The connector comes pre-configured in Cloud Dataproc, GCP’s managed Hadoop and Spark offering. However, it is also easily installed and fully supported for use in other Hadoop distributions such as MapR, Cloudera, and Hortonworks. This makes it easy to migrate on-prem HDFS data to the cloud or burst workloads to GCP. The open source aspect of the Cloud Storage Connector allowed Twitter’s engineering team to closely collaborate with us on the design, implementation, and productionizing of the fadvise and cooperative locking features at petabyte scale. Cloud Storage Connector architectureHere’s a look at what the Cloud Storage Connector architecture looks like:Cloud Storage Connector is an open source Apache 2.0 implementation of an HCFS interface for Cloud Storage. Architecturally, it is composed of four major components:gcs—implementation of the Hadoop Distributed File System and input/output channelsutil-hadoop—common (authentication, authorization) Hadoop-related functionality shared with other Hadoop connectorsgcsio—high-level abstraction of Cloud Storage JSON APIutil—utility functions (error handling, HTTP transport configuration, etc.) used by gcs and gcsio componentsIn the following sections, we highlight a few of the major features in this new release of Cloud Storage Connector. For a full list of settings and how to use them, check out the newly published Configuration Properties and gcs-core-default.xml settings pages.Here are the key new features of the Cloud Storage Connector:Improved performance for Parquet and ORC columnar formatsAs part of Twitter’s migration of Hadoop to Google Cloud, in mid-2018 Twitter started testing big data SQL queries against columnar files in Cloud Storage at massive scale, against a 20+ PB dataset. Since the Cloud Storage Connector is open source, Twitter prototyped the use of range requests to read only the columns required by the query engine, which increased read efficiency. We incorporated that work into a more generalized fadvise feature. In previous versions of the Cloud Storage Connector, reads were optimized for MapReduce-style workloads, where all data in a file was processed sequentially. However, modern columnar file formats such as Parquet or ORC are designed to support predicate pushdown, allowing the big data engine to intelligently read only the chunks of the file (columns) that are needed to process the query. The Cloud Storage Connector now fully supports predicate pushdown, and only reads the bytes requested by the compute layer. This is done by introducing a technique known as fadvise. You may already be familiar with the fadvise feature in Linux. Fadvise allows applications to provide a hint to the Linux kernel with the intended I/O access pattern, indicating how it intends to read a file, whether for sequential scans or random seeks. This lets the kernel choose appropriate read-ahead and caching techniques to increase throughput or reduce latency.The new fadvise feature in Cloud Storage Connector implements a similar functionality and automatically detects (in default auto mode) whether the current big data application’s I/O access pattern is sequential or random.In the default auto mode, fadvise starts by assuming a sequential read pattern, but then switches to random mode upon detection of a backward seek or long forward seek. These seeks are performed by the position() method call and can change the current channel position backward or forward. Any backward seek triggers the mode change to random; however, a forward seek needs to be greater than 8 MB (configurable via fs.gs.inputstream.inplace.seek.limit). The read pattern transition (from sequential to random) in fadvise’s auto mode is stateless and gets reset for each new file read session.Fadvise can be configured via the gcs-core-default.xml file with the fs.gs.inputstream.fadvise parameter:AUTO (default), also called adaptive range reads—In this mode, the connector starts in SEQUENTIAL mode, but switches to RANDOM as soon as the first backward or forward read is detected that’s greater than fs.gs.inputstream.inplace.seek.limit bytes (8 MiB by default).RANDOM—The connector will send bounded range requests to Cloud Storage; Cloud Storage read-ahead will be disabled.SEQUENTIAL—The connector will send a single, unbounded streaming request to Cloud Storage to read an object from a specified position sequentially.In most use cases, the default setting of AUTO should be sufficient. It dynamically adjusts the mode for each file read. However, you can hard-set the mode.Ideal use cases for fadvise in RANDOM mode include:SQL (Spark SQL, Presto, Hive, etc.) queries into columnar file formats (Parquet, ORC, etc.) in Cloud StorageRandom lookups by a database system (HBase, Cassandra, etc.) to storage files (HFile, SSTables) in Cloud StorageIdeal use cases for fadvise in SEQUENTIAL mode include:Traditional MapReduce jobs that scan entire files sequentiallyDistCp file transfersCooperative locking: Isolation for Cloud Storage directory modificationsAnother major addition to Cloud Storage Connector is cooperative locking, which isolates directory modification operations performed through the Hadoop file system shell (hadoop fs command) and other HCFS API interfaces to Cloud Storage.Although Cloud Storage is strongly consistent at the object level, it does not natively support directory semantics. For example, what should happen if two users issue conflicting commands (delete vs. rename) to the same directory? In HDFS, such directory operations are atomic and consistent. So Joep Rottinghuis, leading the @TwitterHadoop team, worked with us to implement cooperative locking in Cloud Storage Connector. This feature prevents data inconsistencies during conflicting directory operations to Cloud Storage, facilitates recovery of any failed directory operations, and simplifies operational migration from HDFS to Cloud Storage.With cooperative locking, concurrent directory modifications that could interfere with each other, like a user deleting a directory while another user is trying to rename it, are safeguarded. Cooperative locking also supports recovery of failed directory modifications (where a JVM might have crashed mid-operation), via the FSCK command, which can resume or roll back the incomplete operation.With this cooperative locking feature, you can now perform isolated directory modification operations, using the hadoop fs commands as you normally would to move or delete a folder:To recover failed directory modification operations performed with enabled Cooperative Locking, use the included FSCK tool:This command will recover (roll back or roll forward) all failed directory modification operations, based on the operation log.The cooperative locking feature is intended to be used by human operators when modifying Cloud Storage directories through the hadoop fs interface. Since the underlying Cloud Storage system does not support locking, this feature should be used cautiously for use cases beyond directory modifications. (such as when a MapReduce or Spark job modifies a directory).Cooperative locking is disabled by default. To enable it, either set fs.gs.cooperative.locking.enable Hadoop property to true in core-site.xml:or specify it directly in your hadoop fs command:How cooperative locking worksHere’s what a directory move with cooperative locking looks like:Cooperative Locking is implemented via atomic lock acquisition in the lock file (_lock/all.lock) using Cloud Storage preconditions. Before each directory modification operation, the Cloud Storage Connector atomically acquires a lock in this bucket-wide lock file.Additional operational metadata is stored in *.lock and *.log files in the _lock directory at the root of the Cloud Storage bucket. Operational files (a list of files to modify) are stored in a per-operation *.log file and additional lock metadata in per-operation *.lock file. This per-operation lock file is used for lock renewal and checkpointing operation progress.The acquired lock will automatically expire if it is not periodically renewed by the client. The timeout interval can be modified via the fs.gs.cooperative.locking.expiration.timeout.ms setting.Cooperative locking supports isolation of directory modification operations only in the same Cloud Storage bucket, and does not support directory moves across buckets.Note: Cooperative locking is a Cloud Storage Connector feature, and it is not implemented by gsutil, Object Lifecycle Management or applications directly using the Cloud Storage API.General performance improvements to Cloud Storage ConnectorIn addition to the above features, there are many other performance improvements and optimizations in this Cloud Storage Connector release. For example:Directory modification parallelization, in addition to using batch request, the Cloud Storage Connector executes Cloud Storage batches in parallel, reducing the rename time for a directory with 32,000 files from 15 minutes to 1 minute, 30 seconds.Latency optimizations by decreasing the necessary Cloud Storage requests for high-level Hadoop file system operations.Concurrent glob algorithms (regular and flat glob) execution to yield the best performance for all use cases (deep and broad file trees).Repair implicit directories during delete and rename operations instead of list and glob operations, reducing latency of expensive list and glob operations, and eliminating the need for write permissions for read requests.Cloud Storage read consistencyto allow requests of the same Cloud Storage object version, preventing reading of different object versions and improving performance.You can upgrade to the new version of Cloud Storage Connector using the connectors initialization action for existing Cloud Dataproc versions. It will become standard starting in Cloud Dataproc version 2.0.Thanks to contributors to the design and development of the new release of Cloud Storage Connector, in no particular order: Joep Rottinghuis, Lohit Vijayarenu, Hao Luo and Yaliang Wang from the Twitter engineering team.
Quelle: Google Cloud Platform

OpenShift 4.2 on Azure Preview

Introduction
In this blog we will be showing a video on how to get Red Hat OpenShift 4 installed on Microsoft Azure using the full stack automated method.. This method differs from the pre-existing infrastructure method, as the full stack automation gets your from zero to a full OpenShift deployment, creating all the required infrastructure components automatically.
Currently, installing OpenShift 4 on Azure is under tech preview. It won’t be supported until the GA release of OpenShift 4.2. This blog is meant for those who want to get a preview on what’s coming. Detailed instructions are below if you wish to follow along!

Prerequisites
It’s important that you get familiar with the general prerequisites by looking at the official documentation for OpenShift. There you can find specific details about the requirements and installation details for either full-stack automated or for pre-existing infrastructure deployments. I have broken up the prerequisites into sections and have marked those that are optional.
DNS
You will need to have a DNS domain already controlled by Azure. The OpenShift installer will configure DNS resolution (internal and external) for the cluster. This can be done by buying a domain on Azure or delegating a domain (or subdomain) to Azure. In either case, make sure the domain is set ahead of time.
During the install, you will be providing a $CLUSTERID. This ID will be used as part of the FQDN of the components created for your cluster. In other words, the ID will become part of your DNS name. For example, a domain of example.com and a $CLUSTERID of ocp4 will yield an OpenShift domain of ocp4.example.com for your cluster.
Choose wisely.
Azure CLI Tools (Optional)
It’s useful to install the Azure az CLI client. Although you can do all of what you need for Azure from the web UI, it’s helpful to have the CLI tool installed for debugging or streamlining the setup process.
Once you’ve installed the Azure CLI, you will need to login to set up the cli for access. Be sure to visit the Getting Started page for more information. Once set up, verify that you have a connection to your account with the following:
az account show

The output should look something like this
{
“environmentName”: “AzureCloud”,
“id”: “VVVVVVVV-VVVV-VVVV-VVVV-VVVVVVVVVVVV”,
“isDefault”: true,
“name”: “Microsoft Azure Account”,
“state”: “Enabled”,
“tenantId”: “WWWWWWWW-WWWW-WWWW-WWWW-WWWWWWWWWWWW”,
“user”: {
“name”: “user@email.com”,
“type”: “user”
}
}

Again, you don’t need the Azure CLI tool, but it does help.
OpenShift CLI Tools
In order to install and interact with OpenShift, you will need to download some CLI tools. These can be found by going to try.openshift.com and logging in with your Red Hat Customer Portal credentials. Click on Azure (note that it’s only Developer Preview currently). You will need to download the following:

The OpenShift Installer
The OpenShift CLI tools (includes oc and kubectl)
Download or copy your pull secret

You may need the “dev preview” binaries instead, as dev previews are always being updated. Always consult try.openshit.com for details.
Install
In this section I will be going over the installation of OpenShift 4.2 dev preview on Azure, with the assumption you have an Azure account and that you performed all of the prerequisites. I will be installing the following:

Installer will set up 3 Master nodes, 3 Worker nodes, and 1 bootstrap node.
I will be using az.redhatworkshops.io as my example domain.
I will be using openshift4 as my clusterid.
I am doing the install from a Linux host.

Creating a Service Principal
A Service Principal needs to be created for the installer to use. Service Principal can be thought of as a “robot” account for automation on Azure. More information about Service Principals can be found using the Microsoft Docs. To create a service principal; run the following command:
az ad sp create-for-rbac –name chernand-azure-video-sp

When successful, it should output the information about the service principal. Save this information somewhere as the installer will need it to do the install. The information should look something like this.
{
“appId”: “ZZZZZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZZZZZZZZZ”,
“displayName”: “chernand-azure-video-sp”,
“name”: “http://chernand-azure-video-sp”,
“password”: “XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX”,
“tenant”: “YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY”
}

Next, you need to give the service principal the right roles in order to properly install OpenShift. The service principal needs to have at least Contributor and User Access Administrator roles assigned in your subscription.
az role assignment create –assignee
ZZZZZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZZZZZZZZZ –role Contributor
az role assignment create –assignee
ZZZZZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZZZZZZZZZ –role “User Access Administrator”

NOTE: The UUID passed to –assignee is the appId in the output when you created the service principal.

In order to properly mint credentials for components in the cluster, your service principal needs to request for the following application permissions before you can deploy OpenShift on Azure: Azure Active Directory Graph -> Application.ReadWrite.OwnedBy
You can request permissions using the Azure portal or the Azure CLI. (You can read more about Azure Active Directory Permissions at the Microsoft Azure website)
az ad app permission add –id ZZZZZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZZZZZZZZZ
–api 00000002-0000-0000-c000-000000000000
–api-permissions 824c81eb-e3f8-4ee6-8f6d-de7f50d565b7=Role

NOTE: The Application.ReadWrite.OwnedBy permission is granted to the application only after it is provided an “Admin Consent” by the tenant administrator. If you are the tenant administrator, you can run the following to grant this permission.

az ad app permission grant –id
ZZZZZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZZZZZZZZZ
–api 00000002-0000-0000-c000-000000000000

You will also need your Subscription ID; you can get this by running the following.
az account list –output table

Installing OpenShift
It’s best to create a working directory when creating a cluster. This directory will hold all the install artifacts, including the initial kubeadmin account.
mkdir ~/ocp4

Run the openshift-install create install-config command specifying this working directory. This creates the initial install config (install-config.yaml) and stores it in that directory. You will need information about your service principal you created earlier.
$ openshift-install create install-config –dir=~/ocp4
? SSH Public Key /home/chernand/.ssh/azure_rsa.pub
? Platform azure
? azure subscription id 12345678-1234-1234-1234-123456789012
? azure tenant id YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY
? azure service principal client id ZZZZ-ZZZZ-ZZZZ-ZZZZZZZZZZZZ
? azure service principal client secret [? for help] ***********
INFO Saving user credentials to “/home/chernand/.azure/osServicePrincipal.json”
? Region centralus
? Base Domain az.redhatworkshops.io
? Cluster Name openshift4
? Pull Secret [? for help] ****************************

Let’s go over the Azure specific options.

azure subscription id – This is your subscription id. This can be obtained by running: az account list –output table
azure tenant id – Your tenant id (this was in the output when you created your service principal)
azure service principal client id – This is the appId from the service principal creation output.
azure service principal client secret – This is the password from the service principal creation output.

The install-config.yaml file is in the ~/ocp4 working directory. It also creates a ~/.azure/osServicePrincipal.json file. Inspect these files if you wish.
cat ~/ocp4/install-config.yaml
cat ~/.azure/osServicePrincipal.json

After you’ve inspected these files; go ahead and install OpenShift.
openshift-install create cluster –dir=~/ocp4/

When the install is finished, you’ll see the following output.
INFO Consuming “Install Config” from target directory
INFO Creating infrastructure resources…
INFO Waiting up to 30m0s for the Kubernetes API at https://api.openshift4.az.redhatworkshops.io:6443…
INFO API v1.14.0+8e63b6d up
INFO Waiting up to 30m0s for bootstrapping to complete…
INFO Destroying the bootstrap resources…
INFO Waiting up to 30m0s for the cluster at https://api.openshift4.az.redhatworkshops.io:6443 to initialize…
INFO Waiting up to 10m0s for the openshift-console route to be created…
INFO Install complete!
INFO To access the cluster as the system:admin user when using ‘oc’, run ‘export KUBECONFIG=/home/chernand/ocp4/auth/kubeconfig’
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.openshift4.az.redhatworkshops.io:6443
INFO Login to the console with user: kubeadmin, password: 5char-5char-5char-5char

Set the KUBECONFIG environment variable to connect to your cluster.
export KUBECONFIG=$HOME/ocp4/auth/kubeconfig

Verify that your cluster is up and running.
$ oc cluster-info
Kubernetes master is running at https://api.openshift4.az.redhatworkshops.io:6443

To further debug and diagnose cluster problems, use ‘kubectl cluster-info dump’.

Post Install
After your cluster is deployed, you may want to do some additional configuration tasks such as:

Configuring authentication and additional users
Adding additional routes and/or sharding network traffic
Migrating OpenShift services to specific nodes
Adding additional persistent storage or a dynamic storage provisioner
Adding more nodes to the cluster

It’s important to note that the kubeadmin user is meant to be a temporary admin user. You should replace this user with a more permanent admin user when you configure authentication.
Conclusion
In this blog we went over how to install OpenShift 4 on Azure using the full stack automated method. It’s important to note that this method is marked as developer preview, meaning it’s not supported by Red Hat. However, the installer is ready for you to deploy and test for non-production workloads. Please feel free to try it and provide feedback by leaving a comment below or or reach out via the Customer Portal Discussions page.
The post OpenShift 4.2 on Azure Preview appeared first on Red Hat OpenShift Blog.
Quelle: OpenShift

National Express signs 8-year cloud deal with Vodafone Business and IBM

UK’s largest coach operator National Express turns to new IBM & Vodafone venture for hybrid cloud boost. (PRNewsfoto/IBM)
National Express has signed an eight-year deal with Vodafone Business and IBM to help the UK-based coach company with its hybrid cloud plans, reports Computer Weekly. Under the agreement, infrastructure for the transportation provider will move to IBM Cloud as part of a larger hybrid cloud strategy.
The deal will enable National Express to more “effectively manage multiple clouds in different locations and from different vendors,” shared IBM. The deal will also help National Express to “seamlessly scale up and down in support of usage spikes. Additional security and risk management will be added to protect the transport operator’s technology infrastructure and provide greater resilience.”
Michael Valocchi, IBM general manager of the venture with Vodafone shared the following with Computer Weekly: “What we’re building for National Express is a future-proof platform that’s going to allow them to use hybrid cloud, to have the flexibility and scalability they need. It’s a way for them to innovate, both from a consumer experience [point of view] and from an operational perspective, by bringing together the predictive nature of maintenance and vehicle placement, and the operational benefits that brings.”
Vodafone Business and IBM launched their joint venture earlier this year to help companies innovate faster. The joint venture aims to provide the open, flexible technologies enterprises need to integrate multiple clouds and prepare for a digital future enabled by AI, 5G, edge computing and Software Defined Networking (SDN).
Read more about the National Express cloud deal in the full article from Computer Weekly.
The post National Express signs 8-year cloud deal with Vodafone Business and IBM appeared first on Cloud computing news.
Quelle: Thoughts on Cloud

Expanding your patent set with ML and BigQuery

Patents protect unique ideas and intellectual property. Patent landscaping is an analytical approach commonly used by corporations, patent offices, and academics to better understand the potential technical coverage of a large number of patents where manual review (i.e., actually reading the patents) is not feasible due to time or cost constraints. Luckily, patents contain rich information, including metadata (examiner-supplied classification codes, citations, dates, and information about the patent applicant), images, and thousands of words of descriptive text, which enable the use of more advanced methodological techniques to augment manual review.Patent landscaping techniques have improved as machine learning models have increased practitioners’ ability to analyze all this data. Here on Google’s Global Patents Team, we’ve developed a new patent landscaping methodology that uses Python and BigQuery on Google Cloud to allow you to easily access patent data and generate automated landscapes.There are some important concepts to know as you’re getting started with patent landscaping. Machine learning (ML) landscaping methods that use these sources of information generally fall into one of two categories:  Unsupervised: Given a portfolio of patents about which the user knows no prior information, then utilize an unsupervised algorithm to generate topic clusters to provide users a better high-level overview of what that portfolio contains.Supervised: Given a seed set of patents about which the user is confident covers a specific technology, then identify other patents among a given set that are likely to relate to the same technology. The focus of this post is on supervised patent landscaping, which tends to have more impact and be commonly used across industries, such as:Corporations that have highly curated seed sets of patents that they own and wish to identify patents with similar technical coverage owned by other entities. That may aid various strategic initiatives, including targeted acquisitions and cross-licensing discussions. Patent offices that regularly perform statistical analyses of filing trends in emerging technologies (like AI) for which the existing classification codes are not sufficiently nuanced. Academics who are interested in understanding how economic policy impacts patent filing trends in specific technology areas across industries. Whereas landscaping methods have historically relied on keyword searching and Boolean logic applied to the metadata, supervised landscaping methodologies are increasingly using advanced ML techniques to extract meaning from the actual full text of the patent, which contains far richer descriptive information than the metadata. Despite this recent progress, most supervised patent landscaping methodologies face at least one of these challenges:Lack of confidence scoring: Many approaches simply return a list of patents without indication of which are the most likely to actually be relevant to a specific technology space covered in the seed set. This means that a manual reviewer can’t prioritize the results for manual review, which is a common use of supervised landscapes. Speed: Many approaches that use more advanced machine learning techniques are extremely slow, making them difficult to use on-demand. Cost: Most existing tools are provided by for-profit companies that charge per analysis or as a recurring SaaS model, which is cost-prohibitive for many users. Transparency: Most available approaches are proprietary, so the user cannot actually review the code or have full visibility into the methodologies and data inputs. Lack of clustering: Many technology areas comprise multiple sub-categories that require a clustering routine to identify. Clustering the input set could formally group the sub-categories in a formulaic way that any downstream tasks could then make use of to more effectively rank and return results. Few (if any) existing approaches attempt to discern sub-categories within the seed set. The new patent landscaping methodology we’ve developed satisfies all of the common shortcomings listed above. This methodology uses Colab (Python) and GCP (BigQuery) to provide the following benefits:Fully transparent with all code and data publicly available, and provides confidence scoring of all resultsClusters patent data to capture variance within the seed setInexpensive, with sole costs incurring from GCP compute feeFast, hundreds or thousands of patents can be used as input with results returned in a few minutesRead on for a high-level overview of the methodology with code snippets. The complete code is found here, and can be reused and modified for your own ML and BigQuery projects. Finally, if you need an introduction to the Google Public Patents Datasets, a great overview is found here.Getting started with the patent landscaping methodology 1. Select a seed set and a patent representationGenerating a landscape first requires a seed set to be used as a starting point for the search. In order to produce a high-quality search, the input patents should themselves be closely related. More closely related seed sets tends to generate landscapes more tightly clustered around the same technical coverage, while a set of completely random patents will likely yield noisy and more uncertain results.The input set could span a Cooperative Patent Code (CPC), a technology, an assignee, an inventor, etc., or a specific list of patents covering some known technological area. In this walkthrough a term (word) is used to find a seed set. In the Google Patents Public Datasets, there is a “top terms” field available for all patents in the “google_patents_research.publications” table. The field contains 10 of the most important terms used in a patent. The terms can be unigrams (such as “aeroelastic,” “genotyping,” or “engine”) or bi-grams (such as “electrical circuit,” “background noise,” or “thermal conductivity”).With a seed set selected, you’ll next need a representation of a patent suitable to be passed through an algorithm. Rather than using the entire text of a patent or discrete features of a patent, it’s more consumable to use an embedding for each patent. Embeddings are a learned representation of a data input through some type of model, often with a neural network architecture. They reduce the dimensionality of an input set by mapping the most important features of the inputs to a vector of continuous numbers. A benefit of using embeddings is the ability to calculate distances between them, since several distance measures between vectors exist.You can find a set of patent embeddings in BigQuery. The patent embeddings were built using a machine learning model that predicted a patent’s CPC code from its text. Therefore, the learned embeddings are a vector of 64 continuous numbers intended to encode the information in a patent’s text. Distances between the embeddings can then be calculated and used as a measure of similarity between two patents. In the following example query (performed in BigQuery), we’ve selected a random set of U.S. patents (and collected their embeddings) granted after Jan. 1, 2005, with a top term of “neural network.”2. Organize the seed setWith the input set determined and the embedding representations retrieved, you have a few options for determining similarity to the seed set of patents.Let’s go through each of the options in more detail.1. Calculating an overall embedding point—centroid, medoid, etc.— for the entire input set and performing similarity to that value. Under this method, one metric is calculated to represent the entire input set. That means that the input set of embeddings, which could contain information on hundreds or thousands of patents, ends up pared down to a single point. There are drawbacks to any methodology that is dependent on one point. If the value itself is not well-selected, all results from the search will be poor. Furthermore, even if the point is well-selected, the search depends on only that one embedding point, meaning all search results may represent the same area of a topic, technology, etc.. By reducing the entire set of inputs to one point, you’ll lose significant information about the input set.2. Seed set x N similarity, e.g., calculating similarity to all patents in the input set to all other patents. Doing it this way means you apply the vector distance metric used between each patent in the input set and all other patents in existence. This method presents a few issues: Lack of tractability. Calculating similarity for (seed_set_size x all_patents) is an expensive solution in terms of time and compute. Outliers in the input set are treated as equals to highly representative patents.Dense areas around a single point could be overrepresented in the results.Reusing the input points for similarity may fail to expand the input space.3. Clustering the input set and performing similarity to a cluster. We recommend clustering as the preferred approach to this problem, as it will overcome many of the issues presented by the other two methods. Using clustering, information about the seed set will be condensed into multiple representative points, with no point being an exact replica of its input. With multiple representative points, you can capture various parts of the input technology, features, etc. 3. Cluster the seed setA couple of notes about the embeddings on BigQuery:The embeddings are a vector of 64 numbers, meaning that data is high-dimensional.As noted earlier, the embeddings were trained in a prediction task, not explicitly trained to capture the “distance” (difference) between patents.Based on the embedding training, the clustering algorithm needs to be able to effectively handle clusters of varying density. Since the embeddings were not trained to separate patents evenly, there will be areas of the embedding space that are more or less dense than others, yet represent similar information between documents.Furthermore, with high-dimensional data, distance measures can degrade rapidly. One possible approach to overcoming the dimensionality is to use a secondary metric to represent the notion of distance. Rather than using absolute distance values, it’s been shown that a ranking of data points from their distances (and removing the importance of the distance magnitudes) will produce more stable results with higher dimensional data. So our clustering algorithm should remove sole dependence on absolute distance.It’s also important that a clustering method be able to detect outliers. When providing a large set of input patents, you can expect that not all documents in the set will be reduced to a clear sub-grouping. When the clustering algorithm is unable to group data in a space, it should be capable of ignoring those documents and spaces. Several clustering algorithms exist (hierarchical, clique-based, hdbscan, etc.) that have the properties we require, any of which can be applied to this problem in place of the algorithm used here. In this application, we used the shared nearest neighbor (SNN) clustering method to determine the patent grouping. SNN is a clustering method that evaluates the neighbors for each point in a dataset and compares the neighbors shared between points to find clusters. SNN is a useful clustering algorithm for determining clusters of varying density. It is good for high-dimensional data, since the explicit distance value is not used in its calculation; rather, it uses a ranking of neighborhood density. The complete clustering code is available in the GitHub repo.For each cluster found, the SNN method determines a representative point for each cluster in order to perform a search against it. Two common approaches for representing geometric centers are centroids and medoids. The centroid simply takes the mean value from each of the 64 embedding dimensions. A medoid is the point in a cluster whose average dissimilarity to all objects in a cluster is minimized. In this walkthrough, we’re using the centroid method.Below you’ll see a Python code snippet of the clustering application and calculations of some cluster characteristics, along with a visualization of the clustering results. The dimensions in the visualization were reduced using TSNE, and outliers in the input set have grayed out. The results of the clustering can be seen by the like colors forming a cluster of patents:4. Perform a similarity searchOnce the cluster groups and their centers have been determined, you’ll need a measure of similarity between vectors. Several measures exist, and you can implement any preferred measure. In this example, we used cosine distances to find the similarity between two vectors.Using the cosine distance, the similarity between a cluster center is compared to all other patents using each of their embeddings. Distance values close to zero mean that the patent is very similar to the cluster point, whereas distances close to one are very far from the cluster point. You’ll see the resulting similarity calculations ordered for each cluster and get an upper bound number of assets.Below you’ll see a Python code snippet that iterates through each cluster. For each cluster, a query is performed in BigQuery that calculates the cosine distance between the cluster center and all other patents, and returns the most similar results to that cluster, like this:5. Apply confidence scoringThe previous step returns the most similar results to each cluster along with its cosine distance values. From here, the final step takes properties of the cluster and the distance measure from the similarity results to create a confidence level for each result. There are multiple ways to construct a confidence function, and each method may have benefits to certain datasets. In this walkthrough, we do the confidence scoring using a half squash function. The half squash function is formulated as follows:The function takes as input the cosine distance value found between a patent and a cluster center (x). Furthermore, the function requires two parameters that affect how the distances of the results are fit onto the confidence scale:A power variable, which defines the properties of the distribution showing the distance results—effectively the slope of the curve. In this version, a power of two is used.A half value, which represents the midpoint of the curve returned and defines the saturation on either side of the curve. In this implementation, each cluster uses its own half value. The half value for each cluster is formulated as follows:(mean distance of input patents in cluster + 2 * standard deviation of input cluster distances)The confidence scoring function effectively re-saturates the returned distance values to a scale between [0,1], with an exponentially decreasing value as the distance between a patent and the cluster center grows:Results from this patent landscaping methodologyApplying the confidence function for all of the similarity search results yields a distribution of patents by confidence score. At the highest levels of confidence, fewer results will appear. As you move down the confidence distribution, the number of results increases exponentially.Not all results returned are guaranteed to be high-quality; however, the higher the confidence level, the more likely a result is positive. Depending on the input set, the confidence levels will not necessarily begin at 99%. From the results above, using our “neural network” random patent set, the highest confidence results sit in the 60% to 70% range. From our own experimentation, the more tightly related the input set, the higher the confidence level in the results will be, since the clusters will be more compact.This walkthrough provides one method for expanding a set of patents to generate a landscape. Several changes or improvements can be made to the cluster algorithm, distance calculations and confidence functions to suit any dataset. Explore the patents dataset for yourself, and try out GitHub for the patent set expansion code too.
Quelle: Google Cloud Platform