Low-latency fraud detection with Cloud Bigtable

Each time someone makes a purchase with a credit card, financial companies want to determine if that was a legitimate transaction or if it is using a stolen credit card, abusing a promotion or hacking into a user’s account. Every year, billions of dollars are lost due to credit card fraud, so there are serious financial consequences. Companies dealing with these transactions need to balance predicting fraud accurately and predicting fraud quickly. In this blog post, you will learn how to build a low-latency, real-time fraud detection system that scales seamlessly by using Bigtable for user attributes, transaction history and machine learning features. We will follow an existing code solution, examine the architecture, define the database schema for this use case, and see opportunities for customizations.The code for this solution is on GitHub and includes a simplistic sample dataset and pre-trained fraud detection model plus a Terraform configuration. This blog and example’s goal is to showcase the end-to-end solution rather than machine learning specifics since most fraud detection models in reality can involve hundreds of variables. If you want to spin up the solution and follow along, clone the repo and follow the instructions in the README to set up resources and run the code.code_block[StructValue([(u’code’, u’git clone https://github.com/GoogleCloudPlatform/java-docs-samples.gitrncd java-docs-samples/bigtable/use-cases/fraudDetection’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e50a2a96bd0>)])]Fraud detection pipelineWhen someone initiates a credit card purchase, the transaction is sent for processing before the purchase can be completed. The processing includes validating the credit card, checking for fraud, and adding the transaction to the user’s transaction history. Once those steps are completed, and if there is no fraud identified, the point of sale system can be notified that the purchase can finish. Otherwise, the customer might receive a notification indicating there was fraud, and further transactions can be blocked until the user can secure their account.The architecture for this application includes:Input stream of customer transactionsFraud detection modelOperational data store with customer profiles and historical dataData pipeline for processing transactionsData warehouse for training the fraud detection model and querying table level analyticsOutput stream of fraud query resultsThe architecture diagram below shows how the system is connected and which services are included in the Terraform setup.Pre-deploymentBefore creating a fraud detection pipeline, you will need a fraud detection model trained on an existing dataset. This solution provides a fraud model to try out, but it is tailored for the simplistic sample dataset. When you’re ready to deploy this solution yourself based on your own data, you can follow our blog on how to train a fraud model with BigQuery ML.Transaction input streamThe first step towards detecting fraud is managing the stream of customer transactions. We need an event-streaming service that can horizontally scale to meet the workload traffic, so Cloud Pub/Sub is a great choice. As our system grows, additional services can subscribe to the event-stream to add new functionality as part of a microservice architecture. Perhaps the analytics team will subscribe to this pipeline for real time dashboards and monitoring.When someone initiates a credit card purchase, a request from the point of sale system will come in as a Pub/Sub message. This message will have information about the transaction like location, transaction amount, merchant id and customer id. Collecting all the transaction information is critical for us to make an informed decision since we will update the fraud detection model based on purchase patterns over time as well as accumulate recent data to use for the model inputs. The more data points we have, the more opportunities we have to find anomalies and make an accurate decision.Transaction pipelinePub/sub has built-in integration with Cloud Dataflow, Google Cloud’s data pipeline tool, which we will use for processing the stream of transactions with horizontal scalability. It’s common to design Dataflow jobs with multiple sources and sinks, so there is a lot of flexibility in pipeline design. Our pipeline design here only fetches data from Bigtable, but you could also add additional data sources or even 3rd party financial APIs to be part of the processing. Dataflow is also great for outputting results to multiple sinks, so we can write to databases, publish an event stream with the results, and even call APIs to send emails or texts to users about the fraud activity.Once the pipeline receives a message, our Dataflow job does the following:Fetch user attributes and transaction history from BigtableRequest a prediction from Vertex AIWrite the new transaction to BigtableSend the prediction to a Pub/Sub output streamcode_block[StructValue([(u’code’, u’Pipeline pipeline = Pipeline.create(options);rnrnPCollection<RowDetails> modelOutput =rn pipelinern .apply(rn “Read PubSub Messages”,rn PubsubIO.readStrings().fromTopic(options.getInputTopic()))rn .apply(“Preprocess Input”, ParDo.of(PREPROCESS_INPUT))rn .apply(“Read from Cloud Bigtable”,rn ParDo.of(new ReadFromTableFn(config)))rn .apply(“Query ML Model”,rn ParDo.of(new QueryMlModelFn(options.getMLRegion())));rnrnmodelOutputrn .apply(rn “TransformParsingsToBigtable”,rn ParDo.of(WriteCBTHelper.MUTATION_TRANSFORM))rn .apply(rn “WriteToBigtable”,rn CloudBigtableIO.writeToTable(config));rnrnmodelOutputrn .apply(rn “Preprocess Pub/Sub Output”,rn ParDo.of(rn new DoFn<RowDetails, String>() {rn @ProcessElementrn public void processElement(rn @Element final RowDetails modelOutput,rn final OutputReceiver<String> out)rn throws IllegalAccessException {rn out.output(modelOutput.toCommaSeparatedString());rn }rn }))rn .apply(“Write to PubSub”,rn PubsubIO.writeStrings().to(options.getOutputTopic()));rnrnpipeline.run();’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e50874dbbd0>)])]Operational data storeTo detect fraud in most scenarios, you cannot just look at just one transaction in a silo – you need the additional context in real time in order to detect an anomaly. Information about the customer’s transaction history and user profile are the features we will use for the prediction.We’ll have lots of customers making purchases, and since we want to validate the transaction quickly, we need a scalable and low-latency database that can act as part of our serving layer. Cloud Bigtable is a horizontally-scalable database service with consistent single-digit millisecond latency, so it aligns great with our requirements. Schema designOur database will store customer profiles and transaction history. The historical data provides context which allows us to know if a transaction follows its customer’s typical purchase patterns. These patterns can be found by looking at hundreds of attributes. A NoSQL database like Bigtable allows us to add columns for new features seamlessly unlike less flexible relational databases which would require schema changes to augment. Data scientists and engineers can work to evolve the model over time by mixing and matching features to see what creates the most accurate model. They can also use the data in other parts of the application: generating credit card statements for customers or creating reports for analysts. Bigtable as an operational data store here allows us to provide a clean current version of the truth shared by multiple access points within our system.For the table design, we can use one column family for customer profiles and another for transaction history since they won’t always be queried together. Most users are only going to make a few purchases a day, so we can use the user id for the row key. All transactions can go in the same row since Bigtable’s cell versioning will let us store multiple values at different timestamps in row-column intersections. Our table example data includes more columns, but the structure looks like this:Since we are recording every transaction each customer is making, the data could grow very quickly, but garbage collection policies can simplify data management. For example, we might want to keep a minimum of 100 transactions then delete any transactions older than six months. Garbage collection policies apply per column family which gives us flexibility. We want to retain all the information in the customer profile family, so we can use a default policy that won’t delete any data. These policies can be managed easily via the Cloud Console and ensure there’s enough data for decision making while trimming the database of extraneous data. Bigtable stores timestamps for each cell by default, so if a transaction is incorrectly categorized as fraud/not fraud, we can look back at all of the information to debug what went wrong. There is also the opportunity to use cell versioning to support temporary features. For example, if a customer notifies us that they will be traveling during a certain time, we can update the location with a future timestamp, so they can go on their trip with ease. QueryWith our pending transaction, we can extract the customer id and fetch that information from the operational data store. Our schema allows us to do one row lookup to get an entire user’s information.code_block[StructValue([(u’code’, u’Table table = getConnection().getTable(TableName.valueOf(options.getCBTTableId()));rnResult row = table.get(new Get(Bytes.toBytes(transactionDetails.getCustomerID())));rnrnCustomerProfile customerProfile = new CustomerProfile(row);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e50874db150>)])]Request a predictionNow, we have our pending transaction and the additional features, so we can make a prediction. We took the fraud detection model that we trained previously and deployed it to Vertex AI Endpoints. This is a managed service with built-in tooling to track our model’s performance.code_block[StructValue([(u’code’, u’PredictRequest predictRequest =rn PredictRequest.newBuilder()rn .setEndpoint(endpointName.toString())rn .addAllInstances(instanceList)rn .build();rnrnPredictResponse predictResponse = predictionServiceClient.predict(rn predictRequest);rndouble fraudProbability =rn predictResponsern .getPredictionsList()rn .get(0)rn .getListValue()rn .getValues(0)rn .getNumberValue();rnrnLOGGER.info(“fraudProbability = ” + fraudProbability);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e50874dbc10>)])]Working with the resultWe will receive the fraud probability back from the prediction service and then can use it in a variety of ways. Stream the predictionWe will receive the fraud probability back from the prediction service and need to pass the result along. We can send the result and transaction as a Pub/Sub message in a result stream, so the point of sale service and other services can complete processing. Multiple services can react to the event stream, so there is a lot of customization you can add here. One example would be to  use the event stream as a Cloud Function trigger for a custom function that notifies users of fraud via email or text.Another customization you could add to the pipeline would be to include a mainframe or a relational database like Cloud Spanner or AlloyDB to commit the transaction and update the account balance. The payment will only go through if the balance can be removed from the remaining credit limit otherwise the customer’s card will have to be declined.code_block[StructValue([(u’code’, u’modelOutputrn .apply(rn “Preprocess Pub/Sub Output”,rn ParDo.of(rn new DoFn<RowDetails, String>() {rn @ProcessElementrn public void processElement(rn @Element final RowDetails modelOutput,rn final OutputReceiver<String> out)rn throws IllegalAccessException {rn out.output(modelOutput.toCommaSeparatedString());rn }rn }))rn .apply(“Write to PubSub”,rn PubsubIO.writeStrings().to(options.getOutputTopic()));’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e50a0d5ef50>)])]Update operational data storeWe also can write the new transaction and its fraud status to our operational data store in Bigtable. As our system processes more transactions, we can improve the accuracy of our model by updating the transaction history, so we will have more data points for future transactions. Bigtable scales horizontally for reading and writing data, so keeping our operational data store up to date requires minimal additional infrastructure setup.Making test predictionsNow that you understand the entire pipeline and it’s up and running, we can send a few transactions to the Pub/Sub stream from our dataset. If you’ve deployed the codebase, you can generate transactions with gcloud and look through each tool in the Cloud Console to monitor the fraud detection ecosystem in real time.Run this bash script from the terraform directory to publish transactions from the testing data:code_block[StructValue([(u’code’, u’NUMBER_OF_LINES=5000rnPUBSUB_TOPIC=$(terraform -chdir=../ output pubsub_input_topic | tr -d ‘”‘)rnFRAUD_TRANSACTIONS_FILE=”../datasets/testing_data/fraud_transactions.csv”rnLEGIT_TRANSACTIONS_FILE=”../datasets/testing_data/legit_transactions.csv”rnrnfor i in $(eval echo “{1..$NUMBER_OF_LINES}”)rndorn # Send a fraudulent transactionrn MESSAGE=$(sed “${i}q;d” $FRAUD_TRANSACTIONS_FILE)rn echo ${MESSAGE}rn gcloud pubsub topics publish ${PUBSUB_TOPIC} –message=”${MESSAGE}”rn sleep 5rnrn # Send a legit transactionrn MESSAGE=$(sed “${i}q;d” $LEGIT_TRANSACTIONS_FILE)rn echo ${MESSAGE}rn gcloud pubsub topics publish ${PUBSUB_TOPIC} –message=”${MESSAGE}”rn sleep 5rndone’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e50a0d5ea50>)])]SummaryIn this piece, we’ve looked at each part of a fraud detection pipeline and how to ensure each has scale and low-latency using the power of Google Cloud. This example is available on GitHub, so explore the code, launch it yourself, and try making modifications to match your needs and data. The Terraform setup included uses dynamically scalable resources like Dataflow, Pub/sub, and Vertex AI with an initial one node Cloud Bigtable instance that you can scale up to match your traffic and system load.Related ArticleHow Cloud Bigtable helps Ravelin detect retail fraud with low latencyDetecting fraud with low latency and accepting payments at scale is made easier thanks to Bigtable.Read Article
Quelle: Google Cloud Platform

BigQuery Geospatial Functions – ST_IsClosed and ST_IsRing

Geospatial data analytics lets you use location data (latitude and longitude) to get business insights. It’s used for a wide variety of applications in industry, such as package delivery logistics services, ride-sharing services, autonomous control of vehicles, real estate analytics, and weather mapping. BigQuery, Google Cloud’s large-scale data warehouse, provides support for analyzing large amounts of geospatial data. This blog post discusses two geography functions we’ve recently added in order to expand the capabilities of geospatial analysis in BigQuery: ST_IsClosed and ST_IsRing.BigQuery geospatial functionsIn BigQuery, you can use the GEOGRAPHY data type to represent geospatial objects like points, lines, and polygons on the Earth’s surface. In BigQuery, geographies are based on the Google S2 Library, which uses Hilbert space-filling curves to perform spatial indexing to make the queries run efficiently. BigQuery comes with a set of geography functions that let you process spatial data using standard ANSI-compliant SQL. (If you’re new to using BigQuery geospatial analytics, start with Get started with geospatial analytics, a tutorial that uses BigQuery to analyze and visualize the popular NYC Bikes Trip dataset.) The new ST_IsClosed and ST_IsRing functions are boolean accessor functions that help determine whether a geographical object (a point, a line, a polygon, or a collection of these objects) is closed or is a ring. Both of these functions accept a GEOGRAPHY column as input and return a boolean value. The following diagram provides a visual summary of the types of geometric objects.For more information about these geometric objects, see Well-known text representation of geometry in Wikipedia.Is the object closed? (ST_IsClosed)The ST_IsClosed function examines a GEOGRAPHY object and determines whether each of the elements of the object has an empty boundary. The boundary for each element is defined formally in the ST_Boundary function. The following rules are used to determine whether a GEOGRAPHY object is closed:A point is always closed.A linestring is closed if the start point and end point of the linestring are the same.A polygon is closed only if it’s a full polygon.A collection is closed if every element in the collection is closed. An empty GEOGRAPHY object is not closed. Is the object a ring? (ST_IsRing)The other new BigQuery geography function is ST_IsRing. This function determines whether a GEOGRAPHY object is a linestring and whether the linestring is both closed and simple. A linestring is considered closed as defined by the ST_IsClosed function. The linestring is considered simple if it doesn’t pass through the same point twice, with one exception: if the start point and end point are the same, the linestring forms a ring. In that case, the linestring is considered simple.Seeing the new functions in actionThe following query shows you what the ST_IsClosed and ST_IsRing function return for a variety of geometric objects. The query creates a series of ad-hoc geography objects and uses the UNION ALL statement to create a set of inputs. The query then calls the ST_IsClosed and ST_IsRing functions to determine whether each of the inputs are closed or are rings. You can run this query in the BigQuery SQL workspace page in the Google Cloud console.code_block[StructValue([(u’code’, u”WITH example AS(rn SELECT ST_GeogFromText(‘POINT(1 2)’) AS geographyrn UNION ALLrn SELECT ST_GeogFromText(‘LINESTRING(2 2, 4 2, 4 4, 2 4, 2 2)’) AS geographyrn UNION ALLrn SELECT ST_GeogFromText(‘LINESTRING(1 2, 4 2, 4 4)’) AS geographyrn UNION ALLrn SELECT ST_GeogFromText(‘POLYGON((0 0, 2 2, 4 2, 4 4, 0 0))’) AS geographyrn UNION ALLrn SELECT ST_GeogFromText(‘MULTIPOINT(5 0, 8 8, 9 6)’) AS geographyrn UNION ALLrn SELECT ST_GeogFromText(‘MULTILINESTRING((0 0, 2 0, 2 2, 0 0), (4 4, 7 4, 7 7, 4 4))’) AS geographyrn UNION ALLrn SELECT ST_GeogFromText(‘GEOMETRYCOLLECTION EMPTY’) AS geographyrn UNION ALLrn SELECT ST_GeogFromText(‘GEOMETRYCOLLECTION(POINT(1 2), LINESTRING(2 2, 4 2, 4 4, 2 4, 2 2))’) AS geography)rnSELECTrn geography,rn ST_IsClosed(geography) AS is_closed, rn ST_IsRing(geography) AS is_ring rnFROM example;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e3a1a99e8d0>)])]The console shows the following results. You can see in the is_closed and is_ring columns what each function returns for the various input geography objects.The new functions with real-world geography objectsIn this section, we show queries using linestring objects that represent line segments that connect some of the cities in Europe. We show the various geography objects on maps and then discuss the results that you get when you call ST_IsClosed and ST_IsRing for these geography objects. You can run the queries by using the BigQuery Geo Viz tool. The maps are the output of the tool. In the tool you can click the Show results button to see the values that the functions return for the query.Start point and end point are the same, no intersectionIn the first example, the query creates a linestring object that has three segments. The segments are defined by using four sets of coordinates: the longitude and latitude for London, Paris, Amsterdam, and then London again, as shown in the following map created by the Geo Viz tool:The query looks like the following:code_block[StructValue([(u’code’, u”WITH example AS (rnSELECT ST_GeogFromText(‘LINESTRING(-0.2420221 51.5287714, 2.2768243 48.8589465, 4.763537 52.3547921, -0.2420221 51.5287714)’) AS geography)rnSELECT rn geography, rn ST_IsClosed(geography) AS is_closed,rn ST_IsRing(geography) AS is_ringrnFROM example;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e3a1b7e4ed0>)])]In the example table that’s created by the query, the columns with the function values show the following:ST_IsClosed returns true. The start point and end point of the linestring are the same.ST_IsRing returns true. The geography is closed, and it’s also simple because there are no self-intersections.Start point and end point are different, no intersectionAnother scenario is when the start and end points are different. For example, imagine two segments that connect London to Paris and then Paris to Amsterdam, as in this map:The following query represents this set of coordinates:code_block[StructValue([(u’code’, u”WITH example AS (rnSELECT ST_GeogFromText(‘LINESTRING(-0.2420221 51.5287714, 2.2768243 48.8589465, 4.763537 52.3547921)’) AS geography)rnSELECT rn geography, rn ST_IsClosed(geography) AS is_closed,rn ST_IsRing(geography) AS is_ringrnFROM example;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e3a1b685b10>)])]This time, the ST_IsClosed and ST_IsRing functions return the following values:ST_IsClosed returns false. The start point and end point of the linestring are different.ST_IsRing returns false. The linestring is not closed. It’s simple because there are no self-intersections, but ST_IsRing returns true only when the geometry is both closed and simple.Start point and end point are the same, with intersectionThe third example is a query that creates a more complex geography. In the linestring, the start point and end point are the same. However, unlike the earlier example, the line segments of the linestring intersect. A map of the segments shows connections that go from London to Zürich, then to Paris, then to Amsterdam, and finally back to London:In the following query, the linestring object has five sets of coordinates that define the four segments:code_block[StructValue([(u’code’, u”WITH example AS (rnSELECT ST_GeogFromText(‘LINESTRING(-0.2420221 51.5287714, 8.393389 47.3774686, 2.2768243 48.8589465, 4.763537 52.3547921, -0.2420221 51.5287714)’) AS geography)rnSELECT rn geography,rn ST_IsClosed(geography) AS is_closed,rn ST_IsRing(geography) as is_ringrnFROM example;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e3a1b676910>)])]In the query, ST_IsClosed and ST_IsRing return the following values:ST_IsClosed returns true. The start point and end point are the same, and the linestring is closed despite the self-intersection.ST_IsRing returns false. The linestring is closed, but it’s not simple because of the intersection.Start point and end point are different, with intersectionIn the last example, the query creates a linestring that has three segments that connect four points: London, Zürich, Paris, and Amsterdam. On a map, the segments look like the following:The query is as follows:code_block[StructValue([(u’code’, u”WITH example AS (rnSELECT ST_GeogFromText(‘LINESTRING(-0.2420221 51.5287714, 8.393389 47.3774686, 2.2768243 48.8589465, 4.763537 52.3547921)’) AS geography)rnSELECT rn geography, rn ST_IsClosed(geography) AS is_closed,rn ST_IsRing(geography) AS is_ringrnFROM example;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e3a18e3a590>)])]The new functions return the following values:ST_IsClosed returns false. The start point and end point are not the same.  ST_IsRing returns false. The linestring is not closed and it’s not simple.Try it yourselfNow that you’ve got an idea of what you can do with the new ST_IsClosed and ST_IsRing functions, you can explore more on your own. For details about the individual functions, read the ST_IsClosed and ST_IsRing entries in the BigQuery documentation. To learn more about the rest of the geography functions available in BigQuery Geospatial, take a look at the BigQuery geography functions page.Thanks to Chad Jennings, Eric Engle and Jing Jing Long for their valuable support to add more functions to BigQuery Geospatial.  Thank you Mike Pope for helping review this article.
Quelle: Google Cloud Platform

November 2022 Newsletter

Docker Hub now supports OCI Artifacts!
Docker Hub can now serve as a registry for any type of application artifact! It can help you distribute WebAssembly modules, helm charts, Docker Volumes, SBOMs, and more.

Learn More

Check out the most popular Docker content this month:
Security Advisory: High Severity OpenSSL Vulnerabilities– by Christian Dupuis, Principal Software Engineer at Docker
Build, Share, and Run WebAssembly Apps Using Docker (Video)– from Chris Crone, Director of Engineering at Docker
New in Docker Desktop 4.14: Greater Visibility Into Your Containers– by Amy Bass, Group Product Manager at Docker
How to Implement Decentralized Storage Using Docker Extensions– by Marton Elek, Principal Software Engineer at Storj

See more great content from Docker and the community

Read Now

Find us at AWS re:Invent!
Did you know you can buy Docker through AWS Marketplace? Stop by booth #946 to chat about the Hardened Desktop security model, Wasm, and deploying Docker to AWS.

Learn More

Subscribe to our newsletter to get the latest news, blogs, tips, how-to guides, best practices, and more from Docker experts sent directly in your inbox once a month.

Quelle: https://blog.docker.com/feed/