Skip to content

Gateway server that provides an OData interface to BigQuery and Elasticsearch

License

Notifications You must be signed in to change notification settings

thumbtack/becquerel

Repository files navigation

Becquerel

Becquerel is a gateway server that provides an OData 4.0 interface to BigQuery, Elasticsearch, and SQL databases. Want your CRM or customer support system to be able to read from your big data warehouse? Becquerel has the hookup.

Erica Ehrhardt developed Becquerel at Thumbtack, where it's currently used in production to provide Thumbtack's Salesforce instance with customer data from BigQuery and Elasticsearch.

license Travis Codecov GitHub release Github All Releases

Table of contents

Getting started

If you're not customizing it and just want to use it, download a prebuilt Becquerel release from our GitHub Releases section and read the subsections below on configuring, deploying, and connecting to Salesforce. For full development info, read all of this section.

Configuring Becquerel

Becquerel uses Play's conf/application.conf file for most configuration. You'll need to customize this for your Becquerel deployment by adding at least one OData service. We've provided an annotated conf/application-example.conf to document Becquerel's config options; copy it to conf/application.conf and edit the services section to get started.

Becquerel uses Play's default logging, which is based on Logback and should be configured by copying conf/logback-example.xml to conf/logback.xml. See the Logging section for more.

If you are using a BigQuery-backed service, the easiest way to get started is to install the Google Cloud SDK and then run gcloud init, which will generate application default credentials. These will grant your development instance of Becquerel access to the same BQ tables that you have access to. You can also use a service account; see the example config file for more.

Running Becquerel

Becquerel can be run like any other Play app with sbt run, or the IntelliJ Play 2 integration, both of which will launch Becquerel at http://localhost:9000/.

Testing Becquerel

Run sbt test, or use IntelliJ's ScalaTest task to run all tests in the test folder.

By default, this will only run unit tests. Integration tests can be turned after creating config files by setting certain environment variables:

  • ES_TESTS=true: Tests ES integration. Assumes you have an ES server running at localhost:9200 containing the Dell DVD Store data.
  • PG_TESTS=true: Tests JDBC integration with PG specifically. Assumes you have a PG server running at localhost:5432 with a database named ds2 containing the Dell DVD Store data.
  • INTEGRATION_TESTS=true: End to end tests of parts of the entire Becquerel app, using the real Play HTTP stack. Requires at least one of the above.

Loading the test data

Becquerel comes with some classes in the com.thumbtack.becquerel.demo namespace to make it easy to load test data into backing databases. Currently, it supports the data from Dell DVD Store, version 2.1.

# Fetch the data as well as MySQL and PG loader scripts.
pushd test_data
./ds2_download.sh
popd

# Copy and edit the demo configuration file to match your setup.
cp conf/demo-example.conf conf/demo.conf

# Load the data into one or more backing databases.
# Note that BQ requires the dataset to be created ahead of time, as well as a writable GCS bucket for temp files.
sbt 'runMain com.thumbtack.becquerel.demo.BqDemoLoader'
sbt 'runMain com.thumbtack.becquerel.demo.EsDemoLoader'
# Note that the data files must be on the same machine as the PG server for this loader to work.
sbt 'runMain com.thumbtack.becquerel.demo.PgDemoLoader'

Packaging Becquerel

# Run the full test suite first.
# Configure application.conf for your data sources, and start their backing servers, then:
ES_TESTS=true PG_TESTS=true INTEGRATION_TESTS=true sbt test

# Package it into a tarball, without the integration test config.
rm -f conf/application.conf conf/demo.conf conf/logback.xml
sbt universal:packageZipTarball

# Copy that tarball to your target machine somehow.
scp target/becquerel/becquerel-*.tgz target:

Deploying Becquerel

# On your target machine:
tar -xzf becquerel-*.tgz
rm becquerel-*.tgz
cd becquerel-*
# Note: if you used a zip release, you'll need to do this extra step to restore exec permissions:
chmod +x bin/becquerel

# Create and edit configuration files.
cp conf/application-example.conf conf/application.conf
cp conf/logback-example.xml conf/logback.xml

# Run Becquerel in production mode.
bin/becquerel

In production configurations, you should deploy Becquerel behind a reverse proxy server such as nginx, especially if it's expected to connect to cloud services on the public Internet. The reverse proxy should be configured to use HTTP basic authentication and TLS. Multi-instance configurations should also use a load balancer.

Developer proxy setup with ngrok

Since the details of that setup are likely specific to your organization, here's an example of how to use Becquerel with the ngrok developer proxy with HTTP basic auth and TLS:

# Assumes Becquerel is running on the default port 9000.
# Change username and password to a username and password you generate.
# ngrok will assign a random public URL like <https://3c81bfc9.ngrok.io>.
ngrok http -bind-tls=true -auth="username:password" 9000

Connecting to Salesforce

See the Salesforce setup instructions. They assume you're already running a Becquerel instance with Elasticsearch, the DVD Store data, and an ngrok proxy using the above settings.

Getting involved

We welcome feedback in the GitHub issue tracker, as well as pull requests!

Supported data sources

BigQuery (BQ)

com.thumbtack.becquerel.datasources.bigquery

Google's BigQuery analytics-oriented SQL database is Becquerel's original raison d'être and is well-supported. While BQ can process colossal amounts of data, it's a high-latency system that takes a few seconds to return results from any query, no matter how small the result set.

Elasticsearch (ES)

com.thumbtack.becquerel.datasources.elasticsearch

Elasticsearch is a horizontally scalable NoSQL data store supporting key lookups and full-text search. ES is a good choice for storing precomputed output from batch processes for low-latency retrieval.

JDBC

com.thumbtack.becquerel.datasources.jdbc

Becquerel has experimental partial support for SQL databases that provide JDBC drivers. It's been tested with PostgreSQL (PG) 9.6 and 10.1, and the H2 version that ships with Play. Expect some string functions not to work.

OData

OData is a refinement of HTTP REST APIs for database-like applications. It originated at Microsoft, was subsequently standardized, and is mostly used to connect enterprise business software to external data providers: some examples are Salesforce Connect external data sources, Microsoft SharePoint with Business Connectivity Services, and Microsoft Dynamics virtual entities.

OData can be used from a browser. The following sections list details of Becquerel's OData implementation along with some useful URLs.

Service documents

The service document (http://localhost:9000/bq/ if running locally with a service named bq) lists all of the OData entity collections for a service, which are roughly semantically equivalent to database tables. You can get this in either JSON or XML formats by changing the HTTP Accept header, or by appending ?$format=json or ?$format=xml.

Queries

Let's say you have a service named bq running locally, backed by a BQ dataset named dvdstore with a table named categories. Becquerel will publish that table at http://localhost:9000/bq/dvdstore__categories, and querying that URL will return data from that table as an OData document. Becquerel supports the following OData system query options:

  • $filter: equivalent to SQL WHERE.
  • $orderby: equivalent to SQL ORDER BY.
  • $top and $skip: equivalent to SQL LIMIT and OFFSET.
  • $select: equivalent to SQL SELECT.
  • $search: full-text search. $search has no standard SQL equivalent, and in fact is emulated using LIKE for SQL data sources, but maps naturally to a subset of ES query strings.

Becquerel does not currently support the $count or $expand system query options, and supports a limited set of string functions for $filter/$orderby: contains(), startswith(), endswith(), tolower(), and toupper().

Metadata

OData is typed and Becquerel thus uses metadata such as table schemas from its data sources to generate OData entity types. Becquerel can map most SQL and ES primitive types to equivalent OData primitive types, and can map array/list and struct/record types to OData collection and complex types respectively. It does not support entity relationships (BQ and ES don't have foreign keys anyway).

The metadata document (http://localhost:9000/bq/$metadata if running locally with a service named bq) lists all of the OData types for a service, which include entity types generated from tables, as well as complex types generated from any struct columns used by those tables. This is an OData CSDL document and is only available in XML.

Collection naming

OData entity collection names are built from the parts of the data source table identifier glued together with a double underscore (__), with runs of non-alphanumeric characters replaced with a single underscore (_). For example:

  • The BQ table gcp-example-project:dvdstore.categories becomes gcp_example_project__dvdstore__categories with omitProjectID off, and dvdstore__categories with omitProjectID on.
  • ES index names are used directly since they're in a flat namespace already, so categories is still categories. (Mapping types are ignored and are deprecated in newer ES versions.)
  • The PG table ds2.public.categories (in database ds2 with the default public schema) becomes ds2__public__categories with omitCatalogID off, and public__categories with omitCatalogID on.

Primary keys

All OData entity types must have a unique primary key (PK), which is used to, among other things, form the canonical URL for an entity. Although Becquerel itself doesn't use the PK for anything, and can serve entity collections with non-unique PKs, this is behavior specifically disallowed by the OData standard, and your OData consumer may behave erratically when it can't tell two entities with the same PK apart.

The SQL backends will use the first column of the table as the primary key (since BQ doesn't have a notion of primary keys, or for that matter, any kind of index, and not all JDBC-compatible SQL databases will necessarily have a PK on every table), so your SQL tables must start with a column containing unique values. The ES backend will use the ES document ID, which is guaranteed to be unique.

Fetching metadata

Rather than fetch table schemas with every query, or fetch them and cache them as new tables are requested, Becquerel fetches them ahead of time, at app startup and then every 10 minutes afterward (by default). This work is done on the metadataRefresh execution context, which is backed by a fixed-size thread pool. You can change the thread pool size in the contexts section of the config file. Note that OData requests won't work until the metadata has been fetched, but the status page will.

Multiple Becquerel instances do not share state. If launching multiple instances at once, you may want to set metadataRefreshDelayMax for your service in your config file: each instance will wait a random amount of time between zero seconds and the value of metadataRefreshDelayMax from the config file before fetching metadata. This will reduce the load on the data source, and in the case of BQ, reduce the chance of exhausting the BQ API call quota.

Architecture notes

Becquerel is a web service written in Scala and uses the Play 2.5 webapp framework, along with Apache Olingo 4 for OData support, and Apache Calcite for SQL manipulation. It provides an OData interface to several different kinds of data source, including Google BigQuery, Elasticsearch 5.x, and JDBC-compatible SQL databases, and translates OData requests to queries on the underlying data source, then translates the query results to OData responses.

OData implementation

Becquerel uses Apache Olingo to parse OData requests and serialize responses as OData entities. Olingo's tutorial is quite good and much of the OData interface code is adapted from it. BecquerelServiceEntityCollectionProcessor and DataSourceMetadata are the two main interface points with Olingo, for data and metadata respectively.

Olingo's HTTP handler uses the Servlet API, and is therefore synchronous and not very Play-friendly. Rather than implement a new Olingo handler type on top of Akka streams, Becquerel has the PlayRequestAdapter and PlayResponseAdapter classes which buffer requests and responses in memory, and provide just enough of the Servlet HttpServletRequest/HttpServletResponse API for Olingo to work.

Since Olingo's HTTP handler, Google's BigQuery client library, and JDBC all do blocking I/O, Becquerel uses a large fixed-size thread pool for Play's default execution context, based on Play's tuning guidelines for highly synchronous operations. The default size of 50 threads was based on load testing in Becquerel's production environment, and the optimal size may be different for your deployment.

Status page

Becquerel's index page (http://localhost:9000/ if running locally) displays the Git revision and Jenkins build info (in Jenkins builds only), the Play environment mode, info for each configured OData service, environment variables, request headers, Java runtime stats for CPUs and memory, and Java properties. Variables known to contain passwords or credentials can be optionally be censored (see the censor section of the config file), so you'll only see the first 4 characters followed by an ellipsis (…).

Logging

Becquerel logs to stdout by default. If you're using something like Logstash or Fluentd that can handle structured logging, edit conf/logback.xml and select logstash-logback-encoder to produce JSON logs instead of the default plain-text logs. This file is also where you'd change the logging destination.

All log entries resulting from HTTP requests are tagged with a "run ID" for tracing purposes, which is normally a random UUID, but may also be provided by setting a Run-ID: XXX header in the HTTP request. This is handled by RunIDFilter and propagated to the logging library by MDCPropagatingDispatcher and MDCPropagatingExecutionContext. The HTTP response will also contain a Run-ID: XXX header, and any SQL queries will contain a -- run_id: XXX comment.

Metrics

Becquerel uses breadfan's fork of the metrics-play library, which is a convenient wrapper for the DropWizard Metrics library. It can send these to InfluxDB using dropwizard-metrics-influxdb.

The metrics-play library provides a metrics controller that shows all registered metrics as a JSON document (http://localhost:9000/metrics if running locally). Note that not all metrics are registered at app startup; some may not show up on the metrics page until the first OData request is served.

Security considerations

Becquerel has no concept of authentication or authorization itself. In its production configuration at Thumbtack, authentication is between Salesforce and an HTTP reverse proxy in front of Becquerel. Salesforce has a single set of service account credentials, decides which of its users can access which external objects through Becquerel, and makes authenticated HTTP requests with its credentials on their behalf. Becquerel itself assumes that any request that reaches it through the proxy is authenticated by the proxy as being Salesforce, and that Salesforce authorized the Salesforce user to perform that action.

The BQ data source uses a single service account to connect to BQ, as does the JDBC data source (for databases that support authentication). The ES data source doesn't support authentication at all, because vanilla ES deployments don't either. Becquerel can thus potentially access any table its service accounts can access; although the BQ and ES backends can be configured to allow access to only certain tables, this will not help you if the data source credentials are somehow leaked.

Limitations

Becquerel has only been used in production with Salesforce thus far. It may or may not work with your particular OData provider. If you get it working with something new, please let us know! Patches and feedback are welcome.

Becquerel is developed on macOS, and built and deployed on Linux. It will probably work on any modern UNIX-like OS, but has not been tested on Windows, although there's no obvious reason it wouldn't run. Again, please let us know if you run it on a new OS!

Becquerel does not currently support access to individual OData entities or entity properties. Salesforce doesn't use this capability, so we haven't added it yet, although it's unlikely to be very complex to implement.

Becquerel is not capable of streaming massive result sets yet, and retains them in memory while generating a response. This is an edge case, and we recommend you use OData's paging features, instead of executing queries so broad that the results don't fit in available RAM. In practice, this is only likely to come up if you try to load an entire table.

Becquerel is a read-only implementation of OData, and can't modify, add, or delete data in its data sources. This was out of scope for Thumbtack's customer support use case: reading customer data from replicas on our data platform is a fairly generic operation, but changing it requires specific business logic that didn't fit well into the OData model.

Alternatives

WSO2 DSS

Thumbtack briefly evaluated WSO2 DSS but found it less capable than we'd hoped:

At the time of evaluation, it did not connect to BQ natively, or work with Elasticsearch at all, and was unable to publish BQ tables over OData: we attempted to use the closed-source BigQuery JDBC driver, and ran into the issue that BQ doesn't have any notion of primary keys but OData requires them, and WSO2 DSS had no way to specify them out of band, or in code, as Becquerel does with its SQL data sources.

Finally, WSO2 DSS is rather difficult to configure or modify. Their preferred scenario involves a custom Eclipse fork and an XML DSL. Becquerel is only about six thousand lines of Scala code.

WSO2 DSS might be a good option if you only need to talk to conventional SQL databases with conservative schemas and available JDBC drivers.

Maintainer info

Docs

If you add new sections to a Markdown document, run gh-md-toc on that document, drop the headers and footers, and copy the generated TOC into the document manually. gh-md-toc README.md | grep -v gh-md-toc | tail -n +5 | sed '/^$/d' will do most of this for you.

Releases

Becquerel follows SemVer and uses sbt-release. Run sbt release to make and push a release tag, and set the next version. Travis CI is configured to build new GitHub binary releases from tags automatically.

If there's something wrong with a release, follow these steps, in this order:

  • Delete the release tag for that version, locally and on GitHub.
  • Delete all binaries attached to that release on GitHub, then delete the release itself.
  • Run sbt release, but specify the version as the one you just deleted as the current version.

About

Gateway server that provides an OData interface to BigQuery and Elasticsearch

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages