We’ve previously described the Hadoop/Hive data warehouse we built in 2012 to store and process the HTTP access logs (450M records/day) and structured application event logs (170M events/day) that are generated by our service.
This setup is still working well for us, but we added Impala into our cluster last year to speed up ad hoc analytic queries. This led to the promised 4x reduction in query times, but access to data in the cluster was basically “all or nothing” … anyone who could make queries against the cluster would have visibility into every table, row, and column within the environment.
Our engineering team works hard to make sure that our logs don’t contain sensitive data or personally identifying information, but we always want to operate under the principle of least privilege for all access into our production systems and data. (E.g. Phil Libin has no logins to our admin/support tools and no permission to crawl our HTTP access logs.) This principle means that we had to restrict Impala query privileges to a very small handful of staff who absolutely needed to go back to the primary data sources.
Recently, we spent some time trying to figure out how we could give a slightly wider group of analysts the ability to access a subset of the data stored within Impala. A few constraints and goals:
- The analysts access our reporting environment through a VPN that performs strong, two-factor authentication and then restricts access to a minimal whitelist of IP:port endpoints in the environment. We’d like to enable Impala queries via the smallest possible expansion of that ACL (ideally, one new TCP port on one host).
- The analysts do not currently have or need any shell accounts on the Debian servers that run our Hadoop cluster, and we’d really prefer not to create Linux logins for them just to permit Impala queries.
- They should be able to perform Impala queries using their existing desktop SQL clients. (We use Razor due to its JDBC support.)
We flailed around for a couple of weeks trying to figure out some way to do this before stumbling across a solution using a mix of Hive/Impala views, SASL authentication to a local DB file, and user/group/role definitions via a Sentry policy file.
Hive/Impala views
Many databases rely on views to provide variable levels of access to data stored within tables. Access to a full table may be restricted, but you can create views giving access to a subset of the rows and/or columns in that table, and permit a different set of consumers to access those views.
Here’s an example table in Hive that contains a hypothetical sequence of events. Each event has an IP address, country code, client identifier, and an “action” that was performed by that client. The full ‘events’ table is in the database named ‘sensitive’, and we create two views in the ‘filtered’ database to give restricted access onto that table:
$ cat /tmp/events.csv 10.1.2.3,US,android,createNote 10.200.88.99,FR,windows,updateNote 10.1.2.3,US,android,updateNote 10.200.88.77,FR,ios,createNote 10.1.4.5,US,windows,updateTag $ hive -S hive> create database sensitive; hive> create table sensitive.events ( ip STRING, country STRING, client STRING, action STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; hive> load data local inpath '/tmp/events.csv' overwrite into table sensitive.events; hive> create database filtered; hive> create view filtered.events as select country, client, action from sensitive.events; hive> create view filtered.events_usonly as select * from filtered.events where country = 'US';
The ‘filtered.events’ view gives access to all rows, but removes access to the IP address column. The ‘filtered.events_usonly’ view further restricts access to only rows that have a country of ‘US’.
These views work great, but now we need to tell Impala to restrict groups of people to only access the correct views.
SASL username/password database
Impala’s daemon officially supports two mechanisms for authentication: kerberos and LDAP. We don’t particularly want to set up a Kerberos infrastructure, and it’s not clear how that would work for people who are just connecting to the Impala daemon over TCP from a SQL tool on their laptops.
We do have LDAP, but the current support in Impala is extremely preliminary (“only tested against Active Directory“) and we couldn’t get it to work against our TLS-only OpenLDAP infrastructure with the limited set of configuration options available today.
While flailing around with LDAP, we decided to try the undocumented –ldap_manual_config option. It turns out that if you tell Impala that it should perform authentication with LDAP (–enable_ldap_auth) using this “manual configuration” option, that means “don’t use LDAP at all, just try to match the client’s username+password against a BerkeleyDB file sitting at /etc/sasldb2.”
We created that file using the ‘saslpassd2‘ command to enter each username and password on our desired impala-server host. As an example, the following shows three different accounts being created in the sasldb2 file:
# saslpasswd2 sysadmin1 Password: Again (for verification): # saslpasswd2 analyst1 ... # saslpasswd2 analyst2 ... # ls -al /etc/sasldb2 -rw-r----- 1 root sasl 12288 Jun 4 17:26 /etc/sasldb2 # usermod -a -G sasl impala
(These usernames do not correspond with any shell accounts in /etc/passwd … they are a standalone authentication database.)
Sentry policy file
To specify the set of permissions for various groups of users, we need to tell Impala to use a Sentry policy file in HDFS. This file contains sections for mapping users into groups and groups onto roles. Roles specify which operations can be performed against which objects in Impala. Here we show our three example SASL users mapped into groups that can either perform any Impala query, perform SELECT operations against any of our ‘filtered’ views, or only SELECT from the ‘events_usonly’ view:
$ cat /tmp/impala-policy.ini [groups] sysadmins = any_operation global_analysts = select_filtered us_analysts = select_us [roles] any_operation = server=testimpala->db=*->table=*->action=* select_filtered = server=testimpala->db=filtered->table=*->action=SELECT select_us = server=testimpala->db=filtered->table=events_usonly->action=SELECT [users] sysadmin1 = sysadmins analyst1 = global_analysts analyst2 = us_analysts $ hdfs dfs -put /tmp/impala-policy.ini /user/hive/warehouse/
Impala server arguments
Finally, we need to tell Impala’s daemon to use the SASL database for authentication and the Sentry policy file for authorization by adding the following arguments to IMPALA_SERVER_ARGS in /etc/default/impala:
-server_name=testimpala \ -authorization_policy_provider_class=org.apache.sentry.provider.file.LocalGroupResourceAuthorizationProvider \ -authorization_policy_file=/user/hive/warehouse/impala-policy.ini \ --enable_ldap_auth=true \ --ldap_manual_config=true \
Then restart the Impala daemons on that host and confirm that there are no errors in /var/log/impala/*
Testing …
Using the impala-shell command-line query tool, we can now confirm that the ‘sysadmin1’ user can query the sensitive source table:
$ impala-shell --quiet -l -u sysadmin1 LDAP password for sysadmin1: [debian-virtualbox.rwc.etonreve.com:21000] > select * from sensitive.events; +--------------+---------+---------+------------+ | ip | country | client | action | +--------------+---------+---------+------------+ | 10.1.2.3 | US | android | createNote | | 10.200.88.99 | FR | windows | updateNote | | 10.1.2.3 | US | android | updateNote | | 10.200.88.77 | FR | ios | createNote | | 10.1.4.5 | US | windows | updateTag | +--------------+---------+---------+------------+
The first analyst can’t query that table, but can use the ‘filtered.events’ view to see everything but the IP addresses:
$ impala-shell --quiet -l -u analyst1 LDAP password for analyst1: [testimpala:21000] > select * from sensitive.events; ERROR: AuthorizationException: User 'analyst1' does not have privileges to execute 'SELECT' on: sensitive.events [testimpala:21000] > select * from filtered.events; +---------+---------+------------+ | country | client | action | +---------+---------+------------+ | US | android | createNote | | FR | windows | updateNote | | US | android | updateNote | | FR | ios | createNote | | US | windows | updateTag | +---------+---------+------------+
And the second analyst can only see the US events:
[testimpala:21000] > $ impala-shell --quiet -l -u analyst2 LDAP password for analyst2: [testimpala:21000] > select * from filtered.events; ERROR: AuthorizationException: User 'analyst2' does not have privileges to execute 'SELECT' on: filtered.events [testimpala:21000] > select * from filtered.events_usonly; +---------+---------+------------+ | country | client | action | +---------+---------+------------+ | US | android | createNote | | US | android | updateNote | | US | windows | updateTag | +---------+---------+------------+ [testimpala:21000] > select client, count(*) as c from filtered.events_usonly group by 1; +---------+---+ | client | c | +---------+---+ | android | 2 | | windows | 1 | +---------+---+
We also use this via the Impala/Hive driver from Razor and JasperServer via the JDBC URL (jdbc:hive2://testimpala:21050/).
Futures…
The ‘sasldb2’ file is not a perfect long-term solution. There’s no UI for self-management of passwords by our analysts, so a sysadmin will need to help them every time they want to change their password. The flat file representation relies on root security on the box, so obviously wouldn’t be appropriate in otherwise-poorly-secured environments.
We’re sure that the LDAP capabilities will improve over time, although I’d be a bit nervous about using LDAP passwords for database connectivity, since desktop tools with SQL integrations would tend to manage and store passwords insecurely.
The same applies for the Sentry policy file. Manually loading this into HDFS whenever we add a user is manageable for now, but not a long-term solution. We could reduce the churn by creating OS-level accounts in OS-level groups and leveraging those, but that’s replacing one clunky group management solution with another.
We couldn’t figure out how to get Hue to talk to Impala properly with SASL+Sentry enabled, so we currently have Hue/Impala configured to talk to the Impala daemon on one of our data nodes, which does not have this enabled. (We’re using network-level ACLs for isolation for the time being.)
But, overall, this solution will meet our needs for a year or two while we’re still dealing with access from only a small number of analysts.