Earlier it was difficult to analyse data since it was all done in one place but with the evolution of Athena, the problem is resolved. Storage of data is entirely managed on Amazon S3. The bottleneck of DML interference which is virtually present in all the databases is not an issue in Athena. AWS Athena is a Data warehouse tool by Amazon that helps users to analyse tons of data. It reduces IT overheads, Does not require infrastructure, is Accurate and high speed, and Cost-effective solution for data analysing. Also, easy & simple to use. Therefore, the preferred choice for every organisation.
AWS Athena is a service offered by Amazon. It enables data analysts to query data from S3 (Simple Storage Service) by using Standard SQL (Structures Query Language) syntax. AWS is a leader in cloud computing technology. The most popular service in the Amazon analytics domain is known as AWS Athena. As Athena is a serverless query service, therefore there is no requirement for managing infrastructure or loading S3 data for analysis also. An analyst can access data via the AWS Management Console, an application programming interface. The user has to then design a scheme and start building to execute SQL queries.
An Amazon Athena user can query the encrypted data with the help of keys that are managed by AWS Key Management Service and then encrypt query results. It is thereby an analytical tool for two-way and responsive query service which helps organisations to analyse data of Amazon S3. It also enables cross-account access to any S3 buckets owned by other users.
Athena users can manage data catalogues to store all the information and schemes related to searches on Amazon S3 data. Thus, it can process unstructured, semistructured, and also structured data sets. This can be further used for research, log analysis, and also online analytical processing.
AWS Athena works directly with all the data that is stored in S3. It is used to run queries with the help of Presto which is a distributed SQL engine. It also uses Apache Spark hive for the creation and alteration of tables and partitions.
Before we find out how Athena works, let us understand all the prerequisites to start working on it!
A user needs to have an AWS account
Become a AWS Certified professional by learning this HKR AWS Training !
1. Create Databases - To begin, create both an Athena database and a table. Remember you need to create a table that will match CSV formats and files in the S3 billing bucket.
Few unseen errors that can be viewed are as follows
2. Partitioning Data - Users generally store their data in time series format and require to query specific data considering period time i.e day, months, and year. Without partitioning data, Athena will scan all the data without executing queries. By partitioning the data you can restrict Athena, thus reducing time, and effort, and lowering cost. This will help to improve performance also.
3. Convert data into columnar facts - Customers can use open sources columnar formats such as Apache ORC and Apache Parquet. It will save cost as well as improve the performance.
4. Compare the Performance - The user can compare the performance of the same query between Parquet files and text files.
AWS Athena charges the users by the amount of data that is scanned per query. By converting the data to columnar formats, Partitioning, and compressing it, you will save cost and also get better performance.
1. Serverless: As AWS Athena is serverless there is no need for infrastructure to use the tool. You can quickly query the data without the need of setting up servers or data warehouses. It allows the user to tap all the data easily. All you need is to set up a schema and begin querying by using the inbuilt query editor.
2. Cost Effective: For using AWS Athena users have to only pay for the queries they run. They only have to shell money for the amount of data that is scanned per query. Also, there are no additional charges for storage as the user has to perform queries directly in S3.
3. Widely Accessible: It is accessible on a larger scale for everyone from developer’s engineers to business analysts to other data specialists, as it is merely a query tool that uses normal Structure Query Language. The queries are simple & easy hence can be used by anyone.
4. Flexibility: A versatile and open architecture of AWS Athena guarantees that the user is not limited or tied up with a single provider, tool, or technology. For instance, the user can work with the Nth number of file formats that are open source without changing the schema swap amongst query engines.
5. Secure: Amazon provides its users with services that are secured. Security is its priority. The effectiveness of which is regularly tested and verified by third-party auditors which is a part of AWS compliance programs.
6. Integration: AWS Athena can integrate with a variety of tools such as AWS Glue, Key Management Service, and Amazon Quicksight. For example, If a user integrates Athena with Glue then the user can access the Glue catalogue which will help him to create a metadata repository across various services.
Want to know more abou AWS, visit here AWS tutorial !
Let’s take an example. As shown in the diagram given above. It depicts a data pipeline whereby data is retrieved and then put in S3 buckets taken from many sources. As the image indicates, unprocessed data depicts that they are not transferred yet. So, the user now can connect the data to S3 using AWS Athena and start analyzing them.
There is no requirement of setting up a database or using external tools for querying the raw data, hence a very straight approach. Once you complete your research and get the desired results. You can use EMR as seen in the diagram to do data transformations and complex analytics and return to S3 after cleaning and processing the data that is raw
The user can then use Athena for querying the processed data to analyse it further. Quick sight can also connect directly with AWS Athena and create images of the data which is stored on S3. You can also migrate the data to Amazon RedShift which is a MPP data warehouse for data analysis, then the user should use QuickSight to view the data from the Redshift.
AWS Athena has proved to be decently priced. But its billing process is not simple. It will let you know the funds used but would be difficult to see how and when.
Following are the optimization techniques that you can use to optimise AWS Athena
Partitioning the data in S3: Partitioning the data will help to divide it into parts and keep the related data in one place based on column values such as date, region, and country. It acts as virtual columns. First, you need to define the table and then reduce the data that needs to be scanned for the query. You can restrict the query i.e., the amount of data that should be scanned based on the partition. Therefore, Athena will only scan the necessary data and charge accordingly.
Use Data Compression Techniques: AWS Athena supports many compression formats for both readings and writing the data. It can successfully read the data from a table that uses Parquet file format when it is compressed with snappy and others with GZIP. The same condition applies to text files, ORC, and JSON storage formats. Various compression formats that Athena supports are GZIP, LZO, SNAPPY, DEFLATE, and others.
Optimise JOIN conditions in queries: Selecting the right join order is important for better query performance. In case you are joining two tables, mention the larger table and smaller table on the left and right side of the join respectively. It distributes the table to worker notes which is on the right and then the left side of the table is streamed to do the join. If the right-side table is smaller then less memory is utilised and also the query runs faster.
Top 50+ frequently asked AWS interview questions !
AWS Athena is very cost-effective. The charges are calculated on the quantity of data that is scanned during query execution. It costs around $5 per Terabyte of data. An additional advantage of using Athena is it allows users to compress data and have Columnar formats, and also routinely delete the old results sets.
Additional charges: Since Amazon Athena reads your data that is stored in Amazon S3 there will be nominal charges for the storage of the data depending on the nature of its storage. History & results are also stored in the S3 bucket therefore, there will be charges which are very nominal for the data stored in that bucket too.
For Instance - Let us consider a table with equal size on S3 with three columns with a size of 3 TB in total as an uncompressed file. Since the text formats are impossible to be divided, running queries to extract the data from each column of the table would require AWS Athena to scan the whole file.
Thus the query would cost approximately $15. Price of 3 TB ( Terabyte) of data scanned i.e. 3*5= 15 $
AWS Athena is thus very cost-effective when compared with its competitors in the market. As it charges only when the user utilises it. Athena helps the user with the ability to use the standard SQL statements on data that is stored in S3 buckets. As seen, there are many benefits of using Amazon Athena along with a few limitations. All you need is to understand the requirements of your organisation and make a sound decision.
AWS CloudFormation: https://hkrtrainings.com/aws-cloudformation
AWS CloudWatch: https://hkrtrainings.com/aws-cloudwatch
AWS Elastic Beanstalk: https://hkrtrainings.com/aws-elastic-beanstalk
AWS elasticsearch: https://hkrtrainings.com/aws-elasticsearch
Big data in AWS: https://hkrtrainings.com/big-data-in-aws
Batch starts on 29th Sep 2023, Fast Track batch
Batch starts on 3rd Oct 2023, Weekday batch
Batch starts on 7th Oct 2023, Weekend batch
S3 is a lightweight solution that is designed to let the analyst use SQL to perform a single object but Athena can be used for querying multiple objects at once. Athena can also be used for complex queries while a user can use S3 only for simple queries. AWS Console can use Athena whereas S3 is an API.
The advantages and disadvantages of using Athena are as follows:
The limitations of AWS Athena are:
Both Redshift spectrum and Amazon Athena are serverless but they differ in various aspects from one another.
Amazon Glue is an ETL service that allows users to manipulate data and also the management of data pipelines. Therefore it is more of a transformation and data movement tool. On other hand, Amazon Athena is majorly used as a query tool for the analysis of data.
Amazon Athena and Presto both belong to Big Data Tools. Athena is serverless and requires no infrastructure. Users will have to pay for the queries that are run. In nutshell, it is an interactive query service where users can analyse data in Amazon in S3 using Structured Query Language. However, Presto is described as a “ Distributed Structure Query Language ( SQL) engine for Big Data”. It is also used for running interactive analytics queries for all data sources of all sizes that range from gigabytes to petabytes.
Spark is an engine for processing large-scale data which is compatible with Hadoop data. Whereas Athena is an interactive tool for queries for easy analysis of data in Amazon S3 using SQL.