Keywords: Hive | CSV | skip.header.line.count | external table
Abstract: This article explores technical methods for skipping header rows in CSV files when creating Hive external tables. It introduces the skip.header.line.count property introduced in Hive v0.13.0, detailing its application in table creation and modification with example code. Additionally, it covers alternative approaches using OpenCSVSerde for finer control, along with considerations to help users handle data efficiently.
Introduction
In big data processing, Hive is widely used for querying and managing large datasets stored in Hadoop Distributed File System (HDFS). Often, data is stored in CSV files that include header rows containing column names. When creating external tables in Hive over such files, it is necessary to skip these header rows to avoid treating them as data records.
Core Method: Using skip.header.line.count
Starting from Hive version 0.13.0, a convenient table property skip.header.line.count is introduced to skip a specified number of header lines. This property can be set during table creation or modified later using ALTER TABLE.
To create an external table and skip the first row as header:
CREATE EXTERNAL TABLE testtable (
name string,
message string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION '/testtable'
TBLPROPERTIES ("skip.header.line.count"="1");
For an existing table, you can use:
ALTER TABLE tablename
SET TBLPROPERTIES ("skip.header.line.count"="1");
This method is straightforward and effective for most use cases.
Considerations and Limitations
It is important to note that skip.header.line.count may have unintended behavior when multiple output files are generated. For instance, if reducers are more than one, it skips the first record in each file, which might not always be desired. Users should be cautious in such scenarios.
Alternative Approach: Using OpenCSVSerde
As a supplementary method, the OpenCSVSerde provides more control over CSV parsing. It allows customization of separator, quote, and escape characters, along with header skipping.
CREATE EXTERNAL TABLE mydb.mytable (
product_name string,
brand_id string,
brand string,
color string,
description string,
sale_price string
)
PARTITIONED BY (seller_id string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '\t',
'quoteChar' = '"',
'escapeChar' = '\\'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://namenode.com:port/data/mydb/mytable'
TBLPROPERTIES (
'serialization.null.format' = '',
'skip.header.line.count' = '1'
);
This approach offers flexibility for complex CSV formats.
Conclusion
To efficiently skip CSV header rows in Hive external tables, using the skip.header.line.count property is the recommended method for its simplicity. However, for advanced control, consider using OpenCSVSerde. Always test in your specific environment to ensure the desired behavior.