Hadoop 2.5.0 Release Notes
These release notes include new developer and user-facing incompatibilities, features, and major improvements.
Changes since Hadoop 2.4.1
- YARN-2335.
Minor improvement reported by Wei Yan and fixed by Wei Yan
Annotate all hadoop-sls APIs as @Private
- YARN-2319.
Major test reported by Wenwu Peng and fixed by Wenwu Peng (resourcemanager)
Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java
{code}
testMiniKDC.start();
{code}
- YARN-2300.
Major improvement reported by Varun Vasudev and fixed by Varun Vasudev (documentation)
Document better sample requests for RM web services for submitting apps
The documentation for RM web services should provide better examples for app submission.
- YARN-2270.
Minor test reported by Ted Yu and fixed by Akira AJISAKA
TestFSDownload#testDownloadPublicWithStatCache fails in trunk
From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console :
{code}
Running org.apache.hadoop.yarn.util.TestFSDownload
Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec <<< FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload
testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 0.137 sec <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363)
{code}
Similar error can be seen here: https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/
Looks like future.get() returned null.
- YARN-2250.
Major bug reported by Krisztian Horvath and fixed by Krisztian Horvath (scheduler)
FairScheduler.findLowestCommonAncestorQueue returns null when queues not identical
We need to update the queue metrics until to lowest common ancestor of the target and source queue. This method fails to retrieve the right queue.
- YARN-2247.
Blocker sub-task reported by Varun Vasudev and fixed by Varun Vasudev
Allow RM web services users to authenticate using delegation tokens
The RM webapp should allow users to authenticate using delegation tokens to maintain parity with RPC.
- YARN-2241.
Minor bug reported by Robert Kanter and fixed by Robert Kanter (resourcemanager)
ZKRMStateStore: On startup, show nicer messages if znodes already exist
When using the RMZKStateStore, if you restart the RM, you get a bunch of stack traces with messages like {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /rmstore}}. This is expected as these nodes already exist from before. We should catch these and print nicer messages.
- YARN-2233.
Blocker sub-task reported by Varun Vasudev and fixed by Varun Vasudev (resourcemanager)
Implement web services to create, renew and cancel delegation tokens
Implement functionality to create, renew and cancel delegation tokens.
- YARN-2232.
Major bug reported by Varun Vasudev and fixed by Varun Vasudev
ClientRMService doesn't allow delegation token owner to cancel their own token in secure mode
The ClientRMSerivce doesn't allow delegation token owners to cancel their own tokens. The root cause is this piece of code from the cancelDelegationToken function -
{noformat}
String user = getRenewerForToken(token);
...
private String getRenewerForToken(Token<RMDelegationTokenIdentifier> token) throws IOException {
UserGroupInformation user = UserGroupInformation.getCurrentUser();
UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
// we can always renew our own tokens
return loginUser.getUserName().equals(user.getUserName())
? token.decodeIdentifier().getRenewer().toString()
: user.getShortUserName();
}
{noformat}
It ends up passing the user short name to the cancelToken function whereas AbstractDelegationTokenSecretManager::cancelToken expects the full user name. This bug occurs in secure mode and is not an issue with simple auth.
- YARN-2224.
Trivial test reported by Anubhav Dhoot and fixed by Anubhav Dhoot (nodemanager)
Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow
If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test will fail. Make the test pass not rely on the default settings but just let it verify that once the setting is turned on it actually does the memory check. See YARN-2225 which suggests we turn the default off.
- YARN-2216.
Minor test reported by Ted Yu and fixed by Zhijie Shen
TestRMApplicationHistoryWriter sometimes fails in trunk
From https://builds.apache.org/job/Hadoop-Yarn-trunk/595/ :
{code}
testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 33.469 sec <<< FAILURE!
java.lang.AssertionError: expected:<10000> but was:<7156>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)
{code}
- YARN-2204.
Trivial bug reported by Robert Kanter and fixed by Robert Kanter (resourcemanager)
TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
- YARN-2201.
Major bug reported by Ray Chiang and fixed by Varun Vasudev
TestRMWebServicesAppsModification dependent on yarn-default.xml
TestRMWebServicesAppsModification.java has some errors that are yarn-default.xml dependent. By changing yarn-default.xml properties, I'm seeing the following errors:
1) Changing yarn.resourcemanager.scheduler.class from capacity.CapacityScheduler to fair.FairScheduler gives the error:
Running org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 79.047 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
testSingleAppKillUnauthorized[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 3.22 sec <<< FAILURE!
java.lang.AssertionError: expected:<Forbidden> but was:<Accepted>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillUnauthorized(TestRMWebServicesAppsModification.java:458)
2) Changing yarn.acl.enable from false to true results in the following errors:
Running org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
Tests run: 10, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 49.044 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
testSingleAppKill[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.986 sec <<< FAILURE!
java.lang.AssertionError: expected:<Accepted> but was:<Unauthorized>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:287)
testSingleAppKillInvalidState[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.258 sec <<< FAILURE!
java.lang.AssertionError: expected:<Bad Request> but was:<Unauthorized>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillInvalidState(TestRMWebServicesAppsModification.java:369)
testSingleAppKillUnauthorized[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 2.263 sec <<< FAILURE!
java.lang.AssertionError: expected:<Forbidden> but was:<Unauthorized>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillUnauthorized(TestRMWebServicesAppsModification.java:458)
testSingleAppKillInvalidId[0](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 0.214 sec <<< FAILURE!
java.lang.AssertionError: expected:<Not Found> but was:<Unauthorized>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKillInvalidId(TestRMWebServicesAppsModification.java:482)
I'm opening this JIRA as a discussion for the best way to fix this. I've got a few ideas, but I would like to get some feedback about potentially more robust ways to fix this test.
- YARN-2195.
Trivial improvement reported by Wei Yan and fixed by Wei Yan
Clean a piece of code in ResourceRequest
{code}
if (numContainersComparison == 0) {
return 0;
} else {
return numContainersComparison;
}
{code}
This code should be cleaned as
{code}
return numContainersComparison;
{code}
- YARN-2192.
Major bug reported by Anubhav Dhoot and fixed by Anubhav Dhoot
TestRMHA fails when run with a mix of Schedulers
If the test is run with FairSchedulers, some of the tests fail because the metricsssytem objects are shared across tests and not destroyed completely.
{code}
Error Message
Metrics source QueueMetrics,q0=root already exists!
Stacktrace
org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists!
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:96)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1281)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:427)
{code}
- YARN-2191.
Major bug reported by Wangda Tan and fixed by Wangda Tan (resourcemanager)
Add a test to make sure NM will do application cleanup even if RM restarting happens before application completed
In YARN-1885, there's a test in TestApplicationCleanup#testAppCleanupWhenRestartedAfterAppFinished. However, we need one more test to make sure NM will do app cleanup when restart happens before app finished. The sequence is,
1. Submit app1 to RM1
2. NM1 launches app1's AM (container-0), NM2 launches app1's task containers.
3. Restart RM1
4. Before RM1 finishes restarting, container-0 completed in NM1
5. RM1 finishes restarting, NM1 report container-0 completed, app1 will be completed
6. RM1 should be able to notify NM1/NM2 to cleanup app1.
- YARN-2187.
Major bug reported by Robert Kanter and fixed by Robert Kanter (fairscheduler)
FairScheduler: Disable max-AM-share check by default
Say you have a small cluster with 8gb memory and 5 queues. This means that equal queue can have 8gb / 5 = 1.6gb but an AM requires 2gb to start so no AMs can be started. By default, max-am-share check should be disabled so users don't see a regression. On medium-sized clusters, it still makes sense to set the max-am-share to a value between 0 and 1.
- YARN-2171.
Critical bug reported by Jason Lowe and fixed by Jason Lowe (capacityscheduler)
AMs block on the CapacityScheduler lock during allocate()
When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes.
- YARN-2167.
Major bug reported by Junping Du and fixed by Junping Du (nodemanager)
LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block
In NMLeveldbStateStoreService#loadLocalizationState(), we have LeveldbIterator to read NM's localization state but it is not get closed in finally block. We should close this connection to DB as a common practice.
- YARN-2163.
Minor bug reported by Wangda Tan and fixed by Wangda Tan (resourcemanager , webapp)
WebUI: Order of AppId in apps table should be consistent with ApplicationId.compareTo().
Currently, AppId is treated as numeric, so the sort result in applications table is sorted by int typed id only (not included cluster timestamp), see attached screenshot. Order of AppId in web page should be consistent with ApplicationId.compareTo().
- YARN-2159.
Trivial improvement reported by Ray Chiang and fixed by Ray Chiang (resourcemanager)
Better logging in SchedulerNode#allocateContainer
This bit of code:
{quote}
LOG.info("Assigned container " + container.getId() + " of capacity "
+ container.getResource() + " on host " + rmNode.getNodeAddress()
+ ", which currently has " + numContainers + " containers, "
+ getUsedResource() + " used and " + getAvailableResource()
+ " available");
{quote}
results in a line like:
{quote}
2014-05-30 16:17:43,573 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_1400666605555_0009_01_001111 of capacity <memory:1536, vCores:1> on host machine.host.domain.com:8041, which currently has 18 containers, <memory:27648, vCores:18> used and <memory:3072, vCores:0> available
{quote}
That message is fine in most cases, but looks pretty bad after the last available allocation, since it says something like "vCores:0 available".
Here is one suggested phrasing
- "which has 18 containers, <memory:27648, vCores:18> used and <memory:3072, vCores:0> available after allocation"
- YARN-2155.
Major bug reported by Wei Yan and fixed by Wei Yan
FairScheduler: Incorrect threshold check for preemption
{code}
private boolean shouldAttemptPreemption() {
if (preemptionEnabled) {
return (preemptionUtilizationThreshold < Math.max(
(float) rootMetrics.getAvailableMB() / clusterResource.getMemory(),
(float) rootMetrics.getAvailableVirtualCores() /
clusterResource.getVirtualCores()));
}
return false;
}
{code}
preemptionUtilizationThreshould should be compared with allocatedResource instead of availableResource.
- YARN-2152.
Major sub-task reported by Jian He and fixed by Jian He (resourcemanager)
Recover missing container information
Container information such as container priority and container start time cannot be recovered because NM container today lacks such container information to send across on NM registration when RM recovery happens
- YARN-2148.
Major bug reported by Wangda Tan and fixed by Wangda Tan (client)
TestNMClient failed due more exit code values added and passed to AM
Currently, TestNMClient will be failed in trunk, see https://builds.apache.org/job/PreCommit-YARN-Build/3959/testReport/junit/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClient/
{code}
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:385)
at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:347)
at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226)
{code}
Test cases in TestNMClient uses following code to verify exit code of COMPLETED containers
{code}
testGetContainerStatus(container, i, ContainerState.COMPLETE,
"Container killed by the ApplicationMaster.", Arrays.asList(
new Integer[] {137, 143, 0}));
{code}
But YARN-2091 added logic to make exit code reflecting the actual status, so exit code of the "killed by ApplicationMaster" will be -105,
{code}
if (container.hasDefaultExitCode()) {
container.exitCode = exitEvent.getExitCode();
}
{code}
We should update test case as well.
- YARN-2132.
Major bug reported by Karthik Kambatla and fixed by Vamsee Yarlagadda (resourcemanager)
ZKRMStateStore.ZKAction#runWithRetries doesn't log the exception it encounters
If we encounter any ZooKeeper issues, we don't know what is going on unless we exhaust all the retries. It would really help to log the exception sooner, so we know what is going on with the cluster.
- YARN-2128.
Major bug reported by Wei Yan and fixed by Wei Yan
FairScheduler: Incorrect calculation of amResource usage
1. The amResource should be normalized instead of fetching from ApplicationSubmissionContext directly.
{code}
ApplicationSubmissionContext appSubmissionContext =
rmContext.getRMApps().get(applicationAttemptId.getApplicationId())
.getApplicationSubmissionContext();
if (appSubmissionContext != null) {
amResource = appSubmissionContext.getResource();
unmanagedAM = appSubmissionContext.getUnmanagedAM();
}
{code}
2. When one application is removed, the FSLeafQueue's amResourceUsage should be updated only if the app has started its AM.
{code}
if (runnableAppScheds.remove(app.getAppSchedulable())) {
// Update AM resource usage
if (app.getAMResource() != null) {
Resources.subtractFrom(amResourceUsage, app.getAMResource());
}
return true;
}
{code}
- YARN-2125.
Minor task reported by Wangda Tan and fixed by Wangda Tan (resourcemanager , scheduler)
ProportionalCapacityPreemptionPolicy should only log CSV when debug enabled
Currently, logToCSV() will be output using LOG.info() in ProportionalCapacityPreemptionPolicy. Which will generate non-human-readable texts in resource manager's log every several seconds, like
{code}
...
2014-06-05 15:57:07,603 INFO org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy: QUEUESTATE: 1401955027603, a1, 4096, 3, 2048, 2, 4096, 3, 4096, 3, 0, 0, 0, 0, b1, 3072, 2, 1024, 1, 3072, 2, 3072, 2, 0, 0, 0, 0, b2, 3072, 2, 1024, 1, 3072, 2, 3072, 2, 0, 0, 0, 0
2014-06-05 15:57:10,603 INFO org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy: QUEUESTATE: 1401955030603, a1, 4096, 3, 2048, 2, 4096, 3, 4096, 3, 0, 0, 0, 0, b1, 3072, 2, 1024, 1, 3072, 2, 3072, 2, 0, 0, 0, 0, b2, 3072, 2, 1024, 1, 3072, 2, 3072, 2, 0, 0, 0, 0
...
{code}
It's better to make it output when debug enabled.
- YARN-2124.
Critical bug reported by Wangda Tan and fixed by Wangda Tan (resourcemanager , scheduler)
ProportionalCapacityPreemptionPolicy cannot work because it's initialized before scheduler initialized
When I play with scheduler with preemption, I found ProportionalCapacityPreemptionPolicy cannot work. NPE will be raised when RM start
{code}
2014-06-05 11:01:33,201 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw an Exception.
java.lang.NullPointerException
at org.apache.hadoop.yarn.util.resource.Resources.greaterThan(Resources.java:225)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.computeIdealResourceDistribution(ProportionalCapacityPreemptionPolicy.java:302)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.recursivelyComputeIdealAssignment(ProportionalCapacityPreemptionPolicy.java:261)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:198)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:174)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82)
at java.lang.Thread.run(Thread.java:744)
{code}
This is caused by ProportionalCapacityPreemptionPolicy needs ResourceCalculator from CapacityScheduler. But ProportionalCapacityPreemptionPolicy get initialized before CapacityScheduler initialized. So ResourceCalculator will set to null in ProportionalCapacityPreemptionPolicy.
- YARN-2122.
Major bug reported by Karthik Kambatla and fixed by Robert Kanter (scheduler)
In AllocationFileLoaderService, the reloadThread should be created in init() and started in start()
AllcoationFileLoaderService has this reloadThread that is currently created and started in start(). Instead, it should be created in init() and started in start().
- YARN-2121.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
TimelineAuthenticator#hasDelegationToken may throw NPE
{code}
private boolean hasDelegationToken(URL url) {
return url.getQuery().contains(
TimelineAuthenticationConsts.DELEGATION_PARAM + "=");
}
{code}
If the given url doesn't have any params at all. It will throw NPE.
- YARN-2119.
Major bug reported by Anubhav Dhoot and fixed by Anubhav Dhoot
DEFAULT_PROXY_ADDRESS should use DEFAULT_PROXY_PORT
The fix for [YARN-1590|https://issues.apache.org/jira/browse/YARN-1590] introduced an method to get web proxy bind address with the incorrect default port. Because all the users of the method (only 1 user) ignores the port, its not breaking anything yet. Fixing it in case someone else uses this in the future.
- YARN-2118.
Major sub-task reported by Ted Yu and fixed by Ted Yu
Type mismatch in contains() check of TimelineWebServices#injectOwnerInfo()
{code}
if (timelineEntity.getPrimaryFilters() != null &&
timelineEntity.getPrimaryFilters().containsKey(
TimelineStore.SystemFilter.ENTITY_OWNER)) {
throw new YarnException(
{code}
getPrimaryFilters() returns a Map keyed by String.
However, TimelineStore.SystemFilter.ENTITY_OWNER is an enum.
Their types don't match.
- YARN-2117.
Minor sub-task reported by Ted Yu and fixed by Chen He
Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block
Here is related code:
{code}
Reader reader = new FileReader(signatureSecretFile);
int c = reader.read();
while (c > -1) {
secret.append((char) c);
c = reader.read();
}
reader.close();
{code}
If IOException is thrown out of reader.read(), reader would be left unclosed.
- YARN-2115.
Major sub-task reported by Jian He and fixed by Jian He
Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus
This jira is protocol changes only to replace the ContainerStatus sent across via NM register call with a new NMContainerStatus to include all the necessary information for container recovery.
- YARN-2112.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
Now YarnClient is using TimelineClient, which has dependency on jackson libs. However, the current dependency configurations make the hadoop-client artifect miss 2 jackson libs, such that the applications which have hadoop-client dependency will see the following exception
{code}
java.lang.NoClassDefFoundError: org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.<init>(TimelineClientImpl.java:92)
at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:88)
at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:111)
at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.jaxrs.JacksonJaxbJsonProvider
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 48 more
{code}
when using YarnClient to submit an application.
- YARN-2111.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
In FairScheduler.attemptScheduling, we don't count containers as assigned if they have 0 memory but non-zero cores
{code}
if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
queueMgr.getRootQueue().assignContainer(node),
Resources.none())) {
{code}
As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores here into account.
- YARN-2109.
Major bug reported by Anubhav Dhoot and fixed by Karthik Kambatla (scheduler)
Fix TestRM to work with both schedulers
testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set it to be CapacityScheduler. But if the default scheduler is set to FairScheduler then the rest of the tests that execute after this will fail with invalid cast exceptions when getting queuemetrics. This is based on test execution order as only the tests that execute after this test will fail. This is because the queuemetrics will be initialized by this test to QueueMetrics and shared by the subsequent tests.
We can explicitly clear the metrics at the end of this test to fix this.
For example
java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot be cast to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.resourcemanager.MockRM.<init>(MockRM.java:90)
at org.apache.hadoop.yarn.server.resourcemanager.MockRM.<init>(MockRM.java:85)
at org.apache.hadoop.yarn.server.resourcemanager.MockRM.<init>(MockRM.java:81)
at org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)
- YARN-2107.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
Refactor timeline classes into server.timeline package
Right now, most of timeline-server classes are present in an applicationhistoryserver package instead of a top level timeline package.
This is one part of YARN-2043, there is more to do..
- YARN-2105.
Major test reported by Ted Yu and fixed by Ashwin Shankar
Fix TestFairScheduler after YARN-2012
The following tests fail in trunk:
{code}
Failed tests:
TestFairScheduler.testDontAllowUndeclaredPools:2412 expected:<1> but was:<0>
Tests in error:
TestFairScheduler.testQueuePlacementWithPolicy:624 NullPointer
TestFairScheduler.testNotUserAsDefaultQueue:530 » NullPointer
{code}
- YARN-2104.
Major bug reported by Wangda Tan and fixed by Wangda Tan (resourcemanager , webapp)
Scheduler queue filter failed to work because index of queue column changed
YARN-563 added,
{code}
+ th(".type", "Application Type”).
{code}
to application table, which makes queue’s column index from 3 to 4. And in scheduler page, queue’s column index is hard coded to 3 when filter application with queue’s name,
{code}
" if (q == 'root') q = '';",
" else q = '^' + q.substr(q.lastIndexOf('.') + 1) + '$';",
" $('#apps').dataTable().fnFilter(q, 3, true);",
{code}
So queue filter will not work for application page.
Reproduce steps: (Thanks Bo Yang for pointing this)
{code}
1) In default setup, there’s a default queue under root queue
2) Run an arbitrary application, you can find it in “Applications” page
3) Click “Default” queue in scheduler page
4) Click “Applications”, no application will show here
5) Click “Root” queue in scheduler page
6) Click “Applications”, application will show again
{code}
- YARN-2103.
Major bug reported by Binglin Chang and fixed by Binglin Chang
Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder
Bug 1:
{code}
SerializedExceptionProto proto = SerializedExceptionProto
.getDefaultInstance();
SerializedExceptionProto.Builder builder = null;
boolean viaProto = false;
{code}
Since viaProto is false, we should initiate build rather than proto
Bug 2:
the class does not provide hashcode() and equals() like other PBImpl records, this class is used in other records, it may affect other records' behavior.
- YARN-2096.
Major bug reported by Anubhav Dhoot and fixed by Anubhav Dhoot
Race in TestRMRestart#testQueueMetricsOnRMRestart
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart fails randomly because of a race condition.
The test validates that metrics are incremented, but does not wait for all transitions to finish before checking for the values.
It also resets metrics after kicking off recovery of second RM. The metrics that need to be incremented race with this reset causing test to fail randomly.
We need to wait for the right transitions.
- YARN-2091.
Major task reported by Bikas Saha and fixed by Tsuyoshi OZAWA
Add more values to ContainerExitStatus and pass it from NM to RM and then to app masters
Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM.
- YARN-2089.
Major improvement reported by Anubhav Dhoot and fixed by zhihai xu (scheduler)
FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
We should mark QueuePlacementPolicy and QueuePlacementRule with audience annotations @Private @Unstable
- YARN-2075.
Major bug reported by Zhijie Shen and fixed by Kenji Kikushima
TestRMAdminCLI consistently fail on trunk and branch-2
{code}
Running org.apache.hadoop.yarn.client.TestRMAdminCLI
Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec <<< FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI
testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.082 sec <<< ERROR!
java.lang.UnsupportedOperationException: null
at java.util.AbstractList.remove(AbstractList.java:144)
at java.util.AbstractList$Itr.remove(AbstractList.java:360)
at java.util.AbstractCollection.remove(AbstractCollection.java:252)
at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173)
at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144)
at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447)
at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380)
at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318)
at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180)
testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.088 sec <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366)
at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307)
{code}
- YARN-2074.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Jian He (resourcemanager)
Preemption of AM containers shouldn't count towards AM failures
One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM containers getting preempted shouldn't count towards AM failures and thus shouldn't eventually fail applications.
We should explicitly handle AM container preemption/kill as a separate issue and not count it towards the limit on AM failures.
- YARN-2073.
Critical bug reported by Karthik Kambatla and fixed by Karthik Kambatla (scheduler)
Fair Scheduler: Add a utilization threshold to prevent preempting resources when cluster is free
Preemption should kick in only when the currently available slots don't match the request.
- YARN-2072.
Major improvement reported by Nathan Roberts and fixed by Nathan Roberts (nodemanager , resourcemanager , webapp)
RM/NM UIs and webservices are missing vcore information
Change RM and NM UIs and webservices to include virtual cores.
- YARN-2071.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Enforce more restricted permissions for the directory of Leveldb store
We need to enforce more restricted permissions for the directory of Leveldb store, as w did for filesystem generic history store.
- YARN-2065.
Major bug reported by Steve Loughran and fixed by Jian He
AM cannot create new containers after restart-NM token from previous attempt used
Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers.
The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure.
- YARN-2061.
Minor improvement reported by Karthik Kambatla and fixed by Ray Chiang (resourcemanager)
Revisit logging levels in ZKRMStateStore
ZKRMStateStore has a few places where it is logging at the INFO level. We should change these to DEBUG or TRACE level messages.
- YARN-2059.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Extend access control for admin acls
- YARN-2054.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (resourcemanager)
Better defaults for YARN ZK configs for retries and retry-inteval when HA is enabled
Currenly, we have the following default values:
# yarn.resourcemanager.zk-num-retries - 500
# yarn.resourcemanager.zk-retry-interval-ms - 2000
This leads to a cumulate 1000 seconds before the RM gives up trying to connect to the ZK.
- YARN-2052.
Major sub-task reported by Tsuyoshi OZAWA and fixed by Tsuyoshi OZAWA (resourcemanager)
ContainerId creation after work preserving restart is broken
Container ids are made unique by using the app identifier and appending a monotonically increasing sequence number to it. Since container creation is a high churn activity the RM does not store the sequence number per app. So after restart it does not know what the new sequence number should be for new allocations.
- YARN-2050.
Major bug reported by Ming Ma and fixed by Ming Ma
Fix LogCLIHelpers to create the correct FileContext
LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus the FileContext created isn't necessarily the FileContext for remote log.
- YARN-2049.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Delegation token stuff for the timeline sever
- YARN-2036.
Minor bug reported by Karthik Kambatla and fixed by Ray Chiang (documentation)
Document yarn.resourcemanager.hostname in ClusterSetup
ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people should just be able to use that directly.
- YARN-2030.
Major improvement reported by Junping Du and fixed by Binglin Chang
Use StateMachine to simplify handleStoreEvent() in RMStateStore
Now the logic to handle different store events in handleStoreEvent() is as following:
{code}
if (event.getType().equals(RMStateStoreEventType.STORE_APP)
|| event.getType().equals(RMStateStoreEventType.UPDATE_APP)) {
...
if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
...
} else {
...
}
...
try {
if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
...
} else {
...
}
}
...
} else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)
|| event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) {
...
if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) {
...
} else {
...
}
...
if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) {
...
} else {
...
}
}
...
} else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) {
...
} else {
...
}
}
{code}
This is not only confuse people but also led to mistake easily. We may leverage state machine to simply this even no state transitions.
- YARN-2022.
Major sub-task reported by Sunil G and fixed by Sunil G (resourcemanager)
Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
Cluster Size = 16GB [2NM's]
Queue A Capacity = 50%
Queue B Capacity = 50%
Consider there are 3 applications running in Queue A which has taken the full cluster capacity.
J1 = 2GB AM + 1GB * 4 Maps
J2 = 2GB AM + 1GB * 4 Maps
J3 = 2GB AM + 1GB * 2 Maps
Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
Currently in this scenario, Jobs J3 will get killed including its AM.
It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted.
Later when cluster is free, maps can be allocated to these Jobs.
- YARN-2017.
Major sub-task reported by Jian He and fixed by Jian He (resourcemanager)
Merge some of the common lib code in schedulers
A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base.
- YARN-2012.
Major improvement reported by Ashwin Shankar and fixed by Ashwin Shankar (scheduler)
Fair Scheduler: allow default queue placement rule to take an arbitrary queue
Currently 'default' rule in queue placement policy,if applied,puts the app in root.default queue. It would be great if we can make 'default' rule optionally point to a different queue as default queue .
This default queue can be a leaf queue or it can also be an parent queue if the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).
- YARN-2011.
Trivial test reported by Chen He and fixed by Chen He
Fix typo and warning in TestLeafQueue
a.assignContainers(clusterResource, node_0);
assertEquals(2*GB, a.getUsedResources().getMemory());
assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
assertEquals(0*GB, app_1.getCurrentConsumption().getMemory());
assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
// Again one to user_0 since he hasn't exceeded user limit yet
a.assignContainers(clusterResource, node_0);
assertEquals(3*GB, a.getUsedResources().getMemory());
assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
assertEquals(1*GB, app_1.getCurrentConsumption().getMemory());
assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G
assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G
- YARN-1987.
Major improvement reported by Jason Lowe and fixed by Jason Lowe
Wrapper for leveldb DBIterator to aid in handling database exceptions
Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a utility wrapper around leveldb's DBIterator to translate the raw RuntimeExceptions it can throw into DBExceptions to make it easier to handle database errors while iterating.
- YARN-1982.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Rename the daemon name to timelineserver
Nowadays, it's confusing that we call the new component timeline server, but we use
{code}
yarn historyserver
yarn-daemon.sh start historyserver
{code}
to start the daemon.
Before the confusion keeps being propagated, we'd better to modify command line asap.
- YARN-1981.
Major bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
Nodemanager version is not updated when a node reconnects
When a nodemanager is quickly restarted and happens to change versions during the restart (e.g.: rolling upgrade scenario) the NM version as reported by the RM is not updated.
- YARN-1977.
Minor test reported by Junping Du and fixed by Junping Du
Add tests on getApplicationRequest with filtering start time range
There is no unit test to verify if request with start time range works to get right application list, we should add it.
- YARN-1970.
Minor test reported by Chris Nauroth and fixed by Chris Nauroth
Prepare YARN codebase for JUnit 4.11.
HADOOP-10503 upgrades the entire Hadoop repo to use JUnit 4.11. Some of the YARN code needs some minor updates to fix deprecation warnings and test isolation problems before the upgrade.
- YARN-1957.
Major sub-task reported by Carlo Curino and fixed by Carlo Curino (resourcemanager)
ProportionalCapacitPreemptionPolicy handling of corner cases...
The current version of ProportionalCapacityPreemptionPolicy should be improved to deal with the following two scenarios:
1) when rebalancing over-capacity allocations, it potentially preempts without considering the maxCapacity constraints of a queue (i.e., preempting possibly more than strictly necessary)
2) a zero capacity queue is preempted even if there is no demand (coherent with old use of zero-capacity to disabled queues)
The proposed patch fixes both issues, and introduce few new test cases.
- YARN-1940.
Major bug reported by Kihwal Lee and fixed by Rushabh S Shah
deleteAsUser() terminates early without deleting more files on error
In container-executor.c, delete_path() returns early when unlink() against a file or a symlink fails. We have seen many cases of the error being ENOENT, which can safely be ignored during delete.
This is what we saw recently: An app mistakenly created a large number of files in the local directory and the deletion service failed to delete a significant portion of them due to this bug. Repeatedly hitting this on the same node led to exhaustion of inodes in one of the partitions.
Beside ignoring ENOENT, delete_path() can simply skip the failed one and continue in some cases, rather than aborting and leaving files behind.
- YARN-1938.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Kerberos authentication for the timeline server
- YARN-1937.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Add entity-level access control of the timeline data for owners only
- YARN-1936.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Secured timeline client
TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token
- YARN-1931.
Blocker bug reported by Thomas Graves and fixed by Sandy Ryza (applications)
Private API change in YARN-1824 in 2.4 broke compatibility with previous releases
YARN-1824 broke compatibility with previous 2.x releases by changes the API's in org.apache.hadoop.yarn.util.Apps.{setEnvFromInputString,addToEnvironment} The old api should be added back in.
This affects any ApplicationMasters who were using this api. It also breaks previously built MapReduce libraries from working with the new Yarn release as MR uses this api.
- YARN-1923.
Major improvement reported by Anubhav Dhoot and fixed by Anubhav Dhoot (scheduler)
Make FairScheduler resource ratio calculations terminate faster
In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal
- YARN-1913.
Major bug reported by bc Wong and fixed by Wei Yan (scheduler)
With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs.
One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a "maxApplicationMasterShare" config.
- YARN-1889.
Minor improvement reported by Hong Zhiguo and fixed by Hong Zhiguo (scheduler)
In Fair Scheduler, avoid creating objects on each call to AppSchedulable comparator
In fair scheduler, in each scheduling attempt, a full sort is
performed on List of AppSchedulable, which invokes Comparator.compare
method many times. Both FairShareComparator and DRFComparator call
AppSchedulable.getWeights, and AppSchedulable.getPriority.
A new ResourceWeights object is allocated on each call of getWeights,
and the same for getPriority. This introduces a lot of pressure to
GC because these methods are called very very frequently.
Below test case shows improvement on performance and GC behaviour. The results show that the GC pressure during processing NodeUpdate is recuded half by this patch.
The code to show the improvement: (Add it to TestFairScheduler.java)
{code}
import java.lang.management.GarbageCollectorMXBean;
import java.lang.management.ManagementFactory;
public void printGCStats() {
long totalGarbageCollections = 0;
long garbageCollectionTime = 0;
for(GarbageCollectorMXBean gc :
ManagementFactory.getGarbageCollectorMXBeans()) {
long count = gc.getCollectionCount();
if(count >= 0) {
totalGarbageCollections += count;
}
long time = gc.getCollectionTime();
if(time >= 0) {
garbageCollectionTime += time;
}
}
System.out.println("Total Garbage Collections: "
+ totalGarbageCollections);
System.out.println("Total Garbage Collection Time (ms): "
+ garbageCollectionTime);
}
@Test
public void testImpactOnGC() throws Exception {
scheduler.reinitialize(conf, resourceManager.getRMContext());
// Add nodes
int numNode = 10000;
for (int i = 0; i < numNode; ++i) {
String host = String.format("192.1.%d.%d", i/256, i%256);
RMNode node =
MockNodes.newNodeInfo(1, Resources.createResource(1024 * 64), i, host);
NodeAddedSchedulerEvent nodeEvent = new NodeAddedSchedulerEvent(node);
scheduler.handle(nodeEvent);
assertEquals(1024 * 64 * (i+1), scheduler.getClusterCapacity().getMemory());
}
assertEquals(numNode, scheduler.getNumClusterNodes());
assertEquals(1024 * 64 * numNode, scheduler.getClusterCapacity().getMemory());
// add apps, each app has 100 containers.
int minReqSize =
FairSchedulerConfiguration.DEFAULT_RM_SCHEDULER_INCREMENT_ALLOCATION_MB;
int numApp = 8000;
int priority = 1;
for (int i = 1; i < numApp + 1; ++i) {
ApplicationAttemptId attemptId = createAppAttemptId(i, 1);
AppAddedSchedulerEvent appAddedEvent = new AppAddedSchedulerEvent(
attemptId.getApplicationId(), "queue1", "user1");
scheduler.handle(appAddedEvent);
AppAttemptAddedSchedulerEvent attemptAddedEvent =
new AppAttemptAddedSchedulerEvent(attemptId, false);
scheduler.handle(attemptAddedEvent);
createSchedulingRequestExistingApplication(minReqSize * 2, 1, priority, attemptId);
}
scheduler.update();
assertEquals(numApp, scheduler.getQueueManager().getLeafQueue("queue1", true)
.getRunnableAppSchedulables().size());
System.out.println("GC stats before NodeUpdate processing:");
printGCStats();
int hb_num = 5000;
long start = System.nanoTime();
for (int i = 0; i < hb_num; ++i) {
String host = String.format("192.1.%d.%d", i/256, i%256);
RMNode node =
MockNodes.newNodeInfo(1, Resources.createResource(1024 * 64), 5000, host);
NodeUpdateSchedulerEvent nodeEvent = new NodeUpdateSchedulerEvent(node);
scheduler.handle(nodeEvent);
}
long end = System.nanoTime();
System.out.printf("processing time for a NodeUpdate in average: %d us\n",
(end - start)/(hb_num * 1000));
System.out.println("GC stats after NodeUpdate processing:");
printGCStats();
}
{code}
- YARN-1885.
Major bug reported by Arpit Gupta and fixed by Wangda Tan
RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts
During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running.
- YARN-1877.
Critical sub-task reported by Karthik Kambatla and fixed by Robert Kanter (resourcemanager)
Document yarn.resourcemanager.zk-auth and its scope
- YARN-1870.
Minor improvement reported by Ted Yu and fixed by Fengdong Yu (resourcemanager)
FileInputStream is not closed in ProcfsBasedProcessTree#constructProcessSMAPInfo()
{code}
List<String> lines = IOUtils.readLines(new FileInputStream(file));
{code}
FileInputStream is not closed.
- YARN-1868.
Major bug reported by Chuan Liu and fixed by Chuan Liu (webapp)
YARN status web ui does not show correctly in IE 11
The YARN status web ui does not show correctly in IE 11. The drop down menu for app entries are not shown. Also the navigation menu displays incorrectly.
- YARN-1865.
Minor bug reported by Remus Rusanu and fixed by Remus Rusanu (nodemanager)
ShellScriptBuilder does not check for some error conditions
The WindowsShellScriptBuilder does not check for commands exceeding windows maximum shell command line length (8191 chars)
Neither Unix nor Windows script builder do not check for error condition after mkdir nor link
WindowsShellScriptBuilder mkdir is not safe with regard to paths containing spaces
- YARN-1864.
Major new feature reported by Ashwin Shankar and fixed by Ashwin Shankar (scheduler)
Fair Scheduler Dynamic Hierarchical User Queues
In Fair Scheduler, we want to be able to create user queues under any parent queue in the hierarchy. For eg. Say user1 submits a job to a parent queue called root.allUserQueues, we want be able to create a new queue called root.allUserQueues.user1 and run user1's job in it.Any further jobs submitted by this user to root.allUserQueues will be run in this newly created root.allUserQueues.user1.
This is very similar to the 'user-as-default' feature in Fair Scheduler which creates user queues under root queue. But we want the ability to create user queues under ANY parent queue.
Why do we want this ?
1. Preemption : these dynamically created user queues can preempt each other if its fair share is not met. So there is fairness among users.
User queues can also preempt other non-user leaf queue as well if below fair share.
2. Allocation to user queues : we want all the user queries(adhoc) to consume only a fraction of resources in the shared cluster. By creating this feature,we could do that by giving a fair share to the parent user queue which is then redistributed to all the dynamically created user queues.
- YARN-1845.
Major improvement reported by Rushabh S Shah and fixed by Rushabh S Shah
Elapsed time for failed tasks that never started is wrong
The elapsed time for tasks in a failed job that were never
started can be way off. It looks like we're marking the start time as the
beginning of the epoch (i.e.: start time = -1) but the finish time is when the
task was marked as failed when the whole job failed. That causes the
calculated elapsed time of the task to be a ridiculous number of hours.
Tasks that fail without any attempts shouldn't have start/finish/elapsed times.
- YARN-1833.
Major bug reported by Mit Desai and fixed by Mit Desai
TestRMAdminService Fails in trunk and branch-2 : Assert Fails due to different count of UserGroups for currentUser()
In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed.
{code}
Assert.assertTrue(groupWithInit.size() != groupBefore.size());
{code}
As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same.
I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful.
- YARN-1790.
Major bug reported by bc Wong and fixed by bc Wong
Fair Scheduler UI not showing apps table
There is a running job, which shows up in the summary table in the FairScheduler UI, the queue display, etc. Just not in the apps table at the bottom.
- YARN-1784.
Minor bug reported by Karthik Kambatla and fixed by Robert Kanter (resourcemanager)
TestContainerAllocation assumes CapacityScheduler
TestContainerAllocation assumes CapacityScheduler
- YARN-1771.
Critical improvement reported by Sangjin Lee and fixed by Sangjin Lee (nodemanager)
many getFileStatus calls made from node manager for localizing a public distributed cache resource
We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache.
We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example:
{noformat}
2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ...
2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ...
2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ...
2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/ ...
2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ...
{noformat}
- YARN-1757.
Major sub-task reported by Jason Lowe and fixed by Jason Lowe (nodemanager)
NM Recovery. Auxiliary service support.
There needs to be a mechanism for communicating to auxiliary services whether nodemanager recovery is enabled and where they should store their state.
- YARN-1751.
Major improvement reported by Ming Ma and fixed by Ming Ma (nodemanager)
Improve MiniYarnCluster for log aggregation testing
MiniYarnCluster specifies individual remote log aggregation root dir for each NM. Test code that uses MiniYarnCluster won't be able to get the value of log aggregation root dir. The following code isn't necessary in MiniYarnCluster.
File remoteLogDir =
new File(testWorkDir, MiniYARNCluster.this.getName()
+ "-remoteLogDir-nm-" + index);
remoteLogDir.mkdir();
config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR,
remoteLogDir.getAbsolutePath());
In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to FileContext.getFileContext() call.
- YARN-1736.
Minor bug reported by Sandy Ryza and fixed by Naren Koneru (scheduler)
FS: AppSchedulable.assignContainer's priority argument is redundant
The ResourceRequest includes a Priority, so no need to pass in a Priority alongside it
- YARN-1726.
Blocker bug reported by Wei Yan and fixed by Wei Yan
ResourceSchedulerWrapper broken due to AbstractYarnScheduler
The YARN scheduler simulator failed when running Fair Scheduler, due to AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper should inherit AbstractYarnScheduler, instead of implementing ResourceScheduler interface directly.
- YARN-1718.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
Fix a couple isTerminals in Fair Scheduler queue placement rules
SecondaryGroupExistingQueue and Default are incorrect
- YARN-1713.
Blocker sub-task reported by Varun Vasudev and fixed by Varun Vasudev
Implement getnewapplication and submitapp as part of RM web service
- YARN-1702.
Major sub-task reported by Varun Vasudev and fixed by Varun Vasudev
Expose kill app functionality as part of RM web services
Expose functionality to kill an app via the ResourceManager web services API.
- YARN-1678.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
Fair scheduler gabs incessantly about reservations
Come on FS. We really don't need to know every time a node with a reservation on it heartbeats.
{code}
2014-01-29 03:48:16,043 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Trying to fulfill reservation for application appattempt_1390547864213_0347_000001 on node: host: a2330.halxg.cloudera.com:8041 #containers=8 available=<memory:0, vCores:8> used=<memory:8192, vCores:8>
2014-01-29 03:48:16,043 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: Making reservation: node=a2330.halxg.cloudera.com app_id=application_1390547864213_0347
2014-01-29 03:48:16,043 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: Application application_1390547864213_0347 reserved container container_1390547864213_0347_01_000003 on node host: a2330.halxg.cloudera.com:8041 #containers=8 available=<memory:0, vCores:8> used=<memory:8192, vCores:8>, currently has 6 at priority 0; currentReservation 6144
2014-01-29 03:48:16,044 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Updated reserved container container_1390547864213_0347_01_000003 on node host: a2330.halxg.cloudera.com:8041 #containers=8 available=<memory:0, vCores:8> used=<memory:8192, vCores:8> for application org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp@1cb01d20
{code}
- YARN-1670.
Critical bug reported by Thomas Graves and fixed by Mit Desai
aggregated log writer can write more log data then it says is the log length
We have seen exceptions when using 'yarn logs' to read log files.
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:441)
at java.lang.Long.parseLong(Long.java:483)
at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.readAContainerLogsForALogType(AggregatedLogFormat.java:518)
at org.apache.hadoop.yarn.logaggregation.LogDumper.dumpAContainerLogs(LogDumper.java:178)
at org.apache.hadoop.yarn.logaggregation.LogDumper.run(LogDumper.java:130)
at org.apache.hadoop.yarn.logaggregation.LogDumper.main(LogDumper.java:246)
We traced it down to the reader trying to read the file type of the next file but where it reads is still log data from the previous file. What happened was the Log Length was written as a certain size but the log data was actually longer then that.
Inside of the write() routine in LogValue it first writes what the logfile length is, but then when it goes to write the log itself it just goes to the end of the file. There is a race condition here where if someone is still writing to the file when it goes to be aggregated the length written could be to small.
We should have the write() routine stop when it writes whatever it said was the length. It would be nice if we could somehow tell the user it might be truncated but I'm not sure of a good way to do this.
We also noticed that a bug in readAContainerLogsForALogType where it is using an int for curRead whereas it should be using a long.
while (len != -1 && curRead < fileLength) {
This isn't actually a problem right now as it looks like the underlying decoder is doing the right thing and the len condition exits.
- YARN-1561.
Minor improvement reported by Junping Du and fixed by Chen He (scheduler)
Fix a generic type warning in FairScheduler
The Comparator below should be specified with type:
private Comparator nodeAvailableResourceComparator =
new NodeAvailableResourceComparator();
- YARN-1550.
Critical bug reported by caolong and fixed by Anubhav Dhoot (scheduler)
NPE in FairSchedulerAppsBlock#render
three Steps :
1、debug at RMAppManager#submitApplication after code
if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
null) {
String message = "Application with id " + applicationId
+ " is already present! Cannot add a duplicate!";
LOG.warn(message);
throw RPCUtil.getRemoteException(message);
}
2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#111111 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1
3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR!
the log:
{noformat}
2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
....
Caused by: java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96)
at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
{noformat}
- YARN-1520.
Major bug reported by Chen He and fixed by Chen He
update capacity scheduler docs to include necessary parameters
- YARN-1479.
Major improvement reported by Kendall Thrapp and fixed by Chen He
Invalid NaN values in Hadoop REST API JSON response
I've been occasionally coming across instances where Hadoop's Cluster Applications REST API (http://hadoop.apache.org/docs/r0.23.6/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API) has returned JSON that PHP's json_decode function failed to parse. I've tracked the syntax error down to the presence of the unquoted word NaN appearing as a value in the JSON. For example:
"progress":NaN,
NaN is not part of the JSON spec, so its presence renders the whole JSON string invalid. Hadoop needs to return something other than NaN in this case -- perhaps an empty string or the quoted string "NaN".
- YARN-1474.
Major sub-task reported by Sandy Ryza and fixed by Tsuyoshi OZAWA (scheduler)
Make schedulers services
Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent.
- YARN-1429.
Trivial bug reported by Sandy Ryza and fixed by Jarek Jarcec Cecho (client)
*nix: Allow a way for users to augment classpath of YARN daemons
YARN_CLASSPATH is referenced in the comments in ./hadoop-yarn-project/hadoop-yarn/bin/yarn and ./hadoop-yarn-project/hadoop-yarn/bin/yarn.cmd, but doesn't do anything.
- YARN-1424.
Minor improvement reported by Sandy Ryza and fixed by Ray Chiang (resourcemanager)
RMAppAttemptImpl should return the DummyApplicationResourceUsageReport for all invalid accesses
RMAppImpl has a DUMMY_APPLICATION_RESOURCE_USAGE_REPORT to return when the caller of createAndGetApplicationReport doesn't have access.
RMAppAttemptImpl should have something similar for getApplicationResourceUsageReport.
It also might make sense to put the dummy report into ApplicationResourceUsageReport and allow both to use it.
A test would also be useful to verify that RMAppAttemptImpl#getApplicationResourceUsageReport doesn't return null if the scheduler doesn't have a report to return.
- YARN-1408.
Major sub-task reported by Sunil G and fixed by Sunil G (resourcemanager)
Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
Capacity preemption is enabled as follows.
* yarn.resourcemanager.scheduler.monitor.enable= true ,
* yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
Queue = a,b
Capacity of Queue A = 80%
Capacity of Queue B = 20%
Step 1: Assign a big jobA on queue a which uses full cluster capacity
Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity
JobA task which uses queue b capcity is been preempted and killed.
This caused below problem:
1. New Container has got allocated for jobA in Queue A as per node update from an NM.
2. This container has been preempted immediately as per preemption.
Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM.
ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED
This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption.
attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs
- YARN-1368.
Major sub-task reported by Bikas Saha and fixed by Jian He
Common work to re-populate containers’ state into scheduler
YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster.
- YARN-1366.
Major sub-task reported by Bikas Saha and fixed by Rohith (resourcemanager)
AM should implement Resync with the ApplicationMasterService instead of shutting down
The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once.
- YARN-1365.
Major sub-task reported by Bikas Saha and fixed by Anubhav Dhoot (resourcemanager)
ApplicationMasterService to allow Register of an app that was running before restart
For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit.
- YARN-1362.
Major sub-task reported by Jason Lowe and fixed by Jason Lowe (nodemanager)
Distinguish between nodemanager shutdown for decommission vs shutdown for restart
When a nodemanager shuts down it needs to determine if it is likely to be restarted. If a restart is likely then it needs to preserve container directories, logs, distributed cache entries, etc. If it is being shutdown more permanently (e.g.: like a decommission) then the nodemanager should cleanup directories and logs.
- YARN-1339.
Major sub-task reported by Jason Lowe and fixed by Jason Lowe (nodemanager)
Recover DeletionService state upon nodemanager restart
- YARN-1338.
Major sub-task reported by Jason Lowe and fixed by Jason Lowe (nodemanager)
Recover localized resource cache state upon nodemanager restart
Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects.
* For work preserving restart we definitely want them as running containers are using them
* For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks.
- YARN-1136.
Major bug reported by Karthik Kambatla and fixed by Chen He
Replace junit.framework.Assert with org.junit.Assert
There are several places where we are using junit.framework.Assert instead of org.junit.Assert.
{code}grep -rn "junit.framework.Assert" hadoop-yarn-project/ --include=*.java{code}
- YARN-738.
Major bug reported by Omkar Vinit Joshi and fixed by Ming Ma
TestClientRMTokens is failing irregularly while running all yarn tests
Running org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 16.787 sec <<< FAILURE!
testShortCircuitRenewCancel(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 186 sec <<< ERROR!
java.lang.RuntimeException: getProxy
at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens$YarnBadRPC.getProxy(TestClientRMTokens.java:334)
at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:157)
at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.renew(RMDelegationTokenIdentifier.java:102)
at org.apache.hadoop.security.token.Token.renew(Token.java:372)
at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:306)
at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancel(TestClientRMTokens.java:240)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
- YARN-614.
Major improvement reported by Bikas Saha and fixed by Xuan Gong
Separate AM failures from hardware failure or YARN error and do not count them to AM retry count
Attempts can fail due to a large number of user errors and they should not be retried unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come to mind.
- YARN-596.
Major bug reported by Sandy Ryza and fixed by Wei Yan (scheduler)
Use scheduling policies throughout the queue hierarchy to decide which containers to preempt
In the fair scheduler, containers are chosen for preemption in the following way:
All containers for all apps that are in queues that are over their fair share are put in a list.
The list is sorted in order of the priority that the container was requested in.
This means that an application can shield itself from preemption by requesting it's containers at higher priorities, which doesn't really make sense.
Also, an application that is not over its fair share, but that is in a queue that is over it's fair share is just as likely to have containers preempted as an application that is over its fair share.
- YARN-483.
Major improvement reported by Sandy Ryza and fixed by Akira AJISAKA (documentation)
Improve documentation on log aggregation in yarn-default.xml
The current documentation for log aggregation is
{code}
<property>
<description>Whether to enable log aggregation</description>
<name>yarn.log-aggregation-enable</name>
<value>false</value>
</property>
{code}
This could be improved to explain what enabling log aggregation does.
- MAPREDUCE-6002.
Major bug reported by Wangda Tan and fixed by Wangda Tan (task)
MR task should prevent report error to AM when process is shutting down
- MAPREDUCE-5952.
Blocker bug reported by Gera Shegalov and fixed by Gera Shegalov (mr-am , mrv2)
LocalContainerLauncher#renameMapOutputForReduce incorrectly assumes a single dir for mapOutIndex
- MAPREDUCE-5939.
Major bug reported by Kihwal Lee and fixed by Chen He
StartTime showing up as the epoch time in JHS UI after upgrade
- MAPREDUCE-5924.
Major bug reported by Yesha Vora and fixed by Zhijie Shen
Windows: Sort Job failed due to 'Invalid event: TA_COMMIT_PENDING at COMMIT_PENDING'
- MAPREDUCE-5920.
Minor bug reported by Uma Maheswara Rao G and fixed by Yi Liu (distcp , documentation)
Add Xattr option in DistCp docs
- MAPREDUCE-5900.
Major sub-task reported by Mayank Bansal and fixed by Mayank Bansal (applicationmaster , mr-am , mrv2)
Container preemption interpreted as task failures and eventually job failures
- MAPREDUCE-5899.
Major improvement reported by Jing Zhao and fixed by Jing Zhao (distcp)
Support incremental data copy in DistCp
- MAPREDUCE-5898.
Major bug reported by Uma Maheswara Rao G and fixed by Yi Liu (distcp)
distcp to support preserving HDFS extended attributes(XAttrs)
- MAPREDUCE-5896.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza
InputSplits should indicate which locations have the block cached in memory
- MAPREDUCE-5895.
Major bug reported by Kousuke Saruta and fixed by Kousuke Saruta (client)
FileAlreadyExistsException was thrown : Temporary Index File can not be cleaned up because OutputStream doesn't close properly
- MAPREDUCE-5888.
Major bug reported by Jason Lowe and fixed by Jason Lowe (mr-am)
Failed job leaves hung AM after it unregisters
- MAPREDUCE-5886.
Minor improvement reported by Chris Nauroth and fixed by Chris Nauroth (examples)
Allow wordcount example job to accept multiple input paths.
- MAPREDUCE-5884.
Major bug reported by Mohammad Kamrul Islam and fixed by Mohammad Kamrul Islam (jobhistoryserver , security)
History server uses short user name when canceling tokens
- MAPREDUCE-5874.
Major bug reported by Ravi Prakash and fixed by Tsuyoshi OZAWA (documentation)
Creating MapReduce REST API section
- MAPREDUCE-5868.
Major bug reported by Jason Lowe and fixed by Akira AJISAKA (test)
TestPipeApplication causing nightly build to fail
- MAPREDUCE-5862.
Critical bug reported by bc Wong and fixed by bc Wong
Line records longer than 2x split size aren't handled correctly
- MAPREDUCE-5861.
Minor improvement reported by Ted Yu and fixed by Tsuyoshi OZAWA
finishedSubMaps field in LocalContainerLauncher does not need to be volatile
- MAPREDUCE-5852.
Minor test reported by Chris Nauroth and fixed by Chris Nauroth (test)
Prepare MapReduce codebase for JUnit 4.11.
- MAPREDUCE-5846.
Major bug reported by Nathan Roberts and fixed by Nathan Roberts (tools/rumen)
Rumen doesn't understand JobQueueChangedEvent
- MAPREDUCE-5844.
Major bug reported by Maysam Yabandeh and fixed by Maysam Yabandeh
Add a configurable delay to reducer-preemption
- MAPREDUCE-5837.
Critical bug reported by Haohui Mai and fixed by Haohui Mai
MRAppMaster fails when checking on uber mode
- MAPREDUCE-5836.
Trivial bug reported by Akira AJISAKA and fixed by Akira AJISAKA
Fix typo in RandomTextWriter
- MAPREDUCE-5834.
Major bug reported by Mit Desai and fixed by Mit Desai
TestGridMixClasses tests timesout on branch-2
- MAPREDUCE-5825.
Major improvement reported by Gera Shegalov and fixed by Gera Shegalov (mr-am)
Provide diagnostics for reducers killed during ramp down
- MAPREDUCE-5821.
Major bug reported by Todd Lipcon and fixed by Todd Lipcon (performance , task)
IFile merge allocates new byte array for every value
- MAPREDUCE-5814.
Major bug reported by Gera Shegalov and fixed by Gera Shegalov (mrv2)
fat jar with *-default.xml may fail when mapreduce.job.classloader=true.
- MAPREDUCE-5812.
Major improvement reported by Mohammad Kamrul Islam and fixed by Mohammad Kamrul Islam (mr-am)
Make job context available to OutputCommitter.isRecoverySupported()
- MAPREDUCE-5809.
Major improvement reported by Chris Nauroth and fixed by Chris Nauroth (distcp)
Enhance distcp to support preserving HDFS ACLs.
- MAPREDUCE-5804.
Major test reported by Mit Desai and fixed by Mit Desai
TestMRJobsWithProfiler#testProfiler timesout
- MAPREDUCE-5790.
Blocker bug reported by Andrew Wang and fixed by Gera Shegalov
Default map hprof profile options do not work
- MAPREDUCE-5777.
Major improvement reported by bc Wong and fixed by zhihai xu
Support utf-8 text with BOM (byte order marker)
- MAPREDUCE-5775.
Minor bug reported by Liyin Liang and fixed by jhanver chand sharma
Remove unnecessary job.setNumReduceTasks in SleepJob.createJob
- MAPREDUCE-5774.
Trivial improvement reported by Gera Shegalov and fixed by Gera Shegalov (jobhistoryserver)
Job overview in History UI should list reducer phases in chronological order
- MAPREDUCE-5765.
Minor bug reported by Jonathan Eagles and fixed by Mit Desai (pipes)
Update hadoop-pipes examples README
- MAPREDUCE-5759.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza
Remove unnecessary conf load in Limits
- MAPREDUCE-5749.
Major bug reported by shenhong and fixed by Jason Lowe
TestRMContainerAllocator#testReportedAppProgress Failed
- MAPREDUCE-5713.
Trivial bug reported by Ben Robie and fixed by Chen He (documentation)
InputFormat and JobConf JavaDoc Fixes
- MAPREDUCE-5671.
Major bug reported by Chen He and fixed by Chen He
NaN can be created by client and assign to Progress
- MAPREDUCE-5665.
Major bug reported by Sandy Ryza and fixed by Anubhav Dhoot (test)
Add audience annotations to MiniMRYarnCluster and MiniMRCluster
- MAPREDUCE-5652.
Major bug reported by Karthik Kambatla and fixed by Jason Lowe
NM Recovery. ShuffleHandler should handle NM restarts
- MAPREDUCE-5642.
Minor test reported by Chuan Liu and fixed by Chuan Liu (test)
TestMiniMRChildTask fails on Windows
- MAPREDUCE-5639.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation)
Port DistCp2 document to trunk
- MAPREDUCE-5638.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation)
Port Hadoop Archives document to trunk
- MAPREDUCE-5637.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation)
Convert Hadoop Streaming document to APT
- MAPREDUCE-5636.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation)
Convert MapReduce Tutorial document to APT
- MAPREDUCE-5517.
Minor bug reported by Siqi Li and fixed by Siqi Li
enabling uber mode with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb
- MAPREDUCE-5456.
Minor bug reported by Jason Lowe and fixed by Jason Lowe (mrv2 , test)
TestFetcher.testCopyFromHostExtraBytes is missing
- MAPREDUCE-5402.
Major improvement reported by David Rosenstrauch and fixed by Tsuyoshi OZAWA (distcp , mrv2)
DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
- MAPREDUCE-5309.
Major bug reported by Vrushali C and fixed by Rushabh S Shah (jobhistoryserver , mrv2)
2.0.4 JobHistoryParser can't parse certain failed job history files generated by 2.0.3 history server
- MAPREDUCE-5014.
Major improvement reported by Srikanth Sundarrajan and fixed by Srikanth Sundarrajan (distcp)
Extending DistCp through a custom CopyListing is not possible
- MAPREDUCE-4937.
Major bug reported by Jason Lowe and fixed by Eric Payne (mr-am)
MR AM handles an oversized split metainfo file poorly
- MAPREDUCE-4282.
Major task reported by Eli Collins and fixed by Akira AJISAKA (documentation)
Convert Forrest docs to APT
- MAPREDUCE-3191.
Trivial bug reported by Todd Lipcon and fixed by Chen He
docs for map output compression incorrectly reference SequenceFile
- HDFS-6752.
Major bug reported by Vinayakumar B and fixed by Vinayakumar B (test)
Avoid Address bind errors in TestDatanodeConfig#testMemlockLimit
- HDFS-6723.
Major bug reported by Ming Ma and fixed by Ming Ma
New NN webUI no longer displays decommissioned state for dead node
- HDFS-6717.
Minor sub-task reported by Jeff Hansen and fixed by Brandon Li (nfs)
Jira HDFS-5804 breaks default nfs-gateway behavior for unsecured config
- HDFS-6712.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (documentation)
Document HDFS Multihoming Settings
- HDFS-6703.
Major bug reported by Abhiraj Butala and fixed by Srikanth Upputuri (nfs)
NFS: Files can be deleted from a read-only mount
- HDFS-6696.
Blocker bug reported by Kihwal Lee and fixed by Andrew Wang
Name node cannot start if the path of a file under construction contains ".snapshot"
- HDFS-6680.
Major bug reported by Tsz Wo Nicholas Sze and fixed by Tsz Wo Nicholas Sze (namenode)
BlockPlacementPolicyDefault does not choose favored nodes correctly
- HDFS-6647.
Blocker bug reported by Aaron T. Myers and fixed by Kihwal Lee (namenode , snapshots)
Edit log corruption when pipeline recovery occurs for deleted file present in snapshot
- HDFS-6632.
Major bug reported by Yongjun Zhang and fixed by Yongjun Zhang
Reintroduce dfs.http.port / dfs.https.port in branch-2
- HDFS-6631.
Major bug reported by Chris Nauroth and fixed by Liang Xie (hdfs-client , test)
TestPread#testHedgedReadLoopTooManyTimes fails intermittently.
- HDFS-6622.
Blocker bug reported by Kihwal Lee and fixed by Kihwal Lee
Rename and AddBlock may race and produce invalid edits
- HDFS-6620.
Major improvement reported by Uma Maheswara Rao G and fixed by Stephen Chu (namenode)
Snapshot docs should specify about preserve options with cp command
- HDFS-6618.
Blocker bug reported by Kihwal Lee and fixed by Kihwal Lee
FSNamesystem#delete drops the FSN lock between removing INodes from the tree and deleting them from the inode map
- HDFS-6614.
Minor test reported by Liang Xie and fixed by Liang Xie (test)
shorten TestPread run time with a smaller retry timeout setting
- HDFS-6612.
Minor bug reported by Juan Yu and fixed by Juan Yu
MiniDFSNNTopology#simpleFederatedTopology(int) always hardcode nameservice ID
- HDFS-6610.
Minor bug reported by Charles Lamb and fixed by Charles Lamb (test)
TestShortCircuitLocalRead tests sometimes timeout on slow machines
- HDFS-6604.
Critical bug reported by Giuseppe Reina and fixed by Colin Patrick McCabe (hdfs-client)
The short-circuit cache doesn't correctly time out replicas that haven't been used in a while
- HDFS-6603.
Minor improvement reported by Stephen Chu and fixed by Stephen Chu (test)
Add XAttr with ACL test
- HDFS-6601.
Blocker bug reported by Kihwal Lee and fixed by Kihwal Lee
Issues in finalizing rolling upgrade when there is a layout version change
- HDFS-6599.
Blocker bug reported by Kihwal Lee and fixed by Daryn Sharp
2.4 addBlock is 10 to 20 times slower compared to 0.23
- HDFS-6598.
Trivial bug reported by Yongjun Zhang and fixed by Yongjun Zhang (webhdfs)
Fix a typo in message issued from explorer.js
- HDFS-6595.
Minor improvement reported by Benoy Antony and fixed by Benoy Antony (balancer , datanode)
Configure the maximum threads allowed for balancing on datanodes
- HDFS-6593.
Minor improvement reported by Jing Zhao and fixed by Jing Zhao (namenode , snapshots)
Move SnapshotDiffInfo out of INodeDirectorySnapshottable
- HDFS-6591.
Major bug reported by LiuLei and fixed by Liang Xie (hdfs-client)
while loop is executed tens of thousands of times in Hedged Read
- HDFS-6587.
Major bug reported by Zhilei Xu and fixed by Zhilei Xu (test)
Bug in TestBPOfferService can cause test failure
- HDFS-6583.
Minor bug reported by Haohui Mai and fixed by Haohui Mai (namenode)
Remove clientNode in FileUnderConstructionFeature
- HDFS-6580.
Major improvement reported by Zhilei Xu and fixed by Zhilei Xu (namenode)
FSNamesystem.mkdirsInt should call the getAuditFileInfo() wrapper
- HDFS-6578.
Major improvement reported by Yongjun Zhang and fixed by Yongjun Zhang
add toString method to DatanodeStorage for easier debugging
- HDFS-6572.
Minor bug reported by Charles Lamb and fixed by Charles Lamb (namenode)
Add an option to the NameNode that prints the software and on-disk image versions
- HDFS-6563.
Critical bug reported by Aaron T. Myers and fixed by Aaron T. Myers (namenode , snapshots)
NameNode cannot save fsimage in certain circumstances when snapshots are in use
- HDFS-6562.
Minor sub-task reported by Haohui Mai and fixed by Haohui Mai (namenode)
Refactor rename() in FSDirectory
- HDFS-6559.
Minor bug reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation)
Fix wrong option "dfsadmin -rollingUpgrade start" in the document
- HDFS-6558.
Trivial improvement reported by Akira AJISAKA and fixed by Chen He
Missing '\n' in the description of dfsadmin -rollingUpgrade
- HDFS-6557.
Major sub-task reported by Haohui Mai and fixed by Haohui Mai (namenode)
Move the reference of fsimage to FSNamesystem
- HDFS-6556.
Major bug reported by Yi Liu and fixed by Uma Maheswara Rao G (namenode)
Refine XAttr permissions
- HDFS-6553.
Major bug reported by Stephen Chu and fixed by Stephen Chu (nfs)
Add missing DeprecationDeltas for NFS Kerberos configurations
- HDFS-6552.
Trivial bug reported by Amir Langer and fixed by (namenode)
add DN storage to a BlockInfo will not replace the different storage from same DN
- HDFS-6551.
Major bug reported by Jing Zhao and fixed by Jing Zhao (namenode , snapshots)
Rename with OVERWRITE option may throw NPE when the target file/directory is a reference INode
- HDFS-6549.
Major bug reported by Aaron T. Myers and fixed by Aaron T. Myers (nfs)
Add support for accessing the NFS gateway from the AIX NFS client
- HDFS-6545.
Critical improvement reported by Kihwal Lee and fixed by Kihwal Lee
Finalizing rolling upgrade can make NN unavailable for a long duration
- HDFS-6539.
Major bug reported by Binglin Chang and fixed by Binglin Chang
test_native_mini_dfs is skipped in hadoop-hdfs/pom.xml
- HDFS-6535.
Major bug reported by George Wong and fixed by George Wong (namenode)
HDFS quota update is wrong when file is appended
- HDFS-6530.
Minor bug reported by Tsz Wo Nicholas Sze and fixed by Tsz Wo Nicholas Sze (documentation)
Fix Balancer documentation
- HDFS-6529.
Minor improvement reported by Anubhav Dhoot and fixed by Anubhav Dhoot (hdfs-client)
Trace logging for RemoteBlockReader2 to identify remote datanode and file being read
- HDFS-6528.
Minor improvement reported by Stephen Chu and fixed by Stephen Chu (test)
Add XAttrs to TestOfflineImageViewer
- HDFS-6527.
Blocker bug reported by Kihwal Lee and fixed by Kihwal Lee
Edit log corruption due to defered INode removal
- HDFS-6518.
Major bug reported by Yongjun Zhang and fixed by Andrew Wang
TestCacheDirectives#testExceedsCapacity should take FSN read lock when accessing pendingCached list
- HDFS-6507.
Major improvement reported by Zesheng Wu and fixed by Zesheng Wu (tools)
Improve DFSAdmin to support HA cluster better
- HDFS-6503.
Minor improvement reported by Zesheng Wu and fixed by Zesheng Wu (tools)
Fix typo of DFSAdmin restoreFailedStorage
- HDFS-6500.
Blocker bug reported by Junping Du and fixed by Tsz Wo Nicholas Sze (snapshots)
Snapshot shouldn't be removed silently after renaming to an existing snapshot
- HDFS-6499.
Major improvement reported by Yongjun Zhang and fixed by Yongjun Zhang (namenode)
Use NativeIO#renameTo instead of File#renameTo in FileJournalManager
- HDFS-6497.
Minor bug reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe (test)
Make TestAvailableSpaceVolumeChoosingPolicy deterministic
- HDFS-6493.
Trivial bug reported by Juan Yu and fixed by Juan Yu
Change dfs.namenode.startup.delay.block.deletion to second instead of millisecond
- HDFS-6492.
Major improvement reported by Andrew Wang and fixed by Andrew Wang (namenode)
Support create-time xattrs and atomically setting multiple xattrs
- HDFS-6487.
Major bug reported by Mit Desai and fixed by Mit Desai
TestStandbyCheckpoint#testSBNCheckpoints is racy
- HDFS-6486.
Minor task reported by Yi Liu and fixed by Yi Liu (webhdfs)
Add user doc for XAttrs via WebHDFS.
- HDFS-6480.
Major sub-task reported by Haohui Mai and fixed by Haohui Mai (namenode)
Move waitForReady() from FSDirectory to FSNamesystem
- HDFS-6475.
Major bug reported by Yongjun Zhang and fixed by Yongjun Zhang (ha , webhdfs)
WebHdfs clients fail without retry because incorrect handling of StandbyException
- HDFS-6472.
Trivial bug reported by Juan Yu and fixed by Juan Yu
fix typo in webapps/hdfs/explorer.js
- HDFS-6471.
Major bug reported by Dasha Boudnik and fixed by Dasha Boudnik (test)
Make moveFromLocal CLI testcases to be non-disruptive
Committed to trunk and merged into branch-2. Thanks Dasha!
- HDFS-6470.
Major bug reported by Andrew Wang and fixed by Ming Ma
TestBPOfferService.testBPInitErrorHandling is flaky
- HDFS-6464.
Major bug reported by Yi Liu and fixed by Yi Liu (webhdfs)
Support multiple xattr.name parameters for WebHDFS getXAttrs.
- HDFS-6463.
Trivial improvement reported by Aaron T. Myers and fixed by Chris Nauroth (namenode)
Clarify behavior of AclStorage#createFsPermissionForExtendedAcl in comments.
- HDFS-6462.
Major bug reported by Yesha Vora and fixed by Brandon Li (nfs)
NFS: fsstat request fails with the secure hdfs
- HDFS-6461.
Trivial bug reported by James Thomas and fixed by James Thomas (datanode)
Use Time#monotonicNow to compute duration in DataNode#shutDown
- HDFS-6460.
Minor improvement reported by Yongjun Zhang and fixed by Yongjun Zhang
Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance
- HDFS-6453.
Major improvement reported by Liang Xie and fixed by Liang Xie (datanode , namenode)
use Time#monotonicNow to avoid system clock reset
- HDFS-6448.
Major improvement reported by Liang Xie and fixed by Liang Xie (hdfs-client)
BlockReaderLocalLegacy should set socket timeout based on conf.socketTimeout
- HDFS-6447.
Trivial improvement reported by Allen Wittenauer and fixed by Juan Yu (balancer)
balancer should timestamp the completion message
- HDFS-6443.
Minor bug reported by Zesheng Wu and fixed by Zesheng Wu (test)
Fix MiniQJMHACluster related test failures
- HDFS-6442.
Minor improvement reported by Zesheng Wu and fixed by Zesheng Wu (test)
Fix TestEditLogAutoroll and TestStandbyCheckpoints failure caused by port conficts
- HDFS-6439.
Major bug reported by Brandon Li and fixed by Aaron T. Myers (nfs)
NFS should not reject NFS requests to the NULL procedure whether port monitoring is enabled or not
- HDFS-6438.
Major bug reported by Jing Zhao and fixed by Jing Zhao (webhdfs)
DeleteSnapshot should be a DELETE request in WebHdfs
- HDFS-6435.
Major new feature reported by Aaron T. Myers and fixed by Aaron T. Myers (nfs)
Add support for specifying a static uid/gid mapping for the NFS gateway
- HDFS-6433.
Major improvement reported by Benoy Antony and fixed by Benoy Antony (balancer)
Replace BytesMoved class with AtomicLong
- HDFS-6432.
Major improvement reported by Suresh Srinivas and fixed by Jing Zhao (namenode , webhdfs)
Add snapshot related APIs to webhdfs
- HDFS-6430.
Major task reported by Yi Liu and fixed by Yi Liu
HTTPFS - Implement XAttr support
- HDFS-6424.
Major bug reported by Ming Ma and fixed by Ming Ma
blockReport doesn't need to invalidate blocks on SBN
- HDFS-6423.
Major bug reported by Jing Zhao and fixed by Jing Zhao (namenode)
Diskspace quota usage should be updated when appending data to partial block
- HDFS-6422.
Blocker bug reported by Charles Lamb and fixed by Charles Lamb
getfattr in CLI doesn't throw exception or return non-0 return code when xattr doesn't exist
- HDFS-6421.
Major bug reported by Jason Lowe and fixed by Mit Desai (libhdfs)
Fix vecsum.c compile on BSD and some other systems
- HDFS-6419.
Major test reported by Akira AJISAKA and fixed by Akira AJISAKA
TestBookKeeperHACheckpoints#TestSBNCheckpoints fails on trunk
- HDFS-6418.
Blocker bug reported by Steve Loughran and fixed by Tsz Wo Nicholas Sze (hdfs-client)
Regression: DFS_NAMENODE_USER_NAME_KEY missing in trunk
- HDFS-6416.
Minor improvement reported by Brandon Li and fixed by Abhiraj Butala (nfs)
Use Time#monotonicNow in OpenFileCtx and OpenFileCtxCatch to avoid system clock bugs
- HDFS-6409.
Trivial bug reported by Chris Nauroth and fixed by Chen He (namenode)
Fix typo in log message about NameNode layout version upgrade.
- HDFS-6406.
Major new feature reported by Aaron T. Myers and fixed by Aaron T. Myers (nfs)
Add capability for NFS gateway to reject connections from unprivileged ports
- HDFS-6404.
Major bug reported by Alejandro Abdelnur and fixed by Mike Yoder
HttpFS should use a 000 umask for mkdir and create operations
- HDFS-6403.
Major improvement reported by Yongjun Zhang and fixed by Yongjun Zhang (datanode , namenode)
Add metrics for log warnings reported by JVM pauses
- HDFS-6400.
Critical bug reported by Akira AJISAKA and fixed by Akira AJISAKA (tools)
Cannot execute "hdfs oiv_legacy"
- HDFS-6399.
Minor bug reported by Charles Lamb and fixed by Chris Nauroth (documentation , namenode)
Add note about setfacl in HDFS permissions guide
- HDFS-6396.
Minor improvement reported by Andrew Wang and fixed by Charles Lamb
Remove support for ACL feature from INodeSymlink
- HDFS-6395.
Major bug reported by Andrew Wang and fixed by Yi Liu (namenode)
Skip checking xattr limits for non-user-visible namespaces
- HDFS-6381.
Trivial bug reported by Binglin Chang and fixed by Binglin Chang (documentation)
Fix a typo in INodeReference.java
- HDFS-6379.
Major bug reported by Alejandro Abdelnur and fixed by Mike Yoder
HTTPFS - Implement ACLs support
- HDFS-6378.
Major bug reported by Brandon Li and fixed by Abhiraj Butala (nfs)
NFS registration should timeout instead of hanging when portmap/rpcbind is not available
- HDFS-6375.
Major improvement reported by Andrew Wang and fixed by Charles Lamb (namenode)
Listing extended attributes with the search permission
- HDFS-6370.
Major bug reported by Haohui Mai and fixed by Haohui Mai (datanode , journal-node , namenode)
Web UI fails to display in intranet under IE
- HDFS-6369.
Trivial improvement reported by Ted Yu and fixed by Ted Yu
Document that BlockReader#available() can return more bytes than are remaining in the block
- HDFS-6367.
Major bug reported by Yi Liu and fixed by Yi Liu (webhdfs)
EnumSetParam$Domain#parse fails for parameter containing more than one enum.
- HDFS-6364.
Major bug reported by Benoy Antony and fixed by Benoy Antony (balancer)
Incorrect check for unknown datanode in Balancer
- HDFS-6356.
Trivial improvement reported by Tulasi G and fixed by Tulasi G (datanode)
Fix typo in DatanodeLayoutVersion
- HDFS-6355.
Major bug reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe
Fix divide-by-zero, improper use of wall-clock time in BlockPoolSliceScanner
- HDFS-6351.
Major sub-task reported by Yongjun Zhang and fixed by Yongjun Zhang (hdfs-client)
Command "hdfs dfs -rm -r" can't remove empty directory
- HDFS-6345.
Major bug reported by Lenni Kuff and fixed by Andrew Wang (caching)
DFS.listCacheDirectives() should allow filtering based on cache directive ID
- HDFS-6337.
Major bug reported by Uma Maheswara Rao G and fixed by Uma Maheswara Rao G (test)
Setfacl testcase is failing due to dash character in username in TestAclCLI
- HDFS-6334.
Major improvement reported by Kihwal Lee and fixed by Kihwal Lee
Client failover proxy provider for IP failover based NN HA
- HDFS-6330.
Major sub-task reported by Haohui Mai and fixed by Haohui Mai (namenode)
Move mkdirs() to FSNamesystem
- HDFS-6328.
Major sub-task reported by Haohui Mai and fixed by Haohui Mai (namenode)
Clean up dead code in FSDirectory
- HDFS-6315.
Major sub-task reported by Haohui Mai and fixed by Haohui Mai
Decouple recording edit logs from FSDirectory
- HDFS-6312.
Blocker bug reported by Daryn Sharp and fixed by Daryn Sharp (webhdfs)
WebHdfs HA failover is broken on secure clusters
- HDFS-6305.
Critical bug reported by Daryn Sharp and fixed by Daryn Sharp (webhdfs)
WebHdfs response decoding may throw RuntimeExceptions
- HDFS-6304.
Major improvement reported by Haohui Mai and fixed by Haohui Mai (namenode)
Consolidate the logic of path resolution in FSDirectory
- HDFS-6297.
Major improvement reported by Dasha Boudnik and fixed by Dasha Boudnik (test)
Add CLI testcases to reflect new features of dfs and dfsadmin
Committed to the trunk and branch-2. Thanks Dasha!
- HDFS-6295.
Major improvement reported by Andrew Wang and fixed by Andrew Wang
Add "decommissioning" state and node state filtering to dfsadmin
- HDFS-6294.
Major bug reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe (namenode)
Use INode IDs to avoid conflicts when a file open for write is renamed
- HDFS-6293.
Blocker bug reported by Kihwal Lee and fixed by Kihwal Lee
Issues with OIV processing PB-based fsimages
Set "dfs.namenode.legacy-oiv-image.dir" to an appropriate directory to make standby name node or secondary name node save its file system state in the old fsimage format during checkpointing. This image can be used for offline analysis using the OfflineImageViewer. Use the "hdfs oiv_legacy" command to process the old fsimage format.
- HDFS-6289.
Critical bug reported by Aaron T. Myers and fixed by Aaron T. Myers (ha)
HA failover can fail if there are pending DN messages for DNs which no longer exist
- HDFS-6288.
Minor bug reported by Juan Yu and fixed by Juan Yu
DFSInputStream Pread doesn't update ReadStatistics
- HDFS-6287.
Minor test reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe (libhdfs , test)
Add vecsum test of libhdfs read access times
- HDFS-6282.
Minor improvement reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe (test)
re-add testIncludeByRegistrationName
- HDFS-6281.
Major new feature reported by Aaron T. Myers and fixed by Aaron T. Myers (nfs)
Provide option to use the NFS Gateway without having to use the Hadoop portmapper
- HDFS-6279.
Major improvement reported by Haohui Mai and fixed by Haohui Mai
Create new index page for JN / DN
- HDFS-6278.
Major improvement reported by Haohui Mai and fixed by Haohui Mai
Create HTML5-based UI for SNN
- HDFS-6276.
Major sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas
Remove unnecessary conditions and null check
- HDFS-6275.
Major sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas
Fix warnings - type arguments can be inferred and redudant local variable
- HDFS-6274.
Major sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas
Cleanup javadoc warnings in HDFS code
- HDFS-6273.
Major improvement reported by Arpit Agarwal and fixed by Arpit Agarwal (namenode)
Config options to allow wildcard endpoints for namenode HTTP and HTTPS servers
HDFS-6273 introduces two new HDFS configuration keys:
- dfs.namenode.http-bind-host
- dfs.namenode.https-bind-host
The most common use case for these keys is to have the NameNode HTTP (or HTTPS) endpoints listen on all interfaces on multi-homed systems by setting the keys to 0.0.0.0 i.e. INADDR_ANY.
For the systems background on this usage of INADDR_ANY please refer to ip(7) in the Linux Programmer's Manual (web link: http://man7.org/linux/man-pages/man7/ip.7.html).
These keys complement the existing NameNode options:
- dfs.namenode.rpc-bind-host
- dfs.namenode.servicerpc-bind-host
- HDFS-6270.
Minor bug reported by Benoy Antony and fixed by Benoy Antony
Secondary namenode status page shows transaction count in bytes
- HDFS-6269.
Major improvement reported by Eric Payne and fixed by Eric Payne (namenode , webhdfs)
NameNode Audit Log should differentiate between webHDFS open and HDFS open.
- HDFS-6266.
Major improvement reported by Jing Zhao and fixed by Jing Zhao (snapshots)
Identify full path for a given INode
- HDFS-6265.
Minor test reported by Chris Nauroth and fixed by Chris Nauroth (test)
Prepare HDFS codebase for JUnit 4.11.
- HDFS-6257.
Minor test reported by Ted Yu and fixed by Colin Patrick McCabe (caching)
TestCacheDirectives#testExceedsCapacity fails occasionally
- HDFS-6256.
Major improvement reported by Akira AJISAKA and fixed by Akira AJISAKA (tools)
Clean up ImageVisitor and SpotCheckImageVisitor
- HDFS-6250.
Major bug reported by Kihwal Lee and fixed by Binglin Chang
TestBalancerWithNodeGroup.testBalancerWithRackLocality fails
- HDFS-6243.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (ha , namenode)
HA NameNode transition to active or shutdown may leave lingering image transfer thread.
- HDFS-6240.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (tools)
WebImageViewer returns 404 if LISTSTATUS to an empty directory
- HDFS-6238.
Minor bug reported by Chris Nauroth and fixed by Chris Nauroth (datanode , test)
TestDirectoryScanner leaks file descriptors.
- HDFS-6230.
Major bug reported by Arpit Agarwal and fixed by Mit Desai (namenode)
Expose upgrade status through NameNode web UI
- HDFS-6227.
Major bug reported by Jing Zhao and fixed by Colin Patrick McCabe
ShortCircuitCache#unref should purge ShortCircuitReplicas whose streams have been closed by java interrupts
- HDFS-6225.
Major improvement reported by Haohui Mai and fixed by Haohui Mai
Remove the o.a.h.hdfs.server.common.UpgradeStatusReport
- HDFS-6224.
Minor test reported by Charles Lamb and fixed by Charles Lamb (test)
Add a unit test to TestAuditLogger for file permissions passed to logAuditEvent
- HDFS-6222.
Major bug reported by Daryn Sharp and fixed by Daryn Sharp (webhdfs)
Remove background token renewer from webhdfs
- HDFS-6219.
Major sub-task reported by Daryn Sharp and fixed by Daryn Sharp (namenode , webhdfs)
Proxy superuser configuration should use true client IP for address checks
- HDFS-6218.
Major sub-task reported by Daryn Sharp and fixed by Daryn Sharp (namenode , webhdfs)
Audit log should use true client IP for proxied webhdfs operations
- HDFS-6217.
Major sub-task reported by Daryn Sharp and fixed by Daryn Sharp (webhdfs)
Webhdfs PUT operations may not work via a http proxy
- HDFS-6216.
Major bug reported by Daryn Sharp and fixed by Daryn Sharp (webhdfs)
Issues with webhdfs and http proxies
- HDFS-6215.
Minor bug reported by Kihwal Lee and fixed by Kihwal Lee
Wrong error message for upgrade
- HDFS-6214.
Major bug reported by Daryn Sharp and fixed by Daryn Sharp (webhdfs)
Webhdfs has poor throughput for files >2GB
- HDFS-6213.
Minor bug reported by Steve Loughran and fixed by Andrew Wang (test)
TestDataNodeConfig failing on Jenkins runs due to DN web port in use
- HDFS-6210.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (tools)
Support GETACLSTATUS operation in WebImageViewer
- HDFS-6194.
Major bug reported by Haohui Mai and fixed by Akira AJISAKA
Create new tests for ByteRangeInputStream
- HDFS-6191.
Major improvement reported by Kihwal Lee and fixed by Kihwal Lee (namenode)
Disable quota checks when replaying edit log.
- HDFS-6190.
Trivial bug reported by Charles Lamb and fixed by Charles Lamb (tools)
minor textual fixes in DFSClient
- HDFS-6186.
Major sub-task reported by Suresh Srinivas and fixed by Jing Zhao (namenode)
Pause deletion of blocks when the namenode starts up
- HDFS-6181.
Trivial bug reported by Brandon Li and fixed by Brandon Li (documentation , nfs)
Fix the wrong property names in NFS user guide
- HDFS-6180.
Blocker bug reported by Travis Thompson and fixed by Haohui Mai
dead node count / listing is very broken in JMX and old GUI
- HDFS-6178.
Major bug reported by Ming Ma and fixed by Ming Ma (namenode)
Decommission on standby NN couldn't finish
- HDFS-6173.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (tools)
Move the default processor from Ls to Web in OfflineImageViewer
- HDFS-6170.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (tools)
Support GETFILESTATUS operation in WebImageViewer
- HDFS-6169.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (tools)
Move the address in WebImageViewer
- HDFS-6168.
Major improvement reported by Tsz Wo Nicholas Sze and fixed by Tsz Wo Nicholas Sze (hdfs-client)
Remove deprecated methods in DistributedFileSystem
- HDFS-6167.
Major improvement reported by Tsz Wo Nicholas Sze and fixed by Tsz Wo Nicholas Sze (hdfs-client)
Relocate the non-public API classes in the hdfs.client package
- HDFS-6164.
Major sub-task reported by Haohui Mai and fixed by Haohui Mai (tools)
Remove lsr in OfflineImageViewer
The offlineimageviewer no longer generates lsr-style outputs. The functionality has been superseded by a tool that takes the fsimage and exposes WebHDFS-like API for user queries.
- HDFS-6162.
Minor sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas
Format strings should use platform independent line separator
- HDFS-6160.
Major bug reported by Ted Yu and fixed by Arpit Agarwal (test)
TestSafeMode occasionally fails
- HDFS-6159.
Major bug reported by Chen He and fixed by Chen He (test)
TestBalancerWithNodeGroup.testBalancerWithNodeGroup fails if there is block missing after balancer success
- HDFS-6158.
Major improvement reported by Haohui Mai and fixed by Haohui Mai
Clean up dead code for OfflineImageViewer
- HDFS-6156.
Major bug reported by Haohui Mai and fixed by Shinichi Yamashita
Simplify the JMX API that provides snapshot information
- HDFS-6155.
Major sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas
Fix Boxing/unboxing to parse a primitive findbugs warnings
- HDFS-6153.
Minor bug reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation , webhdfs)
Document "fileId" and "childrenNum" fields in the FileStatus Json schema
- HDFS-6143.
Blocker bug reported by Gera Shegalov and fixed by Gera Shegalov
WebHdfsFileSystem open should throw FileNotFoundException for non-existing paths
- HDFS-6125.
Major sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas (test)
Cleanup unnecessary cast in HDFS code base
- HDFS-6119.
Minor sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
FSNamesystem code cleanup
- HDFS-6112.
Minor bug reported by Aaron T. Myers and fixed by Aaron T. Myers (nfs)
NFS Gateway docs are incorrect for allowed hosts configuration
- HDFS-6110.
Major improvement reported by Liang Xie and fixed by Liang Xie (datanode)
adding more slow action log in critical write path
Log slow i/o. Set log thresholds in dfsclient and datanode via the below new configs:
dfs.client.slow.io.warning.threshold.ms (Default 30 seconds)
dfs.datanode.slow.io.warning.threshold.ms (Default 300ms)
- HDFS-6109.
Major improvement reported by Liang Xie and fixed by Liang Xie (datanode)
let sync_file_range() system call run in background
- HDFS-6056.
Major bug reported by Aaron T. Myers and fixed by Brandon Li (nfs)
Clean up NFS config settings
- HDFS-6007.
Minor improvement reported by Masatake Iwasaki and fixed by (documentation)
Update documentation about short-circuit local reads
- HDFS-5978.
Major sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (tools)
Create a tool to take fsimage and expose read-only WebHDFS API
- HDFS-5892.
Minor test reported by Ted Yu and fixed by Ted Yu
TestDeleteBlockPool fails in branch-2
- HDFS-5865.
Minor sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation)
Update OfflineImageViewer document
- HDFS-5693.
Major improvement reported by Ming Ma and fixed by Ming Ma (namenode)
Few NN metrics data points were collected via JMX when NN is under heavy load
- HDFS-5683.
Major improvement reported by Andrew Wang and fixed by Abhiraj Butala (namenode)
Better audit log messages for caching operations
- HDFS-5669.
Major bug reported by Vinayakumar B and fixed by Vinayakumar B (datanode)
Storage#tryLock() should check for null before logging successfull message
- HDFS-5591.
Minor bug reported by Andrew Wang and fixed by Charles Lamb (namenode)
Checkpointing should use monotonic time when calculating period
- HDFS-5522.
Major bug reported by Kihwal Lee and fixed by Rushabh S Shah
Datanode disk error check may be incorrectly skipped
- HDFS-5411.
Minor sub-task reported by Robert Rati and fixed by Rakesh R
Update Bookkeeper dependency to 4.2.3
- HDFS-5409.
Minor test reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestOfflineEditsViewer#testStored fails on Windows due to CRLF line endings in editsStored.xml from git checkout
- HDFS-5381.
Minor improvement reported by Colin Patrick McCabe and fixed by Benoy Antony (federation)
ExtendedBlock#hashCode should use both blockId and block pool ID
- HDFS-5196.
Minor improvement reported by Haohui Mai and fixed by Shinichi Yamashita (snapshots)
Provide more snapshot information in WebUI
- HDFS-5168.
Critical improvement reported by Nikola Vujic and fixed by Nikola Vujic (namenode)
BlockPlacementPolicy does not work for cross node group dependencies
- HDFS-4913.
Major bug reported by Stephen Chu and fixed by Colin Patrick McCabe (fuse-dfs)
Deleting file through fuse-dfs when using trash fails requiring root permissions
- HDFS-4909.
Blocker bug reported by Ralph Castain and fixed by Colin Patrick McCabe (datanode , journal-node , namenode)
Avoid protocol buffer RPC namespace clashes
- HDFS-4667.
Major sub-task reported by Jing Zhao and fixed by Jing Zhao (namenode)
Capture renamed files/directories in snapshot diff report
- HDFS-4286.
Major sub-task reported by Vinayakumar B and fixed by Rakesh R
Changes from BOOKKEEPER-203 broken capability of including bookkeeper-server jar in hidden package of BKJM
- HDFS-4221.
Major sub-task reported by Uma Maheswara Rao G and fixed by Rakesh R (ha)
Remove the format limitation point from BKJM documentation as HDFS-3810 closed
- HDFS-3848.
Major bug reported by Hooman Peiro Sajjad and fixed by Chen He (namenode)
A Bug in recoverLeaseInternal method of FSNameSystem class
- HDFS-3828.
Major bug reported by Andy Isaacson and fixed by Andy Isaacson
Block Scanner rescans blocks too frequently
- HDFS-3493.
Major bug reported by J.Andreina and fixed by Juan Yu
Invalidate excess corrupted blocks as long as minimum replication is satisfied
- HDFS-3087.
Critical bug reported by Kihwal Lee and fixed by Rushabh S Shah (namenode)
Decomissioning on NN restart can complete without blocks being replicated
- HDFS-2949.
Major improvement reported by Todd Lipcon and fixed by Rushabh S Shah (ha , namenode)
HA: Add check to active state transition to prevent operator-induced split brain
- HDFS-2006.
Major improvement reported by dhruba borthakur and fixed by Yi Liu (namenode)
ability to support storing extended attributes per file
- HADOOP-10896.
Blocker improvement reported by Karthik Kambatla and fixed by Karthik Kambatla (documentation)
Update compatibility doc to capture visibility of un-annotated classes/ methods
- HADOOP-10894.
Minor sub-task reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation)
Fix dead link in ToolRunner documentation
- HADOOP-10890.
Major bug reported by Yongjun Zhang and fixed by Yongjun Zhang
TestDFVariations.testMount fails intermittently
- HADOOP-10872.
Major bug reported by Yongjun Zhang and fixed by Yongjun Zhang (fs)
TestPathData fails intermittently with "Mkdirs failed to create d1"
- HADOOP-10864.
Minor sub-task reported by Allen Wittenauer and fixed by Akira AJISAKA (documentation)
Tool documentenation is broken
- HADOOP-10821.
Blocker task reported by Akira AJISAKA and fixed by Andrew Wang
Prepare the release notes for Hadoop 2.5.0
- HADOOP-10801.
Major bug reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation)
Fix dead link in site.xml
- HADOOP-10782.
Trivial improvement reported by Jingguo Yao and fixed by Jingguo Yao
Typo in DataChecksum classs
- HADOOP-10767.
Trivial improvement reported by Chris Nauroth and fixed by Chris Nauroth (fs)
Clean up unused code in Ls shell command.
- HADOOP-10754.
Trivial test reported by Chris Nauroth and fixed by Chris Nauroth (ha , test)
Reenable several HA ZooKeeper-related tests on Windows.
- HADOOP-10747.
Minor improvement reported by Chris Nauroth and fixed by Chris Nauroth (ipc)
Support configurable retries on SASL connection failures in RPC client.
- HADOOP-10746.
Major bug reported by Jinghui Wang and fixed by Jinghui Wang (test)
TestSocketIOWithTimeout#testSocketIOWithTimeout fails on Power PC
- HADOOP-10739.
Major bug reported by Jason Lowe and fixed by chang li (fs)
Renaming a file into a directory containing the same filename results in a confusing I/O error
- HADOOP-10716.
Critical bug reported by Daryn Sharp and fixed by Rushabh S Shah (conf , fs)
Cannot use more than 1 har filesystem
- HADOOP-10715.
Minor task reported by Ted Yu and fixed by
Remove public GraphiteSink#setWriter()
- HADOOP-10711.
Major bug reported by Robert Kanter and fixed by Robert Kanter (security)
Cleanup some extra dependencies from hadoop-auth
- HADOOP-10710.
Major bug reported by Alejandro Abdelnur and fixed by Juan Yu (security)
hadoop.auth cookie is not properly constructed according to RFC2109
- HADOOP-10702.
Minor bug reported by Benoy Antony and fixed by Benoy Antony (security)
KerberosAuthenticationHandler does not log the principal names correctly
- HADOOP-10701.
Major bug reported by Premchandra Preetham Kukillaya and fixed by Harsh J (nfs)
NFS should not validate the access premission only based on the user's primary group
- HADOOP-10699.
Major bug reported by Kirill A. Korinskiy and fixed by Binglin Chang
Fix build native library on mac osx
- HADOOP-10691.
Minor improvement reported by Lei (Eddy) Xu and fixed by Lei (Eddy) Xu (tools)
Improve the readability of 'hadoop fs -help'
- HADOOP-10688.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (fs)
Expose thread-level FileSystem StatisticsData
- HADOOP-10686.
Major bug reported by Abraham Elmahrek and fixed by Abraham Elmahrek
Writables are not always configured
- HADOOP-10683.
Major bug reported by Benoy Antony and fixed by Benoy Antony (security)
Users authenticated with KERBEROS are recorded as being authenticated with SIMPLE
- HADOOP-10678.
Minor bug reported by Benoy Antony and fixed by Benoy Antony (security)
SecurityUtil has unnecessary synchronization on collection used for only tests
- HADOOP-10674.
Major improvement reported by Tsz Wo Nicholas Sze and fixed by Tsz Wo Nicholas Sze (performance , util)
Rewrite the PureJavaCrc32 loop for performance improvement
- HADOOP-10666.
Minor improvement reported by Henry Saputra and fixed by Henry Saputra (documentation)
Remove Copyright /d/d/d/d Apache Software Foundation from the source files license header
- HADOOP-10665.
Minor improvement reported by Benoy Antony and fixed by Benoy Antony (security)
Make Hadoop Authentication Handler loads case in-sensitive
- HADOOP-10664.
Major bug reported by Chen He and fixed by Aaron T. Myers
TestNetUtils.testNormalizeHostName fails
- HADOOP-10660.
Major bug reported by Ted Yu and fixed by Chen He
GraphiteSink should implement Closeable
- HADOOP-10659.
Minor sub-task reported by Benoy Antony and fixed by Benoy Antony (security)
Refactor AccessControlList to reuse utility functions and to improve performance
- HADOOP-10658.
Major bug reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur (security)
SSLFactory expects truststores being configured
- HADOOP-10657.
Major bug reported by Ming Ma and fixed by Ming Ma
Have RetryInvocationHandler log failover attempt at INFO level
- HADOOP-10656.
Major bug reported by Brandon Li and fixed by Brandon Li (security)
The password keystore file is not picked by LDAP group mapping
- HADOOP-10652.
Major sub-task reported by Benoy Antony and fixed by Benoy Antony (security)
Refactor Proxyusers to use AccessControlList
- HADOOP-10649.
Major sub-task reported by Benoy Antony and fixed by Benoy Antony (security)
Allow overriding the default ACL for service authorization
- HADOOP-10647.
Minor bug reported by Gene Kim and fixed by Gene Kim (fs/swift)
String Format Exception in SwiftNativeFileSystemStore.java
- HADOOP-10639.
Major bug reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur (security)
FileBasedKeyStoresFactory initialization is not using default for SSL_REQUIRE_CLIENT_CERT_KEY
- HADOOP-10638.
Major bug reported by Manikandan Narayanaswamy and fixed by Manikandan Narayanaswamy (nfs)
Updating hadoop-daemon.sh to work as expected when nfs is started as a privileged user.
- HADOOP-10630.
Major bug reported by Jing Zhao and fixed by Jing Zhao
Possible race condition in RetryInvocationHandler
- HADOOP-10625.
Major bug reported by Wangda Tan and fixed by Wangda Tan (conf)
Configuration: names should be trimmed when putting/getting to properties
- HADOOP-10622.
Critical bug reported by Jason Lowe and fixed by Gera Shegalov
Shell.runCommand can deadlock
- HADOOP-10618.
Minor improvement reported by Akira AJISAKA and fixed by Akira AJISAKA (documentation)
Remove SingleNodeSetup.apt.vm
- HADOOP-10614.
Major improvement reported by Xiangrui Meng and fixed by Xiangrui Meng
CBZip2InputStream is not threadsafe
- HADOOP-10609.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla
.gitignore should ignore .orig and .rej files
- HADOOP-10602.
Trivial bug reported by Chris Nauroth and fixed by Akira AJISAKA (documentation)
Documentation has broken "Go Back" hyperlinks.
- HADOOP-10590.
Major bug reported by Benoy Antony and fixed by Benoy Antony (security)
ServiceAuthorizationManager is not threadsafe
- HADOOP-10588.
Major bug reported by Kihwal Lee and fixed by Kihwal Lee
Workaround for jetty6 acceptor startup issue
- HADOOP-10585.
Critical bug reported by Daryn Sharp and fixed by Daryn Sharp (ipc)
Retry polices ignore interrupted exceptions
- HADOOP-10581.
Major bug reported by Mit Desai and fixed by Mit Desai
TestUserGroupInformation#testGetServerSideGroups fails because groups stored in Set and ArrayList are compared
- HADOOP-10572.
Trivial improvement reported by Harsh J and fixed by Harsh J (nfs)
Example NFS mount command must pass noacl as it isn't supported by the server yet
- HADOOP-10568.
Major bug reported by David S. Wang and fixed by David S. Wang (fs/s3)
Add s3 server-side encryption
s3 server-side encryption is now supported.
To enable this feature, specify the following in your client-side configuration:
name: fs.s3n.server-side-encryption-algorithm
value: AES256
- HADOOP-10566.
Major sub-task reported by Benoy Antony and fixed by Benoy Antony (security)
Refactor proxyservers out of ProxyUsers
- HADOOP-10565.
Major sub-task reported by Benoy Antony and fixed by Benoy Antony (security)
Support IP ranges (CIDR) in proxyuser.hosts
- HADOOP-10561.
Major improvement reported by Uma Maheswara Rao G and fixed by Yi Liu (fs)
Copy command with preserve option should handle Xattrs
- HADOOP-10557.
Major improvement reported by Akira AJISAKA and fixed by Akira AJISAKA (fs)
FsShell -cp -pa option for preserving extended ACLs
- HADOOP-10556.
Major improvement reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur (security)
Add toLowerCase support to auth_to_local rules for service name
- HADOOP-10549.
Major improvement reported by Gera Shegalov and fixed by Gera Shegalov (conf)
MAX_SUBST and varPat should be final in Configuration.java
- HADOOP-10547.
Major bug reported by Jason Dere and fixed by Benoy Antony (security)
Give SaslPropertiesResolver.getDefaultProperties() public scope
- HADOOP-10543.
Major bug reported by Yongjun Zhang and fixed by Yongjun Zhang
RemoteException's unwrapRemoteException method failed for PathIOException
- HADOOP-10541.
Minor bug reported by Ted Yu and fixed by Swarnim Kulkarni (test)
InputStream in MiniKdc#initKDCServer for minikdc.ldiff is not closed
- HADOOP-10540.
Major bug reported by Huan Huang and fixed by Arpit Agarwal (tools)
Datanode upgrade in Windows fails with hardlink error.
- HADOOP-10539.
Minor improvement reported by Benoy Antony and fixed by Benoy Antony (security)
Provide backward compatibility for ProxyUsers.authorize() call
- HADOOP-10535.
Minor improvement reported by Jing Zhao and fixed by Jing Zhao
Make the retry numbers in ActiveStandbyElector configurable
- HADOOP-10533.
Minor bug reported by Benjamin Kim and fixed by Steve Loughran (fs/s3)
S3 input stream NPEs in MapReduce jon
- HADOOP-10531.
Major bug reported by Sebastien Barrier and fixed by Sebastien Barrier
hadoop-config.sh - bug in --hosts argument
- HADOOP-10526.
Minor bug reported by SreeHari and fixed by Rushabh S Shah
Chance for Stream leakage in CompressorStream
- HADOOP-10517.
Minor bug reported by Ted Yu and fixed by Ted Yu (test , util)
InputStream is not closed in two methods of JarFinder
- HADOOP-10514.
Major new feature reported by Uma Maheswara Rao G and fixed by Yi Liu (fs)
Common side changes to support HDFS extended attributes (HDFS-2006)
- HADOOP-10508.
Major bug reported by Chris Li and fixed by Chris Li (ipc)
RefreshCallQueue fails when authorization is enabled
- HADOOP-10503.
Minor sub-task reported by Steve Loughran and fixed by Chris Nauroth (build)
Move junit up to v 4.11
- HADOOP-10500.
Trivial bug reported by Chris Nauroth and fixed by Chris Nauroth (security , test)
TestDoAsEffectiveUser fails on JDK7 due to failure to reset proxy user configuration.
- HADOOP-10499.
Minor sub-task reported by Benoy Antony and fixed by Benoy Antony (security)
Remove unused parameter from ProxyUsers.authorize()
- HADOOP-10498.
Major new feature reported by Daryn Sharp and fixed by Daryn Sharp (util)
Add support for proxy server
- HADOOP-10496.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (metrics)
Metrics system FileSink can leak file descriptor.
- HADOOP-10495.
Trivial bug reported by Chris Nauroth and fixed by Chris Nauroth (fs , test)
TestFileUtil fails on Windows due to bad permission assertions.
- HADOOP-10489.
Major bug reported by Jing Zhao and fixed by Robert Kanter
UserGroupInformation#getTokens and UserGroupInformation#addToken can lead to ConcurrentModificationException
- HADOOP-10479.
Major sub-task reported by Haohui Mai and fixed by Swarnim Kulkarni
Fix new findbugs warnings in hadoop-minikdc
- HADOOP-10475.
Major bug reported by Arpit Gupta and fixed by Jing Zhao (security)
ConcurrentModificationException in AbstractDelegationTokenSelector.selectToken()
- HADOOP-10471.
Major sub-task reported by Benoy Antony and fixed by Benoy Antony (security)
Reduce the visibility of constants in ProxyUsers
- HADOOP-10468.
Blocker bug reported by Haohui Mai and fixed by Akira AJISAKA
TestMetricsSystemImpl.testMultiThreadedPublish fails intermediately
- HADOOP-10467.
Major sub-task reported by Benoy Antony and fixed by Benoy Antony (security)
Enable proxyuser specification to support list of users in addition to list of groups.
- HADOOP-10462.
Major bug reported by Akira AJISAKA and fixed by Akira AJISAKA
DF#getFilesystem is not parsing the command output
- HADOOP-10459.
Major bug reported by Yongjun Zhang and fixed by Yongjun Zhang (tools/distcp)
distcp V2 doesn't preserve root dir's attributes when -p is specified
- HADOOP-10458.
Minor improvement reported by Steve Loughran and fixed by Steve Loughran (fs)
swifts should throw FileAlreadyExistsException on attempt to overwrite file
- HADOOP-10454.
Major improvement reported by Kihwal Lee and fixed by Kihwal Lee
Provide FileContext version of har file system
- HADOOP-10451.
Trivial improvement reported by Benoy Antony and fixed by Benoy Antony (security)
Remove unused field and imports from SaslRpcServer
SaslRpcServer.SASL_PROPS is removed.
Any use of this variable should be replaced with the following code:
SaslPropertiesResolver saslPropsResolver = SaslPropertiesResolver.getInstance(conf);
Map<String, String> sasl_props = saslPropsResolver.getDefaultProperties();
- HADOOP-10448.
Major sub-task reported by Benoy Antony and fixed by Benoy Antony (security)
Support pluggable mechanism to specify proxy user settings
- HADOOP-10442.
Blocker bug reported by Kihwal Lee and fixed by Kihwal Lee
Group look-up can cause segmentation fault when certain JNI-based mapping module is used.
- HADOOP-10439.
Major sub-task reported by Haohui Mai and fixed by Haohui Mai (build)
Fix compilation error in branch-2 after HADOOP-10426
- HADOOP-10426.
Minor sub-task reported by Tsz Wo Nicholas Sze and fixed by Tsz Wo Nicholas Sze (fs)
CreateOpts.getOpt(..) should declare with generic type argument
- HADOOP-10419.
Minor bug reported by Steve Loughran and fixed by Steve Loughran (fs)
BufferedFSInputStream NPEs on getPos() on a closed stream
- HADOOP-10418.
Major bug reported by Aaron T. Myers and fixed by Aaron T. Myers (security)
SaslRpcClient should not assume that remote principals are in the default_realm
- HADOOP-10414.
Major bug reported by Joey Echeverria and fixed by Joey Echeverria (conf)
Incorrect property name for RefreshUserMappingProtocol in hadoop-policy.xml
- HADOOP-10401.
Major bug reported by Colin Patrick McCabe and fixed by Akira AJISAKA (util)
ShellBasedUnixGroupsMapping#getGroups does not always return primary group first
- HADOOP-10383.
Major improvement reported by Enis Soztutar and fixed by Enis Soztutar
InterfaceStability annotations should have RetentionPolicy.RUNTIME
- HADOOP-10378.
Major bug reported by Mit Desai and fixed by Mit Desai
Typo in help printed by hdfs dfs -help
- HADOOP-10376.
Minor improvement reported by Chris Li and fixed by Chris Li
Refactor refresh*Protocols into a single generic refreshConfigProtocol
- HADOOP-10350.
Major bug reported by Vinayakumar B and fixed by Vinayakumar B
BUILDING.txt should mention openssl dependency required for hadoop-pipes
- HADOOP-10345.
Minor improvement reported by Benoy Antony and fixed by Benoy Antony (security)
Sanitize the the inputs (groups and hosts) for the proxyuser configuration
- HADOOP-10342.
Major bug reported by Larry McCay and fixed by Larry McCay (security)
Extend UserGroupInformation to return a UGI given a preauthenticated kerberos Subject
Add getUGIFromSubject to leverage an external kerberos authentication
- HADOOP-10332.
Major bug reported by Daryn Sharp and fixed by Jonathan Eagles
HttpServer's jetty audit log always logs 200 OK
- HADOOP-10322.
Major improvement reported by Benoy Antony and fixed by Benoy Antony (security)
Add ability to read principal names from a keytab
- HADOOP-10312.
Minor bug reported by Steve Loughran and fixed by Steve Loughran (util)
Shell.ExitCodeException to have more useful toString
- HADOOP-10279.
Major sub-task reported by Chris Li and fixed by Chris Li
Create multiplexer, a requirement for the fair queue
- HADOOP-10251.
Critical bug reported by Vinayakumar B and fixed by Vinayakumar B (ha)
Both NameNodes could be in STANDBY State if SNN network is unstable
- HADOOP-10158.
Critical bug reported by Kihwal Lee and fixed by Daryn Sharp
SPNEGO should work with multiple interfaces/SPNs.
- HADOOP-10104.
Minor sub-task reported by Steve Loughran and fixed by Akira AJISAKA (build)
Update jackson to 1.9.13
- HADOOP-9968.
Major improvement reported by Benoy Antony and fixed by Benoy Antony (security)
ProxyUsers does not work with NetGroups
- HADOOP-9919.
Major bug reported by Akira AJISAKA and fixed by Akira AJISAKA (conf)
Update hadoop-metrics2.properties examples to Yarn
Remove MRv1 settings from hadoop-metrics2.properties, add YARN settings instead.
- HADOOP-9712.
Minor sub-task reported by Steve Loughran and fixed by (fs/s3)
Write contract tests for FTP filesystem, fix places where it breaks
- HADOOP-9711.
Minor sub-task reported by Steve Loughran and fixed by Steve Loughran (fs/s3)
Write contract tests for S3Native; fix places where it breaks
- HADOOP-9705.
Major bug reported by Stephen Chu and fixed by Akira AJISAKA (fs)
FsShell cp -p does not preserve directory attibutes
- HADOOP-9704.
Major new feature reported by Chu Tong and fixed by
Write metrics sink plugin for Hadoop/Graphite
- HADOOP-9559.
Major bug reported by Mostafa Elhemali and fixed by Mike Liddell (metrics)
When metrics system is restarted MBean names get incorrectly flagged as dupes
- HADOOP-9555.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (ha)
HA functionality that uses ZooKeeper may experience inadvertent TCP RST and miss session expiration event due to bug in client connection management
- HADOOP-9495.
Major improvement reported by Steve Loughran and fixed by Steve Loughran (fs)
Define behaviour of Seekable.seek(), write tests, fix all hadoop implementations for compliance
- HADOOP-9371.
Major sub-task reported by Steve Loughran and fixed by Steve Loughran (fs)
Define Semantics of FileSystem more rigorously
- HADOOP-9361.
Blocker improvement reported by Steve Loughran and fixed by Steve Loughran (fs , test)
Strictly define the expected behavior of filesystem APIs and write tests to verify compliance
- HADOOP-9099.
Minor bug reported by Ivan Mitic and fixed by Ivan Mitic (test)
NetUtils.normalizeHostName fails on domains where UnknownHost resolves to an IP address
- HADOOP-8943.
Major improvement reported by Kai Zheng and fixed by Kai Zheng (security)
Support multiple group mapping providers
- HADOOP-8826.
Minor bug reported by Robert Joseph Evans and fixed by Mit Desai
Docs still refer to 0.20.205 as stable line
- HADOOP-6350.
Major improvement reported by Hong Tang and fixed by Akira AJISAKA (documentation , metrics)
Documenting Hadoop metrics
- HADOOP-3679.
Minor test reported by Chris Douglas and fixed by jay vyas (test)
calls to junit Assert::assertEquals invert arguments, causing misleading error messages, other minor improvements.